348 Commits

Author SHA1 Message Date
SwechchhaSinha
b34f6588ee
fix: [cherry-pick] changes to propagate traceid from client (#32264) (#34640)
https://github.com/milvus-io/milvus/issues/32321
PR merged to master -
[#32264](https://github.com/milvus-io/milvus/pull/32264)

Issue Description:
Tracing is an important means of identifying bottleneck points in a
system and is crucial for debugging production issues. Milvus(or any DB)
is generally the most downstream system for an user call -- a user call
can originate from UI and pass through multiple components, in
micro-services architecture, before reaching Milvus. So, when an user
experiences a glitch, one would debug the call trace via logs using a
common trace id. As of now, Milvus generates a new trace id for every
call and this request is to make sure client can pass the trace id which
will be used for all the logs across the Milvus sub-components so that
one can fetch logs for a user call across the components -- including
Milvus.

Signed-off-by: Shreesha Srinath Madogaran <smadogaran@salesforce.com>
Signed-off-by: Swechchha Sinha <swechchha.sinha@salesforce.com>
Co-authored-by: madogar <36537062+madogar@users.noreply.github.com>
Co-authored-by: Shreesha Srinath Madogaran <smadogaran@salesforce.com>
2024-08-16 14:12:54 +08:00
Chun Han
20e26588af
fix: enable limiter for restful server(#35350) (#35354)
related: #35350

Signed-off-by: MrPresent-Han <chun.han@gmail.com>
Co-authored-by: MrPresent-Han <chun.han@gmail.com>
2024-08-13 15:36:21 +08:00
wei liu
e5681e5b9c
enhance: make delegator delete buffer holding all delete from cp (#29626) (#35074)
See also #29625
pr: #29626 

This PR:
- Add a new implemention of `DeleteBuffer`: listDeleteBuffer
  - holds cacheBlock slice
  - `Put` method append new delete data into last block
  - when a block is full, append a new block into the list
- Add `TryDiscard` method for `DeleteBuffer` interface
  - For doubleCacheBuffer, do nothing
- For listDeleteBuffer, try to evict "old" blocks, which are blocks
before the first block whose start ts is behind provided ts
- Add checkpoint field for `UpdateVersion` sync action, which shall be
used to discard old cache delete block

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Co-authored-by: congqixia <congqi.xia@zilliz.com>
2024-08-09 18:48:18 +08:00
yihao.dai
20dca130c6
enhance: [cherry-pick] Retry on incomplete query result (#35061)
This PR cherry-picks the following PRs:

1. Return specific error codes when encountering incomplete requery
results error. https://github.com/milvus-io/milvus/pull/31343
2. Retry on incomplete requery result in proxy.
https://github.com/milvus-io/milvus/pull/31713

issue: https://github.com/milvus-io/milvus/issues/34820

pr: https://github.com/milvus-io/milvus/pull/31343,
https://github.com/milvus-io/milvus/pull/31713

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-08-05 15:22:16 +08:00
wei liu
ff7c1a79ee
enhance: Reduce delegator memory overloaded factor to 0.1 (#35092) (#35165)
pr: #35092

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-08-05 15:12:16 +08:00
Gao
0a122533d0
enhance: change autoindex default metric type (#34328)
issue: #34304 
pr: #34261

Signed-off-by: chasingegg <chao.gao@zilliz.com>
2024-08-02 16:22:20 +08:00
yihao.dai
289336a617
enhance: Avoid panic due to nil schema (#35063) (#35065)
/kind improvement

issue: https://github.com/milvus-io/milvus/discussions/25620

pr: https://github.com/milvus-io/milvus/pull/35063

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-07-30 20:19:49 +08:00
wei liu
79c0c78a07
enhance: Preserve fixed-size memory in delegator node for growing segment (#34602)
issue: #34595
pr: #34596

When consuming insert data on the delegator node, QueryCoord will move
out some sealed segments to manage its memory usage. After the growing
segment gets flushed, some sealed segments from other workers will be
moved back to the delegator node. To avoid the frequent movement of
segments, we estimate the maximum growing row count and preserve a
fixed-size memory in the delegator node.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-07-13 10:25:40 +08:00
congqixia
3c44248105
fix: [2.3] support set up knowhere-build-pool-size on querynode (#34647)
Cherry-pick from master
pr: #30922
Related: #29650

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Co-authored-by: MrPresent-Han <chun.han@zilliz.com>
2024-07-12 19:27:36 +08:00
SimFG
00b02ee6ae
enhance: [2.3] try to speed up the loading of small collections (#33863)
- issue: #33569
- pr: #33570

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-06-22 11:46:04 +08:00
congqixia
9157980232
fix: [2.3] Return record with largest timestamp for entires with same PK(#33936) (#34026)
Cherry-pick from master
pr: #33936
See also #33883

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-06-20 19:56:00 +08:00
aoiasd
963f601a96
enhance:[Cherry-pick] Check by proxy rate limiter when delete get data by query. (#30891) (#33794)
relate: https://github.com/milvus-io/milvus/issues/30927
pr: https://github.com/milvus-io/milvus/pull/30891

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2024-06-17 19:36:00 +08:00
wei liu
284e79cf3a
enhance: Execute bloom filter apply in parallel to speed up process delete (#33870)
issue: #33610
pr: #33611 #33793

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-06-17 12:06:04 +08:00
Chun Han
0d4ee287e1
fix: query iterator lack results(#33137) (#33468)
related: #33137
pr: https://github.com/milvus-io/milvus/pull/33422

Signed-off-by: MrPresent-Han <chun.han@zilliz.com>
2024-05-31 13:54:07 +08:00
zhenshan.cao
23e7155a48
fix: avoid memoryleak in rendezvousFlushManager (#33112)
issue: https://github.com/milvus-io/milvus/issues/33110

Signed-off-by: zhenshan.cao <zhenshan.cao@zilliz.com>
2024-05-20 22:19:40 +08:00
congqixia
f848e82971
enhance: [2.3] Add param item to ignore bad message id in checkpoint (#33128)
Cherry-pick from master
pr: #33123
See also #33122

This pr add param item `mq.ignoreBadPosition` to control behavior when
mq failed to parse message id from checkpoint

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-05-20 11:31:39 +08:00
congqixia
a631856321
fix: [2.3] Validate num of rows for insert field data with schema (#32770) (#32845)
Cherry-pick from master
pr: #32770 
See also #32769

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-05-08 16:25:30 +08:00
SimFG
3a7154b796
enhance: [2.3] add the skip auto id and partition key check config (#32671)
/kind improvement
issue: #32591
pr: #32592

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-04-29 10:19:31 +08:00
aoiasd
bf2c5def8d
enhance: [Cherry-Pick] access log support get sdk type by user agent (#30760) (#32554)
Support get sdk type by user agent when we can't get sdk version by
connection in access log.

---------
pr: https://github.com/milvus-io/milvus/pull/30760

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2024-04-25 16:37:27 +08:00
congqixia
c36b54cb57
enhance: [2.3] Use different interval for gc scan (#31363) (#32551)
Cherry-pick from master
pr: #31363
See also #31362

This PR make datacoord garbage collection scan operation using differet
interval than other opeartion.

This interval is a newly added param item, which default value is 7*24
hours.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-25 16:07:26 +08:00
foxspy
560e167214
fix: add score compute consistency config for knowhere (#32584)
issue: #32583 
/kind branch-feature

Signed-off-by: xianliang.li <xianliang.li@zilliz.com>
2024-04-25 14:07:25 +08:00
Xiaofan
37e5728229
fix: reduce didn't handle offset without limit and reduceStopForBest … (#32087)
fix #32059
pr: #32089

this pr fix two issues:
1. offset is not handled correctly without specify a limit
2. reduceStopForBest doesn't guarantee to return limit result even if
there are more result when there is small segment

Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>
2024-04-10 21:20:37 -07:00
wei liu
9d4ce6e581
enhance: Add restful api for devops to execute rolling upgrade (#29998) (#31846)
issue: #29261
pr: #29998
This PR Add restful api for devops to execute rolling upgrade, including
suspend/resume balance and manual transfer segments/channels.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-10 19:47:20 +08:00
cqy123456
47f767cf32
enhance: remove float16 in 2.3 branch (#31720)
issue: https://github.com/milvus-io/milvus/issues/31696

Signed-off-by: cqy123456 <qianya.cheng@zilliz.com>
2024-03-30 10:49:13 +08:00
groot
91cdada12a
fix: minio ssl compatible issue (#31619)
issue: https://github.com/milvus-io/milvus/issues/30709
pr: https://github.com/milvus-io/milvus/pull/31607

Signed-off-by: yhmo <yihua.mo@zilliz.com>
2024-03-27 14:41:20 +08:00
PowderLi
f2f0d44a5d
feat: [cherry-pick] restful phase two (#30430)
issue: #28348 #29732

Support to trace the grpc request, pr: #28349
Support to trace restful request and request error, pr: #28685

restful phase two, pr: #29728 #30343
include: collections, entities, partitions, users, roles, indexes,
aliases, import jobs

---------

Signed-off-by: SimFG <bang.fu@zilliz.com>
Signed-off-by: PowderLi <min.li@zilliz.com>
Co-authored-by: SimFG <bang.fu@zilliz.com>
2024-03-25 10:39:09 +08:00
Jiquan Long
ab059bb064
enhance: add more metrics (#31271) (#31511)
/kind improvement
pr: #31271 
fix: https://github.com/milvus-io/milvus/issues/31272

This pr add more metrics, which are:

Slow query count, which the duration considered as slow can be
configurable;
Number of deleted entities;
Number of entities per collection;
Number of loaded entities per collection;
Number of indexed entities;
Number of indexed entities, per collection, per index and whether it's a
vetor index;
Quota states (LongTimeTickDelay, MemoryExhuasted, DiskQuotaExhuasted)
per database;

---------

Signed-off-by: longjiquan <jiquan.long@zilliz.com>
2024-03-22 16:11:07 +08:00
wei liu
c8658d17f8
fix: Grpcclient return unrecoverable error (#31256) (#31452)
issue: #31222
pr: #31256

grpcclient's `call` func return a unrecoverable error, then the caller's
retry policy also breaks due to this unrecoverable error.

This PR introduce `retry.Handle`, the new func use `func() (bool,
error)` as input parameters, which return `shouldRetry` directly, to
avoid grpcclient return a unrecoverable error

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-21 11:59:12 +08:00
groot
1ca7cba222
enhance: Support MinIO TLS connection (#31292)
issue: https://github.com/milvus-io/milvus/issues/30709
master pr: #31311

Signed-off-by: yhmo <yihua.mo@zilliz.com>
Co-authored-by: Chen Rao <chenrao317328@163.com>
2024-03-21 11:15:20 +08:00
wei liu
9d712f4dd4
fix: Balance param use duplicated key (#31112) (#31141)
pr: #31112
issue: #31115
This PR fix balance check interval  param use duplicated key

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-11 15:03:02 +08:00
Jiquan Long
c37b7792f4
enhance: purge client infos periodically (#31037) (#31092)
https://github.com/milvus-io/milvus/issues/31007
pr: #31037 

---------

Signed-off-by: longjiquan <jiquan.long@zilliz.com>
2024-03-08 10:17:01 +08:00
yihao.dai
91d17870d6
enhance: Prevent the backlog of channelCP update tasks, perform batch updates of channelCPs (#30941) (#31024)
This PR includes the following adjustments:

1. To prevent channelCP update task backlog, only one task with the same
vchannel is retained in the updater. Additionally, the lastUpdateTime is
refreshed after the flowgraph submits the update task, rather than in
the callBack function.
2. Batch updates of multiple vchannel checkpoints are performed in the
UpdateChannelCheckpoint RPC (default batch size is 128). Additionally,
the lock for channelCPs in DataCoord meta has been switched from key
lock to global lock.
3. The concurrency of UpdateChannelCheckpoint RPCs in the datanode has
been reduced from 1000 to 10.

issue: https://github.com/milvus-io/milvus/issues/30004

pr: https://github.com/milvus-io/milvus/pull/30941

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-03-05 14:27:01 +08:00
congqixia
b7635ed989
enhance: [Cherry-pick] Change proxy connection manager to concurrent safe (#31009)
Cherry-pick from master
pr: #31008 
See also #31007

This PR:
- Add param item for connection manager behavior: TTL & check interval
- Change clientInfo map to concurrent map

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-05 14:13:00 +08:00
SimFG
b0569f430b
enhance: [2.3] retry to read when the s3 get the unexpect eof error (#30976)
issue: https://github.com/milvus-io/milvus/issues/30877
pr: #30861

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-03-04 10:42:59 +08:00
groot
5b695d7e86
fix: Clean kafka default configuration (#30925)
issue: #30917
pr: #30924

Signed-off-by: yhmo <yihua.mo@zilliz.com>
2024-03-01 18:15:29 +08:00
congqixia
430e10c8e2
fix: [Cherry-pick] Use localStorage path to check disk cap (#30944) (#30966)
Cherry-pick from master
pr: #30944
See also #30943

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-01 15:11:01 +08:00
congqixia
c3f831fce4
fix: [Cherry-pick] Disk resource is not requested for index loaded with disk (#30757) (#30948)
Cherry pick from master
pr: #30757
See also #30756

This PR:
- Request disk resource when index type, version loaded with disk
- Add attribute cache for index utility
- Add `typeutil.Pair`

---------

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-01 13:07:00 +08:00
chyezh
483a32bced
feat: add collection level flush rate control (#29568)
flush rate control at collection level to avoid generate too much
segment.
0.1 qps by default.

issue: #29477
pr: #29567

Signed-off-by: chyezh <ye.zhen@zilliz.com>
2024-03-01 10:23:01 +08:00
PowderLi
a4219cbb0f
fix: [cherry-pick] set proxy.http.acceptTypeAllowInt64: true as default (#30738)
issue: #30680
pr: #30720

also let the parameter item to be refreshable

Signed-off-by: PowderLi <min.li@zilliz.com>
2024-02-29 09:59:07 +08:00
congqixia
df16bf6acd
fix: [Cherry-pick] Remove time tick delay metrics when nodes go offline (#30833) (#30879)
Cherry-pick from master
pr: #30833
See also #30832

This PR removes time tick delay metrics when rootcoord GetMetrics
response does not have previously existed querynode/datanode

Also add unit tests for this case

---------

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Signed-off-by: Congqi.Xia <congqi.xia@zilliz.com>
2024-02-28 18:55:00 +08:00
groot
2009c3c783
fix: Support TLS for kafka connection (#30466)
issue: https://github.com/milvus-io/milvus/discussions/27977
pr: #30468 

Add extra configurations in milvus.yaml to pass certificates for kafka.

Signed-off-by: yhmo <yihua.mo@zilliz.com>
2024-02-28 18:43:07 +08:00
chyezh
be1bd9615a
enhance: add configurable memory index load predict memory usage factor (#30563)
pr: #30561

related pr: #30475

Signed-off-by: chyezh <chyezh@outlook.com>
2024-02-06 22:00:49 +08:00
jaime
7e7722ed43
enhance: [skip e2e] set logrus log level to reduce output error logs (#30478)
issue: #30295

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-02-04 17:23:06 +08:00
cqy123456
3036c19867
fix: can't not get search_cache_budget_gb in create index (#30353)
issue:https://github.com/milvus-io/milvus/issues/30375
pr: https://github.com/milvus-io/milvus/pull/30119

Signed-off-by: cqy123456 <qianya.cheng@zilliz.com>
2024-01-31 15:49:03 +08:00
chyezh
21c944beaa
enhance: add basic information of milvus into metrics (#29666)
add basic build information and runtime component dependency into
metrics.

issue: #29664
pr: #29665

Signed-off-by: chyezh <ye.zhen@zilliz.com>
2024-01-29 15:49:04 +08:00
chyezh
77e123762f
enhance: add graceful stop timeout to avoid node stop hang under extreme cases (#30320)
1. add coordinator and proxy graceful stop timeout to 5s.
3. add other work node graceful stop timeout to 900s, and we should
potentially change this to 600s when graceful stop is smooth
4. change the order of datacoord component while stop.
5. `LivenessCheck` do not perform graceful shutdown now. 

issue: https://github.com/milvus-io/milvus/issues/30310
pr: #30317
also see: https://github.com/milvus-io/milvus/pull/30306

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-01-27 08:45:02 +08:00
yihao.dai
e0f987ee9b
enhance: Allows proactive warming up of chunk cache (#30182) (#30289)
Allows proactive warming up of chunk cache. Original vector data will be
asynchronously loaded into the chunk cache during the load process. It
has the potential to significantly reduce query/search latency for a
certain duration after the load, albeit with a concurrent increase in
disk usage.

issue: https://github.com/milvus-io/milvus/issues/30181

pr: https://github.com/milvus-io/milvus/pull/30182

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-01-26 09:57:01 +08:00
Bingyi Sun
2c4d0605ef
enhance: add a weight for growing row count when balancing segments (#30293)
Cherry-pick from master
pr: #30271

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2024-01-26 09:17:03 +08:00
yah01
0e71923408
enhance: enable converting segcore error to merr (#29914) (#30178)
this converts the segcore error to merr if possible
pr: #29914

Signed-off-by: yah01 <yang.cen@zilliz.com>
2024-01-22 16:56:55 +08:00
yah01
1cc5a613d5
enhance: adjust the GPU pool size (#29937) (#30177)
according to benchmark, the GPU pool size with 6 performs best
pr: #29937

Signed-off-by: yah01 <yang.cen@zilliz.com>
2024-01-22 16:55:04 +08:00