20636 Commits

Author SHA1 Message Date
yihao.dai
df4d5e1096
enhance: [10kcp] Read metadata concurrently to accelerate recovery (#38404)
Read metadata such as segments, binlogs, and partitions concurrently at
the collection level.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38403

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-12 16:39:06 +08:00
yihao.dai
11118db7d6
enhance: [10kcp] remove unnecessary clone in meta cache (#38398)
issue: https://github.com/milvus-io/milvus/issues/36627,
https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/36628

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Co-authored-by: Ted Xu <ted.xu@zilliz.com>
2024-12-12 16:33:38 +08:00
congqixia
5521091dcd
enhance: [10kcp] Refine querynode collection number metrics (#38352)
Related to #37630

Previously the loaded collection metrics was calculated via scanning all
loaded segment in segment manager, which is slow and buggy
implementation.

This PR:

- Move collection num metrics to collection manager
- Remove deprecated loaded partition metrics update logic

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-10 21:06:42 +08:00
yihao.dai
4a2a5f0183
fix: [10kcp] Fix standby mixcoord start failed (#38327)
fix of https://github.com/milvus-io/milvus/pull/38324

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-10 11:47:45 +08:00
yihao.dai
15b01daec5
fix: [10kcp] Fix standby mixcoord start failed (#38324)
When standby transitions to active, the component state changes to
Initialize. If the initialization takes too long (exceeding the liveness
probe's maximum retries), the standby pod is stopped and fails to start.
This PR removes the Initialize state during standby transitions in
rolling upgrades. The state now switches directly from standby to
healthy, preventing health check failures.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38308

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-10 10:53:50 +08:00
congqixia
24a055996b
enhance: [10kcp] Add secondary index for querynode segment manager (#38312)
Cherry pick from pr
#38311
Related to #37630

Add secondary index with vchannel to reduce `GetBy` rlock holding time
when segment number is large.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-09 19:56:16 +08:00
yihao.dai
3e65cc5850
enhance: [10kcp] Enable score based balance channel policy (#38301)
issue: https://github.com/milvus-io/milvus/issues/38142
current balance channel policy only consider current collection's
distribution, so if all collections has 1 channel, and all channels has
been loaded on same querynode, after querynode num increase, balance
channel won't be triggered.

This PR enable score based balance channel policy, to achieve:

1. distribute all channels evenly across multiple querynodes
2. distribute each collection's channel evenly across multiple
querynodes.

pr: https://github.com/milvus-io/milvus/pull/38143

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Co-authored-by: Wei Liu <wei.liu@zilliz.com>
2024-12-09 19:50:05 +08:00
yihao.dai
ae4e2b8063
fix: [10kcp] Query coord stop progress is too slow (#38300)
issue: https://github.com/milvus-io/milvus/issues/38237

query coord will save collection's target during stop progress, which
will be used for new querycoord's fast recover. but if milvus cluster
has thounsands of collections, which make query coord's stop progress
much more slower than expected.

this PR refine the impl to save collection's target to etcd when target
update, and clean it when collection released.

pr: https://github.com/milvus-io/milvus/pull/38238

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Co-authored-by: Wei Liu <wei.liu@zilliz.com>
2024-12-09 19:49:49 +08:00
yihao.dai
2fe6423552
enhance: [10kcp] Speed up meta recovery (#38298)
Increase the batchSize in WalkWithPrefix operations to 10000.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38285

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-09 19:49:35 +08:00
yihao.dai
3d490aa158
fix: [10kcp] Replace outer lock with concurrent map (#38286)
See also: #37493
pr: #37817

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
Co-authored-by: XuanYang-cn <xuan.yang@zilliz.com>
2024-12-09 19:49:20 +08:00
yihao.dai
df100e5bbe
fix: [10kcp] Fix init rootcoord meta timeout (#38249)
issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38248

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-05 17:45:31 +08:00
Zhen Ye
99279e0bef
enhance: remove the rpc layer of coordinator when enabling standalone or mixcoord (#38246)
issue: #33285
pr: #37815

- remove the rpc layer of coordinator when enabling standalone or
mixcoord
- move health check into init

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-12-05 17:27:53 +08:00
congqixia
c4df6b5910
enhance: [10kcp] Refine Replica manager colle2Replicas secondary index (#37907)
Related to #37630

This PR add a new util coll2Replicas secondary index to reduce map
access & iteration while get replicas by collection

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-05 11:57:29 +08:00
yihao.dai
d75fb5b3f8
enhance: [10kcp] Reduce mutex contention in datacoord meta (#38229)
1. Using secondary index to avoid retrieving all segments at
GetSegmentsChanPart.
2. Perform batch SetAllocations to reduce the number of times the meta
lock is acquired.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38219

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-05 11:57:07 +08:00
yihao.dai
3219b869a3
fix: [10kcp] Fix timeout when listing meta (#38152)
When there are too many key-value pairs, the etcd list operation may
times out. This PR replaces LoadWithPrefix in list operations, which
could involve many keys, with WalkWithPrefix.

issue: https://github.com/milvus-io/milvus/issues/37917

pr: https://github.com/milvus-io/milvus/pull/38151

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-03 14:15:49 +08:00
yihao.dai
0c29d8ff64
enhance: [10kcp] Update segment manger (#38153)
Use a channel level key lock for segments in segmentManager.

issue: https://github.com/milvus-io/milvus/issues/37633,
https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/37836

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-03 14:15:35 +08:00
yihao.dai
338ccc9ff9
enhance: [10kcp] Reduce memory usage of BF in DataNode and QueryNode (#38133)
1. DataNode: Skip generating BF during the insert phase (BF will be
regenerated during the sync phase).
2. QueryNode: Skip generating or maintaining BF for growing segments;
deletion checks will be handled in the segcore.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38129

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-02 14:41:19 +08:00
yihao.dai
0930430a68
enhance: [10kcp] Skip creating partition rate limiters when not enable (#38062)
issue: https://github.com/milvus-io/milvus/issues/37630

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-28 10:45:46 +08:00
yihao.dai
635d161109
enhance: [10kcp] Accelerate observe collection (#38058)
issue: https://github.com/milvus-io/milvus/issues/37630

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-28 10:05:24 +08:00
yihao.dai
312475d1f1
enhance: [10kcp] remove the rpc level of coordinator (#37984)
issue: https://github.com/milvus-io/milvus/issues/37764

- add a local client to call local server directly for
querycoord/rootcoord/datacoord.
- enable local client if milvus is running mixcoord or standalone mode.

Signed-off-by: chyezh <chyezh@outlook.com>

---------

Signed-off-by: chyezh <chyezh@outlook.com>
Co-authored-by: Zhen Ye <chyezh@outlook.com>
2024-11-25 14:50:42 +08:00
yihao.dai
e5c16e0676
fix: [10kcp] Fix checkGeneralCapacity slowly (#37981)
Cache the general count to speed up checkGeneralCapacity.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/37976

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-25 14:50:24 +08:00
yihao.dai
fd30034c77
fix: [10kcp] Fix data view and add more ut (#37915)
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-21 21:35:42 +08:00
yihao.dai
4845e4d679
enhance: [10kcp] Revert "enhance: remove the rpc level of coordinator (#37914)
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-21 21:35:29 +08:00
yihao.dai
bf90e55319
enhance: [10kcp] Reduce GetRecoveryInfo calls (#37891)
1. Introduce a data view mechanism for DataCoord, attempting to update
each collection's data view periodically.
2. QueryCoord maintains a cache of data view versions. Before
batch-fetching recovery info, it retrieves all versions and only fetches
recovery info for collections with updated versions.
3. Return DataCoord's current data view when fetching RecoverInfo.

issue: https://github.com/milvus-io/milvus/issues/37743,
https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/37863

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-21 15:43:13 +08:00
Zhen Ye
ce8069c0fd
enhance: remove the rpc layer of coordinator when enabling standalone or mixcoord (#37892)
issue: #37764

- add a local client to call local server directly for
querycoord/rootcoord/datacoord.
- enable local client if milvus is running mixcoord or standalone mode.

Signed-off-by: chyezh <chyezh@outlook.com>
2024-11-21 15:42:18 +08:00
Zhen Ye
1a6b98be77
enhance: remove the rpc level of coordinator (#37876)
issue: #33285
pr: #37722

- move most cgo opeartions related to search/query into segcore package
for reusing for streamingnode.
- add go unittest for segcore operations.

Signed-off-by: chyezh <chyezh@outlook.com>
2024-11-21 15:21:11 +08:00
yihao.dai
99da46dd0b
fix: [10kcp] Fix load slowly (#37454) (#37878)
When there're a lot of loaded collections, they would occupy the target
observer scheduler’s pool. This prevents loading collections from
updating the current target in time, slowing down the load process. This
PR adds a separate target dispatcher for loading collections.

issue: https://github.com/milvus-io/milvus/issues/37166

pr: https://github.com/milvus-io/milvus/pull/37454

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-21 15:11:03 +08:00
yihao.dai
ac7b485a08
enhance: [10kcp] Accelerate the loading of collection (#37879)
Remove unnecessary ListIndex and DescribeCollection RPC call during
loading.

issue: https://github.com/milvus-io/milvus/issues/37166,
https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/37741

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-21 15:10:36 +08:00
yihao.dai
9e1ba0759c
enhance: [10kcp] Optimize segmentManager segments (#37884)
1. Use vchannel and partition indices for segments.
2. Replace coarse-grained mutex with concurrent map.

issue: https://github.com/milvus-io/milvus/issues/37633,
https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/37836

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-21 15:10:04 +08:00
yihao.dai
92ab65ada0
enhance:[10kcp] Reduce GetIndexInfos calls (#37877)
Batch GetIndexInfos calls for segments to reduce RPC calls.

issue: https://github.com/milvus-io/milvus/issues/37634

pr: https://github.com/milvus-io/milvus/pull/37695

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-21 15:09:39 +08:00
congqixia
0bd26171d5
enhance: [2.4] Provide secondary index criteria when filter leaderview (#37777) (#37802)
Cherry-pick from master
pr: #37777 
Related to #37630

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-21 10:48:33 +08:00
congqixia
28adfe4629
enhance: [2.4] Remove unnecessary segment clone updating dist (#37797) (#37833)
Cherry-pick from master
pr: #37797
Related to #37630

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-20 19:48:33 +08:00
sre-ci-robot
5ac4e4839e
[automated] Bump milvus version to v2.4.16 (#37790)
Bump milvus version to v2.4.16
Signed-off-by: sre-ci-robot sre-ci-robot@users.noreply.github.com

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2024-11-20 11:46:37 +08:00
congqixia
cffde80e68
enhance: [2.4] Prevent generate "null" search params (#37811)
pr: #37812
Preventing generating null search params in restful search request

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
v2.4.16
2024-11-19 18:20:32 +08:00
Zhen Ye
ebfd917bb6
fix: make asan avaiable when building milvus image (#37804)
issue: #35854
pr: #37041

- USE_ASAN will not enable the Debug mode.
- replace USE_ASAN by `ldd`  to make generate right so in milvus image.

Signed-off-by: chyezh <chyezh@outlook.com>
Co-authored-by: yellow-shine <sammy.huang@zilliz.com>
2024-11-19 17:28:32 +08:00
congqixia
a10f95d71c
enhance: Bump milvus & proto version to v2.4.16 (#37762)
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-18 20:36:31 +08:00
congqixia
876e06b862
fix: [2.4] Load l0 delta for growings when using RemoteLoad (#37772)
Cherry-pick from master
pr: #37771
Related to #37574

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-18 20:26:31 +08:00
smellthemoon
46692d7525
enhance: support upsert autoid==true in Restful API and fix some bugs(#37072)(#37487) (#37766)
pr: #37072
pr: #37487

---------

Signed-off-by: lixinguo <xinguo.li@zilliz.com>
Co-authored-by: lixinguo <xinguo.li@zilliz.com>
2024-11-18 19:44:31 +08:00
wei liu
2a4f54cd4f
fix: L0 segment has been loaded to worker during channel balance (#37758)
issue: https://github.com/milvus-io/milvus/issues/37703
pr: https://github.com/milvus-io/milvus/pull/37748

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-18 17:00:32 +08:00
foxspy
cabb55595a
enhance: update knowhere version (#37763)
/kind branch-feature

knowhere release note :
https://github.com/zilliztech/knowhere/releases/tag/v2.3.13

Signed-off-by: xianliang.li <xianliang.li@zilliz.com>
2024-11-18 16:30:32 +08:00
wei liu
79f676e7d8
enhance: Use batch to speed up list collections from meta kv (#37752)
issue: #36228
pr: #37742

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-18 15:58:33 +08:00
nico
bbd96e1829
test: update pymilvus version and test cases (#37711)
Signed-off-by: nico <cheng.yuan@zilliz.com>
2024-11-18 14:14:32 +08:00
jaime
3ce27ca689
enhance: remove collection queryable check from health check (#37731)
pr: #37712

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-11-18 10:50:38 +08:00
yihao.dai
13f83df019
enhance: [2.4] Remove segment-level tag from monitoring metrics (#37737)
When there are a large number of segments, the metrics consume a lot of
memory. This PR Remove segment-level tag from monitoring metrics.

issue: https://github.com/milvus-io/milvus/issues/37636

pr: https://github.com/milvus-io/milvus/pull/37696

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-16 23:04:33 +08:00
yihao.dai
d29573551b
enhance: [2.4] Remove unnecessary clone in SetState (#37736)
issue: https://github.com/milvus-io/milvus/issues/37637

pr: https://github.com/milvus-io/milvus/pull/37697

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-16 19:04:34 +08:00
congqixia
cdf703aabc
enhance: [2.4] Enable RemoteLoad l0 forward policy by default (#37678) (#37713)
Cherry-pick from master
pr: #37678
Related to #35303

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-15 18:28:31 +08:00
smellthemoon
b3e6482367
enhance: add search params in search request in restful(#36304) (#37673)
pr: #36304 
pr: #36714 
pr: #36448

---------

Signed-off-by: lixinguo <xinguo.li@zilliz.com>
Co-authored-by: lixinguo <xinguo.li@zilliz.com>
Co-authored-by: zhuwenxing <wenxing.zhu@zilliz.com>
2024-11-15 17:54:30 +08:00
Zhen Ye
4e11fe7adf
enhance: make milvus image with asan available (#37682)
issue: #35854
pr: #37050

Signed-off-by: chyezh <chyezh@outlook.com>
2024-11-15 17:10:30 +08:00
wei liu
1bd502b585
fix: Delegator stuck at unserviceable status (#37694) (#37702)
issue: #37679
pr: #37694

pr #36549 introduce the logic error which update current target when
only parts of channel is ready.

This PR fix the logic error and let dist handler keep pull distribution
on querynode until all delegator becomes serviceable.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-15 14:52:30 +08:00
congqixia
e222289038
fix: [2.4] Store default value if ErrKeyNotFound is returned (#37691) (#37705)
Cherry-pick from master
pr: #37691
Related to #37690

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-15 14:50:32 +08:00