20648 Commits

Author SHA1 Message Date
yihao.dai
501d1b58cf
Revert "fix: [10kcp] Query coord stop progress is too slow (#38300)" (#38794)
This reverts commit ae4e2b8063f8e06f3af02e8e1f846d0e0b502fd0.
2024-12-26 21:48:41 +08:00
yihao.dai
05f50b11ff
fix: [10kcp] Fix slow preprocess in qc scheduler (#38784)
supplement to pr: https://github.com/milvus-io/milvus/pull/38566

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-26 17:05:44 +08:00
yihao.dai
7f5467577e
fix: [10kcp] Fix index meta mutex contention (#38777)
issue: https://github.com/milvus-io/milvus/issues/37630

Reduce the frequency of updateIndexTasksMetrics to avoid holding the
mutex for long periods.

pr: https://github.com/milvus-io/milvus/pull/38775

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-26 17:04:14 +08:00
yihao.dai
1969ab3da7
enhance: Optimize GetLocalDiskSize and segment loader mutex (#38683)
fix of: https://github.com/milvus-io/milvus/pull/38599

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-24 11:19:58 +08:00
yihao.dai
bf27f70c32
enhance: [10kcp] Optimize GetLocalDiskSize and segment loader mutex (#38601)
fix of pr: https://github.com/milvus-io/milvus/pull/38599

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-19 21:29:13 +08:00
yihao.dai
ecd55596cf
enhance: [10kcp] Optimize GetLocalDiskSize and segment loader mutex (#38600)
1. Make the segment loader lock protect only the resource.
2. Optimize GetDiskUsage to avoid excessive overhead.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38599

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-19 21:14:26 +08:00
congqixia
f5ae24f955
fix: [10kcp] SyncSegments rpc always failed (#38032) (#38579)
Cherry-pick from 2.4
pr: #38032
issue: #38031
cause call `cli.SyncSegments` use ctx which already be override and
canceled, so SyncSegments rpc will always failed.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Co-authored-by: wei liu <wei.liu@zilliz.com>
2024-12-19 14:01:46 +08:00
yihao.dai
c3d4469259
enhance: Print observe time (#38575)
Print observe, dist handing and schedule time.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38566

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-19 11:46:13 +08:00
yihao.dai
ca234e7847
fix: [10kcp] Fix slow dist handle and slow observe (#38567)
1. Provide partition-level indexing in the collection target.
2. Make SegmentAction not wait for distribution.
3. Optimize logging to reduce CPU overhead.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38566

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-18 21:00:39 +08:00
congqixia
999437e76e
enhance: [10kcp] Trim data distribiton resp index info (#38521)
Related to #37630

Data distribution became too large when segment number was huge. This PR
trims the index info struct and return needed info only.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-17 15:20:26 +08:00
congqixia
28841ebdf9
enhance: [10kcp] Simplify querynode tsafe & reduce goroutine number (#38416) (#38433)
Related to #37630

TSafe manager is too complex for current implementation and each
delegator need one goroutine waiting for tsafe update event.

Tsafe updating could be executed in pipeline. This PR remove tsafe
manager and simplify the entire logic of tsafe updating.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-13 21:20:57 +08:00
yihao.dai
de78de7689
fix: [10kcp] Fix consume blocked due to too many consumers (#38456)
This PR limits the maximum number of consumers per pchannel to 10 for
each QueryNode and DataNode.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38455

---------

Signed-off-by: SimFG <bang.fu@zilliz.com>
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Co-authored-by: SimFG <bang.fu@zilliz.com>
2024-12-13 21:20:47 +08:00
yihao.dai
df4d5e1096
enhance: [10kcp] Read metadata concurrently to accelerate recovery (#38404)
Read metadata such as segments, binlogs, and partitions concurrently at
the collection level.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38403

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-12 16:39:06 +08:00
yihao.dai
11118db7d6
enhance: [10kcp] remove unnecessary clone in meta cache (#38398)
issue: https://github.com/milvus-io/milvus/issues/36627,
https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/36628

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Co-authored-by: Ted Xu <ted.xu@zilliz.com>
2024-12-12 16:33:38 +08:00
congqixia
5521091dcd
enhance: [10kcp] Refine querynode collection number metrics (#38352)
Related to #37630

Previously the loaded collection metrics was calculated via scanning all
loaded segment in segment manager, which is slow and buggy
implementation.

This PR:

- Move collection num metrics to collection manager
- Remove deprecated loaded partition metrics update logic

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-10 21:06:42 +08:00
yihao.dai
4a2a5f0183
fix: [10kcp] Fix standby mixcoord start failed (#38327)
fix of https://github.com/milvus-io/milvus/pull/38324

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-10 11:47:45 +08:00
yihao.dai
15b01daec5
fix: [10kcp] Fix standby mixcoord start failed (#38324)
When standby transitions to active, the component state changes to
Initialize. If the initialization takes too long (exceeding the liveness
probe's maximum retries), the standby pod is stopped and fails to start.
This PR removes the Initialize state during standby transitions in
rolling upgrades. The state now switches directly from standby to
healthy, preventing health check failures.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38308

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-10 10:53:50 +08:00
congqixia
24a055996b
enhance: [10kcp] Add secondary index for querynode segment manager (#38312)
Cherry pick from pr
#38311
Related to #37630

Add secondary index with vchannel to reduce `GetBy` rlock holding time
when segment number is large.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-09 19:56:16 +08:00
yihao.dai
3e65cc5850
enhance: [10kcp] Enable score based balance channel policy (#38301)
issue: https://github.com/milvus-io/milvus/issues/38142
current balance channel policy only consider current collection's
distribution, so if all collections has 1 channel, and all channels has
been loaded on same querynode, after querynode num increase, balance
channel won't be triggered.

This PR enable score based balance channel policy, to achieve:

1. distribute all channels evenly across multiple querynodes
2. distribute each collection's channel evenly across multiple
querynodes.

pr: https://github.com/milvus-io/milvus/pull/38143

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Co-authored-by: Wei Liu <wei.liu@zilliz.com>
2024-12-09 19:50:05 +08:00
yihao.dai
ae4e2b8063
fix: [10kcp] Query coord stop progress is too slow (#38300)
issue: https://github.com/milvus-io/milvus/issues/38237

query coord will save collection's target during stop progress, which
will be used for new querycoord's fast recover. but if milvus cluster
has thounsands of collections, which make query coord's stop progress
much more slower than expected.

this PR refine the impl to save collection's target to etcd when target
update, and clean it when collection released.

pr: https://github.com/milvus-io/milvus/pull/38238

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Co-authored-by: Wei Liu <wei.liu@zilliz.com>
2024-12-09 19:49:49 +08:00
yihao.dai
2fe6423552
enhance: [10kcp] Speed up meta recovery (#38298)
Increase the batchSize in WalkWithPrefix operations to 10000.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38285

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-09 19:49:35 +08:00
yihao.dai
3d490aa158
fix: [10kcp] Replace outer lock with concurrent map (#38286)
See also: #37493
pr: #37817

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
Co-authored-by: XuanYang-cn <xuan.yang@zilliz.com>
2024-12-09 19:49:20 +08:00
yihao.dai
df100e5bbe
fix: [10kcp] Fix init rootcoord meta timeout (#38249)
issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38248

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-05 17:45:31 +08:00
Zhen Ye
99279e0bef
enhance: remove the rpc layer of coordinator when enabling standalone or mixcoord (#38246)
issue: #33285
pr: #37815

- remove the rpc layer of coordinator when enabling standalone or
mixcoord
- move health check into init

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-12-05 17:27:53 +08:00
congqixia
c4df6b5910
enhance: [10kcp] Refine Replica manager colle2Replicas secondary index (#37907)
Related to #37630

This PR add a new util coll2Replicas secondary index to reduce map
access & iteration while get replicas by collection

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-05 11:57:29 +08:00
yihao.dai
d75fb5b3f8
enhance: [10kcp] Reduce mutex contention in datacoord meta (#38229)
1. Using secondary index to avoid retrieving all segments at
GetSegmentsChanPart.
2. Perform batch SetAllocations to reduce the number of times the meta
lock is acquired.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38219

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-05 11:57:07 +08:00
yihao.dai
3219b869a3
fix: [10kcp] Fix timeout when listing meta (#38152)
When there are too many key-value pairs, the etcd list operation may
times out. This PR replaces LoadWithPrefix in list operations, which
could involve many keys, with WalkWithPrefix.

issue: https://github.com/milvus-io/milvus/issues/37917

pr: https://github.com/milvus-io/milvus/pull/38151

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-03 14:15:49 +08:00
yihao.dai
0c29d8ff64
enhance: [10kcp] Update segment manger (#38153)
Use a channel level key lock for segments in segmentManager.

issue: https://github.com/milvus-io/milvus/issues/37633,
https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/37836

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-03 14:15:35 +08:00
yihao.dai
338ccc9ff9
enhance: [10kcp] Reduce memory usage of BF in DataNode and QueryNode (#38133)
1. DataNode: Skip generating BF during the insert phase (BF will be
regenerated during the sync phase).
2. QueryNode: Skip generating or maintaining BF for growing segments;
deletion checks will be handled in the segcore.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38129

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-02 14:41:19 +08:00
yihao.dai
0930430a68
enhance: [10kcp] Skip creating partition rate limiters when not enable (#38062)
issue: https://github.com/milvus-io/milvus/issues/37630

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-28 10:45:46 +08:00
yihao.dai
635d161109
enhance: [10kcp] Accelerate observe collection (#38058)
issue: https://github.com/milvus-io/milvus/issues/37630

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-28 10:05:24 +08:00
yihao.dai
312475d1f1
enhance: [10kcp] remove the rpc level of coordinator (#37984)
issue: https://github.com/milvus-io/milvus/issues/37764

- add a local client to call local server directly for
querycoord/rootcoord/datacoord.
- enable local client if milvus is running mixcoord or standalone mode.

Signed-off-by: chyezh <chyezh@outlook.com>

---------

Signed-off-by: chyezh <chyezh@outlook.com>
Co-authored-by: Zhen Ye <chyezh@outlook.com>
2024-11-25 14:50:42 +08:00
yihao.dai
e5c16e0676
fix: [10kcp] Fix checkGeneralCapacity slowly (#37981)
Cache the general count to speed up checkGeneralCapacity.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/37976

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-25 14:50:24 +08:00
yihao.dai
fd30034c77
fix: [10kcp] Fix data view and add more ut (#37915)
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-21 21:35:42 +08:00
yihao.dai
4845e4d679
enhance: [10kcp] Revert "enhance: remove the rpc level of coordinator (#37914)
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-21 21:35:29 +08:00
yihao.dai
bf90e55319
enhance: [10kcp] Reduce GetRecoveryInfo calls (#37891)
1. Introduce a data view mechanism for DataCoord, attempting to update
each collection's data view periodically.
2. QueryCoord maintains a cache of data view versions. Before
batch-fetching recovery info, it retrieves all versions and only fetches
recovery info for collections with updated versions.
3. Return DataCoord's current data view when fetching RecoverInfo.

issue: https://github.com/milvus-io/milvus/issues/37743,
https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/37863

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-21 15:43:13 +08:00
Zhen Ye
ce8069c0fd
enhance: remove the rpc layer of coordinator when enabling standalone or mixcoord (#37892)
issue: #37764

- add a local client to call local server directly for
querycoord/rootcoord/datacoord.
- enable local client if milvus is running mixcoord or standalone mode.

Signed-off-by: chyezh <chyezh@outlook.com>
2024-11-21 15:42:18 +08:00
Zhen Ye
1a6b98be77
enhance: remove the rpc level of coordinator (#37876)
issue: #33285
pr: #37722

- move most cgo opeartions related to search/query into segcore package
for reusing for streamingnode.
- add go unittest for segcore operations.

Signed-off-by: chyezh <chyezh@outlook.com>
2024-11-21 15:21:11 +08:00
yihao.dai
99da46dd0b
fix: [10kcp] Fix load slowly (#37454) (#37878)
When there're a lot of loaded collections, they would occupy the target
observer scheduler’s pool. This prevents loading collections from
updating the current target in time, slowing down the load process. This
PR adds a separate target dispatcher for loading collections.

issue: https://github.com/milvus-io/milvus/issues/37166

pr: https://github.com/milvus-io/milvus/pull/37454

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-21 15:11:03 +08:00
yihao.dai
ac7b485a08
enhance: [10kcp] Accelerate the loading of collection (#37879)
Remove unnecessary ListIndex and DescribeCollection RPC call during
loading.

issue: https://github.com/milvus-io/milvus/issues/37166,
https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/37741

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-21 15:10:36 +08:00
yihao.dai
9e1ba0759c
enhance: [10kcp] Optimize segmentManager segments (#37884)
1. Use vchannel and partition indices for segments.
2. Replace coarse-grained mutex with concurrent map.

issue: https://github.com/milvus-io/milvus/issues/37633,
https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/37836

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-21 15:10:04 +08:00
yihao.dai
92ab65ada0
enhance:[10kcp] Reduce GetIndexInfos calls (#37877)
Batch GetIndexInfos calls for segments to reduce RPC calls.

issue: https://github.com/milvus-io/milvus/issues/37634

pr: https://github.com/milvus-io/milvus/pull/37695

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-21 15:09:39 +08:00
congqixia
0bd26171d5
enhance: [2.4] Provide secondary index criteria when filter leaderview (#37777) (#37802)
Cherry-pick from master
pr: #37777 
Related to #37630

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-21 10:48:33 +08:00
congqixia
28adfe4629
enhance: [2.4] Remove unnecessary segment clone updating dist (#37797) (#37833)
Cherry-pick from master
pr: #37797
Related to #37630

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-20 19:48:33 +08:00
sre-ci-robot
5ac4e4839e
[automated] Bump milvus version to v2.4.16 (#37790)
Bump milvus version to v2.4.16
Signed-off-by: sre-ci-robot sre-ci-robot@users.noreply.github.com

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2024-11-20 11:46:37 +08:00
congqixia
cffde80e68
enhance: [2.4] Prevent generate "null" search params (#37811)
pr: #37812
Preventing generating null search params in restful search request

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
v2.4.16
2024-11-19 18:20:32 +08:00
Zhen Ye
ebfd917bb6
fix: make asan avaiable when building milvus image (#37804)
issue: #35854
pr: #37041

- USE_ASAN will not enable the Debug mode.
- replace USE_ASAN by `ldd`  to make generate right so in milvus image.

Signed-off-by: chyezh <chyezh@outlook.com>
Co-authored-by: yellow-shine <sammy.huang@zilliz.com>
2024-11-19 17:28:32 +08:00
congqixia
a10f95d71c
enhance: Bump milvus & proto version to v2.4.16 (#37762)
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-18 20:36:31 +08:00
congqixia
876e06b862
fix: [2.4] Load l0 delta for growings when using RemoteLoad (#37772)
Cherry-pick from master
pr: #37771
Related to #37574

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-18 20:26:31 +08:00
smellthemoon
46692d7525
enhance: support upsert autoid==true in Restful API and fix some bugs(#37072)(#37487) (#37766)
pr: #37072
pr: #37487

---------

Signed-off-by: lixinguo <xinguo.li@zilliz.com>
Co-authored-by: lixinguo <xinguo.li@zilliz.com>
2024-11-18 19:44:31 +08:00