650 Commits

Author SHA1 Message Date
yihao.dai
893caee467
fix: [2.5] Fix task delta cache data race (#40262)
issue: https://github.com/milvus-io/milvus/issues/40258

pr: https://github.com/milvus-io/milvus/pull/40259

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-03-02 16:52:10 +08:00
wei liu
82c000a4b2
fix: task delta cache leak due to duplicate task id (#40184)
issue: #40052
pr: #40183

task delta cache rely on the taskID is unique, so it incDeltaCache at
AddTask, and decDeltaCache at RemoveTask, but the taskID allocator is
not atomic, which cause two task with same taskID, in such case, it will
call incDeltaCache twice, but call decDeltaCacheOnce, which cause delta
cache leak.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-28 10:22:08 +08:00
wei liu
14f05650e3
enhance: clean shard location cache after collection released (#40228)
issue: #40077
pr: #40088

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-27 19:42:05 +08:00
Xianhui Lin
a4eb2ce224
fix: [2.5]Revert qc statschecker for json key stats (#40125)
Revert qc statschecker for json key stats
issue:https://github.com/milvus-io/milvus/issues/36995
pr:https://github.com/milvus-io/milvus/pull/39876

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
2025-02-24 13:31:55 +08:00
congqixia
709594f158
enhance: [2.5] Use v2 package name for pkg module (#40117)
Cherry-pick from master
pr: #39990
Related to #39095

https://go.dev/doc/modules/version-numbers

Update pkg version according to golang dep version convention

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-02-23 00:46:01 +08:00
Xianhui Lin
c1de61ff7c
fix: [2.5]Replace the position of EnabledJSONKeyStats (#40108)
Replace the position of EnabledJSONKeyStats
issue: https://github.com/milvus-io/milvus/issues/36995
pr: https://github.com/milvus-io/milvus/pull/38039

---------

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
2025-02-22 14:35:54 +08:00
yihao.dai
b8a758b6c4
enhance: [2.5] Add get vector latency metric and refine request limit error message (#40085)
issue: https://github.com/milvus-io/milvus/issues/40078

pr: https://github.com/milvus-io/milvus/pull/40083

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-02-21 20:19:55 +08:00
wei liu
82fb0bf9c1
fix: [2.5] task delta cache leak on reduce task (#40056)
issue: #40052
pr: #40055

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-21 16:49:54 +08:00
wei liu
e42c944e04
fix: [2.5] querycoord panic in cornor case (#40058)
issue: #40050 
pr: #40057

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-21 11:19:58 +08:00
wei liu
3c2d8c1419
enhance: [2.5] Add management api to check querycoord balance status (#37784) (#39909)
issue: #37783
pr: #37784

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-19 10:56:49 +08:00
wei liu
bf54f47c34
enhance: [2.5] use rated logger for high frequency log in dist handler (#39452) (#39928)
pr: #39452

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-18 14:32:52 +08:00
Xianhui Lin
f0964f769d
enhance: [2.5]Add json key inverted index in stats for optimization (#39876)
Add json key inverted index in stats for optimization
issue: https://github.com/milvus-io/milvus/issues/36995
pr: https://github.com/milvus-io/milvus/pull/38039

---------

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-02-16 20:12:15 +08:00
congqixia
9407a3c9b1
fix: [2.5] Check collection released before target checks (#39843)
Cherry-pick from master
pr: #39841 
Related to #39840

The target could be updated async in previous code. This PR make remove
collection from target observer block until all tasks related in
dispatchers are removed preventing the metrics being updated after
collection released.

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-02-13 20:00:15 +08:00
wei liu
82dc57ace0
fix: [skip e2e][2.5] pr conflict cause ut failed (#39810)
Related to https://github.com/milvus-io/milvus/pull/39701 &
https://github.com/milvus-io/milvus/issues/39681

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-12 11:44:51 +08:00
congqixia
4322a0d49a
fix: [2.5] Resolve conflict on qc task test (#39797)
Cherry-pick from master
pr: #39796
Related to #39701 & #39681

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-02-11 18:52:45 +08:00
wei liu
11cba57dc7
fix: [2.5] load collection stucks if compaction/gc happens (#39761)
issue: #39680
pr: #39701
if compaction/gc happens, load collection may stuck due to
SegmentNotFound, we should trigger UpdateNextTarget to get a new data
view to execute loading operation.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-11 15:48:50 +08:00
wei liu
969e34d540
fix: [2.5]uneven distribution caused by executing task delta cache leak (#39759)
issue: #39681
pr: #39702
this PR maintain workload effect in action instead of computing workload
effect from target, which may cause leak if target changes.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-11 14:32:46 +08:00
jaime
ddc5b299ad
enhance: expose more metrics data (#39466)
issue: #36621 #39417
pr: #39456
1. Adjust the server-side cache size.
2. Add source information for configurations.
3. Add node ID for compaction and indexing tasks.
4. Resolve localhost access issues to fix health check failures for
etcd.

Signed-off-by: jaime <yun.zhang@zilliz.com>
2025-02-07 11:48:45 +08:00
yihao.dai
4464966462
enhance: [2.5] Remove frequent observe log (#39414)
/kind improvement

pr: https://github.com/milvus-io/milvus/pull/39413

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-20 11:01:10 +08:00
yihao.dai
89a183c7c2
enhance: [2.5] enable task delta cache (#39349)
When there are many segment tasks in the querycoord scheduler, the
traversal in GetSegmentTaskDelta checks becomes time-consuming. This PR
adds caching for segment deltas.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/39307

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Co-authored-by: Wei Liu <wei.liu@zilliz.com>
2025-01-17 12:01:03 +08:00
yihao.dai
6773fb10a8
enhance: [2.5] Read metadata concurrently to accelerate recovery (#38900)
Read metadata such as segments, binlogs, and partitions concurrently at
the collection level.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38403

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-16 17:53:01 +08:00
yihao.dai
9d2a0e775c
fix: [2.5] Fix slow dist handle and slow observe (#38905)
1. Provide partition&channel level indexing in the collection target.
2. Make SegmentAction not wait for distribution.
3. Remove scheduler and target manager mutex
4. Optimize logging to reduce CPU overhead.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38566

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-16 17:07:02 +08:00
yihao.dai
c741b8be2b
fix: [2.5] Remove frequently updating metric to avoid mutex contention (#38778)
issue: https://github.com/milvus-io/milvus/issues/37630

Reduce the frequency of `updateIndexTasksMetrics` to avoid holding the
mutex for long periods.

pr: https://github.com/milvus-io/milvus/pull/38775

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-16 11:51:02 +08:00
wei liu
76ed552b00
enhance: Add logs for check health failed (#39208) (#39302)
pr: #39208

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-01-16 10:31:04 +08:00
wei liu
51994158d9
fix: channel unbalance during stopping balance progress (#38971) (#39200)
issue: #38970
pr: #38971
cause the stopping balance channel still use the row_count_based policy,
which may causes channel unbalance in multi-collection case.

This PR impl a score based stopping balance channel policy.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-01-14 18:25:00 +08:00
wei liu
4fd56e4773
fix: Prevent leader checker from generating excessive duplicate leader tasks (#39000) (#39160)
issue: #39001
pr: #39000
Background:
Segment Load Version: Each segment load request assigns a timestamp as
its version. When multiple copies of a segment are loaded on different
QueryNodes, the leader checker uses this version to identify the latest
copy and updates the routing table in the leader view to point to it.
Delegator Router Version: When a delegator builds a route to a QueryNode
that has loaded a segment, it also records the segment's version.

Router Table Update Logic: If the leader checker detects that the
version of a segment in the routing table does not match the version in
the worker, it updates the routing table to point to the QueryNode with
the latest version. Additionally, it updates the segment's load version
in the QueryNode during this process.

Issue:
When a channel is undergoing load balancing, the leader checker may sync
the routing table to a new delegator. This sync operation modifies the
segment's load version, which invalidates the routing in the old
delegator. Subsequently, the leader checker updates the routing table in
the old delegator, breaking the routing in the new delegator. This cycle
continues, causing repeated updates and inconsistencies.

Fix:
This PR introduces two changes to address the issue:
1. Use NodeID to verify whether the delegator's routing table needs an
update, avoiding unnecessary modifications.
2. Ensure compatibility by using the latest segment's load version as
the version recorded in the routing table.

These changes resolve the cyclic updates and prevent the leader checker
from generating excessive duplicate tasks, ensuring routing stability
across delegators during load balancing.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-01-14 18:11:06 +08:00
Zhen Ye
adfc3f945e
enhance: record memory size (uncompressed) item for index (#38844)
issue: #38715 
pr: #38770

- Current milvus use a serialized index size(compressed) for estimate
resource for loading.
- Add a new field MemSize (before compressing) for index to estimate
resource.

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-01-14 10:33:06 +08:00
jaime
b0afe32c98
fix: unstable ut in leader_vew_manager.go file (#39162)
issue: #38672
pr: #39161

Signed-off-by: jaime <yun.zhang@zilliz.com>
2025-01-10 19:54:57 +08:00
Zhen Ye
95809ca767
enhance: make new go package to manage proto (#39128)
issue: #39095
pr: #39114

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-01-10 10:53:01 +08:00
jaime
0693634f62
enhance: add db name in replica description (#38673)
issue: #36621
pr: #38672

Signed-off-by: jaime <yun.zhang@zilliz.com>
2025-01-09 19:43:04 +08:00
wei liu
35cef0567c
enhance: Add log for case which target not update as expected (#38944) (#39046)
pr: #38944

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-01-08 19:32:57 +08:00
Xiaofan
a2c4cd59ce
fix: drop partition can not be successful if load failed[2.5] (#38874)
fix https://github.com/milvus-io/milvus/issues/38649
pr: #38793
when partition load failed, the partition drop will also fail due to the
wrong error message

Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>
2025-01-02 09:56:53 +08:00
wei liu
f441ccdbe9
fix: [2.5] Prevent balancer from overloading the same QueryNode (#38724)
issue: #38718
pr: #38719
The balancer calculates the workload of executing tasks as an ongoing
score for target nodes. However, a logic issue arises when
GetSegmentTaskDelta or GetChannelTaskDelta is called with
collectionID=-1, which incorrectly returns zero.

Due to the incorrect global score, the executing task's workload is not
properly reflected for each collection. Consequently, each collection
submits its own balance task, leading to the balancer assigning
excessive tasks to the same QueryNode.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-25 16:16:49 +08:00
wei liu
cb0618b2d4
fix: [2.5] Querycoord will trigger unexpected balance task after restart (#38725)
issue: https://github.com/milvus-io/milvus/issues/38606
pr: https://github.com/milvus-io/milvus/pull/38630

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-25 16:14:49 +08:00
wei liu
b16d04d7cc
fix: Fix update loading collection's load config doesn't work (#38737)
issue: #38594 
pr: #38595

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-25 15:02:50 +08:00
jaime
11bedf5e76
fix: Revert "Expose metrics of stanby coordinators (#27698)" (#38621)
issue: #38608
pr: #38620

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-12-20 18:04:47 +08:00
jaime
78438ef41e
fix: revert optimize CPU usage for CheckHealth requests (#35589) (#38555)
issue: #35563

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-12-19 00:38:45 +08:00
yihao.dai
d3c174b0f1
enhance: Accelerate observe collection (#38028)
1. A collection should observe the channel only once.
2. A collection should check the CollectionLoadPercent for updates only
once.
3. Skip saving coll/partition meta if there are no changes, primarily to
accelerate collection observation after recovery.

issue: https://github.com/milvus-io/milvus/issues/37630

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-17 14:14:45 +08:00
jaime
28fdbc4e30
enhance: optimize CPU usage for CheckHealth requests (#35589)
issue: #35563
1. Use an internal health checker to monitor the cluster's health state,
storing the latest state on the coordinator node. The CheckHealth
request retrieves the cluster's health from this latest state on the
proxy sides, which enhances cluster stability.
2. Each health check will assess all collections and channels, with
detailed failure messages temporarily saved in the latest state.
3. Use CheckHealth request instead of the heavy GetMetrics request on
the querynode and datanode

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-12-17 11:02:45 +08:00
SimFG
2afe2eaf3e
feat: support to replicate collection when the services contains the system tt msg (#37559)
- issue: #37105

---------

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-12-17 09:08:46 +08:00
wei liu
659847c11f
enhance: Remove load task limit in one round (#38436)
the task limit in assignSegment/assignChannel will works for both load
task and balance task.

this PR remove the load task limit, only limit balance task num in one
round.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-16 19:30:43 +08:00
wei liu
40f9db491e
fix: Fix SyncDistribution may cost too much time on retry (#38454)
issue: #38428

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-16 11:38:44 +08:00
tinswzy
27229f7907
enhance: refine exists log print with ctx (#38080)
issue: #35917 
Refines exists log print with ctx

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
2024-12-14 22:36:44 +08:00
Zhen Ye
833c74aa66
enhance: add detail, replica count for resource group (#38314)
issue: #30647

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-12-13 14:14:50 +08:00
wei liu
e279ccf109
enhance: Enable score based balance channel policy (#38143)
issue: #38142
current balance channel policy only consider current collection's
distribution, so if all collections has 1 channel, and all channels has
been loaded on same querynode, after querynode num increase, balance
channel won't be triggered.

This PR enable score based balance channel policy, to achieve:
1. distribute all channels evenly across multiple querynodes
2. distribute each collection's channel evenly across multiple
querynodes.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-11 17:20:43 +08:00
Zhen Ye
d3ae8e9232
fix: delay the wait other coord logic in query coord after query coord change into standby state (#38259)
issue: https://github.com/milvus-io/milvus/issues/37764

- After removing rpc layer from mixcoord, the querycoord at standby mode
will be blocked forever of deployment rolling

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-12-11 15:48:42 +08:00
wei liu
950203aba0
enhance: Optimize save colelction target latency (#38345)
issue: #38237
this PR only use better compression level for proto msg which is larger
than 1MB, and use a lighter compression level for smaller proto msg,
which could get a better latency in most case.

this PR could reduce the latency from 22.7s to 4.7s with 10000
collctions and each collections has 1000 segments.

before this PR:
BenchmarkTargetManager-8 1 22781536357 ns/op 566407275088 B/op 11188282
allocs/op
after this PR:
BenchmarkTargetManager-8 1 4729566944 ns/op 36713248864 B/op 10963615
allocs/op

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-11 10:12:43 +08:00
congqixia
7ea9c983d2
enhance: Add mockery package config for QC&QN (#38340)
Related to #38339

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-10 19:18:42 +08:00
wei liu
856e2aad7d
fix: Leader task stuck and retry again and again (#38202)
issue: #38201
leader task require to update delegator's distribution, and only success
after the distribution change has been applyed to delegator. but the
delegator will reject the distribution change if it's version is older
than current version in delegator. which cause the leader task stuck and
retry forever.

this PR remove the leader task finish check.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-10 19:16:42 +08:00
wei liu
f04986fceb
enhance: Remove constraint on release segment task (#38297)
issue: #38305
after we disable balance segment and balance channel happens at same
time, the constriant which require release segment must happens on
serviceable shard leader is unnessary.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-10 11:18:49 +08:00