668 Commits

Author SHA1 Message Date
wei liu
b298218a29
enhance: [2.5] Remove balance constraints between channel and segment tasks (#42410)
issue: #42176
pr: #42177

Remove the mutual exclusion constraints between channel and segment
balance tasks to allow them to run concurrently.

Changes include:
- Remove permitBalanceChannel() and permitBalanceSegment() methods from
RoundRobinBalancer
- Update ChannelLevelScoreBalancer, MultiTargetBalancer,
RowCountBasedBalancer, and ScoreBasedBalancer to remove constraint
checks
- Allow segment balance tasks to proceed even when channel balance tasks
are running
- Update test cases to reflect new behavior where balance tasks no
longer block each other
- Improve error handling in task executor by preferring serviceable
shard leaders for segment release operations
- Add fallback logic to find latest shard leader when serviceable leader
is not available

This change improves the efficiency of load balancing by removing
unnecessary coordination overhead between different types of balance
operations.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-06-03 10:16:32 +08:00
wei liu
d2ff390a52
fix: Segment may be released prematurely during balance channel (#42043)
issue: #41143
pr: #42090

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-05-29 18:36:35 +08:00
aoiasd
198ff1f150
enhance: [2.5] support run analyzer by loaded collection field (#42119)
relate: https://github.com/milvus-io/milvus/issues/42094
pr: https://github.com/milvus-io/milvus/pull/42113

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-05-29 10:26:30 +08:00
wei liu
4a05180f88
enhance: [2.5] support balancing multiple collections in single trigger (#41875) (#42134)
issue: #41874
pr: #41875
- Optimize balance_checker to support balancing multiple collections
simultaneously
- Add new parameters for segment and channel balancing batch sizes
- Add enableBalanceOnMultipleCollections parameter
- Update tests for balance checker

This change improves resource utilization by allowing the system to
balance multiple collections in a single trigger with configurable batch
sizes.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-05-28 23:18:30 +08:00
yihao.dai
7c8370ccd2
fix: [2.5] Fix ants.Pool goroutine leak (#41893)
1. Release the pool after it is no longer in use.
2. Upgrade ants.Pool to fix the goroutine leak issue (see
https://github.com/panjf2000/ants/pull/287).

issue: https://github.com/milvus-io/milvus/issues/41838

pr: https://github.com/milvus-io/milvus/pull/41892

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-05-16 19:12:22 +08:00
SimFG
6e18ededab
fix: [2.5] mockery too unavailable after upgrade golang version (#41522)
- issue: ##41291
- pr: #41481

Signed-off-by: SimFG <bang.fu@zilliz.com>
2025-04-25 14:40:40 +08:00
SimFG
18eb627533
fix: [2.5] Update logging context and upgrade dependencies (#41319)
- issue: #41291
- pr: #41318

---------

Signed-off-by: SimFG <bang.fu@zilliz.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2025-04-24 23:50:40 +08:00
wei liu
2e8445c2ef
fix: balance checker may enter infinite normal balance loop after balance suspension (#41196)
issue: #41194 
pr: #41195
- Refactor hasUnbalancedCollection flag handling to function scope
- Ensure tracking sets clearance when no balance needed
- Add deferred cleanup for both normal/stopping balance paths
- Add unit tests for collection tracking scenarios

The changes ensure tracking sets (normalBalanceCollectionsCurrentRound
and stoppingBalanceCollectionsCurrentRound) are properly cleared when:
- All collections in current round are balanced
- Balance checks return early due to unready targets
- Balance feature flags are disabled

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-04-10 15:18:28 +08:00
liliu-z
cb0f984155
enhance: Revert "separate for index completed (#40873)" (#41152)
This reverts commit 23e579e3240a30397f05f5b308be687f6f16b013. #40873

issue: #39519

Signed-off-by: Li Liu <li.liu@zilliz.com>
2025-04-08 17:36:30 +08:00
Chun Han
23e579e324
separate for index completed (#40873)
related: https://github.com/milvus-io/milvus/issues/40781

Signed-off-by: MrPresent-Han <chun.han@gmail.com>
Co-authored-by: MrPresent-Han <chun.han@gmail.com>
2025-04-05 10:20:24 +08:00
wei liu
37a533fe6d
fix: [2.5] Address manual balance and balance check issues (#41038)
issue: #37651
pr: #41037
- Fix context propagation for manual balance segment task creation from
PR #38080.
- Optimize stopping balance by preventing redundant checks per round,
addressing performance regression from PR #40297.
- Decrease default `checkBalanceInterval` from 3000ms to 300ms.
- Correct minor log messages in `BalanceChecker`.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-04-03 01:26:23 +08:00
Xianhui Lin
249d5b9b41
fix: jsonstats check if cache schema is nil lazy describecollection (#41068)
fix: jsonstats check if cache schema is nil lazy describecollection
pr:https://github.com/milvus-io/milvus/pull/38039
issue:https://github.com/milvus-io/milvus/issues/36995

---------

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
2025-04-03 00:32:21 +08:00
wei liu
d185a8f941
enhance: Balance the collection with the largest row count first (#40958)
issue: #37651
pr: #40297
this PR enable to balance the collection with largest row count first,
to avoid temporary migration of small table data to new nodes during
their onboarding, only to be moved out again after the large table
balance, which would cause unnecessary load.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-03-31 16:14:21 +08:00
wei liu
b64bb63e77
enhance: [2.5] Add trigger interval config for auto balance (#39154) (#39918)
issue: #39156
pr: #39154

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-03-27 16:40:23 +08:00
Xianhui Lin
8bdff401a3
fix: fix indexchecker schema released (#40809)
pr:https://github.com/milvus-io/milvus/pull/38039
issue:https://github.com/milvus-io/milvus/issues/36995

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
2025-03-20 18:05:22 +08:00
Xianhui Lin
705b3c90a5
fix: Failed to rolling upgrade from v2.5.6 to new 2.5 version when enable JsonKeyStats (#40661)
fix: Failed to rolling upgrade from v2.5.6 to new 2.5 version when
enable JsonKeyStats.The reason is that the file path of the jsonkeyindex
has changed.
issue: https://github.com/milvus-io/milvus/issues/40649https://github.com/milvus-io/milvus/issues/40669
https://github.com/milvus-io/milvus/issues/40707
master-pr: https://github.com/milvus-io/milvus/pull/38039

---------

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
2025-03-18 17:32:16 +08:00
Xianhui Lin
f5e9dea2aa
fix: [2.5]fix the garbage cleanup logic of jsonkey stats && improve json key stats filer (#40039)
fix: fix the garbage collection cleanup logic of jsonkey stats &&
improve json key stats filer
issue: https://github.com/milvus-io/milvus/issues/36995
https://github.com/milvus-io/milvus/issues/40034
https://github.com/milvus-io/milvus/issues/40041
https://github.com/milvus-io/milvus/issues/40106
https://github.com/milvus-io/milvus/issues/40138
pr: https://github.com/milvus-io/milvus/pull/38039

---------

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
2025-03-13 20:18:10 +08:00
Bingyi Sun
683b26ffb7
feat: cherry pick json path index (#40313)
issue: #35528 
pr: #36750 
this pr includes json path index pr and some related prs:
1. update tantivy version #39253 
2. json path index #36750 
3. fall back to brute force #40076 
4. term filter #40140 
5. bug fix #40336

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-03-10 22:14:05 +08:00
yihao.dai
893caee467
fix: [2.5] Fix task delta cache data race (#40262)
issue: https://github.com/milvus-io/milvus/issues/40258

pr: https://github.com/milvus-io/milvus/pull/40259

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-03-02 16:52:10 +08:00
wei liu
82c000a4b2
fix: task delta cache leak due to duplicate task id (#40184)
issue: #40052
pr: #40183

task delta cache rely on the taskID is unique, so it incDeltaCache at
AddTask, and decDeltaCache at RemoveTask, but the taskID allocator is
not atomic, which cause two task with same taskID, in such case, it will
call incDeltaCache twice, but call decDeltaCacheOnce, which cause delta
cache leak.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-28 10:22:08 +08:00
wei liu
14f05650e3
enhance: clean shard location cache after collection released (#40228)
issue: #40077
pr: #40088

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-27 19:42:05 +08:00
Xianhui Lin
a4eb2ce224
fix: [2.5]Revert qc statschecker for json key stats (#40125)
Revert qc statschecker for json key stats
issue:https://github.com/milvus-io/milvus/issues/36995
pr:https://github.com/milvus-io/milvus/pull/39876

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
2025-02-24 13:31:55 +08:00
congqixia
709594f158
enhance: [2.5] Use v2 package name for pkg module (#40117)
Cherry-pick from master
pr: #39990
Related to #39095

https://go.dev/doc/modules/version-numbers

Update pkg version according to golang dep version convention

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-02-23 00:46:01 +08:00
Xianhui Lin
c1de61ff7c
fix: [2.5]Replace the position of EnabledJSONKeyStats (#40108)
Replace the position of EnabledJSONKeyStats
issue: https://github.com/milvus-io/milvus/issues/36995
pr: https://github.com/milvus-io/milvus/pull/38039

---------

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
2025-02-22 14:35:54 +08:00
yihao.dai
b8a758b6c4
enhance: [2.5] Add get vector latency metric and refine request limit error message (#40085)
issue: https://github.com/milvus-io/milvus/issues/40078

pr: https://github.com/milvus-io/milvus/pull/40083

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-02-21 20:19:55 +08:00
wei liu
82fb0bf9c1
fix: [2.5] task delta cache leak on reduce task (#40056)
issue: #40052
pr: #40055

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-21 16:49:54 +08:00
wei liu
e42c944e04
fix: [2.5] querycoord panic in cornor case (#40058)
issue: #40050 
pr: #40057

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-21 11:19:58 +08:00
wei liu
3c2d8c1419
enhance: [2.5] Add management api to check querycoord balance status (#37784) (#39909)
issue: #37783
pr: #37784

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-19 10:56:49 +08:00
wei liu
bf54f47c34
enhance: [2.5] use rated logger for high frequency log in dist handler (#39452) (#39928)
pr: #39452

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-18 14:32:52 +08:00
Xianhui Lin
f0964f769d
enhance: [2.5]Add json key inverted index in stats for optimization (#39876)
Add json key inverted index in stats for optimization
issue: https://github.com/milvus-io/milvus/issues/36995
pr: https://github.com/milvus-io/milvus/pull/38039

---------

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-02-16 20:12:15 +08:00
congqixia
9407a3c9b1
fix: [2.5] Check collection released before target checks (#39843)
Cherry-pick from master
pr: #39841 
Related to #39840

The target could be updated async in previous code. This PR make remove
collection from target observer block until all tasks related in
dispatchers are removed preventing the metrics being updated after
collection released.

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-02-13 20:00:15 +08:00
wei liu
82dc57ace0
fix: [skip e2e][2.5] pr conflict cause ut failed (#39810)
Related to https://github.com/milvus-io/milvus/pull/39701 &
https://github.com/milvus-io/milvus/issues/39681

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-12 11:44:51 +08:00
congqixia
4322a0d49a
fix: [2.5] Resolve conflict on qc task test (#39797)
Cherry-pick from master
pr: #39796
Related to #39701 & #39681

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-02-11 18:52:45 +08:00
wei liu
11cba57dc7
fix: [2.5] load collection stucks if compaction/gc happens (#39761)
issue: #39680
pr: #39701
if compaction/gc happens, load collection may stuck due to
SegmentNotFound, we should trigger UpdateNextTarget to get a new data
view to execute loading operation.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-11 15:48:50 +08:00
wei liu
969e34d540
fix: [2.5]uneven distribution caused by executing task delta cache leak (#39759)
issue: #39681
pr: #39702
this PR maintain workload effect in action instead of computing workload
effect from target, which may cause leak if target changes.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-11 14:32:46 +08:00
jaime
ddc5b299ad
enhance: expose more metrics data (#39466)
issue: #36621 #39417
pr: #39456
1. Adjust the server-side cache size.
2. Add source information for configurations.
3. Add node ID for compaction and indexing tasks.
4. Resolve localhost access issues to fix health check failures for
etcd.

Signed-off-by: jaime <yun.zhang@zilliz.com>
2025-02-07 11:48:45 +08:00
yihao.dai
4464966462
enhance: [2.5] Remove frequent observe log (#39414)
/kind improvement

pr: https://github.com/milvus-io/milvus/pull/39413

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-20 11:01:10 +08:00
yihao.dai
89a183c7c2
enhance: [2.5] enable task delta cache (#39349)
When there are many segment tasks in the querycoord scheduler, the
traversal in GetSegmentTaskDelta checks becomes time-consuming. This PR
adds caching for segment deltas.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/39307

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Co-authored-by: Wei Liu <wei.liu@zilliz.com>
2025-01-17 12:01:03 +08:00
yihao.dai
6773fb10a8
enhance: [2.5] Read metadata concurrently to accelerate recovery (#38900)
Read metadata such as segments, binlogs, and partitions concurrently at
the collection level.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38403

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-16 17:53:01 +08:00
yihao.dai
9d2a0e775c
fix: [2.5] Fix slow dist handle and slow observe (#38905)
1. Provide partition&channel level indexing in the collection target.
2. Make SegmentAction not wait for distribution.
3. Remove scheduler and target manager mutex
4. Optimize logging to reduce CPU overhead.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38566

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-16 17:07:02 +08:00
yihao.dai
c741b8be2b
fix: [2.5] Remove frequently updating metric to avoid mutex contention (#38778)
issue: https://github.com/milvus-io/milvus/issues/37630

Reduce the frequency of `updateIndexTasksMetrics` to avoid holding the
mutex for long periods.

pr: https://github.com/milvus-io/milvus/pull/38775

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-16 11:51:02 +08:00
wei liu
76ed552b00
enhance: Add logs for check health failed (#39208) (#39302)
pr: #39208

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-01-16 10:31:04 +08:00
wei liu
51994158d9
fix: channel unbalance during stopping balance progress (#38971) (#39200)
issue: #38970
pr: #38971
cause the stopping balance channel still use the row_count_based policy,
which may causes channel unbalance in multi-collection case.

This PR impl a score based stopping balance channel policy.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-01-14 18:25:00 +08:00
wei liu
4fd56e4773
fix: Prevent leader checker from generating excessive duplicate leader tasks (#39000) (#39160)
issue: #39001
pr: #39000
Background:
Segment Load Version: Each segment load request assigns a timestamp as
its version. When multiple copies of a segment are loaded on different
QueryNodes, the leader checker uses this version to identify the latest
copy and updates the routing table in the leader view to point to it.
Delegator Router Version: When a delegator builds a route to a QueryNode
that has loaded a segment, it also records the segment's version.

Router Table Update Logic: If the leader checker detects that the
version of a segment in the routing table does not match the version in
the worker, it updates the routing table to point to the QueryNode with
the latest version. Additionally, it updates the segment's load version
in the QueryNode during this process.

Issue:
When a channel is undergoing load balancing, the leader checker may sync
the routing table to a new delegator. This sync operation modifies the
segment's load version, which invalidates the routing in the old
delegator. Subsequently, the leader checker updates the routing table in
the old delegator, breaking the routing in the new delegator. This cycle
continues, causing repeated updates and inconsistencies.

Fix:
This PR introduces two changes to address the issue:
1. Use NodeID to verify whether the delegator's routing table needs an
update, avoiding unnecessary modifications.
2. Ensure compatibility by using the latest segment's load version as
the version recorded in the routing table.

These changes resolve the cyclic updates and prevent the leader checker
from generating excessive duplicate tasks, ensuring routing stability
across delegators during load balancing.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-01-14 18:11:06 +08:00
Zhen Ye
adfc3f945e
enhance: record memory size (uncompressed) item for index (#38844)
issue: #38715 
pr: #38770

- Current milvus use a serialized index size(compressed) for estimate
resource for loading.
- Add a new field MemSize (before compressing) for index to estimate
resource.

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-01-14 10:33:06 +08:00
jaime
b0afe32c98
fix: unstable ut in leader_vew_manager.go file (#39162)
issue: #38672
pr: #39161

Signed-off-by: jaime <yun.zhang@zilliz.com>
2025-01-10 19:54:57 +08:00
Zhen Ye
95809ca767
enhance: make new go package to manage proto (#39128)
issue: #39095
pr: #39114

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-01-10 10:53:01 +08:00
jaime
0693634f62
enhance: add db name in replica description (#38673)
issue: #36621
pr: #38672

Signed-off-by: jaime <yun.zhang@zilliz.com>
2025-01-09 19:43:04 +08:00
wei liu
35cef0567c
enhance: Add log for case which target not update as expected (#38944) (#39046)
pr: #38944

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-01-08 19:32:57 +08:00
Xiaofan
a2c4cd59ce
fix: drop partition can not be successful if load failed[2.5] (#38874)
fix https://github.com/milvus-io/milvus/issues/38649
pr: #38793
when partition load failed, the partition drop will also fail due to the
wrong error message

Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>
2025-01-02 09:56:53 +08:00