milvus

mirror of https://gitee.com/milvus-io/milvus.git synced 2026-02-02 01:06:41 +08:00

Author	SHA1	Message	Date
congqixia	e70e70699c	enhance: [2.5] skip adding stopping node to resource group in handleNodeUp (#45969 ) (#45982 ) Cherry-pick from master pr: #45969 Related to #45960 Follow-up to #45961 After #45961 ensured that handleNodeUp is always called for nodes discovered during rewatchNodes (including stopping nodes), this change adds a safeguard in ResourceManager.handleNodeUp to skip adding stopping nodes to resource groups. 1. resource_manager.go: Add check for IsStoppingState() in handleNodeUp to prevent stopping nodes from being added to incomingNode set and assigned to resource groups. 2. server.go: - Delete processed nodes from sessionMap to avoid duplicate processing in the subsequent loop - Add warning logs for stopping state transitions during rewatch Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-12-02 10:23:13 +08:00
congqixia	a24a0f11aa	fix: [2.5] always call handleNodeUp in rewatchNodes for proper stopping balance (#45964 ) Cherry-pick from master pr: #45961 Related to #45960 When QueryCoord restarts or reconnects to etcd, the rewatchNodes function previously skipped handleNodeUp for QueryNodes in stopping state. This caused stopping balance to fail because necessary components were not initialized: - Task scheduler executor was not added - Dist handler was not started - Node was not registered in resource manager This fix ensures handleNodeUp is always called for new nodes regardless of their stopping state, followed by handleNodeStopping if the node is stopping. This allows the graceful shutdown process to correctly migrate segments and channels away from stopping nodes. Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-12-01 11:11:10 +08:00
7y-9	a42e847678	fix: [2.5] Fix infinite loop in ResourceManager recovery process (#45563 ) relate: https://github.com/milvus-io/milvus/issues/45557 Signed-off-by: lianyu.sun <lianyu.sun@ly.com>	2025-11-17 15:19:39 +08:00
congqixia	9ed77d4484	fix: [2.5] prevent data race in querycoord collection notifier update (#45037 ) (#45052 ) Cherry-pick from master pr: #45037 Fixes #45035 This commit addresses a data race issue where refreshCollection was updating the collection notifier without proper lock protection. Changes: - Add UpdateCollection method to CollectionManager with proper locking - Introduce CollectionOperator pattern for thread-safe collection updates - Make setRefreshNotifier private and use it through the operator pattern - Update refreshCollection to use the new UpdateCollection method - Handle collection not found error gracefully in refreshCollection The CollectionOperator pattern ensures all collection modifications go through the CollectionManager's lock, preventing concurrent access issues. Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-10-23 19:34:12 +08:00
wei liu	82081eba1b	fix: [2.5] Fix deactivate balance checker also stops stopping balance (#44835 ) issue: #43858 pr: #44834 Fix the issue introduced in PR #43992 where deactivating the balance checker incorrectly stops stopping balance operations. Changes: - Move IsActive() check after stopping balance logic - Only skip normal balance when checker is inactive - Allow stopping balance to proceed regardless of checker state This ensures stopping balance can execute even when the balance checker is deactivated. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-10-15 15:56:01 +08:00
wei liu	47949fd883	enhance: Implement rewatch mechanism for etcd failure scenarios (#43829 ) (#43920 ) issue: #43828 pr: #43829 #43909 Implement robust rewatch mechanism to handle etcd connection failures and node reconnection scenarios in DataCoord and QueryCoord, along with heartbeat lag monitoring capabilities. Changes include: - Implement rewatchDataNodes/rewatchQueryNodes callbacks for etcd reconnection scenarios - Add idempotent rewatchNodes method to handle etcd session recovery gracefully - Add QueryCoordLastHeartbeatTimeStamp metric for monitoring node heartbeat lag - Clean up heartbeat metrics when nodes go down to prevent metric leaks --------- --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com> Co-authored-by: Zhen Ye <chyezh@outlook.com>	2025-10-15 14:12:01 +08:00
wei liu	cbe2761e99	fix: Fix L0 segment duplicate load task generation during channel balance (#44700 ) issue: #44699 Fix the issue where L0 segment checking logic incorrectly identifies L0 segments as missing when they exist on multiple delegators during channel balance process, which blocks sealed segment loading and target progression. Changes include: - Replace GetLatestShardLeaderByFilter with GetByFilter to check all delegators instead of only the latest leader - Iterate through all delegator views to identify which ones lack the L0 segment The original logic only checked the latest shard leader, causing false positive detection of missing L0 segments when they actually exist on other delegators in the same channel during balance operations. This led to continuous generation of duplicate L0 segment load tasks, preventing normal sealed segment loading flow. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-10-11 10:04:00 +08:00
wei liu	892d63d26e	enhance: [2.5] Refactor balance checker with priority queue (#43992 ) (#44588 ) issue: #43858 pr: #43992 Refactor the balance checker implementation to use priority queues for managing collection balance operations, improving processing efficiency and order control. Changes include: - Export priority queue interfaces (Item, BaseItem, PriorityQueue) - Replace collection round-robin with priority-based queue system - Add BalanceCheckCollectionMaxCount configuration parameter - Optimize balance task generation with batch processing limits - Refactor processBalanceQueue method for different strategies - Enhance test coverage with comprehensive unit tests The new priority queue system processes collections based on row count or collection ID order, providing better control over balance operation priorities and resource utilization. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-09-28 19:23:05 +08:00
wei liu	39754af727	fix: Fix L0 segment loading delegator selection in QueryCoord (#43795 ) issue: #43794 Fix the issue where L0 segments were not correctly selecting appropriate delegators during loading, which could cause load failures or incorrect delegator assignments. Changes include: - Add special handling for L0 segments in delegator selection logic - Find delegators that are missing the L0 segment for direct loading - Fallback to existing serviceable delegator selection when no suitable delegator is found for L0 segments - Add comprehensive test coverage for L0 segment loading scenarios - Test delegator selection when some delegators are missing segments - Test fallback behavior when all delegators already have the segment - Test error handling when no delegators are available Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-08-19 16:35:47 +08:00
wei liu	80d1ef74ce	fix: apply load config changes failed after restart (#43555 ) issue: #43107 pr: #43554 --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-08-01 20:17:37 +08:00
wei liu	75463725b3	fix: skip loading non-existent L0 segments to prevent load blocking (#43576 ) issue: #43557 In 2.5 branch, L0 segments must be loaded before other segments. If an L0 segment has been garbage collected but is still in the target list, the load operation would keep failing, preventing other segments from being loaded. This patch adds a segment existence check for L0 segments in getSealedSegmentDiff. Only L0 segments that actually exist will be included in the load list. Changes: - Add checkSegmentExist function parameter to SegmentChecker constructor - Filter L0 segments by existence check in getSealedSegmentDiff - Add unit tests using mockey to verify the fix behavior Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-07-31 14:33:38 +08:00
wei liu	4631657304	fix: Unstable integration case TestBalanceOnSingleReplica (#43552 ) issue: #42930 Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-07-25 10:52:55 +08:00
wei liu	ad0bf9cad8	enhance: Optimize channel node balancing for uneven QN distribution (#42786 ) (#43423 ) issue: #42860 pr: #42786 Fix channel node allocation when QueryNode count is not a multiple of channel count. The previous algorithm used simple division which caused uneven distribution with remainders. Key improvements: - Implement smart remainder distribution algorithm - Refactor large function into focused helper functions - Support two-phase rebalancing (release then allocate) - Handle edge cases like insufficient nodes gracefully --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-07-21 17:04:54 +08:00
wei liu	b08d9efe69	fix: Prevent delegator unserviceable due to shard leader change (#42689 ) (#43309 ) issue: #42098 #42404 pr: #42689 Fix critical issue where concurrent balance segment and balance channel operations cause delegator view inconsistency. When shard leader switches between load and release phases of segment balance, it results in loading segments on old delegator but releasing on new delegator, making the new delegator unserviceable. The root cause is that balance segment modifies delegator views, and if these modifications happen on different delegators due to leader change, it corrupts the delegator state and affects query availability. Changes include: - Add shardLeaderID field to SegmentTask to track delegator for load - Record shard leader ID during segment loading in move operations - Skip release if shard leader changed from the one used for loading - Add comprehensive unit tests for leader change scenarios This ensures balance segment operations are atomic on single delegator, preventing view corruption and maintaining delegator serviceability. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-07-15 17:46:51 +08:00
wei liu	4952b8c416	enhance: apply load config changes after QueryCoord restart (#43108 ) (#43236 ) issue: #43107 pr: #43108 - Add checkLoadConfigChanges() to apply load config during startup - Call config check in startQueryCoord() after restart - Skip auto-updates for collections with user-specified replica numbers - Add is_user_specified_replica_mode field to preserve user settings - Add comprehensive unit tests with mockey Ensures existing collections use latest cluster-level config after restart. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-07-14 10:22:50 +08:00
congqixia	2531ebda27	fix: [2.5] Check field mmap property before apply collection level one (#43091 ) Cherry-pick from master pr: #43090 Related to #43089 --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-07-03 14:32:45 +08:00
congqixia	3d58b2ecee	fix: [2.5] Make controller wait checker worker quit (#42704 ) (#42726 ) Cherry-pick from master pr: #42704 Related to #42702 This patch add wait logic for `CheckerController` Nil check already exists due to code branching Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-06-16 15:14:38 +08:00
Zhen Ye	edca441eae	fix: filter the streaming query node from resource group when upgrading (#42594 ) issue: #42492 pr: #38677 - filter the streaming query node out from 2.6.0, avoid to load sealed segment on streaming query node. Signed-off-by: chyezh <chyezh@outlook.com>	2025-06-09 22:10:35 +08:00
wei liu	f06de7eca6	fix: Fix delegator selection logic in releaseSegment (#42572 ) issue: #42568 Fix incorrect delegator selection during segment release process which introduced by pr #42410 - Add serviceable filter to prioritize available shard leaders - Fix fallback logic with channel-specific lookup - Add early return when no leader found - Add comprehensive unit tests for all scenarios Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-06-06 19:24:33 +08:00
Xianhui Lin	a1927e22a5	fix: add ShowLoadCollections and ShowLoadPartitions for compatibale mixcoord (#42514 ) fix: add ShowLoadCollections and ShowLoadPartitions for compatibale mixcoord issue:https://github.com/milvus-io/milvus/issues/42492 Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>	2025-06-05 15:46:33 +08:00
wei liu	b298218a29	enhance: [2.5] Remove balance constraints between channel and segment tasks (#42410 ) issue: #42176 pr: #42177 Remove the mutual exclusion constraints between channel and segment balance tasks to allow them to run concurrently. Changes include: - Remove permitBalanceChannel() and permitBalanceSegment() methods from RoundRobinBalancer - Update ChannelLevelScoreBalancer, MultiTargetBalancer, RowCountBasedBalancer, and ScoreBasedBalancer to remove constraint checks - Allow segment balance tasks to proceed even when channel balance tasks are running - Update test cases to reflect new behavior where balance tasks no longer block each other - Improve error handling in task executor by preferring serviceable shard leaders for segment release operations - Add fallback logic to find latest shard leader when serviceable leader is not available This change improves the efficiency of load balancing by removing unnecessary coordination overhead between different types of balance operations. Signed-off-by: Wei Liu <wei.liu@zilliz.com> Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-06-03 10:16:32 +08:00
wei liu	d2ff390a52	fix: Segment may be released prematurely during balance channel (#42043 ) issue: #41143 pr: #42090 Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-05-29 18:36:35 +08:00
aoiasd	198ff1f150	enhance: [2.5] support run analyzer by loaded collection field (#42119 ) relate: https://github.com/milvus-io/milvus/issues/42094 pr: https://github.com/milvus-io/milvus/pull/42113 Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2025-05-29 10:26:30 +08:00
wei liu	4a05180f88	enhance: [2.5] support balancing multiple collections in single trigger (#41875 ) (#42134 ) issue: #41874 pr: #41875 - Optimize balance_checker to support balancing multiple collections simultaneously - Add new parameters for segment and channel balancing batch sizes - Add enableBalanceOnMultipleCollections parameter - Update tests for balance checker This change improves resource utilization by allowing the system to balance multiple collections in a single trigger with configurable batch sizes. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-05-28 23:18:30 +08:00
yihao.dai	7c8370ccd2	fix: [2.5] Fix ants.Pool goroutine leak (#41893 ) 1. Release the pool after it is no longer in use. 2. Upgrade ants.Pool to fix the goroutine leak issue (see https://github.com/panjf2000/ants/pull/287). issue: https://github.com/milvus-io/milvus/issues/41838 pr: https://github.com/milvus-io/milvus/pull/41892 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-05-16 19:12:22 +08:00
SimFG	6e18ededab	fix: [2.5] mockery too unavailable after upgrade golang version (#41522 ) - issue: ##41291 - pr: #41481 Signed-off-by: SimFG <bang.fu@zilliz.com>	2025-04-25 14:40:40 +08:00
SimFG	18eb627533	fix: [2.5] Update logging context and upgrade dependencies (#41319 ) - issue: #41291 - pr: #41318 --------- Signed-off-by: SimFG <bang.fu@zilliz.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2025-04-24 23:50:40 +08:00
wei liu	2e8445c2ef	fix: balance checker may enter infinite normal balance loop after balance suspension (#41196 ) issue: #41194 pr: #41195 - Refactor hasUnbalancedCollection flag handling to function scope - Ensure tracking sets clearance when no balance needed - Add deferred cleanup for both normal/stopping balance paths - Add unit tests for collection tracking scenarios The changes ensure tracking sets (normalBalanceCollectionsCurrentRound and stoppingBalanceCollectionsCurrentRound) are properly cleared when: - All collections in current round are balanced - Balance checks return early due to unready targets - Balance feature flags are disabled Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-04-10 15:18:28 +08:00
liliu-z	cb0f984155	enhance: Revert "separate for index completed (#40873 )" (#41152 ) This reverts commit 23e579e3240a30397f05f5b308be687f6f16b013. #40873 issue: #39519 Signed-off-by: Li Liu <li.liu@zilliz.com>	2025-04-08 17:36:30 +08:00
Chun Han	23e579e324	separate for index completed (#40873 ) related: https://github.com/milvus-io/milvus/issues/40781 Signed-off-by: MrPresent-Han <chun.han@gmail.com> Co-authored-by: MrPresent-Han <chun.han@gmail.com>	2025-04-05 10:20:24 +08:00
wei liu	37a533fe6d	fix: [2.5] Address manual balance and balance check issues (#41038 ) issue: #37651 pr: #41037 - Fix context propagation for manual balance segment task creation from PR #38080. - Optimize stopping balance by preventing redundant checks per round, addressing performance regression from PR #40297. - Decrease default `checkBalanceInterval` from 3000ms to 300ms. - Correct minor log messages in `BalanceChecker`. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-04-03 01:26:23 +08:00
Xianhui Lin	249d5b9b41	fix: jsonstats check if cache schema is nil lazy describecollection (#41068 ) fix: jsonstats check if cache schema is nil lazy describecollection pr:https://github.com/milvus-io/milvus/pull/38039 issue:https://github.com/milvus-io/milvus/issues/36995 --------- Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>	2025-04-03 00:32:21 +08:00
wei liu	d185a8f941	enhance: Balance the collection with the largest row count first (#40958 ) issue: #37651 pr: #40297 this PR enable to balance the collection with largest row count first, to avoid temporary migration of small table data to new nodes during their onboarding, only to be moved out again after the large table balance, which would cause unnecessary load. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-03-31 16:14:21 +08:00
wei liu	b64bb63e77	enhance: [2.5] Add trigger interval config for auto balance (#39154 ) (#39918 ) issue: #39156 pr: #39154 Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-03-27 16:40:23 +08:00
Xianhui Lin	8bdff401a3	fix: fix indexchecker schema released (#40809 ) pr:https://github.com/milvus-io/milvus/pull/38039 issue:https://github.com/milvus-io/milvus/issues/36995 Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>	2025-03-20 18:05:22 +08:00
Xianhui Lin	705b3c90a5	fix: Failed to rolling upgrade from v2.5.6 to new 2.5 version when enable JsonKeyStats (#40661 ) fix: Failed to rolling upgrade from v2.5.6 to new 2.5 version when enable JsonKeyStats.The reason is that the file path of the jsonkeyindex has changed. issue: https://github.com/milvus-io/milvus/issues/40649 ，https://github.com/milvus-io/milvus/issues/40669 https://github.com/milvus-io/milvus/issues/40707 master-pr: https://github.com/milvus-io/milvus/pull/38039 --------- Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>	2025-03-18 17:32:16 +08:00
Xianhui Lin	f5e9dea2aa	fix: [2.5]fix the garbage cleanup logic of jsonkey stats && improve json key stats filer (#40039 ) fix: fix the garbage collection cleanup logic of jsonkey stats && improve json key stats filer issue: https://github.com/milvus-io/milvus/issues/36995 https://github.com/milvus-io/milvus/issues/40034 https://github.com/milvus-io/milvus/issues/40041 https://github.com/milvus-io/milvus/issues/40106 https://github.com/milvus-io/milvus/issues/40138 pr: https://github.com/milvus-io/milvus/pull/38039 --------- Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>	2025-03-13 20:18:10 +08:00
Bingyi Sun	683b26ffb7	feat: cherry pick json path index (#40313 ) issue: #35528 pr: #36750 this pr includes json path index pr and some related prs: 1. update tantivy version #39253 2. json path index #36750 3. fall back to brute force #40076 4. term filter #40140 5. bug fix #40336 --------- Signed-off-by: sunby <sunbingyi1992@gmail.com>	2025-03-10 22:14:05 +08:00
yihao.dai	893caee467	fix: [2.5] Fix task delta cache data race (#40262 ) issue: https://github.com/milvus-io/milvus/issues/40258 pr: https://github.com/milvus-io/milvus/pull/40259 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-03-02 16:52:10 +08:00
wei liu	82c000a4b2	fix: task delta cache leak due to duplicate task id (#40184 ) issue: #40052 pr: #40183 task delta cache rely on the taskID is unique, so it incDeltaCache at AddTask, and decDeltaCache at RemoveTask, but the taskID allocator is not atomic, which cause two task with same taskID, in such case, it will call incDeltaCache twice, but call decDeltaCacheOnce, which cause delta cache leak. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-02-28 10:22:08 +08:00
wei liu	14f05650e3	enhance: clean shard location cache after collection released (#40228 ) issue: #40077 pr: #40088 Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-02-27 19:42:05 +08:00
Xianhui Lin	a4eb2ce224	fix: [2.5]Revert qc statschecker for json key stats (#40125 ) Revert qc statschecker for json key stats issue:https://github.com/milvus-io/milvus/issues/36995 pr:https://github.com/milvus-io/milvus/pull/39876 Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>	2025-02-24 13:31:55 +08:00
congqixia	709594f158	enhance: [2.5] Use v2 package name for pkg module (#40117 ) Cherry-pick from master pr: #39990 Related to #39095 https://go.dev/doc/modules/version-numbers Update pkg version according to golang dep version convention Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-02-23 00:46:01 +08:00
Xianhui Lin	c1de61ff7c	fix: [2.5]Replace the position of EnabledJSONKeyStats (#40108 ) Replace the position of EnabledJSONKeyStats issue: https://github.com/milvus-io/milvus/issues/36995 pr: https://github.com/milvus-io/milvus/pull/38039 --------- Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>	2025-02-22 14:35:54 +08:00
yihao.dai	b8a758b6c4	enhance: [2.5] Add get vector latency metric and refine request limit error message (#40085 ) issue: https://github.com/milvus-io/milvus/issues/40078 pr: https://github.com/milvus-io/milvus/pull/40083 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-02-21 20:19:55 +08:00
wei liu	82fb0bf9c1	fix: [2.5] task delta cache leak on reduce task (#40056 ) issue: #40052 pr: #40055 Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-02-21 16:49:54 +08:00
wei liu	e42c944e04	fix: [2.5] querycoord panic in cornor case (#40058 ) issue: #40050 pr: #40057 Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-02-21 11:19:58 +08:00
wei liu	3c2d8c1419	enhance: [2.5] Add management api to check querycoord balance status (#37784 ) (#39909 ) issue: #37783 pr: #37784 Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-02-19 10:56:49 +08:00
wei liu	bf54f47c34	enhance: [2.5] use rated logger for high frequency log in dist handler (#39452 ) (#39928 ) pr: #39452 Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-02-18 14:32:52 +08:00
Xianhui Lin	f0964f769d	enhance: [2.5]Add json key inverted index in stats for optimization (#39876 ) Add json key inverted index in stats for optimization issue: https://github.com/milvus-io/milvus/issues/36995 pr: https://github.com/milvus-io/milvus/pull/38039 --------- Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com> Co-authored-by: luzhang <luzhang@zilliz.com>	2025-02-16 20:12:15 +08:00

1 2 3 4 5 ...

688 Commits