milvus

mirror of https://gitee.com/milvus-io/milvus.git synced 2025-12-06 17:18:35 +08:00

Author	SHA1	Message	Date
wei liu	a308331b81	fix: Set replica field in balance plans to prevent panic (#45722 ) issue: #45598 The MultiTargetBalancer was missing replica field assignment in the generated segment and channel plans, which caused panic during balance operations. This change ensures that all balance plans have the replica field properly set to fix the panic issue. Also refactored the balance test to extract common test logic into a reusable helper function and added a new integration test specifically for MultipleTargetBalancer policy. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-12-04 10:19:11 +08:00
wei liu	e70c01362d	enhance: Add resource exhaustion querynode penalty policy (#45808 ) issue: #40513 for querynode which return resource exhausted error, add a penalty duration on it, and suspend loading new resource until penalty duration expired. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-12-02 16:59:11 +08:00
wei liu	3bb3e8c09e	fix: Enable leader checker to sync segment distribution to RO nodes (#45949 ) issue: #45865 - Modified leader_checker.go to include all nodes (RO + RW) instead of only RW nodes, preventing channel balance from stucking on RO nodes - Added debug logging in segment_checker.go when no shard leader found - Enhanced target_observer.go with detailed logging for delegator check failures to improve debugging visibility - Fixed integration tests: - Temporarily disabled partial result counter assertion in partial_result_on_node_down_test.go pending concurrent issue fix - Increased transfer channel timeout from 10s to 20s in manual_rolling_upgrade_test.go to avoid flaky test caused by target update interval (10s) --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-12-02 10:07:09 +08:00
Zhen Ye	2ef18c5b4f	enhance: remove watch at session liveness check (#45968 ) issue: #45724 --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-12-01 17:55:10 +08:00
congqixia	af734f19dc	enhance: skip adding stopping node to resource group in handleNodeUp (#45969 ) Related to #45960 Follow-up to #45961 After #45961 ensured that handleNodeUp is always called for nodes discovered during rewatchNodes (including stopping nodes), this change adds a safeguard in ResourceManager.handleNodeUp to skip adding stopping nodes to resource groups. 1. resource_manager.go: Add check for IsStoppingState() in handleNodeUp to prevent stopping nodes from being added to incomingNode set and assigned to resource groups. 2. server.go: - Delete processed nodes from sessionMap to avoid duplicate processing in the subsequent loop - Add warning logs for stopping state transitions during rewatch Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-12-01 16:17:10 +08:00
congqixia	5f5560d042	fix: always call handleNodeUp in rewatchNodes for proper stopping balance (#45961 ) Related to #45960 When QueryCoord restarts or reconnects to etcd, the rewatchNodes function previously skipped handleNodeUp for QueryNodes in stopping state. This caused stopping balance to fail because necessary components were not initialized: - Task scheduler executor was not added - Dist handler was not started - Node was not registered in resource manager This fix ensures handleNodeUp is always called for new nodes regardless of their stopping state, followed by handleNodeStopping if the node is stopping. This allows the graceful shutdown process to correctly migrate segments and channels away from stopping nodes. Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-12-01 11:09:08 +08:00
Zhen Ye	4f080bd3a0	fix: remove the streamingnode checking when loading segment (#45859 ) issue: #43117 If we enable checking when loading segments, all segment should always be loaded by streamingnode but not 2.5 querynode, make some search and query failure when upgrading. Otherwise, some search and query result will be wrong when upgrading. We choose to disable this checking for now to promise available search and query when upgrading. also see pr: #43346 Signed-off-by: chyezh <chyezh@outlook.com>	2025-11-28 10:09:08 +08:00
Zhen Ye	31976d8adb	fix: executor/scheduler should be latest replica meta but not replica copy (#45877 ) issue: #45865 --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-11-28 06:59:08 +08:00
cai.zhang	7c9a9c6f7e	fix: Reduce querycoord check node in replica interval for test (#45837 ) issue: #45791 Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>	2025-11-27 15:07:07 +08:00
congqixia	f51fcc09ae	fix: resolve SessionWatcher goroutine leak and unstable UT in querycoordv2 (#45627 ) Related to #44620 Related to unstable ut "internal/querycoordv2 TestServer/TestNodeUp" Introduce SessionWatcher interface to fix race condition and goroutine leak that caused unstable unit test TestServer/TestNodeUp. Changes: - Add SessionWatcher interface with EventChannel() and Stop() methods - Refactor WatchServices() to return SessionWatcher instead of raw channel - Fix cleanup order in QueryCoordV2: stop watcher before session - Update DataCoord, ConnectionManager to use SessionWatcher - Add MockSessionWatcher for testing Fixes race condition between session context cancellation and internal loop exit. Eliminates goroutine leak by providing explicit lifecycle management. --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-11-21 18:33:06 +08:00
aoiasd	947c8855f3	feat: support search bm25 with highlight (#44923 ) relate: https://github.com/milvus-io/milvus/issues/42589 --------- Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2025-11-18 16:09:39 +08:00
Zhen Ye	b7fb8ed38c	fix: use the right resource key lock for ddl and use new ddl in transfer replica (#45506 ) issue: #45452 - alias/rename related DDL should use database level exclusive lock - alias cannot use as the resource key of lock, use collection name instead - transfer replica should use WAL-based framework Signed-off-by: chyezh <chyezh@outlook.com>	2025-11-12 19:01:38 +08:00
yihao.dai	cabc47ce01	fix: Fix channel not available error and release collection blocking (#45428 ) 1. Ensure replica creation is idempotent. 2. Prevent currentTarget update when replica is missing. 3. Move the wait-for-release logic into the DDL framework's callback, and add a timeout to prevent it from blocking the DDL callback indefinitely. issue: https://github.com/milvus-io/milvus/issues/45301, https://github.com/milvus-io/milvus/issues/45274, https://github.com/milvus-io/milvus/issues/45295 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-11-12 18:55:37 +08:00
Zhen Ye	a2ce70d252	fix: ddl framework bug patch (#45290 ) issue: #45080, #45274, #45285 - LoadCollection doesn't ignore the ignorable request, for false field array. - CreatIndex doesn't ignore the ignorable request, for wrong index. - index meta is not thread safe. - lost parameter check of DDL. - DDL Ack scheduler may get stuck and DDL is block until next incoming DDL. - lost parameter checker of ddl --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-11-04 22:25:33 +08:00
Spade A	c0029b788d	fix: alter collection failed with MMAP setting for STRUCT (#45173 ) issue: https://github.com/milvus-io/milvus/issues/45001 ref: https://github.com/milvus-io/milvus/issues/42148 --------- Signed-off-by: SpadeA <tangchenjie1210@gmail.com> Signed-off-by: aoiasd <zhicheng.yue@zilliz.com> Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com> Co-authored-by: aoiasd <zhicheng.yue@zilliz.com>	2025-11-04 20:19:33 +08:00
Zhen Ye	966ebfbcab	fix: support upgrading from 2.6.x to 2.6.5 (#45264 ) issue: #43897 Signed-off-by: chyezh <chyezh@outlook.com>	2025-11-04 18:31:32 +08:00
Zhen Ye	00d8d2c33d	enhance: support load/release collection/partition with WAL-based DDL framework (#45154 ) issue: #43897 - Load/Release collection/partition is implemented by WAL-based DDL framework now. - Support AlterLoadConfig/DropLoadConfig in wal now. - Load/Release operation can be synced by new CDC now. - Refactor some UT for load/release DDL. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-11-02 18:39:32 +08:00
Zhen Ye	309d564796	enhance: support collection and index with WAL-based DDL framework (#45033 ) issue: #43897 - Part of collection/index related DDL is implemented by WAL-based DDL framework now. - Support following message type in wal, CreateCollection, DropCollection, CreatePartition, DropPartition, CreateIndex, AlterIndex, DropIndex. - Part of collection/index related DDL can be synced by new CDC now. - Refactor some UT for collection/index DDL. - Add Tombstone scheduler to manage the tombstone GC for collection or partition meta. - Move the vchannel allocation into streaming pchannel manager. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-10-30 14:24:08 +08:00
congqixia	569a5b40d2	enhance: [StorageV2] add manifest path support for FFI integration (#44991 ) Related to #44956 Add manifest_path field throughout the data path to support LOON Storage V2 manifest tracking. The manifest stores metadata for segment data files and enables the unified Storage V2 FFI interface. Changes include: - Add manifest_path field to SegmentInfo and SaveBinlogPathsRequest proto messages - Add UpdateManifest operator to datacoord meta operations - Update metacache, sync manager, and meta writer to propagate manifest paths - Include manifest_path in segment load info for query coordinator This is part of the Storage V2 FFI interface integration. Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-10-27 19:24:10 +08:00
Spade A	6494c75d31	fix: collection level MMAP does not take effect for STRUCT (#44996 ) issue: https://github.com/milvus-io/milvus/issues/42148 Signed-off-by: SpadeA <tangchenjie1210@gmail.com>	2025-10-23 19:52:05 +08:00
aoiasd	cfeb095ad7	enhance: forbid build analyzer at proxy (#44067 ) relate: https://github.com/milvus-io/milvus/issues/43687 We used to run the temporary analyzer and validate analyzer on the proxy, but the proxy should not be a computation-heavy node. This PR move all analyzer calculations to the streaming node. --------- Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2025-10-23 10:58:12 +08:00
congqixia	20dcb45b3d	fix: prevent data race in querycoord collection notifier update (#45037 ) Fixes #45035 This commit addresses a data race issue where refreshCollection was updating the collection notifier without proper lock protection. Changes: - Add UpdateCollection method to CollectionManager with proper locking - Introduce CollectionOperator pattern for thread-safe collection updates - Make setRefreshNotifier private and use it through the operator pattern - Update refreshCollection to use the new UpdateCollection method - Handle collection not found error gracefully in refreshCollection The CollectionOperator pattern ensures all collection modifications go through the CollectionManager's lock, preventing concurrent access issues. Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-10-23 10:28:04 +08:00
Zhen Ye	21076196bf	enhance: support resource group with WAL-based DDL framework (#44874 ) issue: #43897 - Resource group related DDL is implemented by WAL-based DDL framework now. - Support following message type in wal AlterResourceGroup, DropResourceGroup. - Resource group DDL can be synced by new CDC now. - Refactor some UT for resource group DDL. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-10-21 09:58:03 +08:00
1mmortal	e18e7d3b32	fix: Pingpong load balancing issue when segment has only 1 row(#44840 ) (#44841 ) Use math.Ceil to calculate Priority uniformly issue: https://github.com/milvus-io/milvus/issues/44840 Signed-off-by: 1mmortal <lmzzzzz1@163.com>	2025-10-16 11:18:00 +08:00
wei liu	38833b0e1d	fix: Fix deactivate balance checker also stops stopping balance (#44834 ) issue: #43858 Fix the issue introduced in PR #43992 where deactivating the balance checker incorrectly stops stopping balance operations. Changes: - Move IsActive() check after stopping balance logic - Only skip normal balance when checker is inactive - Allow stopping balance to proceed regardless of checker state This ensures stopping balance can execute even when the balance checker is deactivated. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-10-15 15:50:04 +08:00
Zhen Ye	53e8f150e8	fix: check if qn is sqn with label and streamingnode list (#44792 ) issue: #44014 - On standalone, the query node inside need to load segment and watch channel, so the querynode is not a embeded querynode in streamingnode without `LabelStreamingNodeEmbeddedQueryNode`. The channel dist manager can not confirm a standalone node is a embededStreamingNode. Bug is introduced by #44099 Signed-off-by: chyezh <chyezh@outlook.com>	2025-10-13 16:33:59 +08:00
wei liu	33d1e7de83	fix: Replace incorrect log import with milvus v2 log package (#44731 ) issue: #44730 Fix the issue where logs were not outputting as expected due to incorrect log package imports across multiple components. Changes include: - Add golangci-lint rule to forbid github.com/pingcap/log usage - Replace github.com/pingcap/log with github.com/milvus-io/milvus/pkg/v2/log Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-10-10 20:27:57 +08:00
zhenshan.cao	4279f166c6	enhance: Add refine logs for task scheduler in QueryCoord (#44577 ) issue: https://github.com/milvus-io/milvus/issues/43968 Signed-off-by: zhenshan.cao <zhenshan.cao@zilliz.com>	2025-10-10 10:07:55 +08:00
Zhen Ye	19e5e9f910	enhance: broadcaster will lock resource until message acked (#44508 ) issue: #43897 - Return LastConfirmedMessageID when wal append operation. - Add resource-key-based locker for broadcast-ack operation to protect the coord state when executing ddl. - Resource-key-based locker is held until the broadcast operation is acked. - ResourceKey support shared and exclusive lock. - Add FastAck execute ack right away after the broadcast done to speed up ddl. - Ack callback will support broadcast message result now. - Add tombstone for broadcaster to avoid to repeatedly commit DDL and ABA issue. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-09-24 20:58:05 +08:00
XuanYang-cn	24037a396a	fix: LoadSegment failed for dup field mmap.enabel props (#44465 ) When set mmap enabled in both collection properties and field properties, load segment will fail. See also: #44443 Signed-off-by: yangxuan <xuan.yang@zilliz.com>	2025-09-22 14:40:06 +08:00
wei liu	6d4961b978	enhance: Refactor balance checker with priority queue (#43992 ) issue: #43858 Refactor the balance checker implementation to use priority queues for managing collection balance operations, improving processing efficiency and order control. Changes include: - Export priority queue interfaces (Item, BaseItem, PriorityQueue) - Replace collection round-robin with priority-based queue system - Add BalanceCheckCollectionMaxCount configuration parameter - Optimize balance task generation with batch processing limits - Refactor processBalanceQueue method for different strategies - Enhance test coverage with comprehensive unit tests The new priority queue system processes collections based on row count or collection ID order, providing better control over balance operation priorities and resource utilization. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-09-19 17:46:01 +08:00
zhenshan.cao	691a8df953	feat: Add RESTful api for rolling upgrade support (#44381 ) issue: https://github.com/milvus-io/milvus/issues/43968 Co-authored-by: chyezh <ye.zhen@zilliz.com>	2025-09-16 20:08:00 +08:00
Bingyi Sun	0c0630cc38	feat: support dropping index without releasing collection (#42941 ) issue: #42942 This pr includes the following changes: 1. Added checks for index checker in querycoord to generate drop index tasks 2. Added drop index interface to querynode 3. To avoid search failure after dropping the index, the querynode allows the use of lazy mode (warmup=disable) to load raw data even when indexes contain raw data. 4. In segcore, loading the index no longer deletes raw data; instead, it evicts it. 5. In expr, the index is pinned to prevent concurrent errors. --------- Signed-off-by: sunby <sunbingyi1992@gmail.com>	2025-09-02 16:17:52 +08:00
Zhen Ye	9e2d1963d4	enhance: support cchannel for streaming service (#44143 ) issue: #43897 - add cchannel as a special vchannel to hold some ddl and dcl. Signed-off-by: chyezh <chyezh@outlook.com>	2025-09-02 10:05:52 +08:00
zhagnlu	fc876639cf	enhance: support json stats with shredding design (#42534 ) #42533 Co-authored-by: luzhang <luzhang@zilliz.com>	2025-09-01 10:49:52 +08:00
Zhen Ye	23085ae437	fix: use query node label check if streamingnode (#44099 ) issue: #44014 - Because the session of querynode and streamingnode is different. - So when streamingnode session down first, a streaming query node will be treated as querynode. - Use label but not streaming node session to fix it. Signed-off-by: chyezh <chyezh@outlook.com>	2025-08-29 10:45:59 +08:00
Chun Han	da156981c6	feat: milvus support posix-compatible mode(milvus-io#43942) (#43944 ) related: #43942 Signed-off-by: MrPresent-Han <chun.han@gmail.com> Co-authored-by: MrPresent-Han <chun.han@gmail.com>	2025-08-27 16:29:50 +08:00
XuanYang-cn	37a447d166	feat: Add CMEK cipher plugin (#43722 ) 1. Enable Milvus to read cipher configs 2. Enable cipher plugin in binlog reader and writer 3. Add a testCipher for unittests 4. Support pooling for datanode 5. Add encryption in storagev2 See also: #40321 Signed-off-by: yangxuan <xuan.yang@zilliz.com> --------- Signed-off-by: yangxuan <xuan.yang@zilliz.com>	2025-08-27 11:15:52 +08:00
Zhen Ye	575345ae7b	fix: get streamingnodes from service discovery without channel assign (#44033 ) issue: #43767 Signed-off-by: chyezh <chyezh@outlook.com>	2025-08-26 14:29:51 +08:00
Zhen Ye	cbb9392564	fix: filter the streaming node from resource group (#43984 ) issue: #43981 Signed-off-by: chyezh <chyezh@outlook.com>	2025-08-22 19:21:47 +08:00
wei liu	399f63300c	enhance: Implement dynamic interval updates for ticker components (#43865 ) issue: #43858 Enable dynamic configuration updates for ticker intervals without restart. This enhancement allows runtime configuration changes to take effect immediately for better operational flexibility. Changes include: - Apply "drain+Reset only when interval changed" pattern across all ticker components to preserve existing timing phases - Fix goroutine variable capture issue in CheckerController.Start() - Remove unnecessary ticker.Stop() in manual trigger paths - Add dynamic interval checking in QueryCoordV2 components: * checkers/controller.go: Various checker intervals * dist/dist_handler.go: DistPullInterval, CheckExecutedFlagInterval * session/cluster.go: CheckNodeSessionInterval * server.go: CheckAutoBalanceConfigInterval * observers/target_observer.go: UpdateNextTargetInterval * observers/collection_observer.go: CollectionObserverInterval - Add dynamic interval checking in QueryNodeV2 components: * segments/disk_usage_fetcher.go: DiskSizeFetchInterval - Ensure thread safety by performing all ticker operations in same goroutine with proper drain before Reset to avoid spurious triggers --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-08-21 10:07:47 +08:00
wei liu	384c493d0e	fix: Fix node status inconsistency after QueryCoord restart (#43941 ) issue: #43933 Fix the issue where QueryCoord restart leads to node status inconsistency in resource manager, causing segment loading failures and incorrect resource group assignments. Changes include: - Add CheckNodesInResourceGroup method to sync node status after restart - Implement proper cleanup of offline/stopping nodes from resource groups - Add automatic discovery and assignment of new nodes to resource groups - Enhance rewatchNodes process to include resource manager synchronization This ensures resource manager maintains correct node status and assignments even after QueryCoord restarts, preventing segment loading failures and improving system reliability. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-08-20 14:13:46 +08:00
wei liu	dada00a81c	fix: dirty querynode doesn't clean up after restart (#43909 ) issue: #43905 Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-08-18 18:05:46 +08:00
wei liu	3e9e830074	enhance: Implement rewatch mechanism for etcd failure scenarios (#43829 ) issue: #43828 Implement robust rewatch mechanism to handle etcd connection failures and node reconnection scenarios in DataCoord and QueryCoord, along with heartbeat lag monitoring capabilities. Changes include: - Implement rewatchDataNodes/rewatchQueryNodes callbacks for etcd reconnection scenarios - Add idempotent rewatchNodes method to handle etcd session recovery gracefully - Add QueryCoordLastHeartbeatTimeStamp metric for monitoring node heartbeat lag - Clean up heartbeat metrics when nodes go down to prevent metric leaks --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-08-14 10:31:44 +08:00
wei liu	ecc2ac0426	fix: apply load config changes failed after restart (#43554 ) issue: #43107 --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-08-01 20:13:37 +08:00
Spade A	faeb7fd410	feat: impl StructArray -- create schema, insert, and retrieve data (#42855 ) Ref https://github.com/milvus-io/milvus/issues/42148 https://github.com/milvus-io/milvus/pull/42406 impls the segcore part of storage for handling with VectorArray. This PR: 1. impls the go part of storage for VectorArray 2. impls the collection creation with StructArrayField and VectorArray 3. insert and retrieve data from the collection. --------- Signed-off-by: SpadeA <tangchenjie1210@gmail.com> Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com> Signed-off-by: SpadeA-Tang <u6748471@anu.edu.au>	2025-07-27 01:30:55 +08:00
Zhen Ye	e9ab73e93d	enhance: add schema version at recovery storage (#43500 ) issue: #43072, #43289 - manage the schema version at recovery storage. - update the schema when creating collection or alter schema. - get schema at write buffer based on version. - recover the schema when upgrading from 2.5. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-07-23 21:38:54 +08:00
Zhen Ye	df7e507c49	fix: balance may not trigger at balance checker when upgrading (#43462 ) issue: #43416 Signed-off-by: chyezh <chyezh@outlook.com>	2025-07-22 16:02:53 +08:00
Zhen Ye	25b76e1fde	fix: cannot auto balance the channel from old arch to streamingnode (#43424 ) issue: #43416, #43413 - also fix the panic on streamingnode when concurrent sync Signed-off-by: chyezh <chyezh@outlook.com>	2025-07-20 23:00:52 +08:00
Zhen Ye	3aacd179f7	fix: balance channel before balance segment when upgrading (#43346 ) issue: #43117, #42966, #43373 - also fix channel balance may not work at 2.6. - fix error lost at delete path - add mvcc into s/q log - change the log level for TestCoordDownSearch Signed-off-by: chyezh <chyezh@outlook.com>	2025-07-17 20:16:52 +08:00

1 2 3 4 5 ...

740 Commits