milvus

mirror of https://gitee.com/milvus-io/milvus.git synced 2025-12-07 09:38:39 +08:00

Author	SHA1	Message	Date
Zhen Ye	8ca120b841	fix: use the right resource key lock for ddl and use new ddl in transfer replica (#45509 ) issue: #45452 pr: #45506 - alias/rename related DDL should use database level exclusive lock - alias cannot use as the resource key of lock, use collection name instead - transfer replica should use WAL-based framework Signed-off-by: chyezh <chyezh@outlook.com>	2025-11-12 20:13:38 +08:00
yihao.dai	b1255dba6f	fix: [2.6] Fix channel not available error and release collection blocking (#45429 ) 1. Ensure replica creation is idempotent. 2. Prevent currentTarget update when replica is missing. 3. Move the wait-for-release logic into the DDL framework's callback, and add a timeout to prevent it from blocking the DDL callback indefinitely. issue: https://github.com/milvus-io/milvus/issues/45301, https://github.com/milvus-io/milvus/issues/45274, https://github.com/milvus-io/milvus/issues/45295 pr: https://github.com/milvus-io/milvus/pull/45428 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-11-12 18:57:37 +08:00
Zhen Ye	02e2170601	enhance: cherry pick patch of new DDL framework and CDC 2 (#45241 ) issue: #43897, #44123 pr: #45224 also pick pr: #45216,#45154,#45033,#45145,#45092,#45058,#45029 enhance: Close channel replicator more gracefully (#45029) issue: https://github.com/milvus-io/milvus/issues/44123 enhance: Show create time for import job (#45058) issue: https://github.com/milvus-io/milvus/issues/45056 fix: wal state may be unconsistent after recovering from crash (#45092) issue: #45088, #45086 - Message on control channel should trigger the checkpoint update. - LastConfrimedMessageID should be recovered from the minimum of checkpoint or the LastConfirmedMessageID of uncommitted txn. - Add more log info for wal debugging. fix: make ack of broadcaster cannot canceled by client (#45145) issue: #45141 - make ack of broadcaster cannot canceled by rpc. - make clone for assignment snapshot of wal balancer. - add server id for GetReplicateCheckpoint to avoid failure. enhance: support collection and index with WAL-based DDL framework (#45033) issue: #43897 - Part of collection/index related DDL is implemented by WAL-based DDL framework now. - Support following message type in wal, CreateCollection, DropCollection, CreatePartition, DropPartition, CreateIndex, AlterIndex, DropIndex. - Part of collection/index related DDL can be synced by new CDC now. - Refactor some UT for collection/index DDL. - Add Tombstone scheduler to manage the tombstone GC for collection or partition meta. - Move the vchannel allocation into streaming pchannel manager. enhance: support load/release collection/partition with WAL-based DDL framework (#45154) issue: #43897 - Load/Release collection/partition is implemented by WAL-based DDL framework now. - Support AlterLoadConfig/DropLoadConfig in wal now. - Load/Release operation can be synced by new CDC now. - Refactor some UT for load/release DDL. enhance: Don't start cdc by default (#45216) issue: https://github.com/milvus-io/milvus/issues/44123 fix: unrecoverable when replicate from old (#45224) issue: #44962 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com> Signed-off-by: chyezh <chyezh@outlook.com> Co-authored-by: yihao.dai <yihao.dai@zilliz.com>	2025-11-04 01:35:33 +08:00
Zhen Ye	318db122b8	enhance: cherry pick patch of new DDL framework and CDC (#45025 ) issue: #43897, #44123 pr: #44898 related pr: #44607 #44642 #44792 #44809 #44564 #44560 #44735 #44822 #44865 #44850 #44942 #44874 #44963 #44886 #44898 enhance: remove redundant channel manager from datacoord (#44532) issue: #41611 - After enabling streaming arch, channel manager of data coord is a redundant component. fix: Fix CDC OOM due to high buffer size (#44607) Fix CDC OOM by: 1. free msg buffer manually. 2. limit max msg buffer size. 3. reduce scanner msg hander buffer size. issue: https://github.com/milvus-io/milvus/issues/44123 fix: remove wrong start timetick to avoid filtering DML whose timetick is less than it. (#44691) issue: #41611 - introduced by #44532 enhance: support remove cluster from replicate topology (#44642) issue: #44558, #44123 - Update config(A->C) to A and C, config(B) to B on replicate topology (A->B,A->C) can remove the B from replicate topology - Fix some metric error of CDC fix: check if qn is sqn with label and streamingnode list (#44792) issue: #44014 - On standalone, the query node inside need to load segment and watch channel, so the querynode is not a embeded querynode in streamingnode without `LabelStreamingNodeEmbeddedQueryNode`. The channel dist manager can not confirm a standalone node is a embededStreamingNode. Bug is introduced by #44099 enhance: Make GetReplicateInfo API work at the pchannel level (#44809) issue: https://github.com/milvus-io/milvus/issues/44123 enhance: Speed up CDC scheduling (#44564) Make CDC watch etcd replicate pchannel meta instead of listing them periodically. issue: https://github.com/milvus-io/milvus/issues/44123 enhance: refactor update replicate config operation using wal-broadcast-based DDL/DCL framework (#44560) issue: #43897 - UpdateReplicateConfig operation will broadcast AlterReplicateConfig message into all pchannels with cluster-exclusive-lock. - Begin txn message will use commit message timetick now (to avoid timetick rollback when CDC with txn message). - If current cluster is secondary, the UpdateReplicateConfig will wait until the replicate configuration is consistent with the config replicated from primary. enhance: support rbac with WAL-based DDL framework (#44735) issue: #43897 - RBAC(Roles/Users/Privileges/Privilege Groups) is implemented by WAL-based DDL framework now. - Support following message type in wal `AlterUser`, `DropUser`, `AlterRole`, `DropRole`, `AlterUserRole`, `DropUserRole`, `AlterPrivilege`, `DropPrivilege`, `AlterPrivilegeGroup`, `DropPrivilegeGroup`, `RestoreRBAC`. - RBAC can be synced by new CDC now. - Refactor some UT for RBAC. enhance: support database with WAL-based DDL framework (#44822) issue: #43897 - Database related DDL is implemented by WAL-based DDL framework now. - Support following message type in wal CreateDatabase, AlterDatabase, DropDatabase. - Database DDL can be synced by new CDC now. - Refactor some UT for Database DDL. enhance: support alias with WAL-based DDL framework (#44865) issue: #43897 - Alias related DDL is implemented by WAL-based DDL framework now. - Support following message type in wal AlterAlias, DropAlias. - Alias DDL can be synced by new CDC now. - Refactor some UT for Alias DDL. enhance: Disable import for replicating cluster (#44850) 1. Import in replicating cluster is not supported yet, so disable it for now. 2. Remove GetReplicateConfiguration wal API issue: https://github.com/milvus-io/milvus/issues/44123 fix: use short debug string to avoid newline in debug logs (#44925) issue: #44924 fix: rerank before requery if reranker didn't use field data (#44942) issue: #44918 enhance: support resource group with WAL-based DDL framework (#44874) issue: #43897 - Resource group related DDL is implemented by WAL-based DDL framework now. - Support following message type in wal AlterResourceGroup, DropResourceGroup. - Resource group DDL can be synced by new CDC now. - Refactor some UT for resource group DDL. fix: Fix Fix replication txn data loss during chaos (#44963) Only confirm CommitMsg for txn messages to prevent data loss. issue: https://github.com/milvus-io/milvus/issues/44962, https://github.com/milvus-io/milvus/issues/44123 fix: wrong execution order of DDL/DCL on secondary (#44886) issue: #44697, #44696 - The DDL executing order of secondary keep same with order of control channel timetick now. - filtering the control channel operation on shard manager of streamingnode to avoid wrong vchannel of create segment. - fix that the immutable txn message lost replicate header. fix: Fix primary-secondary replication switch blocking (#44898) 1. Fix primary-secondary replication switchover blocking by delete replicate pchannel meta using modRevision. 2. Stop channel replicator(scanner) when cluster role changes to prevent continued message consumption and replication. 3. Close Milvus client to prevent goroutine leak. 4. Create Milvus client once for a channel replicator. 5. Simplify CDC controller and resources. issue: https://github.com/milvus-io/milvus/issues/44123 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com> Signed-off-by: chyezh <chyezh@outlook.com> Co-authored-by: yihao.dai <yihao.dai@zilliz.com>	2025-11-03 15:39:33 +08:00
congqixia	c3932022c9	fix: [2.6] prevent data race in querycoord collection notifier update (#45037 ) (#45051 ) Cherry-pick from master pr: #45037 Fixes #45035 This commit addresses a data race issue where refreshCollection was updating the collection notifier without proper lock protection. Changes: - Add UpdateCollection method to CollectionManager with proper locking - Introduce CollectionOperator pattern for thread-safe collection updates - Make setRefreshNotifier private and use it through the operator pattern - Update refreshCollection to use the new UpdateCollection method - Handle collection not found error gracefully in refreshCollection The CollectionOperator pattern ensures all collection modifications go through the CollectionManager's lock, preventing concurrent access issues. Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-10-23 19:34:13 +08:00
wei liu	ecc2ac0426	fix: apply load config changes failed after restart (#43554 ) issue: #43107 --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-08-01 20:13:37 +08:00
wei liu	b2597c6329	enhance: apply load config changes after QueryCoord restart (#43108 ) issue: #43107 - Add checkLoadConfigChanges() to apply load config during startup - Call config check in startQueryCoord() after restart - Skip auto-updates for collections with user-specified replica numbers - Add is_user_specified_replica_mode field to preserve user settings - Add comprehensive unit tests with mockey Ensures existing collections use latest cluster-level config after restart. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-07-10 14:28:48 +08:00
Bingyi Sun	6bebb68727	fix: Return all targets segments in ListLoadedSegments (#42728 ) issue: https://github.com/milvus-io/milvus/issues/42412 Signed-off-by: sunby <sunbingyi1992@gmail.com>	2025-06-18 11:20:38 +08:00
Bingyi Sun	1bf960b1a8	enhance: Check loaded segments before gc (#42639 ) issue: https://github.com/milvus-io/milvus/issues/42412 --------- Signed-off-by: sunby <sunbingyi1992@gmail.com>	2025-06-13 17:44:38 +08:00
wei liu	54619eaa2c	feat: Implement partial result support on node down (#42009 ) issue: https://github.com/milvus-io/milvus/issues/41690 This commit implements partial search result functionality when query nodes go down, improving system availability during node failures. The changes include: - Enhanced load balancing in proxy (lb_policy.go) to handle node failures with retry support - Added partial search result capability in querynode delegator and distribution logic - Implemented tests for various partial result scenarios when nodes go down - Added metrics to track partial search results in querynode_metrics.go - Updated parameter configuration to support partial result required data ratio - Replaced old partial_search_test.go with more comprehensive partial_result_on_node_down_test.go - Updated proto definitions and improved retry logic These changes improve query resilience by returning partial results to users when some query nodes are unavailable, ensuring that queries don't completely fail when a portion of data remains accessible. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-05-28 00:12:28 +08:00
Xianhui Lin	f9febe3bae	enhance: Merge RootCoord, DataCoord And QueryCoord into MixCoord (#41006 ) Merge RootCoord, DataCoord And QueryCoord into MixCoord Make Session into one issue : https://github.com/milvus-io/milvus/issues/37764 --------- Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>	2025-04-11 16:36:30 +08:00
congqixia	cb7f2fa6fd	enhance: Use v2 package name for pkg module (#39990 ) Related to #39095 https://go.dev/doc/modules/version-numbers Update pkg version according to golang dep version convention --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-02-22 23:15:58 +08:00
Zhen Ye	bb8d1ab3bf	enhance: make new go package to manage proto (#39114 ) issue: #39095 --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-01-10 10:49:01 +08:00
Xiaofan	cb6eca8e91	fix: drop partition can not be successful if load failed (#38793 ) fix #38649 when partition load failed, the partition drop will also fail due to the wrong error message Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>	2024-12-30 19:42:52 +08:00
wei liu	25f0c82ceb	fix: Fix update loading collection's load config doesn't work (#38595 ) issue: #38594 --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-12-25 18:02:51 +08:00
jaime	5afd0c0a2b	fix: Revert "Expose metrics of stanby coordinators (#27698 )" (#38620 ) issue: #38608 Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-12-23 11:46:57 +08:00
jaime	78438ef41e	fix: revert optimize CPU usage for CheckHealth requests (#35589 ) (#38555 ) issue: #35563 Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-12-19 00:38:45 +08:00
jaime	28fdbc4e30	enhance: optimize CPU usage for CheckHealth requests (#35589 ) issue: #35563 1. Use an internal health checker to monitor the cluster's health state, storing the latest state on the coordinator node. The CheckHealth request retrieves the cluster's health from this latest state on the proxy sides, which enhances cluster stability. 2. Each health check will assess all collections and channels, with detailed failure messages temporarily saved in the latest state. 3. Use CheckHealth request instead of the heavy GetMetrics request on the querynode and datanode Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-12-17 11:02:45 +08:00
SimFG	2afe2eaf3e	feat: support to replicate collection when the services contains the system tt msg (#37559 ) - issue: #37105 --------- Signed-off-by: SimFG <bang.fu@zilliz.com>	2024-12-17 09:08:46 +08:00
tinswzy	27229f7907	enhance: refine exists log print with ctx (#38080 ) issue: #35917 Refines exists log print with ctx Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>	2024-12-14 22:36:44 +08:00
congqixia	051bc280dd	enhance: Make dynamic load/release partition follow targets (#38059 ) Related to #37849 --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-12-05 16:24:40 +08:00
tinswzy	e76802f910	enhance: refine querycoord meta/catalog related interfaces to ensure that each method includes a ctx parameter (#37916 ) issue: #35917 This PR refine the querycoord meta related interfaces to ensure that each method includes a ctx parameter. Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>	2024-11-25 11:14:34 +08:00
jaime	257ecab84b	enhance: remove collection queryable check from health check (#37712 ) Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-11-18 10:50:38 +08:00
wei liu	266f8ef1f5	fix: Search may return less result after qn recover (#36549 ) issue: #36293 #36242 after qn recover, delegator may be loaded in new node, after all segment has been loaded, delegator becomes serviceable. but delegator's target version hasn't been synced, and if search/query comes, delegator will use wrong target version to filter out a empty segment list, which caused empty search result. This pr will block delegator's serviceable status until target version is synced --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-11-12 16:34:28 +08:00
congqixia	f5b06a3c9f	enhance: Invalidate collection cache when release collection (#37577 ) Related to #37395 --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-11-12 10:16:29 +08:00
congqixia	dcc1b506dc	enhance: Add context trace for querycoord queryable check (#37524 ) When check health logic failed to collection not-queryable, the related reason is hard to find in log. This PR add context for log with trace id and print unqueryable collection info log. Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-11-08 14:00:26 +08:00
jaime	9d16b972ea	feat: add tasks page into management WebUI (#37002 ) issue: #36621 1. Add API to access task runtime metrics, including: - build index task - compaction task - import task - balance (including load/release of segments/channels and some leader tasks on querycoord) - sync task 2. Add a debug model to the webpage by using debug=true or debug=false in the URL query parameters to enable or disable debug mode. Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-10-28 10:13:29 +08:00
jaime	4746f47282	feat: management WebUI homepage (#36822 ) issue: #36784 1. Implement an embedded web server for WebUI access. 2. Complete the homepage development. Home page demo: <img width="2177" alt="iShot_2024-10-10_17 57 34" src="https://github.com/user-attachments/assets/38539917-ce09-4e54-a5b5-7f4f7eaac353"> Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-10-23 11:29:28 +08:00
wei liu	3cd0b26285	enhance: Enable dynamic update loaded collection's replica (#35822 ) issue: #35821 After collection loaded, if we need to increase/decrease collection's replica, we need to release and load it again. milvus offers 4 solution to update loaded collection's replica, this PR aims to dynamic change the replica number without release, and after replica number changed, milvus will execute load replica or release replica in async, and the replica loaded status can be checked by getReplicas API. Notice that if set too much replicas than querynode can afford，the new replica won't be loaded successfully until enough querynode joins. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-09-25 10:13:18 +08:00
congqixia	2fbc628994	feat: Support field partial load collection (#35416 ) Related to #35415 --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-08-20 16:49:02 +08:00
wei liu	22ced010cd	enhance: make configure load param feature be compatible with old sdk (#35520 ) issue: #31570 #35521 --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-08-20 10:30:55 +08:00
wei liu	9b37d3f517	enhance: Enable setting the replica number and resource group during collection creation (#34403 ) issue: #30040 --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-07-10 10:20:13 +08:00
jaime	0426390f06	enhance: improve check health (#33800 ) issue: #34264 Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-07-01 10:16:06 +08:00
wei liu	935bc1fb71	fix: Fix GetReplicas API return nil status (#33715 ) issue: #33702 Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-06-20 14:40:15 +08:00
wei liu	b13932bb55	enhance: Enable database level replica num and resource groups for loading collection (#33052 ) issue: #30040 This PR introduce two database level props: 1. database.replica.number 2. database.resource_groups User can set those two database props by AlterDatabase API, then can load collection without specified replica_num and resource groups. then it will use database level load param when try to load collections. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-05-29 10:59:43 +08:00
wei liu	a7f6193bfc	fix: query node may stuck at stopping progress (#33104 ) issue: #33103 when try to do stopping balance for stopping query node, balancer will try to get node list from replica.GetNodes, then check whether node is stopping, if so, stopping balance will be triggered for this replica. after the replica refactor, replica.GetNodes only return rwNodes, and the stopping node maintains in roNodes, so balancer couldn't find replica which contains stopping node, and stopping balance for replica won't be triggered, then query node will stuck forever due to segment/channel doesn't move out. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-05-20 10:21:38 +08:00
chyezh	293f14a8b9	fix: remove redundant replica recover (#32985 ) issue: #22288 - replica recover should be only triggered by replica recover Signed-off-by: chyezh <chyezh@outlook.com>	2024-05-13 15:25:32 +08:00
wei liu	ba02d54a30	enhance: update shard leader cache when leader location changed (#32470 ) issue: #32466 this PR enhance that when shard location changed, update proxy's shard leader cache. in case of query node failover case, proxy can find replica recover --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-05-08 10:05:29 +08:00
wei liu	d900e68440	fix: fix GetShardLeaders return empty node list (#32685 ) issue: #32449 to avoid GetShardLeaders return empty node list, this PR add node list check in both client side and server side. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-04-29 14:19:26 +08:00
chyezh	f06509bf97	fix: get replica should not report error when no querynode serve (#32536 ) issue: #30647 - Remove error report if there's no query node serve. It's hard for programer to use it to do resource management. - Change resource group `transferNode` logic to keep compatible with old version sdk. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2024-04-25 19:25:24 +08:00
chyezh	b287fbaa2e	fix: return collection on recovering but not collection not loaded when target is not recovered (#32447 ) issue: #32398 Signed-off-by: chyezh <chyezh@outlook.com>	2024-04-25 11:21:26 +08:00
smellthemoon	96d95e7743	enhance: fix pass error msg as channel name (#32511 ) Signed-off-by: lixinguo <xinguo.li@zilliz.com> Co-authored-by: lixinguo <xinguo.li@zilliz.com>	2024-04-23 16:45:22 +08:00
congqixia	d7ff1bbe5c	enhance: Make querycoordv2 collection observer task driven (#32441 ) See also #32440 - Add loadTask in collection observer - For load collection/partitions, load task shall timeout as a whole - Change related constructor to load jobs --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-04-22 10:39:22 +08:00
chyezh	a8c8a6bb0f	fix: parameter check of TransferReplica and TransferNode (#32297 ) issue: #30647 - Same dst and src resource group should not be allowed in `TransferReplica` and `TransferNode`. - Remove redundant parameter check. Signed-off-by: chyezh <chyezh@outlook.com>	2024-04-17 15:27:19 +08:00
chyezh	48fe977a9d	enhance: declarative resource group api (#31930 ) issue: #30647 - Add declarative resource group api - Add config for resource group management - Resource group recovery enhancement --------- Signed-off-by: chyezh <chyezh@outlook.com>	2024-04-15 08:13:19 +08:00
wei liu	c4806b69c4	enhance: Refactor leader view manager interface (#31133 ) issue: #31091 This PR add GetByFilter interface in leader view manager, instead of all kind of get func --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-04-10 15:13:36 +08:00
chyezh	a2502bde75	enhance: replica manager enhancement (#31496 ) issue: #30647 - ReplicaManager manage read only node now, and always do persistent of node distribution of replica. - All segment/channel checker using ReplicaManager to get read-only node or read-write node, but not ResourceManager. - ReplicaManager promise that only apply unique querynode to one replica in same collection now (replicas in same collection never hold same querynode at same time). - ReplicaManager promise that fairly node count assignment policy if multi replicas of collection is assigned to one resource group. - Move some parameters check into ReplicaManager to avoid data race. - Allow transfer replica to resource group that already load replica of same collection - Allow transfer node between resource groups that load replica of same collection --------- Signed-off-by: chyezh <chyezh@outlook.com>	2024-04-05 04:57:16 +08:00
congqixia	c2aad513c0	fix: Check collection nil before check load status (#31850 ) See also #31849 Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-04-03 10:07:13 +08:00
wei liu	7471a8005f	fix: querycoord panic after node down (#31831 ) issue: #30519 Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-04-03 10:03:22 +08:00
wei liu	92971707de	enhance: Add restful api for devops to execute rolling upgrade (#29998 ) issue: #29261 This PR Add restful api for devops to execute rolling upgrade, including suspend/resume balance and manual transfer segments/channels. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-03-27 16:15:19 +08:00

1 2 3

122 Commits