milvus

mirror of https://gitee.com/milvus-io/milvus.git synced 2025-12-07 01:28:27 +08:00

Author	SHA1	Message	Date
Zhen Ye	23085ae437	fix: use query node label check if streamingnode (#44099 ) issue: #44014 - Because the session of querynode and streamingnode is different. - So when streamingnode session down first, a streaming query node will be treated as querynode. - Use label but not streaming node session to fix it. Signed-off-by: chyezh <chyezh@outlook.com>	2025-08-29 10:45:59 +08:00
Chun Han	da156981c6	feat: milvus support posix-compatible mode(milvus-io#43942) (#43944 ) related: #43942 Signed-off-by: MrPresent-Han <chun.han@gmail.com> Co-authored-by: MrPresent-Han <chun.han@gmail.com>	2025-08-27 16:29:50 +08:00
Zhen Ye	cbb9392564	fix: filter the streaming node from resource group (#43984 ) issue: #43981 Signed-off-by: chyezh <chyezh@outlook.com>	2025-08-22 19:21:47 +08:00
wei liu	399f63300c	enhance: Implement dynamic interval updates for ticker components (#43865 ) issue: #43858 Enable dynamic configuration updates for ticker intervals without restart. This enhancement allows runtime configuration changes to take effect immediately for better operational flexibility. Changes include: - Apply "drain+Reset only when interval changed" pattern across all ticker components to preserve existing timing phases - Fix goroutine variable capture issue in CheckerController.Start() - Remove unnecessary ticker.Stop() in manual trigger paths - Add dynamic interval checking in QueryCoordV2 components: * checkers/controller.go: Various checker intervals * dist/dist_handler.go: DistPullInterval, CheckExecutedFlagInterval * session/cluster.go: CheckNodeSessionInterval * server.go: CheckAutoBalanceConfigInterval * observers/target_observer.go: UpdateNextTargetInterval * observers/collection_observer.go: CollectionObserverInterval - Add dynamic interval checking in QueryNodeV2 components: * segments/disk_usage_fetcher.go: DiskSizeFetchInterval - Ensure thread safety by performing all ticker operations in same goroutine with proper drain before Reset to avoid spurious triggers --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-08-21 10:07:47 +08:00
wei liu	384c493d0e	fix: Fix node status inconsistency after QueryCoord restart (#43941 ) issue: #43933 Fix the issue where QueryCoord restart leads to node status inconsistency in resource manager, causing segment loading failures and incorrect resource group assignments. Changes include: - Add CheckNodesInResourceGroup method to sync node status after restart - Implement proper cleanup of offline/stopping nodes from resource groups - Add automatic discovery and assignment of new nodes to resource groups - Enhance rewatchNodes process to include resource manager synchronization This ensures resource manager maintains correct node status and assignments even after QueryCoord restarts, preventing segment loading failures and improving system reliability. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-08-20 14:13:46 +08:00
wei liu	dada00a81c	fix: dirty querynode doesn't clean up after restart (#43909 ) issue: #43905 Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-08-18 18:05:46 +08:00
wei liu	3e9e830074	enhance: Implement rewatch mechanism for etcd failure scenarios (#43829 ) issue: #43828 Implement robust rewatch mechanism to handle etcd connection failures and node reconnection scenarios in DataCoord and QueryCoord, along with heartbeat lag monitoring capabilities. Changes include: - Implement rewatchDataNodes/rewatchQueryNodes callbacks for etcd reconnection scenarios - Add idempotent rewatchNodes method to handle etcd session recovery gracefully - Add QueryCoordLastHeartbeatTimeStamp metric for monitoring node heartbeat lag - Clean up heartbeat metrics when nodes go down to prevent metric leaks --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-08-14 10:31:44 +08:00
wei liu	ecc2ac0426	fix: apply load config changes failed after restart (#43554 ) issue: #43107 --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-08-01 20:13:37 +08:00
wei liu	b2597c6329	enhance: apply load config changes after QueryCoord restart (#43108 ) issue: #43107 - Add checkLoadConfigChanges() to apply load config during startup - Call config check in startQueryCoord() after restart - Skip auto-updates for collections with user-specified replica numbers - Add is_user_specified_replica_mode field to preserve user settings - Add comprehensive unit tests with mockey Ensures existing collections use latest cluster-level config after restart. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-07-10 14:28:48 +08:00
wei liu	78010262f0	enhance: Optimize shard serviceable mechanism (#41937 ) issue: https://github.com/milvus-io/milvus/issues/41690 - Merge leader view and channel management into ChannelDistManager, allowing a channel to have multiple delegators. - Improve shard leader switching to ensure a single replica only has one shard leader per channel. The shard leader handles all resource loading and query requests. - Refine the serviceable mechanism: after QC completes loading, sync the query view to the delegator. The delegator then determines its serviceable status based on the query view. - When a delegator encounters forwarding query or deletion failures, mark the corresponding segment as offline and transition it to an unserviceable state. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-05-22 11:38:24 +08:00
SimFG	91d40fa558	fix: Update logging context and upgrade dependencies (#41318 ) - issue: #41291 --------- Signed-off-by: SimFG <bang.fu@zilliz.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2025-04-23 10:52:38 +08:00
Xianhui Lin	f9febe3bae	enhance: Merge RootCoord, DataCoord And QueryCoord into MixCoord (#41006 ) Merge RootCoord, DataCoord And QueryCoord into MixCoord Make Session into one issue : https://github.com/milvus-io/milvus/issues/37764 --------- Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>	2025-04-11 16:36:30 +08:00
congqixia	cb7f2fa6fd	enhance: Use v2 package name for pkg module (#39990 ) Related to #39095 https://go.dev/doc/modules/version-numbers Update pkg version according to golang dep version convention --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-02-22 23:15:58 +08:00
Zhen Ye	c84a0748c4	enhance: add rw/ro streaming query node replica management (#38677 ) issue: #38399 - Embed the query node into streaming node to make delegator available at streaming node. - The embedded query node has a special server label `QUERYNODE_STREAMING-EMBEDDED`. - Change the balance strategy to make the channel assigned to streaming node as much as possible. Signed-off-by: chyezh <chyezh@outlook.com>	2025-01-24 16:55:07 +08:00
wei liu	d2834a1812	enhance: Add logs for check health failed (#39208 ) Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-01-15 17:31:00 +08:00
Zhen Ye	bb8d1ab3bf	enhance: make new go package to manage proto (#39114 ) issue: #39095 --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-01-10 10:49:01 +08:00
jaime	f03a85725a	enhance: add db name in replica (#38672 ) issue: #36621 Signed-off-by: jaime <yun.zhang@zilliz.com>	2025-01-09 19:40:59 +08:00
jaime	78438ef41e	fix: revert optimize CPU usage for CheckHealth requests (#35589 ) (#38555 ) issue: #35563 Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-12-19 00:38:45 +08:00
jaime	28fdbc4e30	enhance: optimize CPU usage for CheckHealth requests (#35589 ) issue: #35563 1. Use an internal health checker to monitor the cluster's health state, storing the latest state on the coordinator node. The CheckHealth request retrieves the cluster's health from this latest state on the proxy sides, which enhances cluster stability. 2. Each health check will assess all collections and channels, with detailed failure messages temporarily saved in the latest state. 3. Use CheckHealth request instead of the heavy GetMetrics request on the querynode and datanode Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-12-17 11:02:45 +08:00
tinswzy	27229f7907	enhance: refine exists log print with ctx (#38080 ) issue: #35917 Refines exists log print with ctx Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>	2024-12-14 22:36:44 +08:00
Zhen Ye	d3ae8e9232	fix: delay the wait other coord logic in query coord after query coord change into standby state (#38259 ) issue: https://github.com/milvus-io/milvus/issues/37764 - After removing rpc layer from mixcoord, the querycoord at standby mode will be blocked forever of deployment rolling --------- Signed-off-by: chyezh <chyezh@outlook.com>	2024-12-11 15:48:42 +08:00
tinswzy	7944538ade	enhance: Add ctx param to KV operation interfaces (#38154 ) issue: #35917 Refine KV operation interfaces by adding a ctx param Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>	2024-12-05 15:16:41 +08:00
tinswzy	e76802f910	enhance: refine querycoord meta/catalog related interfaces to ensure that each method includes a ctx parameter (#37916 ) issue: #35917 This PR refine the querycoord meta related interfaces to ensure that each method includes a ctx parameter. Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>	2024-11-25 11:14:34 +08:00
jaime	7bbfe86bcd	enhance: add list index and segment index retrieval API for WebUI (#37861 ) issue: #36621 Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-11-22 16:58:34 +08:00
jaime	1e8ea4a7e7	feat: add segment/channel/task/slow query render (#37561 ) issue: #36621 Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-11-12 17:44:29 +08:00
wei liu	266f8ef1f5	fix: Search may return less result after qn recover (#36549 ) issue: #36293 #36242 after qn recover, delegator may be loaded in new node, after all segment has been loaded, delegator becomes serviceable. but delegator's target version hasn't been synced, and if search/query comes, delegator will use wrong target version to filter out a empty segment list, which caused empty search result. This pr will block delegator's serviceable status until target version is synced --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-11-12 16:34:28 +08:00
wei liu	a03157838b	enhance: Enable node assign policy on resource group (#36968 ) issue: #36977 with node_label_filter on resource group, user can add label on querynode with env `MILVUS_COMPONENT_LABEL`, then resource group will prefer to accept node which match it's node_label_filter. then querynode's can't be group by labels, and put querynodes with same label to same resource groups. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-11-08 11:18:27 +08:00
jaime	f348bd9441	feat: add segment,pipeline, replica and resourcegroup api for WebUI (#37344 ) issue: #36621 Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-11-07 11:52:25 +08:00
jaime	9d16b972ea	feat: add tasks page into management WebUI (#37002 ) issue: #36621 1. Add API to access task runtime metrics, including: - build index task - compaction task - import task - balance (including load/release of segments/channels and some leader tasks on querycoord) - sync task 2. Add a debug model to the webpage by using debug=true or debug=false in the URL query parameters to enable or disable debug mode. Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-10-28 10:13:29 +08:00
jaime	4746f47282	feat: management WebUI homepage (#36822 ) issue: #36784 1. Implement an embedded web server for WebUI access. 2. Complete the homepage development. Home page demo: <img width="2177" alt="iShot_2024-10-10_17 57 34" src="https://github.com/user-attachments/assets/38539917-ce09-4e54-a5b5-7f4f7eaac353"> Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-10-23 11:29:28 +08:00
wei liu	3cd0b26285	enhance: Enable dynamic update loaded collection's replica (#35822 ) issue: #35821 After collection loaded, if we need to increase/decrease collection's replica, we need to release and load it again. milvus offers 4 solution to update loaded collection's replica, this PR aims to dynamic change the replica number without release, and after replica number changed, milvus will execute load replica or release replica in async, and the replica loaded status can be checked by getReplicas API. Notice that if set too much replicas than querynode can afford，the new replica won't be loaded successfully until enough querynode joins. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-09-25 10:13:18 +08:00
congqixia	2fbc628994	feat: Support field partial load collection (#35416 ) Related to #35415 --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-08-20 16:49:02 +08:00
jaime	fcec4c21b9	fix: check collection health(queryable) fail for releasing collection (#34947 ) issue: #34946 Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-08-02 17:20:15 +08:00
jaime	9630974fbb	enhance: move rocksmq from internal to pkg module (#33881 ) issue: #33956 Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-06-25 21:18:15 +08:00
wayblink	5fac2fa1d2	fix: Panic if ProcessActiveStandBy returns error (#33369 ) #33368 Signed-off-by: wayblink <anyang.wang@zilliz.com>	2024-06-19 11:16:00 +08:00
wei liu	303470fc35	fix: Clean offline node from resource group after qc restart (#33232 ) issue: #33200 #33207 pr#33104 causes the offline node will be kept in resource group after qc recover, and offline node will be assign to new replica as rwNode, then request send to those node will fail by NodeNotFound. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-05-22 10:03:40 +08:00
wei liu	33bd6eed28	fix: Clean offline node from replica after qc recover (#33213 ) issue: #33200 #33207 pr#33104 remove this logic by mistake, which cause the offline node will be kept in replica after qc recover, and request send to offline qn will go a NodeNotFound error. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-05-21 15:41:39 +08:00
wei liu	2013d97243	enhance: Enable to dynamic update balancer policy in querycoord (#33037 ) issue: #33036 This PR enable to dynamic update balancer policy without restart querycoord. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-05-21 14:29:39 +08:00
wei liu	a7f6193bfc	fix: query node may stuck at stopping progress (#33104 ) issue: #33103 when try to do stopping balance for stopping query node, balancer will try to get node list from replica.GetNodes, then check whether node is stopping, if so, stopping balance will be triggered for this replica. after the replica refactor, replica.GetNodes only return rwNodes, and the stopping node maintains in roNodes, so balancer couldn't find replica which contains stopping node, and stopping balance for replica won't be triggered, then query node will stuck forever due to segment/channel doesn't move out. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-05-20 10:21:38 +08:00
congqixia	861977ab60	fix: Start `LeaderCacheObserver` before `SyncAll` (#33035 ) Related to #33033 Signed-off-by: Congqi Xia <congqi.xia@zilliz.com> Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-05-14 13:25:32 +08:00
wei liu	e2332bdc17	enhance: Enable channel exclusive balance policy (#32911 ) issue: #32910 * split replica's node list to channels when create replicas * balance nodes among channels when node change happens * implement channel level balance, let balance happens in channel level Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-05-10 17:27:31 +08:00
wei liu	ba02d54a30	enhance: update shard leader cache when leader location changed (#32470 ) issue: #32466 this PR enhance that when shard location changed, update proxy's shard leader cache. in case of query node failover case, proxy can find replica recover --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-05-08 10:05:29 +08:00
chyezh	48fe977a9d	enhance: declarative resource group api (#31930 ) issue: #30647 - Add declarative resource group api - Add config for resource group management - Resource group recovery enhancement --------- Signed-off-by: chyezh <chyezh@outlook.com>	2024-04-15 08:13:19 +08:00
congqixia	25a1c9ecf0	fix: Make coordinator `Register` not blocked on ProcessActiveStandby (#32069 ) See also #32066 This PR make coordinator register successful and let `ProcessActiveStandBy` run async. And roles may receive stop signal and notify servers. --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-04-10 18:49:18 +08:00
chyezh	a2502bde75	enhance: replica manager enhancement (#31496 ) issue: #30647 - ReplicaManager manage read only node now, and always do persistent of node distribution of replica. - All segment/channel checker using ReplicaManager to get read-only node or read-write node, but not ResourceManager. - ReplicaManager promise that only apply unique querynode to one replica in same collection now (replicas in same collection never hold same querynode at same time). - ReplicaManager promise that fairly node count assignment policy if multi replicas of collection is assigned to one resource group. - Move some parameters check into ReplicaManager to avoid data race. - Allow transfer replica to resource group that already load replica of same collection - Allow transfer node between resource groups that load replica of same collection --------- Signed-off-by: chyezh <chyezh@outlook.com>	2024-04-05 04:57:16 +08:00
wei liu	4dfdb1a443	fix: save current target after target observer stop (#31315 ) issue: #28491 should save target to meta store after target observer stop, incase of target changed Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-03-18 12:27:04 +08:00
wei liu	d79aa58b37	enhance: Speed up target recovery after query coord restart (#31240 ) issue: #28491 after querycoord restart, it will pull a new target, which include channel and segment list. when segments loaded on querynode has reached the target, the collection could provide search/query. but if segment list changes by time, ater querycoord pull a new target, it will takes a few minutes to catch up the target's segment distribution. and before that, query/search will fail due to lack of segments. This PR save the current loaded target to meta storein querycoord's stop progress, and recover it when query coord starts, to speed up the target recovery time. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-03-15 14:19:03 +08:00
chyezh	ff4237bb90	enhance: add hostname into node info (#30673 ) issue: https://github.com/milvus-io/milvus/issues/30647 - Address may be reused in k8s environment. Using hostname can be better. Signed-off-by: chyezh <chyezh@outlook.com>	2024-03-15 10:45:06 +08:00
jaime	db79be3ae0	fix: ctx cancel should be the last step while stopping server (#31220 ) issue: #31219 Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-03-15 10:33:05 +08:00
SimFG	ee8d6f236c	enhance: make the watch dm channel request better compatibility (#30952 ) issue: #30938 Signed-off-by: SimFG <bang.fu@zilliz.com>	2024-03-01 16:07:37 +08:00

1 2 3

137 Commits