78 Commits

Author SHA1 Message Date
Zhen Ye
21076196bf
enhance: support resource group with WAL-based DDL framework (#44874)
issue: #43897

- Resource group related DDL is implemented by WAL-based DDL framework
now.
- Support following message type in wal AlterResourceGroup,
DropResourceGroup.
- Resource group DDL can be synced by new CDC now.
- Refactor some UT for resource group DDL.

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-10-21 09:58:03 +08:00
Zhen Ye
23085ae437
fix: use query node label check if streamingnode (#44099)
issue: #44014

- Because the session of querynode and streamingnode is different.
- So when streamingnode session down first, a streaming query node will
be treated as querynode.
- Use label but not streaming node session to fix it.

Signed-off-by: chyezh <chyezh@outlook.com>
2025-08-29 10:45:59 +08:00
wei liu
384c493d0e
fix: Fix node status inconsistency after QueryCoord restart (#43941)
issue: #43933

Fix the issue where QueryCoord restart leads to node status
inconsistency in resource manager, causing segment loading failures and
incorrect resource group assignments.

Changes include:
- Add CheckNodesInResourceGroup method to sync node status after restart
- Implement proper cleanup of offline/stopping nodes from resource
groups
- Add automatic discovery and assignment of new nodes to resource groups
- Enhance rewatchNodes process to include resource manager
synchronization

This ensures resource manager maintains correct node status and
assignments even after QueryCoord restarts, preventing segment loading
failures and improving system reliability.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-08-20 14:13:46 +08:00
wei liu
dada00a81c
fix: dirty querynode doesn't clean up after restart (#43909)
issue: #43905

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-08-18 18:05:46 +08:00
wei liu
3e9e830074
enhance: Implement rewatch mechanism for etcd failure scenarios (#43829)
issue: #43828
Implement robust rewatch mechanism to handle etcd connection failures
and node reconnection scenarios in DataCoord and QueryCoord, along with
heartbeat lag monitoring capabilities.

Changes include:
- Implement rewatchDataNodes/rewatchQueryNodes callbacks for etcd
reconnection scenarios
- Add idempotent rewatchNodes method to handle etcd session recovery
gracefully
- Add QueryCoordLastHeartbeatTimeStamp metric for monitoring node
heartbeat lag
- Clean up heartbeat metrics when nodes go down to prevent metric leaks

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-08-14 10:31:44 +08:00
wei liu
ecc2ac0426
fix: apply load config changes failed after restart (#43554)
issue: #43107

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-08-01 20:13:37 +08:00
wei liu
b2597c6329
enhance: apply load config changes after QueryCoord restart (#43108)
issue: #43107 
- Add checkLoadConfigChanges() to apply load config during startup
- Call config check in startQueryCoord() after restart
- Skip auto-updates for collections with user-specified replica numbers
- Add is_user_specified_replica_mode field to preserve user settings
- Add comprehensive unit tests with mockey

Ensures existing collections use latest cluster-level config after
restart.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-07-10 14:28:48 +08:00
cai.zhang
5566a85bcc
enhance: Add proxy task queue metrics (#42156)
issue: #42155

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
2025-06-04 11:26:32 +08:00
wei liu
78010262f0
enhance: Optimize shard serviceable mechanism (#41937)
issue: https://github.com/milvus-io/milvus/issues/41690
- Merge leader view and channel management into ChannelDistManager,
allowing a channel to have multiple delegators.
- Improve shard leader switching to ensure a single replica only has one
shard leader per channel. The shard leader handles all resource loading
and query requests.
- Refine the serviceable mechanism: after QC completes loading, sync the
query view to the delegator. The delegator then determines its
serviceable status based on the query view.
- When a delegator encounters forwarding query or deletion failures,
mark the corresponding segment as offline and transition it to an
unserviceable state.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-05-22 11:38:24 +08:00
Xianhui Lin
f9febe3bae
enhance: Merge RootCoord, DataCoord And QueryCoord into MixCoord (#41006)
Merge RootCoord, DataCoord And QueryCoord into MixCoord
Make Session into one
issue : https://github.com/milvus-io/milvus/issues/37764

---------

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
2025-04-11 16:36:30 +08:00
congqixia
cb7f2fa6fd
enhance: Use v2 package name for pkg module (#39990)
Related to #39095

https://go.dev/doc/modules/version-numbers

Update pkg version according to golang dep version convention

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-02-22 23:15:58 +08:00
Zhen Ye
bb8d1ab3bf
enhance: make new go package to manage proto (#39114)
issue: #39095

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-01-10 10:49:01 +08:00
SimFG
2afe2eaf3e
feat: support to replicate collection when the services contains the system tt msg (#37559)
- issue: #37105

---------

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-12-17 09:08:46 +08:00
Zhen Ye
d3ae8e9232
fix: delay the wait other coord logic in query coord after query coord change into standby state (#38259)
issue: https://github.com/milvus-io/milvus/issues/37764

- After removing rpc layer from mixcoord, the querycoord at standby mode
will be blocked forever of deployment rolling

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-12-11 15:48:42 +08:00
tinswzy
e76802f910
enhance: refine querycoord meta/catalog related interfaces to ensure that each method includes a ctx parameter (#37916)
issue: #35917 
This PR refine the querycoord meta related interfaces to ensure that
each method includes a ctx parameter.

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
2024-11-25 11:14:34 +08:00
wei liu
266f8ef1f5
fix: Search may return less result after qn recover (#36549)
issue: #36293 #36242
after qn recover, delegator may be loaded in new node, after all segment
has been loaded, delegator becomes serviceable. but delegator's target
version hasn't been synced, and if search/query comes, delegator will
use wrong target version to filter out a empty segment list, which
caused empty search result.

This pr will block delegator's serviceable status until target version
is synced

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-12 16:34:28 +08:00
congqixia
f985173da0
fix: Fill load field list from old version load info (#35993)
See also #35959

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-09-05 16:57:05 +08:00
congqixia
2fbc628994
feat: Support field partial load collection (#35416)
Related to #35415

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-08-20 16:49:02 +08:00
wei liu
b13932bb55
enhance: Enable database level replica num and resource groups for loading collection (#33052)
issue: #30040

This PR introduce two database level props:
1. database.replica.number
2. database.resource_groups

User can set those two database props by AlterDatabase API, then can
load collection without specified replica_num and resource groups. then
it will use database level load param when try to load collections.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-29 10:59:43 +08:00
wei liu
2013d97243
enhance: Enable to dynamic update balancer policy in querycoord (#33037)
issue: #33036
This PR enable to dynamic update balancer policy without restart
querycoord.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-21 14:29:39 +08:00
chyezh
48fe977a9d
enhance: declarative resource group api (#31930)
issue: #30647

- Add declarative resource group api

- Add config for resource group management

- Resource group recovery enhancement

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-15 08:13:19 +08:00
congqixia
25a1c9ecf0
fix: Make coordinator Register not blocked on ProcessActiveStandby (#32069)
See also #32066

This PR make coordinator register successful and let
`ProcessActiveStandBy` run async. And roles may receive stop signal and
notify servers.

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-10 18:49:18 +08:00
chyezh
ff4237bb90
enhance: add hostname into node info (#30673)
issue: https://github.com/milvus-io/milvus/issues/30647

- Address may be reused in k8s environment. Using hostname can be
better.

Signed-off-by: chyezh <chyezh@outlook.com>
2024-03-15 10:45:06 +08:00
congqixia
c886aa29ff
enhance: Use ListIndexes instead of DescribeIndex for qc broker (#31122)
See also #31103

Since querycoord need index meta information from datacoord only, broker
shall use `ListIndexes` to skip segment index building check logic in
datacoord

This PR is also related to #30538, in which DescribeIndex caused lots of
memory usage and lead to OOM eventually

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-07 21:43:03 +08:00
wei liu
e98c62abbb
enhance: refactor leader_observer to leader_checker (#29454)
issue: #29453 

sync distribution by rpc will also call loadSegment/releaseSegment,
which may cause all kinds of concurrent case on same segment, such as
concurrent load and release on one segment.
This PR add leader_checker which generate load/release task to correct
the leader view, instead of calling sync distribution by rpc

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-01-05 15:54:55 +08:00
wei liu
839a72129e
fix: Auto balance param can't be updated by dynamic (#29501)
This PR fixed that auto balance param can't be updated by dynamic

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-12-27 14:30:53 +08:00
wei liu
7f78e1dd46
fix datacoord unstable ut (#28281)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-11-08 18:43:31 +08:00
yah01
1b90630633
Fix the target updated before version updated to cause data missing (#28250)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-11-08 11:36:22 +08:00
wei liu
5b45a138b1
disable auto balance when old node exists (#28191)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-11-07 14:02:20 +08:00
yah01
dc89730a50
Support collection-level mmap control (#26901)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-11-02 23:52:16 +08:00
wei liu
178db7b0f0
check stopping node during start qc (#27859)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-10-24 12:20:11 +08:00
yah01
be980fbc38
Refine state check (#27541)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-10-11 21:01:35 +08:00
jaime
7f7c71ea7d
Decoupling client and server API in types interface (#27186)
Co-authored-by:: aoiasd <zhicheng.yue@zilliz.com>

Signed-off-by: jaime <yun.zhang@zilliz.com>
2023-09-26 09:57:25 +08:00
SimFG
26f06dd732
Format the code (#27275)
Signed-off-by: SimFG <bang.fu@zilliz.com>
2023-09-21 09:45:27 +08:00
yah01
b4f86ea55e
Construct all success status with merr (#27226)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-09-20 10:57:23 +08:00
yiwangdr
337edc321b
tikv integration (#26246)
Signed-off-by: yiwangdr <yiwangdr@gmail.com>
2023-09-07 07:25:14 +08:00
Enwei Jiao
fb0705df1b
Decouple basetable and componentparam (#26725)
Signed-off-by: Enwei Jiao <enwei.jiao@zilliz.com>
2023-09-05 10:31:48 +08:00
congqixia
e8f1b1736e
Remove log.Error(err.error())-style log (#26783)
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-09-01 13:09:01 +08:00
wei liu
949c320185
remove pull target from qc recover (#26775)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-09-01 11:17:01 +08:00
congqixia
9364d0ea49
Remove etcd dependency for querycoord unit test (#26550)
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-08-28 11:20:25 +08:00
congqixia
1045c88102
Support replace indexed field in QueryCoord (#25747)
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-07-19 21:22:58 +08:00
yah01
948d1f1f4a
Handle errors by merr for QueryCoord (#24926)
Signed-off-by: yah01 <yang.cen@zilliz.com>
2023-07-17 14:59:34 +08:00
yiwangdr
b9189b9f41
Organize mocks from types.go (#25466)
Signed-off-by: yiwangdr <yiwangdr@gmail.com>
2023-07-14 10:12:31 +08:00
wei liu
68ae199a9f
load segment with target version, avoid read redundant segment (#24929)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-06-27 11:48:45 +08:00
congqixia
41af0a98fa
Use go-api/v2 for milvus-proto (#24770)
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-06-09 01:28:37 +08:00
yihao.dai
89db828f71
Fix load collection failed after drop partition (#24680)
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2023-06-07 19:04:36 +08:00
congqixia
ed81eaa963
Make CollectionObserver trigger checker more frequently during load procedure (#23928)
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-05-08 14:06:41 +08:00
foxspy
6f4ed517de
add growing segment index (#23615)
Signed-off-by: xianliang <xianliang.li@zilliz.com>
2023-04-26 10:14:41 +08:00
wei liu
1deac692a0
fix nodeup block (#23634)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-04-25 19:20:37 +08:00
wei liu
4336ed8609
fix node up (#23415)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-04-20 09:52:31 +08:00