issue: #43897, #44123
pr: #45224
also pick pr: #45216,#45154,#45033,#45145,#45092,#45058,#45029
enhance: Close channel replicator more gracefully (#45029)
issue: https://github.com/milvus-io/milvus/issues/44123
enhance: Show create time for import job (#45058)
issue: https://github.com/milvus-io/milvus/issues/45056
fix: wal state may be unconsistent after recovering from crash (#45092)
issue: #45088, #45086
- Message on control channel should trigger the checkpoint update.
- LastConfrimedMessageID should be recovered from the minimum of
checkpoint or the LastConfirmedMessageID of uncommitted txn.
- Add more log info for wal debugging.
fix: make ack of broadcaster cannot canceled by client (#45145)
issue: #45141
- make ack of broadcaster cannot canceled by rpc.
- make clone for assignment snapshot of wal balancer.
- add server id for GetReplicateCheckpoint to avoid failure.
enhance: support collection and index with WAL-based DDL framework
(#45033)
issue: #43897
- Part of collection/index related DDL is implemented by WAL-based DDL
framework now.
- Support following message type in wal, CreateCollection,
DropCollection, CreatePartition, DropPartition, CreateIndex, AlterIndex,
DropIndex.
- Part of collection/index related DDL can be synced by new CDC now.
- Refactor some UT for collection/index DDL.
- Add Tombstone scheduler to manage the tombstone GC for collection or
partition meta.
- Move the vchannel allocation into streaming pchannel manager.
enhance: support load/release collection/partition with WAL-based DDL
framework (#45154)
issue: #43897
- Load/Release collection/partition is implemented by WAL-based DDL
framework now.
- Support AlterLoadConfig/DropLoadConfig in wal now.
- Load/Release operation can be synced by new CDC now.
- Refactor some UT for load/release DDL.
enhance: Don't start cdc by default (#45216)
issue: https://github.com/milvus-io/milvus/issues/44123
fix: unrecoverable when replicate from old (#45224)
issue: #44962
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Signed-off-by: chyezh <chyezh@outlook.com>
Co-authored-by: yihao.dai <yihao.dai@zilliz.com>
issue: #43897, #44123
pr: #44898
related pr: #44607#44642#44792#44809#44564#44560#44735#44822#44865#44850#44942#44874#44963#44886#44898
enhance: remove redundant channel manager from datacoord (#44532)
issue: #41611
- After enabling streaming arch, channel manager of data coord is a
redundant component.
fix: Fix CDC OOM due to high buffer size (#44607)
Fix CDC OOM by:
1. free msg buffer manually.
2. limit max msg buffer size.
3. reduce scanner msg hander buffer size.
issue: https://github.com/milvus-io/milvus/issues/44123
fix: remove wrong start timetick to avoid filtering DML whose timetick
is less than it. (#44691)
issue: #41611
- introduced by #44532
enhance: support remove cluster from replicate topology (#44642)
issue: #44558, #44123
- Update config(A->C) to A and C, config(B) to B on replicate topology
(A->B,A->C) can remove the B from replicate topology
- Fix some metric error of CDC
fix: check if qn is sqn with label and streamingnode list (#44792)
issue: #44014
- On standalone, the query node inside need to load segment and watch
channel, so the querynode is not a embeded querynode in streamingnode
without `LabelStreamingNodeEmbeddedQueryNode`. The channel dist manager
can not confirm a standalone node is a embededStreamingNode.
Bug is introduced by #44099
enhance: Make GetReplicateInfo API work at the pchannel level (#44809)
issue: https://github.com/milvus-io/milvus/issues/44123
enhance: Speed up CDC scheduling (#44564)
Make CDC watch etcd replicate pchannel meta instead of listing them
periodically.
issue: https://github.com/milvus-io/milvus/issues/44123
enhance: refactor update replicate config operation using
wal-broadcast-based DDL/DCL framework (#44560)
issue: #43897
- UpdateReplicateConfig operation will broadcast AlterReplicateConfig
message into all pchannels with cluster-exclusive-lock.
- Begin txn message will use commit message timetick now (to avoid
timetick rollback when CDC with txn message).
- If current cluster is secondary, the UpdateReplicateConfig will wait
until the replicate configuration is consistent with the config
replicated from primary.
enhance: support rbac with WAL-based DDL framework (#44735)
issue: #43897
- RBAC(Roles/Users/Privileges/Privilege Groups) is implemented by
WAL-based DDL framework now.
- Support following message type in wal `AlterUser`, `DropUser`,
`AlterRole`, `DropRole`, `AlterUserRole`, `DropUserRole`,
`AlterPrivilege`, `DropPrivilege`, `AlterPrivilegeGroup`,
`DropPrivilegeGroup`, `RestoreRBAC`.
- RBAC can be synced by new CDC now.
- Refactor some UT for RBAC.
enhance: support database with WAL-based DDL framework (#44822)
issue: #43897
- Database related DDL is implemented by WAL-based DDL framework now.
- Support following message type in wal CreateDatabase, AlterDatabase,
DropDatabase.
- Database DDL can be synced by new CDC now.
- Refactor some UT for Database DDL.
enhance: support alias with WAL-based DDL framework (#44865)
issue: #43897
- Alias related DDL is implemented by WAL-based DDL framework now.
- Support following message type in wal AlterAlias, DropAlias.
- Alias DDL can be synced by new CDC now.
- Refactor some UT for Alias DDL.
enhance: Disable import for replicating cluster (#44850)
1. Import in replicating cluster is not supported yet, so disable it for
now.
2. Remove GetReplicateConfiguration wal API
issue: https://github.com/milvus-io/milvus/issues/44123
fix: use short debug string to avoid newline in debug logs (#44925)
issue: #44924
fix: rerank before requery if reranker didn't use field data (#44942)
issue: #44918
enhance: support resource group with WAL-based DDL framework (#44874)
issue: #43897
- Resource group related DDL is implemented by WAL-based DDL framework
now.
- Support following message type in wal AlterResourceGroup,
DropResourceGroup.
- Resource group DDL can be synced by new CDC now.
- Refactor some UT for resource group DDL.
fix: Fix Fix replication txn data loss during chaos (#44963)
Only confirm CommitMsg for txn messages to prevent data loss.
issue: https://github.com/milvus-io/milvus/issues/44962,
https://github.com/milvus-io/milvus/issues/44123
fix: wrong execution order of DDL/DCL on secondary (#44886)
issue: #44697, #44696
- The DDL executing order of secondary keep same with order of control
channel timetick now.
- filtering the control channel operation on shard manager of
streamingnode to avoid wrong vchannel of create segment.
- fix that the immutable txn message lost replicate header.
fix: Fix primary-secondary replication switch blocking (#44898)
1. Fix primary-secondary replication switchover blocking by delete
replicate pchannel meta using modRevision.
2. Stop channel replicator(scanner) when cluster role changes to prevent
continued message consumption and replication.
3. Close Milvus client to prevent goroutine leak.
4. Create Milvus client once for a channel replicator.
5. Simplify CDC controller and resources.
issue: https://github.com/milvus-io/milvus/issues/44123
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Signed-off-by: chyezh <chyezh@outlook.com>
Co-authored-by: yihao.dai <yihao.dai@zilliz.com>
issue: #44014
- Because the session of querynode and streamingnode is different.
- So when streamingnode session down first, a streaming query node will
be treated as querynode.
- Use label but not streaming node session to fix it.
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #43933
Fix the issue where QueryCoord restart leads to node status
inconsistency in resource manager, causing segment loading failures and
incorrect resource group assignments.
Changes include:
- Add CheckNodesInResourceGroup method to sync node status after restart
- Implement proper cleanup of offline/stopping nodes from resource
groups
- Add automatic discovery and assignment of new nodes to resource groups
- Enhance rewatchNodes process to include resource manager
synchronization
This ensures resource manager maintains correct node status and
assignments even after QueryCoord restarts, preventing segment loading
failures and improving system reliability.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #43828
Implement robust rewatch mechanism to handle etcd connection failures
and node reconnection scenarios in DataCoord and QueryCoord, along with
heartbeat lag monitoring capabilities.
Changes include:
- Implement rewatchDataNodes/rewatchQueryNodes callbacks for etcd
reconnection scenarios
- Add idempotent rewatchNodes method to handle etcd session recovery
gracefully
- Add QueryCoordLastHeartbeatTimeStamp metric for monitoring node
heartbeat lag
- Clean up heartbeat metrics when nodes go down to prevent metric leaks
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #43107
- Add checkLoadConfigChanges() to apply load config during startup
- Call config check in startQueryCoord() after restart
- Skip auto-updates for collections with user-specified replica numbers
- Add is_user_specified_replica_mode field to preserve user settings
- Add comprehensive unit tests with mockey
Ensures existing collections use latest cluster-level config after
restart.
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: https://github.com/milvus-io/milvus/issues/41690
- Merge leader view and channel management into ChannelDistManager,
allowing a channel to have multiple delegators.
- Improve shard leader switching to ensure a single replica only has one
shard leader per channel. The shard leader handles all resource loading
and query requests.
- Refine the serviceable mechanism: after QC completes loading, sync the
query view to the delegator. The delegator then determines its
serviceable status based on the query view.
- When a delegator encounters forwarding query or deletion failures,
mark the corresponding segment as offline and transition it to an
unserviceable state.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Merge RootCoord, DataCoord And QueryCoord into MixCoord
Make Session into one
issue : https://github.com/milvus-io/milvus/issues/37764
---------
Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
issue: https://github.com/milvus-io/milvus/issues/37764
- After removing rpc layer from mixcoord, the querycoord at standby mode
will be blocked forever of deployment rolling
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #35917
This PR refine the querycoord meta related interfaces to ensure that
each method includes a ctx parameter.
Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
issue: #36293#36242
after qn recover, delegator may be loaded in new node, after all segment
has been loaded, delegator becomes serviceable. but delegator's target
version hasn't been synced, and if search/query comes, delegator will
use wrong target version to filter out a empty segment list, which
caused empty search result.
This pr will block delegator's serviceable status until target version
is synced
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #30040
This PR introduce two database level props:
1. database.replica.number
2. database.resource_groups
User can set those two database props by AlterDatabase API, then can
load collection without specified replica_num and resource groups. then
it will use database level load param when try to load collections.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #30647
- Add declarative resource group api
- Add config for resource group management
- Resource group recovery enhancement
---------
Signed-off-by: chyezh <chyezh@outlook.com>
See also #32066
This PR make coordinator register successful and let
`ProcessActiveStandBy` run async. And roles may receive stop signal and
notify servers.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
See also #31103
Since querycoord need index meta information from datacoord only, broker
shall use `ListIndexes` to skip segment index building check logic in
datacoord
This PR is also related to #30538, in which DescribeIndex caused lots of
memory usage and lead to OOM eventually
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #29453
sync distribution by rpc will also call loadSegment/releaseSegment,
which may cause all kinds of concurrent case on same segment, such as
concurrent load and release on one segment.
This PR add leader_checker which generate load/release task to correct
the leader view, instead of calling sync distribution by rpc
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>