58 Commits

Author SHA1 Message Date
yihao.dai
5b97cb70a0
enhance: Support delaying scanner startup (#46369)
Introduce a ScannerStartupDelay configuration to enable WAL write-only
recovery, allowing fence messages to be persisted during
primary–secondary switchover when the StreamingNode is trapped in crash
loops.

issue: https://github.com/milvus-io/milvus/issues/46368

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added a configurable WAL scanner pause/resume and a consumer request
flag to optionally ignore pause signals.

* **Metrics**
* Added a scanner pause gauge and pause-duration tracking for WAL
scanning.

* **Tests**
* Added coverage for pause-consumption behavior and cleanup in stream
client tests.

* **Chores**
* Consolidated flush-all logging into a single field and added a helper
for bulk message conversion.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-12-24 11:53:19 +08:00
Zhen Ye
7c575a18b0
enhance: support AckSyncUp for broadcaster, and enable it in truncate api (#46313)
issue: #43897
also for issue: #46166

add ack_sync_up flag into broadcast message header, which indicates that
whether the broadcast operation is need to be synced up between the
streaming node and the coordinator.
If the ack_sync_up is false, the broadcast operation will be acked once
the recovery storage see the message at current vchannel, the fast ack
operation can be applied to speed up the broadcast operation.
If the ack_sync_up is true, the broadcast operation will be acked after
the checkpoint of current vchannel reach current message.
The fast ack operation can not be applied to speed up the broadcast
operation, because the ack operation need to be synced up with streaming
node.
e.g. if truncate collection operation want to call ack once callback
after the all segment are flushed at current vchannel, it should set the
ack_sync_up to be true.

TODO: current implementation doesn't promise the ack sync up semantic,
it only promise FastAck operation will not be applied, wait for 3.0 to
implement the ack sync up semantic. only for truncate api now.

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-12-17 16:55:17 +08:00
XuanYang-cn
0bbb134e39
feat: Enable to backup and reload ez (#46332)
see also: #40013

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
2025-12-16 17:19:16 +08:00
Zhen Ye
b8086cb62b
fix: lost database in restful v2 (#46171)
issue: #45812

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-12-09 10:59:13 +08:00
Zhen Ye
c22cdbbf9a
enhance: support proxy DQL forward (#46036)
issue: #45812

Signed-off-by: chyezh <chyezh@outlook.com>
2025-12-04 17:11:12 +08:00
Zhen Ye
f365025d68
enhance: support proxy DML forward on restful v2 (#46020)
issue: #45812

Signed-off-by: chyezh <chyezh@outlook.com>
2025-12-03 14:39:09 +08:00
Zhen Ye
adbdf916e1
enhance: support proxy DML forward (#45921)
issue: #45812

- 2.6 proxy will try to forward DWL to 2.5 proxy if streaming service is
not ready

Signed-off-by: chyezh <chyezh@outlook.com>
2025-12-01 19:37:10 +08:00
Zhen Ye
576084fe86
enhance: support alter collection/database with WAL-based DDL framework (#45266)
issue: #43897

- Alter collection/database is implemented by WAL-based DDL framework
now.
- Support AlterCollection/AlterDatabase in wal now.
- Alter operation can be synced by new CDC now.
- Refactor some UT for alter DDL.

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-11-04 09:59:33 +08:00
Zhen Ye
309d564796
enhance: support collection and index with WAL-based DDL framework (#45033)
issue: #43897

- Part of collection/index related DDL is implemented by WAL-based DDL
framework now.
- Support following message type in wal, CreateCollection,
DropCollection, CreatePartition, DropPartition, CreateIndex, AlterIndex,
DropIndex.
- Part of collection/index related DDL can be synced by new CDC now.
- Refactor some UT for collection/index DDL.
- Add Tombstone scheduler to manage the tombstone GC for collection or
partition meta.
- Move the vchannel allocation into streaming pchannel manager.

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-10-30 14:24:08 +08:00
Zhen Ye
2aa48bf4ca
fix: wrong execution order of DDL/DCL on secondary (#44886)
issue: #44697, #44696

- The DDL executing order of secondary keep same with order of control
channel timetick now.
- filtering the control channel operation on shard manager of
streamingnode to avoid wrong vchannel of create segment.
- fix that the immutable txn message lost replicate header.

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-10-21 22:38:05 +08:00
yihao.dai
168dc49bfc
enhance: Disable import for replicating cluster (#44850)
1. Import in replicating cluster is not supported yet, so disable it for
now.
2. Remove GetReplicateConfiguration wal API

issue: https://github.com/milvus-io/milvus/issues/44123

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-10-20 09:56:01 +08:00
Zhen Ye
8bf7d6ae72
enhance: refactor update replicate config operation using wal-broadcast-based DDL/DCL framework (#44560)
issue: #43897

- UpdateReplicateConfig operation will broadcast AlterReplicateConfig
message into all pchannels with cluster-exclusive-lock.
- Begin txn message will use commit message timetick now (to avoid
timetick rollback when CDC with txn message).
- If current cluster is secondary, the UpdateReplicateConfig will wait
until the replicate configuration is consistent with the config
replicated from primary.

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-10-15 15:26:01 +08:00
Zhen Ye
369c6eb206
enhance: support remove cluster from replicate topology (#44642)
issue: #44558, #44123
- Update config(A->C) to A and C, config(B) to B on replicate topology
(A->B,A->C) can remove the B from replicate topology
- Fix some metric error of CDC

Signed-off-by: chyezh <chyezh@outlook.com>
2025-10-13 11:07:58 +08:00
Zhen Ye
19e5e9f910
enhance: broadcaster will lock resource until message acked (#44508)
issue: #43897

- Return LastConfirmedMessageID when wal append operation.
- Add resource-key-based locker for broadcast-ack operation to protect
the coord state when executing ddl.
- Resource-key-based locker is held until the broadcast operation is
acked.
- ResourceKey support shared and exclusive lock.
- Add FastAck execute ack right away after the broadcast done to speed
up ddl.
- Ack callback will support broadcast message result now.
- Add tombstone for broadcaster to avoid to repeatedly commit DDL and
ABA issue.

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-09-24 20:58:05 +08:00
Zhen Ye
c171280f63
enhance: support replicate message in wal. (#44456)
issue: #44123

- support replicate message  in wal of milvus.
- support CDC-replicate recovery from wal.
- fix some CDC replicator bugs

Signed-off-by: chyezh <chyezh@outlook.com>
2025-09-22 17:06:11 +08:00
zhenshan.cao
691a8df953
feat: Add RESTful api for rolling upgrade support (#44381)
issue: https://github.com/milvus-io/milvus/issues/43968

Co-authored-by: chyezh <ye.zhen@zilliz.com>
2025-09-16 20:08:00 +08:00
yihao.dai
51f69f32d0
feat: Add CDC support (#44124)
This PR implements a new CDC service for Milvus 2.6, providing log-based
cross-cluster replication.

issue: https://github.com/milvus-io/milvus/issues/44123

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Signed-off-by: chyezh <chyezh@outlook.com>
Co-authored-by: chyezh <chyezh@outlook.com>
2025-09-16 16:32:01 +08:00
Zhen Ye
cbe4c3d231
enhance: get cchannel before build message (#44229)
issue: #43897

- support never expire txn message.

Signed-off-by: chyezh <chyezh@outlook.com>
2025-09-10 11:09:57 +08:00
Zhen Ye
9e2d1963d4
enhance: support cchannel for streaming service (#44143)
issue: #43897

- add cchannel as a special vchannel to hold some ddl and dcl.

Signed-off-by: chyezh <chyezh@outlook.com>
2025-09-02 10:05:52 +08:00
Zhen Ye
3327df72e4
enhance: make immutable message as the param of ack operation for cdc (#43900)
issue: #43897

- The original broadcast ack operation need to recover message from
etcd, which can not support cdc.
- immutable message will set as the ack parameter to fix it.

Signed-off-by: chyezh <chyezh@outlook.com>
2025-09-01 10:21:52 +08:00
Zhen Ye
d0e3a33c37
enhance: add IsRebalanceSuspended interface for wal balancer (#44026)
issue: #43968

Signed-off-by: chyezh <chyezh@outlook.com>
2025-08-24 09:19:47 +08:00
Zhen Ye
082ca62ec1
enhance: support balancer interface for streaming client to fetch streaming node information (#43969)
issue: #43968

- Add ListStreamingNode/GetWALDistribution to  fetch streaming node info
- Add SuspendRebalance/ResumeRebalance to enable or stop balance
- Add FreezeNodeIDs/DefreezeNodeIDs to freeze target node

Signed-off-by: chyezh <chyezh@outlook.com>
2025-08-21 15:55:47 +08:00
Zhen Ye
5551d99425
enhance: remove old arch non-streaming arch code (#43651)
issue: #41609

- remove all dml dead code at proxy
- remove dead code at l0_write_buffer
- remove msgstream dependency at proxy
- remove timetick reporter from proxy
- remove replicate stream implementation

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-08-06 14:41:40 +08:00
Zhen Ye
15a6631147
enhance: add quota limit based on sn consuming lag (#43105)
issue: #42995

- The consuming lag at streaming node will be reported to coordinator.
- The consuming lag will trigger the write limit and deny by quota
center.
- Set the ttProtection by default.

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-07-11 14:10:49 +08:00
Zhen Ye
f598ca2b4e
fix: block at msgpack adaptor and wrong metrics (#43235)
issue: #43018

Signed-off-by: chyezh <chyezh@outlook.com>
2025-07-11 10:14:49 +08:00
Zhen Ye
66cc194ab2
enhance: add partition gc at streaming arch (#42179)
issue: #41976

- make drop partition message as a broadcast message.
- add gc when drop partition message is acked.
- add a call back to handle the broadcast message when ack.
- the ack operation of broadcast message will retry until success.

Signed-off-by: chyezh <chyezh@outlook.com>
2025-05-29 23:20:30 +08:00
yihao.dai
65dd3982d8
fix: Fix ants.Pool goroutine leak (#41892)
1. Release the pool after it is no longer in use.
2. Upgrade ants.Pool to fix the goroutine leak issue (see [PR
#287](https://github.com/panjf2000/ants/pull/287)).

issue: https://github.com/milvus-io/milvus/issues/41838

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-05-19 17:56:22 +08:00
Zhen Ye
0a465bb5b7
enhance: use recovery+shardmanager, remove segment assignment interceptor (#41824)
issue: #41544

- add lock interceptor into wal.
- use recovery and shardmanager to replace the original implementation
of segment assignment.
- remove redundant implementation and unittest.
- remove redundant proto definition.
- use 2 streamingnode in e2e.

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-05-14 23:00:23 +08:00
Zhen Ye
21d6d1669e
fix: wal should be reopen if wal append receive the fence error (#41807)
issue: #41544

Signed-off-by: chyezh <chyezh@outlook.com>
2025-05-14 01:02:56 +08:00
Zhen Ye
de8f0af20d
enhance: use dispatcher at delegator when enable streaming (#41266)
issue: #38399

- add an adaptor type to adapt the streaming service client and
msgstream client to reuse the msgdispatcher.

Signed-off-by: chyezh <chyezh@outlook.com>
2025-05-06 01:12:53 +08:00
Zhen Ye
a3d621cb5e
fix: remove the concurrent limits for streaming service (#41484)
issue: #41479

Signed-off-by: chyezh <chyezh@outlook.com>
2025-04-24 20:36:38 +08:00
Zhen Ye
78fca7e88d
fix: transaction should retry if transaction is expired (#41379)
issue: #41248

Signed-off-by: chyezh <chyezh@outlook.com>
2025-04-20 22:38:36 +08:00
Zhen Ye
224728c2d2
fix: catchup cannot work if using StartAfter (#41201)
issue: #41062

Signed-off-by: chyezh <chyezh@outlook.com>
2025-04-10 19:04:27 +08:00
Zhen Ye
f6fb4bc442
fix: backoff will retry infinitely after reaching max elapse (#40589)
issue: #40588

Signed-off-by: chyezh <chyezh@outlook.com>
2025-03-13 16:24:06 +08:00
Zhen Ye
f47ab31f23
enhance: remove redundant resource key watch operation, just keep consistency of wal (#40235)
issue: #38399
related PR: #39522

- Just implement exclusive broadcaster between broadcast message with
same resource key to keep same order in different wal.
- After simplify the broadcast model, original watch-based broadcast is
too complicated and redundant, remove it.

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-03-03 14:40:05 +08:00
Zhen Ye
84df80b5e4
enhance: refactor metrics of streaming (#40031)
issue: #38399

- add metrics for broadcaster component.
- add metrics for wal flusher component.
- add metrics for wal interceptors.
- add slow log for wal.
- add more label for some wal metrics. (local or remote/catcup or
tailing...)

Signed-off-by: chyezh <chyezh@outlook.com>
2025-02-25 12:25:56 +08:00
congqixia
cb7f2fa6fd
enhance: Use v2 package name for pkg module (#39990)
Related to #39095

https://go.dev/doc/modules/version-numbers

Update pkg version according to golang dep version convention

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-02-22 23:15:58 +08:00
Zhen Ye
21724ab52c
enhance: generate guaranteets at delegator if local wal (#39799)
issue: #38399, #39892

- use mvcc timestamp of wal as guaranteets if wal and delegator is
located at same node.
- fix: ignore growing option is lost at hibridsearch

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-02-17 15:22:15 +08:00
SimFG
047254665d
feat: support to replicate import msg (#39171)
- issue: #39849

---------

Signed-off-by: SimFG <bang.fu@zilliz.com>
Signed-off-by: chyezh <chyezh@outlook.com>
Co-authored-by: chyezh <chyezh@outlook.com>
2025-02-16 00:08:13 +08:00
Zhen Ye
a9e0e0a852
enhance: broadcast with event-based notification (#39522)
issue: #38399

- broadcast message can carry multi resource key now.
- implement event-based notification for broadcast messages
- broadcast message use broadcast id as a unique identifier in message
- broadcasted message on vchannels keep the broadcasted vchannel now.
- broadcasted message and broadcast message have a common broadcast
header now.

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-02-07 11:14:43 +08:00
Zhen Ye
5669016af0
enhance: erase the rpc level when wal is located at same node (#38858)
issue: #38399

- Make the wal scanner interface same with streaming scanner.
- Use wal if the wal is located at current node.
- Otherwise fallback the old logic.

Signed-off-by: chyezh <chyezh@outlook.com>
2025-02-05 22:25:10 +08:00
Zhen Ye
92bde5b4f6
fix: panic when streaming release if using msgstream (#39374)
issue: #39367

Signed-off-by: chyezh <chyezh@outlook.com>
2025-01-17 11:47:02 +08:00
Zhen Ye
fd84ed817c
enhance: add broadcast operation for msgstream (#39040)
issue: #38399

- make broadcast service available for msgstream by reusing the
architecture streaming service

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-01-14 15:14:59 +08:00
Zhen Ye
3bcdd92915
enhance: add broadcast for streaming service (#39020)
issue: #38399 

- Add new rpc for transfer broadcast to streaming coord
- Add broadcast service at streaming coord to make broadcast message
sent automicly

Signed-off-by: chyezh <chyezh@outlook.com>
2025-01-09 16:24:55 +08:00
Zhen Ye
69a9fd6ead
enhance: enable rmq for streaming (#38669)
issue: #38399

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-12-24 20:24:48 +08:00
Zhen Ye
afac153c26
enhance: move the lifetime implementation out of server level lifetime (#38442)
issue: #38399

- move the lifetime implementation of common code out of the server
level lifetime implementation

Signed-off-by: chyezh <chyezh@outlook.com>
2024-12-17 11:42:44 +08:00
Zhen Ye
1b6edd0b4b
enhance: refactor the consumer grpc proto for reusing grpc stream for multi-consumer (#37564)
issue: #33285

- Modify the proto of consumer of streaming service.
- Make VChannel as a required option for streaming

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-11-11 17:24:29 +08:00
Zhen Ye
f0f5147aef
fix: streaming consumer may get stucked when handler is un-consumed (#36818)
issue: #36378

Signed-off-by: chyezh <chyezh@outlook.com>
2024-10-14 15:23:23 +08:00
Zhen Ye
2ec6e602d6
enhance: add streaming client metrics (#36523)
issue: #33285

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-10-08 21:25:19 +08:00
Zhen Ye
a6545b2e29
fix: refactor milvus config and change default txn timeout (#36522)
issue: #36498

Signed-off-by: chyezh <chyezh@outlook.com>
2024-09-29 11:01:15 +08:00