Introduce a ScannerStartupDelay configuration to enable WAL write-only
recovery, allowing fence messages to be persisted during
primary–secondary switchover when the StreamingNode is trapped in crash
loops.
issue: https://github.com/milvus-io/milvus/issues/46368
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Added a configurable WAL scanner pause/resume and a consumer request
flag to optionally ignore pause signals.
* **Metrics**
* Added a scanner pause gauge and pause-duration tracking for WAL
scanning.
* **Tests**
* Added coverage for pause-consumption behavior and cleanup in stream
client tests.
* **Chores**
* Consolidated flush-all logging into a single field and added a helper
for bulk message conversion.
<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
issue: #45088, #45086
- Message on control channel should trigger the checkpoint update.
- LastConfrimedMessageID should be recovered from the minimum of
checkpoint or the LastConfirmedMessageID of uncommitted txn.
- Add more log info for wal debugging.
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #45077
We need to promise the state of wal consistent with the memory state of
streamingnode. So we don't allow the append operation can be cancelled
by the append caller to avoid leave a inconsistent state of alive wal.
The wal append operation can only be cancelled when the wal is shutting
down.
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #43897
- Return LastConfirmedMessageID when wal append operation.
- Add resource-key-based locker for broadcast-ack operation to protect
the coord state when executing ddl.
- Resource-key-based locker is held until the broadcast operation is
acked.
- ResourceKey support shared and exclusive lock.
- Add FastAck execute ack right away after the broadcast done to speed
up ddl.
- Ack callback will support broadcast message result now.
- Add tombstone for broadcaster to avoid to repeatedly commit DDL and
ABA issue.
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #44123
- support replicate message in wal of milvus.
- support CDC-replicate recovery from wal.
- fix some CDC replicator bugs
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #42995
- The consuming lag at streaming node will be reported to coordinator.
- The consuming lag will trigger the write limit and deny by quota
center.
- Set the ttProtection by default.
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #43074
- fix: panic when logging a old message should be skipped, #43074
- fix: make the ack of broadcaster idompotent, #43026
- fix: lost dropping collection when upgrading, #43092
- fix: panic when DropPartition happen after DropCollection, #43027,
#43078
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #40532
- start timeticksync at rootcoord if the streaming service is not
available
- stop timeticksync if the streaming service is available
- open a read-only wal if some nodes in cluster is not upgrading to 2.6
- allow to open read-write wal after all nodes in cluster is upgrading
to 2.6
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #42162
- enhance: add read ahead buffer size issue #42129
- fix: rocksmq consumer's close operation may get stucked
- fix: growing segment from old arch is not flushed after upgrading
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #41544
- add lock interceptor into wal.
- use recovery and shardmanager to replace the original implementation
of segment assignment.
- remove redundant implementation and unittest.
- remove redundant proto definition.
- use 2 streamingnode in e2e.
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #41544
- simplify the proto message for flush and create segment.
- simplify the msg handler for flowgraph.
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #41439
- add IsPersisted and VChannel interface for message
- add WithNotPersisted() for message builder
- fix the persisted time tick lost at write ahead buffer
Signed-off-by: chyezh <chyezh@outlook.com>
Related to #41302
Previously wait for 200 milliseconds could cause unsable behavior of
this unittest. This PR make unittest wait for certain function call
instead of wait for some time.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Merge RootCoord, DataCoord And QueryCoord into MixCoord
Make Session into one
issue : https://github.com/milvus-io/milvus/issues/37764
---------
Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
issue: #38399
- add metrics for broadcaster component.
- add metrics for wal flusher component.
- add metrics for wal interceptors.
- add slow log for wal.
- add more label for some wal metrics. (local or remote/catcup or
tailing...)
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #38399, #39892
- use mvcc timestamp of wal as guaranteets if wal and delegator is
located at same node.
- fix: ignore growing option is lost at hibridsearch
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #38399
- Make a timetick-commit-based write ahead buffer at write side.
- Add a switchable scanner at read side to transfer the state between
catchup and tailing read
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #38399
- Add a pchannel level checkpoint for flush processing
- Refactor the recovery of flushers of wal
- make a shared wal scanner first, then make multi datasyncservice on it
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #38399
- Make the wal scanner interface same with streaming scanner.
- Use wal if the wal is located at current node.
- Otherwise fallback the old logic.
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #38399
- Add new rpc for transfer broadcast to streaming coord
- Add broadcast service at streaming coord to make broadcast message
sent automicly
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #38399
- move the lifetime implementation of common code out of the server
level lifetime implementation
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #33285
- Modify the proto of consumer of streaming service.
- Make VChannel as a required option for streaming
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #37172
- add redo interceptor to implement append context refresh. (make new
timetick)
- add create segment handler for flusher.
- make empty segment flushable and directly change it into dropped.
- add create segment message into wal when creating new growing segment.
- make the insert operation into following seq: createSegment -> insert
-> insert -> flushSegment.
- make manual flush into following seq: flushTs -> flushsegment ->
flushsegment -> manualflush.
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #36858
- Start channel manager on datacoord, but with empty assign policy in
streaming service.
- Make collection at dropping state can be recovered by flusher to make
sure that
milvus consume the dropCollection message.
- Add backoff for flusher lifetime.
- remove the proxy watcher from timetick at rootcoord in streaming
service.
Also see the better fixup: #37176
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #33285
- using streaming service in insert/upsert/flush/delete/querynode
- fixup flusher bugs and refactor the flush operation
- enable streaming service for dml and ddl
- pass the e2e when enabling streaming service
- pass the integration tst when enabling streaming service
---------
Signed-off-by: chyezh <chyezh@outlook.com>
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
issue: #33285
- support transaction on single wal.
- last confirmed message id can still be used when enable transaction.
- add fence operation for segment allocation interceptor.
---------
Signed-off-by: chyezh <chyezh@outlook.com>