When computing load diff, binlogs in v1/legacy format have empty
child_fields. In this case, the field_id itself should be used as the
child_id (group_id == field_id for legacy format).
Without this fix, legacy format binlogs are not recognized during diff
computation, causing segments to fail loading and TestProxy to timeout.
Changes:
- Add fallback to use fieldid as child_id when child_fields is empty
- Add LoadDiff::ToString() for debugging
- Add logging for diff in Load/Reopen operations
- Add comprehensive unit tests for legacy format handling
Related to #46594
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: load-diff computation must enumerate every binlog
child group for a field so current vs new segment state comparisons
include all column-group/binlog groups; for legacy (v1) binlogs that
have empty child_fields, the code must treat group_id == field_id to
preserve that mapping.
- Bug fix (resolves#46594): SegmentLoadInfo now normalizes
field_binlog.child_fields() into a vector and falls back to using
field_id as the single child group when child_fields is empty; the same
normalization is applied for both current and new-info paths, ensuring
legacy v1 binlogs are discovered and included in Load/ComputeDiff
results so segments load correctly.
- Logic simplified: removed the implicit assumption that child_fields is
always present by centralizing a single normalization/fallback step used
symmetrically for both diff paths, avoiding ad-hoc special-casing and
unifying iteration over child groups.
- No data loss / no behavior regression: the fallback only activates
when child_fields is empty — non-legacy binlogs continue to use their
child_fields unchanged. Add/drop semantics are preserved because the
same normalization is applied to both sides of the diff. Unit tests
(v1-only, v4-only, mixed cases) were added to validate correctness;
LoadDiff::ToString() and extra logging are diagnostic only.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Co-authored-by: Cai Zhang <cai.zhang@zilliz.com>
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Introduce a ScannerStartupDelay configuration to enable WAL write-only
recovery, allowing fence messages to be persisted during
primary–secondary switchover when the StreamingNode is trapped in crash
loops.
issue: https://github.com/milvus-io/milvus/issues/46368
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Added a configurable WAL scanner pause/resume and a consumer request
flag to optionally ignore pause signals.
* **Metrics**
* Added a scanner pause gauge and pause-duration tracking for WAL
scanning.
* **Tests**
* Added coverage for pause-consumption behavior and cleanup in stream
client tests.
* **Chores**
* Consolidated flush-all logging into a single field and added a helper
for bulk message conversion.
<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
### **PR Type**
Enhancement
___
### **Description**
- Only flush and fence segments for schema-changing alter collection
messages
- Skip segment sealing for collection property-only alterations
- Add conditional check using messageutil.IsSchemaChange utility
function
___
### Diagram Walkthrough
```mermaid
flowchart LR
A["Alter Collection Message"] --> B{"Is Schema Change?"}
B -->|Yes| C["Flush and Fence Segments"]
B -->|No| D["Skip Segment Operations"]
C --> E["Set Flushed Segment IDs"]
D --> E
E --> F["Append Operation"]
```
<details><summary><h3>File Walkthrough</h3></summary>
<table><thead><tr><th></th><th align="left">Relevant
files</th></tr></thead><tbody><tr><td><strong>Enhancement</strong></td><td><table>
<tr>
<td>
<details>
<summary><strong>shard_interceptor.go</strong><dd><code>Conditional
segment sealing based on schema changes</code>
</dd></summary>
<hr>
internal/streamingnode/server/wal/interceptors/shard/shard_interceptor.go
<ul><li>Added import for <code>messageutil</code> package to access
schema change detection <br>utility<br> <li> Modified
<code>handleAlterCollection</code> to conditionally flush and fence
<br>segments only for schema-changing messages<br> <li> Wrapped segment
flushing logic in <code>if
</code><br><code>messageutil.IsSchemaChange(header)</code> check<br>
<li> Skips unnecessary segment sealing when only collection properties
are <br>altered</ul>
</details>
</td>
<td><a
href="https://github.com/milvus-io/milvus/pull/46488/files#diff-c1acf785e5b530e59137b21584cf567ccd9aeeb613fb3684294b439289e80beb">+9/-3</a>
</td>
</tr>
</table></td></tr></tbody></table>
</details>
___
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Bug Fixes**
* Optimized collection schema alteration to conditionally perform
segment allocation operations only when schema changes are detected,
reducing unnecessary overhead in unmodified collection scenarios.
<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
This PR:
1. Define and implement the new FlushAllMessage.
2. Refactor FlushAll to flush the entire cluster.
issue: https://github.com/milvus-io/milvus/issues/45919
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
issue: #44369
woodpecker related[ issue:
#59](https://github.com/zilliztech/woodpecker/issues/59)
Refactor the WAL retention logic in Milvus StreamingNode:
- Remove the simple sampling-based truncation mechanism.
- After flush, WAL data is directly truncated.
- The retention control is now delegated to the underlying message queue
(MQ) implementation.
Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
issue: #43897
- Alter collection/database is implemented by WAL-based DDL framework
now.
- Support AlterCollection/AlterDatabase in wal now.
- Alter operation can be synced by new CDC now.
- Refactor some UT for alter DDL.
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #43897
- Part of collection/index related DDL is implemented by WAL-based DDL
framework now.
- Support following message type in wal, CreateCollection,
DropCollection, CreatePartition, DropPartition, CreateIndex, AlterIndex,
DropIndex.
- Part of collection/index related DDL can be synced by new CDC now.
- Refactor some UT for collection/index DDL.
- Add Tombstone scheduler to manage the tombstone GC for collection or
partition meta.
- Move the vchannel allocation into streaming pchannel manager.
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #45088, #45086
- Message on control channel should trigger the checkpoint update.
- LastConfrimedMessageID should be recovered from the minimum of
checkpoint or the LastConfirmedMessageID of uncommitted txn.
- Add more log info for wal debugging.
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #45077
We need to promise the state of wal consistent with the memory state of
streamingnode. So we don't allow the append operation can be cancelled
by the append caller to avoid leave a inconsistent state of alive wal.
The wal append operation can only be cancelled when the wal is shutting
down.
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #44697, #44696
- The DDL executing order of secondary keep same with order of control
channel timetick now.
- filtering the control channel operation on shard manager of
streamingnode to avoid wrong vchannel of create segment.
- fix that the immutable txn message lost replicate header.
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #43897
- Return LastConfirmedMessageID when wal append operation.
- Add resource-key-based locker for broadcast-ack operation to protect
the coord state when executing ddl.
- Resource-key-based locker is held until the broadcast operation is
acked.
- ResourceKey support shared and exclusive lock.
- Add FastAck execute ack right away after the broadcast done to speed
up ddl.
- Ack callback will support broadcast message result now.
- Add tombstone for broadcaster to avoid to repeatedly commit DDL and
ABA issue.
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #44123
- support replicate message in wal of milvus.
- support CDC-replicate recovery from wal.
- fix some CDC replicator bugs
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #43897
- The original broadcast ack operation need to recover message from
etcd, which can not support cdc.
- immutable message will set as the ack parameter to fix it.
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #42416
- Rename the InsertMetric into ModifiedMetric.
- Add L0 control configuration.
- Add some L0 current state collect.
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #43072, #43289
- manage the schema version at recovery storage.
- update the schema when creating collection or alter schema.
- get schema at write buffer based on version.
- recover the schema when upgrading from 2.5.
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #42995
- The consuming lag at streaming node will be reported to coordinator.
- The consuming lag will trigger the write limit and deny by quota
center.
- Set the ttProtection by default.
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #43074
- fix: panic when logging a old message should be skipped, #43074
- fix: make the ack of broadcaster idompotent, #43026
- fix: lost dropping collection when upgrading, #43092
- fix: panic when DropPartition happen after DropCollection, #43027,
#43078
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #42649
- the sync operation of different pchannel is concurrent now.
- add a option to notify the backlog clear automatically.
- make pulsar walimpls can be recovered from backlog exceed.
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #41976
- make drop partition message as a broadcast message.
- add gc when drop partition message is acked.
- add a call back to handle the broadcast message when ack.
- the ack operation of broadcast message will retry until success.
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #40532
- start timeticksync at rootcoord if the streaming service is not
available
- stop timeticksync if the streaming service is available
- open a read-only wal if some nodes in cluster is not upgrading to 2.6
- allow to open read-write wal after all nodes in cluster is upgrading
to 2.6
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #42162
- enhance: add read ahead buffer size issue #42129
- fix: rocksmq consumer's close operation may get stucked
- fix: growing segment from old arch is not flushed after upgrading
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #41544
- unify the log field of message
- use the minimum one of flusher and recovery storage checkpoint as the
truncate position
Signed-off-by: chyezh <chyezh@outlook.com>