milvus

mirror of https://gitee.com/milvus-io/milvus.git synced 2025-12-28 22:45:26 +08:00

Author	SHA1	Message	Date
congqixia	6f94d8c41a	fix: Handle legacy binlog format (v1) in segment load diff computation (#46598 ) When computing load diff, binlogs in v1/legacy format have empty child_fields. In this case, the field_id itself should be used as the child_id (group_id == field_id for legacy format). Without this fix, legacy format binlogs are not recognized during diff computation, causing segments to fail loading and TestProxy to timeout. Changes: - Add fallback to use fieldid as child_id when child_fields is empty - Add LoadDiff::ToString() for debugging - Add logging for diff in Load/Reopen operations - Add comprehensive unit tests for legacy format handling Related to #46594 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> - Core invariant: load-diff computation must enumerate every binlog child group for a field so current vs new segment state comparisons include all column-group/binlog groups; for legacy (v1) binlogs that have empty child_fields, the code must treat group_id == field_id to preserve that mapping. - Bug fix (resolves #46594): SegmentLoadInfo now normalizes field_binlog.child_fields() into a vector and falls back to using field_id as the single child group when child_fields is empty; the same normalization is applied for both current and new-info paths, ensuring legacy v1 binlogs are discovered and included in Load/ComputeDiff results so segments load correctly. - Logic simplified: removed the implicit assumption that child_fields is always present by centralizing a single normalization/fallback step used symmetrically for both diff paths, avoiding ad-hoc special-casing and unifying iteration over child groups. - No data loss / no behavior regression: the fallback only activates when child_fields is empty — non-legacy binlogs continue to use their child_fields unchanged. Add/drop semantics are preserved because the same normalization is applied to both sides of the diff. Unit tests (v1-only, v4-only, mixed cases) were added to validate correctness; LoadDiff::ToString() and extra logging are diagnostic only. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Cai Zhang <cai.zhang@zilliz.com> --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-12-25 23:33:19 +08:00
yihao.dai	5b97cb70a0	enhance: Support delaying scanner startup (#46369 ) Introduce a ScannerStartupDelay configuration to enable WAL write-only recovery, allowing fence messages to be persisted during primary–secondary switchover when the StreamingNode is trapped in crash loops. issue: https://github.com/milvus-io/milvus/issues/46368 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * New Features * Added a configurable WAL scanner pause/resume and a consumer request flag to optionally ignore pause signals. * Metrics * Added a scanner pause gauge and pause-duration tracking for WAL scanning. * Tests * Added coverage for pause-consumption behavior and cleanup in stream client tests. * Chores * Consolidated flush-all logging into a single field and added a helper for bulk message conversion. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-12-24 11:53:19 +08:00
cai.zhang	21b0e5ca9d	enhance: Don't seal segments when only alter collection properties (#46488 ) ### PR Type Enhancement ___ ### Description - Only flush and fence segments for schema-changing alter collection messages - Skip segment sealing for collection property-only alterations - Add conditional check using messageutil.IsSchemaChange utility function ___ ### Diagram Walkthrough ```mermaid flowchart LR A["Alter Collection Message"] --> B{"Is Schema Change?"} B -->\|Yes\| C["Flush and Fence Segments"] B -->\|No\| D["Skip Segment Operations"] C --> E["Set Flushed Segment IDs"] D --> E E --> F["Append Operation"] ``` <details><summary><h3>File Walkthrough</h3></summary> <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td><strong>Enhancement</strong></td><td><table> <tr> <td> <details> <summary><strong>shard_interceptor.go</strong><dd><code>Conditional segment sealing based on schema changes</code>            </dd></summary> <hr> internal/streamingnode/server/wal/interceptors/shard/shard_interceptor.go <ul><li>Added import for <code>messageutil</code> package to access schema change detection <br>utility<br> <li> Modified <code>handleAlterCollection</code> to conditionally flush and fence <br>segments only for schema-changing messages<br> <li> Wrapped segment flushing logic in <code>if </code><br><code>messageutil.IsSchemaChange(header)</code> check<br> <li> Skips unnecessary segment sealing when only collection properties are <br>altered</ul> </details> </td> <td><a href="https://github.com/milvus-io/milvus/pull/46488/files#diff-c1acf785e5b530e59137b21584cf567ccd9aeeb613fb3684294b439289e80beb">+9/-3</a>      </td> </tr> </table></td></tr></tbody></table> </details> ___ <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * Bug Fixes * Optimized collection schema alteration to conditionally perform segment allocation operations only when schema changes are detected, reducing unnecessary overhead in unmodified collection scenarios. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>	2025-12-22 20:55:19 +08:00
tinswzy	9345caa135	fix: call truncate when checkpoint is persisted (#46382 ) issue: #44434 Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>	2025-12-21 19:01:17 +08:00
sijie-ni-0214	f51de1a8ab	feat: support TruncateCollection api to clear collection data (#46167 ) issue: https://github.com/milvus-io/milvus/issues/46166 --------- Signed-off-by: sijie-ni-0214 <sijie.ni@zilliz.com>	2025-12-12 10:31:14 +08:00
Zhen Ye	15f8dfc7ad	enhance: introduce a tolerance duration to delay the drop operation (#46251 ) issue: #46214 Signed-off-by: chyezh <chyezh@outlook.com>	2025-12-10 19:57:13 +08:00
yihao.dai	f32f2694bc	enhance: Implement new FlushAllMessage and refactor flush all (#45920 ) This PR: 1. Define and implement the new FlushAllMessage. 2. Refactor FlushAll to flush the entire cluster. issue: https://github.com/milvus-io/milvus/issues/45919 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-12-10 19:27:13 +08:00
Zhen Ye	10a781d22c	fix: write ahead buffer unittest failure (#45978 ) issue: #45977 Signed-off-by: chyezh <chyezh@outlook.com>	2025-12-02 10:25:10 +08:00
Zhen Ye	8e0ae6433d	fix: LastConfirmedMessageID may be wrong if high concurrent writing (#45873 ) issue: #45872 Signed-off-by: chyezh <chyezh@outlook.com>	2025-11-27 12:01:07 +08:00
tinswzy	1427825133	enhance: improve WAL retention strategy (#45350 ) issue: #44369 woodpecker related[ issue: #59](https://github.com/zilliztech/woodpecker/issues/59) Refactor the WAL retention logic in Milvus StreamingNode: - Remove the simple sampling-based truncation mechanism. - After flush, WAL data is directly truncated. - The retention control is now delegated to the underlying message queue (MQ) implementation. Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>	2025-11-23 21:41:05 +08:00
Zhen Ye	576084fe86	enhance: support alter collection/database with WAL-based DDL framework (#45266 ) issue: #43897 - Alter collection/database is implemented by WAL-based DDL framework now. - Support AlterCollection/AlterDatabase in wal now. - Alter operation can be synced by new CDC now. - Refactor some UT for alter DDL. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-11-04 09:59:33 +08:00
Zhen Ye	25e0485a56	fix: unrecoverable when replicate from old (#45224 ) issue: #44962 Signed-off-by: chyezh <chyezh@outlook.com>	2025-11-03 15:07:36 +08:00
Zhen Ye	309d564796	enhance: support collection and index with WAL-based DDL framework (#45033 ) issue: #43897 - Part of collection/index related DDL is implemented by WAL-based DDL framework now. - Support following message type in wal, CreateCollection, DropCollection, CreatePartition, DropPartition, CreateIndex, AlterIndex, DropIndex. - Part of collection/index related DDL can be synced by new CDC now. - Refactor some UT for collection/index DDL. - Add Tombstone scheduler to manage the tombstone GC for collection or partition meta. - Move the vchannel allocation into streaming pchannel manager. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-10-30 14:24:08 +08:00
Zhen Ye	ce164db1f3	fix: wal state may be unconsistent after recovering from crash (#45092 ) issue: #45088, #45086 - Message on control channel should trigger the checkpoint update. - LastConfrimedMessageID should be recovered from the minimum of checkpoint or the LastConfirmedMessageID of uncommitted txn. - Add more log info for wal debugging. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-10-29 16:26:10 +08:00
Zhen Ye	9d29e6ee64	fix: append operation can be only canceled by the wal itself but not the rpc (#45078 ) issue: #45077 We need to promise the state of wal consistent with the memory state of streamingnode. So we don't allow the append operation can be cancelled by the append caller to avoid leave a inconsistent state of alive wal. The wal append operation can only be cancelled when the wal is shutting down. Signed-off-by: chyezh <chyezh@outlook.com>	2025-10-27 11:08:05 +08:00
Zhen Ye	2aa48bf4ca	fix: wrong execution order of DDL/DCL on secondary (#44886 ) issue: #44697, #44696 - The DDL executing order of secondary keep same with order of control channel timetick now. - filtering the control channel operation on shard manager of streamingnode to avoid wrong vchannel of create segment. - fix that the immutable txn message lost replicate header. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-10-21 22:38:05 +08:00
Zhen Ye	19e5e9f910	enhance: broadcaster will lock resource until message acked (#44508 ) issue: #43897 - Return LastConfirmedMessageID when wal append operation. - Add resource-key-based locker for broadcast-ack operation to protect the coord state when executing ddl. - Resource-key-based locker is held until the broadcast operation is acked. - ResourceKey support shared and exclusive lock. - Add FastAck execute ack right away after the broadcast done to speed up ddl. - Ack callback will support broadcast message result now. - Add tombstone for broadcaster to avoid to repeatedly commit DDL and ABA issue. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-09-24 20:58:05 +08:00
Zhen Ye	c171280f63	enhance: support replicate message in wal. (#44456 ) issue: #44123 - support replicate message in wal of milvus. - support CDC-replicate recovery from wal. - fix some CDC replicator bugs Signed-off-by: chyezh <chyezh@outlook.com>	2025-09-22 17:06:11 +08:00
yihao.dai	51f69f32d0	feat: Add CDC support (#44124 ) This PR implements a new CDC service for Milvus 2.6, providing log-based cross-cluster replication. issue: https://github.com/milvus-io/milvus/issues/44123 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com> Signed-off-by: chyezh <chyezh@outlook.com> Co-authored-by: chyezh <chyezh@outlook.com>	2025-09-16 16:32:01 +08:00
Zhen Ye	cbe4c3d231	enhance: get cchannel before build message (#44229 ) issue: #43897 - support never expire txn message. Signed-off-by: chyezh <chyezh@outlook.com>	2025-09-10 11:09:57 +08:00
Zhen Ye	9e2d1963d4	enhance: support cchannel for streaming service (#44143 ) issue: #43897 - add cchannel as a special vchannel to hold some ddl and dcl. Signed-off-by: chyezh <chyezh@outlook.com>	2025-09-02 10:05:52 +08:00
Zhen Ye	3327df72e4	enhance: make immutable message as the param of ack operation for cdc (#43900 ) issue: #43897 - The original broadcast ack operation need to recover message from etcd, which can not support cdc. - immutable message will set as the ack parameter to fix it. Signed-off-by: chyezh <chyezh@outlook.com>	2025-09-01 10:21:52 +08:00
Zhen Ye	f5cee0012a	fix: remove panic for message type in recovery storage and marshal log (#43976 ) issue: #43897 Signed-off-by: chyezh <chyezh@outlook.com>	2025-08-21 14:23:47 +08:00
Zhen Ye	a86b6f2a54	enhance: extend the stats manage at streaming shard manager for L0 (#43371 ) issue: #42416 - Rename the InsertMetric into ModifiedMetric. - Add L0 control configuration. - Add some L0 current state collect. Signed-off-by: chyezh <chyezh@outlook.com>	2025-08-18 20:41:46 +08:00
Zhen Ye	7b005c48bf	enhance: support util template generation for messages (#43881 ) issue: #43880 Signed-off-by: chyezh <chyezh@outlook.com>	2025-08-18 01:19:44 +08:00
Zhen Ye	8ff118a9ff	fix: call IntoMessageProto instead of Payload when rpc (#43678 ) issue: #43677 Signed-off-by: chyezh <chyezh@outlook.com>	2025-08-06 14:45:40 +08:00
Zhen Ye	3e3775fb81	fix: panics when describe collection internal failure (#43630 ) issue: #43629 - also fix the scanner_switchable panic underlying wal scanner return context error. Signed-off-by: chyezh <chyezh@outlook.com>	2025-07-29 20:33:36 +08:00
Zhen Ye	070aabd27e	enhance: fix remove flushing state of segment (#43560 ) issue: #43559, #42884 - also fix the data lost when streaming resuming from old arch message. Signed-off-by: chyezh <chyezh@outlook.com>	2025-07-25 18:08:54 +08:00
Zhen Ye	e9ab73e93d	enhance: add schema version at recovery storage (#43500 ) issue: #43072, #43289 - manage the schema version at recovery storage. - update the schema when creating collection or alter schema. - get schema at write buffer based on version. - recover the schema when upgrading from 2.5. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-07-23 21:38:54 +08:00
Zhen Ye	b142589942	enhance: support all partitions in shard manager for L0 segment (#43385 ) issue: #42416 - change the key from partitionID into PartitionUniqueKey to support AllPartitionsID Signed-off-by: chyezh <chyezh@outlook.com>	2025-07-18 11:40:51 +08:00
Zhen Ye	ffc8c0730c	fix: wrong metric for sn timetick (#43312 ) issue: #43266 Signed-off-by: chyezh <chyezh@outlook.com>	2025-07-14 20:40:50 +08:00
Zhen Ye	15a6631147	enhance: add quota limit based on sn consuming lag (#43105 ) issue: #42995 - The consuming lag at streaming node will be reported to coordinator. - The consuming lag will trigger the write limit and deny by quota center. - Set the ttProtection by default. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-07-11 14:10:49 +08:00
Zhen Ye	f598ca2b4e	fix: block at msgpack adaptor and wrong metrics (#43235 ) issue: #43018 Signed-off-by: chyezh <chyezh@outlook.com>	2025-07-11 10:14:49 +08:00
Zhen Ye	490c5d5088	fix: lost message version after compatible message modification (#43217 ) issue: #43018 Signed-off-by: chyezh <chyezh@outlook.com>	2025-07-10 10:36:48 +08:00
Zhen Ye	46b6f1b9e2	fix: panic when logging a old message should be skipped (#43076 ) issue: #43074 - fix: panic when logging a old message should be skipped, #43074 - fix: make the ack of broadcaster idompotent, #43026 - fix: lost dropping collection when upgrading, #43092 - fix: panic when DropPartition happen after DropCollection, #43027, #43078 --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-07-04 16:04:44 +08:00
Zhen Ye	e97e44d56e	enhance: limit the gc concurrency when cpu is high (#43059 ) issue: #42833 Signed-off-by: chyezh <chyezh@outlook.com>	2025-07-04 09:22:43 +08:00
cai.zhang	f6b2a71c95	enhance: Remove chunkmanager-related dependencies from datanode (#43021 ) issue: #41611 --------- Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>	2025-07-03 14:44:45 +08:00
Zhen Ye	8367e4ec6a	fix: set 72h for wal retention (#42910 ) issue: #42706 Signed-off-by: chyezh <chyezh@outlook.com>	2025-06-27 17:36:43 +08:00
Zhen Ye	1f66b650e9	fix: pulsar cannot work properly if backlog exceed (#42653 ) issue: #42649 - the sync operation of different pchannel is concurrent now. - add a option to notify the backlog clear automatically. - make pulsar walimpls can be recovered from backlog exceed. Signed-off-by: chyezh <chyezh@outlook.com>	2025-06-13 14:28:37 +08:00
Zhen Ye	fc010e44a8	fix: release memory after pop from heap (#42482 ) issue: #42481 Signed-off-by: chyezh <chyezh@outlook.com>	2025-06-04 10:00:32 +08:00
Zhen Ye	e479467582	fix: panic when upgrading from old arch (#42422 ) issue: #42405 - add delete rows into header when upsert. Signed-off-by: chyezh <chyezh@outlook.com>	2025-05-31 22:56:29 +08:00
Zhen Ye	66cc194ab2	enhance: add partition gc at streaming arch (#42179 ) issue: #41976 - make drop partition message as a broadcast message. - add gc when drop partition message is acked. - add a call back to handle the broadcast message when ack. - the ack operation of broadcast message will retry until success. Signed-off-by: chyezh <chyezh@outlook.com>	2025-05-29 23:20:30 +08:00
Zhen Ye	4bad293655	enhance: make upgrading from 2.5.x less down time (#42082 ) issue: #40532 - start timeticksync at rootcoord if the streaming service is not available - stop timeticksync if the streaming service is available - open a read-only wal if some nodes in cluster is not upgrading to 2.6 - allow to open read-write wal after all nodes in cluster is upgrading to 2.6 --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-05-29 23:02:29 +08:00
Zhen Ye	b94cee2413	fix: growing segment from old arch is not flushed after upgrading (#42164 ) issue: #42162 - enhance: add read ahead buffer size issue #42129 - fix: rocksmq consumer's close operation may get stucked - fix: growing segment from old arch is not flushed after upgrading --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-05-29 23:00:28 +08:00
Zhen Ye	38c804fb01	fix: more stable recovery graceful closing and stable unittest (#42013 ) issue: #41544 Signed-off-by: chyezh <chyezh@outlook.com>	2025-05-23 17:52:26 +08:00
congqixia	244aa30076	fix: Lock before reading flusher cp sampling truncate cp (#42019 ) Related to #42018 --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-05-22 21:38:28 +08:00
Zhen Ye	c9b0748ff9	enhance: add delete rows into delete msg header and more metric (#41952 ) issue: #41544 - add delete rows into delete messsage header - add more insert/delete metrics - fix non-broadcast message has broadcast header Signed-off-by: chyezh <chyezh@outlook.com>	2025-05-22 20:28:26 +08:00
Zhen Ye	59ab274dbe	fix: use flusher and recovery checkpoint together to determine the truncate position (#41934 ) issue: #41544 - unify the log field of message - use the minimum one of flusher and recovery storage checkpoint as the truncate position Signed-off-by: chyezh <chyezh@outlook.com>	2025-05-20 16:10:24 +08:00
yihao.dai	65dd3982d8	fix: Fix ants.Pool goroutine leak (#41892 ) 1. Release the pool after it is no longer in use. 2. Upgrade ants.Pool to fix the goroutine leak issue (see [PR #287](https://github.com/panjf2000/ants/pull/287)). issue: https://github.com/milvus-io/milvus/issues/41838 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-05-19 17:56:22 +08:00
Zhen Ye	59dff668dc	enhance: schema change without manual flush (#41882 ) issue: #39718 - remove the manual flush message from schema change operation - add flush segment id handle into schema change processes Signed-off-by: chyezh <chyezh@outlook.com> Co-authored-by: congqixia <congqi.xia@zilliz.com>	2025-05-19 10:14:22 +08:00

1 2 3

119 Commits