Add support for DataNode compaction using file resources in ref mode.
SortCompation and StatsJobs will build text indexes, which may use file
resources.
relate: https://github.com/milvus-io/milvus/issues/43687
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: file resources (analyzer binaries/metadata) are only
fetched, downloaded and used when the node is configured in Ref mode
(fileresource.IsRefMode via CommonCfg.QNFileResourceMode /
DNFileResourceMode); Sync now carries a version and managers track
per-resource versions/resource IDs so newer resource sets win and older
entries are pruned (RefManager/SynchManager resource maps).
- Logic removed / simplified: component-specific FileResourceMode flags
and an indirection through a long-lived BinlogIO wrapper were
consolidated — file-resource mode moved to CommonCfg, Sync/Download APIs
became version- and context-aware, and compaction/index tasks accept a
ChunkManager directly (binlog IO wrapper creation inlined). This
eliminates duplicated config checks and wrapper indirection while
preserving the same chunk/IO semantics.
- Why no data loss or behavior regression: all file-resource code paths
are gated by the configured mode (default remains "sync"); when not in
ref-mode or when no resources exist, compaction and stats flows follow
existing code paths unchanged. Versioned Sync + resourceID maps ensure
newly synced sets replace older ones and RefManager prunes stale files;
GetFileResources returns an error if requested IDs are missing (prevents
silent use of wrong resources). Analyzer naming/parameter changes add
analyzer_extra_info but default-callers pass "" so existing analyzers
and index contents remain unchanged.
- New capability: DataNode compaction and StatsJobs can now build text
indexes using external file resources in Ref mode — DataCoord exposes
GetFileResources and populates CompactionPlan.file_resources;
SortCompaction/StatsTask download resources via fileresource.Manager,
produce an analyzer_extra_info JSON (storage + resource->id map) via
analyzer.BuildExtraResourceInfo, and propagate analyzer_extra_info into
BuildIndexInfo so the tantivy bindings can load custom analyzers during
text index creation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
issue: #44358
Implement complete snapshot management system including creation,
deletion, listing, description, and restoration capabilities across all
system components.
Key features:
- Create snapshots for entire collections
- Drop snapshots by name with proper cleanup
- List snapshots with collection filtering
- Describe snapshot details and metadata
Components added/modified:
- Client SDK with full snapshot API support and options
- DataCoord snapshot service with metadata management
- Proxy layer with task-based snapshot operations
- Protocol buffer definitions for snapshot RPCs
- Comprehensive unit tests with mockey framework
- Integration tests for end-to-end validation
Technical implementation:
- Snapshot metadata storage in etcd with proper indexing
- File-based snapshot data persistence in object storage
- Garbage collection integration for snapshot cleanup
- Error handling and validation across all operations
- Thread-safe operations with proper locking mechanisms
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant/assumption: snapshots are immutable point‑in‑time
captures identified by (collection, snapshot name/ID); etcd snapshot
metadata is authoritative for lifecycle (PENDING → COMMITTED → DELETING)
and per‑segment manifests live in object storage (Avro / StorageV2). GC
and restore logic must see snapshotRefIndex loaded
(snapshotMeta.IsRefIndexLoaded) before reclaiming or relying on
segment/index files.
- New capability added: full end‑to‑end snapshot subsystem — client SDK
APIs (Create/Drop/List/Describe/Restore + restore job queries),
DataCoord SnapshotWriter/Reader (Avro + StorageV2 manifests),
snapshotMeta in meta, SnapshotManager orchestration
(create/drop/describe/list/restore), copy‑segment restore
tasks/inspector/checker, proxy & RPC surface, GC integration, and
docs/tests — enabling point‑in‑time collection snapshots persisted to
object storage and restorations orchestrated across components.
- Logic removed/simplified and why: duplicated recursive
compaction/delta‑log traversal and ad‑hoc lookup code were consolidated
behind two focused APIs/owners (Handler.GetDeltaLogFromCompactTo for
delta traversal and SnapshotManager/SnapshotReader for snapshot I/O).
MixCoord/coordinator broker paths were converted to thin RPC proxies.
This eliminates multiple implementations of the same traversal/lookup,
reducing divergence and simplifying responsibility boundaries.
- Why this does NOT introduce data loss or regressions: snapshot
create/drop use explicit two‑phase semantics (PENDING → COMMIT/DELETING)
with SnapshotWriter writing manifests and metadata before commit; GC
uses snapshotRefIndex guards and
IsRefIndexLoaded/GetSnapshotBySegment/GetSnapshotByIndex checks to avoid
removing referenced files; restore flow pre‑allocates job IDs, validates
resources (partitions/indexes), performs rollback on failure
(rollbackRestoreSnapshot), and converts/updates segment/index metadata
only after successful copy tasks. Extensive unit and integration tests
exercise pending/deleting/GC/restore/error paths to ensure idempotence
and protection against premature deletion.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #45881
Add persistent task management for external collections with automatic
detection of external_source and external_spec changes. When source
changes, the system aborts running tasks and creates new ones, ensuring
only one active task per collection. Tasks validate their source on
completion to prevent superseded tasks from committing results.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: at most one active UpdateExternalCollection task
exists per collection — tasks are serialized by collectionID
(collection-level locking) and any change to external_source or
external_spec aborts superseded tasks and causes a new task creation
(externalCollectionManager + external_collection_task_meta
collection-based locks enforce this).
- What was simplified/removed: per-task fine-grained locking and
concurrent multi-task acceptance per collection were replaced by
collection-level synchronization (external_collection_task_meta.go) and
a single persistent task lifecycle in DataCoord/Index task code;
redundant double-concurrent update paths were removed by checking
existing task presence in AddTask/LoadOrStore and aborting/overwriting
via Drop/Cancel flows.
- Why this does NOT cause data loss or regress behavior: task state
transitions and commit are validated against the current external
source/spec before applying changes — UpdateStateWithMeta and SetJobInfo
verify task metadata and persist via catalog only under matching
collection-state; DataNode externalCollectionManager persists task
results to in-memory manager and exposes Query/Drop flows (services.go)
without modifying existing segment data unless a task successfully
finishes and SetJobInfo atomically updates segments via meta/catalog
calls, preventing superseded tasks from committing stale results.
- New capability added: end-to-end external collection update workflow —
DataCoord Index task + Cluster RPC helpers + DataNode external task
runner and ExternalCollectionManager enable creating, querying,
cancelling, and applying external collection updates
(fragment-to-segment balancing, kept/updated segment handling, allocator
integration); accompanying unit tests cover success, failure,
cancellation, allocator errors, and balancing logic.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: sunby <sunbingyi1992@gmail.com>
issue: #46274
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Performance Improvements**
* Field-level text index creation and JSON-key statistics now run
concurrently, reducing overall indexing time and speeding task
completion.
* **Observability Enhancements**
* Per-task and per-field logging expanded with richer context and
per-phase elapsed-time reporting for improved monitoring and
diagnostics.
* **Refactor**
* Node slot handling simplified to compute slot counts on demand instead
of storing them.
<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
relate:https://github.com/milvus-io/milvus/issues/43687
Support use file resource with sync mode.
Auto download or remove file resource to local when user add or remove
file resource.
Sync file resource to node when find new node session.
---------
Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
Remove stopTask.
Replace multiple task tracking maps with single unified taskState map.
Fix slot tracking, improve state transitions, and add comprehensive test
See also: #44714
---------
Signed-off-by: yangxuan <xuan.yang@zilliz.com>
1. Use goroutine pool instead of sem.
2. Remove compaction executor from pipeline, since in streaming mode
pipeline should be decoupled from compaction.
issue: https://github.com/milvus-io/milvus/issues/44541
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
https://github.com/milvus-io/milvus/issues/44011
this is to support compaction that sorts records by partition key and pk
in the future
---------
Signed-off-by: sunby <sunbingyi1992@gmail.com>
1. Enable Milvus to read cipher configs
2. Enable cipher plugin in binlog reader and writer
3. Add a testCipher for unittests
4. Support pooling for datanode
5. Add encryption in storagev2
See also: #40321
Signed-off-by: yangxuan <xuan.yang@zilliz.com>
---------
Signed-off-by: yangxuan <xuan.yang@zilliz.com>
- Introduce dynamic buffer sizing to avoid generating small binlogs
during import
- Refactor import slot calculation based on CPU and memory constraints
- Implement dynamic pool sizing for sync manager and import tasks
according to CPU core count
issue: https://github.com/milvus-io/milvus/issues/43131
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Related to #42530
The cluster id is missing when drop worker drop causing redoing task on
report duplicated task error.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Remove the unlimited logID mechanism and switch to redundantly
allocating a large number of IDs.
issue: https://github.com/milvus-io/milvus/issues/42518
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
1. Add global scheduler for datacoord.
2. Define and implement new CreateTask, QueryTask, DropTask interfaces.
3. Refine Import, Compaction, Stats, Index task.
issue: https://github.com/milvus-io/milvus/issues/41123
Co-authored-by: Cai Zhang <cai.zhang@zilliz.com>
When autoID is enabled, the preimport task estimates row distribution by
evenly dividing the total row count (numRows) across all vchannels:
`estimatedCount = numRows / vchannelNum`.
However, the actual import task hashes real auto-generated IDs to
determine
the target vchannel. This mismatch can lead to inaccurate row
distribution estimation
in such corner cases:
- Importing 1 row into 2 vchannels:
• Preimport: 1 / 2 = 0 → both v0 and v1 are estimated to have 0 rows
• Import: real autoID (e.g., 457975852966809057) hashes to v1
→ actual result: v0 = 0, v1 = 1
To resolve such corner case, we now allocate at least one segment for
each vchannel
when autoID is enabled, ensuring all vchannels are prepared to receive
data even
if no rows are estimated for them.
issue: https://github.com/milvus-io/milvus/issues/41759
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
With concurrenct L0 compaction
(https://github.com/milvus-io/milvus/pull/36816), delta logs might be
written to the same L1 segment, causing logID duplication when using the
incremental beginLogID. This PR removes the beginLogID mechanism and
instead passes a log ID range, where the number of IDs in the range
equals the number of compaction segment binlogs multiplied by an
expansion factor.
issue: https://github.com/milvus-io/milvus/issues/40207
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Iterators are long deprecated, but sort are still using it. This PR
unifies stats task with the latest compaction common functions and
remove the usage of iterators.
1. Rename `datanode/compaction` to `datanode/compactor`
2. Add `internal/compaction` and move some compaction commons into it.
3. Replace `DeltalogIterators` with `ComposeDeleteFromDeltalogs`
4. Remove `datanode/iterators`
See also: #39242
Signed-off-by: yangxuan <xuan.yang@zilliz.com>
issue: #35563
1. Use an internal health checker to monitor the cluster's health state,
storing the latest state on the coordinator node. The CheckHealth
request retrieves the cluster's health from this latest state on the
proxy sides, which enhances cluster stability.
2. Each health check will assess all collections and channels, with
detailed failure messages temporarily saved in the latest state.
3. Use CheckHealth request instead of the heavy GetMetrics request on
the querynode and datanode
Signed-off-by: jaime <yun.zhang@zilliz.com>
issue: #36621
1. Add API to access task runtime metrics, including:
- build index task
- compaction task
- import task
- balance (including load/release of segments/channels and some leader
tasks on querycoord)
- sync task
2. Add a debug model to the webpage by using debug=true or debug=false
in the URL query parameters to enable or disable debug mode.
Signed-off-by: jaime <yun.zhang@zilliz.com>
1. Move the common modules of streamingNode and dataNode to flushcommon
2. Add new GetVChannels interface for rootcoord
issue: https://github.com/milvus-io/milvus/issues/33285
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
This PR removes the dependency of compaction on the ID allocator by
pre-allocating the logID and segmentID.
issue: https://github.com/milvus-io/milvus/issues/33957
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
issue: #33540
1. gorwing L0 segments is invisible to datacoord.
2. flushed L0 segments need to clean by datacoord.
Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
Due to the removal of injection and syncSegments from the compaction, we
need to ensure that no compaction is successfully executed during the
rolling upgrade. This PR renames Compaction to CompactionV2, with the
following effects:
- New datacoord + old datanode: Utilizes the CompactionV2 interface,
resulting in the datanode error "CompactionV2 not implemented," causing
compaction to fail;
- Old datacoord + new datanode: Utilizes the CompactionV1 interface,
resulting in the datanode error "CompactionV1 not implemented," causing
compaction to fail.
issue: https://github.com/milvus-io/milvus/issues/32809
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>