Cherry-pick from master
pr: #45061#45488#45803#46017#44991#45132#45723#45726#45798#45897#45918#44998
This feature integrates the Storage V2 (Loon) FFI interface as a unified
storage layer for segment loading and index building in Milvus. It
enables
manifest-based data access, replacing the traditional binlog-based
approach
with a more efficient columnar storage format.
Key changes:
### Segment Self-Managed Loading Architecture
- Move segment loading orchestration from Go layer to C++ segcore
- Add NewSegmentWithLoadInfo() API for passing load info during segment
creation
- Implement SetLoadInfo() and Load() methods in SegmentInterface
- Support parallel loading of indexed and non-indexed fields
- Enable both sealed and growing segments to self-manage loading
### Storage V2 FFI Integration
- Integrate milvus-storage library's FFI interface for packed columnar
data
- Add manifest path support throughout the data path (SegmentInfo,
LoadInfo)
- Implement ManifestReader for generating manifests from binlogs
- Support zero-copy data exchange using Arrow C Data Interface
- Add ToCStorageConfig() for Go-to-C storage config conversion
### Manifest-Based Index Building
- Extend FileManagerContext to carry loon_ffi_properties
- Implement GetFieldDatasFromManifest() using Arrow C Stream interface
- Support manifest-based reading in DiskFileManagerImpl and
MemFileManagerImpl
- Add fallback to traditional segment insert files when manifest
unavailable
### Compaction Pipeline Updates
- Include manifest path in all compaction task builders (clustering, L0,
mix)
- Update BulkPackWriterV2 to return manifest path
- Propagate manifest metadata through compaction pipeline
### Configuration & Protocol
- Add common.storageV2.useLoonFFI config option (default: false)
- Add manifest_path field to SegmentLoadInfo and related proto messages
- Add manifest field to compaction segment messages
### Bug Fixes
- Fix mmap settings not applied during segment load (key typo fix)
- Populate index info after segment loading to prevent redundant load
tasks
- Fix memory corruption by removing premature transaction handle
destruction
Related issues: #44956, #45060, #39173
## Individual Cherry-Picked Commits
1. **e1c923b5cc** - fix: apply mmap settings correctly during segment
load (#46017)
2. **63b912370b** - enhance: use milvus-storage internal C++ Reader API
for Loon FFI (#45897)
3. **bfc192faa5** - enhance: Resolve issues integrating loon FFI
(#45918)
4. **fb18564631** - enhance: support manifest-based index building with
Loon FFI reader (#45726)
5. **b9ec2392b9** - enhance: integrate StorageV2 FFI interface for
manifest-based segment loading (#45798)
6. **66db3c32e6** - enhance: integrate Storage V2 FFI interface for
unified storage access (#45723)
7. **ae789273ac** - fix: populate index info after segment loading to
prevent redundant load tasks (#45803)
8. **49688b0be2** - enhance: Move segment loading logic from Go layer to
segcore for self-managed loading (#45488)
9. **5b2df88bac** - enhance: [StorageV2] Integrate FFI interface for
packed reader (#45132)
10. **91ff5706ac** - enhance: [StorageV2] add manifest path support for
FFI integration (#44991)
11. **2192bb4a85** - enhance: add NewSegmentWithLoadInfo API to support
segment self-managed loading (#45061)
12. **4296b01da0** - enhance: update delta log serialization APIs to
integrate storage V2 (#44998)
## Technical Details
### Architecture Changes
- **Before**: Go layer orchestrated segment loading, making multiple CGO
calls
- **After**: Segments autonomously manage loading in C++ layer with
single entry point
### Storage Access Pattern
- **Before**: Read individual binlog files through Go storage layer
- **After**: Read manifest file that references packed columnar data via
FFI
### Benefits
- Reduced cross-language call overhead
- Better resource management at C++ level
- Improved I/O performance through batched streaming reads
- Cleaner separation of concerns between Go and C++ layers
- Foundation for proactive schema evolution handling
---------
Signed-off-by: Ted Xu <ted.xu@zilliz.com>
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Co-authored-by: Ted Xu <ted.xu@zilliz.com>
1. Enable Milvus to read cipher configs
2. Enable cipher plugin in binlog reader and writer
3. Add a testCipher for unittests
4. Support pooling for datanode
5. Add encryption in storagev2
See also: #40321
Signed-off-by: yangxuan <xuan.yang@zilliz.com>
---------
Signed-off-by: yangxuan <xuan.yang@zilliz.com>
Ref https://github.com/milvus-io/milvus/issues/42148https://github.com/milvus-io/milvus/pull/42406 impls the segcore part of
storage for handling with VectorArray.
This PR:
1. impls the go part of storage for VectorArray
2. impls the collection creation with StructArrayField and VectorArray
3. insert and retrieve data from the collection.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com>
Signed-off-by: SpadeA-Tang <u6748471@anu.edu.au>
This parameter determines whether the returned value should be a copy or
a reference from the arrow array. The updates enhance memory management
and provide more control over data handling during deserialization.
See #43186
---------
Signed-off-by: Ted Xu <ted.xu@zilliz.com>
See: #43186
In this PR:
1. Flush renamed to FlushChunk, while a new Flush primitive is
introduced to serialize values to records.
2. Segment mapping in clustering compaction now process data by records
instead of values, it calls flush to all buffers after each record is
processed.
Signed-off-by: Ted Xu <ted.xu@zilliz.com>
Related to #43522
Currently, passing partial schema to storage v2 packed reader may
trigger SEGV during clustering compaction unit test.
This patch implement `NeededFields` differently in each `RecordReader`
imlementation. For now, v2 will implemented as no-op. This will be
supported after packed reader support this API.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Correct read and buffer size to 64MB to prevent OOM during clustering
compaction.
issue: https://github.com/milvus-io/milvus/issues/43310
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Related to #39173
Like logic in #41919, storage v2 fs shall use complete paths with
bucketName prefix to be compatible with its definition. This PR fills
bucket name from config when creating reader for compaction tasks.
NOTE: the bucket name shall be read from task params config for
compaction task pooling.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Make DataNode use compaction parameters from request instead of
configuration.
issue: https://github.com/milvus-io/milvus/issues/41123
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
With concurrenct L0 compaction
(https://github.com/milvus-io/milvus/pull/36816), delta logs might be
written to the same L1 segment, causing logID duplication when using the
incremental beginLogID. This PR removes the beginLogID mechanism and
instead passes a log ID range, where the number of IDs in the range
equals the number of compaction segment binlogs multiplied by an
expansion factor.
issue: https://github.com/milvus-io/milvus/issues/40207
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
- Feat: Support Mix compaction. Covering tests include compatibility and
rollback ability.
- Read v1 segments and compact with v2 format.
- Read both v1 and v2 segments and compact with v2 format.
- Read v2 segments and compact with v2 format.
- Compact with duplicate primary key test.
- Compact with bm25 segments.
- Compact with merge sort segments.
- Compact with no expiration segments.
- Compact with lack binlog segments.
- Compact with nullable field segments.
- Feat: Support Clustering compaction. Covering tests include
compatibility and rollback ability.
- Read v1 segments and compact with v2 format.
- Read both v1 and v2 segments and compact with v2 format.
- Read v2 segments and compact with v2 format.
- Compact bm25 segments with v2 format.
- Compact with memory limit.
- Enhance: Use serdeMap serialize in BuildRecord function to support all
Milvus data types.
related: #39173
Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
Iterators are long deprecated, but sort are still using it. This PR
unifies stats task with the latest compaction common functions and
remove the usage of iterators.
1. Rename `datanode/compaction` to `datanode/compactor`
2. Add `internal/compaction` and move some compaction commons into it.
3. Replace `DeltalogIterators` with `ComposeDeleteFromDeltalogs`
4. Remove `datanode/iterators`
See also: #39242
Signed-off-by: yangxuan <xuan.yang@zilliz.com>