Related to #44956
Use the exact manifest version from the path parameter instead of always
fetching the latest manifest. This ensures data consistency by reading
from the specific version that was requested.
Changes:
- Update GetColumnGroups to use transaction.begin(version) with the
specified version from the path JSON
- Replace get_latest_manifest() with get_current_manifest() after
beginning transaction at the target version
- Update Go FFI binding to call get_column_groups_by_version instead of
get_latest_column_groups
- Remove unused GetManifest function from util.cpp/util.h
- Bump milvus-storage version from 5fff4f5 to 33bf815
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Related to #46098
Fix the ReverseDataFromIndex function where the assignment of raw_data
to scalar_array and the break statement were incorrectly placed inside
the for loop for TIMESTAMPTZ data type. This caused QueryNode to panic
when outputting timestamptz fields from scalar index.
Move the assignment and break statement outside the loop to match the
pattern used by other data types like VARCHAR.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
See: #44956
This PR upgrades loon to the latest version and resolves building
conflicts.
---------
Signed-off-by: Ted Xu <ted.xu@zilliz.com>
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Co-authored-by: Congqi Xia <congqi.xia@zilliz.com>
Previously, mmap settings configured at the collection or field level
were not being applied during segment loading in segcore. This was
caused by:
1. A typo in the key name: "mmap.enable" instead of "mmap.enabled"
2. Missing logic to parse and apply mmap settings from schema
This commit fixes the issue by:
- Correcting the key name to "mmap.enabled" to match the standard
- Adding Schema::MmapEnabled() method to retrieve field/collection level
mmap settings with proper fallback logic
- Parsing mmap settings from field type_params and collection properties
during schema parsing
- Applying computed mmap settings in LoadColumnGroup() and Load()
methods instead of hardcoded false values
- Using global MmapConfig as fallback when no explicit setting exists
The mmap setting priority is now:
1. Field-level mmap setting (from type_params)
2. Collection-level mmap setting (from properties)
3. Global mmap config (from MmapManager)
For column groups, if any field has mmap enabled, the entire group uses
mmap (since they are loaded together).
Related issue: #45060
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
This PR refactors the Loon FFI reader implementation to use
milvus-storage's internal C++ Reader API directly instead of the
external FFI interface.
Key changes:
- Replace external FFI calls (get_record_batch_reader, reader_destroy)
with direct C++ Reader API calls
- Add GetLoonReader() helper function to create Reader instances using
milvus-storage::api::Reader::create()
- Use MakeInternalPropertiesFromStorageConfig() instead of
MakePropertiesFromStorageConfig() to get internal properties
- Update NewPackedFFIReaderWithManifest() to deserialize column groups
from JSON manifest content directly
- Simplify GetFFIReaderStream() to use Reader::get_record_batch_reader()
and arrow::ExportRecordBatchReader() for Arrow stream export
- Change CFFIPackedReader typedef from ReaderHandle to void* for
flexibility
- Update milvus-storage dependency version to ba7df7b
This change improves code maintainability by using the native C++ API
directly and eliminates the overhead of going through the external FFI
layer.
issue: #44956
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Related to #44956
- Update milvus-storage version to ba7df7b for chunk reader fix
- Pass manifest path to index build request in DataCoord/DataNode
- Add null chunk assertion with detailed debug info in
ManifestGroupTranslator
- Fix memory corruption by removing premature transaction handle
destruction
- Clean up log message in ChunkedSegmentSealedImpl
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
This PR adds support for reading data from StorageV2 using manifest
files and the Loon FFI interface during index building, providing an
alternative to the traditional segment insert files approach.
Key changes:
Core C++ changes:
- Add SEGMENT_MANIFEST_KEY and LOON_FFI_PROPERTIES_KEY constants for
manifest handling
- Extend FileManagerContext to carry loon_ffi_properties for FFI
operations
- Update index_c.cpp to pass manifest and loon properties to file
managers for all index types (vector, JSON key, text)
- Implement GetFieldDatasFromManifest() in Util.cpp using Arrow C Stream
interface:
* Create Arrow schema from field metadata
* Initialize FFI reader with manifest content and storage properties
* Import record batches from C data interface
* Convert to FieldData for index building
- Update DiskFileManagerImpl and MemFileManagerImpl to support
manifest-based data reading with fallback to traditional paths
Loon FFI utilities (internal/core/src/storage/loon_ffi/):
- Add ToCStorageConfig() to convert StorageConfig to C-compatible
structure
- Implement GetManifest() to parse manifest JSON and retrieve column
groups via FFI
- Enhance MakePropertiesFromStorageConfig() integration
Storage V2 integration:
- Update milvus-storage dependency from 0883026 to 302143c for latest
FFI support
Protobuf changes:
- Add manifest field to BuildIndexInfo for passing manifest path to C++
layer
Configuration:
- Add common.storageV2.useLoonFFI config option (default: false) for
feature toggle
This change is part of issue #44956 to integrate the StorageV2 FFI
interface as the unified storage layer. The implementation maintains
backward compatibility by checking for manifest presence and falling
back to existing segment insert files approach when manifest is not
provided.
Related issue: #44956
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #45486
This commit refactors the chunk writing system by introducing a
two-phase
approach: size calculation followed by writing to a target. This enables
efficient group chunk creation where multiple fields share a single mmap
region, significantly reducing the number of mmap system calls and VMAs.
- Optimize `mmap` usage: single `mmap` per group chunk instead of per
field
- Split ChunkWriter into two phases:
- `calculate_size()`: Pre-compute required memory without allocation
- `write_to_target()`: Write data to a provided ChunkTarget
- Implement `ChunkMmapGuard` for unified mmap region lifecycle
management
- Handles `munmap` and file cleanup via RAII
- Shared via `std::shared_ptr` across multiple chunks in a group
Signed-off-by: Shawn Wang <shawn.wang@zilliz.com>
---------
Signed-off-by: Shawn Wang <shawn.wang@zilliz.com>
Related to #44956
**New Translator (C++)**
- Added `ManifestGroupTranslator`
(`internal/core/src/segcore/storagev2translator/`)
- Translates manifest-based column groups to Milvus internal format
- Implements `GroupCTMeta` interface for chunk-based column access
- Supports both memory and mmap storage modes
- Handles cache warmup policies for vector and scalar data
**ChunkedSegmentSealedImpl**
(`internal/core/src/segcore/ChunkedSegmentSealedImpl.cpp:333`)
- Added `LoadColumnGroups(const std::string& manifest_path)`: Main entry
point for manifest-based loading
- Creates milvus-storage Reader from manifest file
- Parallelizes column group loading using thread pool
- Aggregates loading exceptions and reports errors
- Added `LoadColumnGroup()`: Loads individual column group
- Extracts field IDs from column group metadata
- Creates ManifestGroupTranslator for each column group
- Builds ProxyChunkColumn for field access
- Special handling for timestamp field index construction
**SegmentGrowingImpl**
(`internal/core/src/segcore/SegmentGrowingImpl.cpp`)
- Added similar `LoadColumnGroups()` and `LoadColumnGroup()` methods for
growing segments
- Maintains consistency with sealed segment loading path
Storage FFI Utilities
**loon_ffi/util** (`internal/core/src/storage/loon_ffi/util.cpp`)
- Added `MakeInternalPropertiesFromStorageConfig()`: Converts C storage
config to internal Properties
- Maps all storage configuration fields (S3, GCS, Azure, local)
- Handles SSL, IAM, virtual host settings
- Configures connection timeouts and max connections
- Added `MakeInternalLocalProperies()`: Creates local filesystem
properties
- Added `ToCStorageConfig()`: Converts Go StorageConfig to C
representation
- Added `GetColumnGroups()`: Extracts column groups from manifest file
using Transaction API
Protocol Buffer Changes
**segcore.proto** (`pkg/proto/segcore.proto:121`)
- Added `manifest_path` field to `SegmentLoadInfo` message
- Enables passing manifest file path from Go layer to C++ core
Go Integration
**segment.go** (`internal/util/segcore/segment.go:372`)
- Updated `ConvertToSegcoreSegmentLoadInfo()` to propagate
`ManifestPath` field
- Bridges QueryNode segment load info to Segcore format
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
This commit optimizes std::vector usage across segcore by adding
reserve() calls where the size is known in advance, reducing memory
reallocations during push_back operations.
Changes:
- TimestampIndex.cpp: Reserve space for prefix_sums and
timestamp_barriers
- SegmentGrowingImpl.cpp: Reserve space for binlog info vectors
- ChunkedSegmentSealedImpl.cpp: Reserve space for futures and field data
vectors
- storagev2translator/GroupChunkTranslator.cpp: Reserve space for
metadata vectors
This improves performance by avoiding multiple memory reallocations when
the vector size is predictable.
issue: https://github.com/milvus-io/milvus/issues/45679
---------
Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
#45610
this fix add a little cost for execute:
=== Lower Bound Overhead (isolated) ===
Position 1 (list len = 90000): 39 ns per lower_bound
Position 2 (list len =180000): 45 ns per lower_bound
Position 3 (list len =270000): 46 ns per lower_bound
Position 4 (list len =360000): 38 ns per lower_bound
Position 5 (list len =450000): 42 ns per lower_bound
Position 6 (list len =540000): 55 ns per lower_bound
Position 7 (list len =630000): 56 ns per lower_bound
Position 8 (list len =720000): 49 ns per lower_bound
Position 9 (list len =810000): 48 ns per lower_bound
Signed-off-by: luzhang <luzhang@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
issue: https://github.com/milvus-io/milvus/issues/45783
for simdjson ondemand api, a iterator can only be used once. use dom api
to prevent crashes when processing JSON contains operations with
different types.
Signed-off-by: sunby <sunbingyi1992@gmail.com>
relate: https://github.com/milvus-io/milvus/issues/43687
Support use user provice file by file params, in analyzer params.
Could use local file or remote file resource.
Support use file params in jieba extern dict.
Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
Related #44956
This commit integrates the Storage V2 FFI (Foreign Function Interface)
interface throughout the Milvus codebase, enabling unified storage
access through the Loon FFI layer. This is a significant step towards
standardizing storage operations across different storage versions.
1. Configuration Support
- **configs/milvus.yaml**: Added `useLoonFFI` configuration flag under
`common.storage.file.splitByAvgSize` section
- Allows runtime toggle between traditional binlog readers and new
FFI-based manifest readers
- Default: `false` (maintains backward compatibility)
2. Core FFI Infrastructure
Enhanced Utilities (internal/core/src/storage/loon_ffi/util.cpp/h)
- **ToCStorageConfig()**: Converts Go's `StorageConfig` to C's
`CStorageConfig` struct for FFI calls
- **GetManifest()**: Parses manifest JSON and retrieves latest column
groups using FFI
- Accepts manifest path with `base_path` and `ver` fields
- Calls `get_latest_column_groups()` FFI function
- Returns column group information as string
- Comprehensive error handling for JSON parsing and FFI errors
3. Dependency Updates
- **internal/core/thirdparty/milvus-storage/CMakeLists.txt**:
- Updated milvus-storage version from `0883026` to `302143c`
- Ensures compatibility with latest FFI interfaces
4. Data Coordinator Changes
All compaction task builders now include manifest path in segment
binlogs:
- **compaction_task_clustering.go**: Added `Manifest:
segInfo.GetManifestPath()` to segment binlogs
- **compaction_task_l0.go**: Added manifest path to both L0 segment
selection and compaction plan building
- **compaction_task_mix.go**: Added manifest path to mixed compaction
segment binlogs
- **meta.go**: Updated metadata completion logic:
- `completeClusterCompactionMutation()`: Set `ManifestPath` in new
segment info
- `completeMixCompactionMutation()`: Preserve manifest path in compacted
segments
- `completeSortCompactionMutation()`: Include manifest path in sorted
segments
5. Data Node Compactor Enhancements
All compactors updated to support dual-mode reading (binlog vs
manifest):
6. Flush & Sync Manager Updates
Pack Writer V2 (pack_writer_v2.go)
- **BulkPackWriterV2.Write()**: Extended return signature to include
`manifest string`
- Implementation:
- Generate manifest path: `path.Join(pack.segmentID, "manifest.json")`
- Write packed data using FFI-based writer
- Return manifest path along with binlogs, deltas, and stats
Task Handling (task.go)
- Updated all sync task result handling to accommodate new manifest
return value
- Ensured backward compatibility for callers not using manifest
7. Go Storage Layer Integration
New Interfaces and Implementations
- **record_reader.go**: Interface for unified record reading across
storage versions
- **record_writer.go**: Interface for unified record writing across
storage versions
- **binlog_record_writer.go**: Concrete implementation for traditional
binlog-based writing
Enhanced Schema Support (schema.go, schema_test.go)
- Schema conversion utilities to support FFI-based storage operations
- Ensures proper Arrow schema mapping for V2 storage
Serialization Updates
- **serde.go, serde_events.go, serde_events_v2.go**: Updated to work
with new reader/writer interfaces
- Test files updated to validate dual-mode serialization
8. Storage V2 Packed Format
FFI Common (storagev2/packed/ffi_common.go)
- Common FFI utilities and type conversions for packed storage format
Packed Writer FFI (storagev2/packed/packed_writer_ffi.go)
- FFI-based implementation of packed writer
- Integrates with Loon storage layer for efficient columnar writes
Packed Reader FFI (storagev2/packed/packed_reader_ffi.go)
- Already existed, now complemented by writer implementation
9. Protocol Buffer Updates
data_coord.proto & datapb/data_coord.pb.go
- Added `manifest` field to compaction segment messages
- Enables passing manifest metadata through compaction pipeline
worker.proto & workerpb/worker.pb.go
- Added compaction parameter for `useLoonFFI` flag
- Allows workers to receive FFI configuration from coordinator
10. Parameter Configuration
component_param.go
- Added `UseLoonFFI` parameter to compaction configuration
- Reads from `common.storage.file.useLoonFFI` config path
- Default: `false` for safe rollout
11. Test Updates
- **clustering_compactor_storage_v2_test.go**: Updated signatures to
handle manifest return value
- **mix_compactor_storage_v2_test.go**: Updated test helpers for
manifest support
- **namespace_compactor_test.go**: Adjusted writer calls to expect
manifest
- **pack_writer_v2_test.go**: Validated manifest generation in pack
writing
This integration follows a **dual-mode approach**:
1. **Legacy Path**: Traditional binlog-based reading/writing (when
`useLoonFFI=false` or no manifest)
2. **FFI Path**: Manifest-based reading/writing through Loon FFI (when
`useLoonFFI=true` and manifest exists)
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
1. Array.h: Add output_data(ScalarFieldProto&) overload for both Array
and ArrayView classes
2. Use std::string_view instead of std::string for VARCHAR and GEOMETRY
types to avoid extra string copies
3. Call Reserve(length_) before writing to proto objects to reduce
memory reallocations
a simple test shows those optimizations improve the Array of Varchar
bulk_subscript performance by 20%
issue: https://github.com/milvus-io/milvus/issues/45679
Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
Related to #44974
The emplace() operation on tbb::concurrent_hash_map was not protected,
allowing other threads to erase entries between the emplace attempt and
the subsequent lookup.
Solution:
1. Add shared_lock protection around the emplace() operation to prevent
concurrent erasure during insertion
2. Instead of returning nullptr when the key is not found on retry,
recursively call Get(key) to retry the entire operation
3. Fix typo: "earsed" -> "erased"
This ensures that concurrent Get() operations are properly synchronized
and will eventually succeed even under high contention.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Related to #45060
Refactor segment loading architecture to make segments autonomously
manage their own loading process, moving the orchestration logic from Go
(segment_loader.go) to C++ (segcore).
**C++ Layer (segcore):**
- Added `SetLoadInfo()` and `Load()` methods to `SegmentInterface` and
implementations
- Implemented `ChunkedSegmentSealedImpl::Load()` with parallel loading
strategy:
- Separates indexed fields from non-indexed fields
- Loads indexes concurrently using thread pools
- Loads field data for non-indexed fields in parallel
- Implemented `SegmentGrowingImpl::Load()` to convert and load field
data
- Extracted `LoadIndexData()` as a reusable utility function in
`Utils.cpp`
- Added `SegmentLoad()` C binding in `segment_c.cpp`
**Go Layer:**
- Added `Load()` method to segment interfaces
- Updated mock implementations and test interfaces
- Integrated new C++ `SegmentLoad()` binding in Go segment wrapper
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Related to #45445
Previously, FillFieldData for JSON fields would assert and fail when a
default_value was provided, blocking index creation for JSON fields with
default values (including dynamic fields like $meta).
This change enables JSON default value support by:
- Removing the assertion that blocked default values
- Parsing bytes_data into Json objects when default_value is present
- Properly filling data_ array and setting valid_data_ bitset to true
- Maintaining null behavior when no default_value is provided
Impact:
- Fixes index creation failure for JSON fields with default values
- Resolves upgrade issues from 2.5 to 2.6.5 where dynamic fields with
default values couldn't be indexed
- Index builds that were stuck in InProgress state can now complete
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Related to #44988
This PR commit newly updated tantivy-binding.h with cargo build result
which shall passes format check.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>