related: #45993
Add nullable vector support in import utility layer
Key changes:
ImportV2 util:
- Add nullable vector types (FloatVector, Float16Vector, BFloat16Vector,
BinaryVector, SparseFloatVector, Int8Vector) to
AppendNullableDefaultFieldsData()
- Add tests for nullable vector field data appending
CSV/JSON/Numpy readers:
- Add nullPercent parameter to test data generation for better null
coverage
- Mark vector fields as nullable in test schemas
- Add test cases for nullable vector field parsing
- Refactor tests to use loop-based approach with 0%, 50%, 100% null
percentages
Parquet field reader:
- Add ReadNullableBinaryData() for nullable
BinaryVector/Float16Vector/BFloat16Vector
- Add ReadNullableFloatVectorData() for nullable FloatVector
- Add ReadNullableSparseFloatVectorData() for nullable SparseFloatVector
- Add ReadNullableInt8VectorData() for nullable Int8Vector
- Add ReadNullableStructData() for generic nullable struct data
- Update Next() to use nullable read methods when field is nullable
- Add null data validation for non-nullable fields
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: import must preserve per-row alignment and validity
for every field — nullable vector fields are expected to be encoded with
per-row validity masks and all readers/writers must emit arrays aligned
to original input rows (null entries represented explicitly).
- New feature & scope: adds end-to-end nullable-vector support in the
import utility layer — AppendNullableDefaultFieldsData in
internal/datanode/importv2/util.go now appends nil placeholders for
nullable vectors (FloatVector, Float16Vector, BFloat16Vector,
BinaryVector, SparseFloatVector, Int8Vector); parquet reader
(internal/util/importutilv2/parquet/field_reader.go) adds
ReadNullableBinaryData, ReadNullableFloatVectorData,
ReadNullableSparseFloatVectorData, ReadNullableInt8VectorData,
ReadNullableStructData and routes nullable branches to these helpers;
CSV/JSON/Numpy readers and test utilities updated to generate and
validate 0/50/100% null scenarios and mark vector fields as nullable in
test schemas.
- Logic removed / simplified: eliminates ad-hoc "parameter-invalid"
rejections for nullable vectors inside FieldReader.Next by centralizing
nullable handling into ReadNullable* helpers and shared validators
(getArrayDataNullable,
checkNullableVectorAlignWithDim/checkNullableVectorAligned), simplifying
control flow and removing scattered special-case checks.
- No data loss / no regression (concrete code paths): nulls are
preserved end-to-end — AppendNullableDefaultFieldsData explicitly
inserts nil entries per null row (datanode import append path);
ReadNullable*Data helpers return both data and []bool validity masks so
callers in field_reader.go and downstream readers receive exact per-row
validity; testutil.BuildSparseVectorData was extended to accept
validData so sparse vectors are materialized only for valid rows while
null rows are represented as missing. These concrete paths ensure null
rows are represented rather than dropped, preventing data loss or
behavioral regression.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: marcelo-cjl <marcelo.chen@zilliz.com>
Related to #44956
This change propagates the useLoonFFI configuration through the import
pipeline to enable LOON FFI usage during data import operations.
Key changes:
- Add use_loon_ffi field to ImportRequest protobuf message
- Add manifest_path field to ImportSegmentInfo for tracking manifest
- Initialize manifest path when creating segments (both import and
growing)
- Pass useLoonFFI flag through NewSyncTask in import tasks
- Simplify pack_writer_v2 by removing GetManifestInfo method and relying
on pre-initialized manifest path from segment creation
- Update segment meta with manifest path after import completion
This allows the import workflow to use the LOON FFI based packed writer
when the common.useLoonFFI configuration is enabled.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
1. Use blocking memory allocation to wait until memory becomes available
2. Perform memory allocation at the file level instead of per task
3. Limit Parquet file reader batch size to prevent excessive memory
consumption
4. Limit import buffer size from 20% to 10% of total memory
issue: https://github.com/milvus-io/milvus/issues/43387,
https://github.com/milvus-io/milvus/issues/43131
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Ref https://github.com/milvus-io/milvus/issues/42148https://github.com/milvus-io/milvus/pull/42406 impls the segcore part of
storage for handling with VectorArray.
This PR:
1. impls the go part of storage for VectorArray
2. impls the collection creation with StructArrayField and VectorArray
3. insert and retrieve data from the collection.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com>
Signed-off-by: SpadeA-Tang <u6748471@anu.edu.au>
issue: #43072, #43289
- manage the schema version at recovery storage.
- update the schema when creating collection or alter schema.
- get schema at write buffer based on version.
- recover the schema when upgrading from 2.5.
---------
Signed-off-by: chyezh <chyezh@outlook.com>
- Introduce dynamic buffer sizing to avoid generating small binlogs
during import
- Refactor import slot calculation based on CPU and memory constraints
- Implement dynamic pool sizing for sync manager and import tasks
according to CPU core count
issue: https://github.com/milvus-io/milvus/issues/43131
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
issue: #41976
- make drop partition message as a broadcast message.
- add gc when drop partition message is acked.
- add a call back to handle the broadcast message when ack.
- the ack operation of broadcast message will retry until success.
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #36621
1. Add API to access task runtime metrics, including:
- build index task
- compaction task
- import task
- balance (including load/release of segments/channels and some leader
tasks on querycoord)
- sync task
2. Add a debug model to the webpage by using debug=true or debug=false
in the URL query parameters to enable or disable debug mode.
Signed-off-by: jaime <yun.zhang@zilliz.com>
1. Move the common modules of streamingNode and dataNode to flushcommon
2. Add new GetVChannels interface for rootcoord
issue: https://github.com/milvus-io/milvus/issues/33285
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
The import is dependent on syncTask, which in turn relies on the
allocator. This PR pre-allocate the necessary IDs for import syncTask.
issue: https://github.com/milvus-io/milvus/issues/33957
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
See also #33561
This PR:
- Use zero copy when buffering insert messages
- Make `storage.InsertCodec` support serialize multiple insert data
chunk into same batch binlog files
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
The data coordinator computed the appropriate number of import segments,
thus when importing in the data node, one can randomly select a segment.
issue: https://github.com/milvus-io/milvus/issues/33604
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
The max number of import files per request should not exceed 1024 by
default (configurable).
The import file size allowed for importing should not exceed 16GB by
default (configurable).
issue: https://github.com/milvus-io/milvus/issues/28521
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
1. The Import APIs now provide detailed progress information for each
imported file, including details such as file name, file size, progress,
and more.
2. The APIs now return the collection name and the completion time.
3. Other modifications include changing jobID to jobId and other similar
adjustments.
issue: https://github.com/milvus-io/milvus/issues/28521
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
This PR introduces novel managerial roles for importv2:
1. ImportMeta: To manage all the import tasks;
2. ImportScheduler: To process tasks and modify their states;
3. ImportChecker: To ascertain the completion of all tasks and instigate
relevant operations.
issue: https://github.com/milvus-io/milvus/issues/28521
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
This pr decoups importing segment from flush process by:
1. Exclude the importing segment from the flush policy, this approch
avoids notifying the datanode to flush the importing segment, which may
not exist.
2. When RootCoord call Flush, DataCoord directly set the importing
segment state to `Flushed`.
issue: https://github.com/milvus-io/milvus/issues/30359
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
This PR introduces novel importv2 roles for datanode:
1. Executor: To execute tasks, a import task will be divided into the
following steps: read data -> hash data -> sync data;
2. Manager: To manage all the tasks;
issue: https://github.com/milvus-io/milvus/issues/28521
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>