issue: https://github.com/milvus-io/milvus/issues/42148
Optimized from
Go VectorArray → VectorArray Proto → Binary → C++ VectorArray Proto →
C++ VectorArray local impl → Memory
to
Go VectorArray → Arrow ListArray → Memory
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
https://github.com/milvus-io/milvus/issues/44011
this is to support compaction that sorts records by partition key and pk
in the future
---------
Signed-off-by: sunby <sunbingyi1992@gmail.com>
1. Enable Milvus to read cipher configs
2. Enable cipher plugin in binlog reader and writer
3. Add a testCipher for unittests
4. Support pooling for datanode
5. Add encryption in storagev2
See also: #40321
Signed-off-by: yangxuan <xuan.yang@zilliz.com>
---------
Signed-off-by: yangxuan <xuan.yang@zilliz.com>
Ref https://github.com/milvus-io/milvus/issues/42148
This PR supports create index for vector array (now, only for
`DataType.FLOAT_VECTOR`) and search on it.
The index type supported in this PR is `EMB_LIST_HNSW` and the metric
type is `MAX_SIM` only.
The way to use it:
```python
milvus_client = MilvusClient("xxx:19530")
schema = milvus_client.create_schema(enable_dynamic_field=True, auto_id=True)
...
struct_schema = milvus_client.create_struct_array_field_schema("struct_array_field")
...
struct_schema.add_field("struct_float_vec", DataType.ARRAY_OF_VECTOR, element_type=DataType.FLOAT_VECTOR, dim=128, max_capacity=1000)
...
schema.add_struct_array_field(struct_schema)
index_params = milvus_client.prepare_index_params()
index_params.add_index(field_name="struct_float_vec", index_type="EMB_LIST_HNSW", metric_type="MAX_SIM", index_params={"nlist": 128})
...
milvus_client.create_index(COLLECTION_NAME, schema=schema, index_params=index_params)
```
Note: This PR uses `Lims` to convey offsets of the vector array to
knowhere where vectors of multiple vector arrays are concatenated and we
need offsets to specify which vectors belong to which vector array.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com>
Add `arrowBuild.Reserve` call for `ValueSerializer` to reduce repeated
resizing buffer when write size is large
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Ref https://github.com/milvus-io/milvus/issues/42148https://github.com/milvus-io/milvus/pull/42406 impls the segcore part of
storage for handling with VectorArray.
This PR:
1. impls the go part of storage for VectorArray
2. impls the collection creation with StructArrayField and VectorArray
3. insert and retrieve data from the collection.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com>
Signed-off-by: SpadeA-Tang <u6748471@anu.edu.au>
This parameter determines whether the returned value should be a copy or
a reference from the arrow array. The updates enhance memory management
and provide more control over data handling during deserialization.
See #43186
---------
Signed-off-by: Ted Xu <ted.xu@zilliz.com>
Related to #43522
Currently, passing partial schema to storage v2 packed reader may
trigger SEGV during clustering compaction unit test.
This patch implement `NeededFields` differently in each `RecordReader`
imlementation. For now, v2 will implemented as no-op. This will be
supported after packed reader support this API.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
This PR fill default value for `PackedBinlogRecordWriter` timestamp
range so target segment meta will contains correct timestamp range
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Correct read and buffer size to 64MB to prevent OOM during clustering
compaction.
issue: https://github.com/milvus-io/milvus/issues/43310
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Related to #39173
This PR
- Close packed reader after sort
- Release arrow.Record preventing memory leakage
- Invoke `pack_reader->Close()` for CloseReader
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Fix issues in end-to-end tests:
1. **Split column groups based on schema**, rather than estimating by
average chunk row size. **Ensure column group consistency within a
segment**, to avoid errors caused by loading multiple column group
chunks simultaneously.
2. **Use sorted segmentId** when generating the stats binlog path, to
ensure consistent and correct file path resolution.
3. **Determine field IDs as follows**:
For multi-column column groups, retrieve the field ID list from
metadata.
For single-column column groups, use the column group ID directly as the
field ID.
related: #39173fix: #42862
---------
Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
Related to #42856
Default value will be missing after segment get sorted/compacted. This
PR is a temp workaround since in long term default value shall be filled
with storage engine instead.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Related to #39173
`null_bitmap_data()` returns raw pointer of null bitmap of Array. While
after slicing, this bitmap is not rewritten due to zero copy
implementation, so the current start pos maybe non-zero while
FillFieldData generating column `valid_data` array.
This PR add `offset` param for `FillFieldData` method, and force all
invocation pass correct offset of `null_bitmap_data` ptr.
Also update milvus-storage commit fixing reader failed to return data
when buffer size smaller than row group size problem.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Related to #42723
Previous PR #42684 permit insert msg transformation but insertCodec did
not adapt the same skip logic, whic causes panicking.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #42649
- the sync operation of different pchannel is concurrent now.
- add a option to notify the backlog clear automatically.
- make pulsar walimpls can be recovered from backlog exceed.
Signed-off-by: chyezh <chyezh@outlook.com>
Related to #41858#41951#42084
When insert msg consumer (pipeline/flowgraph) have newer schema than
insertMsg, it have to adapter the insert msg used old schema(missing
newly added field)
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #42028
- limit the concurrency of zstd compression.
- zstd.go modified from
`github.com/apache/arrow/go/v17/parquet/compress/ztsd.go`
- may be related to #42129
Signed-off-by: chyezh <chyezh@outlook.com>
Related to #39173
`nullable` flag is crucial for serde logic of v2 writer, missing this
flag causes logic bug for v2 nullalbe data.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Remove the hardcoded batchSize of 100,000 and instead trigger a write
every 64MB based on actual data size. This prevents sort stats from
generating excessively large binlog files.
issue: https://github.com/milvus-io/milvus/issues/42400
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Related to #39173
This PR
- Use updated path with bucketName for packedReader
- Update milvus-storage commit to report reader/writer initialization
failure, see also milvus-io/milvus-storage#192
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Related to #39173
This PR:
- Upgrade milvus-storage commit to fix filesystem finalized issue
- Add bucket-name as prefix for all fs style access io
- Initial arrow fs on querynodes startup
- Fix timestamp access when loading sealed segment
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
after the pr merged, we can support to insert, upsert, build index,
query, search in the added field.
can only do the above operates in added field after add field request
complete, which is a sync operate.
compact will be supported in the next pr.
#39718
---------
Signed-off-by: lixinguo <xinguo.li@zilliz.com>
Co-authored-by: lixinguo <xinguo.li@zilliz.com>