relate: https://github.com/milvus-io/milvus/issues/43687
We used to run the temporary analyzer and validate analyzer on the
proxy, but the proxy should not be a computation-heavy node. This PR
move all analyzer calculations to the streaming node.
---------
Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
See #39173
In this PR:
- Adjusted the delta log serialization APIs.
- Refactored the stats collector to improve the collection and digest of
primary key and BM25 statistics.
- Introduced new tests for the delta log reader/writer and stats
collectors to ensure functionality and correctness.
---------
Signed-off-by: Ted Xu <ted.xu@zilliz.com>
Related to #44995
Added missing case for JSON data type in GetDefaultValue function to
properly retrieve default values for JSON fields. This prevents crashes
when enabling dynamic fields with default values during concurrent
insert operations.
Changes:
- Added JSON data type case in GetDefaultValue to return BytesData
- Added comprehensive tests for fillMissingFields covering JSON and
other data types with default values
- Added tests for nullable fields, required fields validation, and edge
cases
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #43427
This pr's main goal is merge #37417 to milvus 2.5 without conflicts.
# Main Goals
1. Create and describe collections with geospatial type
2. Insert geospatial data into the insert binlog
3. Load segments containing geospatial data into memory
4. Enable query and search can display geospatial data
5. Support using GIS funtions like ST_EQUALS in query
6. Support R-Tree index for geometry type
# Solution
1. **Add Type**: Modify the Milvus core by adding a Geospatial type in
both the C++ and Go code layers, defining the Geospatial data structure
and the corresponding interfaces.
2. **Dependency Libraries**: Introduce necessary geospatial data
processing libraries. In the C++ source code, use Conan package
management to include the GDAL library. In the Go source code, add the
go-geom library to the go.mod file.
3. **Protocol Interface**: Revise the Milvus protocol to provide
mechanisms for Geospatial message serialization and deserialization.
4. **Data Pipeline**: Facilitate interaction between the client and
proxy using the WKT format for geospatial data. The proxy will convert
all data into WKB format for downstream processing, providing column
data interfaces, segment encapsulation, segment loading, payload
writing, and cache block management.
5. **Query Operators**: Implement simple display and support for filter
queries. Initially, focus on filtering based on spatial relationships
for a single column of geospatial literal values, providing parsing and
execution for query expressions.Now only support brutal search
7. **Client Modification**: Enable the client to handle user input for
geospatial data and facilitate end-to-end testing.Check the modification
in pymilvus.
---------
Signed-off-by: Yinwei Li <yinwei.li@zilliz.com>
Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
Co-authored-by: ZhuXi <150327960+Yinwei-Yu@users.noreply.github.com>
issue: https://github.com/milvus-io/milvus/issues/27467
>My plan is as follows.
>- [x] M1 Create collection with timestamptz field
>- [x] M2 Insert timestamptz field data
>- [x] M3 Retrieve timestamptz field data
>- [x] M4 Implement handoff
>- [x] M5 Implement compare operator
>- [x] M6 Implement extract operator
>- [x] M8 Support database/collection level default timezone
>- [x] M7 Support STL-SORT index for datatype timestamptz
---
The third PR of issue: https://github.com/milvus-io/milvus/issues/27467,
which completes M5, M6, M7, M8 described above.
## M8 Default Timezone
We will be able to use alter_collection() and alter_database() in a
future Python SDK release to modify the default timezone at the
collection or database level.
For insert requests, the timezone will be resolved using the following
order of precedence: String Literal-> Collection Default -> Database
Default.
For retrieval requests, the timezone will be resolved in this order:
Query Parameters -> Collection Default -> Database Default.
In both cases, the final fallback timezone is UTC.
## M5: Comparison Operators
We can now use the following expression format to filter on the
timestamptz field:
- `timestamptz_field [+/- INTERVAL 'interval_string'] {comparison_op}
ISO 'iso_string' `
- The interval_string follows the ISO 8601 duration format, for example:
P1Y2M3DT1H2M3S.
- The iso_string follows the ISO 8601 timestamp format, for example:
2025-01-03T00:00:00+08:00.
- Example expressions: "tsz + INTERVAL 'P0D' != ISO
'2025-01-03T00:00:00+08:00'" or "tsz != ISO
'2025-01-03T00:00:00+08:00'".
## M6: Extract
We will be able to extract sepecific time filed by kwargs in a future
Python SDK release.
The key is `time_fields`, and value should be one or more of "year,
month, day, hour, minute, second, microsecond", seperated by comma or
space. Then the result of each record would be an array of int64.
## M7: Indexing Support
Expressions without interval arithmetic can be accelerated using an
STL-SORT index. However, expressions that include interval arithmetic
cannot be indexed. This is because the result of an interval calculation
depends on the specific timestamp value. For example, adding one month
to a date in February results in a different number of added days than
adding one month to a date in March.
---
After this PR, the input / output type of timestamptz would be iso
string. Timestampz would be stored as timestamptz data, which is int64_t
finally.
> for more information, see https://en.wikipedia.org/wiki/ISO_8601
---------
Signed-off-by: xtx <xtianx@smail.nju.edu.cn>
Related to #39173
backlog issue that memory size and log size shared same value. This
patch add `GetFileSize` api to get remote compressed binlog size as meta
log file size to calculate usage more accurate.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Related to #39173
- Put bucket name concatenation logic back for azure support
This reverts commit 8f97eb355fde6b86cf37f166d2191750b4210ba3.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Related to #44257
This PR store split result in compaction segment binlog struct to make
querynode could utilize it later.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Related to #44257
This PR:
- Utilize configurable split policy for storage v2, enabling system
field policy
- Store split result in field binlog struct
- Adapt legacy binlog without child fields
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Related to #39173
This PR:
- Bump milvus-storage commit to handle bucket name concatenation logic
in multipart s3 fs
- Remove all user-side bucket name concatenation code
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: https://github.com/milvus-io/milvus/issues/42148
Optimized from
Go VectorArray → VectorArray Proto → Binary → C++ VectorArray Proto →
C++ VectorArray local impl → Memory
to
Go VectorArray → Arrow ListArray → Memory
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
https://github.com/milvus-io/milvus/issues/44011
this is to support compaction that sorts records by partition key and pk
in the future
---------
Signed-off-by: sunby <sunbingyi1992@gmail.com>
1. Enable Milvus to read cipher configs
2. Enable cipher plugin in binlog reader and writer
3. Add a testCipher for unittests
4. Support pooling for datanode
5. Add encryption in storagev2
See also: #40321
Signed-off-by: yangxuan <xuan.yang@zilliz.com>
---------
Signed-off-by: yangxuan <xuan.yang@zilliz.com>
Ref https://github.com/milvus-io/milvus/issues/42148
This PR supports create index for vector array (now, only for
`DataType.FLOAT_VECTOR`) and search on it.
The index type supported in this PR is `EMB_LIST_HNSW` and the metric
type is `MAX_SIM` only.
The way to use it:
```python
milvus_client = MilvusClient("xxx:19530")
schema = milvus_client.create_schema(enable_dynamic_field=True, auto_id=True)
...
struct_schema = milvus_client.create_struct_array_field_schema("struct_array_field")
...
struct_schema.add_field("struct_float_vec", DataType.ARRAY_OF_VECTOR, element_type=DataType.FLOAT_VECTOR, dim=128, max_capacity=1000)
...
schema.add_struct_array_field(struct_schema)
index_params = milvus_client.prepare_index_params()
index_params.add_index(field_name="struct_float_vec", index_type="EMB_LIST_HNSW", metric_type="MAX_SIM", index_params={"nlist": 128})
...
milvus_client.create_index(COLLECTION_NAME, schema=schema, index_params=index_params)
```
Note: This PR uses `Lims` to convey offsets of the vector array to
knowhere where vectors of multiple vector arrays are concatenated and we
need offsets to specify which vectors belong to which vector array.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com>
Add `arrowBuild.Reserve` call for `ValueSerializer` to reduce repeated
resizing buffer when write size is large
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Ref https://github.com/milvus-io/milvus/issues/42148https://github.com/milvus-io/milvus/pull/42406 impls the segcore part of
storage for handling with VectorArray.
This PR:
1. impls the go part of storage for VectorArray
2. impls the collection creation with StructArrayField and VectorArray
3. insert and retrieve data from the collection.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com>
Signed-off-by: SpadeA-Tang <u6748471@anu.edu.au>
This parameter determines whether the returned value should be a copy or
a reference from the arrow array. The updates enhance memory management
and provide more control over data handling during deserialization.
See #43186
---------
Signed-off-by: Ted Xu <ted.xu@zilliz.com>
Related to #43522
Currently, passing partial schema to storage v2 packed reader may
trigger SEGV during clustering compaction unit test.
This patch implement `NeededFields` differently in each `RecordReader`
imlementation. For now, v2 will implemented as no-op. This will be
supported after packed reader support this API.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
This PR fill default value for `PackedBinlogRecordWriter` timestamp
range so target segment meta will contains correct timestamp range
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Correct read and buffer size to 64MB to prevent OOM during clustering
compaction.
issue: https://github.com/milvus-io/milvus/issues/43310
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Related to #39173
This PR
- Close packed reader after sort
- Release arrow.Record preventing memory leakage
- Invoke `pack_reader->Close()` for CloseReader
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Fix issues in end-to-end tests:
1. **Split column groups based on schema**, rather than estimating by
average chunk row size. **Ensure column group consistency within a
segment**, to avoid errors caused by loading multiple column group
chunks simultaneously.
2. **Use sorted segmentId** when generating the stats binlog path, to
ensure consistent and correct file path resolution.
3. **Determine field IDs as follows**:
For multi-column column groups, retrieve the field ID list from
metadata.
For single-column column groups, use the column group ID directly as the
field ID.
related: #39173fix: #42862
---------
Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
Related to #42856
Default value will be missing after segment get sorted/compacted. This
PR is a temp workaround since in long term default value shall be filled
with storage engine instead.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Related to #39173
`null_bitmap_data()` returns raw pointer of null bitmap of Array. While
after slicing, this bitmap is not rewritten due to zero copy
implementation, so the current start pos maybe non-zero while
FillFieldData generating column `valid_data` array.
This PR add `offset` param for `FillFieldData` method, and force all
invocation pass correct offset of `null_bitmap_data` ptr.
Also update milvus-storage commit fixing reader failed to return data
when buffer size smaller than row group size problem.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>