relate: https://github.com/milvus-io/milvus/issues/43687
We used to run the temporary analyzer and validate analyzer on the
proxy, but the proxy should not be a computation-heavy node. This PR
move all analyzer calculations to the streaming node.
---------
Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
Related to #44995
Added missing case for JSON data type in GetDefaultValue function to
properly retrieve default values for JSON fields. This prevents crashes
when enabling dynamic fields with default values during concurrent
insert operations.
Changes:
- Added JSON data type case in GetDefaultValue to return BytesData
- Added comprehensive tests for fillMissingFields covering JSON and
other data types with default values
- Added tests for nullable fields, required fields validation, and edge
cases
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #43427
This pr's main goal is merge #37417 to milvus 2.5 without conflicts.
# Main Goals
1. Create and describe collections with geospatial type
2. Insert geospatial data into the insert binlog
3. Load segments containing geospatial data into memory
4. Enable query and search can display geospatial data
5. Support using GIS funtions like ST_EQUALS in query
6. Support R-Tree index for geometry type
# Solution
1. **Add Type**: Modify the Milvus core by adding a Geospatial type in
both the C++ and Go code layers, defining the Geospatial data structure
and the corresponding interfaces.
2. **Dependency Libraries**: Introduce necessary geospatial data
processing libraries. In the C++ source code, use Conan package
management to include the GDAL library. In the Go source code, add the
go-geom library to the go.mod file.
3. **Protocol Interface**: Revise the Milvus protocol to provide
mechanisms for Geospatial message serialization and deserialization.
4. **Data Pipeline**: Facilitate interaction between the client and
proxy using the WKT format for geospatial data. The proxy will convert
all data into WKB format for downstream processing, providing column
data interfaces, segment encapsulation, segment loading, payload
writing, and cache block management.
5. **Query Operators**: Implement simple display and support for filter
queries. Initially, focus on filtering based on spatial relationships
for a single column of geospatial literal values, providing parsing and
execution for query expressions.Now only support brutal search
7. **Client Modification**: Enable the client to handle user input for
geospatial data and facilitate end-to-end testing.Check the modification
in pymilvus.
---------
Signed-off-by: Yinwei Li <yinwei.li@zilliz.com>
Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
Co-authored-by: ZhuXi <150327960+Yinwei-Yu@users.noreply.github.com>
issue: https://github.com/milvus-io/milvus/issues/27467
>My plan is as follows.
>- [x] M1 Create collection with timestamptz field
>- [x] M2 Insert timestamptz field data
>- [x] M3 Retrieve timestamptz field data
>- [x] M4 Implement handoff
>- [x] M5 Implement compare operator
>- [x] M6 Implement extract operator
>- [x] M8 Support database/collection level default timezone
>- [x] M7 Support STL-SORT index for datatype timestamptz
---
The third PR of issue: https://github.com/milvus-io/milvus/issues/27467,
which completes M5, M6, M7, M8 described above.
## M8 Default Timezone
We will be able to use alter_collection() and alter_database() in a
future Python SDK release to modify the default timezone at the
collection or database level.
For insert requests, the timezone will be resolved using the following
order of precedence: String Literal-> Collection Default -> Database
Default.
For retrieval requests, the timezone will be resolved in this order:
Query Parameters -> Collection Default -> Database Default.
In both cases, the final fallback timezone is UTC.
## M5: Comparison Operators
We can now use the following expression format to filter on the
timestamptz field:
- `timestamptz_field [+/- INTERVAL 'interval_string'] {comparison_op}
ISO 'iso_string' `
- The interval_string follows the ISO 8601 duration format, for example:
P1Y2M3DT1H2M3S.
- The iso_string follows the ISO 8601 timestamp format, for example:
2025-01-03T00:00:00+08:00.
- Example expressions: "tsz + INTERVAL 'P0D' != ISO
'2025-01-03T00:00:00+08:00'" or "tsz != ISO
'2025-01-03T00:00:00+08:00'".
## M6: Extract
We will be able to extract sepecific time filed by kwargs in a future
Python SDK release.
The key is `time_fields`, and value should be one or more of "year,
month, day, hour, minute, second, microsecond", seperated by comma or
space. Then the result of each record would be an array of int64.
## M7: Indexing Support
Expressions without interval arithmetic can be accelerated using an
STL-SORT index. However, expressions that include interval arithmetic
cannot be indexed. This is because the result of an interval calculation
depends on the specific timestamp value. For example, adding one month
to a date in February results in a different number of added days than
adding one month to a date in March.
---
After this PR, the input / output type of timestamptz would be iso
string. Timestampz would be stored as timestamptz data, which is int64_t
finally.
> for more information, see https://en.wikipedia.org/wiki/ISO_8601
---------
Signed-off-by: xtx <xtianx@smail.nju.edu.cn>
Ref https://github.com/milvus-io/milvus/issues/42148
This PR supports create index for vector array (now, only for
`DataType.FLOAT_VECTOR`) and search on it.
The index type supported in this PR is `EMB_LIST_HNSW` and the metric
type is `MAX_SIM` only.
The way to use it:
```python
milvus_client = MilvusClient("xxx:19530")
schema = milvus_client.create_schema(enable_dynamic_field=True, auto_id=True)
...
struct_schema = milvus_client.create_struct_array_field_schema("struct_array_field")
...
struct_schema.add_field("struct_float_vec", DataType.ARRAY_OF_VECTOR, element_type=DataType.FLOAT_VECTOR, dim=128, max_capacity=1000)
...
schema.add_struct_array_field(struct_schema)
index_params = milvus_client.prepare_index_params()
index_params.add_index(field_name="struct_float_vec", index_type="EMB_LIST_HNSW", metric_type="MAX_SIM", index_params={"nlist": 128})
...
milvus_client.create_index(COLLECTION_NAME, schema=schema, index_params=index_params)
```
Note: This PR uses `Lims` to convey offsets of the vector array to
knowhere where vectors of multiple vector arrays are concatenated and we
need offsets to specify which vectors belong to which vector array.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com>
Ref https://github.com/milvus-io/milvus/issues/42148https://github.com/milvus-io/milvus/pull/42406 impls the segcore part of
storage for handling with VectorArray.
This PR:
1. impls the go part of storage for VectorArray
2. impls the collection creation with StructArrayField and VectorArray
3. insert and retrieve data from the collection.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com>
Signed-off-by: SpadeA-Tang <u6748471@anu.edu.au>
Related to #41858#41951#42084
When insert msg consumer (pipeline/flowgraph) have newer schema than
insertMsg, it have to adapter the insert msg used old schema(missing
newly added field)
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Related to #39173
`nullable` flag is crucial for serde logic of v2 writer, missing this
flag causes logic bug for v2 nullalbe data.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: https://github.com/milvus-io/milvus/issues/39818
This PR mimics Varchar data type, allows insert, search, query, delete,
full-text search and others.
Functionalities related to filter expressions are disabled temporarily.
Storage changes for Text data type will be in the following PRs.
Signed-off-by: Patrick Weizhi Xu <weizhi.xu@zilliz.com>
See: #39697
In sync operations, the type conversions from message to insert data
always result in a memory copy, which is not necessary if the converting
type is identical.
Signed-off-by: Ted Xu <ted.xu@zilliz.com>
issue:https://github.com/milvus-io/milvus/issues/27576
# Main Goals
1. Create and describe collections with geospatial fields, enabling both
client and server to recognize and process geo fields.
2. Insert geospatial data as payload values in the insert binlog, and
print the values for verification.
3. Load segments containing geospatial data into memory.
4. Ensure query outputs can display geospatial data.
5. Support filtering on GIS functions for geospatial columns.
# Solution
1. **Add Type**: Modify the Milvus core by adding a Geospatial type in
both the C++ and Go code layers, defining the Geospatial data structure
and the corresponding interfaces.
2. **Dependency Libraries**: Introduce necessary geospatial data
processing libraries. In the C++ source code, use Conan package
management to include the GDAL library. In the Go source code, add the
go-geom library to the go.mod file.
3. **Protocol Interface**: Revise the Milvus protocol to provide
mechanisms for Geospatial message serialization and deserialization.
4. **Data Pipeline**: Facilitate interaction between the client and
proxy using the WKT format for geospatial data. The proxy will convert
all data into WKB format for downstream processing, providing column
data interfaces, segment encapsulation, segment loading, payload
writing, and cache block management.
5. **Query Operators**: Implement simple display and support for filter
queries. Initially, focus on filtering based on spatial relationships
for a single column of geospatial literal values, providing parsing and
execution for query expressions.
6. **Client Modification**: Enable the client to handle user input for
geospatial data and facilitate end-to-end testing.Check the modification
in pymilvus.
---------
Signed-off-by: tasty-gumi <1021989072@qq.com>
fix not append valid data when transfer to insert record and add a tiny
check when in groupBy field.
#35924
Signed-off-by: lixinguo <xinguo.li@zilliz.com>
Co-authored-by: lixinguo <xinguo.li@zilliz.com>
issue: #34357
Go Parquet uses dictionary encoding by default, and it will fall back to
plain encoding if the dictionary size exceeds the dictionary size page
limit. Users can specify custom fallback encoding by using
`parquet.WithEncoding(ENCODING_METHOD)` in writer properties. However,
Go Parquet [fallbacks to plain
encoding](e65c1e295d/go/parquet/file/column_writer_types.gen.go.tmpl (L238))
rather than custom encoding method users provide. Therefore, this patch
only turns off dictionary encoding for the primary key.
With a 5 million auto ID primary key benchmark, the parquet file size
improves from 13.93 MB to 8.36 MB when dictionary encoding is turned
off, reducing primary key storage space by 40%.
Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
add sparse float vector support to different milvus components,
including proxy, data node to receive and write sparse float vectors to
binlog, query node to handle search requests, index node to build index
for sparse float column, etc.
https://github.com/milvus-io/milvus/issues/29419
---------
Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
fix: #29757
In previous code, `ColumnBasedInsertMsgToInsertData` adds empty field if
the insertMsg parameter does not have the column schema defined. This
may lead to unexpected behavior of caller functions.
This PR:
- Add column missing check
- Add column length check
- Generate BlobInfo for ColumnBasedInsertMsgToInsertData result
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Benchmark Milvus with https://github.com/qdrant/vector-db-benchmark and
specify the datasets as 'deep-image-96-angular'. Meanwhile, do perf
profiling during 'upload + index' stage of vector-db-benchmark and see
the following hot spots.
39.59%--github.com/milvus-io/milvus/internal/storage.MergeInsertData
|
|--21.43%--github.com/milvus-io/milvus/internal/storage.MergeFieldData
| |
| |--17.22%--runtime.memmove
| |
| |--1.53%--asm_exc_page_fault
| ......
|
|--18.16%--runtime.memmove
|
|--1.66%--asm_exc_page_fault
......
The hot code path is in storage.MergeInsertData() which updates
buffer.buffer by creating a new 'InsertData' instance and merging both
the old buffer.buffer and addedBuffer into it. When it calls golang
runtime.memmove to move buffer.buffer which is with big size (>1M), the
hot spots appear.
To avoid the above overhead, update storage.MergeInsertData() by
appending addedBuffer to buffer.buffer, instead of moving buffer.buffer
and addedBuffer to a new 'InsertData'. This change removes the hot spots
'runtime.memmove' from perf profiling output. Additionally, the 'upload
+ index' time, which is one performance metric of vector-db-benchmark,
is reduced around 60% with this change.
Signed-off-by: Cathy Zhang <cathy.zhang@intel.com>