1827 Commits

Author SHA1 Message Date
congqixia
22098c1785
fix: add null check for packed_writer_ in JsonStatsParquetWriter::Close() (#45158)
Related to #45157

Fix a bug where DataNode panics when building json stats index throws an
exception before the writer is initialized. The destructor would call
Close() on an uninitialized packed_writer_ pointer, causing a null
pointer dereference.

Changes:
- Add null check for packed_writer_ before calling Flush() and Close()
- Prevents null pointer dereference in edge cases
- Ignore close status as this is a cleanup operation

This ensures safe cleanup even when initialization fails due to
exceptions.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-10-30 17:40:09 +08:00
cqy123456
35d8213a00
fix: fail to mmap emb_list_meta in embedding list (#45127)
issue: https://github.com/milvus-io/milvus/issues/44965

Signed-off-by: cqy123456 <qianya.cheng@zilliz.com>
2025-10-30 11:00:09 +08:00
aoiasd
ad9a0cae48
enhance: add global analyzer options (#44684)
relate: https://github.com/milvus-io/milvus/issues/43687
Add global analyzer options, avoid having to merge some milvus params
into user's analyzer params.

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-10-28 14:52:10 +08:00
congqixia
fd0ef09e97
fix: Handle all-null data in StringIndexSort to prevent load timeout (#45100)
Related to #45081

StringIndexSort now properly handles collections with all-null string
fields by:
- Removing the error thrown when unique_count is 0 in ParseBinaryData
- Adding alignment and padding support in mmap serialization (similar to
ScalarIndexSort)
- Separating data_size_ from mmap_size_ to correctly parse data without
reading padding

This fixes load collection timeout failures when all string field data
is null, particularly affecting STL_SORT and TRIE index types.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-10-27 18:04:09 +08:00
congqixia
36a887b38b
enhance: add NewSegmentWithLoadInfo API to support segment self-managed loading (#45061)
This commit introduces the foundation for enabling segments to manage
their own loading process by passing load information during segment
creation.

Changes:

C++ Layer:
- Add NewSegmentWithLoadInfo() C API to create segments with serialized
load info
- Add SetLoadInfo() method to SegmentInterface for storing load
information
- Refactor segment creation logic into shared CreateSegment() helper
function
- Add comprehensive documentation for the new API

Go Layer:
- Extend CreateCSegmentRequest to support optional LoadInfo field
- Update segment creation in querynode to pass SegmentLoadInfo when
available
- Add ConvertToSegcoreSegmentLoadInfo() and helper converters for proto
translation

Proto Definitions:
- Add segcorepb.SegmentLoadInfo message with essential loading metadata
- Add supporting messages: Binlog, FieldBinlog, FieldIndexInfo,
TextIndexStats, JsonKeyStats
- Remove dependency on data_coord.proto by creating segcore-specific
definitions

Testing:
- Add comprehensive unit tests for proto conversion functions
- Test edge cases including nil inputs, empty data, and nil array/map
elements

This is the first step toward issue #45060 - enabling segments to
autonomously manage their loading process, which will:
- Clarify responsibilities between Go and C++ layers
- Reduce cross-language call overhead
- Enable precise resource management at the C++ level
- Support better integration with caching layer
- Enable proactive schema evolution handling

Related to #45060

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-10-27 15:28:12 +08:00
congqixia
7c627260f3
enhance: Optimize ScalarIndexSort bitmap initialization for range queries (#45085)
Optimize bitmap initialization in ScalarIndexSort range queries by using
adaptive strategy based on result density. When more than 50% of
elements match the range condition, initialize bitmap with all true
values and clear non-matching elements. Otherwise, use the original
approach of initializing with false and setting matching elements. Also
defer bitmap allocation until after early return checks to avoid
unnecessary memory allocation.

This optimization reduces bit operations for high-selectivity queries
while maintaining the same performance for low-selectivity queries.

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-10-27 10:08:06 +08:00
Buqian Zheng
c284e8c4a8
enhance: some minor code cleanup, prepare for scalar benchmark (#45008)
issue: https://github.com/milvus-io/milvus/issues/44452

---------

Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
2025-10-24 14:22:05 +08:00
congqixia
199f6d936e
fix: Update milvus-storage to fix duplicate AWS SDK initialization (#45062)
Update milvus-storage version from aa189ad to e5f5b4c to include the fix
for duplicate AWS SDK initialization that was causing init conflicts.

This update removes the redundant arrow::fs::InitializeS3() call that
was resulting in duplicate Aws::InitAPI() initialization. The duplicate
initialization was causing AWS SDK to ignore custom configurations,
particularly affecting GCP Workload Identity authentication.

Changes in milvus-storage e5f5b4c:
- Remove redundant arrow::fs::InitializeS3() call
- Keep only the extended S3 initialization with custom AWS SDK options
- Ensure GCP IAM authentication via custom HTTP client factory works
correctly

Relates to #44745
Reference: milvus-io/milvus-storage#285

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-10-24 11:32:05 +08:00
Buqian Zheng
22995cea3f
fix: Remove debug logging from JsonFlatIndex (#44807)
issue: https://github.com/milvus-io/milvus/issues/44452

Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
Co-authored-by: buqian.zheng <buqian.zheng@zilliz.com>
2025-10-23 16:08:06 +08:00
Bingyi Sun
52270701ce
feat: use namespace skip index when search (#44888)
issue: #44011

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-10-23 12:04:04 +08:00
Spade A
6077178553
enhance: enable STL_SORT to support VARCHAR (#44401)
issue: https://github.com/milvus-io/milvus/issues/44399

This PR implements STL_SORT for VARCHAR data type for both RAM and MMAP
mode.
The general idea is that we deduplicate field values and maintains a
posting list for each unique value.

The serialization format of the index is:
```
[unique_count][string_offsets][string_data][post_list_offsets][post_list_data][magic_code]
string_offsets: array of offsets into string_data section
string_data: str_len1, str1, str_len2, str2, ...
post_list_offsets: array of offsets into post_list_data section
post_list_data: post_list_len1, row_id1, row_id2, ..., post_list_len2, row_id1, row_id2, ...
```

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-10-23 11:00:05 +08:00
cai.zhang
3d11ba06ef
fix: Double check to avoid iter has been earsed by other thread (#45013)
issue: #44974

---------

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
2025-10-21 23:36:04 +08:00
zhagnlu
730308b1eb
fix: fix not equal not include None (#44959)
#44816

Signed-off-by: luzhang <luzhang@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-10-21 17:08:03 +08:00
cai.zhang
b23d75a032
fix: Fix bug for gis function to filter geometry (#44966)
issue: #44961 

This PR fixes 3 geometry related bugs:
1. Implement `ToString` interface for GisFunctionFilter.
2. Ignore GisFunctionFilter `MoveCursor` for growing segment.
3. Don't skip null geometry for building R-Tree index, should be record
in null_offsets.

---------

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
2025-10-21 09:52:04 +08:00
cai.zhang
a35a3b7c69
fix: Ensure fulfill promise when CreateArrowFileSystem throw an exception (#44975)
issue: #44974

---------

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
2025-10-20 23:32:03 +08:00
zhagnlu
05df48fbe4
fix:remove duplicated '/' in jsonstats path (#44939)
#44950

Signed-off-by: luzhang <luzhang@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-10-20 14:06:03 +08:00
Zhen Ye
f98d02b3e1
fix: use short debug string to avoid newline in debug logs (#44925)
issue: #44924

Signed-off-by: chyezh <chyezh@outlook.com>
2025-10-20 10:16:03 +08:00
Bingyi Sun
3ddf9154ab
fix: Fix exists expr for json flat index (#44910)
issue: https://github.com/milvus-io/milvus/issues/44915

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-10-19 19:46:07 +08:00
congqixia
27dbb8e75d
fix: support JSON default value in CreateArrowScalarFromDefaultValue (#44912)
Related to #44897

Add missing JSON data type handling in CreateArrowScalarFromDefaultValue
to fix query failures when dynamic fields are enabled. JSON default
values are now properly converted to arrow::BinaryScalar using
bytes_data().

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-10-17 18:22:00 +08:00
cai.zhang
b0f642fb4c
fix: Fix the geometry return POINT(0 0) when growing mmap is enabled (#44889)
issue: #44802 

After a Geometry object is serialized into WKB, the resulting binary may
contain '\0' bytes.
When growing mmap is enabled, the append data logic uses strcpy, which
stops copying at the first '\0' bytes.
This causes only part of the WKB---typically the portion up to the
geometry type field to be copied, leading to corrupted data.
As a result, during parsing, all POINT geometries are incorrectly
interperted as POINT(0 0).

To fix this issue, memcpy will be used instead of strcpy.

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
2025-10-17 17:10:11 +08:00
zhagnlu
b7935557e1
fix:unified json exists path semantic (#44916)
#44927

Signed-off-by: luzhang <luzhang@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-10-17 16:40:02 +08:00
zhagnlu
ae19c93c14
enhance: remove timestamp filter for search_ids to optimize performance (#44634)
#44352

Signed-off-by: luzhang <luzhang@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-10-17 16:10:01 +08:00
sparknack
4bd30a74ca
enhance: cachinglayer: add mmap and eviction support for TextMatchIndex (#44806)
issue: #41435, #44502

Signed-off-by: Shawn Wang <shawn.wang@zilliz.com>
2025-10-17 14:42:02 +08:00
Bingyi Sun
633cae9461
enhance: add namespace for query and search request (#44343)
issue: #44011

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-10-16 17:52:01 +08:00
congqixia
684018ca7b
fix: ensure deterministic search result ordering when scores are equal (#44870)
Related to #44819
This fix addresses an issue(#44819) where the offset parameter did not
work correctly during searches when multiple results had identical
scores. The problem occurred because results with equal scores were not
consistently ordered, leading to unpredictable pagination behavior.

The solution adds a new sorting step (SortEqualScoresByPks) in the
reduce phase that sorts results with identical scores by their primary
keys in ascending order. This ensures deterministic ordering and enables
proper offset functionality.

Changes:
- Add SortEqualScoresByPks() to sort results with equal scores by PK
- Add SortEqualScoresOneNQ() to handle per-query sorting logic
- Invoke sorting step after FillPrimaryKey() in Reduce() workflow

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-10-16 10:04:00 +08:00
Bingyi Sun
26d06c6340
feat: load skip index using parquet statistics (#44252)
#44011

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-10-15 19:16:00 +08:00
cqy123456
822588302a
enhance: embedding_list support mmap in MemVectorIndex (#44764)
issue: https://github.com/milvus-io/milvus/issues/44702

Signed-off-by: cqy123456 <qianya.cheng@zilliz.com>
2025-10-15 15:22:00 +08:00
Spade A
c4f3f0ce4c
feat: impl StructArray -- support more types of vector in STRUCT (#44736)
ref: https://github.com/milvus-io/milvus/issues/42148

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com>
2025-10-15 10:25:59 +08:00
Spade A
b8df1c0cc5
enhance: improve observability in trace for segcore scalar expression (#44260)
Ref https://github.com/milvus-io/milvus/issues/44259

This PR connects the trace between go and segcore, and add full traces
for scalar expression calling chain:
<img width="2418" height="960" alt="image"
src="https://github.com/user-attachments/assets/8cad69d7-bcb7-4002-a4e3-679a3641e229"
/>
<img width="2452" height="850" alt="image"
src="https://github.com/user-attachments/assets/8b44aed0-0f03-48a7-baa0-b022fee994ce"
/>
<img width="2403" height="707" alt="image"
src="https://github.com/user-attachments/assets/cd6f0601-0d5c-4087-8ed8-2385f1bc740b"
/>

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-10-14 17:15:59 +08:00
Bingyi Sun
6cb1f7d7c6
enhance: optimize the performace of bitmap reverse lookup (#44804)
Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-10-14 11:57:58 +08:00
zhagnlu
2f178f810f
fix:fix json_contains(path, int) bug (#44814)
#44816

Signed-off-by: luzhang <luzhang@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-10-14 00:19:59 +08:00
sparknack
df6a4dc1a0
fix: cachinglayer: avoid eviction during json handling (#44812)
issue: #44797

Signed-off-by: Shawn Wang <shawn.wang@zilliz.com>
2025-10-13 22:07:58 +08:00
aoiasd
1b17e16fc7
fix: expr filter return wrong result when skipped (#44778)
relate: https://github.com/milvus-io/milvus/issues/44777
Should return res with false if skipped. But now return vaild[0], it
almost be true.

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-10-13 18:33:59 +08:00
zhagnlu
3dd5deb70a
fix:disable using shredding for json_path contains digital (#44724)
#44132

Signed-off-by: luzhang <luzhang@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-10-13 17:25:59 +08:00
sparknack
c8a4d6e2ef
enhance: add cachinglayer management for TextMatchIndex (#44741)
issue: #41435, #44502

Signed-off-by: Shawn Wang <shawn.wang@zilliz.com>
2025-10-13 14:37:58 +08:00
aoiasd
09865a5da5
fix: BM25 with boost return result not ordered. (#44744)
relate: https://github.com/milvus-io/milvus/issues/44758
Wrong code which should be `(result.seg_offsets_[i] >= 0 &&
result.seg_offsets_[j] < 0)`, but was `(result.seg_offsets_[j] >= 0 &&
result.seg_offsets_[j] < 0) ` now.
But because all placeholder which was offset -1, will fill with worst
distance value.
For IP, L2 or COSIN, it will be +inf or -inf. So sort distance was
enough.
But when use BM25, it will be NAN. Will case sort out of ordered.

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-10-11 17:17:58 +08:00
congqixia
5ece760d73
fix: Pass fs via FileManagerContext when loading index (#44733)
Related to #44615

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-10-11 09:55:57 +08:00
sparknack
7e750190b6
enhance: add a size getter for tantivy inverted index (#44609)
issue: #41435

---------

Signed-off-by: Shawn Wang <shawn.wang@zilliz.com>
2025-10-10 17:43:57 +08:00
congqixia
8a443c699e
fix: Make aws credential provider singleton (#44687)
Related to #44647

This patch make milvus-storage using singleton credential provider in
case of data race when concurrent index build task recieved.

See also milvus-io/milvus-storage#44647

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-10-09 16:11:58 +08:00
congqixia
1d85b83215
enhance: [backlog] Fix unittest and remove fs fallback logic (#44615)
Related to #44535

This PR:
- Fix the unittest creating `DiskFileManagerImpl` without `filesystem`
- Add comments for methods need `fs_`
- Remove fallback creation and add assertion for nullptr fs

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-10-09 10:41:57 +08:00
cai.zhang
9d1bb8497c
fix: Get R-Tree index correct for growing segment (#44612)
issue: #43427 

R-Tree index is the entire segment, not the chunk.

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
2025-09-29 21:34:54 +08:00
cai.zhang
aecb46a08b
fix: Skip empty loop for process growing segment (#44606)
issue: #43427 

The GISFunction asserts that the segment_offsets cannot be nullptr. When
size is 0, the segment_offsets is nullptr, so the loop is skiped.

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
2025-09-29 21:15:05 +08:00
cai.zhang
19346fa389
feat: Geospatial Data Type and GIS Function support for milvus (#44547)
issue: #43427

This pr's main goal is merge #37417 to milvus 2.5 without conflicts.

# Main Goals

1. Create and describe collections with geospatial type
2. Insert geospatial data into the insert binlog
3. Load segments containing geospatial data into memory
4. Enable query and search can display  geospatial data
5. Support using GIS funtions like ST_EQUALS in query
6. Support R-Tree index for geometry type

# Solution

1. **Add Type**: Modify the Milvus core by adding a Geospatial type in
both the C++ and Go code layers, defining the Geospatial data structure
and the corresponding interfaces.
2. **Dependency Libraries**: Introduce necessary geospatial data
processing libraries. In the C++ source code, use Conan package
management to include the GDAL library. In the Go source code, add the
go-geom library to the go.mod file.
3. **Protocol Interface**: Revise the Milvus protocol to provide
mechanisms for Geospatial message serialization and deserialization.
4. **Data Pipeline**: Facilitate interaction between the client and
proxy using the WKT format for geospatial data. The proxy will convert
all data into WKB format for downstream processing, providing column
data interfaces, segment encapsulation, segment loading, payload
writing, and cache block management.
5. **Query Operators**: Implement simple display and support for filter
queries. Initially, focus on filtering based on spatial relationships
for a single column of geospatial literal values, providing parsing and
execution for query expressions.Now only support brutal search
7. **Client Modification**: Enable the client to handle user input for
geospatial data and facilitate end-to-end testing.Check the modification
in pymilvus.

---------

Signed-off-by: Yinwei Li <yinwei.li@zilliz.com>
Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
Co-authored-by: ZhuXi <150327960+Yinwei-Yu@users.noreply.github.com>
2025-09-28 19:43:05 +08:00
aoiasd
1b20e956be
enhance: support random score for boost function score (#44214)
And support set function mode and boost mode when run search with boost.

RandomScore support get random function score between [0, weight).
FunctionMode decide how to calculate boost score for multiple boost
function scores.
BoostMode decide how to calculate final score for origin score and boost
score.
relate: https://github.com/milvus-io/milvus/issues/43867

---------

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-09-24 17:50:04 +08:00
foxspy
13c3b0b909
enhance: add autoindex configuration for the int8 vector type (#44554)
issue: #38666 

Add int8 support for autoindex to ensure it can be independently
configured. At the same time, remove the restriction on int8 type for
vectorDiskIndex (note that vectorDiskIndex only determines the building
and loading method of the index, not the index type).

Signed-off-by: xianliang.li <xianliang.li@zilliz.com>
2025-09-24 17:48:04 +08:00
sparknack
0145dc8c06
fix: refund loaded resource usage in Insert/DeleteRecord destructor (#44555)
issue: #44528

Signed-off-by: Shawn Wang <shawn.wang@zilliz.com>
2025-09-24 16:16:04 +08:00
zhagnlu
eac16a577c
enhance:support cachelayer for json stats (#44446)
#42533

Signed-off-by: zhagnlu <lu.zhang@zilliz.com>
2025-09-24 15:30:04 +08:00
sparknack
14c085374e
fix: set mmap_file_raii_ to nullptr when mmap is disabled (#44516)
issue: #44510
related: #44501

Signed-off-by: Shawn Wang <shawn.wang@zilliz.com>
2025-09-24 11:50:03 +08:00
congqixia
ea307ea3c9
fix: [StorageV2] Make DiskFileManager use fs from context (#44535)
Related to #44534

Datanode shall not use singleton fs after 2.6+. This patch make disk
file manager use filesystem passed by fileManagerContext instead of
errorous singleton one.

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-09-24 10:12:03 +08:00
Bingyi Sun
f0446fd9a0
enhance: optimize the performance of binary_search_string (#44469)
Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-09-23 10:52:13 +08:00