1936 Commits

Author SHA1 Message Date
foxspy
358bc150df
enhance: add force rebuild index configuration (#41473)
issue: #41431

Signed-off-by: xianliang.li <xianliang.li@zilliz.com>
2025-05-14 10:52:21 +08:00
zhagnlu
f094d026f8
fix: add params to ignore config type exception (#41776)
#41707

Signed-off-by: luzhang <luzhang@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-05-13 13:48:56 +08:00
Buqian Zheng
ff5c2770e5
feat: cachinglayer: various improvements (#41546)
issue: https://github.com/milvus-io/milvus/issues/41435

this PR is based on https://github.com/milvus-io/milvus/pull/41436. 

Improvements include:

- Lazy Load support for Storage v1
- Use Low/High watermark to control eviction
- Caching Layer related config changes
- Removed ChunkCache related configs and code in golang
- Add `PinAllCells` helper method to CacheSlot class
- Modified ValueAt, RawAt, PrimitiveRawAt to Bulk version, to reduce
caching layer overhead
- Removed some unclear templated bulk_subscript methods
- CachedSearchIterator to store PinWrapper when searching on
ChunkedColumn, and removed unused contrustor.

---------

Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
2025-05-10 09:19:16 +08:00
congqixia
bcf94a0754
fix: Remove noexcept from CacheIndexToDiskInternal (#41725)
Related to #41219

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-05-09 14:16:53 +08:00
zhagnlu
f674e232b9
fix: GetValueFromConfig return nullopt instead of exception for null value (#41709)
#41707

Signed-off-by: luzhang <luzhang@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-05-09 11:18:53 +08:00
Xianhui Lin
26cbc74478
fix: support infix and suffix match types in JsonStats (#41720)
fix: support infix and suffix match types in JsonStats
issue:https://github.com/milvus-io/milvus/issues/41386

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
2025-05-09 10:42:53 +08:00
zhagnlu
e3c81ba1cc
enhance: use scan mode for like although inverted index exists (#41325)
#41065

Signed-off-by: luzhang <luzhang@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-05-09 10:36:54 +08:00
zhagnlu
39e7ad33d7
enhance: add optimize for like expr (#41066)
#41065

Signed-off-by: luzhang <luzhang@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-05-08 14:28:52 +08:00
foxspy
e2ddbe4962
feat: add cachinglayer to index (#41653)
issue: #41435

Signed-off-by: xianliang.li <xianliang.li@zilliz.com>
2025-05-08 10:12:54 +08:00
congqixia
b1f3fe1f07
fix: Use sum of num_rows instead of last one (#41685)
Related to #41656

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-05-07 19:40:53 +08:00
Bingyi Sun
0dee3ccfd7
enhance: Make user specified doc id selectable for tantivy index writer (#41528)
issue: https://github.com/milvus-io/milvus/issues/41527

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-05-07 10:48:53 +08:00
Bingyi Sun
4c08090687
feat: Add json index support for json contains expr (#41478)
issue: #35528

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-05-06 11:44:52 +08:00
Buqian Zheng
73bbf4c674
fix: error when lack_binlog_rows = 0 (#41644)
issue: https://github.com/milvus-io/milvus/issues/41643

Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
2025-05-04 00:24:56 +08:00
sthuang
e9442f575d
feat: storage v2 seal segment load (#41567)
storage v2 chunked seal segment loading is based on caching layer. A
cell unit in storage v2 is a parquet row group in remote object storage,
containing all fields. Therefore, each field needs a proxy to do related
one field operations.

<img width="965" alt="Screenshot 2025-04-28 at 10 59 30"
src="https://github.com/user-attachments/assets/83e93a10-3b1d-4066-ac17-b996d5650416"
/>

related: #39173

---------

Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
2025-04-30 14:22:58 +08:00
sthuang
6c377b6e86
feat: Storage v2 index and stats raw data (#41534)
related: #39173

---------

Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
2025-04-30 08:48:54 +08:00
zhagnlu
cd60b965c8
enhance: add expr filter ratio monitor params (#41402)
#41401

Signed-off-by: luzhang <luzhang@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-04-29 17:02:54 +08:00
foxspy
1d99f8bd67
enhance: add force rebuild index configuration (#41473)
issue: #41431

Signed-off-by: xianliang.li <xianliang.li@zilliz.com>
2025-04-29 16:20:56 +08:00
congqixia
f3f8227cd0
enhance: [AddField] Trigger check schema in retrieve as well (#41598)
Related to #39718
Fixes milvus-io/pymilvus#2771

This PR:
- Make AsyncRetrieve task triggers "schema check" logic as well
- Rename `AddField` related methods to align with code standard

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-04-29 14:10:49 +08:00
Spade A
910f68c986
fix: update tantivy to fix tantivy doc out of order when merge (#41596)
issue: #41597

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-04-29 13:46:49 +08:00
Spade A
f35e8f7420
fix: fix arm64 compile issue (#41593)
issue: https://github.com/milvus-io/milvus/issues/41059,
https://github.com/milvus-io/milvus/issues/41510

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-04-29 13:19:25 +08:00
Buqian Zheng
3de904c7ea
feat: add cachinglayer to sealed segment (#41436)
issue: https://github.com/milvus-io/milvus/issues/41435

---------

Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
2025-04-28 10:52:40 +08:00
cai.zhang
640f526301
fix: Update current scalar index version to compatible tantivy different versions (#41141)
issue: #40823

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
2025-04-27 20:44:39 +08:00
Chun Han
12cde913b5
fix: fail to get string views due to chunk bound empty loop(#41300) (#41452)
related: #41300

Signed-off-by: MrPresent-Han <chun.han@gmail.com>
Co-authored-by: MrPresent-Han <chun.han@gmail.com>
2025-04-27 10:40:38 +08:00
congqixia
b5443ddbd0
enhance: [AddField] Reopen loaded segments after AddField (#41529)
Related to #39718

This PR:
- Add reopen logic for growing & sealed segments
- Lazy reopen when schema version increases
- Add FinishLoad api for loading progress

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-04-26 08:48:39 +08:00
Buqian Zheng
1c8b9c127d
fix: Make sure segment in ut is destroyed before static MmapManager singleton (#41508)
issue: #41507

Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
2025-04-25 18:50:38 +08:00
Xianhui Lin
1a6838b496
fix: json stats add map null check before insert into tantivity (#41505)
json stats add map null check before insert into tantivity. Json stats
index may fail if there is no data
issue:https://github.com/milvus-io/milvus/issues/41494

---------

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
2025-04-24 21:06:37 +08:00
congqixia
dbe54c2df8
enhance: [AddField] Resolve conflicts & make WAL ts collection updatets (#41476)
Related to #39718

This PR:
- Use WAL broadcast timestamp as Collection update timestamp
- Remove request_fields size assertion
- Remove proxy schema cache loaded field check & skip related cases
- other minor issues

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-04-24 12:06:39 +08:00
Spade A
f3d878ab3f
fix: update tantivy for fixing phrase match (#41450)
issue: #41454
https://github.com/zilliztech/tantivy/pull/8 fixes the problem, this PR
update the tantivy.

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-04-24 10:52:37 +08:00
aoiasd
f52c2909c4
feat: support multi analyzer for bm25 function (#41351)
relate: https://github.com/milvus-io/milvus/issues/41213

---------

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-04-23 18:22:38 +08:00
Xianhui Lin
3d4889586d
fix: JsonStats filter by conjunctExpr and improve the task slot calculation logic (#41459)
Optimized JSON filter execution by introducing
ProcessJsonStatsChunkPos() for unified position calculation and
GetNextBatchSize() for better batch processing.
Improved JSON key generation by replacing manual path joining with
milvus::Json::pointer() and adjusted slot size calculation for JSON key
index jobs.
Updated the task slot calculation logic in calculateStatsTaskSlot() to
handle the increased resource needs of JSON key index jobs.
issue: https://github.com/milvus-io/milvus/issues/41378
https://github.com/milvus-io/milvus/issues/41218

---------

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
2025-04-23 16:30:37 +08:00
aoiasd
a16bd6263b
feat: support more lauguage for build in stop words and add remove punct, regex filter (#41412)
relate: https://github.com/milvus-io/milvus/issues/41213

---------

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-04-23 11:44:37 +08:00
aoiasd
11f2fae42e
feat: support extend default dict for jieba tokenizer (#41360)
relate: https://github.com/milvus-io/milvus/issues/41213

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-04-22 20:34:37 +08:00
congqixia
b36c88f3c8
enhance: [AddField] Broadcast schema change via WAL (#41373)
Related to #39718

Add Broadcast logic for collection schema change and notifies:
- Streamnode - Delegator
- Streamnode - Flush component
- QueryNodes via grpc

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-04-22 16:28:37 +08:00
aoiasd
110c5aaaf4
feat: support icu and language identifier tokenizer (#41214)
relate: https://github.com/milvus-io/milvus/issues/41213

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-04-22 15:56:37 +08:00
cqy123456
5219d9a723
fix: Inserting null and non-null array at the same time will cause milvus crash when growing mmap open (#41051)
issue: https://github.com/milvus-io/milvus/issues/40981
2.5 pr: https://github.com/milvus-io/milvus/pull/41052

Signed-off-by: cqy123456 <qianya.cheng@zilliz.com>
2025-04-22 12:26:37 +08:00
aoiasd
f166843c5e
enhance: support use lindera tag filter (#40416)
relate: https://github.com/milvus-io/milvus/issues/39659

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-04-21 15:56:36 +08:00
sparknack
8ccb875e41
enhance: add simde package (#40943)
issue: #40942

Add simde package, which can make porting SIMD code to other
architectures much easier.

Signed-off-by: Shawn Wang <shawn.wang@zilliz.com>
2025-04-21 12:18:40 +08:00
Spade A
5b1430f27e
enhance: tantivy collector set bitset directly (#39748)
fix: #39755

The following shows a simple benchmark where insert 1M docs where all
rows are "hello", the latency is segcore level, CPU is 9900K:
master: 2.62ms
this PR: 2.11ms

bench mark code:

```
TEST(TextMatch, TestPerf) {
    auto schema = GenTestSchema({}, true);
    auto seg = CreateSealedSegment(schema, empty_index_meta);
    int64_t N = 1000000;
    uint64_t seed = 19190504;
    auto raw_data = DataGen(schema, N, seed);
    auto str_col = raw_data.raw_->mutable_fields_data()
                       ->at(1)
                       .mutable_scalars()
                       ->mutable_string_data()
                       ->mutable_data();
    for (int64_t i = 0; i < N - 1; i++) {
        str_col->at(i) = "hello";
    }
    SealedLoadFieldData(raw_data, *seg);
    seg->CreateTextIndex(FieldId(101));

    auto now = std::chrono::high_resolution_clock::now();
    auto expr = GetMatchExpr(schema, "hello", OpType::TextMatch);
    auto final = ExecuteQueryExpr(expr, seg.get(), N, MAX_TIMESTAMP);
    auto end = std::chrono::high_resolution_clock::now();
    auto duration =
        std::chrono::duration_cast<std::chrono::microseconds>(end - now);
    std::cout << "TextMatch query time: " << duration.count() << "ms"
              << std::endl;
}
```

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-04-20 23:02:41 +08:00
Chun Han
016920b023
fix: solve incompitable problem for none-encoding index(#40838) (#41369)
related: #40838

Signed-off-by: MrPresent-Han <chun.han@gmail.com>
Co-authored-by: MrPresent-Han <chun.han@gmail.com>
2025-04-20 22:56:44 +08:00
Ted Xu
d50781c8cc
enhance: support nullable group by keys (#41313)
See #36264

---------

Signed-off-by: Ted Xu <ted.xu@zilliz.com>
2025-04-18 10:08:34 +08:00
Spade A
62293cb582
fix: revert batch add (#41374)
issue: #41375

todo: to fix the problems fixed in the issue.

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-04-17 22:32:38 +08:00
Bingyi Sun
4552dd4b23
fix: Fix json index does not work for string filter (#41382)
issue: #35528

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-04-17 20:10:39 +08:00
sthuang
1f1c836fb9
feat: Storage v2 growing segment load (#41001)
support parallel loading sealed and growing segments with storage v2
format by async reading row groups.
related: #39173

---------

Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
2025-04-16 17:14:33 +08:00
Spade A
70d13dcf61
enhance: update tantivy for removing "doc_id" fast field (#41198)
Issue: #41210

After https://github.com/zilliztech/tantivy/pull/5, we can provide
milvus row id directly to tantivy rather than record it in the fast
field "doc_id".
So rather than search tantivy doc id and then get milvus row id from
"doc_id", now, the searched tantivy doc id is the milvus row id,
eliminating the expensive acquiring row id phase.

The following shows a simple benchmark where insert **1M** docs where
all rows are "hello", the latency is **segcore** level, CPU is 9900K:

![image](https://github.com/user-attachments/assets/d8e72134-56b5-430b-8628-36c3bed8eaad)
**The latency is 2.02 and 2.1 times respectively.**

bench mark code:
```
TEST(TextMatch, TestPerf) {
    auto schema = GenTestSchema({}, true);
    auto seg = CreateSealedSegment(schema, empty_index_meta);
    int64_t N = 1000000;
    uint64_t seed = 19190504;
    auto raw_data = DataGen(schema, N, seed);
    auto str_col = raw_data.raw_->mutable_fields_data()
                       ->at(1)
                       .mutable_scalars()
                       ->mutable_string_data()
                       ->mutable_data();
    for (int64_t i = 0; i < N - 1; i++) {
        str_col->at(i) = "hello";
    }
    SealedLoadFieldData(raw_data, *seg);
    seg->CreateTextIndex(FieldId(101));

    auto now = std::chrono::high_resolution_clock::now();
    auto expr = GetMatchExpr(schema, "hello", OpType::TextMatch);
    auto final = ExecuteQueryExpr(expr, seg.get(), N, MAX_TIMESTAMP);
    auto end = std::chrono::high_resolution_clock::now();
    auto duration =
        std::chrono::duration_cast<std::chrono::microseconds>(end - now);
    std::cout << "TextMatch query time: " << duration.count() << "ms"
              << std::endl;
}
```

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-04-15 20:20:32 +08:00
Bingyi Sun
a953eaeaf0
enhance: support binary range expression for json path index (#41025)
issue: #35528

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-04-15 19:32:33 +08:00
Chun Han
59b14d38f5
enhance: Optimize index format for improved load performance(#40838) (#40839)
related: https://github.com/milvus-io/milvus/issues/40838

Signed-off-by: MrPresent-Han <chun.han@gmail.com>
Co-authored-by: MrPresent-Han <chun.han@gmail.com>
2025-04-15 03:10:30 +08:00
Bingyi Sun
bf617115ca
enhance: Remove single chunk segment related codes (#39249)
https://github.com/milvus-io/milvus/issues/39112

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-04-11 18:56:29 +08:00
Spade A
9ce3e3cb44
enhance: add documents in batch for json key stats (#41228)
issue: https://github.com/milvus-io/milvus/issues/40897

After this, the document add operations scheduling duration is decreased
roughly from 6s to 0.9s for the case in the issue.

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-04-11 14:08:26 +08:00
Bingyi Sun
b9b8419cbf
fix: Use int32 when creating array index for element type int8/int16 (#41185)
issue: #41172
Elements with type int8 or int16 in Array is encoded using int32, so we
should parse it as int32 when creating index.

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-04-11 13:18:25 +08:00
foxspy
17e10beba0
fix: avoid segmentation faults caused by retrieving empty vector datasets (#40545)
issue: #40544

Signed-off-by: xianliang.li <xianliang.li@zilliz.com>
2025-04-10 20:16:29 +08:00