360 Commits

Author SHA1 Message Date
sthuang
f77571d5c1
fix: [StorageV2] file writer write row group split to default size (#43471)
Bumped milvus storage version.
related: https://github.com/milvus-io/milvus/issues/43310

* https://github.com/milvus-io/milvus-storage/pull/213
* https://github.com/milvus-io/milvus-storage/pull/217
* https://github.com/milvus-io/milvus-storage/pull/220

Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
2025-07-22 09:52:52 +08:00
aoiasd
e9fc140eaf
fix: jieba tokenizer cause panic when dict word was empty string (#43337)
relate: https://github.com/milvus-io/milvus/issues/42779

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-07-21 16:34:53 +08:00
aoiasd
c7b53ed43b
enhance: run rust format (#43447)
Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-07-21 14:12:53 +08:00
aoiasd
f7e1f1c382
enhance: support download lindera system dictionary online (#43121)
relate: https://github.com/milvus-io/milvus/issues/43120

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-07-20 23:24:52 +08:00
Spade A
42ad786f75
fix: update tantivy for fixing dir removing race condition (#43399)
fix: https://github.com/milvus-io/milvus/issues/43258

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-07-18 15:44:56 +08:00
Spade A
8612a2c946
enhance: optimize in by batch-in (#43268)
fix: https://github.com/milvus-io/milvus/issues/43267

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-07-17 19:40:52 +08:00
sparknack
9b4081e110
enhance: cachinglayer: some performance optimization (#42858)
issue: #41435

We compared the performance using the modified test_sealed.cpp, which
randomly accesses all rows in all chunks and counts the number of runs
within 3s.

## performance data comparison (ops/second)

chunk config: 1x1000

| Field Type | w/o cachinglayer (commit 640f526301) | w/ cachinglayer |
w/ cachinglayer + opt |
|---|---|---|---|
| Bool field | 82428 | -63.6% (29983) | +2.7% (84675) |
| Int8 field | 82228 | -63.3% (30166) | +2.4% (84163) |
| Int16 field | 82572 | -63.8% (29867) | +1.8% (84036) |
| Int32 field | 82797 | -63.7% (30031) | +1.5% (84043) |
| Int64 field | 81077 | -62.9% (30107) | +0.6% (81604) |
| Float field | 82678 | -63.4% (30266) | +1.8% (84146) |
| Double field | 81925 | -63.4% (29974) | +0.2% (82097) |
| Varchar field | 19933 | -19.6% (16027) | +18.9% (23690) |
| JSON field | 16519 | -96.8% (533) | +2.5% (16927) |
| Int array field | 7325 | -13.7% (6321) | -1.4% (7220) |
| Long array field | 6347 | -8.9% (5781) | -0.1% (6344) |
| Bool array field | 8275 | -14.0% (7116) | +0.4% (8311) |
| String array field | 2281 | -5.0% (2168) | +0.2% (2287) |
| Double array field | 6427 | -13.3% (5574) | -2.0% (6301) |
| Float array field | 7291 | -13.0% (6346) | -1.5% (7183) |
| Vector field | 27487 | -40.4% (16371) | -4.7% (26192) |
| Float16 vector field | 49773 | -54.6% (22601) | -5.9% (46834) |
| BFloat16 vector field | 49783 | -53.1% (23350) | -5.7% (46934) |
| Int8 vector field | 63871 | -59.0% (26179) | -6.2% (59926) |

---

chunk config: 10x1000

| Field Type | w/o cachinglayer (commit 640f526301) | w/ cachinglayer |
w/ cachinglayer + opt |
|---|---|---|---|
| Bool field | 3659 | -48.6% (1879) | +110.1% (7686) |
| Int8 field | 3410 | -45.3% (1864) | +123.9% (7636) |
| Int16 field | 3647 | -48.6% (1874) | +110.1% (7661) |
| Int32 field | 3647 | -48.8% (1866) | +109.6% (7645) |
| Int64 field | 3645 | -48.9% (1863) | +107.8% (7573) |
| Float field | 3647 | -49.0% (1861) | +109.5% (7639) |
| Double field | 3640 | -45.1% (1998) | +108.4% (7586) |
| Varchar field | 1594 | -23.9% (1213) | +20.6% (1922) |
| JSON field | 1202 | -26.5% (884) | +16.1% (1396) |
| Int array field | 602 | -12.3% (528) | +12.7% (678) |
| Long array field | 529 | -12.2% (465) | +7.5% (569) |
| Double array field | 537 | -13.0% (467) | +6.4% (571) |
| Vector field | 1520 | -37.9% (943) | -5.5% (1437) |
| Float16 vector field | 2607 | -47.0% (1382) | +6.4% (2774) |
| BFloat16 vector field | 2586 | -46.5% (1383) | +8.8% (2813) |
| Int8 vector field | 3101 | -47.3% (1633) | +41.9% (4400) |

---------

Signed-off-by: Shawn Wang <shawn.wang@zilliz.com>
2025-07-17 11:20:51 +08:00
foxspy
58a9e49066
enhance: update knowhere version (#43331)
issue: #42937 #43294

Signed-off-by: xianliang.li <xianliang.li@zilliz.com>
2025-07-16 15:04:50 +08:00
Spade A
db91d85dbc
feat: more types of matches for ngram (#43081)
Ref https://github.com/milvus-io/milvus/issues/42053

This PR enable ngram to support more kinds of matches such as prefix and
postfix match.

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-07-14 20:34:50 +08:00
foxspy
8171a2a0b5
enhance: update knowhere version (#43246)
issue: #42937

Signed-off-by: xianliang.li <xianliang.li@zilliz.com>
2025-07-14 11:06:49 +08:00
Spade A
26ec841feb
feat: optimize Like query with n-gram (#41803)
Ref #42053

This is the first PR for optimizing `LIKE` with ngram inverted index.
Now, only VARCHAR data type is supported and only InnerMatch LIKE
(%xxx%) query is supported.


How to use it:
```
milvus_client = MilvusClient("http://localhost:19530")
schema = milvus_client.create_schema()
...
schema.add_field("content_ngram", DataType.VARCHAR, max_length=10000)
...
index_params = milvus_client.prepare_index_params()
index_params.add_index(field_name="content_ngram", index_type="NGRAM", index_name="ngram_index", min_gram=2, max_gram=3)
milvus_client.create_collection(COLLECTION_NAME, ...)
```

min_gram and max_gram controls how we tokenize the documents. For
example, for min_gram=2 and max_gram=4, we will tokenize each document
with 2-gram, 3-gram and 4-gram.

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com>
2025-07-01 10:08:44 +08:00
foxspy
be05b653c1
enhance: update knowhere version (#42938)
issue: #42937

Signed-off-by: xianliang.li <xianliang.li@zilliz.com>
2025-06-26 01:22:41 +08:00
sthuang
ad6d620e9f
fix: [StorageV2] Compiling debug mode throw DCHECK s3 initialize error (#42922)
related: https://github.com/milvus-io/milvus/issues/42844

Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
2025-06-24 19:30:41 +08:00
Spade A
50f7579d8f
fix: fix some bugs discovered by chaos tests (#42906)
fix: https://github.com/milvus-io/milvus/issues/42870

This PR fixes:
1. SetBitset fn shuold consider growing segments with concurrent write
2. avoid using from_raw_parts directly

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-06-24 16:32:42 +08:00
Spade A
e15926b40c
enhance: optimize tantivy cargo config (#42880)
fix: https://github.com/milvus-io/milvus/issues/42879

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-06-20 16:17:49 +08:00
aoiasd
43a9f7a79e
enhance: Add and run rust format command in makefile (#42807)
relate: https://github.com/milvus-io/milvus/issues/42806

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-06-20 10:22:39 +08:00
Spade A
e2c85eec81
fix: load stats index based on mmap config (#42788)
ref https://github.com/milvus-io/milvus/issues/42626

This PR makes text match index and json key stats index be loaded based
on mmap config.

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-06-19 10:10:39 +08:00
aoiasd
d49989345b
enhance: forbid regex filter clone regex for each streamer (#42781)
Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-06-18 16:10:39 +08:00
congqixia
f01ff57f3f
fix: [StorageV2] Use correct offset filling null bitmap (#42774)
Related to #39173

`null_bitmap_data()` returns raw pointer of null bitmap of Array. While
after slicing, this bitmap is not rewritten due to zero copy
implementation, so the current start pos maybe non-zero while
FillFieldData generating column `valid_data` array.

This PR add `offset` param for `FillFieldData` method, and force all
invocation pass correct offset of `null_bitmap_data` ptr.

Also update milvus-storage commit fixing reader failed to return data
when buffer size smaller than row group size problem.

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-06-17 10:08:38 +08:00
Bingyi Sun
fbf5cb4e62
feat: Add json flat index (#39917)
issue: https://github.com/milvus-io/milvus/issues/35528

This PR introduces a JSON flat index that allows indexing JSON fields
and dynamic fields in the same way as other field types.

In a previous PR (#36750), we implemented a JSON index that requires
specifying a JSON path and casting a type. The only distinction lies in
the json_cast_type parameter. When json_cast_type is set to JSON type,
Milvus automatically creates a JSON flat index.

For details on how Tantivy interprets JSON data, refer to the [tantivy
documentation](https://github.com/quickwit-oss/tantivy/blob/main/doc/src/json.md#pitfalls-limitation-and-corner-cases).

Limitations
Array handling: Arrays do not function as nested objects. See the
[limitations
section](https://github.com/quickwit-oss/tantivy/blob/main/doc/src/json.md#arrays-do-not-work-like-nested-object)
for more details.

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-06-10 19:14:35 +08:00
cqy123456
317bbfbf81
enhance: milvus support minhash vector and mhjaccard metric (#42036)
issue:
https://github.com/issues/assigned?issue=milvus-io%7Cmilvus%7C41746

Signed-off-by: cqy123456 <qianya.cheng@zilliz.com>
2025-06-10 14:38:34 +08:00
aoiasd
fd6e2b52ff
enhance: use english name as language name for all type language identifier (#42600)
Set whatlang detect return language name as english name.
Make sure same with lingua.

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-06-10 10:24:35 +08:00
aoiasd
6e16653597
fix: update tantivy commit version to fix stemmer panic (#42171)
relate: https://github.com/milvus-io/milvus/issues/42168

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-06-09 10:34:33 +08:00
foxspy
3dbad0306a
fix: Add bypass thread pool mode to avoid growing indexes blocking insert/load (#41012)
issue: #40825

Signed-off-by: xianliang.li <xianliang.li@zilliz.com>
2025-05-20 14:30:24 +08:00
congqixia
a22088a380
enhance: [StorageV2] Make packed reader use correct path (#41919)
Related to #39173

This PR
- Use updated path with bucketName for packedReader
- Update milvus-storage commit to report reader/writer initialization
failure, see also milvus-io/milvus-storage#192

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-05-20 10:36:23 +08:00
congqixia
3bbc0fa560
enhance: [StorageV2] update storage to pass endpoint as-is (#41889)
Related to milvus-io/milvus-storage#190

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-05-16 18:06:21 +08:00
Buqian Zheng
b0260d8676
feat: manual evict cache after built interim index (#41836)
issue: https://github.com/milvus-io/milvus/issues/41435

this PR also makes HasRawData of ChunkedSegmentSealedImpl to return
based on metadata, without needing to load the cache just to answer this
simple question.

---------

Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
2025-05-16 16:34:23 +08:00
congqixia
a6d09ff4cd
enhance: [StorageV2] fix issues integrating basic RW operations (#41834)
Related to #39173

This PR:
- Upgrade milvus-storage commit to fix filesystem finalized issue
- Add bucket-name as prefix for all fs style access io
- Initial arrow fs on querynodes startup
- Fix timestamp access when loading sealed segment

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-05-15 09:52:23 +08:00
foxspy
358bc150df
enhance: add force rebuild index configuration (#41473)
issue: #41431

Signed-off-by: xianliang.li <xianliang.li@zilliz.com>
2025-05-14 10:52:21 +08:00
foxspy
e2ddbe4962
feat: add cachinglayer to index (#41653)
issue: #41435

Signed-off-by: xianliang.li <xianliang.li@zilliz.com>
2025-05-08 10:12:54 +08:00
Bingyi Sun
0dee3ccfd7
enhance: Make user specified doc id selectable for tantivy index writer (#41528)
issue: https://github.com/milvus-io/milvus/issues/41527

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-05-07 10:48:53 +08:00
foxspy
1d99f8bd67
enhance: add force rebuild index configuration (#41473)
issue: #41431

Signed-off-by: xianliang.li <xianliang.li@zilliz.com>
2025-04-29 16:20:56 +08:00
Spade A
910f68c986
fix: update tantivy to fix tantivy doc out of order when merge (#41596)
issue: #41597

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-04-29 13:46:49 +08:00
Spade A
f35e8f7420
fix: fix arm64 compile issue (#41593)
issue: https://github.com/milvus-io/milvus/issues/41059,
https://github.com/milvus-io/milvus/issues/41510

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-04-29 13:19:25 +08:00
cai.zhang
640f526301
fix: Update current scalar index version to compatible tantivy different versions (#41141)
issue: #40823

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
2025-04-27 20:44:39 +08:00
Spade A
f3d878ab3f
fix: update tantivy for fixing phrase match (#41450)
issue: #41454
https://github.com/zilliztech/tantivy/pull/8 fixes the problem, this PR
update the tantivy.

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-04-24 10:52:37 +08:00
aoiasd
a16bd6263b
feat: support more lauguage for build in stop words and add remove punct, regex filter (#41412)
relate: https://github.com/milvus-io/milvus/issues/41213

---------

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-04-23 11:44:37 +08:00
aoiasd
11f2fae42e
feat: support extend default dict for jieba tokenizer (#41360)
relate: https://github.com/milvus-io/milvus/issues/41213

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-04-22 20:34:37 +08:00
aoiasd
110c5aaaf4
feat: support icu and language identifier tokenizer (#41214)
relate: https://github.com/milvus-io/milvus/issues/41213

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-04-22 15:56:37 +08:00
aoiasd
f166843c5e
enhance: support use lindera tag filter (#40416)
relate: https://github.com/milvus-io/milvus/issues/39659

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-04-21 15:56:36 +08:00
Spade A
5b1430f27e
enhance: tantivy collector set bitset directly (#39748)
fix: #39755

The following shows a simple benchmark where insert 1M docs where all
rows are "hello", the latency is segcore level, CPU is 9900K:
master: 2.62ms
this PR: 2.11ms

bench mark code:

```
TEST(TextMatch, TestPerf) {
    auto schema = GenTestSchema({}, true);
    auto seg = CreateSealedSegment(schema, empty_index_meta);
    int64_t N = 1000000;
    uint64_t seed = 19190504;
    auto raw_data = DataGen(schema, N, seed);
    auto str_col = raw_data.raw_->mutable_fields_data()
                       ->at(1)
                       .mutable_scalars()
                       ->mutable_string_data()
                       ->mutable_data();
    for (int64_t i = 0; i < N - 1; i++) {
        str_col->at(i) = "hello";
    }
    SealedLoadFieldData(raw_data, *seg);
    seg->CreateTextIndex(FieldId(101));

    auto now = std::chrono::high_resolution_clock::now();
    auto expr = GetMatchExpr(schema, "hello", OpType::TextMatch);
    auto final = ExecuteQueryExpr(expr, seg.get(), N, MAX_TIMESTAMP);
    auto end = std::chrono::high_resolution_clock::now();
    auto duration =
        std::chrono::duration_cast<std::chrono::microseconds>(end - now);
    std::cout << "TextMatch query time: " << duration.count() << "ms"
              << std::endl;
}
```

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-04-20 23:02:41 +08:00
Spade A
62293cb582
fix: revert batch add (#41374)
issue: #41375

todo: to fix the problems fixed in the issue.

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-04-17 22:32:38 +08:00
sthuang
1f1c836fb9
feat: Storage v2 growing segment load (#41001)
support parallel loading sealed and growing segments with storage v2
format by async reading row groups.
related: #39173

---------

Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
2025-04-16 17:14:33 +08:00
Spade A
70d13dcf61
enhance: update tantivy for removing "doc_id" fast field (#41198)
Issue: #41210

After https://github.com/zilliztech/tantivy/pull/5, we can provide
milvus row id directly to tantivy rather than record it in the fast
field "doc_id".
So rather than search tantivy doc id and then get milvus row id from
"doc_id", now, the searched tantivy doc id is the milvus row id,
eliminating the expensive acquiring row id phase.

The following shows a simple benchmark where insert **1M** docs where
all rows are "hello", the latency is **segcore** level, CPU is 9900K:

![image](https://github.com/user-attachments/assets/d8e72134-56b5-430b-8628-36c3bed8eaad)
**The latency is 2.02 and 2.1 times respectively.**

bench mark code:
```
TEST(TextMatch, TestPerf) {
    auto schema = GenTestSchema({}, true);
    auto seg = CreateSealedSegment(schema, empty_index_meta);
    int64_t N = 1000000;
    uint64_t seed = 19190504;
    auto raw_data = DataGen(schema, N, seed);
    auto str_col = raw_data.raw_->mutable_fields_data()
                       ->at(1)
                       .mutable_scalars()
                       ->mutable_string_data()
                       ->mutable_data();
    for (int64_t i = 0; i < N - 1; i++) {
        str_col->at(i) = "hello";
    }
    SealedLoadFieldData(raw_data, *seg);
    seg->CreateTextIndex(FieldId(101));

    auto now = std::chrono::high_resolution_clock::now();
    auto expr = GetMatchExpr(schema, "hello", OpType::TextMatch);
    auto final = ExecuteQueryExpr(expr, seg.get(), N, MAX_TIMESTAMP);
    auto end = std::chrono::high_resolution_clock::now();
    auto duration =
        std::chrono::duration_cast<std::chrono::microseconds>(end - now);
    std::cout << "TextMatch query time: " << duration.count() << "ms"
              << std::endl;
}
```

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-04-15 20:20:32 +08:00
Spade A
9ce3e3cb44
enhance: add documents in batch for json key stats (#41228)
issue: https://github.com/milvus-io/milvus/issues/40897

After this, the document add operations scheduling duration is decreased
roughly from 6s to 0.9s for the case in the issue.

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-04-11 14:08:26 +08:00
Xianhui Lin
3bc24c264f
enhance: Add json key inverted index in stats for optimization (#38039)
Add json key inverted index in stats for optimization
https://github.com/milvus-io/milvus/issues/36995

---------

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-04-10 15:20:28 +08:00
Spade A
e9fa30f462
fix: remove single segment logic in V7 (#41159)
Ref: https://github.com/milvus-io/milvus/issues/40823

It does not make any sense to create single segment tantivy index for
old version such as 2.4 by using tantivy V7.
So, clean the relevant code.

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-04-09 19:54:27 +08:00
sthuang
50e02e3598
enhance: update packed reader api (#41055)
related: https://github.com/milvus-io/milvus/issues/39173

Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
2025-04-09 10:18:26 +08:00
Spade A
c6a0c2ab64
enhance: process tantivy document add by batch (#40124)
issue: https://github.com/milvus-io/milvus/issues/40006

This PR make tantivy document add by batch. Add document by batch can
greately reduce the latency of scheduling the document add operation
(call tantivy `add_document` only schdules the add operation and it
returns immediately after scheduled) , because each call involes a tokio
block_on which is relatively heavy.

Reduce scheduling part not necessarily reduces the overall latency if
the index writer threads does not process indexing quickly enough.
But if scheduling itself is pretty slow, even the index writer threads
process indexing very fast (by increasing thread number), the overall
performance can still be limited.

The following codes bench the PR (Note, the duration only counts for
scheduling without commit)
```
fn test_performance() {
    let field_name = "text";
    let dir = TempDir::new().unwrap();
    let mut index_wrapper = IndexWriterWrapper::create_text_writer(
        field_name,
        dir.path().to_str().unwrap(),
        "default",
        "",
        1,
        50_000_000,
        false,
        TantivyIndexVersion::V7,
    )
    .unwrap();

    let mut batch = vec![];
    for i in 0..1_000_000 {
        batch.push(format!("hello{:04}", i));
    }
    let batch_ref = batch.iter().map(|s| s.as_str()).collect::<Vec<_>>();

    let now = std::time::Instant::now();
    index_wrapper
        .add_data_by_batch(&batch_ref, Some(0))
        .unwrap();
    let elapsed = now.elapsed();
    println!("add_data_by_batch elapsed: {:?}", elapsed);
}
```
Latency roughly reduces from 1.4s to 558ms.

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-04-08 19:50:24 +08:00
aoiasd
6f17720e4e
enhance: support use jieba tokenizer with costum dictionary (#39854)
relate: https://github.com/milvus-io/milvus/issues/40168

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-04-08 14:52:27 +08:00