1883 Commits

Author SHA1 Message Date
zhagnlu
ee1faf80dd
fix:add clear bitmap for batch skip mode (#41166)
#41086 #41150

Signed-off-by: luzhang <luzhang@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-04-09 13:08:27 +08:00
sthuang
50e02e3598
enhance: update packed reader api (#41055)
related: https://github.com/milvus-io/milvus/issues/39173

Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
2025-04-09 10:18:26 +08:00
congqixia
e2d8adb963
fix: Use element_type for Array is null operator (#41157)
Related to #41156

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-04-09 10:16:24 +08:00
Spade A
c6a0c2ab64
enhance: process tantivy document add by batch (#40124)
issue: https://github.com/milvus-io/milvus/issues/40006

This PR make tantivy document add by batch. Add document by batch can
greately reduce the latency of scheduling the document add operation
(call tantivy `add_document` only schdules the add operation and it
returns immediately after scheduled) , because each call involes a tokio
block_on which is relatively heavy.

Reduce scheduling part not necessarily reduces the overall latency if
the index writer threads does not process indexing quickly enough.
But if scheduling itself is pretty slow, even the index writer threads
process indexing very fast (by increasing thread number), the overall
performance can still be limited.

The following codes bench the PR (Note, the duration only counts for
scheduling without commit)
```
fn test_performance() {
    let field_name = "text";
    let dir = TempDir::new().unwrap();
    let mut index_wrapper = IndexWriterWrapper::create_text_writer(
        field_name,
        dir.path().to_str().unwrap(),
        "default",
        "",
        1,
        50_000_000,
        false,
        TantivyIndexVersion::V7,
    )
    .unwrap();

    let mut batch = vec![];
    for i in 0..1_000_000 {
        batch.push(format!("hello{:04}", i));
    }
    let batch_ref = batch.iter().map(|s| s.as_str()).collect::<Vec<_>>();

    let now = std::time::Instant::now();
    index_wrapper
        .add_data_by_batch(&batch_ref, Some(0))
        .unwrap();
    let elapsed = now.elapsed();
    println!("add_data_by_batch elapsed: {:?}", elapsed);
}
```
Latency roughly reduces from 1.4s to 558ms.

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-04-08 19:50:24 +08:00
Bingyi Sun
da21640ac3
fix: Fix the bug that null data can not be filtered by null expr (#41124)
issue: https://github.com/milvus-io/milvus/issues/41063

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-04-08 19:12:24 +08:00
aoiasd
6f17720e4e
enhance: support use jieba tokenizer with costum dictionary (#39854)
relate: https://github.com/milvus-io/milvus/issues/40168

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-04-08 14:52:27 +08:00
Spade A
e4da2765ba
enhance: process batch of strings within one tantivy_index_add_string call (#40007)
issue: #40006

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-04-08 01:20:25 +08:00
Bingyi Sun
355f62d6c9
fix: Align brute force search with json index for exists expr (#41116)
issue: #35528

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-04-07 15:42:23 +08:00
zhagnlu
ee8783cae9
fix:add operator type for some operator (#40895)
#40894

Signed-off-by: luzhang <luzhang@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-04-07 11:58:27 +08:00
zhagnlu
10a63b3f2e
enhance: add formatter for serveral types to remove compile warning (#41094)
#41091

Signed-off-by: luzhang <luzhang@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-04-07 11:54:24 +08:00
zhagnlu
0a378dc308
fix:fix format error for json (#41026)
#40963

Signed-off-by: luzhang <luzhang@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-04-07 10:22:22 +08:00
Bingyi Sun
fcb03b5bd1
feat: add json null/exists expression (#41004)
issue: #35528

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-04-03 17:48:21 +08:00
Zhen Ye
9f27d9af61
fix: segv if the LoadArrowReaderFromRemote run at the exception path (#41069)
issue: #41067

Signed-off-by: chyezh <chyezh@outlook.com>
2025-04-03 02:54:21 +08:00
Spade A
f552ec67dd
fix: support building tantivy index with low version(5) (#40822)
fix: https://github.com/milvus-io/milvus/issues/40823
To solve the problem in the issue, we have to support building tantivy
index with low version
for those query nodes with low tantivy version.

This PR does two things:
1. refactor codes for IndexWriterWrapper to make it concise
2. enable IndexWriterWrapper to build tantivy index by different tantivy
crate

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-04-02 18:46:20 +08:00
Chun Han
afa519b4c7
fix: array is null failed(#40686) (#41027)
related: #40686

Signed-off-by: MrPresent-Han <chun.han@gmail.com>
Co-authored-by: MrPresent-Han <chun.han@gmail.com>
2025-04-02 18:20:22 +08:00
smellthemoon
cb1e86e17c
enhance: support add field (#39800)
after the pr merged, we can support to insert, upsert, build index,
query, search in the added field.
can only do the above operates in added field after add field request
complete, which is a sync operate.

compact will be supported in the next pr.
#39718

---------

Signed-off-by: lixinguo <xinguo.li@zilliz.com>
Co-authored-by: lixinguo <xinguo.li@zilliz.com>
2025-04-02 14:24:31 +08:00
Spade A
216be1494b
fix: add log for object storage operation fail (#40666)
fix: #40665

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-04-02 01:26:21 +08:00
cqy123456
6dc0f42830
fix:growing mmap data type crashed by nullable input (#40994)
issue: https://github.com/milvus-io/milvus/issues/40981
2.5 pr: https://github.com/milvus-io/milvus/pull/40980

Signed-off-by: cqy123456 <qianya.cheng@zilliz.com>
2025-03-31 20:32:19 +08:00
Bingyi Sun
27ff3a42e7
enhance: Record simdjson error (#41003)
issue: #35528

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-03-31 17:56:19 +08:00
Bingyi Sun
15ec7bae4d
fix: Fix using json index when iterative_filter is specified (#40945)
issue: #40934

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-03-31 15:26:19 +08:00
Bingyi Sun
9676365af9
fix: Fix json index not equal filter (#40647)
issue: #35528

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-03-27 23:06:23 +08:00
aoiasd
384d39ef5a
enhance: not build lindera features by default and support make milvus with tantivy features (#40813)
relate: https://github.com/milvus-io/milvus/issues/39659

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-03-27 14:08:22 +08:00
zhagnlu
87e7d6d79f
fix:fix exception when do arith expr with using index (#40794)
#40783

Signed-off-by: luzhang <luzhang@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-03-27 11:10:21 +08:00
Xiaofan
8788e591cd
enhance: add detailed stack for error message (#40883)
fix #40882
adding stacktrace will operator execute failed.

Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>
2025-03-26 13:24:20 +08:00
zhagnlu
7fdb2e144f
enhance:change multi or expr to in expr (#40757)
#40752

Signed-off-by: luzhang <luzhang@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-03-25 11:06:18 +08:00
cai.zhang
a41cb942f6
fix: Do not delete the centroids file when sampling fails instead wait GC (#40701)
issue: #40700

---------

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
2025-03-21 10:32:12 +08:00
sthuang
d7df78a6c9
feat: Storage v2 compaction (#40667)
- Feat: Support Mix compaction. Covering tests include compatibility and
rollback ability.
  - Read v1 segments and compact with v2 format.
  - Read both v1 and v2 segments and compact with v2 format.
  - Read v2 segments and compact with v2 format.
  - Compact with duplicate primary key test.
  - Compact with bm25 segments.
  - Compact with merge sort segments.
  - Compact with no expiration segments.
  - Compact with lack binlog segments.
  - Compact with nullable field segments.
- Feat: Support Clustering compaction. Covering tests include
compatibility and rollback ability.
  - Read v1 segments and compact with v2 format.
  - Read both v1 and v2 segments and compact with v2 format.
  - Read v2 segments and compact with v2 format.
  - Compact bm25 segments with v2 format.
  - Compact with memory limit.
- Enhance: Use serdeMap serialize in BuildRecord function to support all
Milvus data types.
related: #39173

Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
2025-03-21 10:16:12 +08:00
Bingyi Sun
5a6b4e56d5
fix: Fix tasks will panic if one of them throw an exception. (#40691)
issue: https://github.com/milvus-io/milvus/issues/40690

the variable rcm will be dangling if a future throws an exception and
return.

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-03-19 16:52:09 +08:00
aoiasd
92bdf7a0c1
enhance: support run anayser return detaild token (#40458)
relate: https://github.com/milvus-io/milvus/issues/39705

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-03-19 15:48:15 +08:00
zhagnlu
6c55db44f1
enhance: reorder sub expr for conjunct expr (#39872)
two point:
 (1) reoder conjucts expr's subexpr, postpone heavy operations
sequence: int(column) -> index(column) -> string(column) -> light
conjuct
...... -> json(column) -> heavy conjuct -> two_column_compare
(2) support pre filter for expr execute, skip scan raw data that had
been skipped
     because of preceding expr result.

#39869

Signed-off-by: luzhang <luzhang@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-03-19 14:50:14 +08:00
Zhen Ye
8db708f67d
enhance: enable memory prof based on jemalloc (#40731)
issue: #40730

also see: https://github.com/milvus-io/cgosymbolizer/pull/2

After these PR, at linux:

- the milvus will always enable jemalloc by default.
- jemalloc will always compiled with --enable-prof options.
- all image will always enable the jemalloc prof by default.
- a pprof http service for jemalloc at `/debug/jemalloc/` will be
registered into restful.
- `jeprof` can remote profile the memory of milvus.

Signed-off-by: chyezh <chyezh@outlook.com>
2025-03-19 14:46:18 +08:00
zhagnlu
7ebe3d7038
enhance: refine chunk access logic and add some comment on data (#40618)
#40367

Signed-off-by: luzhang <luzhang@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-03-16 22:20:08 +08:00
Bingyi Sun
6249335859
fix: Catch invalid json pointer error (#40625)
issue: #35528

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-03-14 16:56:08 +08:00
Bingyi Sun
d3adab15ac
fix: Build double index for all json numeric field (#40619)
issue: #35528

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-03-14 16:52:11 +08:00
Bingyi Sun
8fbacf3583
fix: Null expr does not work for json field (#40456)
issue: https://github.com/milvus-io/milvus/issues/40455

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-03-14 16:06:08 +08:00
Spade A
001fc992df
enhance: get doc ids by batch (#40608)
issue: #40607

tantivy change: https://github.com/zilliztech/tantivy/pull/3

Benchmarks:
Test Envrioment: CPU 9900K
The data is insert by:
```
for i in 0..N {
    for j in 0..UNIQUE {
        let key = format!("hello{}", j);
        index_writer.add_string(&key, i * UNIQUE + j).unwrap();
    }
}
```
So the unique influences the locality of the matched docs.
The latency is the avg latency over 1000 repeate quries.
The result shows 22.5%-34.8% latency reduction.

![image](https://github.com/user-attachments/assets/dd8af75a-ddc3-445d-92df-50d354dd5645)

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-03-14 15:48:09 +08:00
Spade A
f36d1562bd
enhance: add metrics for random sample (#40634)
issue: #39541

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-03-13 21:42:11 +08:00
Spade A
9f3bd55755
fix: avoid panic when field not exists in schema in query node (#40541)
ref #40473

This PR is a workaround to avoid the panic described in the issue.

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-03-12 22:44:08 +08:00
Bingyi Sun
0698d04f7d
enhance: Upgrade simdjson version (#40538)
issue: https://github.com/milvus-io/milvus/issues/40519
simdjson returns better error code in newer version.

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-03-11 15:04:05 +08:00
cai.zhang
e5f50076ec
enhance: Only check element type with not null array (#40446)
Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
2025-03-11 14:58:07 +08:00
Bingyi Sun
0a7e692b6f
fix: Fix null offset loading in inverted index (#40523)
issue: #40516

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-03-10 22:12:04 +08:00
Cai Yudong
2bd2cca04a
enhance: Truly support multi vector data types in SearchBruteForce (#40499)
Issue: #38666

Signed-off-by: CaiYudong <yudong.cai@zilliz.com>
2025-03-10 18:36:03 +08:00
sre-ci-robot
a6d4121034
[automated] Update Knowhere Commit (#40486)
Update Knowhere Commit
Signed-off-by: sre-ci-robot sre-ci-robot@users.noreply.github.com

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2025-03-10 12:28:04 +08:00
smellthemoon
faae8ee518
fix: store wrong offset when build tantivy in nullable field (#40452)
#40454

Signed-off-by: lixinguo <xinguo.li@zilliz.com>
Co-authored-by: lixinguo <xinguo.li@zilliz.com>
2025-03-09 09:34:04 +08:00
Bingyi Sun
37b118d55d
fix: Skip loading primary key if index has raw data (#39921)
issue: https://github.com/milvus-io/milvus/issues/39907

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-03-06 17:46:02 +08:00
Spade A
3db56560fb
fix: fix concurrent issues in null offset (#40363)
issue: #40308
This issue fixes these two concurrent issues:
1. element in null_offset is used to set bitset where the size of bitset
is initialized by tantivy document count. However, there may still be
some documents that are not committed in tantivy but are null in
null_offset. So array out of range occurs.
2. null_offset can be read and write concurrently but there's no
synchronization protection.

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-03-05 17:48:00 +08:00
Bingyi Sun
be4d09561b
fix: Fix missing null or non-exist key in json index (#40336)
issue: #35528

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-03-05 11:48:02 +08:00
Bingyi Sun
7040ba1c12
enhance: make json path index support term filter (#40140)
issue: #35528

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-03-04 11:56:02 +08:00
Zhen Ye
8eb662b4dc
enhance: add more metrics for async cgo component (#40136)
issue: #40014

Signed-off-by: chyezh <chyezh@outlook.com>
2025-03-03 09:56:03 +08:00
sre-ci-robot
6a57a1973f
[automated] Update Knowhere Commit (#40283)
Update Knowhere Commit
Signed-off-by: sre-ci-robot sre-ci-robot@users.noreply.github.com

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2025-03-03 01:11:58 +08:00