issue: https://github.com/milvus-io/milvus/issues/40006
This PR make tantivy document add by batch. Add document by batch can
greately reduce the latency of scheduling the document add operation
(call tantivy `add_document` only schdules the add operation and it
returns immediately after scheduled) , because each call involes a tokio
block_on which is relatively heavy.
Reduce scheduling part not necessarily reduces the overall latency if
the index writer threads does not process indexing quickly enough.
But if scheduling itself is pretty slow, even the index writer threads
process indexing very fast (by increasing thread number), the overall
performance can still be limited.
The following codes bench the PR (Note, the duration only counts for
scheduling without commit)
```
fn test_performance() {
let field_name = "text";
let dir = TempDir::new().unwrap();
let mut index_wrapper = IndexWriterWrapper::create_text_writer(
field_name,
dir.path().to_str().unwrap(),
"default",
"",
1,
50_000_000,
false,
TantivyIndexVersion::V7,
)
.unwrap();
let mut batch = vec![];
for i in 0..1_000_000 {
batch.push(format!("hello{:04}", i));
}
let batch_ref = batch.iter().map(|s| s.as_str()).collect::<Vec<_>>();
let now = std::time::Instant::now();
index_wrapper
.add_data_by_batch(&batch_ref, Some(0))
.unwrap();
let elapsed = now.elapsed();
println!("add_data_by_batch elapsed: {:?}", elapsed);
}
```
Latency roughly reduces from 1.4s to 558ms.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
after the pr merged, we can support to insert, upsert, build index,
query, search in the added field.
can only do the above operates in added field after add field request
complete, which is a sync operate.
compact will be supported in the next pr.
#39718
---------
Signed-off-by: lixinguo <xinguo.li@zilliz.com>
Co-authored-by: lixinguo <xinguo.li@zilliz.com>
issue: #40308
This issue fixes these two concurrent issues:
1. element in null_offset is used to set bitset where the size of bitset
is initialized by tantivy document count. However, there may still be
some documents that are not committed in tantivy but are null in
null_offset. So array out of range occurs.
2. null_offset can be read and write concurrently but there's no
synchronization protection.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
fix: #39711
Unlike English sentence where each words are parsed exactly once and one
after one with position length 1, one Chinese word may be parsed to
multiple words with position length larger than 1.
For example, "badminton and skiing" will be parsed to Token{ start: 0,
length: 1, text: "badminton" }, Token{ start: 1, length: 1, text: "and"
}, and Token{ start: 2, length: 1, text: "tennis" }.
While for exmaple for Chinsese: "羽毛球和滑雪" may be parsed to Token{ start:
0, length: 2, text: "羽毛" }, Token{ start: 0, length: 3, text: "羽毛球" },
Token{ start: 3, length: 1, text: "和" }, and Token{ start: 4, length: 2,
text: "滑雪" }.
This PR fix that the code not recognizes this situation.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
issue: #35922
add an enable_tokenizer param to varchar field: must be set to true so
that a varchar field can enable_match or used as input of BM25 function
---------
Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>