Ref https://github.com/milvus-io/milvus/issues/42053
This PR enable ngram to support more kinds of matches such as prefix and
postfix match.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
Ref #42053
This is the first PR for optimizing `LIKE` with ngram inverted index.
Now, only VARCHAR data type is supported and only InnerMatch LIKE
(%xxx%) query is supported.
How to use it:
```
milvus_client = MilvusClient("http://localhost:19530")
schema = milvus_client.create_schema()
...
schema.add_field("content_ngram", DataType.VARCHAR, max_length=10000)
...
index_params = milvus_client.prepare_index_params()
index_params.add_index(field_name="content_ngram", index_type="NGRAM", index_name="ngram_index", min_gram=2, max_gram=3)
milvus_client.create_collection(COLLECTION_NAME, ...)
```
min_gram and max_gram controls how we tokenize the documents. For
example, for min_gram=2 and max_gram=4, we will tokenize each document
with 2-gram, 3-gram and 4-gram.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com>
ref https://github.com/milvus-io/milvus/issues/42626
This PR makes text match index and json key stats index be loaded based
on mmap config.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
Related to #39173
`null_bitmap_data()` returns raw pointer of null bitmap of Array. While
after slicing, this bitmap is not rewritten due to zero copy
implementation, so the current start pos maybe non-zero while
FillFieldData generating column `valid_data` array.
This PR add `offset` param for `FillFieldData` method, and force all
invocation pass correct offset of `null_bitmap_data` ptr.
Also update milvus-storage commit fixing reader failed to return data
when buffer size smaller than row group size problem.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Related to #39173
This PR
- Use updated path with bucketName for packedReader
- Update milvus-storage commit to report reader/writer initialization
failure, see also milvus-io/milvus-storage#192
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: https://github.com/milvus-io/milvus/issues/41435
this PR also makes HasRawData of ChunkedSegmentSealedImpl to return
based on metadata, without needing to load the cache just to answer this
simple question.
---------
Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
Related to #39173
This PR:
- Upgrade milvus-storage commit to fix filesystem finalized issue
- Add bucket-name as prefix for all fs style access io
- Initial arrow fs on querynodes startup
- Fix timestamp access when loading sealed segment
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
fix: #39755
The following shows a simple benchmark where insert 1M docs where all
rows are "hello", the latency is segcore level, CPU is 9900K:
master: 2.62ms
this PR: 2.11ms
bench mark code:
```
TEST(TextMatch, TestPerf) {
auto schema = GenTestSchema({}, true);
auto seg = CreateSealedSegment(schema, empty_index_meta);
int64_t N = 1000000;
uint64_t seed = 19190504;
auto raw_data = DataGen(schema, N, seed);
auto str_col = raw_data.raw_->mutable_fields_data()
->at(1)
.mutable_scalars()
->mutable_string_data()
->mutable_data();
for (int64_t i = 0; i < N - 1; i++) {
str_col->at(i) = "hello";
}
SealedLoadFieldData(raw_data, *seg);
seg->CreateTextIndex(FieldId(101));
auto now = std::chrono::high_resolution_clock::now();
auto expr = GetMatchExpr(schema, "hello", OpType::TextMatch);
auto final = ExecuteQueryExpr(expr, seg.get(), N, MAX_TIMESTAMP);
auto end = std::chrono::high_resolution_clock::now();
auto duration =
std::chrono::duration_cast<std::chrono::microseconds>(end - now);
std::cout << "TextMatch query time: " << duration.count() << "ms"
<< std::endl;
}
```
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
support parallel loading sealed and growing segments with storage v2
format by async reading row groups.
related: #39173
---------
Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
Issue: #41210
After https://github.com/zilliztech/tantivy/pull/5, we can provide
milvus row id directly to tantivy rather than record it in the fast
field "doc_id".
So rather than search tantivy doc id and then get milvus row id from
"doc_id", now, the searched tantivy doc id is the milvus row id,
eliminating the expensive acquiring row id phase.
The following shows a simple benchmark where insert **1M** docs where
all rows are "hello", the latency is **segcore** level, CPU is 9900K:

**The latency is 2.02 and 2.1 times respectively.**
bench mark code:
```
TEST(TextMatch, TestPerf) {
auto schema = GenTestSchema({}, true);
auto seg = CreateSealedSegment(schema, empty_index_meta);
int64_t N = 1000000;
uint64_t seed = 19190504;
auto raw_data = DataGen(schema, N, seed);
auto str_col = raw_data.raw_->mutable_fields_data()
->at(1)
.mutable_scalars()
->mutable_string_data()
->mutable_data();
for (int64_t i = 0; i < N - 1; i++) {
str_col->at(i) = "hello";
}
SealedLoadFieldData(raw_data, *seg);
seg->CreateTextIndex(FieldId(101));
auto now = std::chrono::high_resolution_clock::now();
auto expr = GetMatchExpr(schema, "hello", OpType::TextMatch);
auto final = ExecuteQueryExpr(expr, seg.get(), N, MAX_TIMESTAMP);
auto end = std::chrono::high_resolution_clock::now();
auto duration =
std::chrono::duration_cast<std::chrono::microseconds>(end - now);
std::cout << "TextMatch query time: " << duration.count() << "ms"
<< std::endl;
}
```
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
issue: https://github.com/milvus-io/milvus/issues/40897
After this, the document add operations scheduling duration is decreased
roughly from 6s to 0.9s for the case in the issue.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
Ref: https://github.com/milvus-io/milvus/issues/40823
It does not make any sense to create single segment tantivy index for
old version such as 2.4 by using tantivy V7.
So, clean the relevant code.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
issue: https://github.com/milvus-io/milvus/issues/40006
This PR make tantivy document add by batch. Add document by batch can
greately reduce the latency of scheduling the document add operation
(call tantivy `add_document` only schdules the add operation and it
returns immediately after scheduled) , because each call involes a tokio
block_on which is relatively heavy.
Reduce scheduling part not necessarily reduces the overall latency if
the index writer threads does not process indexing quickly enough.
But if scheduling itself is pretty slow, even the index writer threads
process indexing very fast (by increasing thread number), the overall
performance can still be limited.
The following codes bench the PR (Note, the duration only counts for
scheduling without commit)
```
fn test_performance() {
let field_name = "text";
let dir = TempDir::new().unwrap();
let mut index_wrapper = IndexWriterWrapper::create_text_writer(
field_name,
dir.path().to_str().unwrap(),
"default",
"",
1,
50_000_000,
false,
TantivyIndexVersion::V7,
)
.unwrap();
let mut batch = vec![];
for i in 0..1_000_000 {
batch.push(format!("hello{:04}", i));
}
let batch_ref = batch.iter().map(|s| s.as_str()).collect::<Vec<_>>();
let now = std::time::Instant::now();
index_wrapper
.add_data_by_batch(&batch_ref, Some(0))
.unwrap();
let elapsed = now.elapsed();
println!("add_data_by_batch elapsed: {:?}", elapsed);
}
```
Latency roughly reduces from 1.4s to 558ms.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>