support parallel loading sealed and growing segments with storage v2
format by async reading row groups.
related: #39173
---------
Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
Issue: #41210
After https://github.com/zilliztech/tantivy/pull/5, we can provide
milvus row id directly to tantivy rather than record it in the fast
field "doc_id".
So rather than search tantivy doc id and then get milvus row id from
"doc_id", now, the searched tantivy doc id is the milvus row id,
eliminating the expensive acquiring row id phase.
The following shows a simple benchmark where insert **1M** docs where
all rows are "hello", the latency is **segcore** level, CPU is 9900K:

**The latency is 2.02 and 2.1 times respectively.**
bench mark code:
```
TEST(TextMatch, TestPerf) {
auto schema = GenTestSchema({}, true);
auto seg = CreateSealedSegment(schema, empty_index_meta);
int64_t N = 1000000;
uint64_t seed = 19190504;
auto raw_data = DataGen(schema, N, seed);
auto str_col = raw_data.raw_->mutable_fields_data()
->at(1)
.mutable_scalars()
->mutable_string_data()
->mutable_data();
for (int64_t i = 0; i < N - 1; i++) {
str_col->at(i) = "hello";
}
SealedLoadFieldData(raw_data, *seg);
seg->CreateTextIndex(FieldId(101));
auto now = std::chrono::high_resolution_clock::now();
auto expr = GetMatchExpr(schema, "hello", OpType::TextMatch);
auto final = ExecuteQueryExpr(expr, seg.get(), N, MAX_TIMESTAMP);
auto end = std::chrono::high_resolution_clock::now();
auto duration =
std::chrono::duration_cast<std::chrono::microseconds>(end - now);
std::cout << "TextMatch query time: " << duration.count() << "ms"
<< std::endl;
}
```
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
issue: https://github.com/milvus-io/milvus/issues/40897
After this, the document add operations scheduling duration is decreased
roughly from 6s to 0.9s for the case in the issue.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
issue: #41172
Elements with type int8 or int16 in Array is encoded using int32, so we
should parse it as int32 when creating index.
Signed-off-by: sunby <sunbingyi1992@gmail.com>
fix: https://github.com/milvus-io/milvus/issues/40823
To solve the problem in the issue, we have to support building tantivy
index with low version
for those query nodes with low tantivy version.
This PR does two things:
1. refactor codes for IndexWriterWrapper to make it concise
2. enable IndexWriterWrapper to build tantivy index by different tantivy
crate
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
after the pr merged, we can support to insert, upsert, build index,
query, search in the added field.
can only do the above operates in added field after add field request
complete, which is a sync operate.
compact will be supported in the next pr.
#39718
---------
Signed-off-by: lixinguo <xinguo.li@zilliz.com>
Co-authored-by: lixinguo <xinguo.li@zilliz.com>
two point:
(1) reoder conjucts expr's subexpr, postpone heavy operations
sequence: int(column) -> index(column) -> string(column) -> light
conjuct
...... -> json(column) -> heavy conjuct -> two_column_compare
(2) support pre filter for expr execute, skip scan raw data that had
been skipped
because of preceding expr result.
#39869
Signed-off-by: luzhang <luzhang@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
issue: #40308
This issue fixes these two concurrent issues:
1. element in null_offset is used to set bitset where the size of bitset
is initialized by tantivy document count. However, there may still be
some documents that are not committed in tantivy but are null in
null_offset. So array out of range occurs.
2. null_offset can be read and write concurrently but there's no
synchronization protection.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>