issue: #43088
issue: #43038
The current loading process:
* When loading an index, we first download the index files into a list
of buffers, say A
* then constructing(copying) them into a vector of FieldDatas(each file
is a FieldData), say B
* assembles them together as a huge BinarySet, say C
* lastly, copy into the actual index data structure, say D
The problem:
* We can see that, after each step, we don't need the data in previous
step.
* But currently, we release the memory of A, B, C only after we have
finished constructing D
* This leads to a up to 4x peak memory usage comparing with the raw
index size, during the loading process
* This PR allows timely releasing of B after we assembled C. So after
this PR, the peak memory usage during loading will be up to 3x of the
raw index size.
I will create another PR to release A after we created B, that seems
more complicated and need more work.
Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
ref https://github.com/milvus-io/milvus/issues/42626
This PR makes text match index and json key stats index be loaded based
on mmap config.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
Ref #42626
This path tidy up path for scalar index including path for loading index
from remote storage and temporary path for buliding index.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
fix: #39755
The following shows a simple benchmark where insert 1M docs where all
rows are "hello", the latency is segcore level, CPU is 9900K:
master: 2.62ms
this PR: 2.11ms
bench mark code:
```
TEST(TextMatch, TestPerf) {
auto schema = GenTestSchema({}, true);
auto seg = CreateSealedSegment(schema, empty_index_meta);
int64_t N = 1000000;
uint64_t seed = 19190504;
auto raw_data = DataGen(schema, N, seed);
auto str_col = raw_data.raw_->mutable_fields_data()
->at(1)
.mutable_scalars()
->mutable_string_data()
->mutable_data();
for (int64_t i = 0; i < N - 1; i++) {
str_col->at(i) = "hello";
}
SealedLoadFieldData(raw_data, *seg);
seg->CreateTextIndex(FieldId(101));
auto now = std::chrono::high_resolution_clock::now();
auto expr = GetMatchExpr(schema, "hello", OpType::TextMatch);
auto final = ExecuteQueryExpr(expr, seg.get(), N, MAX_TIMESTAMP);
auto end = std::chrono::high_resolution_clock::now();
auto duration =
std::chrono::duration_cast<std::chrono::microseconds>(end - now);
std::cout << "TextMatch query time: " << duration.count() << "ms"
<< std::endl;
}
```
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
fix: https://github.com/milvus-io/milvus/issues/40823
To solve the problem in the issue, we have to support building tantivy
index with low version
for those query nodes with low tantivy version.
This PR does two things:
1. refactor codes for IndexWriterWrapper to make it concise
2. enable IndexWriterWrapper to build tantivy index by different tantivy
crate
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
issue: #40308
This issue fixes these two concurrent issues:
1. element in null_offset is used to set bitset where the size of bitset
is initialized by tantivy document count. However, there may still be
some documents that are not committed in tantivy but are null in
null_offset. So array out of range occurs.
2. null_offset can be read and write concurrently but there's no
synchronization protection.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
issue: #38715
- Current milvus use a serialized index size(compressed) for estimate
resource for loading.
- Add a new field `MemSize` (before compressing) for index to estimate
resource.
---------
Signed-off-by: chyezh <chyezh@outlook.com>