milvus

mirror of https://gitee.com/milvus-io/milvus.git synced 2026-01-05 18:31:59 +08:00

Author	SHA1	Message	Date
Buqian Zheng	b0260d8676	feat: manual evict cache after built interim index (#41836 ) issue: https://github.com/milvus-io/milvus/issues/41435 this PR also makes HasRawData of ChunkedSegmentSealedImpl to return based on metadata, without needing to load the cache just to answer this simple question. --------- Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>	2025-05-16 16:34:23 +08:00
congqixia	a6d09ff4cd	enhance: [StorageV2] fix issues integrating basic RW operations (#41834 ) Related to #39173 This PR: - Upgrade milvus-storage commit to fix filesystem finalized issue - Add bucket-name as prefix for all fs style access io - Initial arrow fs on querynodes startup - Fix timestamp access when loading sealed segment --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-05-15 09:52:23 +08:00
foxspy	358bc150df	enhance: add force rebuild index configuration (#41473 ) issue: #41431 Signed-off-by: xianliang.li <xianliang.li@zilliz.com>	2025-05-14 10:52:21 +08:00
foxspy	e2ddbe4962	feat: add cachinglayer to index (#41653 ) issue: #41435 Signed-off-by: xianliang.li <xianliang.li@zilliz.com>	2025-05-08 10:12:54 +08:00
Bingyi Sun	0dee3ccfd7	enhance: Make user specified doc id selectable for tantivy index writer (#41528 ) issue: https://github.com/milvus-io/milvus/issues/41527 --------- Signed-off-by: sunby <sunbingyi1992@gmail.com>	2025-05-07 10:48:53 +08:00
foxspy	1d99f8bd67	enhance: add force rebuild index configuration (#41473 ) issue: #41431 Signed-off-by: xianliang.li <xianliang.li@zilliz.com>	2025-04-29 16:20:56 +08:00
Spade A	910f68c986	fix: update tantivy to fix tantivy doc out of order when merge (#41596 ) issue: #41597 Signed-off-by: SpadeA <tangchenjie1210@gmail.com>	2025-04-29 13:46:49 +08:00
Spade A	f35e8f7420	fix: fix arm64 compile issue (#41593 ) issue: https://github.com/milvus-io/milvus/issues/41059, https://github.com/milvus-io/milvus/issues/41510 Signed-off-by: SpadeA <tangchenjie1210@gmail.com>	2025-04-29 13:19:25 +08:00
cai.zhang	640f526301	fix: Update current scalar index version to compatible tantivy different versions (#41141 ) issue: #40823 Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>	2025-04-27 20:44:39 +08:00
Spade A	f3d878ab3f	fix: update tantivy for fixing phrase match (#41450 ) issue: #41454 https://github.com/zilliztech/tantivy/pull/8 fixes the problem, this PR update the tantivy. --------- Signed-off-by: SpadeA <tangchenjie1210@gmail.com>	2025-04-24 10:52:37 +08:00
aoiasd	a16bd6263b	feat: support more lauguage for build in stop words and add remove punct, regex filter (#41412 ) relate: https://github.com/milvus-io/milvus/issues/41213 --------- Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2025-04-23 11:44:37 +08:00
aoiasd	11f2fae42e	feat: support extend default dict for jieba tokenizer (#41360 ) relate: https://github.com/milvus-io/milvus/issues/41213 Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2025-04-22 20:34:37 +08:00
aoiasd	110c5aaaf4	feat: support icu and language identifier tokenizer (#41214 ) relate: https://github.com/milvus-io/milvus/issues/41213 Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2025-04-22 15:56:37 +08:00
aoiasd	f166843c5e	enhance: support use lindera tag filter (#40416 ) relate: https://github.com/milvus-io/milvus/issues/39659 Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2025-04-21 15:56:36 +08:00
Spade A	5b1430f27e	enhance: tantivy collector set bitset directly (#39748 ) fix: #39755 The following shows a simple benchmark where insert 1M docs where all rows are "hello", the latency is segcore level, CPU is 9900K: master: 2.62ms this PR: 2.11ms bench mark code: ``` TEST(TextMatch, TestPerf) { auto schema = GenTestSchema({}, true); auto seg = CreateSealedSegment(schema, empty_index_meta); int64_t N = 1000000; uint64_t seed = 19190504; auto raw_data = DataGen(schema, N, seed); auto str_col = raw_data.raw_->mutable_fields_data() ->at(1) .mutable_scalars() ->mutable_string_data() ->mutable_data(); for (int64_t i = 0; i < N - 1; i++) { str_col->at(i) = "hello"; } SealedLoadFieldData(raw_data, *seg); seg->CreateTextIndex(FieldId(101)); auto now = std::chrono::high_resolution_clock::now(); auto expr = GetMatchExpr(schema, "hello", OpType::TextMatch); auto final = ExecuteQueryExpr(expr, seg.get(), N, MAX_TIMESTAMP); auto end = std::chrono::high_resolution_clock::now(); auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - now); std::cout << "TextMatch query time: " << duration.count() << "ms" << std::endl; } ``` --------- Signed-off-by: SpadeA <tangchenjie1210@gmail.com>	2025-04-20 23:02:41 +08:00
Spade A	62293cb582	fix: revert batch add (#41374 ) issue: #41375 todo: to fix the problems fixed in the issue. --------- Signed-off-by: SpadeA <tangchenjie1210@gmail.com>	2025-04-17 22:32:38 +08:00
sthuang	1f1c836fb9	feat: Storage v2 growing segment load (#41001 ) support parallel loading sealed and growing segments with storage v2 format by async reading row groups. related: #39173 --------- Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>	2025-04-16 17:14:33 +08:00
Spade A	70d13dcf61	enhance: update tantivy for removing "doc_id" fast field (#41198 ) Issue: #41210 After https://github.com/zilliztech/tantivy/pull/5, we can provide milvus row id directly to tantivy rather than record it in the fast field "doc_id". So rather than search tantivy doc id and then get milvus row id from "doc_id", now, the searched tantivy doc id is the milvus row id, eliminating the expensive acquiring row id phase. The following shows a simple benchmark where insert 1M docs where all rows are "hello", the latency is segcore level, CPU is 9900K: ![image](https://github.com/user-attachments/assets/d8e72134-56b5-430b-8628-36c3bed8eaad) The latency is 2.02 and 2.1 times respectively. bench mark code: ``` TEST(TextMatch, TestPerf) { auto schema = GenTestSchema({}, true); auto seg = CreateSealedSegment(schema, empty_index_meta); int64_t N = 1000000; uint64_t seed = 19190504; auto raw_data = DataGen(schema, N, seed); auto str_col = raw_data.raw_->mutable_fields_data() ->at(1) .mutable_scalars() ->mutable_string_data() ->mutable_data(); for (int64_t i = 0; i < N - 1; i++) { str_col->at(i) = "hello"; } SealedLoadFieldData(raw_data, *seg); seg->CreateTextIndex(FieldId(101)); auto now = std::chrono::high_resolution_clock::now(); auto expr = GetMatchExpr(schema, "hello", OpType::TextMatch); auto final = ExecuteQueryExpr(expr, seg.get(), N, MAX_TIMESTAMP); auto end = std::chrono::high_resolution_clock::now(); auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - now); std::cout << "TextMatch query time: " << duration.count() << "ms" << std::endl; } ``` --------- Signed-off-by: SpadeA <tangchenjie1210@gmail.com>	2025-04-15 20:20:32 +08:00
Spade A	9ce3e3cb44	enhance: add documents in batch for json key stats (#41228 ) issue: https://github.com/milvus-io/milvus/issues/40897 After this, the document add operations scheduling duration is decreased roughly from 6s to 0.9s for the case in the issue. --------- Signed-off-by: SpadeA <tangchenjie1210@gmail.com>	2025-04-11 14:08:26 +08:00
Xianhui Lin	3bc24c264f	enhance: Add json key inverted index in stats for optimization (#38039 ) Add json key inverted index in stats for optimization https://github.com/milvus-io/milvus/issues/36995 --------- Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com> Co-authored-by: luzhang <luzhang@zilliz.com>	2025-04-10 15:20:28 +08:00
Spade A	e9fa30f462	fix: remove single segment logic in V7 (#41159 ) Ref: https://github.com/milvus-io/milvus/issues/40823 It does not make any sense to create single segment tantivy index for old version such as 2.4 by using tantivy V7. So, clean the relevant code. --------- Signed-off-by: SpadeA <tangchenjie1210@gmail.com>	2025-04-09 19:54:27 +08:00
sthuang	50e02e3598	enhance: update packed reader api (#41055 ) related: https://github.com/milvus-io/milvus/issues/39173 Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>	2025-04-09 10:18:26 +08:00
Spade A	c6a0c2ab64	enhance: process tantivy document add by batch (#40124 ) issue: https://github.com/milvus-io/milvus/issues/40006 This PR make tantivy document add by batch. Add document by batch can greately reduce the latency of scheduling the document add operation (call tantivy `add_document` only schdules the add operation and it returns immediately after scheduled) , because each call involes a tokio block_on which is relatively heavy. Reduce scheduling part not necessarily reduces the overall latency if the index writer threads does not process indexing quickly enough. But if scheduling itself is pretty slow, even the index writer threads process indexing very fast (by increasing thread number), the overall performance can still be limited. The following codes bench the PR (Note, the duration only counts for scheduling without commit) ``` fn test_performance() { let field_name = "text"; let dir = TempDir::new().unwrap(); let mut index_wrapper = IndexWriterWrapper::create_text_writer( field_name, dir.path().to_str().unwrap(), "default", "", 1, 50_000_000, false, TantivyIndexVersion::V7, ) .unwrap(); let mut batch = vec![]; for i in 0..1_000_000 { batch.push(format!("hello{:04}", i)); } let batch_ref = batch.iter().map(\|s\| s.as_str()).collect::<Vec<_>>(); let now = std::time::Instant::now(); index_wrapper .add_data_by_batch(&batch_ref, Some(0)) .unwrap(); let elapsed = now.elapsed(); println!("add_data_by_batch elapsed: {:?}", elapsed); } ``` Latency roughly reduces from 1.4s to 558ms. --------- Signed-off-by: SpadeA <tangchenjie1210@gmail.com>	2025-04-08 19:50:24 +08:00
aoiasd	6f17720e4e	enhance: support use jieba tokenizer with costum dictionary (#39854 ) relate: https://github.com/milvus-io/milvus/issues/40168 Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2025-04-08 14:52:27 +08:00
Spade A	e4da2765ba	enhance: process batch of strings within one tantivy_index_add_string call (#40007 ) issue: #40006 --------- Signed-off-by: SpadeA <tangchenjie1210@gmail.com>	2025-04-08 01:20:25 +08:00
Spade A	f552ec67dd	fix: support building tantivy index with low version(5) (#40822 ) fix: https://github.com/milvus-io/milvus/issues/40823 To solve the problem in the issue, we have to support building tantivy index with low version for those query nodes with low tantivy version. This PR does two things: 1. refactor codes for IndexWriterWrapper to make it concise 2. enable IndexWriterWrapper to build tantivy index by different tantivy crate --------- Signed-off-by: SpadeA <tangchenjie1210@gmail.com>	2025-04-02 18:46:20 +08:00
aoiasd	384d39ef5a	enhance: not build lindera features by default and support make milvus with tantivy features (#40813 ) relate: https://github.com/milvus-io/milvus/issues/39659 Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2025-03-27 14:08:22 +08:00
sthuang	d7df78a6c9	feat: Storage v2 compaction (#40667 ) - Feat: Support Mix compaction. Covering tests include compatibility and rollback ability. - Read v1 segments and compact with v2 format. - Read both v1 and v2 segments and compact with v2 format. - Read v2 segments and compact with v2 format. - Compact with duplicate primary key test. - Compact with bm25 segments. - Compact with merge sort segments. - Compact with no expiration segments. - Compact with lack binlog segments. - Compact with nullable field segments. - Feat: Support Clustering compaction. Covering tests include compatibility and rollback ability. - Read v1 segments and compact with v2 format. - Read both v1 and v2 segments and compact with v2 format. - Read v2 segments and compact with v2 format. - Compact bm25 segments with v2 format. - Compact with memory limit. - Enhance: Use serdeMap serialize in BuildRecord function to support all Milvus data types. related: #39173 Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>	2025-03-21 10:16:12 +08:00
aoiasd	92bdf7a0c1	enhance: support run anayser return detaild token (#40458 ) relate: https://github.com/milvus-io/milvus/issues/39705 Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2025-03-19 15:48:15 +08:00
Zhen Ye	8db708f67d	enhance: enable memory prof based on jemalloc (#40731 ) issue: #40730 also see: https://github.com/milvus-io/cgosymbolizer/pull/2 After these PR, at linux: - the milvus will always enable jemalloc by default. - jemalloc will always compiled with --enable-prof options. - all image will always enable the jemalloc prof by default. - a pprof http service for jemalloc at `/debug/jemalloc/` will be registered into restful. - `jeprof` can remote profile the memory of milvus. Signed-off-by: chyezh <chyezh@outlook.com>	2025-03-19 14:46:18 +08:00
Spade A	001fc992df	enhance: get doc ids by batch (#40608 ) issue: #40607 tantivy change: https://github.com/zilliztech/tantivy/pull/3 Benchmarks: Test Envrioment: CPU 9900K The data is insert by: ``` for i in 0..N { for j in 0..UNIQUE { let key = format!("hello{}", j); index_writer.add_string(&key, i * UNIQUE + j).unwrap(); } } ``` So the unique influences the locality of the matched docs. The latency is the avg latency over 1000 repeate quries. The result shows 22.5%-34.8% latency reduction. ![image](https://github.com/user-attachments/assets/dd8af75a-ddc3-445d-92df-50d354dd5645) --------- Signed-off-by: SpadeA <tangchenjie1210@gmail.com>	2025-03-14 15:48:09 +08:00
Bingyi Sun	0698d04f7d	enhance: Upgrade simdjson version (#40538 ) issue: https://github.com/milvus-io/milvus/issues/40519 simdjson returns better error code in newer version. Signed-off-by: sunby <sunbingyi1992@gmail.com>	2025-03-11 15:04:05 +08:00
sre-ci-robot	a6d4121034	[automated] Update Knowhere Commit (#40486 ) Update Knowhere Commit Signed-off-by: sre-ci-robot sre-ci-robot@users.noreply.github.com Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2025-03-10 12:28:04 +08:00
sre-ci-robot	6a57a1973f	[automated] Update Knowhere Commit (#40283 ) Update Knowhere Commit Signed-off-by: sre-ci-robot sre-ci-robot@users.noreply.github.com Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2025-03-03 01:11:58 +08:00
Micka	5cc104b412	fix: Change CMake variable for switch to knowhere-cuvs (#40105 ) issue: https://github.com/milvus-io/milvus/issues/39883 Signed-off-by: Mickael Ide <mide@nvidia.com>	2025-02-27 22:05:58 +08:00
sre-ci-robot	b2769fb357	[automated] Update Knowhere Commit (#40223 ) Update Knowhere Commit Signed-off-by: sre-ci-robot sre-ci-robot@users.noreply.github.com Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2025-02-27 01:35:59 +08:00
aoiasd	38f1608910	enhance: pack analyzer code and support lindera tokenizer (#39660 ) relate: https://github.com/milvus-io/milvus/issues/39659 Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2025-02-24 12:13:55 +08:00
sre-ci-robot	dd1347d041	[automated] Update Knowhere Commit (#40103 ) Update Knowhere Commit Signed-off-by: sre-ci-robot sre-ci-robot@users.noreply.github.com Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2025-02-22 01:01:53 +08:00
sthuang	3eb3af5f08	feat: explicitly specify column groups for storage v2 api (#39790 ) * use the new packed reader and writer api to be compatible with current etcd meta * For the new packed writer API: column groups and paths are explicitly defined by users and won't split column groups by memory in storage v2. Packed writer follows the user-defined column groups to split arrow record and write into the corresponding file path. * For the new packed reader API: read paths are explicitly defined by users. related: #39173 Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>	2025-02-21 22:03:54 +08:00
Spade A	d34d70582d	fix: fix misleading name _add_multi_ (#39997 ) fix: #39995 Signed-off-by: SpadeA <tangchenjie1210@gmail.com>	2025-02-21 16:45:55 +08:00
sre-ci-robot	f0d3d98c3f	[automated] Update Knowhere Commit (#40063 ) Update Knowhere Commit Signed-off-by: sre-ci-robot sre-ci-robot@users.noreply.github.com Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2025-02-21 01:19:54 +08:00
Spade A	52c7d7dd80	fix: offset combined with term should be based on Token positions in phrase match (#39931 ) fix: #39711 Unlike English sentence where each words are parsed exactly once and one after one with position length 1, one Chinese word may be parsed to multiple words with position length larger than 1. For example, "badminton and skiing" will be parsed to Token{ start: 0, length: 1, text: "badminton" }, Token{ start: 1, length: 1, text: "and" }, and Token{ start: 2, length: 1, text: "tennis" }. While for exmaple for Chinsese: "羽毛球和滑雪" may be parsed to Token{ start: 0, length: 2, text: "羽毛" }, Token{ start: 0, length: 3, text: "羽毛球" }, Token{ start: 3, length: 1, text: "和" }, and Token{ start: 4, length: 2, text: "滑雪" }. This PR fix that the code not recognizes this situation. --------- Signed-off-by: SpadeA <tangchenjie1210@gmail.com>	2025-02-18 20:38:51 +08:00
sre-ci-robot	61cc22354e	[automated] Update Knowhere Commit (#39898 ) Update Knowhere Commit Signed-off-by: sre-ci-robot sre-ci-robot@users.noreply.github.com Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2025-02-16 01:32:13 +08:00
Bingyi Sun	b59555057d	feat: support json index (#36750 ) https://github.com/milvus-io/milvus/issues/35528 This PR adds json index support for json and dynamic fields. Now you can only do unary query like 'a["b"] > 1' using this index. We will support more filter type later. basic usage: ``` collection.create_index("json_field", {"index_type": "INVERTED", "params": {"json_cast_type": DataType.STRING, "json_path": 'json_field["a"]["b"]'}}) ``` There are some limits to use this index: 1. If a record does not have the json path you specify, it will be ignored and there will not be an error. 2. If a value of the json path fails to be cast to the type you specify, it will be ignored and there will not be an error. 3. A specific json path can have only one json index. 4. If you try to create more than one json indexes for one json field, sdk(pymilvus<=2.4.7) may return immediately because of internal implementation. This will be fixed in a later version. --------- Signed-off-by: sunby <sunbingyi1992@gmail.com>	2025-02-15 14:06:15 +08:00
Spade A	f7d9587720	enhance: add tantivy collector for i64 (#39850 ) issue: #39852 Signed-off-by: SpadeA <tangchenjie1210@gmail.com>	2025-02-14 15:50:15 +08:00
sre-ci-robot	ba03a435fb	[automated] Update Knowhere Commit (#39878 ) Update Knowhere Commit Signed-off-by: sre-ci-robot sre-ci-robot@users.noreply.github.com Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2025-02-14 15:18:21 +08:00
Bingyi Sun	c13fc8cd19	enhance: update tantivy version (#39253 ) https://github.com/milvus-io/milvus/issues/39254 --------- Signed-off-by: sunby <sunbingyi1992@gmail.com>	2025-02-08 14:08:43 +08:00
sre-ci-robot	ba312427f2	[automated] Update Knowhere Commit (#39696 ) Update Knowhere Commit Signed-off-by: sre-ci-robot sre-ci-robot@users.noreply.github.com Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2025-02-08 01:36:43 +08:00
Gao	c1794cc490	enhance: update knowhere version and IsAdditionalScalarSupported interface (#39573 ) Signed-off-by: chasingegg <chao.gao@zilliz.com>	2025-02-05 19:51:10 +08:00
sthuang	c4ae9f4ece	feat: introduce third-party milvus-storage (#39418 ) related: https://github.com/milvus-io/milvus/issues/39173 Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>	2025-01-24 17:21:13 +08:00

1 2 3 4 5 ...

334 Commits