milvus

mirror of https://gitee.com/milvus-io/milvus.git synced 2026-01-07 19:31:51 +08:00

Author	SHA1	Message	Date
aoiasd	38f1608910	enhance: pack analyzer code and support lindera tokenizer (#39660 ) relate: https://github.com/milvus-io/milvus/issues/39659 Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2025-02-24 12:13:55 +08:00
Spade A	d34d70582d	fix: fix misleading name _add_multi_ (#39997 ) fix: #39995 Signed-off-by: SpadeA <tangchenjie1210@gmail.com>	2025-02-21 16:45:55 +08:00
Spade A	52c7d7dd80	fix: offset combined with term should be based on Token positions in phrase match (#39931 ) fix: #39711 Unlike English sentence where each words are parsed exactly once and one after one with position length 1, one Chinese word may be parsed to multiple words with position length larger than 1. For example, "badminton and skiing" will be parsed to Token{ start: 0, length: 1, text: "badminton" }, Token{ start: 1, length: 1, text: "and" }, and Token{ start: 2, length: 1, text: "tennis" }. While for exmaple for Chinsese: "羽毛球和滑雪" may be parsed to Token{ start: 0, length: 2, text: "羽毛" }, Token{ start: 0, length: 3, text: "羽毛球" }, Token{ start: 3, length: 1, text: "和" }, and Token{ start: 4, length: 2, text: "滑雪" }. This PR fix that the code not recognizes this situation. --------- Signed-off-by: SpadeA <tangchenjie1210@gmail.com>	2025-02-18 20:38:51 +08:00
Bingyi Sun	b59555057d	feat: support json index (#36750 ) https://github.com/milvus-io/milvus/issues/35528 This PR adds json index support for json and dynamic fields. Now you can only do unary query like 'a["b"] > 1' using this index. We will support more filter type later. basic usage: ``` collection.create_index("json_field", {"index_type": "INVERTED", "params": {"json_cast_type": DataType.STRING, "json_path": 'json_field["a"]["b"]'}}) ``` There are some limits to use this index: 1. If a record does not have the json path you specify, it will be ignored and there will not be an error. 2. If a value of the json path fails to be cast to the type you specify, it will be ignored and there will not be an error. 3. A specific json path can have only one json index. 4. If you try to create more than one json indexes for one json field, sdk(pymilvus<=2.4.7) may return immediately because of internal implementation. This will be fixed in a later version. --------- Signed-off-by: sunby <sunbingyi1992@gmail.com>	2025-02-15 14:06:15 +08:00
Spade A	f7d9587720	enhance: add tantivy collector for i64 (#39850 ) issue: #39852 Signed-off-by: SpadeA <tangchenjie1210@gmail.com>	2025-02-14 15:50:15 +08:00
Bingyi Sun	c13fc8cd19	enhance: update tantivy version (#39253 ) https://github.com/milvus-io/milvus/issues/39254 --------- Signed-off-by: sunby <sunbingyi1992@gmail.com>	2025-02-08 14:08:43 +08:00
Spade A	8c4ba70a4c	fix: enable to build index with single segment (#39233 ) fix https://github.com/milvus-io/milvus/issues/39232 --------- Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com>	2025-01-16 11:01:06 +08:00
Spade A	032292a432	feat: support phrase match query (#38869 ) The relevant issue: https://github.com/milvus-io/milvus/issues/38930 --------- Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com>	2025-01-12 20:24:58 +08:00
Bingyi Sun	f0cddfd160	fix: Fix panic caused by removing directory (#38622 ) https://github.com/milvus-io/milvus/issues/38604 --------- Signed-off-by: sunby <sunbingyi1992@gmail.com>	2025-01-06 10:54:54 +08:00
Bingyi Sun	3822819942	enhance: Remove an undefined behavior in index writer (#38657 ) Signed-off-by: sunby <sunbingyi1992@gmail.com>	2024-12-31 10:42:52 +08:00
Bingyi Sun	3e2a2f278b	enhance: Handle rust error in c++ (#38113 ) https://github.com/milvus-io/milvus/issues/37930 --------- Signed-off-by: sunby <sunbingyi1992@gmail.com>	2024-12-16 19:40:45 +08:00
aoiasd	87aa9a0f2d	fix: empty analyzer params not use standard tokenizer (#38148 ) relate: https://github.com/milvus-io/milvus/issues/35853 Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2024-12-04 14:58:39 +08:00
Bingyi Sun	e6af806a0d	enhance: optimize self defined rust error (#37975 ) Prepare for issue: https://github.com/milvus-io/milvus/issues/37930 Signed-off-by: sunby <sunbingyi1992@gmail.com>	2024-11-28 20:30:36 +08:00
Zhen Ye	fbb68ca370	enhance: make all index operation async scheduled by tokio (#37946 ) issue: #37851 related pr: https://github.com/milvus-io/tantivy/pull/3 Signed-off-by: chyezh <chyezh@outlook.com>	2024-11-25 10:12:34 +08:00
Bingyi Sun	700a448a54	fix: Escape prefix before search in inverted index (#37925 ) issue: https://github.com/milvus-io/milvus/issues/37912 Signed-off-by: sunby <sunbingyi1992@gmail.com>	2024-11-22 14:10:33 +08:00
Bingyi Sun	06d73cf2e2	enhance: Remove raw tokenizer register. (#37886 ) tantivy already register raw tokenizer by default Signed-off-by: sunby <sunbingyi1992@gmail.com>	2024-11-22 12:02:32 +08:00
Zhen Ye	1dc1a97e65	fix: use different thread pool for scheduler and merger (#37911 ) issue: #37895 related pr: https://github.com/milvus-io/tantivy/pull/2 Signed-off-by: chyezh <chyezh@outlook.com>	2024-11-21 21:34:33 +08:00
Zhen Ye	f3a36f8a29	fix: use global pool but not dedicated pool for every index (#37852 ) issue: #37851 - make a global thread pool at tantivy temporally. - set 1 but not 4 threads for inverted text index. Signed-off-by: chyezh <chyezh@outlook.com>	2024-11-20 20:44:32 +08:00
aoiasd	16e206167c	enhance: analyzer length filter max should be close interval instead open interval (#37770 ) Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2024-11-18 19:30:31 +08:00
aoiasd	3b5a0df159	enhance: Optimize chinese analyzer and support CnAlphaNumFilter (#37727 ) relate: https://github.com/milvus-io/milvus/issues/35853 Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2024-11-16 10:28:30 +08:00
aoiasd	1c5b5e1e3d	feat: Add chinese and english analyzer with refactor jieba tokenizer (#37494 ) relate: https://github.com/milvus-io/milvus/issues/35853 Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2024-11-14 10:34:31 +08:00
aoiasd	12951f0abb	enhance: rename tokenizer to analyzer and check analyzer params (#37478 ) relate: https://github.com/milvus-io/milvus/issues/35853 --------- Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2024-11-10 16:12:26 +08:00
aoiasd	d67853fa89	feat: Tokenizer support build with params and clone for concurrency (#37048 ) relate: https://github.com/milvus-io/milvus/issues/35853 https://github.com/milvus-io/milvus/issues/36751 --------- Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2024-11-06 17:48:24 +08:00
Buqian Zheng	9997c5de34	fix: remove excessive logging (#36859 ) issue: https://github.com/milvus-io/milvus/issues/35853 Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>	2024-10-16 10:47:22 +08:00
Buqian Zheng	f7b811450d	feat: add enable_tokenizer params to VarChar field (#36480 ) issue: #35922 add an enable_tokenizer param to varchar field: must be set to true so that a varchar field can enable_match or used as input of BM25 function --------- Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>	2024-10-10 20:33:21 +08:00
Jiquan Long	89bf226f0b	feat: support keyword text match (#35923 ) fix: #35922 --------- Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2024-09-10 15:11:08 +08:00
Jiquan Long	5ea2454fdf	feat: tantivy tokenizer binding (#35801 ) fix: #35800 --------- Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2024-09-01 17:13:03 +08:00
Jiquan Long	a52ba3d09d	enhance: allow many segments for inverted index (#35616 ) fix: https://github.com/milvus-io/milvus/issues/35615 --------- Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2024-08-28 11:30:59 +08:00
Zhen Ye	a773836b89	enhance: optimize milvus core building (#35610 ) issue: #35549,#35611,#35633 - remove milvus_segcore milvus_indexbuilder..., add libmilvus_core - core building only link once - move opendal compilation into cmake - fix odr --------- Signed-off-by: chyezh <chyezh@outlook.com>	2024-08-23 12:35:02 +08:00
Jiquan Long	7b9462c0d3	enhance: fix copying hits of inverted index twice (#33968 ) issue: https://github.com/milvus-io/milvus/issues/29793 The custom `VecCollector` have already transformed the results into vector of offsets, no need to copy them twice. Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2024-06-19 12:40:01 +08:00
Jiquan Long	ecf2bcee42	enhance: speed up array-equal operator via inverted index (#33633 ) fix: #33632 --------- Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2024-06-11 14:13:54 +08:00
Jiquan Long	0c5d8660aa	feat: support inverted index for array (#33452 ) issue: https://github.com/milvus-io/milvus/issues/27704 --------- Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2024-05-31 09:47:47 +08:00
Jiquan Long	035a508722	fix: make sure inverted index has only one segment (#32858 ) issue: #32717 --------- Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2024-05-08 21:25:30 +08:00
Jiquan Long	03e0db109e	fix: udpate Cargo.lock (#31859 ) issue: #31681 Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2024-04-03 14:18:23 +08:00
Jiquan Long	9750e78f1d	enhance: lock tantivy dependencies (#31688 ) issue: https://github.com/milvus-io/milvus/issues/31681 Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2024-03-29 10:15:17 +08:00
Jiquan Long	e33dba8afe	fix: [skip-e2e] use zstd-sys 2.0.9 (#31682 ) fix: #31681 /kind improvement Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2024-03-28 15:14:10 +08:00
Jiquan Long	e549148a19	enhance: full-support for wildcard pattern matching (#30288 ) issue: #29988 This pr adds full-support for wildcard pattern matching from end to end. Before this pr, the users can only use prefix match in their expression, for example, "like 'prefix%'". With this pr, more flexible syntax can be combined. To do so, this pr makes these changes: - 1. support regex query both on index and raw data; - 2. translate the pattern matching to regex query, so that it can be handled by the regex query logic; - 3. loose the limit of the expression parsing, which allows general pattern matching syntax; With the support of regex query in segcore backend, we can also add mysql-like `REGEXP` syntax later easily. --------- Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2024-02-01 12:37:04 +08:00
Jiquan Long	67ab5be15a	enhance: optimize search performance of inverted index (#29794 ) issue: #29793 Use `DocSetCollector` instead of `TopDocsCollector`, which will avoid scoring and sorting. --------- Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2024-01-11 11:12:49 +08:00
Jiquan Long	e9f3df3626	fix: inverted index file not found (#29695 ) issue: https://github.com/milvus-io/milvus/issues/29654 --------- Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2024-01-07 20:26:49 +08:00
Jiquan Long	3f46c6d459	feat: support inverted index (#28783 ) issue: https://github.com/milvus-io/milvus/issues/27704 Add inverted index for some data types in Milvus. This index type can save a lot of memory compared to loading all data into RAM and speed up the term query and range query. Supported: `INT8`, `INT16`, `INT32`, `INT64`, `FLOAT`, `DOUBLE`, `BOOL` and `VARCHAR`. Not supported: `ARRAY` and `JSON`. Note: - The inverted index for `VARCHAR` is not designed to serve full-text search now. We will treat every row as a whole keyword instead of tokenizing it into multiple terms. - The inverted index don't support retrieval well, so if you create inverted index for field, those operations which depend on the raw data will fallback to use chunk storage, which will bring some performance loss. For example, comparisons between two columns and retrieval of output fields. The inverted index is very easy to be used. Taking below collection as an example: ```python fields = [ FieldSchema(name="pk", dtype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=100), FieldSchema(name="int8", dtype=DataType.INT8), FieldSchema(name="int16", dtype=DataType.INT16), FieldSchema(name="int32", dtype=DataType.INT32), FieldSchema(name="int64", dtype=DataType.INT64), FieldSchema(name="float", dtype=DataType.FLOAT), FieldSchema(name="double", dtype=DataType.DOUBLE), FieldSchema(name="bool", dtype=DataType.BOOL), FieldSchema(name="varchar", dtype=DataType.VARCHAR, max_length=1000), FieldSchema(name="random", dtype=DataType.DOUBLE), FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=dim), ] schema = CollectionSchema(fields) collection = Collection("demo", schema) ``` Then we can simply create inverted index for field via: ```python index_type = "INVERTED" collection.create_index("int8", {"index_type": index_type}) collection.create_index("int16", {"index_type": index_type}) collection.create_index("int32", {"index_type": index_type}) collection.create_index("int64", {"index_type": index_type}) collection.create_index("float", {"index_type": index_type}) collection.create_index("double", {"index_type": index_type}) collection.create_index("bool", {"index_type": index_type}) collection.create_index("varchar", {"index_type": index_type}) ``` Then, term query and range query on the field can be speed up automatically by the inverted index: ```python result = collection.query(expr='int64 in [1, 2, 3]', output_fields=["pk"]) result = collection.query(expr='int64 < 5', output_fields=["pk"]) result = collection.query(expr='int64 > 2997', output_fields=["pk"]) result = collection.query(expr='1 < int64 < 5', output_fields=["pk"]) ``` --------- Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2023-12-31 19:50:47 +08:00

40 Commits