40 Commits

Author SHA1 Message Date
aoiasd
38f1608910
enhance: pack analyzer code and support lindera tokenizer (#39660)
relate: https://github.com/milvus-io/milvus/issues/39659

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-02-24 12:13:55 +08:00
Spade A
d34d70582d
fix: fix misleading name *_add_multi_* (#39997)
fix: #39995

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-02-21 16:45:55 +08:00
Spade A
52c7d7dd80
fix: offset combined with term should be based on Token positions in phrase match (#39931)
fix: #39711

Unlike English sentence where each words are parsed exactly once and one
after one with position length 1, one Chinese word may be parsed to
multiple words with position length larger than 1.

For example, "badminton and skiing" will be parsed to Token{ start: 0,
length: 1, text: "badminton" }, Token{ start: 1, length: 1, text: "and"
}, and Token{ start: 2, length: 1, text: "tennis" }.

While for exmaple for Chinsese: "羽毛球和滑雪" may be parsed to Token{ start:
0, length: 2, text: "羽毛" }, Token{ start: 0, length: 3, text: "羽毛球" },
Token{ start: 3, length: 1, text: "和" }, and Token{ start: 4, length: 2,
text: "滑雪" }.

This PR fix that the code not recognizes this situation.

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-02-18 20:38:51 +08:00
Bingyi Sun
b59555057d
feat: support json index (#36750)
https://github.com/milvus-io/milvus/issues/35528

This PR adds json index support for json and dynamic fields. Now you can
only do unary query like 'a["b"] > 1' using this index. We will support
more filter type later.

basic usage:
```
collection.create_index("json_field", {"index_type": "INVERTED",
    "params": {"json_cast_type": DataType.STRING, "json_path":
'json_field["a"]["b"]'}})
```

There are some limits to use this index:
1. If a record does not have the json path you specify, it will be
ignored and there will not be an error.
2. If a value of the json path fails to be cast to the type you specify,
it will be ignored and there will not be an error.
3. A specific json path can have only one json index.
4. If you try to create more than one json indexes for one json field,
sdk(pymilvus<=2.4.7) may return immediately because of internal
implementation. This will be fixed in a later version.

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-02-15 14:06:15 +08:00
Spade A
f7d9587720
enhance: add tantivy collector for i64 (#39850)
issue: #39852

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-02-14 15:50:15 +08:00
Bingyi Sun
c13fc8cd19
enhance: update tantivy version (#39253)
https://github.com/milvus-io/milvus/issues/39254

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-02-08 14:08:43 +08:00
Spade A
8c4ba70a4c
fix: enable to build index with single segment (#39233)
fix https://github.com/milvus-io/milvus/issues/39232

---------

Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com>
2025-01-16 11:01:06 +08:00
Spade A
032292a432
feat: support phrase match query (#38869)
The relevant issue: https://github.com/milvus-io/milvus/issues/38930

---------

Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com>
2025-01-12 20:24:58 +08:00
Bingyi Sun
f0cddfd160
fix: Fix panic caused by removing directory (#38622)
https://github.com/milvus-io/milvus/issues/38604

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-01-06 10:54:54 +08:00
Bingyi Sun
3822819942
enhance: Remove an undefined behavior in index writer (#38657)
Signed-off-by: sunby <sunbingyi1992@gmail.com>
2024-12-31 10:42:52 +08:00
Bingyi Sun
3e2a2f278b
enhance: Handle rust error in c++ (#38113)
https://github.com/milvus-io/milvus/issues/37930

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2024-12-16 19:40:45 +08:00
aoiasd
87aa9a0f2d
fix: empty analyzer params not use standard tokenizer (#38148)
relate: https://github.com/milvus-io/milvus/issues/35853

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2024-12-04 14:58:39 +08:00
Bingyi Sun
e6af806a0d
enhance: optimize self defined rust error (#37975)
Prepare for issue: https://github.com/milvus-io/milvus/issues/37930

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2024-11-28 20:30:36 +08:00
Zhen Ye
fbb68ca370
enhance: make all index operation async scheduled by tokio (#37946)
issue: #37851
related pr: https://github.com/milvus-io/tantivy/pull/3

Signed-off-by: chyezh <chyezh@outlook.com>
2024-11-25 10:12:34 +08:00
Bingyi Sun
700a448a54
fix: Escape prefix before search in inverted index (#37925)
issue: https://github.com/milvus-io/milvus/issues/37912

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2024-11-22 14:10:33 +08:00
Bingyi Sun
06d73cf2e2
enhance: Remove raw tokenizer register. (#37886)
tantivy already register raw tokenizer by default

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2024-11-22 12:02:32 +08:00
Zhen Ye
1dc1a97e65
fix: use different thread pool for scheduler and merger (#37911)
issue: #37895
related pr: https://github.com/milvus-io/tantivy/pull/2

Signed-off-by: chyezh <chyezh@outlook.com>
2024-11-21 21:34:33 +08:00
Zhen Ye
f3a36f8a29
fix: use global pool but not dedicated pool for every index (#37852)
issue: #37851

- make a global thread pool at tantivy temporally.
- set 1 but not 4 threads for inverted text index.

Signed-off-by: chyezh <chyezh@outlook.com>
2024-11-20 20:44:32 +08:00
aoiasd
16e206167c
enhance: analyzer length filter max should be close interval instead open interval (#37770)
Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2024-11-18 19:30:31 +08:00
aoiasd
3b5a0df159
enhance: Optimize chinese analyzer and support CnAlphaNumFilter (#37727)
relate: https://github.com/milvus-io/milvus/issues/35853

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2024-11-16 10:28:30 +08:00
aoiasd
1c5b5e1e3d
feat: Add chinese and english analyzer with refactor jieba tokenizer (#37494)
relate: https://github.com/milvus-io/milvus/issues/35853

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2024-11-14 10:34:31 +08:00
aoiasd
12951f0abb
enhance: rename tokenizer to analyzer and check analyzer params (#37478)
relate: https://github.com/milvus-io/milvus/issues/35853

---------

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2024-11-10 16:12:26 +08:00
aoiasd
d67853fa89
feat: Tokenizer support build with params and clone for concurrency (#37048)
relate: https://github.com/milvus-io/milvus/issues/35853
https://github.com/milvus-io/milvus/issues/36751

---------

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2024-11-06 17:48:24 +08:00
Buqian Zheng
9997c5de34
fix: remove excessive logging (#36859)
issue: https://github.com/milvus-io/milvus/issues/35853

Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
2024-10-16 10:47:22 +08:00
Buqian Zheng
f7b811450d
feat: add enable_tokenizer params to VarChar field (#36480)
issue: #35922

add an enable_tokenizer param to varchar field: must be set to true so
that a varchar field can enable_match or used as input of BM25 function

---------

Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
2024-10-10 20:33:21 +08:00
Jiquan Long
89bf226f0b
feat: support keyword text match (#35923)
fix: #35922

---------

Signed-off-by: longjiquan <jiquan.long@zilliz.com>
2024-09-10 15:11:08 +08:00
Jiquan Long
5ea2454fdf
feat: tantivy tokenizer binding (#35801)
fix: #35800

---------

Signed-off-by: longjiquan <jiquan.long@zilliz.com>
2024-09-01 17:13:03 +08:00
Jiquan Long
a52ba3d09d
enhance: allow many segments for inverted index (#35616)
fix: https://github.com/milvus-io/milvus/issues/35615

---------

Signed-off-by: longjiquan <jiquan.long@zilliz.com>
2024-08-28 11:30:59 +08:00
Zhen Ye
a773836b89
enhance: optimize milvus core building (#35610)
issue: #35549,#35611,#35633

- remove milvus_segcore milvus_indexbuilder..., add libmilvus_core
- core building only link once
- move opendal compilation into cmake
- fix odr

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-08-23 12:35:02 +08:00
Jiquan Long
7b9462c0d3
enhance: fix copying hits of inverted index twice (#33968)
issue: https://github.com/milvus-io/milvus/issues/29793
The custom `VecCollector` have already transformed the results into
vector of offsets, no need to copy them twice.

Signed-off-by: longjiquan <jiquan.long@zilliz.com>
2024-06-19 12:40:01 +08:00
Jiquan Long
ecf2bcee42
enhance: speed up array-equal operator via inverted index (#33633)
fix: #33632

---------

Signed-off-by: longjiquan <jiquan.long@zilliz.com>
2024-06-11 14:13:54 +08:00
Jiquan Long
0c5d8660aa
feat: support inverted index for array (#33452)
issue: https://github.com/milvus-io/milvus/issues/27704

---------

Signed-off-by: longjiquan <jiquan.long@zilliz.com>
2024-05-31 09:47:47 +08:00
Jiquan Long
035a508722
fix: make sure inverted index has only one segment (#32858)
issue: #32717

---------

Signed-off-by: longjiquan <jiquan.long@zilliz.com>
2024-05-08 21:25:30 +08:00
Jiquan Long
03e0db109e
fix: udpate Cargo.lock (#31859)
issue: #31681

Signed-off-by: longjiquan <jiquan.long@zilliz.com>
2024-04-03 14:18:23 +08:00
Jiquan Long
9750e78f1d
enhance: lock tantivy dependencies (#31688)
issue: https://github.com/milvus-io/milvus/issues/31681

Signed-off-by: longjiquan <jiquan.long@zilliz.com>
2024-03-29 10:15:17 +08:00
Jiquan Long
e33dba8afe
fix: [skip-e2e] use zstd-sys 2.0.9 (#31682)
fix: #31681 
/kind improvement

Signed-off-by: longjiquan <jiquan.long@zilliz.com>
2024-03-28 15:14:10 +08:00
Jiquan Long
e549148a19
enhance: full-support for wildcard pattern matching (#30288)
issue: #29988 
This pr adds full-support for wildcard pattern matching from end to end.
Before this pr, the users can only use prefix match in their expression,
for example, "like 'prefix%'". With this pr, more flexible syntax can be
combined.

To do so, this pr makes these changes:
- 1. support regex query both on index and raw data;
- 2. translate the pattern matching to regex query, so that it can be
handled by the regex query logic;
- 3. loose the limit of the expression parsing, which allows general
pattern matching syntax;

With the support of regex query in segcore backend, we can also add
mysql-like `REGEXP` syntax later easily.

---------

Signed-off-by: longjiquan <jiquan.long@zilliz.com>
2024-02-01 12:37:04 +08:00
Jiquan Long
67ab5be15a
enhance: optimize search performance of inverted index (#29794)
issue: #29793 
Use `DocSetCollector` instead of `TopDocsCollector`, which will avoid
scoring and sorting.

---------

Signed-off-by: longjiquan <jiquan.long@zilliz.com>
2024-01-11 11:12:49 +08:00
Jiquan Long
e9f3df3626
fix: inverted index file not found (#29695)
issue: https://github.com/milvus-io/milvus/issues/29654

---------

Signed-off-by: longjiquan <jiquan.long@zilliz.com>
2024-01-07 20:26:49 +08:00
Jiquan Long
3f46c6d459
feat: support inverted index (#28783)
issue: https://github.com/milvus-io/milvus/issues/27704

Add inverted index for some data types in Milvus. This index type can
save a lot of memory compared to loading all data into RAM and speed up
the term query and range query.

Supported: `INT8`, `INT16`, `INT32`, `INT64`, `FLOAT`, `DOUBLE`, `BOOL`
and `VARCHAR`.

Not supported: `ARRAY` and `JSON`.

Note:
- The inverted index for `VARCHAR` is not designed to serve full-text
search now. We will treat every row as a whole keyword instead of
tokenizing it into multiple terms.
- The inverted index don't support retrieval well, so if you create
inverted index for field, those operations which depend on the raw data
will fallback to use chunk storage, which will bring some performance
loss. For example, comparisons between two columns and retrieval of
output fields.

The inverted index is very easy to be used.

Taking below collection as an example:

```python
fields = [
		FieldSchema(name="pk", dtype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=100),
		FieldSchema(name="int8", dtype=DataType.INT8),
		FieldSchema(name="int16", dtype=DataType.INT16),
		FieldSchema(name="int32", dtype=DataType.INT32),
		FieldSchema(name="int64", dtype=DataType.INT64),
		FieldSchema(name="float", dtype=DataType.FLOAT),
		FieldSchema(name="double", dtype=DataType.DOUBLE),
		FieldSchema(name="bool", dtype=DataType.BOOL),
		FieldSchema(name="varchar", dtype=DataType.VARCHAR, max_length=1000),
		FieldSchema(name="random", dtype=DataType.DOUBLE),
		FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=dim),
]
schema = CollectionSchema(fields)
collection = Collection("demo", schema)
```

Then we can simply create inverted index for field via:

```python
index_type = "INVERTED"
collection.create_index("int8", {"index_type": index_type})
collection.create_index("int16", {"index_type": index_type})
collection.create_index("int32", {"index_type": index_type})
collection.create_index("int64", {"index_type": index_type})
collection.create_index("float", {"index_type": index_type})
collection.create_index("double", {"index_type": index_type})
collection.create_index("bool", {"index_type": index_type})
collection.create_index("varchar", {"index_type": index_type})
```

Then, term query and range query on the field can be speed up
automatically by the inverted index:

```python
result = collection.query(expr='int64 in [1, 2, 3]', output_fields=["pk"])
result = collection.query(expr='int64 < 5', output_fields=["pk"])
result = collection.query(expr='int64 > 2997', output_fields=["pk"])
result = collection.query(expr='1 < int64 < 5', output_fields=["pk"])
```

---------

Signed-off-by: longjiquan <jiquan.long@zilliz.com>
2023-12-31 19:50:47 +08:00