milvus

mirror of https://gitee.com/milvus-io/milvus.git synced 2026-01-07 19:31:51 +08:00

History

enhance: optimize jieba and lindera analyzer clone (#46719 )

relate: https://github.com/milvus-io/milvus/issues/46718

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Enhancement: Optimize Jieba and Lindera Analyzer Clone

**Core Invariant**: JiebaTokenizer and LinderaTokenizer must be
efficiently cloneable without lifetime constraints to support analyzer
composition in multi-language detection chains.

**What Logic Was Improved**:
- **JiebaTokenizer**: Replaced `Cow<'a, Jieba>` with
`Arc<jieba_rs::Jieba>` and removed the `<'a>` lifetime parameter. The
global JIEBA instance now wraps in Arc, enabling `#[derive(Clone)]` on
the struct. This eliminates lifetime management complexity while
maintaining zero-copy sharing via atomic reference counting.
- **LinderaTokenizer**: Introduced public `LinderaSegmenter` struct
encapsulating dictionary and mode state, and implemented explicit
`Clone` that properly duplicates the segmenter (cloning Arc-wrapped
dictionary), applies `box_clone()` to each boxed token filter, and
clones the token buffer. Previously, Clone was either unavailable or
incompletely handled trait objects.

**Why Previous Implementation Was Limiting**:
- The `Cow::Borrowed` pattern for JiebaTokenizer created explicit
lifetime dependencies that prevented straightforward `#[derive(Clone)]`.
Switching to Arc eliminates borrow checker constraints while providing
the same reference semantics for immutable shared state.
- LinderaTokenizer's token filters are boxed trait objects, which cannot
be auto-derived. Manual Clone implementation with `box_clone()` calls
correctly handles polymorphic filter duplication.

**No Data Loss or Behavior Regression**:
- Arc cloning is semantically equivalent to `Cow::Borrowed` for
read-only access; both efficiently share the underlying Jieba instance
and Dictionary without data duplication.
- The explicit Clone preserves all tokenizer state: segmenter (with
shared Arc dictionary), all token filters (via individual box_clone),
and the token buffer used during tokenization.
- Token stream behavior unchanged—segmentation and filter application
order remain identical.
- New benchmarks (`bench_jieba_tokenizer_clone`,
`bench_lindera_tokenizer_clone`) measure and validate clone performance
for both tokenizers.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>

2026-01-06 21:19:25 +08:00

build-support

fix: clang format broken under osx (#38427 )

2025-01-17 10:43:03 +08:00

cmake

enhance: move c++ unit test file to aside of the production code (#43932 )

2025-09-03 23:45:53 +08:00

src

enhance: support compaction with file resource in ref mode (#46399 )

2026-01-06 16:31:31 +08:00

thirdparty

enhance: optimize jieba and lindera analyzer clone (#46719 )