mirror of
https://gitee.com/milvus-io/milvus.git
synced 2026-01-07 19:31:51 +08:00
relate: https://github.com/milvus-io/milvus/issues/46718 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Enhancement: Optimize Jieba and Lindera Analyzer Clone **Core Invariant**: JiebaTokenizer and LinderaTokenizer must be efficiently cloneable without lifetime constraints to support analyzer composition in multi-language detection chains. **What Logic Was Improved**: - **JiebaTokenizer**: Replaced `Cow<'a, Jieba>` with `Arc<jieba_rs::Jieba>` and removed the `<'a>` lifetime parameter. The global JIEBA instance now wraps in Arc, enabling `#[derive(Clone)]` on the struct. This eliminates lifetime management complexity while maintaining zero-copy sharing via atomic reference counting. - **LinderaTokenizer**: Introduced public `LinderaSegmenter` struct encapsulating dictionary and mode state, and implemented explicit `Clone` that properly duplicates the segmenter (cloning Arc-wrapped dictionary), applies `box_clone()` to each boxed token filter, and clones the token buffer. Previously, Clone was either unavailable or incompletely handled trait objects. **Why Previous Implementation Was Limiting**: - The `Cow::Borrowed` pattern for JiebaTokenizer created explicit lifetime dependencies that prevented straightforward `#[derive(Clone)]`. Switching to Arc eliminates borrow checker constraints while providing the same reference semantics for immutable shared state. - LinderaTokenizer's token filters are boxed trait objects, which cannot be auto-derived. Manual Clone implementation with `box_clone()` calls correctly handles polymorphic filter duplication. **No Data Loss or Behavior Regression**: - Arc cloning is semantically equivalent to `Cow::Borrowed` for read-only access; both efficiently share the underlying Jieba instance and Dictionary without data duplication. - The explicit Clone preserves all tokenizer state: segmenter (with shared Arc dictionary), all token filters (via individual box_clone), and the token buffer used during tokenization. - Token stream behavior unchanged—segmentation and filter application order remain identical. - New benchmarks (`bench_jieba_tokenizer_clone`, `bench_lindera_tokenizer_clone`) measure and validate clone performance for both tokenizers. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>