milvus

mirror of https://gitee.com/milvus-io/milvus.git synced 2026-01-07 19:31:51 +08:00

Author	SHA1	Message	Date
junjiejiangjjj	1100d8f7e2	feat: Add semantic highlight (#46189 ) https://github.com/milvus-io/milvus/issues/42589 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Semantic Highlighting Feature Core Invariant: Semantic highlighting operates on a per-field basis with independent text processing through an external Zilliz highlight provider. The implementation maintains field ID to field name mapping and correlates highlight results back to original field outputs. What is Added: This PR introduces semantic highlighting capability for search results alongside the existing lexical highlighting. The feature consists of: - New `SemanticHighlight` orchestrator that validates queries/input fields against collection schema, instantiates a Zilliz-based provider, and batches text processing across multiple queries - New `SemanticHighlighter` proxy wrapper implementing the `Highlighter` interface for search pipeline integration - New `semanticHighlightOperator` that processes search results by delegating per-field text processing to the provider and attaching correlated `HighlightResult` data to search outputs - New gRPC service definition (`HighlightService`) and `ZillizClient.Highlight()` method for external provider communication No Data Loss or Regression: The change is purely additive without modifying existing logic: - Lexical highlighting path remains unchanged (separate switch case in `createHighlightTask`) - New `HighlightResults` field is only populated when semantic highlighting is explicitly requested via `HighlightType_Semantic` enum value - Gracefully handles missing fields by returning explicit errors rather than silent failures - Pipeline operator integration follows existing patterns and only processes when semantic highlighter is instantiated Why This Design: Semantic highlighting is routed through the same pipeline operator pattern as lexical highlighting, ensuring consistent integration into search workflows. The per-field model allows flexible highlighting across different text columns and batch processing ensures efficient handling of multiple queries with configurable provider constraints. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: junjie.jiang <junjie.jiang@zilliz.com>	2025-12-31 11:41:22 +08:00
Bingyi Sun	f9827392bb	enhance: implement external collection update task with source change detection (#45905 ) issue: #45881 Add persistent task management for external collections with automatic detection of external_source and external_spec changes. When source changes, the system aborts running tasks and creates new ones, ensuring only one active task per collection. Tasks validate their source on completion to prevent superseded tasks from committing results. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> - Core invariant: at most one active UpdateExternalCollection task exists per collection — tasks are serialized by collectionID (collection-level locking) and any change to external_source or external_spec aborts superseded tasks and causes a new task creation (externalCollectionManager + external_collection_task_meta collection-based locks enforce this). - What was simplified/removed: per-task fine-grained locking and concurrent multi-task acceptance per collection were replaced by collection-level synchronization (external_collection_task_meta.go) and a single persistent task lifecycle in DataCoord/Index task code; redundant double-concurrent update paths were removed by checking existing task presence in AddTask/LoadOrStore and aborting/overwriting via Drop/Cancel flows. - Why this does NOT cause data loss or regress behavior: task state transitions and commit are validated against the current external source/spec before applying changes — UpdateStateWithMeta and SetJobInfo verify task metadata and persist via catalog only under matching collection-state; DataNode externalCollectionManager persists task results to in-memory manager and exposes Query/Drop flows (services.go) without modifying existing segment data unless a task successfully finishes and SetJobInfo atomically updates segments via meta/catalog calls, preventing superseded tasks from committing stale results. - New capability added: end-to-end external collection update workflow — DataCoord Index task + Cluster RPC helpers + DataNode external task runner and ExternalCollectionManager enable creating, querying, cancelling, and applying external collection updates (fragment-to-segment balancing, kept/updated segment handling, allocator integration); accompanying unit tests cover success, failure, cancellation, allocator errors, and balancing logic. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: sunby <sunbingyi1992@gmail.com>	2025-12-29 19:53:21 +08:00
wei liu	293838bb67	enhance: add delegator catching up streaming data state tracking (#46551 ) issue: #46550 - Add CatchUpStreamingDataTsLag parameter to control tolerable lag threshold for delegator to be considered caught up - Add catchingUpStreamingData field in delegator to track whether delegator has caught up with streaming data - Add catching_up_streaming_data field in LeaderViewStatus proto - Check catching up status in CheckDelegatorDataReady, return not ready when delegator is still catching up streaming data - Add unit tests for the new functionality When tsafe lag exceeds the threshold, the distribution will not be considered serviceable, preventing queries from timing out in waitTSafe. This is useful when streaming message queue consumption is slow. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> - Core invariant: a delegator must not be considered serviceable while its tsafe lags behind the latest committed timestamp beyond a configurable tolerance; a delegator is "caught-up" only when (latestTsafe - delegator.GetTSafe()) < CatchUpStreamingDataTsLag (configured by queryNode.delegator.catchUpStreamingDataTsLag, default 1s). - New capability and where it takes effect: adds streaming-catchup tracking to QueryNode/QueryCoord — an atomic catchingUpStreamingData flag on shardDelegator (internal/querynodev2/delegator/delegator.go), a new param CatchUpStreamingDataTsLag (pkg/util/paramtable/component_param.go), and a LeaderViewStatus.catching_up_streaming_data field in the proto (pkg/proto/query_coord.proto). The flag is exposed in GetDataDistribution (internal/querynodev2/services.go) and used by QueryCoord readiness checks (internal/querycoordv2/utils/util.go::CheckDelegatorDataReady) to reject leaders that are still catching up. - What logic is simplified/added (not removed): instead of relying solely on segment distribution/worker heartbeats, the PR adds an explicit readiness gate that returns "not available" when the delegator reports catching-up-streaming-data. This is strictly additive — no existing checks are removed; the new precondition runs before segment availability validation to prevent premature routing to slow-consuming delegators. - Why this does NOT cause data loss or regress behavior: the change only controls serviceability visibility and routing — it never drops or mutates data. Concretely: shardDelegator starts with catchingUpStreamingData=true and flips to false in UpdateTSafe once the sampled lag falls below the configured threshold (internal/querynodev2/delegator/delegator.go::UpdateTSafe). QueryCoord will short-circuit in CheckDelegatorDataReady when leader.Status.GetCatchingUpStreamingData() is true (internal/querycoordv2/utils/util.go), returning a channel-not-available error before any segment checks; when the flag clears, existing segment-distribution checks (same code paths) resume. Tests added cover both catching-up and caught-up paths (internal/querynodev2/delegator/delegator_test.go, internal/querycoordv2/utils/util_test.go, internal/querynodev2/services_test.go), demonstrating convergence without changed data flows or deletion of data. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-12-29 17:15:21 +08:00
Spade A	0114bd1dc9	feat: support match operator family (#46518 ) issue: https://github.com/milvus-io/milvus/issues/46517 ref: https://github.com/milvus-io/milvus/issues/42148 This PR supports match operator family with struct array and brute force search only. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> - Core invariant: match operators only target struct-array element-level predicates and assume callers provide a correct row_start so element indices form a contiguous range; IArrayOffsets implementations convert row-level bitmaps/rows (starting at row_start) into element-level bitmaps or a contiguous element-offset vector used by brute-force evaluation. - New capability added: end-to-end support for MATCH_* semantics (match_any, match_all, match_least, match_most, match_exact) — parser (grammar + proto), planner (ParseMatchExprs), expr model (expr::MatchExpr), compilation (Expr→PhyMatchFilterExpr), execution (PhyMatchFilterExpr::Eval uses element offsets/bitmaps), and unit tests (MatchExprTest + parser tests). Implementation currently works for struct-array inputs and uses brute-force element counting via RowBitsetToElementOffsets/RowBitsetToElementBitset. - Logic removed or simplified and why: removed the ad-hoc DocBitsetToElementOffsets helper and consolidated offset/bitset derivation into IArrayOffsets::RowBitsetToElementOffsets and a row_start-aware RowBitsetToElementBitset, and removed EvalCtx overloads that embedded ExprSet (now EvalCtx(exec_ctx, offset_input)). This centralizes array-layout logic in ArrayOffsets and removes duplicated offset conversion and EvalCtx variants that were redundant for element-level evaluation. - No data loss / no behavior regression: persistent formats are unchanged (no proto storage or on-disk layout changed); callers were updated to supply row_start and now route through the centralized ArrayOffsets APIs which still use the authoritative row_to_element_start_ mapping, preserving exact element index mappings. Eval logic changes are limited to in-memory plumbing (how offsets/bitmaps are produced and how EvalCtx is constructed); expression evaluation still invokes exprs_->Eval where needed, so existing behavior and stored data remain intact. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: SpadeA <tangchenjie1210@gmail.com> Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com>	2025-12-29 11:03:26 +08:00
aoiasd	55feb7ded8	feat: set related resource ids in collection schema (#46423 ) Support crate analyzer with file resource info, and return used file resource ids when validate analyzer. Save the related resource ids in collection schema. relate: https://github.com/milvus-io/milvus/issues/43687 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> - Core invariant: analyzer file-resource resolution is deterministic and traceable by threading a FileResourcePathHelper (collecting used resource IDs in a HashSet) through all tokenizer/analyzer construction and validation paths; validate_analyzer(params, extra_info) returns the collected Vec<i64) which is propagated through C/Rust/Go layers to callers (CValidateResult → RustResult::from_vec_i64 → Go []int64 → querypb.ValidateAnalyzerResponse.ResourceIds → CollectionSchema.FileResourceIds). - Logic removed/simplified: ad‑hoc, scattered resource-path lookups and per-filter file helpers (e.g., read_synonyms_file and other inline file-reading logic) were consolidated into ResourceInfo + FileResourcePathHelper and a centralized get_resource_path(helper, ...) API; filter/tokenizer builder APIs now accept &mut FileResourcePathHelper so all file path resolution and ID collection use the same path and bookkeeping logic (redundant duplicated lookups removed). - Why no data loss or behavior regression: changes are additive and default-preserving — existing call sites pass extra_info = "" so analyzer creation/validation behavior and error paths remain unchanged; new Collection.FileResourceIds is populated from resp.ResourceIds in validateSchema and round‑tripped through marshal/unmarshal (model.Collection ↔ schemapb.CollectionSchema) so schema persistence uses the new list without overwriting other schema fields; proto change adds a repeated field (resource_ids) which is wire‑compatible (older clients ignore extra field). Concrete code paths: analyzer creation still uses create_analyzer (now with extra_info ""), tokenizer validation still returns errors as before but now also returns IDs via CValidateResult/RustResult, and rootcoord.validateSchema assigns resp.ResourceIds → schema.FileResourceIds. - New capability added: end‑to‑end discovery, return, and persistence of file resource IDs used by analyzers — validate flows now return resource IDs and the system stores them in collection schema (affects tantivy analyzer binding, canalyzer C bindings, internal/util analyzer APIs, querynode ValidateAnalyzer response, and rootcoord/create_collection flow). <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2025-12-26 22:49:19 +08:00
sijie-ni-0214	fc45905ee0	enhance: Optimize QuotaCenter CPU usage (#46388 ) issue: https://github.com/milvus-io/milvus/issues/46387 --------- Signed-off-by: sijie-ni-0214 <sijie.ni@zilliz.com>	2025-12-26 10:09:19 +08:00
yihao.dai	5b97cb70a0	enhance: Support delaying scanner startup (#46369 ) Introduce a ScannerStartupDelay configuration to enable WAL write-only recovery, allowing fence messages to be persisted during primary–secondary switchover when the StreamingNode is trapped in crash loops. issue: https://github.com/milvus-io/milvus/issues/46368 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * New Features * Added a configurable WAL scanner pause/resume and a consumer request flag to optionally ignore pause signals. * Metrics * Added a scanner pause gauge and pause-duration tracking for WAL scanning. * Tests * Added coverage for pause-consumption behavior and cleanup in stream client tests. * Chores * Consolidated flush-all logging into a single field and added a helper for bulk message conversion. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-12-24 11:53:19 +08:00
Spade A	ad8aba7cb4	feat: impl ComputePhraseMatchSlop for compute min slop for phrase match query (#45892 ) issue: https://github.com/milvus-io/milvus/issues/45890 ComputePhraseMatchSlop accepts three pararms: 1. A string: query text 2. Some trings: data texts 3. Analyzer params, Slop will be calculated for the query text with each data text in the context of phrase match where they are tokenized with tokenizer with analyzer params. So two array will be returned: 1. is_match: is phrase match can sucess 2. slop: the related slop if phrase match can sucess, or -1 is cannot. --------- Signed-off-by: SpadeA <tangchenjie1210@gmail.com>	2025-12-19 16:03:18 +08:00
Zhen Ye	7c575a18b0	enhance: support AckSyncUp for broadcaster, and enable it in truncate api (#46313 ) issue: #43897 also for issue: #46166 add ack_sync_up flag into broadcast message header, which indicates that whether the broadcast operation is need to be synced up between the streaming node and the coordinator. If the ack_sync_up is false, the broadcast operation will be acked once the recovery storage see the message at current vchannel, the fast ack operation can be applied to speed up the broadcast operation. If the ack_sync_up is true, the broadcast operation will be acked after the checkpoint of current vchannel reach current message. The fast ack operation can not be applied to speed up the broadcast operation, because the ack operation need to be synced up with streaming node. e.g. if truncate collection operation want to call ack once callback after the all segment are flushed at current vchannel, it should set the ack_sync_up to be true. TODO: current implementation doesn't promise the ack sync up semantic, it only promise FastAck operation will not be applied, wait for 3.0 to implement the ack sync up semantic. only for truncate api now. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-12-17 16:55:17 +08:00
congqixia	46c14781be	enhance: support useLoonFFI flag in import workflow (#46363 ) Related to #44956 This change propagates the useLoonFFI configuration through the import pipeline to enable LOON FFI usage during data import operations. Key changes: - Add use_loon_ffi field to ImportRequest protobuf message - Add manifest_path field to ImportSegmentInfo for tracking manifest - Initialize manifest path when creating segments (both import and growing) - Pass useLoonFFI flag through NewSyncTask in import tasks - Simplify pack_writer_v2 by removing GetManifestInfo method and relying on pre-initialized manifest path from segment creation - Update segment meta with manifest path after import completion This allows the import workflow to use the LOON FFI based packed writer when the common.useLoonFFI configuration is enabled. --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-12-17 16:35:16 +08:00
congqixia	21ed1fabfd	feat: support reopen segment for data/schema changes (#46359 ) issue: #46358 This PR implements segment reopening functionality on query nodes, enabling the application of data or schema changes to already-loaded segments without requiring a full reload. ### Core (C++) New SegmentLoadInfo class (`internal/core/src/segcore/SegmentLoadInfo.h/cpp`): - Encapsulates segment load configuration with structured access - Implements `ComputeDiff()` to calculate differences between old and new load states - Tracks indexes, binlogs, and column groups that need to be loaded or dropped - Provides `ConvertFieldIndexInfoToLoadIndexInfo()` for index loading ChunkedSegmentSealedImpl modifications: - Added `Reopen(const SegmentLoadInfo&)` method to apply incremental changes based on computed diff - Refactored `LoadColumnGroups()` and `LoadColumnGroup()` to support selective loading via field ID map - Extracted `LoadBatchIndexes()` and `LoadBatchFieldData()` for reusable batch loading logic - Added `LoadManifest()` for manifest-based loading path - Updated all methods to use `SegmentLoadInfo` wrapper instead of direct proto access SegmentGrowingImpl modifications: - Added `Reopen()` stub method for interface compliance C API additions (`segment_c.h/cpp`): - Added `ReopenSegment()` function exposing reopen to Go layer ### Go Side QueryNode handlers (`internal/querynodev2/`): - Added `HandleReopen()` in handlers.go - Added `ReopenSegments()` RPC in services.go Segment interface (`internal/querynodev2/segments/`): - Extended `Segment` interface with `Reopen()` method - Implemented `Reopen()` in LocalSegment - Added `Reopen()` to segment loader Segcore wrapper (`internal/util/segcore/`): - Added `Reopen()` method in segment.go - Added `ReopenSegmentRequest` in requests.go ### Proto - Added new fields to support reopen in `query_coord.proto` --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-12-17 15:49:16 +08:00
XuanYang-cn	0bbb134e39	feat: Enable to backup and reload ez (#46332 ) see also: #40013 Signed-off-by: yangxuan <xuan.yang@zilliz.com>	2025-12-16 17:19:16 +08:00
congqixia	bb2a08ed71	enhance: pass manifest path to stats task for storage v2 support (#46350 ) Related #44956 Add manifest_path field to CreateStatsRequest and propagate it through the stats task pipeline. This enables stats tasks and text index building to access segment manifest for storage v2 format operations. - Add manifest_path field to CreateStatsRequest proto - Set ManifestPath from segment metadata in DataCoord - Pass manifest to BuildIndexInfo in stats task builder - Include manifest in compaction text index creation Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-12-16 11:11:16 +08:00
yihao.dai	889505872a	enhance: Return FlushAllMsg in response (#46347 ) issue: https://github.com/milvus-io/milvus/issues/45919 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-12-16 10:35:16 +08:00
Spade A	f6f716bcfd	feat: impl StructArray -- support embedding searches embeddings in embedding list with element level filter expression (#45830 ) issue: https://github.com/milvus-io/milvus/issues/42148 For a vector field inside a STRUCT, since a STRUCT can only appear as the element type of an ARRAY field, the vector field in STRUCT is effectively an array of vectors, i.e. an embedding list. Milvus already supports searching embedding lists with metrics whose names start with the prefix MAX_SIM_. This PR allows Milvus to search embeddings inside an embedding list using the same metrics as normal embedding fields. Each embedding in the list is treated as an independent vector and participates in ANN search. Further, since STRUCT may contain scalar fields that are highly related to the embedding field, this PR introduces an element-level filter expression to refine search results. The grammar of the element-level filter is: element_filter(structFieldName, $[subFieldName] == 3) where $[subFieldName] refers to the value of subFieldName in each element of the STRUCT array structFieldName. It can be combined with existing filter expressions, for example: "varcharField == 'aaa' && element_filter(struct_field, $[struct_int] == 3)" A full example: ``` struct_schema = milvus_client.create_struct_field_schema() struct_schema.add_field("struct_str", DataType.VARCHAR, max_length=65535) struct_schema.add_field("struct_int", DataType.INT32) struct_schema.add_field("struct_float_vec", DataType.FLOAT_VECTOR, dim=EMBEDDING_DIM) schema.add_field( "struct_field", datatype=DataType.ARRAY, element_type=DataType.STRUCT, struct_schema=struct_schema, max_capacity=1000, ) ... filter = "varcharField == 'aaa' && element_filter(struct_field, $[struct_int] == 3 && $[struct_str] == 'abc')" res = milvus_client.search( COLLECTION_NAME, data=query_embeddings, limit=10, anns_field="struct_field[struct_float_vec]", filter=filter, output_fields=["struct_field[struct_int]", "varcharField"], ) ``` TODO: 1. When an `element_filter` expression is used, a regular filter expression must also be present. Remove this restriction. 2. Implement `element_filter` expressions in the `query`. --------- Signed-off-by: SpadeA <tangchenjie1210@gmail.com>	2025-12-15 12:01:15 +08:00
aoiasd	0c54875832	enhance: ValidateAnalyzer return ValidateAnalyzerResponse instead common.Status (#46292 ) Prepare for return more info when validate analyzer. relate: https://github.com/milvus-io/milvus/issues/43687 Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2025-12-12 10:35:14 +08:00
sijie-ni-0214	f51de1a8ab	feat: support TruncateCollection api to clear collection data (#46167 ) issue: https://github.com/milvus-io/milvus/issues/46166 --------- Signed-off-by: sijie-ni-0214 <sijie.ni@zilliz.com>	2025-12-12 10:31:14 +08:00
yihao.dai	f32f2694bc	enhance: Implement new FlushAllMessage and refactor flush all (#45920 ) This PR: 1. Define and implement the new FlushAllMessage. 2. Refactor FlushAll to flush the entire cluster. issue: https://github.com/milvus-io/milvus/issues/45919 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-12-10 19:27:13 +08:00
aoiasd	354ab2f55e	enhance: sync file resource to querynode and datanode (#44480 ) relate:https://github.com/milvus-io/milvus/issues/43687 Support use file resource with sync mode. Auto download or remove file resource to local when user add or remove file resource. Sync file resource to node when find new node session. --------- Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2025-12-04 16:23:11 +08:00
Zhen Ye	adbdf916e1	enhance: support proxy DML forward (#45921 ) issue: #45812 - 2.6 proxy will try to forward DWL to 2.5 proxy if streaming service is not ready Signed-off-by: chyezh <chyezh@outlook.com>	2025-12-01 19:37:10 +08:00
aoiasd	7d19c40e3c	feat: support search highlight with queries (#45736 ) Previously, search with highlight only supported using BM25 search text as the highlight target. This PR adds support for highlighting with user-defined queries. relate: https://github.com/milvus-io/milvus/issues/42589 --------- Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2025-12-01 10:17:09 +08:00
Bingyi Sun	b6532d3e44	enhance: implement external collection update task with source change detection (#45690 ) issue: https://github.com/milvus-io/milvus/issues/45691 Add persistent task management for external collections with automatic detection of external_source and external_spec changes. When source changes, the system aborts running tasks and creates new ones, ensuring only one active task per collection. Tasks validate their source on completion to prevent superseded tasks from committing results. --------- Signed-off-by: sunby <sunbingyi1992@gmail.com>	2025-11-27 15:33:08 +08:00
congqixia	3f8c146831	enhance: support manifest-based index building with Loon FFI reader (#45726 ) This PR adds support for reading data from StorageV2 using manifest files and the Loon FFI interface during index building, providing an alternative to the traditional segment insert files approach. Key changes: Core C++ changes: - Add SEGMENT_MANIFEST_KEY and LOON_FFI_PROPERTIES_KEY constants for manifest handling - Extend FileManagerContext to carry loon_ffi_properties for FFI operations - Update index_c.cpp to pass manifest and loon properties to file managers for all index types (vector, JSON key, text) - Implement GetFieldDatasFromManifest() in Util.cpp using Arrow C Stream interface: * Create Arrow schema from field metadata * Initialize FFI reader with manifest content and storage properties * Import record batches from C data interface * Convert to FieldData for index building - Update DiskFileManagerImpl and MemFileManagerImpl to support manifest-based data reading with fallback to traditional paths Loon FFI utilities (internal/core/src/storage/loon_ffi/): - Add ToCStorageConfig() to convert StorageConfig to C-compatible structure - Implement GetManifest() to parse manifest JSON and retrieve column groups via FFI - Enhance MakePropertiesFromStorageConfig() integration Storage V2 integration: - Update milvus-storage dependency from 0883026 to 302143c for latest FFI support Protobuf changes: - Add manifest field to BuildIndexInfo for passing manifest path to C++ layer Configuration: - Add common.storageV2.useLoonFFI config option (default: false) for feature toggle This change is part of issue #44956 to integrate the StorageV2 FFI interface as the unified storage layer. The implementation maintains backward compatibility by checking for manifest presence and falling back to existing segment insert files approach when manifest is not provided. Related issue: #44956 --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-11-26 12:43:08 +08:00
congqixia	03f5d7c0a5	enhance: integrate StorageV2 FFI interface for manifest-based segment loading (#45798 ) Related to #44956 New Translator (C++) - Added `ManifestGroupTranslator` (`internal/core/src/segcore/storagev2translator/`) - Translates manifest-based column groups to Milvus internal format - Implements `GroupCTMeta` interface for chunk-based column access - Supports both memory and mmap storage modes - Handles cache warmup policies for vector and scalar data ChunkedSegmentSealedImpl (`internal/core/src/segcore/ChunkedSegmentSealedImpl.cpp:333`) - Added `LoadColumnGroups(const std::string& manifest_path)`: Main entry point for manifest-based loading - Creates milvus-storage Reader from manifest file - Parallelizes column group loading using thread pool - Aggregates loading exceptions and reports errors - Added `LoadColumnGroup()`: Loads individual column group - Extracts field IDs from column group metadata - Creates ManifestGroupTranslator for each column group - Builds ProxyChunkColumn for field access - Special handling for timestamp field index construction SegmentGrowingImpl (`internal/core/src/segcore/SegmentGrowingImpl.cpp`) - Added similar `LoadColumnGroups()` and `LoadColumnGroup()` methods for growing segments - Maintains consistency with sealed segment loading path Storage FFI Utilities loon_ffi/util (`internal/core/src/storage/loon_ffi/util.cpp`) - Added `MakeInternalPropertiesFromStorageConfig()`: Converts C storage config to internal Properties - Maps all storage configuration fields (S3, GCS, Azure, local) - Handles SSL, IAM, virtual host settings - Configures connection timeouts and max connections - Added `MakeInternalLocalProperies()`: Creates local filesystem properties - Added `ToCStorageConfig()`: Converts Go StorageConfig to C representation - Added `GetColumnGroups()`: Extracts column groups from manifest file using Transaction API Protocol Buffer Changes segcore.proto (`pkg/proto/segcore.proto:121`) - Added `manifest_path` field to `SegmentLoadInfo` message - Enables passing manifest file path from Go layer to C++ core Go Integration segment.go (`internal/util/segcore/segment.go:372`) - Updated `ConvertToSegcoreSegmentLoadInfo()` to propagate `ManifestPath` field - Bridges QueryNode segment load info to Segcore format --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-11-25 17:27:07 +08:00
congqixia	c01fd94a6a	enhance: integrate Storage V2 FFI interface for unified storage access (#45723 ) Related #44956 This commit integrates the Storage V2 FFI (Foreign Function Interface) interface throughout the Milvus codebase, enabling unified storage access through the Loon FFI layer. This is a significant step towards standardizing storage operations across different storage versions. 1. Configuration Support - configs/milvus.yaml: Added `useLoonFFI` configuration flag under `common.storage.file.splitByAvgSize` section - Allows runtime toggle between traditional binlog readers and new FFI-based manifest readers - Default: `false` (maintains backward compatibility) 2. Core FFI Infrastructure Enhanced Utilities (internal/core/src/storage/loon_ffi/util.cpp/h) - ToCStorageConfig(): Converts Go's `StorageConfig` to C's `CStorageConfig` struct for FFI calls - GetManifest(): Parses manifest JSON and retrieves latest column groups using FFI - Accepts manifest path with `base_path` and `ver` fields - Calls `get_latest_column_groups()` FFI function - Returns column group information as string - Comprehensive error handling for JSON parsing and FFI errors 3. Dependency Updates - internal/core/thirdparty/milvus-storage/CMakeLists.txt: - Updated milvus-storage version from `0883026` to `302143c` - Ensures compatibility with latest FFI interfaces 4. Data Coordinator Changes All compaction task builders now include manifest path in segment binlogs: - compaction_task_clustering.go: Added `Manifest: segInfo.GetManifestPath()` to segment binlogs - compaction_task_l0.go: Added manifest path to both L0 segment selection and compaction plan building - compaction_task_mix.go: Added manifest path to mixed compaction segment binlogs - meta.go: Updated metadata completion logic: - `completeClusterCompactionMutation()`: Set `ManifestPath` in new segment info - `completeMixCompactionMutation()`: Preserve manifest path in compacted segments - `completeSortCompactionMutation()`: Include manifest path in sorted segments 5. Data Node Compactor Enhancements All compactors updated to support dual-mode reading (binlog vs manifest): 6. Flush & Sync Manager Updates Pack Writer V2 (pack_writer_v2.go) - BulkPackWriterV2.Write(): Extended return signature to include `manifest string` - Implementation: - Generate manifest path: `path.Join(pack.segmentID, "manifest.json")` - Write packed data using FFI-based writer - Return manifest path along with binlogs, deltas, and stats Task Handling (task.go) - Updated all sync task result handling to accommodate new manifest return value - Ensured backward compatibility for callers not using manifest 7. Go Storage Layer Integration New Interfaces and Implementations - record_reader.go: Interface for unified record reading across storage versions - record_writer.go: Interface for unified record writing across storage versions - binlog_record_writer.go: Concrete implementation for traditional binlog-based writing Enhanced Schema Support (schema.go, schema_test.go) - Schema conversion utilities to support FFI-based storage operations - Ensures proper Arrow schema mapping for V2 storage Serialization Updates - serde.go, serde_events.go, serde_events_v2.go: Updated to work with new reader/writer interfaces - Test files updated to validate dual-mode serialization 8. Storage V2 Packed Format FFI Common (storagev2/packed/ffi_common.go) - Common FFI utilities and type conversions for packed storage format Packed Writer FFI (storagev2/packed/packed_writer_ffi.go) - FFI-based implementation of packed writer - Integrates with Loon storage layer for efficient columnar writes Packed Reader FFI (storagev2/packed/packed_reader_ffi.go) - Already existed, now complemented by writer implementation 9. Protocol Buffer Updates data_coord.proto & datapb/data_coord.pb.go - Added `manifest` field to compaction segment messages - Enables passing manifest metadata through compaction pipeline worker.proto & workerpb/worker.pb.go - Added compaction parameter for `useLoonFFI` flag - Allows workers to receive FFI configuration from coordinator 10. Parameter Configuration component_param.go - Added `UseLoonFFI` parameter to compaction configuration - Reads from `common.storage.file.useLoonFFI` config path - Default: `false` for safe rollout 11. Test Updates - clustering_compactor_storage_v2_test.go: Updated signatures to handle manifest return value - mix_compactor_storage_v2_test.go: Updated test helpers for manifest support - namespace_compactor_test.go: Adjusted writer calls to expect manifest - pack_writer_v2_test.go: Validated manifest generation in pack writing This integration follows a dual-mode approach: 1. Legacy Path: Traditional binlog-based reading/writing (when `useLoonFFI=false` or no manifest) 2. FFI Path: Manifest-based reading/writing through Loon FFI (when `useLoonFFI=true` and manifest exists) --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-11-24 19:57:07 +08:00
aoiasd	5efb0cedc8	feat: support use fragment config for highlight (#45099 ) relate: https://github.com/milvus-io/milvus/issues/42589 Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2025-11-24 17:07:06 +08:00
aoiasd	947c8855f3	feat: support search bm25 with highlight (#44923 ) relate: https://github.com/milvus-io/milvus/issues/42589 --------- Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2025-11-18 16:09:39 +08:00
862103595	a0e2fe78f3	enhance: Add ST_IsValid operator implementation for gis (#45501 ) issue：#43427 --------- Signed-off-by: xiejh <862103595@qq.com>	2025-11-18 15:09:40 +08:00
junjiejiangjjj	102481e53f	feat: Support add_function/alter_function/drop_function (#44895 ) https://github.com/milvus-io/milvus/issues/44053 Signed-off-by: junjie.jiang <junjie.jiang@zilliz.com>	2025-11-13 20:53:39 +08:00
Gao	09a3195867	enhance: support max_connections config for remote storage (#45225 ) related: https://github.com/milvus-io/milvus/issues/45344 Signed-off-by: chasingegg <chao.gao@zilliz.com>	2025-11-13 15:37:38 +08:00
junjiejiangjjj	50f198e346	feat: Support zilliz models (#45168 ) https://github.com/milvus-io/milvus/issues/35856 Signed-off-by: junjie.jiang <junjie.jiang@zilliz.com>	2025-11-13 12:55:37 +08:00
Gao	e9a875f7ac	enhance: override index_type while creating segment index (#45416 ) issue: #44752 --------- Signed-off-by: chasingegg <chao.gao@zilliz.com>	2025-11-11 07:27:36 +08:00
aoiasd	ed69375f00	enhance: remove resource type from file resource config (#45103 ) File resource type was useless till now, remove it before new release. relate: https://github.com/milvus-io/milvus/issues/43687 Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2025-11-03 10:15:32 +08:00
Zhen Ye	309d564796	enhance: support collection and index with WAL-based DDL framework (#45033 ) issue: #43897 - Part of collection/index related DDL is implemented by WAL-based DDL framework now. - Support following message type in wal, CreateCollection, DropCollection, CreatePartition, DropPartition, CreateIndex, AlterIndex, DropIndex. - Part of collection/index related DDL can be synced by new CDC now. - Refactor some UT for collection/index DDL. - Add Tombstone scheduler to manage the tombstone GC for collection or partition meta. - Move the vchannel allocation into streaming pchannel manager. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-10-30 14:24:08 +08:00
congqixia	569a5b40d2	enhance: [StorageV2] add manifest path support for FFI integration (#44991 ) Related to #44956 Add manifest_path field throughout the data path to support LOON Storage V2 manifest tracking. The manifest stores metadata for segment data files and enables the unified Storage V2 FFI interface. Changes include: - Add manifest_path field to SegmentInfo and SaveBinlogPathsRequest proto messages - Add UpdateManifest operator to datacoord meta operations - Update metacache, sync manager, and meta writer to propagate manifest paths - Include manifest_path in segment load info for query coordinator This is part of the Storage V2 FFI interface integration. Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-10-27 19:24:10 +08:00
congqixia	36a887b38b	enhance: add NewSegmentWithLoadInfo API to support segment self-managed loading (#45061 ) This commit introduces the foundation for enabling segments to manage their own loading process by passing load information during segment creation. Changes: C++ Layer: - Add NewSegmentWithLoadInfo() C API to create segments with serialized load info - Add SetLoadInfo() method to SegmentInterface for storing load information - Refactor segment creation logic into shared CreateSegment() helper function - Add comprehensive documentation for the new API Go Layer: - Extend CreateCSegmentRequest to support optional LoadInfo field - Update segment creation in querynode to pass SegmentLoadInfo when available - Add ConvertToSegcoreSegmentLoadInfo() and helper converters for proto translation Proto Definitions: - Add segcorepb.SegmentLoadInfo message with essential loading metadata - Add supporting messages: Binlog, FieldBinlog, FieldIndexInfo, TextIndexStats, JsonKeyStats - Remove dependency on data_coord.proto by creating segcore-specific definitions Testing: - Add comprehensive unit tests for proto conversion functions - Test edge cases including nil inputs, empty data, and nil array/map elements This is the first step toward issue #45060 - enabling segments to autonomously manage their loading process, which will: - Clarify responsibilities between Go and C++ layers - Reduce cross-language call overhead - Enable precise resource management at the C++ level - Support better integration with caching layer - Enable proactive schema evolution handling Related to #45060 --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-10-27 15:28:12 +08:00
yihao.dai	8d11373376	enhance: Show create time for import job (#45058 ) issue: https://github.com/milvus-io/milvus/issues/45056 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-10-27 12:14:08 +08:00
Bingyi Sun	58277c8eb0	feat: Auto add namespace field data if namespace is enabled (#44933 ) issue: #44011 --------- Signed-off-by: sunby <sunbingyi1992@gmail.com>	2025-10-24 18:40:05 +08:00
cai.zhang	b069eeecd2	fix: Added GetMetrics back to IndexNodeServer to ensure compatibility (#45073 ) issue: #45070 --------- Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>	2025-10-24 17:00:06 +08:00
aoiasd	cfeb095ad7	enhance: forbid build analyzer at proxy (#44067 ) relate: https://github.com/milvus-io/milvus/issues/43687 We used to run the temporary analyzer and validate analyzer on the proxy, but the proxy should not be a computation-heavy node. This PR move all analyzer calculations to the streaming node. --------- Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2025-10-23 10:58:12 +08:00
Zhen Ye	21076196bf	enhance: support resource group with WAL-based DDL framework (#44874 ) issue: #43897 - Resource group related DDL is implemented by WAL-based DDL framework now. - Support following message type in wal AlterResourceGroup, DropResourceGroup. - Resource group DDL can be synced by new CDC now. - Refactor some UT for resource group DDL. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-10-21 09:58:03 +08:00
Bingyi Sun	633cae9461	enhance: add namespace for query and search request (#44343 ) issue: #44011 --------- Signed-off-by: sunby <sunbingyi1992@gmail.com>	2025-10-16 17:52:01 +08:00
Zhen Ye	80bb09f7c2	enhance: support rbac with WAL-based DDL framework (#44735 ) issue: #43897 - RBAC(Roles/Users/Privileges/Privilege Groups) is implemented by WAL-based DDL framework now. - Support following message type in wal `AlterUser`, `DropUser`, `AlterRole`, `DropRole`, `AlterUserRole`, `DropUserRole`, `AlterPrivilege`, `DropPrivilege`, `AlterPrivilegeGroup`, `DropPrivilegeGroup`, `RestoreRBAC`. - RBAC can be synced by new CDC now. - Refactor some UT for RBAC. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-10-16 16:02:01 +08:00
Spade A	c4f3f0ce4c	feat: impl StructArray -- support more types of vector in STRUCT (#44736 ) ref: https://github.com/milvus-io/milvus/issues/42148 --------- Signed-off-by: SpadeA <tangchenjie1210@gmail.com> Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com>	2025-10-15 10:25:59 +08:00
sparknack	c8a4d6e2ef	enhance: add cachinglayer management for TextMatchIndex (#44741 ) issue: #41435, #44502 Signed-off-by: Shawn Wang <shawn.wang@zilliz.com>	2025-10-13 14:37:58 +08:00
congqixia	d76d92d33f	enhance: Use relative path in proto following convention (#44650 ) Previous pr #44163 Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-09-30 18:43:52 +08:00
Agnes George	aea0418713	fix: resolve CVE-2020-25576, WS-2023-0223 (#44163 ) fix: issue https://github.com/milvus-io/milvus/issues/44160 WS-2023-0223 reported for [atty-0.2.14.crate](https://ibmets.whitesourcesoftware.com/Wss/WSS.html#!libraryDetails;uuid=9c622063-376a-446b-bece-d7f6fd096758;project=7300448;orgToken=79623fcf-07fe-42b8-90bf-513fafeb41be) CVE-2020-25576 reported for [rand_core-0.3.1.crate](https://ibmets.whitesourcesoftware.com/Wss/WSS.html#!libraryDetails;uuid=20e2ad1b-c84c-4f18-98a9-4f27643b29ff;project=7300448;orgToken=79623fcf-07fe-42b8-90bf-513fafeb41be) [atty-0.2.14.crate](https://ibmets.whitesourcesoftware.com/Wss/WSS.html#!libraryDetails;uuid=9c622063-376a-446b-bece-d7f6fd096758;project=7300448;orgToken=79623fcf-07fe-42b8-90bf-513fafeb41be) is a transitive dependency coming from the root libraries 'cbindgen-0.26.0.crate' and 'criterion-0.4.0.crate' [rand_core-0.3.1.crate](https://ibmets.whitesourcesoftware.com/Wss/WSS.html#!libraryDetails;uuid=20e2ad1b-c84c-4f18-98a9-4f27643b29ff;project=7300448;orgToken=79623fcf-07fe-42b8-90bf-513fafeb41be) is also a transitive dependency coming from 'rand-0.3.23.crate' library Path to dependency file: /workspace/app/milvus/internal/core/thirdparty/tantivy/tantivy-binding/Cargo.toml For Remediation, since these vulnerabilities are transitive one, the root libraries should be updated to the latest non-vulnerable version --------- Co-authored-by: Agnes-George1 <agnes.george1@ibm.com> Co-authored-by: Abita Ann Augustine <abitaaugustine@gmail.com> Co-authored-by: gifi-siby <gifi.s@ibm.com>	2025-09-30 16:25:53 +08:00
cai.zhang	19346fa389	feat: Geospatial Data Type and GIS Function support for milvus (#44547 ) issue: #43427 This pr's main goal is merge #37417 to milvus 2.5 without conflicts. # Main Goals 1. Create and describe collections with geospatial type 2. Insert geospatial data into the insert binlog 3. Load segments containing geospatial data into memory 4. Enable query and search can display geospatial data 5. Support using GIS funtions like ST_EQUALS in query 6. Support R-Tree index for geometry type # Solution 1. Add Type: Modify the Milvus core by adding a Geospatial type in both the C++ and Go code layers, defining the Geospatial data structure and the corresponding interfaces. 2. Dependency Libraries: Introduce necessary geospatial data processing libraries. In the C++ source code, use Conan package management to include the GDAL library. In the Go source code, add the go-geom library to the go.mod file. 3. Protocol Interface: Revise the Milvus protocol to provide mechanisms for Geospatial message serialization and deserialization. 4. Data Pipeline: Facilitate interaction between the client and proxy using the WKT format for geospatial data. The proxy will convert all data into WKB format for downstream processing, providing column data interfaces, segment encapsulation, segment loading, payload writing, and cache block management. 5. Query Operators: Implement simple display and support for filter queries. Initially, focus on filtering based on spatial relationships for a single column of geospatial literal values, providing parsing and execution for query expressions.Now only support brutal search 7. Client Modification: Enable the client to handle user input for geospatial data and facilitate end-to-end testing.Check the modification in pymilvus. --------- Signed-off-by: Yinwei Li <yinwei.li@zilliz.com> Signed-off-by: Cai Zhang <cai.zhang@zilliz.com> Co-authored-by: ZhuXi <150327960+Yinwei-Yu@users.noreply.github.com>	2025-09-28 19:43:05 +08:00
aoiasd	1b20e956be	enhance: support random score for boost function score (#44214 ) And support set function mode and boost mode when run search with boost. RandomScore support get random function score between [0, weight). FunctionMode decide how to calculate boost score for multiple boost function scores. BoostMode decide how to calculate final score for origin score and boost score. relate: https://github.com/milvus-io/milvus/issues/43867 --------- Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2025-09-24 17:50:04 +08:00
zhagnlu	eac16a577c	enhance:support cachelayer for json stats (#44446 ) #42533 Signed-off-by: zhagnlu <lu.zhang@zilliz.com>	2025-09-24 15:30:04 +08:00

1 2 3 4

156 Commits