milvus

mirror of https://gitee.com/milvus-io/milvus.git synced 2026-01-07 19:31:51 +08:00

Author	SHA1	Message	Date
wei liu	975c91df16	feat: Add comprehensive snapshot functionality for collections (#44361 ) issue: #44358 Implement complete snapshot management system including creation, deletion, listing, description, and restoration capabilities across all system components. Key features: - Create snapshots for entire collections - Drop snapshots by name with proper cleanup - List snapshots with collection filtering - Describe snapshot details and metadata Components added/modified: - Client SDK with full snapshot API support and options - DataCoord snapshot service with metadata management - Proxy layer with task-based snapshot operations - Protocol buffer definitions for snapshot RPCs - Comprehensive unit tests with mockey framework - Integration tests for end-to-end validation Technical implementation: - Snapshot metadata storage in etcd with proper indexing - File-based snapshot data persistence in object storage - Garbage collection integration for snapshot cleanup - Error handling and validation across all operations - Thread-safe operations with proper locking mechanisms <!-- This is an auto-generated comment: release notes by coderabbit.ai --> - Core invariant/assumption: snapshots are immutable point‑in‑time captures identified by (collection, snapshot name/ID); etcd snapshot metadata is authoritative for lifecycle (PENDING → COMMITTED → DELETING) and per‑segment manifests live in object storage (Avro / StorageV2). GC and restore logic must see snapshotRefIndex loaded (snapshotMeta.IsRefIndexLoaded) before reclaiming or relying on segment/index files. - New capability added: full end‑to‑end snapshot subsystem — client SDK APIs (Create/Drop/List/Describe/Restore + restore job queries), DataCoord SnapshotWriter/Reader (Avro + StorageV2 manifests), snapshotMeta in meta, SnapshotManager orchestration (create/drop/describe/list/restore), copy‑segment restore tasks/inspector/checker, proxy & RPC surface, GC integration, and docs/tests — enabling point‑in‑time collection snapshots persisted to object storage and restorations orchestrated across components. - Logic removed/simplified and why: duplicated recursive compaction/delta‑log traversal and ad‑hoc lookup code were consolidated behind two focused APIs/owners (Handler.GetDeltaLogFromCompactTo for delta traversal and SnapshotManager/SnapshotReader for snapshot I/O). MixCoord/coordinator broker paths were converted to thin RPC proxies. This eliminates multiple implementations of the same traversal/lookup, reducing divergence and simplifying responsibility boundaries. - Why this does NOT introduce data loss or regressions: snapshot create/drop use explicit two‑phase semantics (PENDING → COMMIT/DELETING) with SnapshotWriter writing manifests and metadata before commit; GC uses snapshotRefIndex guards and IsRefIndexLoaded/GetSnapshotBySegment/GetSnapshotByIndex checks to avoid removing referenced files; restore flow pre‑allocates job IDs, validates resources (partitions/indexes), performs rollback on failure (rollbackRestoreSnapshot), and converts/updates segment/index metadata only after successful copy tasks. Extensive unit and integration tests exercise pending/deleting/GC/restore/error paths to ensure idempotence and protection against premature deletion. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2026-01-06 10:15:24 +08:00
Zhen Ye	27525d57cc	enhance: add glog sink to transfer cgo log into zap (#46721 ) issue: #45640 - After async logging, the C log and go log has no order promise, meanwhile the C log format is not consistent with Go Log; so we close the output of glog, just forward the log result operation into Go side which will be handled by the async zap logger. - Use CGO to filter all cgo logging and promise the order between c log and go log. - Also fix the metric name, add new metric to count the logging. - TODO: after woodpecker use the logger of milvus, we can add bigger buffer for logging. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> - Core invariant: all C (glog) and Go logs must be routed through the same zap async pipeline so ordering and formatting are preserved; this PR ensures every glog emission is captured and forwarded to zap before any async buffering diverges the outputs. - Logic removed/simplified: direct glog outputs and hard stdout/stderr/log_dir settings are disabled (configs/glog.conf and flags in internal/core/src/config/ConfigKnowhere.cpp) because they are redundant once a single zap sink handles all logs; logging metrics were simplified from per-length/volatile gauges to totalized counters (pkg/metrics/logging_metrics.go & pkg/log/), removing duplicate length-tracking and making accounting consistent. - No data loss or behavior regression (concrete code paths): Google logging now adds a GoZapSink (internal/core/src/common/logging_c.h, logging_c.cpp) that calls the exported CGO bridge goZapLogExt (internal/util/cgo/logging/logging.go). Go side uses C.GoStringN/C.GoString to capture full message and file, maps glog severities to zapcore levels, preserves caller info, and writes via the existing zap async core (same write path used by Go logs). The C++ send() trims glog's trailing newline and forwards exact buffers/lengths, so message content, file, line, and severity are preserved and serialized through the same async writer—no log entries are dropped or reordered relative to Go logs. - Capability added (where it takes effect): a CGO bridge that forwards glog into zap—new Go-exported function goZapLogExt (internal/util/cgo/logging/logging.go), a GoZapSink in C++ that forwards glog sends (internal/core/src/common/logging_c.h/.cpp), and blank imports of the cgo initializer across multiple packages (various internal/ files) to ensure the bridge is registered early so all C logs are captured. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: chyezh <chyezh@outlook.com>	2026-01-04 14:45:23 +08:00
yihao.dai	5b97cb70a0	enhance: Support delaying scanner startup (#46369 ) Introduce a ScannerStartupDelay configuration to enable WAL write-only recovery, allowing fence messages to be persisted during primary–secondary switchover when the StreamingNode is trapped in crash loops. issue: https://github.com/milvus-io/milvus/issues/46368 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * New Features * Added a configurable WAL scanner pause/resume and a consumer request flag to optionally ignore pause signals. * Metrics * Added a scanner pause gauge and pause-duration tracking for WAL scanning. * Tests * Added coverage for pause-consumption behavior and cleanup in stream client tests. * Chores * Consolidated flush-all logging into a single field and added a helper for bulk message conversion. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-12-24 11:53:19 +08:00
yihao.dai	d03b9cc052	enhance: Align the monitoring of last_replicated_time_tick with wal_last_confirm_time_tick (#46469 ) issue: https://github.com/milvus-io/milvus/issues/46116 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-12-21 19:03:17 +08:00
yihao.dai	b69f4ab1cd	fix: Fix replicate lag metric calculation to prevent false-positive health (#46120 ) This change fixes the calculation by using timestamp subtraction (WAL confirmed time - Last replicate time). This ensures the lag metric immediately spikes when replication is blocked, providing reliable monitoring. issue: https://github.com/milvus-io/milvus/issues/46116 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-12-07 21:51:12 +08:00
Zhen Ye	c3fe6473b8	enhance: support async write syncer for milvus logging (#45805 ) issue: #45640 - log may be dropped if the underlying file system is busy. - use async write syncer to avoid the log operation block the milvus major system. - remove some log dependency from the until function to avoid dependency-loop. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-11-28 17:43:11 +08:00
tinswzy	1427825133	enhance: improve WAL retention strategy (#45350 ) issue: #44369 woodpecker related[ issue: #59](https://github.com/zilliztech/woodpecker/issues/59) Refactor the WAL retention logic in Milvus StreamingNode: - Remove the simple sampling-based truncation mechanism. - After flush, WAL data is directly truncated. - The retention control is now delegated to the underlying message queue (MQ) implementation. Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>	2025-11-23 21:41:05 +08:00
Zhen Ye	40e2042728	enhance: add more metrics for DDL framework (#45558 ) issue: #43897 --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-11-14 15:19:37 +08:00
Gao	3cc59a0d69	enhance: add storage usage for delete/upsert/restful (#44512 ) #44212 Also, record metrics only when storageUsageTracking is enabled. Use MB for scanned_remote counter and scanned_total counter metrics to avoid overflow. --------- Signed-off-by: chasingegg <chao.gao@zilliz.com>	2025-09-30 00:31:06 +08:00
congqixia	0e5fb8ac6f	fix: Cleanup collection metrics after dropped on rootcoord (#44511 ) Related to #44509 Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-09-23 11:02:06 +08:00
Zhen Ye	c171280f63	enhance: support replicate message in wal. (#44456 ) issue: #44123 - support replicate message in wal of milvus. - support CDC-replicate recovery from wal. - fix some CDC replicator bugs Signed-off-by: chyezh <chyezh@outlook.com>	2025-09-22 17:06:11 +08:00
Gao	d3784c6515	enhance: add storage resource usage for vector search (#44308 ) issue: #44212 Implement search/query storage usage statistics in go side(result reduce), only record storage usage in vector search C++ path. Need to be implemented in query c++ path in next prs. --------- Signed-off-by: chasingegg <chao.gao@zilliz.com> Signed-off-by: marcelo.chen <marcelo.chen@zilliz.com> Co-authored-by: marcelo.chen <marcelo.chen@zilliz.com>	2025-09-19 20:20:02 +08:00
yihao.dai	51f69f32d0	feat: Add CDC support (#44124 ) This PR implements a new CDC service for Milvus 2.6, providing log-based cross-cluster replication. issue: https://github.com/milvus-io/milvus/issues/44123 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com> Signed-off-by: chyezh <chyezh@outlook.com> Co-authored-by: chyezh <chyezh@outlook.com>	2025-09-16 16:32:01 +08:00
congqixia	ba88cfa7a9	enhance: Add unified GRPC latency metrics in inteceptor (#44089 ) Related to #43966 Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-08-28 09:53:51 +08:00
congqixia	d5ecf49319	enhance: Add request stats interceptor collecting req metrics (#43967 ) Related to #43966 #43809 This PR: - Replace distributed request metrics collection into one interceptor - Add `Retry` and `Reject` label represents auth rejection and retry-able error cases --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-08-22 13:09:47 +08:00
Zhen Ye	a86b6f2a54	enhance: extend the stats manage at streaming shard manager for L0 (#43371 ) issue: #42416 - Rename the InsertMetric into ModifiedMetric. - Add L0 control configuration. - Add some L0 current state collect. Signed-off-by: chyezh <chyezh@outlook.com>	2025-08-18 20:41:46 +08:00
PjJinchen	64633cc5b3	fix: Metrics with collectionName but no databaseName label are causing name conflicts and confusion (#43277 ) (#43808 ) issue: https://github.com/milvus-io/milvus/issues/43277 --------- Signed-off-by: PjJinchen <6268414+pj1987111@users.noreply.github.com>	2025-08-15 01:37:44 +08:00
wei liu	3e9e830074	enhance: Implement rewatch mechanism for etcd failure scenarios (#43829 ) issue: #43828 Implement robust rewatch mechanism to handle etcd connection failures and node reconnection scenarios in DataCoord and QueryCoord, along with heartbeat lag monitoring capabilities. Changes include: - Implement rewatchDataNodes/rewatchQueryNodes callbacks for etcd reconnection scenarios - Add idempotent rewatchNodes method to handle etcd session recovery gracefully - Add QueryCoordLastHeartbeatTimeStamp metric for monitoring node heartbeat lag - Clean up heartbeat metrics when nodes go down to prevent metric leaks --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-08-14 10:31:44 +08:00
Zhen Ye	f81652ebbc	fix: lost metrics of flowgraph and write buffer (#42549 ) issue: #42548 Signed-off-by: chyezh <chyezh@outlook.com>	2025-06-06 19:08:31 +08:00
yihao.dai	297331b2cc	enhance: Add slot and tasks num metrics (#42141 ) issue: https://github.com/milvus-io/milvus/issues/41123 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-05-30 21:52:30 +08:00
wei liu	54619eaa2c	feat: Implement partial result support on node down (#42009 ) issue: https://github.com/milvus-io/milvus/issues/41690 This commit implements partial search result functionality when query nodes go down, improving system availability during node failures. The changes include: - Enhanced load balancing in proxy (lb_policy.go) to handle node failures with retry support - Added partial search result capability in querynode delegator and distribution logic - Implemented tests for various partial result scenarios when nodes go down - Added metrics to track partial search results in querynode_metrics.go - Updated parameter configuration to support partial result required data ratio - Replaced old partial_search_test.go with more comprehensive partial_result_on_node_down_test.go - Updated proto definitions and improved retry logic These changes improve query resilience by returning partial results to users when some query nodes are unavailable, ensuring that queries don't completely fail when a portion of data remains accessible. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-05-28 00:12:28 +08:00
yihao.dai	e04e5b41ca	enhance: Add task version monitoring (#42023 ) issue: https://github.com/milvus-io/milvus/issues/41123 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-05-22 23:24:28 +08:00
Zhen Ye	c9b0748ff9	enhance: add delete rows into delete msg header and more metric (#41952 ) issue: #41544 - add delete rows into delete messsage header - add more insert/delete metrics - fix non-broadcast message has broadcast header Signed-off-by: chyezh <chyezh@outlook.com>	2025-05-22 20:28:26 +08:00
Zhen Ye	d3fff1769e	fix: streaming node panic with when binary size is set as zero (#41879 ) issue: #41853 - persist the estimated binary size for insert message into wal. - add metric to record the total growing rows of channel. Signed-off-by: chyezh <chyezh@outlook.com>	2025-05-16 11:12:22 +08:00
Zhen Ye	0a465bb5b7	enhance: use recovery+shardmanager, remove segment assignment interceptor (#41824 ) issue: #41544 - add lock interceptor into wal. - use recovery and shardmanager to replace the original implementation of segment assignment. - remove redundant implementation and unittest. - remove redundant proto definition. - use 2 streamingnode in e2e. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-05-14 23:00:23 +08:00
yihao.dai	36e9e41627	fix: Fix no candidate segments error for small import (#41771 ) When autoID is enabled, the preimport task estimates row distribution by evenly dividing the total row count (numRows) across all vchannels: `estimatedCount = numRows / vchannelNum`. However, the actual import task hashes real auto-generated IDs to determine the target vchannel. This mismatch can lead to inaccurate row distribution estimation in such corner cases: - Importing 1 row into 2 vchannels: • Preimport: 1 / 2 = 0 → both v0 and v1 are estimated to have 0 rows • Import: real autoID (e.g., 457975852966809057) hashes to v1 → actual result: v0 = 0, v1 = 1 To resolve such corner case, we now allocate at least one segment for each vchannel when autoID is enabled, ensuring all vchannels are prepared to receive data even if no rows are estimated for them. issue: https://github.com/milvus-io/milvus/issues/41759 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-05-14 15:30:21 +08:00
Zhen Ye	7beafe99a7	enhance: implement wal garbage collector with truncate api (#41770 ) issue: #41544 - add a truncator implementation into wal recovery storage. - add metrics for recovery storage. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-05-13 22:08:56 +08:00
Zhen Ye	61b6ca5b73	enhance: add in mem shard manager (#41749 ) issue: #41544 - Implement in-memory shard manager to maintain the shard state at write ahead. - Remove all rpc and meta operation at write ahead, make the segment assignment logic only use wal and memory. - Refactor global stats management, add node-level flush policy. - Fix the recovery storage inconsistency bug when graceful close. Signed-off-by: chyezh <chyezh@outlook.com>	2025-05-13 12:04:56 +08:00
Xianhui Lin	deb610e5d3	fix: update MixCoord registration in MilvusRoles (#41337 ) enhance: update MixCoord registration in MilvusRoles The `runMixCoord` function in `MilvusRoles` was updated to use the `RegisterMixCoord` function from the `rootcoord_metrics` package instead of `RegisterRootCoord`. This change aligns with the recent modifications made to the `rootcoord_metrics` package. issue:https://github.com/milvus-io/milvus/issues/41338 --------- Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>	2025-04-16 19:49:54 +08:00
Xianhui Lin	3bc24c264f	enhance: Add json key inverted index in stats for optimization (#38039 ) Add json key inverted index in stats for optimization https://github.com/milvus-io/milvus/issues/36995 --------- Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com> Co-authored-by: luzhang <luzhang@zilliz.com>	2025-04-10 15:20:28 +08:00
Zhen Ye	f18aa85083	enhance: vchannel fair balance policy for streaming (#40959 ) issue: #40638 - Add `ChannelID` for streaming replica in future. - Remove the pchannel count fair balance policy for streaming. - Add Score based vchannel fair balance policy for streaming. - Add pchannel stats manager to collect the stats of pchannel for balancer. - Add configuration and metrics for new balance policy --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-04-04 10:12:22 +08:00
Buqian Zheng	c12abf4e2a	enhance: improve sparse query nnz metric (#40713 ) add query type and field id label; add metric for hybrid search issue: https://github.com/milvus-io/milvus/issues/35853 Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>	2025-03-18 17:20:16 +08:00
cai.zhang	6dbe5d475e	enhance: Refine task meta with key lock (#40613 ) issue: #39101 2.5 pr: #40146 #40353 Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>	2025-03-14 15:44:22 +08:00
yihao.dai	b2a8694686	enhance: Merge IndexNode and DataNode (#40272 ) Merge DataNode and IndexNode into DataNode. issue: https://github.com/milvus-io/milvus/issues/39115 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-03-13 14:26:11 +08:00
junjiejiangjjj	359e7efd8e	feat: Add function running monitoring (#40358 ) #35856 #40004 1. Optimize model verification logic 2. Add profiling code Signed-off-by: junjie.jiang <junjie.jiang@zilliz.com>	2025-03-10 22:28:05 +08:00
XuanYang-cn	e6c46a25ea	enhance: Use correct counter metrics for overall wa calculation (#40394 ) - Use CounterVec to calculate sum of increase during a time period. - Use entries number instead of binlog size --------- Signed-off-by: yangxuan <xuan.yang@zilliz.com>	2025-03-10 16:34:06 +08:00
Zhen Ye	f47ab31f23	enhance: remove redundant resource key watch operation, just keep consistency of wal (#40235 ) issue: #38399 related PR: #39522 - Just implement exclusive broadcaster between broadcast message with same resource key to keep same order in different wal. - After simplify the broadcast model, original watch-based broadcast is too complicated and redundant, remove it. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-03-03 14:40:05 +08:00
cai.zhang	13aff35a83	enhance: Add metrics for parse expression (#39654 ) Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>	2025-02-28 10:07:58 +08:00
cai.zhang	eb04686348	enhance: Add metrics for proxy queue (#40070 ) Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>	2025-02-27 17:30:01 +08:00
Zhen Ye	84df80b5e4	enhance: refactor metrics of streaming (#40031 ) issue: #38399 - add metrics for broadcaster component. - add metrics for wal flusher component. - add metrics for wal interceptors. - add slow log for wal. - add more label for some wal metrics. (local or remote/catcup or tailing...) Signed-off-by: chyezh <chyezh@outlook.com>	2025-02-25 12:25:56 +08:00
congqixia	cb7f2fa6fd	enhance: Use v2 package name for pkg module (#39990 ) Related to #39095 https://go.dev/doc/modules/version-numbers Update pkg version according to golang dep version convention --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-02-22 23:15:58 +08:00
SimFG	047254665d	feat: support to replicate import msg (#39171 ) - issue: #39849 --------- Signed-off-by: SimFG <bang.fu@zilliz.com> Signed-off-by: chyezh <chyezh@outlook.com> Co-authored-by: chyezh <chyezh@outlook.com>	2025-02-16 00:08:13 +08:00
XuanYang-cn	1f14053c70	enhance: Enable to observe write amplification (#39661 ) Signed-off-by: yangxuan <xuan.yang@zilliz.com>	2025-02-08 18:38:43 +08:00
yihao.dai	f0b7446e6b	enhance: Remove unnecessary collection and partition label from the metrics (#39536 ) /kind improvement --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-02-05 11:01:10 +08:00
Ted Xu	56659bacbb	enhance: make serialization be part of sync task to support file format change (#38946 ) See #38945 --------- Signed-off-by: Ted Xu <ted.xu@zilliz.com>	2025-01-23 15:49:05 +08:00
yihao.dai	657550cf06	fix: Fix slow dist handle and slow observe (#38566 ) 1. Provide partition&channel level indexing in the collection target. 2. Make `SegmentAction` not wait for distribution. 3. Remove scheduler and target manager mutex. 4. Optimize logging to reduce CPU overhead. issue: https://github.com/milvus-io/milvus/issues/37630 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-01-15 20:17:00 +08:00
XuanYang-cn	c731357538	enhance: Add missing delete metrics (#38634 ) Add 2 counter metrics: - Total delete entries from deltalog: milvus_datanode_compaction_delete_count - Total missing deletes: milvus_datanode_compaction_missing_delete_count See also: #34665 Signed-off-by: yangxuan <xuan.yang@zilliz.com>	2024-12-25 10:24:50 +08:00
Zhen Ye	833c74aa66	enhance: add detail, replica count for resource group (#38314 ) issue: #30647 --------- Signed-off-by: chyezh <chyezh@outlook.com>	2024-12-13 14:14:50 +08:00
Buqian Zheng	75e64b993f	enhance: add metrics for counting number of nun-zeros/tokens of sparse/FTS search (#38329 ) sparse vectors may have arbitrary number of non zeros and it is hard to optimize without knowing the actual distribution of nnz. this PR adds a metric for analyzing that. issue: https://github.com/milvus-io/milvus/issues/35853 comparing with https://github.com/milvus-io/milvus/pull/38328, this includes also metric for FTS in query node delegator also fixed a bug of sparse when searching by pk Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>	2024-12-12 16:22:43 +08:00
Gao	8977454311	enhance: support recall estimation (#38017 ) issue: #37899 Only `search` api will be supported --------- Signed-off-by: chasingegg <chao.gao@zilliz.com>	2024-12-11 20:40:48 +08:00

1 2 3 4

164 Commits