milvus

mirror of https://gitee.com/milvus-io/milvus.git synced 2026-01-07 19:31:51 +08:00

Author	SHA1	Message	Date
Zhen Ye	bb913dd837	fix: simplify go ut (#46606 ) issue: #46500 - simplify the run_go_codecov.sh to make sure the set -e to protect any sub command failure. - remove all embed etcd in test to make full test can be run at local. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## PR Summary: Simplify Go Unit Tests by Removing Embedded etcd and Async Startup Scaffolding Core Invariant: This PR assumes that unit tests can be simplified by running without embedded etcd servers (delegating to environment-based or external etcd instances via `kvfactory.GetEtcdAndPath()` or `ETCD_ENDPOINTS`) and by removing goroutine-based async startup scaffolding in favor of synchronous component initialization. Tests remain functionally equivalent while becoming simpler to run and debug locally. What is Removed or Simplified: 1. Embedded etcd test infrastructure deleted: Removes `EmbedEtcdUtil` type and its public methods (SetupEtcd, TearDownEmbedEtcd) from `pkg/util/testutils/embed_etcd.go`, removes the `StartTestEmbedEtcdServer()` helper from `pkg/util/etcd/etcd_util.go`, and removes etcd embedding from test suites (e.g., `TaskSuite`, `EtcdSourceSuite`, `mixcoord/client_test.go`). Tests now either skip etcd-dependent tests (via `MILVUS_UT_WITHOUT_KAFKA=1` environment flag in `kafka_test.go`) or source etcd from external configuration (via `kvfactory.GetEtcdAndPath()` in `task_test.go`, or `ETCD_ENDPOINTS` environment variable in `etcd_source_test.go`). This eliminates the overhead of spinning up temporary etcd servers for unit tests. 2. Async startup scaffolding replaced with synchronous initialization: In `internal/proxy/proxy_test.go` and `proxy_rpc_test.go`, the `startGrpc()` method signature removes the `sync.WaitGroup` parameter; components are now created, prepared, and run synchronously in-place rather than in goroutines (e.g., `go testServer.startGrpc(ctx, &p)` becomes `testServer.startGrpc(ctx, &p)` running synchronously). Readiness checks (e.g., `waitForGrpcReady()`) remain in place to ensure startup safety without concurrency constructs. This simplifies control flow and reduces debugging complexity. 3. Shell script orchestration unified with proper error handling: In `scripts/run_go_codecov.sh` and `scripts/run_intergration_test.sh`, per-package inline test invocations are consolidated into a single `test_cmd()` function with unified `TEST_CMD_WITH_ARGS` array containing race, coverage, verbose, and other flags. The problematic `set -ex` is replaced with `set -e` alone (removing debug output noise while preserving strict error semantics), ensuring the scripts fail fast on any command failure. Why No Regression: - Test assertions and code paths remain unchanged; only deployment source of etcd (embedded → external) and startup orchestration (async → sync) change. - Readiness verification (e.g., `waitForGrpcReady()`) is retained, ensuring components are initialized before test execution. - Test flags (race detection, coverage, verbosity) are uniformly applied across all packages via unified `TEST_CMD_WITH_ARGS`, preserving test coverage and quality. - `set -e` alone is sufficient for strict failure detection without the `-x` flag's verbose output. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-12-31 16:07:22 +08:00
wei liu	b907f9e7a8	fix: unify ro node handling to avoid balance channel task stuck (#46440 ) issue: #46393 RO node can be created from two sources: stopping a QueryNode or replica node transfer (e.g., suspend node). Before this fix, there were two defects and one constraint that caused a deadlock: Defects: 1. LeaderChecker does not sync segment distribution to RO nodes 2. Scheduler only cancels tasks on stopping nodes, not RO nodes Constraint: - Balance channel task blocks waiting for new delegator to become serviceable (via sync segment) before executing release action Deadlock scenario: When target node becomes RO node (but not stopping) during balance channel execution, the task gets stuck because: - Cannot sync segment to RO node (defect 1) -> task blocks - Task is not cancelled since node is not stopping (defect 2) PR #45949 attempted to fix defect 1 but was not successful. This PR unifies RO node handling by: - LeaderChecker: only sync segment distribution to RW nodes - Scheduler: cancel task when target node becomes RO node - Simplify checkStale logic with unified node state checking Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-12-24 10:31:19 +08:00
congqixia	1414065860	feat: query coord support segment reopen when manifest path changes (#46394 ) Related to #46358 Add segment reopen mechanism in QueryCoord to handle segment data updates when the manifest path changes. This enables QueryNode to reload segment data without full segment reload, supporting storage v2 incremental updates. Changes: - Add ActionTypeReopen action type and LoadScope_Reopen in protobuf - Track ManifestPath in segment distribution metadata - Add CheckSegmentDataReady utility to verify segment data matches target - Extend getSealedSegmentDiff to detect segments needing reopen - Create segment reopen tasks when manifest path differs from target - Block target update until segment data is ready --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-12-17 22:15:16 +08:00
wei liu	e70c01362d	enhance: Add resource exhaustion querynode penalty policy (#45808 ) issue: #40513 for querynode which return resource exhausted error, add a penalty duration on it, and suspend loading new resource until penalty duration expired. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-12-02 16:59:11 +08:00
Zhen Ye	4f080bd3a0	fix: remove the streamingnode checking when loading segment (#45859 ) issue: #43117 If we enable checking when loading segments, all segment should always be loaded by streamingnode but not 2.5 querynode, make some search and query failure when upgrading. Otherwise, some search and query result will be wrong when upgrading. We choose to disable this checking for now to promise available search and query when upgrading. also see pr: #43346 Signed-off-by: chyezh <chyezh@outlook.com>	2025-11-28 10:09:08 +08:00
Zhen Ye	31976d8adb	fix: executor/scheduler should be latest replica meta but not replica copy (#45877 ) issue: #45865 --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-11-28 06:59:08 +08:00
Spade A	c0029b788d	fix: alter collection failed with MMAP setting for STRUCT (#45173 ) issue: https://github.com/milvus-io/milvus/issues/45001 ref: https://github.com/milvus-io/milvus/issues/42148 --------- Signed-off-by: SpadeA <tangchenjie1210@gmail.com> Signed-off-by: aoiasd <zhicheng.yue@zilliz.com> Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com> Co-authored-by: aoiasd <zhicheng.yue@zilliz.com>	2025-11-04 20:19:33 +08:00
Spade A	6494c75d31	fix: collection level MMAP does not take effect for STRUCT (#44996 ) issue: https://github.com/milvus-io/milvus/issues/42148 Signed-off-by: SpadeA <tangchenjie1210@gmail.com>	2025-10-23 19:52:05 +08:00
zhenshan.cao	4279f166c6	enhance: Add refine logs for task scheduler in QueryCoord (#44577 ) issue: https://github.com/milvus-io/milvus/issues/43968 Signed-off-by: zhenshan.cao <zhenshan.cao@zilliz.com>	2025-10-10 10:07:55 +08:00
XuanYang-cn	24037a396a	fix: LoadSegment failed for dup field mmap.enabel props (#44465 ) When set mmap enabled in both collection properties and field properties, load segment will fail. See also: #44443 Signed-off-by: yangxuan <xuan.yang@zilliz.com>	2025-09-22 14:40:06 +08:00
Bingyi Sun	0c0630cc38	feat: support dropping index without releasing collection (#42941 ) issue: #42942 This pr includes the following changes: 1. Added checks for index checker in querycoord to generate drop index tasks 2. Added drop index interface to querynode 3. To avoid search failure after dropping the index, the querynode allows the use of lazy mode (warmup=disable) to load raw data even when indexes contain raw data. 4. In segcore, loading the index no longer deletes raw data; instead, it evicts it. 5. In expr, the index is pinned to prevent concurrent errors. --------- Signed-off-by: sunby <sunbingyi1992@gmail.com>	2025-09-02 16:17:52 +08:00
Zhen Ye	23085ae437	fix: use query node label check if streamingnode (#44099 ) issue: #44014 - Because the session of querynode and streamingnode is different. - So when streamingnode session down first, a streaming query node will be treated as querynode. - Use label but not streaming node session to fix it. Signed-off-by: chyezh <chyezh@outlook.com>	2025-08-29 10:45:59 +08:00
Zhen Ye	575345ae7b	fix: get streamingnodes from service discovery without channel assign (#44033 ) issue: #43767 Signed-off-by: chyezh <chyezh@outlook.com>	2025-08-26 14:29:51 +08:00
Spade A	faeb7fd410	feat: impl StructArray -- create schema, insert, and retrieve data (#42855 ) Ref https://github.com/milvus-io/milvus/issues/42148 https://github.com/milvus-io/milvus/pull/42406 impls the segcore part of storage for handling with VectorArray. This PR: 1. impls the go part of storage for VectorArray 2. impls the collection creation with StructArrayField and VectorArray 3. insert and retrieve data from the collection. --------- Signed-off-by: SpadeA <tangchenjie1210@gmail.com> Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com> Signed-off-by: SpadeA-Tang <u6748471@anu.edu.au>	2025-07-27 01:30:55 +08:00
Zhen Ye	3aacd179f7	fix: balance channel before balance segment when upgrading (#43346 ) issue: #43117, #42966, #43373 - also fix channel balance may not work at 2.6. - fix error lost at delete path - add mvcc into s/q log - change the log level for TestCoordDownSearch Signed-off-by: chyezh <chyezh@outlook.com>	2025-07-17 20:16:52 +08:00
wei liu	039564199c	fix: Prevent duplicate segment results in count queries (#43173 ) issue: #41570 Fix issue where growing and sealed segments could be searched simultaneously, causing inflated count() results. This was caused by logic introduced in PR #42009 that made sealed segments readable before target version advancement. Changes include: - Fix conditional filtering logic in PinReadableSegments to prevent sealed segments from becoming readable prematurely - Use target version filter for full results (ratio=1.0) to ensure sealed segments only become readable after target advancement - Use query view segment list filter for partial results (ratio<1.0) to maintain backward compatibility - Simplify target version setting in AddDistributions to prevent premature segment readability - Add logging for redundant growing segments during sync - Add comprehensive unit tests covering the duplicate segment scenario This fix ensures count() queries return accurate results by preventing the same segment from being counted in both growing and sealed states. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-07-14 11:10:49 +08:00
congqixia	1fae5230fe	fix: Check field mmap property before apply collection level one (#43090 ) Related to #43089 --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-07-03 14:30:44 +08:00
congqixia	7bc7b18ed5	fix: [AddField] Prevent concurrent load during UpdateSchema (#43043 ) Related to #43028 This PR: - Add mutex prevent concurrent load segment & schema change - Add schema verison field in load meta - Update schema in PutOrRef if schema verison is larger --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-07-02 17:38:44 +08:00
wei liu	bf5fde1431	fix: Prevent delegator unserviceable due to shard leader change (#42689 ) issue: #42098 #42404 Fix critical issue where concurrent balance segment and balance channel operations cause delegator view inconsistency. When shard leader switches between load and release phases of segment balance, it results in loading segments on old delegator but releasing on new delegator, making the new delegator unserviceable. The root cause is that balance segment modifies delegator views, and if these modifications happen on different delegators due to leader change, it corrupts the delegator state and affects query availability. Changes include: - Add shardLeaderID field to SegmentTask to track delegator for load - Record shard leader ID during segment loading in move operations - Skip release if shard leader changed from the one used for loading - Add comprehensive unit tests for leader change scenarios This ensures balance segment operations are atomic on single delegator, preventing view corruption and maintaining delegator serviceability. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-06-19 12:10:38 +08:00
Chun Han	001619aef9	feat: supporing load priority for loading (#42413 ) related: #40781 Signed-off-by: MrPresent-Han <chun.han@gmail.com> Co-authored-by: MrPresent-Han <chun.han@gmail.com>	2025-06-17 15:22:38 +08:00
wei liu	e7c0a6ffbb	enhance: Refine QueryNode task parallelism based on CPU core count (#42166 ) issue: #42165 Implement dynamic task execution capacity calculation based on QueryNode CPU core count instead of static configuration for better resource utilization. Changes include: - Add CpuCoreNum() method and WithCpuCoreNum() option to NodeInfo - Implement GetTaskExecutionCap() for dynamic capacity calculation - Add QueryNodeTaskParallelismFactor parameter for tuning - Update proto definition to include cpu_core_num field - Add unit tests for new functionality This allows QueryCoord to automatically adjust task parallelism based on actual hardware resources. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-06-11 13:20:35 +08:00
wei liu	aa66072a1c	enhance: Remove inadvertently introduced goccy/go-json dependency (#42146 ) Remove the 'goccy/go-json' library, which was inadvertently introduced, and revert to using the standard internal JSON handling. Changes include: - Removed dependency on 'github.com/goccy/go-json' in go.mod and go.sum. - Replaced import of 'goccy/go-json' with 'internal/json' in 'internal/querycoordv2/task/scheduler.go'. This correction ensures the project continues to use the intended JSON processing libraries and avoids unnecessary external dependencies. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-06-03 17:38:32 +08:00
wei liu	54619eaa2c	feat: Implement partial result support on node down (#42009 ) issue: https://github.com/milvus-io/milvus/issues/41690 This commit implements partial search result functionality when query nodes go down, improving system availability during node failures. The changes include: - Enhanced load balancing in proxy (lb_policy.go) to handle node failures with retry support - Added partial search result capability in querynode delegator and distribution logic - Implemented tests for various partial result scenarios when nodes go down - Added metrics to track partial search results in querynode_metrics.go - Updated parameter configuration to support partial result required data ratio - Replaced old partial_search_test.go with more comprehensive partial_result_on_node_down_test.go - Updated proto definitions and improved retry logic These changes improve query resilience by returning partial results to users when some query nodes are unavailable, ensuring that queries don't completely fail when a portion of data remains accessible. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-05-28 00:12:28 +08:00
wei liu	78010262f0	enhance: Optimize shard serviceable mechanism (#41937 ) issue: https://github.com/milvus-io/milvus/issues/41690 - Merge leader view and channel management into ChannelDistManager, allowing a channel to have multiple delegators. - Improve shard leader switching to ensure a single replica only has one shard leader per channel. The shard leader handles all resource loading and query requests. - Refine the serviceable mechanism: after QC completes loading, sync the query view to the delegator. The delegator then determines its serviceable status based on the query view. - When a delegator encounters forwarding query or deletion failures, mark the corresponding segment as offline and transition it to an unserviceable state. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-05-22 11:38:24 +08:00
Zhen Ye	5fd47c3c89	fix: mockery too unavailable after upgrade golang version (#41481 ) issue: #41291 pr: #41318 Signed-off-by: chyezh <chyezh@outlook.com>	2025-04-24 10:46:43 +08:00
Xianhui Lin	3bc24c264f	enhance: Add json key inverted index in stats for optimization (#38039 ) Add json key inverted index in stats for optimization https://github.com/milvus-io/milvus/issues/36995 --------- Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com> Co-authored-by: luzhang <luzhang@zilliz.com>	2025-04-10 15:20:28 +08:00
yihao.dai	c368113233	fix: Fix task delta cache data race (#40259 ) issue: https://github.com/milvus-io/milvus/issues/40258 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-03-02 16:52:09 +08:00
wei liu	b0806bb900	fix: task delta cache leak due to duplicate task id (#40183 ) issue: #40052 task delta cache rely on the taskID is unique, so it incDeltaCache at AddTask, and decDeltaCache at RemoveTask, but the taskID allocator is not atomic, which cause two task with same taskID, in such case, it will call incDeltaCache twice, but call decDeltaCacheOnce, which cause delta cache leak. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-02-28 10:22:08 +08:00
wei liu	69b8b89369	enhance: Remove QueryCoord's scheduling of L0 segments (#39552 ) issue: #39551 This PR remove querycoord's scheduling of l0 segments: - only load l0 segment when watch channel - only release l0 segment when release channel or sync data distribution --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-02-26 21:38:00 +08:00
congqixia	cb7f2fa6fd	enhance: Use v2 package name for pkg module (#39990 ) Related to #39095 https://go.dev/doc/modules/version-numbers Update pkg version according to golang dep version convention --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-02-22 23:15:58 +08:00
yihao.dai	2a037a97f1	enhance: Add get vector latency metric and refine request limit error message (#40083 ) issue: https://github.com/milvus-io/milvus/issues/40078 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-02-21 19:41:55 +08:00
wei liu	7d2c948c69	fix: task delta cache leak on reduce task (#40055 ) issue: #40052 Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-02-21 16:47:54 +08:00
wei liu	c12c4b4fff	fix: [skip e2e] pr conflict cause ut failed (#39811 ) Related to https://github.com/milvus-io/milvus/pull/39701 & https://github.com/milvus-io/milvus/issues/39681 Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-02-12 11:44:51 +08:00
congqixia	7b51e4839f	fix: Resolve conflict on qc task test (#39796 ) Related to #39701 & #39681 Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-02-11 18:40:45 +08:00
wei liu	ff5c680c99	fix: load collection stucks if compaction/gc happens (#39701 ) issue: #39680 if compaction/gc happens, load collection may stuck due to SegmentNotFound, we should trigger UpdateNextTarget to get a new data view to execute loading operation. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-02-11 15:48:50 +08:00
wei liu	85c9f92ff4	fix: uneven distribution caused by executing task delta cache leak (#39702 ) issue: #39681 this PR maintain workload effect in action instead of computing workload effect from target, which may cause leak if target changes. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-02-11 14:30:46 +08:00
jaime	8a4ac8cccd	enhance: expose more metrics data (#39456 ) issue: #36621 #39417 1. Adjust the server-side cache size. 2. Add source information for configurations. 3. Add node ID for compaction and indexing tasks. 4. Resolve localhost access issues to fix health check failures for etcd. Signed-off-by: jaime <yun.zhang@zilliz.com>	2025-02-07 11:50:50 +08:00
yihao.dai	5fb597b37b	fix: Remove frequently updating metric to avoid mutex contention (#38775 ) issue: https://github.com/milvus-io/milvus/issues/37630 Reduce the frequency updating metrics to avoid holding the mutex for long periods. --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-01-24 10:31:07 +08:00
yihao.dai	e0b26260f2	enhance: enable task delta cache (#39307 ) When there are many segment tasks in the querycoord scheduler, the traversal in `GetSegmentTaskDelta` checks becomes time-consuming. This PR adds caching for segment deltas. issue: https://github.com/milvus-io/milvus/issues/37630 Signed-off-by: Wei Liu <wei.liu@zilliz.com> Co-authored-by: Wei Liu <wei.liu@zilliz.com>	2025-01-23 14:31:16 +08:00
yihao.dai	657550cf06	fix: Fix slow dist handle and slow observe (#38566 ) 1. Provide partition&channel level indexing in the collection target. 2. Make `SegmentAction` not wait for distribution. 3. Remove scheduler and target manager mutex. 4. Optimize logging to reduce CPU overhead. issue: https://github.com/milvus-io/milvus/issues/37630 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-01-15 20:17:00 +08:00
Zhen Ye	bb8d1ab3bf	enhance: make new go package to manage proto (#39114 ) issue: #39095 --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-01-10 10:49:01 +08:00
wei liu	9c3f59dbbe	fix: Prevent balancer from overloading the same QueryNode (#38719 ) issue: #38718 The balancer calculates the workload of executing tasks as an ongoing score for target nodes. However, a logic issue arises when GetSegmentTaskDelta or GetChannelTaskDelta is called with collectionID=-1, which incorrectly returns zero. Due to the incorrect global score, the executing task's workload is not properly reflected for each collection. Consequently, each collection submits its own balance task, leading to the balancer assigning excessive tasks to the same QueryNode. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-12-25 16:36:49 +08:00
SimFG	2afe2eaf3e	feat: support to replicate collection when the services contains the system tt msg (#37559 ) - issue: #37105 --------- Signed-off-by: SimFG <bang.fu@zilliz.com>	2024-12-17 09:08:46 +08:00
tinswzy	27229f7907	enhance: refine exists log print with ctx (#38080 ) issue: #35917 Refines exists log print with ctx Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>	2024-12-14 22:36:44 +08:00
wei liu	856e2aad7d	fix: Leader task stuck and retry again and again (#38202 ) issue: #38201 leader task require to update delegator's distribution, and only success after the distribution change has been applyed to delegator. but the delegator will reject the distribution change if it's version is older than current version in delegator. which cause the leader task stuck and retry forever. this PR remove the leader task finish check. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-12-10 19:16:42 +08:00
wei liu	f04986fceb	enhance: Remove constraint on release segment task (#38297 ) issue: #38305 after we disable balance segment and balance channel happens at same time, the constriant which require release segment must happens on serviceable shard leader is unnessary. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-12-10 11:18:49 +08:00
congqixia	051bc280dd	enhance: Make dynamic load/release partition follow targets (#38059 ) Related to #37849 --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-12-05 16:24:40 +08:00
tinswzy	7944538ade	enhance: Add ctx param to KV operation interfaces (#38154 ) issue: #35917 Refine KV operation interfaces by adding a ctx param Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>	2024-12-05 15:16:41 +08:00
tinswzy	e76802f910	enhance: refine querycoord meta/catalog related interfaces to ensure that each method includes a ctx parameter (#37916 ) issue: #35917 This PR refine the querycoord meta related interfaces to ensure that each method includes a ctx parameter. Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>	2024-11-25 11:14:34 +08:00
wei liu	0a440e0d38	fix: Prevent simultaneous balance of segments and channels (#37850 ) issue: #33550 balance segment and balance segment execute at same time, which will cause bounch of corner case. This PR disable simultaneous balance of segments and channels Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-11-21 17:56:55 +08:00

1 2 3 4 5

221 Commits