203 Commits

Author SHA1 Message Date
wei liu
bf5fde1431
fix: Prevent delegator unserviceable due to shard leader change (#42689)
issue: #42098 #42404
Fix critical issue where concurrent balance segment and balance channel
operations cause delegator view inconsistency. When shard leader
switches between load and release phases of segment balance, it results
in loading segments on old delegator but releasing on new delegator,
making the new delegator unserviceable.

The root cause is that balance segment modifies delegator views, and if
these modifications happen on different delegators due to leader change,
it corrupts the delegator state and affects query availability.

Changes include:
- Add shardLeaderID field to SegmentTask to track delegator for load
- Record shard leader ID during segment loading in move operations
- Skip release if shard leader changed from the one used for loading
- Add comprehensive unit tests for leader change scenarios

This ensures balance segment operations are atomic on single delegator,
preventing view corruption and maintaining delegator serviceability.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-06-19 12:10:38 +08:00
Chun Han
001619aef9
feat: supporing load priority for loading (#42413)
related: #40781

Signed-off-by: MrPresent-Han <chun.han@gmail.com>
Co-authored-by: MrPresent-Han <chun.han@gmail.com>
2025-06-17 15:22:38 +08:00
wei liu
e7c0a6ffbb
enhance: Refine QueryNode task parallelism based on CPU core count (#42166)
issue: #42165
Implement dynamic task execution capacity calculation based on QueryNode
CPU core count instead of static configuration for better resource
utilization.

Changes include:
- Add CpuCoreNum() method and WithCpuCoreNum() option to NodeInfo
- Implement GetTaskExecutionCap() for dynamic capacity calculation
- Add QueryNodeTaskParallelismFactor parameter for tuning
- Update proto definition to include cpu_core_num field
- Add unit tests for new functionality

This allows QueryCoord to automatically adjust task parallelism based on
actual hardware resources.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-06-11 13:20:35 +08:00
wei liu
aa66072a1c
enhance: Remove inadvertently introduced goccy/go-json dependency (#42146)
Remove the 'goccy/go-json' library, which was inadvertently introduced,
and revert to using the standard internal JSON handling.

Changes include:
- Removed dependency on 'github.com/goccy/go-json' in go.mod and go.sum.
- Replaced import of 'goccy/go-json' with 'internal/json' in
'internal/querycoordv2/task/scheduler.go'.

This correction ensures the project continues to use the intended JSON
processing libraries and avoids unnecessary external dependencies.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-06-03 17:38:32 +08:00
wei liu
54619eaa2c
feat: Implement partial result support on node down (#42009)
issue: https://github.com/milvus-io/milvus/issues/41690
This commit implements partial search result functionality when query
nodes go down, improving system availability during node failures. The
changes include:

- Enhanced load balancing in proxy (lb_policy.go) to handle node
failures with retry support
- Added partial search result capability in querynode delegator and
distribution logic
- Implemented tests for various partial result scenarios when nodes go
down
- Added metrics to track partial search results in querynode_metrics.go
- Updated parameter configuration to support partial result required
data ratio
- Replaced old partial_search_test.go with more comprehensive
partial_result_on_node_down_test.go
- Updated proto definitions and improved retry logic

These changes improve query resilience by returning partial results to
users when some query nodes are unavailable, ensuring that queries don't
completely fail when a portion of data remains accessible.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-05-28 00:12:28 +08:00
wei liu
78010262f0
enhance: Optimize shard serviceable mechanism (#41937)
issue: https://github.com/milvus-io/milvus/issues/41690
- Merge leader view and channel management into ChannelDistManager,
allowing a channel to have multiple delegators.
- Improve shard leader switching to ensure a single replica only has one
shard leader per channel. The shard leader handles all resource loading
and query requests.
- Refine the serviceable mechanism: after QC completes loading, sync the
query view to the delegator. The delegator then determines its
serviceable status based on the query view.
- When a delegator encounters forwarding query or deletion failures,
mark the corresponding segment as offline and transition it to an
unserviceable state.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-05-22 11:38:24 +08:00
Zhen Ye
5fd47c3c89
fix: mockery too unavailable after upgrade golang version (#41481)
issue: #41291
pr: #41318

Signed-off-by: chyezh <chyezh@outlook.com>
2025-04-24 10:46:43 +08:00
Xianhui Lin
3bc24c264f
enhance: Add json key inverted index in stats for optimization (#38039)
Add json key inverted index in stats for optimization
https://github.com/milvus-io/milvus/issues/36995

---------

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-04-10 15:20:28 +08:00
yihao.dai
c368113233
fix: Fix task delta cache data race (#40259)
issue: https://github.com/milvus-io/milvus/issues/40258

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-03-02 16:52:09 +08:00
wei liu
b0806bb900
fix: task delta cache leak due to duplicate task id (#40183)
issue: #40052

task delta cache rely on the taskID is unique, so it incDeltaCache at
AddTask, and decDeltaCache at RemoveTask, but the taskID allocator is
not atomic, which cause two task with same taskID, in such case, it will
call incDeltaCache twice, but call decDeltaCacheOnce, which cause delta
cache leak.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-28 10:22:08 +08:00
wei liu
69b8b89369
enhance: Remove QueryCoord's scheduling of L0 segments (#39552)
issue: #39551
This PR remove querycoord's scheduling of l0 segments:
  - only load l0 segment when watch channel
- only release l0 segment when release channel or sync data distribution

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-26 21:38:00 +08:00
congqixia
cb7f2fa6fd
enhance: Use v2 package name for pkg module (#39990)
Related to #39095

https://go.dev/doc/modules/version-numbers

Update pkg version according to golang dep version convention

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-02-22 23:15:58 +08:00
yihao.dai
2a037a97f1
enhance: Add get vector latency metric and refine request limit error message (#40083)
issue: https://github.com/milvus-io/milvus/issues/40078

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-02-21 19:41:55 +08:00
wei liu
7d2c948c69
fix: task delta cache leak on reduce task (#40055)
issue: #40052

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-21 16:47:54 +08:00
wei liu
c12c4b4fff
fix: [skip e2e] pr conflict cause ut failed (#39811)
Related to https://github.com/milvus-io/milvus/pull/39701 &
https://github.com/milvus-io/milvus/issues/39681

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-12 11:44:51 +08:00
congqixia
7b51e4839f
fix: Resolve conflict on qc task test (#39796)
Related to #39701 & #39681

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-02-11 18:40:45 +08:00
wei liu
ff5c680c99
fix: load collection stucks if compaction/gc happens (#39701)
issue: #39680
if compaction/gc happens, load collection may stuck due to
SegmentNotFound, we should trigger UpdateNextTarget to get a new data
view to execute loading operation.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-11 15:48:50 +08:00
wei liu
85c9f92ff4
fix: uneven distribution caused by executing task delta cache leak (#39702)
issue: #39681 

this PR maintain workload effect in action instead of computing workload
effect from target, which may cause leak if target changes.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-11 14:30:46 +08:00
jaime
8a4ac8cccd
enhance: expose more metrics data (#39456)
issue: #36621 #39417
1. Adjust the server-side cache size.
2. Add source information for configurations.
3. Add node ID for compaction and indexing tasks.
4. Resolve localhost access issues to fix health check failures for
etcd.

Signed-off-by: jaime <yun.zhang@zilliz.com>
2025-02-07 11:50:50 +08:00
yihao.dai
5fb597b37b
fix: Remove frequently updating metric to avoid mutex contention (#38775)
issue: https://github.com/milvus-io/milvus/issues/37630

Reduce the frequency updating metrics to avoid holding the mutex for
long periods.

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-24 10:31:07 +08:00
yihao.dai
e0b26260f2
enhance: enable task delta cache (#39307)
When there are many segment tasks in the querycoord scheduler, the
traversal in `GetSegmentTaskDelta` checks becomes time-consuming. This
PR adds caching for segment deltas.

issue: https://github.com/milvus-io/milvus/issues/37630

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Co-authored-by: Wei Liu <wei.liu@zilliz.com>
2025-01-23 14:31:16 +08:00
yihao.dai
657550cf06
fix: Fix slow dist handle and slow observe (#38566)
1. Provide partition&channel level indexing in the collection target.
2. Make `SegmentAction` not wait for distribution.
3. Remove scheduler and target manager mutex.
4. Optimize logging to reduce CPU overhead.

issue: https://github.com/milvus-io/milvus/issues/37630

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-15 20:17:00 +08:00
Zhen Ye
bb8d1ab3bf
enhance: make new go package to manage proto (#39114)
issue: #39095

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-01-10 10:49:01 +08:00
wei liu
9c3f59dbbe
fix: Prevent balancer from overloading the same QueryNode (#38719)
issue: #38718
The balancer calculates the workload of executing tasks as an ongoing
score for target nodes. However, a logic issue arises when
GetSegmentTaskDelta or GetChannelTaskDelta is called with
collectionID=-1, which incorrectly returns zero.

Due to the incorrect global score, the executing task's workload is not
properly reflected for each collection. Consequently, each collection
submits its own balance task, leading to the balancer assigning
excessive tasks to the same QueryNode.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-25 16:36:49 +08:00
SimFG
2afe2eaf3e
feat: support to replicate collection when the services contains the system tt msg (#37559)
- issue: #37105

---------

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-12-17 09:08:46 +08:00
tinswzy
27229f7907
enhance: refine exists log print with ctx (#38080)
issue: #35917 
Refines exists log print with ctx

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
2024-12-14 22:36:44 +08:00
wei liu
856e2aad7d
fix: Leader task stuck and retry again and again (#38202)
issue: #38201
leader task require to update delegator's distribution, and only success
after the distribution change has been applyed to delegator. but the
delegator will reject the distribution change if it's version is older
than current version in delegator. which cause the leader task stuck and
retry forever.

this PR remove the leader task finish check.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-10 19:16:42 +08:00
wei liu
f04986fceb
enhance: Remove constraint on release segment task (#38297)
issue: #38305
after we disable balance segment and balance channel happens at same
time, the constriant which require release segment must happens on
serviceable shard leader is unnessary.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-10 11:18:49 +08:00
congqixia
051bc280dd
enhance: Make dynamic load/release partition follow targets (#38059)
Related to #37849

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-05 16:24:40 +08:00
tinswzy
7944538ade
enhance: Add ctx param to KV operation interfaces (#38154)
issue: #35917 
Refine KV operation interfaces by adding a ctx param

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
2024-12-05 15:16:41 +08:00
tinswzy
e76802f910
enhance: refine querycoord meta/catalog related interfaces to ensure that each method includes a ctx parameter (#37916)
issue: #35917 
This PR refine the querycoord meta related interfaces to ensure that
each method includes a ctx parameter.

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
2024-11-25 11:14:34 +08:00
wei liu
0a440e0d38
fix: Prevent simultaneous balance of segments and channels (#37850)
issue: #33550
balance segment and balance segment execute at same time, which will
cause bounch of corner case.

This PR disable simultaneous balance of segments and channels

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-21 17:56:55 +08:00
yihao.dai
b6612e02b4
enhance: Reduce GetIndexInfos calls (#37695)
Batch `GetIndexInfos` calls for segments to reduce RPC calls.

issue: https://github.com/milvus-io/milvus/issues/37634

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-19 14:24:31 +08:00
congqixia
6d86b9022e
enhance: Provide secondary index critria when filter leaderview (#37777)
Related to #37630

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-19 10:12:30 +08:00
congqixia
b0bd290a6e
enhance: Use internal json(sonic) to replace std json lib (#37708)
Related to #35020

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-18 10:46:31 +08:00
jaime
1d06d4324b
fix: Int64 overflow in JSON encoding (#37657)
issue: ##36621

- For simple types in a struct, add "string" to the JSON tag for
automatic string conversion during JSON encoding.
- For complex types in a struct, replace "int64" with "string."

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-11-14 22:52:30 +08:00
jaime
1e8ea4a7e7
feat: add segment/channel/task/slow query render (#37561)
issue: #36621

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-11-12 17:44:29 +08:00
sthuang
70605cf5b3
enhance: Support custom privilege group for RBAC (#37087)
issue: #37031

---------

Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
2024-11-09 08:44:28 +08:00
wei liu
8714774305
fix: search/query failed due to segment not loaded (#37403)
issue: #36970
cause release segment and balance channel may happen at same time, and
before new delegator become serviceable, if release segment exeuctes on
new delegator, and search/query comes on old delegator, then release
segment and query segment happens in parallel, if release segment
execute first in worker, then search/query will got a SegmentNodeLoaded
error.

This PR add serviceable filter on delegator, then all load/release
segment operation will happens on serviceable delegator.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-06 15:10:25 +08:00
jaime
9d16b972ea
feat: add tasks page into management WebUI (#37002)
issue: #36621

1. Add API to access task runtime metrics, including:
  - build index task
  - compaction task
  - import task
- balance (including load/release of segments/channels and some leader
tasks on querycoord)
  - sync task
2. Add a debug model to the webpage by using debug=true or debug=false
in the URL query parameters to enable or disable debug mode.

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-10-28 10:13:29 +08:00
Bingyi Sun
6851738fd1
fix: fix make generate-mockery panic with go1.22 (#36830)
https://github.com/milvus-io/milvus/issues/36831
Fix `make generate-mockery` panic.

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2024-10-17 12:11:31 +08:00
wei liu
5dfa1c3397
fix: Segment unbalance after many times load/release (#36537)
issue: #36536
query coord use `segmentTaskDeleta/channelTaskDelta` to measure the
executing workload for querynode in scheduler, and we maintains the
`segmentTaskDeleta/channelTaskDelta` by `scheulder.Add(task)` and
`scheduler.remove(task)`, but `scheduler.remove(task)` has been called
in unexpected way, which cause a wrong
`segmentTaskDeleta/channelTaskDelta` value and affect the segment assign
logic, causes segment unbalance.

This PR moves to compute the `segmentTaskDeleta/channelTaskDelta` when
access, to avoid the wrong value affect.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-26 15:13:15 +08:00
wei liu
3cd0b26285
enhance: Enable dynamic update loaded collection's replica (#35822)
issue: #35821
After collection loaded, if we need to increase/decrease collection's
replica, we need to release and load it again.

milvus offers 4 solution to update loaded collection's replica, this PR
aims to dynamic change the replica number without release, and after
replica number changed, milvus will execute load replica or release
replica in async, and the replica loaded status can be checked by
getReplicas API.

Notice that if set too much replicas than querynode can afford,the new
replica won't be loaded successfully until enough querynode joins.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-25 10:13:18 +08:00
wei liu
3bd7ec8751
fix: Fix cornor case that segment can't be move out from stopping node (#36431)
issue: #36426
the old constriant requires only segment on current target can be
balanced, which is wrong, and caused that segment can't be move out from
stopping node, if it's only exist in next target.

by design, stopping balance need to move out all segment on it by
balance task, thus the unfair old constriant should be removed.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-24 17:01:14 +08:00
wei liu
c84ea5465c
fix: Fix some replicas don't participate in the query after the failure recovery (#35850)
issue: #35846
querycoord will notify proxy to update shard leader cache after
delegator location changes, but during querynode's failure recovery,
some delegator may become unserviceable due to lacking of segments, and
back to serviceable after segment loaded, so we also need to notify
proxy to invalidate shard leader cache when delegator serviceable state
changes.

This PR will maintain querynode's serviceable state during heartbeat,
and notify proxy to invalidate shard leader cache if serviceable state
changes.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-03 15:39:03 +08:00
SimFG
731d45abbe
enhance: provide more general configuration to control mmap behavior (#35359)
- issue: #35273

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-08-21 00:22:54 +08:00
congqixia
2fbc628994
feat: Support field partial load collection (#35416)
Related to #35415

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-08-20 16:49:02 +08:00
wei liu
c0200eec39
enhance: limit getSegmentInfo batch size to avoid excced grpc message limit (#35394)
issue: #35395

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-08-15 19:17:00 +08:00
wei liu
f6aaf3fef2
fix: force update next target if target can't be loaded (#35365)
issue: #35361

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-08-15 19:15:00 +08:00
jaime
fcec4c21b9
fix: check collection health(queryable) fail for releasing collection (#34947)
issue: #34946

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-08-02 17:20:15 +08:00