90 Commits

Author SHA1 Message Date
wei liu
bf5fde1431
fix: Prevent delegator unserviceable due to shard leader change (#42689)
issue: #42098 #42404
Fix critical issue where concurrent balance segment and balance channel
operations cause delegator view inconsistency. When shard leader
switches between load and release phases of segment balance, it results
in loading segments on old delegator but releasing on new delegator,
making the new delegator unserviceable.

The root cause is that balance segment modifies delegator views, and if
these modifications happen on different delegators due to leader change,
it corrupts the delegator state and affects query availability.

Changes include:
- Add shardLeaderID field to SegmentTask to track delegator for load
- Record shard leader ID during segment loading in move operations
- Skip release if shard leader changed from the one used for loading
- Add comprehensive unit tests for leader change scenarios

This ensures balance segment operations are atomic on single delegator,
preventing view corruption and maintaining delegator serviceability.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-06-19 12:10:38 +08:00
Chun Han
001619aef9
feat: supporing load priority for loading (#42413)
related: #40781

Signed-off-by: MrPresent-Han <chun.han@gmail.com>
Co-authored-by: MrPresent-Han <chun.han@gmail.com>
2025-06-17 15:22:38 +08:00
wei liu
e7c0a6ffbb
enhance: Refine QueryNode task parallelism based on CPU core count (#42166)
issue: #42165
Implement dynamic task execution capacity calculation based on QueryNode
CPU core count instead of static configuration for better resource
utilization.

Changes include:
- Add CpuCoreNum() method and WithCpuCoreNum() option to NodeInfo
- Implement GetTaskExecutionCap() for dynamic capacity calculation
- Add QueryNodeTaskParallelismFactor parameter for tuning
- Update proto definition to include cpu_core_num field
- Add unit tests for new functionality

This allows QueryCoord to automatically adjust task parallelism based on
actual hardware resources.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-06-11 13:20:35 +08:00
wei liu
54619eaa2c
feat: Implement partial result support on node down (#42009)
issue: https://github.com/milvus-io/milvus/issues/41690
This commit implements partial search result functionality when query
nodes go down, improving system availability during node failures. The
changes include:

- Enhanced load balancing in proxy (lb_policy.go) to handle node
failures with retry support
- Added partial search result capability in querynode delegator and
distribution logic
- Implemented tests for various partial result scenarios when nodes go
down
- Added metrics to track partial search results in querynode_metrics.go
- Updated parameter configuration to support partial result required
data ratio
- Replaced old partial_search_test.go with more comprehensive
partial_result_on_node_down_test.go
- Updated proto definitions and improved retry logic

These changes improve query resilience by returning partial results to
users when some query nodes are unavailable, ensuring that queries don't
completely fail when a portion of data remains accessible.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-05-28 00:12:28 +08:00
wei liu
78010262f0
enhance: Optimize shard serviceable mechanism (#41937)
issue: https://github.com/milvus-io/milvus/issues/41690
- Merge leader view and channel management into ChannelDistManager,
allowing a channel to have multiple delegators.
- Improve shard leader switching to ensure a single replica only has one
shard leader per channel. The shard leader handles all resource loading
and query requests.
- Refine the serviceable mechanism: after QC completes loading, sync the
query view to the delegator. The delegator then determines its
serviceable status based on the query view.
- When a delegator encounters forwarding query or deletion failures,
mark the corresponding segment as offline and transition it to an
unserviceable state.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-05-22 11:38:24 +08:00
Xianhui Lin
3bc24c264f
enhance: Add json key inverted index in stats for optimization (#38039)
Add json key inverted index in stats for optimization
https://github.com/milvus-io/milvus/issues/36995

---------

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-04-10 15:20:28 +08:00
congqixia
cb7f2fa6fd
enhance: Use v2 package name for pkg module (#39990)
Related to #39095

https://go.dev/doc/modules/version-numbers

Update pkg version according to golang dep version convention

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-02-22 23:15:58 +08:00
Zhen Ye
bb8d1ab3bf
enhance: make new go package to manage proto (#39114)
issue: #39095

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-01-10 10:49:01 +08:00
SimFG
2afe2eaf3e
feat: support to replicate collection when the services contains the system tt msg (#37559)
- issue: #37105

---------

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-12-17 09:08:46 +08:00
wei liu
f04986fceb
enhance: Remove constraint on release segment task (#38297)
issue: #38305
after we disable balance segment and balance channel happens at same
time, the constriant which require release segment must happens on
serviceable shard leader is unnessary.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-10 11:18:49 +08:00
congqixia
051bc280dd
enhance: Make dynamic load/release partition follow targets (#38059)
Related to #37849

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-05 16:24:40 +08:00
tinswzy
e76802f910
enhance: refine querycoord meta/catalog related interfaces to ensure that each method includes a ctx parameter (#37916)
issue: #35917 
This PR refine the querycoord meta related interfaces to ensure that
each method includes a ctx parameter.

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
2024-11-25 11:14:34 +08:00
yihao.dai
b6612e02b4
enhance: Reduce GetIndexInfos calls (#37695)
Batch `GetIndexInfos` calls for segments to reduce RPC calls.

issue: https://github.com/milvus-io/milvus/issues/37634

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-19 14:24:31 +08:00
wei liu
8714774305
fix: search/query failed due to segment not loaded (#37403)
issue: #36970
cause release segment and balance channel may happen at same time, and
before new delegator become serviceable, if release segment exeuctes on
new delegator, and search/query comes on old delegator, then release
segment and query segment happens in parallel, if release segment
execute first in worker, then search/query will got a SegmentNodeLoaded
error.

This PR add serviceable filter on delegator, then all load/release
segment operation will happens on serviceable delegator.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-06 15:10:25 +08:00
jaime
9d16b972ea
feat: add tasks page into management WebUI (#37002)
issue: #36621

1. Add API to access task runtime metrics, including:
  - build index task
  - compaction task
  - import task
- balance (including load/release of segments/channels and some leader
tasks on querycoord)
  - sync task
2. Add a debug model to the webpage by using debug=true or debug=false
in the URL query parameters to enable or disable debug mode.

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-10-28 10:13:29 +08:00
wei liu
3cd0b26285
enhance: Enable dynamic update loaded collection's replica (#35822)
issue: #35821
After collection loaded, if we need to increase/decrease collection's
replica, we need to release and load it again.

milvus offers 4 solution to update loaded collection's replica, this PR
aims to dynamic change the replica number without release, and after
replica number changed, milvus will execute load replica or release
replica in async, and the replica loaded status can be checked by
getReplicas API.

Notice that if set too much replicas than querynode can afford,the new
replica won't be loaded successfully until enough querynode joins.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-25 10:13:18 +08:00
congqixia
2fbc628994
feat: Support field partial load collection (#35416)
Related to #35415

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-08-20 16:49:02 +08:00
wei liu
c0200eec39
enhance: limit getSegmentInfo batch size to avoid excced grpc message limit (#35394)
issue: #35395

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-08-15 19:17:00 +08:00
jaime
fcec4c21b9
fix: check collection health(queryable) fail for releasing collection (#34947)
issue: #34946

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-08-02 17:20:15 +08:00
wayblink
a1232fafda
feat: Major compaction (#33620)
#30633

Signed-off-by: wayblink <anyang.wang@zilliz.com>
Co-authored-by: MrPresent-Han <chun.han@zilliz.com>
2024-06-10 21:34:08 +08:00
SimFG
ecee7d90d4
enhance: try to speed up the loading of small collections (#33570)
- issue: #33569

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-06-07 08:25:53 +08:00
wei liu
68dec7dcd4
fix: Use correct ts to avoid exclude segment list leak (#31991)
issue: #31990

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-12 10:39:19 +08:00
wei liu
c4806b69c4
enhance: Refactor leader view manager interface (#31133)
issue: #31091
This PR add GetByFilter interface in leader view manager, instead of all
kind of get func

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-10 15:13:36 +08:00
wei liu
7471a8005f
fix: querycoord panic after node down (#31831)
issue: #30519

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-03 10:03:22 +08:00
wei liu
5d752498e7
fix: Skip release duplicate l0 segment (#31540)
issue: #31480 #31481

release duplicate l0 segment task, which execute on old delegator may
cause segment lack, and execute on new delegator may break new
delegator's leader view.

This PR skip release duplicate l0 segment by segment_checker, cause l0
segment will be released with unsub channel

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-27 12:53:10 +08:00
chyezh
9f9ef8ac32
enhance: transfer resource group and dbname to querynode when load (#30936)
issue: #30931

Signed-off-by: chyezh <chyezh@outlook.com>
2024-03-21 11:59:12 +08:00
congqixia
773c64ecbb
fix: Set nodeID when remove distribution (#31259)
See also #30930

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-14 15:09:03 +08:00
congqixia
5b51c20293
fix: Use Remove sync type for distribution removal (#31215)
See also #31214

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-13 06:11:04 +08:00
wei liu
22df5061c1
fix: Leader checker can't update segment's load version (#31040)
issue: #30890

when leader checker find that leader view has an older load version of
segment, it will try to correct leader view. but the sync action doesn't
specify the latest load version. so the update operation will failed.

This PR fix leader checker can't update segment's load version and
keeping generate same task to scheduler.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-08 11:57:01 +08:00
congqixia
c886aa29ff
enhance: Use ListIndexes instead of DescribeIndex for qc broker (#31122)
See also #31103

Since querycoord need index meta information from datacoord only, broker
shall use `ListIndexes` to skip segment index building check logic in
datacoord

This PR is also related to #30538, in which DescribeIndex caused lots of
memory usage and lead to OOM eventually

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-07 21:43:03 +08:00
SimFG
ee8d6f236c
enhance: make the watch dm channel request better compatibility (#30952)
issue: #30938

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-03-01 16:07:37 +08:00
wei liu
befe0e21fd
fix: Set indexInfo when try to set segment to leader view (#30758)
issue: #30150
see also: #30258

cause `SyncDataDistribution` will try to load delta for segment. if miss
indexInfo in request, sync action will failed due to lack of index info.

This PR set indexinfo when try to set segment to leader view

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-02-26 11:02:55 +08:00
congqixia
7b91fa3db8
fix: Make leader checker generate leader task instead of segment task (#30258)
See also #30150

For leader view distribution with offline nodes, a release task can
never be sent to querynode due to targetNode online check logic. Even
the request is dispatched, normal release task does not have "force"
flag when calling `delegator.ReleaseSegment`.

This PR adds a new type of querycoord task: LeaderTask, the
responsibility of which is to rectify leader view distribtion.

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-02-21 11:08:51 +08:00
wei liu
f69f65ff68
fix: Leader checker can't remove segment from leader view (#30151)
issue: #30150

This PR fix three problems:
1. leader checker use wrong node id when generate release task, which
cause the release task finished immediately
2. the release request generated by leader_checker doesn't set the
`force` flag, the operation to clean leader view on delegator will fail.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-01-20 18:58:58 +08:00
yah01
c68c128e47
fix: level 0 segments not loaded (#29908)
the recent changes move the level 0 segments list to a new proto field,
which leads to the QueryCoord can't see the level 0 segments, handle the
new changes
fix #29907

Signed-off-by: yah01 <yang.cen@zilliz.com>
2024-01-16 14:40:53 +08:00
xige-16
9702cef2b5
feat: Support multiple vector search (#29433)
issue #25639 

Signed-off-by: xige-16 <xi.ge@zilliz.com>

Signed-off-by: xige-16 <xi.ge@zilliz.com>
2024-01-08 15:34:48 +08:00
congqixia
a3cb8e2625
fix: Add atomic method to get collection target (#29577)
Related to #29575

Add `getCollectionTarget` method which is atomic when scope is
`CurrentTargetFirst` or `NextTargetFirst`
Also return error when executor finds no channel in target manager

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-12-29 09:04:46 +08:00
yah01
1d6bcd1ded
enhance: speed up loading with many deletions (#29455)
the executor always fetches the latest segment info, so we could consume
from the latest checkpoint, which could save much time while deleted
many entities

Signed-off-by: yah01 <yang.cen@zilliz.com>
2023-12-25 20:58:45 +08:00
yah01
a0e1a1eb31
feat: support enable/disable mmap for index (#29005)
support enable/disable mmap for index, the user could alter the index's
mode by `AlterIndex` method
related: https://github.com/milvus-io/milvus/issues/21866

---------

Signed-off-by: yah01 <yah2er0ne@outlook.com>
Signed-off-by: yah01 <yang.cen@zilliz.com>
2023-12-21 18:07:24 +08:00
yah01
0a87724f18
enhance: remove merger for load segments (#29062)
remove merger as now QueryNode could load segments concurrently
fix #29063

Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-12-12 10:48:37 +08:00
yah01
fab52d167b
fix: may miss stream delta while loading (#28871)
we consume the delta data from the lastest channel checkpoint while
loading segment,

this works well without level 0 segments, but now it may lead to miss
some delta data,

so we have to consume from the current target's channel checkpoint

related: #27349

---------

Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-12-05 17:34:45 +08:00
yah01
ece592a42f
Deliver L0 segments delete records (#27722)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-11-07 01:44:18 +08:00
yah01
dc89730a50
Support collection-level mmap control (#26901)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-11-02 23:52:16 +08:00
congqixia
852be152de
Change task sourceID to stringer interface (#27965)
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-10-27 01:08:12 +08:00
SimFG
26f06dd732
Format the code (#27275)
Signed-off-by: SimFG <bang.fu@zilliz.com>
2023-09-21 09:45:27 +08:00
yah01
168e82ee10
Fix panic while handling with the nil status (#27040)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-09-15 10:09:21 +08:00
congqixia
edde3cf1c7
Add tracer for querycoord tasks (#27058)
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-09-14 09:59:19 +08:00
Bingyi Sun
54c0e64059
Fix search on empty segments set bug (#26136)
Signed-off-by: sunby <sunbingyi1992@gmail.com>
2023-08-08 11:17:08 +08:00
yah01
2180ef180c
Record only failed task error (#26033)
Signed-off-by: yah01 <yang.cen@zilliz.com>
2023-08-01 10:11:05 +08:00
congqixia
2b9ec565bb
Fix task executor can not dedup same task (#25901)
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-07-26 08:51:02 +08:00