57 Commits

Author SHA1 Message Date
wei liu
3bb3e8c09e
fix: Enable leader checker to sync segment distribution to RO nodes (#45949)
issue: #45865

- Modified leader_checker.go to include all nodes (RO + RW) instead of
only RW nodes, preventing channel balance from stucking on RO nodes
- Added debug logging in segment_checker.go when no shard leader found
- Enhanced target_observer.go with detailed logging for delegator check
failures to improve debugging visibility
- Fixed integration tests:
- Temporarily disabled partial result counter assertion in
partial_result_on_node_down_test.go pending concurrent issue fix
- Increased transfer channel timeout from 10s to 20s in
manual_rolling_upgrade_test.go to avoid flaky test caused by target
update interval (10s)

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-12-02 10:07:09 +08:00
yihao.dai
cabc47ce01
fix: Fix channel not available error and release collection blocking (#45428)
1. Ensure replica creation is idempotent.
2. Prevent currentTarget update when replica is missing.
3. Move the wait-for-release logic into the DDL framework's callback,
and add a timeout to prevent it from blocking the DDL callback
indefinitely.

issue: https://github.com/milvus-io/milvus/issues/45301,
https://github.com/milvus-io/milvus/issues/45274,
https://github.com/milvus-io/milvus/issues/45295

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-11-12 18:55:37 +08:00
wei liu
399f63300c
enhance: Implement dynamic interval updates for ticker components (#43865)
issue: #43858

Enable dynamic configuration updates for ticker intervals without
restart. This enhancement allows runtime configuration changes to take
effect immediately for better operational flexibility.

Changes include:
- Apply "drain+Reset only when interval changed" pattern across all
ticker components to preserve existing timing phases
- Fix goroutine variable capture issue in CheckerController.Start()
- Remove unnecessary ticker.Stop() in manual trigger paths
- Add dynamic interval checking in QueryCoordV2 components:
  * checkers/controller.go: Various checker intervals
  * dist/dist_handler.go: DistPullInterval, CheckExecutedFlagInterval
  * session/cluster.go: CheckNodeSessionInterval
  * server.go: CheckAutoBalanceConfigInterval
  * observers/target_observer.go: UpdateNextTargetInterval
  * observers/collection_observer.go: CollectionObserverInterval
- Add dynamic interval checking in QueryNodeV2 components:
  * segments/disk_usage_fetcher.go: DiskSizeFetchInterval
- Ensure thread safety by performing all ticker operations in same
goroutine with proper drain before Reset to avoid spurious triggers

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-08-21 10:07:47 +08:00
wei liu
be492c2939
fix: Add missing keylocks in ReleasePartition operation (#42940)
issue: #42098
Fix concurrent access issue by adding proper locking around
ReleasePartition operation to prevent race conditions when releasing
partitions on the same collection.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-06-25 21:48:42 +08:00
wei liu
317e7999da
fix: ReleasePartition cause delegator unserviceable. (#42423)
issue: #42098 #42404
related to: ##42009 #41937

Implement new method to handle partition removal from next target
without directly modifying current target.

Changes include:
- Add RemovePartitionFromNextTarget method and deprecate RemovePartition
- Update target_observer to use new method for ReleasePartition
operations
- Add unit tests and mock methods for new functionality

This ensures that all changes to next target will propagates to
delegator's query view.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-06-09 19:02:34 +08:00
wei liu
54619eaa2c
feat: Implement partial result support on node down (#42009)
issue: https://github.com/milvus-io/milvus/issues/41690
This commit implements partial search result functionality when query
nodes go down, improving system availability during node failures. The
changes include:

- Enhanced load balancing in proxy (lb_policy.go) to handle node
failures with retry support
- Added partial search result capability in querynode delegator and
distribution logic
- Implemented tests for various partial result scenarios when nodes go
down
- Added metrics to track partial search results in querynode_metrics.go
- Updated parameter configuration to support partial result required
data ratio
- Replaced old partial_search_test.go with more comprehensive
partial_result_on_node_down_test.go
- Updated proto definitions and improved retry logic

These changes improve query resilience by returning partial results to
users when some query nodes are unavailable, ensuring that queries don't
completely fail when a portion of data remains accessible.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-05-28 00:12:28 +08:00
wei liu
78010262f0
enhance: Optimize shard serviceable mechanism (#41937)
issue: https://github.com/milvus-io/milvus/issues/41690
- Merge leader view and channel management into ChannelDistManager,
allowing a channel to have multiple delegators.
- Improve shard leader switching to ensure a single replica only has one
shard leader per channel. The shard leader handles all resource loading
and query requests.
- Refine the serviceable mechanism: after QC completes loading, sync the
query view to the delegator. The delegator then determines its
serviceable status based on the query view.
- When a delegator encounters forwarding query or deletion failures,
mark the corresponding segment as offline and transition it to an
unserviceable state.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-05-22 11:38:24 +08:00
wei liu
69b8b89369
enhance: Remove QueryCoord's scheduling of L0 segments (#39552)
issue: #39551
This PR remove querycoord's scheduling of l0 segments:
  - only load l0 segment when watch channel
- only release l0 segment when release channel or sync data distribution

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-26 21:38:00 +08:00
congqixia
cb7f2fa6fd
enhance: Use v2 package name for pkg module (#39990)
Related to #39095

https://go.dev/doc/modules/version-numbers

Update pkg version according to golang dep version convention

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-02-22 23:15:58 +08:00
congqixia
58045a3396
fix: Check collection released before target checks (#39841)
Related to #39840

The target could be updated async in previous code. This PR make remove
collection from target observer block until all tasks related in
dispatchers are removed preventing the metrics being updated after
collection released.

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-02-14 11:38:14 +08:00
wei liu
ff5c680c99
fix: load collection stucks if compaction/gc happens (#39701)
issue: #39680
if compaction/gc happens, load collection may stuck due to
SegmentNotFound, we should trigger UpdateNextTarget to get a new data
view to execute loading operation.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-11 15:48:50 +08:00
Zhen Ye
bb8d1ab3bf
enhance: make new go package to manage proto (#39114)
issue: #39095

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-01-10 10:49:01 +08:00
wei liu
40f9db491e
fix: Fix SyncDistribution may cost too much time on retry (#38454)
issue: #38428

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-16 11:38:44 +08:00
congqixia
051bc280dd
enhance: Make dynamic load/release partition follow targets (#38059)
Related to #37849

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-05 16:24:40 +08:00
tinswzy
e76802f910
enhance: refine querycoord meta/catalog related interfaces to ensure that each method includes a ctx parameter (#37916)
issue: #35917 
This PR refine the querycoord meta related interfaces to ensure that
each method includes a ctx parameter.

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
2024-11-25 11:14:34 +08:00
wei liu
a1b6be1253
fix: Delegator stuck at unserviceable status (#37694)
issue: #37679

pr #36549 introduce the logic error which update current target when
only parts of channel is ready.

This PR fix the logic error and let dist handler keep pull distribution
on querynode until all delegator becomes serviceable.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-15 10:20:31 +08:00
wei liu
1304b40552
fix: Balance channel may stuck at increasing replica number case (#37641)
issue: #37640
fix the pr #36549
cause balance channel will wait until new delegator becomes serviceable,
but new delegator need to sync target version then becomes serviceable,
and sync target version need to be wait all replica load done. so if
increasing replica number and balance channel happens at same time,
logic dead lock occurs.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-14 10:08:31 +08:00
wei liu
266f8ef1f5
fix: Search may return less result after qn recover (#36549)
issue: #36293 #36242
after qn recover, delegator may be loaded in new node, after all segment
has been loaded, delegator becomes serviceable. but delegator's target
version hasn't been synced, and if search/query comes, delegator will
use wrong target version to filter out a empty segment list, which
caused empty search result.

This pr will block delegator's serviceable status until target version
is synced

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-12 16:34:28 +08:00
yihao.dai
ff9bdf7029
fix: Fix load slowly (#37454)
When there're a lot of loaded collections, they would occupy the target
observer scheduler’s pool. This prevents loading collections from
updating the current target in time, slowing down the load process.
This PR adds a separate target dispatcher for loading collections.

issue: https://github.com/milvus-io/milvus/issues/37166

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-09 07:48:26 +08:00
Xiaofan
e073906a19
enhance: optimize describe collection and index (#37490)
fix #37489
combine multiple describe collection and list index into one call

Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>
2024-11-08 10:18:34 +08:00
wei liu
75676fbd11
fix: Fix dynamic release partition may fail search/query request (#35919)
issue: #33550
cause concurrent issue may occur between remove parition in target
manager and sync segment list to delegator. when it happens, some
segment may be released in delegator, and those segment may also be
synced to delegator, which cause delegator become unserviceable due to
lack of necessary segments, then search/query fails.

this PR make sure that all write access to target_manager will be
executed in serial to avoid the concurrent issues.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-05 18:47:03 +08:00
congqixia
09ef3f1b4f
fix: Make sure querycoord observers started once (#35811)
Related to #35809

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-08-29 14:45:00 +08:00
jaime
fcec4c21b9
fix: check collection health(queryable) fail for releasing collection (#34947)
issue: #34946

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-08-02 17:20:15 +08:00
congqixia
b284b81a47
fix: Check partition in current target when observing partition load status (#34282)
See also #34234

`LoadPartitions` does not guarantee the current target has loading
partitions if there are some partitions already loaded before.

This PR check current target contains the partition to load when
advancing loading percentage to 100.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-07-01 17:40:07 +08:00
wayblink
a1232fafda
feat: Major compaction (#33620)
#30633

Signed-off-by: wayblink <anyang.wang@zilliz.com>
Co-authored-by: MrPresent-Han <chun.han@zilliz.com>
2024-06-10 21:34:08 +08:00
wei liu
c4806b69c4
enhance: Refactor leader view manager interface (#31133)
issue: #31091
This PR add GetByFilter interface in leader view manager, instead of all
kind of get func

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-10 15:13:36 +08:00
congqixia
73858b23bc
fix: Make target observer auto/manual task mutual exclusive (#31584)
See also #30867

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-26 09:57:08 +08:00
chyezh
9f9ef8ac32
enhance: transfer resource group and dbname to querynode when load (#30936)
issue: #30931

Signed-off-by: chyezh <chyezh@outlook.com>
2024-03-21 11:59:12 +08:00
congqixia
c886aa29ff
enhance: Use ListIndexes instead of DescribeIndex for qc broker (#31122)
See also #31103

Since querycoord need index meta information from datacoord only, broker
shall use `ListIndexes` to skip segment index building check logic in
datacoord

This PR is also related to #30538, in which DescribeIndex caused lots of
memory usage and lead to OOM eventually

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-07 21:43:03 +08:00
yiwangdr
32cff25f97
enhance: decrease coordinator init time (#29822)
This PR mainly improve two items:
1. Target observer should refresh loading status during init time. An
uninitialized loading status blocks search/query. Currently, the target
observer refreshes every 10 seconds, i.e. we'd need to wait for 10s for
no reason. That's also the reason why we constantly see false log
"collection unloaded" upon mixcoord restarts.
2. Delete session when service is stopped. So that the new service
doesn't need to wait for the previous session to expire (~10s).

Item 1 is the major improvement of this PR, which should speed up init
time by 10s.
Item 2 is not a big concern in most cases as coordinators usually shut
down after stop(). In those cases, coordinator restart triggers serverID
change which further triggers an existing logic that deletes expired
session. This PR only fixes rare cases where serverID doesn't change.

integration test:
`go test -tags dynamic -v -coverprofile=profile.out -covermode=atomic
tests/integration/coordrecovery/coord_recovery_test.go -timeout=20m`
Performance after the change:
Average init time of coordinators: 10s
Hardware: M2 Pro
Test setup: 1000 collections with 1000 rows (dim=128) per collection.


issue: #29409

Signed-off-by: yiwangdr <yiwangdr@gmail.com>
2024-02-05 14:00:12 +08:00
congqixia
da7c3cbd88
enhance: make delegator delete buffer holding all delete from cp (#29626)
See also #29625

This PR:
- Add a new implemention of `DeleteBuffer`: listDeleteBuffer
  - holds cacheBlock slice
  - `Put` method append new delete data into last block
  - when a block is full, append a new block into the list
- Add `TryDiscard` method for `DeleteBuffer` interface
  - For doubleCacheBuffer, do nothing
- For listDeleteBuffer, try to evict "old" blocks, which are blocks
before the first block whose start ts is behind provided ts
- Add checkpoint field for `UpdateVersion` sync action, which shall be
used to discard old cache delete block

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-01-04 17:02:46 +08:00
yah01
9819090247
enhance: add more logs for target updating (#29090)
- add more logs about the condition satisfying

---------

Signed-off-by: yah01 <yah2er0ne@outlook.com>
Signed-off-by: yah01 <yang.cen@zilliz.com>
2023-12-12 14:06:43 +08:00
aoiasd
b4af6f8c40
fix: sync action load segment with lack collection index info list (#28788)
relate: https://github.com/milvus-io/milvus/issues/28779
https://github.com/milvus-io/milvus/issues/28637

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2023-12-04 18:14:34 +08:00
wei liu
d081fd5481
enhance: Change some frequency log to rated level (#28897)
This pr change some frequency log's level to rated.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-12-04 10:38:35 +08:00
yah01
ecb3f585c3
Fix passing the wrong dropped list from current target (#28265)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-11-08 17:02:18 +08:00
yah01
1b90630633
Fix the target updated before version updated to cause data missing (#28250)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-11-08 11:36:22 +08:00
wei liu
e0222b2ce3
refine target manager code style (#27883)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-10-25 00:44:12 +08:00
congqixia
93a877f55e
Make qcv2 target&leader observer execute in parallel (#27844)
- Add `taskDispatcher` to submit and run task async safely
- Change `LeaderObeserver` and `TargetObserver` schedule and manual check action to submitting task into dispatcher
- Fix logic problem in collection observer when manual check return false

See also #27494

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-10-24 10:14:11 +08:00
wei liu
55e5f80e24
update collection target after observer start (#27774)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-10-19 21:52:10 +08:00
congqixia
eca79d149c
Add ctx control for observer manual check methods (#27531)
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-10-09 11:07:33 +08:00
yah01
a8ce1b6686
Refine QueryCoord stopping (#27371)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-09-27 16:27:27 +08:00
wei liu
949c320185
remove pull target from qc recover (#26775)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-09-01 11:17:01 +08:00
Enwei Jiao
7d61355ab0
Refactor log for Query (#26310)
Signed-off-by: Enwei Jiao <enwei.jiao@zilliz.com>
2023-08-14 18:57:32 +08:00
wei liu
6f89620a43
remove pull target rpc from lock (#26054)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-08-04 10:31:06 +08:00
yah01
848ef7418b
Fix the refresh may be notified finished early (#24438)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-05-26 18:43:26 +08:00
yah01
296380d6e6
Support async refresh (#23107)
Signed-off-by: yah01 <yang.cen@zilliz.com>
2023-04-12 15:06:28 +08:00
jaime
c9d0c157ec
Move some modules from internal to public package (#22572)
Signed-off-by: jaime <yun.zhang@zilliz.com>
2023-04-06 19:14:32 +08:00
yah01
2b81933d13
Refine logs of DistHandler (#22879)
Signed-off-by: yah01 <yang.cen@zilliz.com>
2023-03-21 18:11:57 +08:00
yihao.dai
1f718118e9
Dynamic load/release partitions (#22655)
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2023-03-20 14:55:57 +08:00
zhenshan.cao
e768437681
Correct usage of Timer and Ticker (#22228)
Signed-off-by: zhenshan.cao <zhenshan.cao@zilliz.com>
2023-02-23 18:59:45 +08:00