issue: https://github.com/milvus-io/milvus/issues/41690
This commit implements partial search result functionality when query
nodes go down, improving system availability during node failures. The
changes include:
- Enhanced load balancing in proxy (lb_policy.go) to handle node
failures with retry support
- Added partial search result capability in querynode delegator and
distribution logic
- Implemented tests for various partial result scenarios when nodes go
down
- Added metrics to track partial search results in querynode_metrics.go
- Updated parameter configuration to support partial result required
data ratio
- Replaced old partial_search_test.go with more comprehensive
partial_result_on_node_down_test.go
- Updated proto definitions and improved retry logic
These changes improve query resilience by returning partial results to
users when some query nodes are unavailable, ensuring that queries don't
completely fail when a portion of data remains accessible.
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: https://github.com/milvus-io/milvus/issues/41690
- Merge leader view and channel management into ChannelDistManager,
allowing a channel to have multiple delegators.
- Improve shard leader switching to ensure a single replica only has one
shard leader per channel. The shard leader handles all resource loading
and query requests.
- Refine the serviceable mechanism: after QC completes loading, sync the
query view to the delegator. The delegator then determines its
serviceable status based on the query view.
- When a delegator encounters forwarding query or deletion failures,
mark the corresponding segment as offline and transition it to an
unserviceable state.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #39937
Before PR #39552, whenever a segment was missing in either the `current
target` or the `next target`, we would trigger `load segment` to recover
the delegator. However, restoring only the missing segments in the `next
target` is sufficient to advance the target and complete the recovery
process.
In PR #39552, we removed the scheduling of L0 segments along with this
unnecessary `load segment` logic. However, this exposed a new issue: if
the `current target` still has missing segments and there is a flaw in
the `checkDelegatorDataReady` logic, it could block the recovery of a
delegator that contains `offline segments`.
Since `offline segments` are cleaned up asynchronously in this scenario,
this PR removes their blocking effect on delegator recovery, ensuring a
smoother failure recovery process.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #39551
This PR remove querycoord's scheduling of l0 segments:
- only load l0 segment when watch channel
- only release l0 segment when release channel or sync data distribution
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
1. Provide partition&channel level indexing in the collection target.
2. Make `SegmentAction` not wait for distribution.
3. Remove scheduler and target manager mutex.
4. Optimize logging to reduce CPU overhead.
issue: https://github.com/milvus-io/milvus/issues/37630
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
issue: #35563
1. Use an internal health checker to monitor the cluster's health state,
storing the latest state on the coordinator node. The CheckHealth
request retrieves the cluster's health from this latest state on the
proxy sides, which enhances cluster stability.
2. Each health check will assess all collections and channels, with
detailed failure messages temporarily saved in the latest state.
3. Use CheckHealth request instead of the heavy GetMetrics request on
the querynode and datanode
Signed-off-by: jaime <yun.zhang@zilliz.com>
issue: #35917
This PR refine the querycoord meta related interfaces to ensure that
each method includes a ctx parameter.
Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
issue: #36293#36242
after qn recover, delegator may be loaded in new node, after all segment
has been loaded, delegator becomes serviceable. but delegator's target
version hasn't been synced, and if search/query comes, delegator will
use wrong target version to filter out a empty segment list, which
caused empty search result.
This pr will block delegator's serviceable status until target version
is synced
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
When check health logic failed to collection not-queryable, the related
reason is hard to find in log.
This PR add context for log with trace id and print unqueryable
collection info log.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #35846
querycoord will notify proxy to update shard leader cache after
delegator location changes, but during querynode's failure recovery,
some delegator may become unserviceable due to lacking of segments, and
back to serviceable after segment loaded, so we also need to notify
proxy to invalidate shard leader cache when delegator serviceable state
changes.
This PR will maintain querynode's serviceable state during heartbeat,
and notify proxy to invalidate shard leader cache if serviceable state
changes.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #32466
this PR enhance that when shard location changed, update proxy's shard
leader cache. in case of query node failover case, proxy can find
replica recover
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>