issue: #36970
cause release segment and balance channel may happen at same time, and
before new delegator become serviceable, if release segment exeuctes on
new delegator, and search/query comes on old delegator, then release
segment and query segment happens in parallel, if release segment
execute first in worker, then search/query will got a SegmentNodeLoaded
error.
This PR add serviceable filter on delegator, then all load/release
segment operation will happens on serviceable delegator.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #36621
1. Add API to access task runtime metrics, including:
- build index task
- compaction task
- import task
- balance (including load/release of segments/channels and some leader
tasks on querycoord)
- sync task
2. Add a debug model to the webpage by using debug=true or debug=false
in the URL query parameters to enable or disable debug mode.
Signed-off-by: jaime <yun.zhang@zilliz.com>
issue: #36536
query coord use `segmentTaskDeleta/channelTaskDelta` to measure the
executing workload for querynode in scheduler, and we maintains the
`segmentTaskDeleta/channelTaskDelta` by `scheulder.Add(task)` and
`scheduler.remove(task)`, but `scheduler.remove(task)` has been called
in unexpected way, which cause a wrong
`segmentTaskDeleta/channelTaskDelta` value and affect the segment assign
logic, causes segment unbalance.
This PR moves to compute the `segmentTaskDeleta/channelTaskDelta` when
access, to avoid the wrong value affect.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #35821
After collection loaded, if we need to increase/decrease collection's
replica, we need to release and load it again.
milvus offers 4 solution to update loaded collection's replica, this PR
aims to dynamic change the replica number without release, and after
replica number changed, milvus will execute load replica or release
replica in async, and the replica loaded status can be checked by
getReplicas API.
Notice that if set too much replicas than querynode can afford,the new
replica won't be loaded successfully until enough querynode joins.
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #36426
the old constriant requires only segment on current target can be
balanced, which is wrong, and caused that segment can't be move out from
stopping node, if it's only exist in next target.
by design, stopping balance need to move out all segment on it by
balance task, thus the unfair old constriant should be removed.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #35846
querycoord will notify proxy to update shard leader cache after
delegator location changes, but during querynode's failure recovery,
some delegator may become unserviceable due to lacking of segments, and
back to serviceable after segment loaded, so we also need to notify
proxy to invalidate shard leader cache when delegator serviceable state
changes.
This PR will maintain querynode's serviceable state during heartbeat,
and notify proxy to invalidate shard leader cache if serviceable state
changes.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #34095
When a new query node comes online, the segment_checker,
channel_checker, and balance_checker simultaneously attempt to allocate
segments to it. If this occurs during the execution of a load task and
the distribution of the new query node hasn't been updated, the query
coordinator may mistakenly view the new query node as empty. As a
result, it assigns segments or channels to it, potentially overloading
the new query node with more segments or channels than expected.
This PR measures the workload of the executing tasks on the target query
node to prevent assigning an excessive number of segments to it.
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
when querycoord process segment task, it will try to iterate whole
segment list to checke whether segment is loaded, which cost too much
cpu if there has thousands of segments.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #31091
This PR add GetByFilter interface in leader view manager, instead of all
kind of get func
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #30816
check stale rules for leader task:
1. for reduce leader task, it should keep executing until leader's node
become offline.
2. for grow leader task,it should keep executing until leader's node
become stopping.
This PR check leader node's stopping state for grow leader task
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #31480#31481
release duplicate l0 segment task, which execute on old delegator may
cause segment lack, and execute on new delegator may break new
delegator's leader view.
This PR skip release duplicate l0 segment by segment_checker, cause l0
segment will be released with unsub channel
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #30816
pr #31319 introduce the logic that segment checker need to load level
zero segment which only exist in current target.
This PR fix load segment task promote failed when segment only belongs
to current target
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
This PR add metrics for task latency in querycoord scheduler, so if any
kind of task stuck, it's easy to figure out by metrics
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #30186
during channel balance, after new delegator loaded, instead of syncing
l0 segment's location to new delegator, we should load l0 segment on new
delegator, and release the old l0 segment, then start to release old
delegator.
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #30890
when leader checker find that leader view has an older load version of
segment, it will try to correct leader view. but the sync action doesn't
specify the latest load version. so the update operation will failed.
This PR fix leader checker can't update segment's load version and
keeping generate same task to scheduler.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
See also #31103
Since querycoord need index meta information from datacoord only, broker
shall use `ListIndexes` to skip segment index building check logic in
datacoord
This PR is also related to #30538, in which DescribeIndex caused lots of
memory usage and lead to OOM eventually
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #30150
`checkLeaderTaskStale` will check segment whether exist on next current
for leaderTask's growing action, which will cause promote leader task
failed when segment only exist on current target
This PR will check segment for both current or next target.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #30150
see also: #30258
cause `SyncDataDistribution` will try to load delta for segment. if miss
indexInfo in request, sync action will failed due to lack of index info.
This PR set indexinfo when try to set segment to leader view
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #30723
This PR skip generate balance task when collection's target isn't ready.
also refine the check stale logic in query coord's scheduler, if channel
exist in current or next target, task won't be canceled.
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
See also #30150
For leader view distribution with offline nodes, a release task can
never be sent to querynode due to targetNode online check logic. Even
the request is dispatched, normal release task does not have "force"
flag when calling `delegator.ReleaseSegment`.
This PR adds a new type of querycoord task: LeaderTask, the
responsibility of which is to rectify leader view distribtion.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #30150
This PR fix three problems:
1. leader checker use wrong node id when generate release task, which
cause the release task finished immediately
2. the release request generated by leader_checker doesn't set the
`force` flag, the operation to clean leader view on delegator will fail.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #29841
if segment loaded, submit load segment task for it isn't permitted, to
avoid load segment twice. but this logic blocks the leader checker to
correct leader view by `LoadSegment`
This PR remove the segment loaded check, to fix that leader checker
cann't submit load task
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
the recent changes move the level 0 segments list to a new proto field,
which leads to the QueryCoord can't see the level 0 segments, handle the
new changes
fix#29907
Signed-off-by: yah01 <yang.cen@zilliz.com>
See also #29803
This PR:
- Add trace span for collection/partition load
- Use TraceSpan to generate Segment/ChannelTasks when loading
- Refine BaseTask trace tag usage
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>