issue: #30150
pr: #30151
This PR fix three problems:
1. the load request generated by leader checker doesn't set load scope
2. leader checker use wrong node id when generate release task, which
cause the release task finished immediately
3. the release request generated by leader_checker doesn't set the force
flag, the operation to clean leader view on delegator will fail.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Cherry-pick from master
pr: #30066
Shuffle candidates to reduce scenario that some channel allocated into
same node
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #29677#29838
pr: #29999
during get shard leaders, if qeurynode doesn't ack the heartbeat than
10s, querycoord will treat it as unavailable, and won't return shard
leader on it. but when querynode has a full cpu usage, it's easily to
stuck for more than 10s without ack the heartbeat, which cause no shard
leader to search/query.
This PR remove heartbeat lag logic during get shard leaders
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #29453
pr: #29452
sync distribution by rpc will also call loadSegment/releaseSegment,
which may cause all kinds of concurrent case on same segment, such as
concurrent load and release on one segment.
This PR add leader_checker which generate load/release task to correct
the leader view, instead of calling sync distribution by rpc
---------
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Cherry-pick from master
pr: #29922
See also: #29650
Either segment dml position & channel checkpoint could be newer in some
cases. This PR make PackLoadSegments use the newer one improving load
performance during cases where there are lots of upsert.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #29814
pr: #29724
if channel is not subscribed yet, the generated load segment task will
be remove from task scheduler due to the load segment task need to be
transfer to worker node by shard leader.
This PR skip generate load segment task when channel is not subscribed
yet.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Cherry-pick from master
pr: #29806
See also #29803
This PR:
- Add trace span for collection/partition load
- Use TraceSpan to generate Segment/ChannelTasks when loading
- Refine BaseTask trace tag usage
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #29582
pr: #29574
This PR rewrite gen segment plan logic based on assign segment in
`score_based_balancer`
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Cherry pick from master
pr: #29577
Related to #29575
Add `getCollectionTarget` method which is atomic when scope is
`CurrentTargetFirst` or `NextTargetFirst`
Also return error when executor finds no channel in target manager
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #29523
pr: #29525
readable shard leader should still be the old one during channel
balance, if the new shard leader is not ready.
This PR fixed that query coord choose wrong shard leader during balance
channel
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Cherry-pick from master
pr: #29526
`WatchDmChannel` only need growing segment info, this PR removes fetch
segmentInfos when fill watch dml channel request.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
the executor always fetches the latest segment info, so we could consume
from the latest checkpoint, which could save much time while deleted
many entities
pr: #29455
Signed-off-by: yah01 <yang.cen@zilliz.com>
Signed-off-by: yah01 <yah2er0ne@outlook.com>
pr: #29473
`AssignSegment` method defines how to assign segment to nodes, but
score_based_balance implement another assign logic in
`genStoppingSegmentPlan`
This PR rewrite gen stopping segment plan based on assign segment.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
pr: #29443
milvus branch 2.3 add `loadType` in CollectionLoadInfo, so for
collection meta upgrade from 2.2, we should add `loadType` to
CollectionLoadInfo. This PR update CollectionLoadInfo with `loadType`
when meet a old version CollectionLoadInfo
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #23726
pr: #29231
This PR add control config to querycoord's background auto balance
channel operation
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #28622
pr: #29216
After we support balance segment with growing segment count #28623, if
we balance segment and channel at same time, some segments need to be
rebalanced after balance channel finish.
This PR skip balance segment when channel need be balanced.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #23726
pr: #28469
1. enable auto balance channel between nodes in querycoord
2. make `genSegmentPlan` reuse the `AssignSegment` logic
3. make `genChannelPlan` reuse the `AssignChannel` logic
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
we found the load got stuck probably, and reviewed the logs.
the target observer seems not working, the reason is the taskDispatcher
removes the task in a goroutine, and modifies the task status after
committing the task into the goroutine pool, but this may happen after
the task removed, which leads to the task will never be removed
related #29086
pr: #29191
Signed-off-by: yah01 <yang.cen@zilliz.com>
issue: #28622
pr: #28623
query node with delegator will has more rows than other query node due
to delgator loads all growing rows.
This PR enable the balance segment which based on the num of growing
rows in leader view.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
This pull request enhances the logging functionality in the code related
to target updating. It adds more logs about the condition satisfying
when updating the target. The logs provide additional information about
the collection ID, replica number, channel readiness, segment readiness,
and leader view readiness. These logs will help in troubleshooting and
monitoring the target updating process.
pr: #29090
Signed-off-by: yah01 <yah2er0ne@outlook.com>
Signed-off-by: yah01 <yang.cen@zilliz.com>
pr: #28829
issue: #28831
release old delegator before new delegator update it's distribution may
cause `channel not available` error
This PR will block release old delgator before new delegator finish
`syncDistribution`
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Cherry-pick from master
pr: #28661
See also #28660
This pr add request timeout config item for etcd kv request timeout
Sync the default timeout value to same value for etcdKV & tikv config
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #28332
pr: #28396
during querycoord's recover, it try to call `DescribeCollection` and
`ShowPartitions` to root coord, to checker whether collection or
partition has been released in rootcoord. but if rootcoord isn't not
ready yet, the rpc will fail, the querycoord panic.
to fix this, we remove rpc call during querycoord's start
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Cherry-pick from master
pr: #28472
See also #28466
In `taskDispatcher.schedule`, same task may be resubmitted if the
previous round did not finish
In this case, TaskObserver.check may set current target by mistake,
which may cause the random search/query failure
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Remove the "failCount" log field, which is ambiguous
replace the status (int32) with string, to improve the readability for
log of task removed
pr: #28331
Signed-off-by: yah01 <yah2er0ne@outlook.com>