368 Commits

Author SHA1 Message Date
wei liu
b2997eb881
fix: Leader checker can't remove segment from leader view (#30152)
issue: #30150
pr: #30151

This PR fix three problems:

1. the load request generated by leader checker doesn't set load scope
2. leader checker use wrong node id when generate release task, which
cause the release task finished immediately
3. the release request generated by leader_checker doesn't set the force
flag, the operation to clean leader view on delegator will fail.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-01-20 18:58:58 +08:00
congqixia
079ddbfc01
enhance: [Cherry-pick] Shuffle candidates before channel assignment (#30066) (#30089)
Cherry-pick from master
pr: #30066

Shuffle candidates to reduce scenario that some channel allocated into
same node

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-01-19 12:06:54 +08:00
SimFG
be1470a654
enhance: [2.3] Add load/release partitions to replicate msg stream (#30001)
/kind improvement
pr: #28399

---------

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-01-18 22:50:55 +08:00
wei liu
71e24f0a7f
fix: Remove heartbeat lag logic during get shard leaders (#29999) (#30085)
issue: #29677 #29838
pr: #29999
during get shard leaders, if qeurynode doesn't ack the heartbeat than
10s, querycoord will treat it as unavailable, and won't return shard
leader on it. but when querynode has a full cpu usage, it's easily to
stuck for more than 10s without ack the heartbeat, which cause no shard
leader to search/query.

This PR remove heartbeat lag logic during get shard leaders

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-01-18 17:48:55 +08:00
congqixia
7f32576f36
enhance: [cherry-pick] replace magic number with ParamItem for dist handler (#30020) (#30070)
Cherry-pick from master
pr: #30020
See also #28817

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-01-18 15:58:54 +08:00
wei liu
7d73032582
enhance: refactor leader_observer to leader_checker (#29454) (#29984)
issue: #29453
pr: #29452
sync distribution by rpc will also call loadSegment/releaseSegment,
which may cause all kinds of concurrent case on same segment, such as
concurrent load and release on one segment.
This PR add leader_checker which generate load/release task to correct
the leader view, instead of calling sync distribution by rpc

---------

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-01-18 14:08:54 +08:00
congqixia
7fc7e1a0d5
enhance: [Cherry-pick] Use newer checkpoint when packing LoadSegmentRequest (#29922) (#29978)
Cherry-pick from master
pr: #29922 
See also: #29650

Either segment dml position & channel checkpoint could be newer in some
cases. This PR make PackLoadSegments use the newer one improving load
performance during cases where there are lots of upsert.

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-01-16 12:08:53 +08:00
wei liu
81fdb6f472
enhance: Skip generate load segment task (#29724) (#29982)
issue: #29814
pr: #29724
if channel is not subscribed yet, the generated load segment task will
be remove from task scheduler due to the load segment task need to be
transfer to worker node by shard leader.

This PR skip generate load segment task when channel is not subscribed
yet.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-01-16 10:12:52 +08:00
wei liu
5520bfbb05
enhance: Change some frequency log to rated level (#29720) (#29903)
pr: #29720
This PR change some frequency log to rated level

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-01-12 11:46:52 +08:00
congqixia
00c0a5a2ab
enhance: [Cherry-pick] make Load process traceable in querycoord (#29806) (#29869)
Cherry-pick from master
pr: #29806
See also #29803

This PR:
- Add trace span for collection/partition load
- Use TraceSpan to generate Segment/ChannelTasks when loading
- Refine BaseTask trace tag usage

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-01-11 18:00:52 +08:00
congqixia
6c9a5e347e
fix: [cherry-pick] Assertion all async invocations in test case (#29737) (#29782)
Cherry-pick from master
pr: #29737
Resolves: #29736

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-01-09 17:48:49 +08:00
wei liu
4088b00602
enhance: Rewrite gen segment plan based on assign segment (#29574) (#29684)
issue: #29582
pr: #29574
This PR rewrite gen segment plan logic based on assign segment in
`score_based_balancer`

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-01-06 09:58:49 +08:00
congqixia
5ec79ab6f6
fix: [Cherry-pick] Add atomic method to get collection target (#29580)
Cherry pick from master
pr: #29577
Related to #29575

Add `getCollectionTarget` method which is atomic when scope is
`CurrentTargetFirst` or `NextTargetFirst`
Also return error when executor finds no channel in target manager

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-12-29 10:30:46 +08:00
wei liu
a13fc4d346
enhance: Remove useless log in collection observer (#29555)
pr: #29554
This PR removed useless log in collection observer

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-12-28 17:14:45 +08:00
wei liu
07ef52e845
fix: Choose wrong shard leader during balance channel(#29525) (#29532)
issue: #29523
pr: #29525

readable shard leader should still be the old one during channel
balance, if the new shard leader is not ready.
This PR fixed that query coord choose wrong shard leader during balance
channel

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-12-28 15:22:51 +08:00
congqixia
eb11b1a56e
enhance: [Cherry-pick] remove flushed segmentInfo in WatchChannelRequest (#29527)
Cherry-pick from master
pr: #29526
`WatchDmChannel` only need growing segment info, this PR removes fetch
segmentInfos when fill watch dml channel request.

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-12-27 23:46:46 +08:00
yah01
4c0ca83928
enhance: speed up loading with many deletions (#29455) (#29520)
the executor always fetches the latest segment info, so we could consume
from the latest checkpoint, which could save much time while deleted
many entities

pr: #29455

Signed-off-by: yah01 <yang.cen@zilliz.com>
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-12-27 23:24:46 +08:00
wei liu
ad37b98cda
enhance: Rewrite gen stopping segment plan based on assign segment (29473) (#29480)
pr: #29473

`AssignSegment` method defines how to assign segment to nodes, but
score_based_balance implement another assign logic in
`genStoppingSegmentPlan`
This PR rewrite gen stopping segment plan based on assign segment.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-12-27 15:50:47 +08:00
wei liu
d0bcbf3953
fix: Upgrade from 2.2 should update CollectionLoadInfo (#29443) (#29479)
pr: #29443
milvus branch 2.3 add `loadType` in CollectionLoadInfo, so for
collection meta upgrade from 2.2, we should add `loadType` to
CollectionLoadInfo. This PR update CollectionLoadInfo with `loadType`
when meet a old version CollectionLoadInfo

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-12-27 15:48:58 +08:00
wei liu
26b1853c54
fix: Auto balance param can't be updated by dynamic(#29501) (#29502)
pr: #29501
This PR fixed that auto balance param can't be updated by dynamic

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-12-27 14:30:53 +08:00
SimFG
74e72ce27e
enhance: [2.3] Support to get the param value in the runtime (#29298)
pr: #29297
/kind improvement

Signed-off-by: SimFG <bang.fu@zilliz.com>
2023-12-21 20:36:43 +08:00
wei liu
2d33c7fe41
enhance: Add config for querycoord auto balance channel (#29231) (#29262)
issue: #23726
pr: #29231
This PR add control config to querycoord's background auto balance
channel operation

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-12-18 14:32:41 +08:00
congqixia
49c9dc4923
fix: [cherry-pick] balance_unstable_view unit test (#29127) (#29249)
Cherry-pick from master
pr: #29127
fix: #29126
Allow unstable output channel balance plan

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-12-16 00:16:39 +08:00
wei liu
97d71c2580
enhance: Skip balance segment when channel need be balanced (#29116) (#29232)
issue: #28622
pr: #29216
After we support balance segment with growing segment count #28623, if
we balance segment and channel at same time, some segments need to be
rebalanced after balance channel finish.

This PR skip balance segment when channel need be balanced.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-12-15 15:58:37 +08:00
wei liu
e8a480c28d
enhance: Enable balance channel in querycoord (#28469) (#29209)
issue: #23726
pr: #28469

1. enable auto balance channel between nodes in querycoord
2. make `genSegmentPlan` reuse the `AssignSegment` logic
3. make `genChannelPlan` reuse the `AssignChannel` logic

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-12-15 11:52:38 +08:00
yah01
5a8ddde92a
fix: load gets stuck probably (#29191) (#29192)
we found the load got stuck probably, and reviewed the logs.

the target observer seems not working, the reason is the taskDispatcher
removes the task in a goroutine, and modifies the task status after
committing the task into the goroutine pool, but this may happen after
the task removed, which leads to the task will never be removed

related #29086
pr: #29191

Signed-off-by: yah01 <yang.cen@zilliz.com>
2023-12-14 16:56:38 +08:00
wei liu
9092b1ae8a
feat: enable balance based on growing segment row count (#28623) (#29184)
issue: #28622 
pr: #28623
query node with delegator will has more rows than other query node due
to delgator loads all growing rows.
This PR enable the balance segment which based on the num of growing
rows in leader view.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-12-14 15:26:37 +08:00
yah01
76757e53c4
enhance: Add more logs for target updating (#29090) (#29141)
This pull request enhances the logging functionality in the code related
to target updating. It adds more logs about the condition satisfying
when updating the target. The logs provide additional information about
the collection ID, replica number, channel readiness, segment readiness,
and leader view readiness. These logs will help in troubleshooting and
monitoring the target updating process.

pr: #29090

Signed-off-by: yah01 <yah2er0ne@outlook.com>
Signed-off-by: yah01 <yang.cen@zilliz.com>
2023-12-12 22:28:38 +08:00
yah01
4334e4e7ad
enhance: remove merger for load segments (#29062) (#29064)
remove merger as now QueryNode could load segments concurrently
fix https://github.com/milvus-io/milvus/issues/29063
pr: #29062

Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-12-12 16:22:50 +08:00
MrPresent-Han
5f4ac437b2
enhance: [Cherry-pick] Moving etcd client into session (#27069) (#28996)
relate: #26694
pr: https://github.com/milvus-io/milvus/pull/27069

Signed-off-by: Filip Haltmayer <filip.haltmayer@zilliz.com>
Signed-off-by: MrPresent-Han <chun.han@zilliz.com>
Co-authored-by: Filip Haltmayer <81822489+filip-halt@users.noreply.github.com>
2023-12-07 16:22:34 +08:00
aoiasd
8502037cff
fix: [Cherry-pick] sync action load segment with lack collection index info list (#28956)
relate: https://github.com/milvus-io/milvus/issues/28779
https://github.com/milvus-io/milvus/issues/28637
pr: https://github.com/milvus-io/milvus/pull/28788

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2023-12-07 14:14:42 +08:00
congqixia
3a33afd1fb
enhance: [cherry-pick] Change const magic number in querycoord to param (#28819) (#28947)
Cherry-pick from master
pr: #28819 
See also #28817

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-12-04 19:06:40 +08:00
wei liu
c650240f31
enhance: Change some frequency log to rated level (#28897) (#28934)
pr: #28897
This pr change some frequency log's level to rated.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-12-04 18:52:32 +08:00
wei liu
d2c171354f
fix: Balance channel may cause channel not availble error (#28829) (#28902)
pr: #28829
issue: #28831
release old delegator before new delegator update it's distribution may
cause `channel not available` error
This PR will block release old delgator before new delegator finish
`syncDistribution`

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-12-04 09:40:32 +08:00
jaime
9378f78218
enhance: Add logs for each step during service initialization (#28687)
/kind improvement
pr: #28624

Signed-off-by: jaime <yun.zhang@zilliz.com>
2023-11-27 17:54:26 +08:00
congqixia
6512b12fba
enhance: [cherry-pick] Make etcd kv request timeout configurable (#28661) (#28701)
Cherry-pick from master
pr: #28661
See also #28660
This pr add request timeout config item for etcd kv request timeout
 Sync the default timeout value to same value for etcdKV & tikv config

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-11-24 21:16:26 +08:00
yah01
5ca7851f4a
enhance: refine error messages (#28424) (#28614)
- Split the simple reason and full detail
- Refine existing error messages related: #28422
related: https://github.com/milvus-io/milvus/issues/28422
pr: #28424

---------

Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-11-24 10:04:24 +08:00
wei liu
c7ec882033
enhance: Remove rpc during querycoord start (#28396) (#28604)
issue: #28332
pr: #28396

during querycoord's recover, it try to call `DescribeCollection` and
`ShowPartitions` to root coord, to checker whether collection or
partition has been released in rootcoord. but if rootcoord isn't not
ready yet, the rpc will fail, the querycoord panic.

to fix this, we remove rpc call during querycoord's start

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-11-21 18:08:29 +08:00
congqixia
d0f94f3d17
fix: make qcv2 observer dispatcher execute exactly once (#28472) (#28477)
Cherry-pick from master
pr: #28472
See also #28466

In `taskDispatcher.schedule`, same task may be resubmitted if the
previous round did not finish
In this case, TaskObserver.check may set current target by mistake,
which may cause the random search/query failure

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-11-17 01:34:21 +08:00
yah01
e36976c474
enhance: modify log to avoid ambiguity and improve readability (#28331) (#28414)
Remove the "failCount" log field, which is ambiguous
replace the status (int32) with string, to improve the readability for
log of task removed
pr: #28331

Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-11-15 10:26:19 +08:00
wei liu
d3f149c403
fix unstable auto balance config ut (#28289)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-11-09 10:02:19 +08:00
yah01
385507ce47
Fix the target updated before version updated to cause data missing (#28257)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-11-08 18:54:18 +08:00
wei liu
12a09231f1
fix datacoord unstable ut (#28282)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-11-08 18:44:58 +08:00
wei liu
918333817e
Disable auto balance when old node exists (#28191) (#28224)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-11-08 07:10:17 +08:00
yah01
d10a82dba4
Fix getting incorrect CPU num (#28178)
Signed-off-by: yah01 <yang.cen@zilliz.com>
2023-11-07 11:52:22 +08:00
wei liu
87e8d04ed7
fix sync distribution with wrong version (#28130) (#28170)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-11-06 11:38:18 +08:00
wei liu
416c3275a0
fix load index for stopping node (#28047) (#28137)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-11-03 11:02:17 +08:00
congqixia
02f4d145ca
Set qcv2 index task priority to Low (#28117) (#28134)
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-11-03 01:30:16 +08:00
congqixia
da4a062e5b
Change task sourceID to stringer interface (#27965) (#28074)
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-11-01 23:12:46 +08:00
wei liu
e0222b2ce3
refine target manager code style (#27883)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-10-25 00:44:12 +08:00