574 Commits

Author SHA1 Message Date
wei liu
93063ce1f9
fix: Prevent simultaneous balance of segments and channels (#37850) (#37939)
issue: #33550
pr: #37850
balance segment and balance segment execute at same time, which will
cause bounch of corner case.

This PR disable simultaneous balance of segments and channels

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-26 10:26:40 +08:00
congqixia
8601f3ed66
enhance: [2.4] Refine Replica manager colle2Replicas secondary index (#37906) (#37970)
Cherry-pick from master
pr: #37906
Related to #37630

This PR add a new util coll2Replicas secondary index to reduce map
access & iteration while get replicas by collection

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-26 10:20:35 +08:00
wei liu
bb66636448
fix: Channel may be released after balance (#37862) (#37940)
issue: #37830
pr: #37862
casue dist handler doesn't set channel's version, so if channel checker
try to dedup channel, it may release the new delegator after balance
finished.

this PR fix the way to set proper version for channel.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-25 11:26:44 +08:00
congqixia
0bd26171d5
enhance: [2.4] Provide secondary index criteria when filter leaderview (#37777) (#37802)
Cherry-pick from master
pr: #37777 
Related to #37630

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-21 10:48:33 +08:00
congqixia
28adfe4629
enhance: [2.4] Remove unnecessary segment clone updating dist (#37797) (#37833)
Cherry-pick from master
pr: #37797
Related to #37630

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-20 19:48:33 +08:00
jaime
3ce27ca689
enhance: remove collection queryable check from health check (#37731)
pr: #37712

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-11-18 10:50:38 +08:00
wei liu
1bd502b585
fix: Delegator stuck at unserviceable status (#37694) (#37702)
issue: #37679
pr: #37694

pr #36549 introduce the logic error which update current target when
only parts of channel is ready.

This PR fix the logic error and let dist handler keep pull distribution
on querynode until all delegator becomes serviceable.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-15 14:52:30 +08:00
wei liu
28bcd85bd0
fix: Balance channel may stuck at increasing replica number case (#37642)
issue: #37640
pr: #37641
fix the pr #36549
cause balance channel will wait until new delegator becomes serviceable,
but new delegator need to sync target version then becomes serviceable,
and sync target version need to be wait all replica load done. so if
increasing replica number and balance channel happens at same time,
logic dead lock occurs.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-13 14:26:30 +08:00
congqixia
8801322371
enhance: [2.4] Invalidate collection cache when release collection (#37577) (#37628)
Cherry-pick from master
pr: #37577
Related to #37395

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-13 14:00:31 +08:00
wei liu
6dc879b1e2
enhance: Enable node assign policy on resource group (#36968) (#37588)
issue: #36977
pr: #36968
with node_label_filter on resource group, user can add label on
querynode with env `MILVUS_COMPONENT_LABEL`, then resource group will
prefer to accept node which match it's node_label_filter.

then querynode's can't be group by labels, and put querynodes with same
label to same resource groups.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-13 11:10:29 +08:00
wei liu
7d1c899155
fix: Search may return less result after qn recover (#36549) (#37610)
issue: #36293 #36242
pr: #36549
after qn recover, delegator may be loaded in new node, after all segment
has been loaded, delegator becomes serviceable. but delegator's target
version hasn't been synced, and if search/query comes, delegator will
use wrong target version to filter out a empty segment list, which
caused empty search result.

This pr will block delegator's serviceable status until target version
is synced

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-12 19:16:30 +08:00
wei liu
074f8ee696
enhance: optimize describe collection and index (#37490) (#37605)
fix #37489
pr: #34790
combine multiple describe collection and list index into one call

Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Co-authored-by: Xiaofan <83447078+xiaofan-luan@users.noreply.github.com>
2024-11-12 16:54:29 +08:00
wei liu
25c96991f6
fix: Lost loading collection's updateTs after qc restart (#37538) (#37580)
issue: #37537
pr: #37538

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-11 17:50:30 +08:00
congqixia
2fbb157dc8
enhance: [2.4] Handle legacy proxy load fields request (#37565) (#37569)
Cherry-pick from master
pr: #37565
Related to #35415

In rolling upgrade, legacy proxy may dispatch load request wit empty
load field list. The upgraded querycoord may report error by mistake
that load field list is changed.

This PR:

- Auto field empty load field list with all user field ids
- Refine the error messag when load field list updates
- Refine load job unit test with service cases

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-11 14:06:29 +08:00
congqixia
cedc34053c
enhance: [2.4] Add context trace for querycoord queryable check (#37524) (#37534)
Cherry-pick from master
pr: #37524

When check health logic failed to collection not-queryable, the related
reason is hard to find in log.

This PR add context for log with trace id and print unqueryable
collection info log.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-08 18:58:27 +08:00
wei liu
7b71411b60
fix: search/query failed due to segment not loaded (#37403) (#37544)
issue: #36970
pr: #37403
cause release segment and balance channel may happen at same time, and
before new delegator become serviceable, if release segment exeuctes on
new delegator, and search/query comes on old delegator, then release
segment and query segment happens in parallel, if release segment
execute first in worker, then search/query will got a SegmentNodeLoaded
error.

This PR add serviceable filter on delegator, then all load/release
segment operation will happens on serviceable delegator.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-08 18:56:26 +08:00
congqixia
1a09d6385e
enhance: [2.4] Release compacted growing segment if in dropped list (#37245) (#37266)
Cherry-pick from master
pr: #37245
See also #37205

Previously releasing growing segments could be triggered by two
conditions:

- Sealed Segment with same id is loaded
- Segment start position is before target checkpoint ts

Which has a worst case that the corresponding sealed segment is
compacted and the checkpoint is pinned by a growing l0 segment.

This PR introduces a new rule that: a growing segment could be released
if the segment id appeared in current target dropped segment id list.

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-10-31 10:14:22 +08:00
wei liu
057bfbe678
fix: Delegator may becomes unserviceable after querycoord restart (#37055) (#37100)
issue: #37054
pr: #37055
after querycoord restart, segment_checker may release segment by mistake
due to next target isn't ready yet.

This PR requires release segment must happens after next target is
ready.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-10-25 14:55:31 +08:00
congqixia
6bc8aba17f
enhance: [2.4] Batch forward delete when using DirectForward (#37076) (#37107)
Cherry pick from master
pr: #37076
Related #36887

DirectFoward streaming delete will cause memory usage explode if the
segments number was large. This PR add batching delete API and using it
for direct forward implementation.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-10-25 11:53:29 +08:00
wei liu
59b2563029
fix: Dynamic release parition may fail search/query. (#37049) (#37099)
issue: #33550
pr: #37049
cause wrong impl of UpdateCollectionNextTarget, if ReleaseCollection and
UpdateCollectionNextTarget happens at same time, the the released
partition's segment list may be add to target again, and delegator will
be marked as unserviceable due to lack of segment.

This PR fix the impl of UpdateCollectionNextTarget

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-10-24 18:01:30 +08:00
congqixia
b24788b2c7
enhance: [2.4] Add balance report log for qc balancer (#36749)
Cherry pick from master
pr: #36747 
Related to #36746

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-10-11 10:25:24 +08:00
wei liu
2428adea3b
enhance: Enable balance on querynode with different mem capacity (#36466) (#36625)
issue: #36464
pr: #36466
This PR enable balance on querynode with different mem capacity, for
query node which has more mem capactity will be assigned more records,
and query node with the largest difference between assignedScore and
currentScore will have a higher priority to carry the new segment.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-30 18:11:18 +08:00
wei liu
4120320074
enhance: make TransferChannel/TransferSegment idempotent (#36489) (#36552)
issue: #36488
pr: #36489
when call TransferChannel/TransferSegment, querycoord will generate and
submit balance task to scheduler, if segment/channel's task already
exist in scheduler, submit task will failed.

to make TransferChannel/TransferSegment idempotent, we should skip to
submit if task already exist in scheduler.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-30 14:25:26 +08:00
wei liu
74af00ba8c
fix: Segment unbalance after many times load/release (#36537) (#36543)
issue: #36536
pr: #36537
query coord use `segmentTaskDeleta/channelTaskDelta` to measure the
executing workload for querynode in scheduler, and we maintains the
`segmentTaskDeleta/channelTaskDelta` by `scheulder.Add(task)` and
`scheduler.remove(task)`, but `scheduler.remove(task)` has been called
in unexpected way, which cause a wrong
`segmentTaskDeleta/channelTaskDelta` value and affect the segment assign
logic, causes segment unbalance.

This PR moves to compute the `segmentTaskDeleta/channelTaskDelta` when
access, to avoid the wrong value affect.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-26 20:57:14 +08:00
wei liu
975a9797a2
enhance: Enable dynamic update loaded collection's replica (#36417)
issue: #35821
pr: #35822
After collection loaded, if we need to increase/decrease collection's
replica, we need to release and load it again.

milvus offers 4 solution to update loaded collection's replica, this PR
aims to dynamic change the replica number without release, and after
replica number changed, milvus will execute load replica or release
replica in async, and the replica loaded status can be checked by
getReplicas API.

Notice that if set too much replicas than querynode can afford,the new
replica won't be loaded successfully until enough querynode joins.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-26 10:43:15 +08:00
wei liu
bdc59f3b15
fix: Fix cornor case that segment can't be move out from stopping node (#36431) (#36475)
issue: #36426
pr: #36431
the old constriant requires only segment on current target can be
balanced, which is wrong, and caused that segment can't be move out from
stopping node, if it's only exist in next target.

by design, stopping balance need to move out all segment on it by
balance task, thus the unfair old constriant should be removed.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-25 10:13:18 +08:00
SimFG
95e47bfcf8
fix: force to set the metric type in the search request (#36279)
- issue: #35960
- pr: #35962

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-09-18 19:21:11 +08:00
wei liu
efed3d3ed0
fix: [skip e2e] Fix unstable ut TestCollectionObserver (#36231) (#36260)
issue: #36237
pr: #36231

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-14 15:43:08 +08:00
wei liu
38b307e230
fix: Clean dirty segment/channel on querynode (#36202) (#36259)
issue: #36201
pr: #36202
after querynode has been remove from replica, all dirty segment/channel
on it should be released.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-14 14:41:09 +08:00
wei liu
cc414d53b7
fix: Fix logic dead lock when delegator has high memory usage (#36066)
issue: #36064
pr: #36065
when delegator has high memory usage, load l0 segment will failed. and
balance segment task will blocked by load segment task, then delegator
cann't free memory by moving out some segment, causes a logic dead lock.

this PR remove the limit for balance, we permit segment and balance
execute in parallel. which won't cause side effect due to:
1. one segment can only has one task in qc's scheduler, and load/release
task will replace balance task if necessary
2. balance speed has been limited, and it won't block load segment task.

3. if collection has load task and balance task at same time, load task
will be scheduled first due to high proirity.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-06 22:01:07 +08:00
congqixia
b34b035edc
fix: [2.4] Use SliceSetEqual to compare load field list (#36062)
Cherry-pick from master
pr: #36051
Related to #36037

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-09-06 19:17:05 +08:00
congqixia
e21b09cc90
fix: [2.4] Fill load field list from old version load info (#35993) (#36018)
Cherry-pick from master
pr: #35993
See also #35959

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-09-06 17:27:06 +08:00
wei liu
10211ea056
fix: Fix dynamic release partition may fail search/query request (#35919) (#36019)
issue: #33550
pr: #35919
cause concurrent issue may occur between remove parition in target
manager and sync segment list to delegator. when it happens, some
segment may be released in delegator, and those segment may also be
synced to delegator, which cause delegator become unserviceable due to
lack of necessary segments, then search/query fails.

this PR make sure that all write access to target_manager will be
executed in serial to avoid the concurrent issues.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-06 10:49:05 +08:00
wei liu
c87711d903
fix: Fix some replicas don't participate in the query after the failure recovery (#35850) (#35925)
issue: #35846
pr: #35850
querycoord will notify proxy to update shard leader cache after
delegator location changes, but during querynode's failure recovery,
some delegator may become unserviceable due to lacking of segments, and
back to serviceable after segment loaded, so we also need to notify
proxy to invalidate shard leader cache when delegator serviceable state
changes.

This PR will maintain querynode's serviceable state during heartbeat,
and notify proxy to invalidate shard leader cache if serviceable state
changes.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-05 10:09:04 +08:00
SimFG
084b3efaa1
fix: [2.4] fill the metric type field in the LoadMetaInfo object (#35963)
- issue: #35960
- pr: #35962

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-09-04 16:21:05 +08:00
congqixia
df8d1c7ca3
enhance: [2.4] Check load fields for previous loaded collection (#35905) (#35910)
Cherry-pick from master
pr: #35905
Related to #35415

This PR make querycoord report error when load request tries to update
load fields list, which is currently not supported.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-09-03 11:25:03 +08:00
congqixia
cfc99e63b1
fix: [2.4] Make sure querycoord observers started once (#35811) (#35817)
Cherry-pick from master
pr: #35811
Related to #35809

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-08-29 19:15:01 +08:00
congqixia
8928c9d570
enhance: [2.4] Change frequent balancer debug log to rated one (#35749) (#35796)
Cherry-pick from master
pr: #35749
"skip balance" log is too frequent in debug level. This PR changes it
into rated on.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-08-29 12:31:00 +08:00
SimFG
fc324b4254
feat: [2.4] add the rbac msg and send them to the replicate channel (#35562)
- issue: #35391
- pr: #35392

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-08-27 14:45:00 +08:00
congqixia
ab261d0f8b
feat: [2.4] Support field partial load collection (#35416) (#35696)
Cherry-pick from master
pr: #35416
Related to #35415

---------

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-08-27 14:07:00 +08:00
Xiaofan
7269d5eda2
enhance: [2.4] reduce the log level of frequent log (#35653)
pr: #35651

Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>
2024-08-25 17:48:57 +08:00
SimFG
5b5119a51f
feat: [2.4] provide more general configuration to control mmap behavior (#35609)
- issue: #35273
- pr: #35359

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-08-23 12:35:02 +08:00
wei liu
e2542a1bf5
enhance: Update protobuf-go to protobuf-go v2 (#34394) (#35555)
issue: #34252
pr: #34394 #35072 #35084

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Co-authored-by: Congqi Xia <congqi.xia@zilliz.com>
2024-08-21 18:50:58 +08:00
wei liu
4bf4cbad85
enhance: Mark query node as read only after suspend (#35492) (#35586)
issue: #34985 #35493
pr: #35492
after querynode has been suspended, it's not allow to load
segment/channel on it, which means the node is read only. to be
compatible with resource group design, after query node has been
suspend, we remove it from it's original resource group, make it a read
only query node in replica. then two things will happens:
1. it's original resource group will be lacking of query nodes, query
coord will assign new node to it.
2. querycoord will try to move out all segments/channels after querynode
has been suspended

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-08-20 19:00:56 +08:00
wei liu
4610dafb2e
enhance: make configure load param feature be compatible with old sdk(#35520) (#35573)
issue: #31570 #35521
pr: #35520 #35546

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-08-20 18:20:57 +08:00
wei liu
8cd6718672
enhance: limit getSegmentInfo batch size to avoid excced grpc message limit (#35432)
issue: #35395 
pr: #35394

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-08-13 11:42:19 +08:00
wei liu
b316040634
fix: force update next target if target can't be loaded (#35366)
issue: #35361
pr: #35365

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-08-13 10:24:20 +08:00
wei liu
0201e00a2f
enhance: enable to set load config in cluster level (#35293)
issue: #35170
pr: #35169
This PR enable to set load configs in cluster level, such as replicas
and resource groups. then when load collections will use the load
config.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-08-07 12:38:21 +08:00
wei liu
2ac1bf7532
enhance: Enable setting the replica number and resource group during collection creation (#34403) (#34561)
issue: #30040
pr: #34403

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-08-06 15:06:17 +08:00
wei liu
d48c690cb3
enhance: Avoid unnecesary syncTargetVersion func call after querycoord recover (#34954) (#35234)
pr: #34954
before querycoord stop gracefully, we will save the current target to
meta store and recover it after querycoord start up, to speed the
querycoord's recovery time. but the target version hasn't been recovered
as expected, and it use latest timestamp as current target's version,
which has no effect to querycoord but an unnecessary syncTargetVersion
func call.

This PR recover the correct target version to avoid unnecessary
syncTargetVersion func call

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-08-05 10:18:16 +08:00