issue: #37640
pr: #37641
fix the pr #36549
cause balance channel will wait until new delegator becomes serviceable,
but new delegator need to sync target version then becomes serviceable,
and sync target version need to be wait all replica load done. so if
increasing replica number and balance channel happens at same time,
logic dead lock occurs.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Cherry-pick from master
pr: #37405
Cgo API cost is not observerable since not metrics is related to them.
This PR add metrics for some sync cgo call related to load & write
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #36977
pr: #36968
with node_label_filter on resource group, user can add label on
querynode with env `MILVUS_COMPONENT_LABEL`, then resource group will
prefer to accept node which match it's node_label_filter.
then querynode's can't be group by labels, and put querynodes with same
label to same resource groups.
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #36293#36242
pr: #36549
after qn recover, delegator may be loaded in new node, after all segment
has been loaded, delegator becomes serviceable. but delegator's target
version hasn't been synced, and if search/query comes, delegator will
use wrong target version to filter out a empty segment list, which
caused empty search result.
This pr will block delegator's serviceable status until target version
is synced
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
fix#37489
pr: #34790
combine multiple describe collection and list index into one call
Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Co-authored-by: Xiaofan <83447078+xiaofan-luan@users.noreply.github.com>
Cherry-pick from master
pr: #37565
Related to #35415
In rolling upgrade, legacy proxy may dispatch load request wit empty
load field list. The upgraded querycoord may report error by mistake
that load field list is changed.
This PR:
- Auto field empty load field list with all user field ids
- Refine the error messag when load field list updates
- Refine load job unit test with service cases
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Cherry-pick from master
pr: #37416
See also #37404#37402
IP address in paramtable need validation and fail fast with reasonable
error message
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Cherry-pick from master
pr: #37524
When check health logic failed to collection not-queryable, the related
reason is hard to find in log.
This PR add context for log with trace id and print unqueryable
collection info log.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #36970
pr: #37403
cause release segment and balance channel may happen at same time, and
before new delegator become serviceable, if release segment exeuctes on
new delegator, and search/query comes on old delegator, then release
segment and query segment happens in parallel, if release segment
execute first in worker, then search/query will got a SegmentNodeLoaded
error.
This PR add serviceable filter on delegator, then all load/release
segment operation will happens on serviceable delegator.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #37166
pr: #37433
cause the misuse of timer.Reset, which cause dispatcher failed to send
msg to virtual channel buffer, and dispatcher do splitting again and
again, which hold the dispatcher manager's lock, block watching channel
progress.
This PR fix the misuse of timer.Reset
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #37289
pr: #37480
cause pr #37116 introduce retry on get shard leader, which make search
won't fail during query node down.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Cherry-pick from master
pr: #37468
Previously failed label is used for canceled storage op, which may cause
wrong alarm when user cancel load operation or etc. This PR utilizes
cancel label when such case happens.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Cherry pick from master
pr: #37439
Related #37223
RPC stats worked in middleware but faild to get method & collection info
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #37115
pr#37116 let proxy retry to get shard leader if error happens, which
cause if search/query on a unloaded collection, which will keep retrying
until ctx done.
This PR add error type check to skip retry on ErrCollectionLoaded.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Cherry pick from master
pr: #37337
Related to #35303
`deleteMut` shall be protecting streaming delete buffer, forward l0
could be move out of the rlock section to reduce tsafe impact from
loading segments.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #37289
pr: #37288
those test case use search to verify replica's status, but if the search
gap is 1s, the node down's effect may be fixed up by balance.
This PR remove the 1 second gap between search operation.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Timeout is a bad design for long running tasks, especially using a
static timeout config. We should monitor execution progress and fail the
task if the progress has been stale for a long time.
This pr is a small patch to stop DC from marking compaction tasks
timeout, while still waiting for DN to finish. The design is
self-conflicted. After this pr, mix and L0 compaction are no longer
controlled by DC timeout, but clustering is still under timeout control.
The compaction queue capacity grows larger for priority calc, hence
timeout compactions appears more often, and when timeout, the queuing
tasks will be timeout too, no compaction will success after.
See also: #37108, #37015
pr: #37118
---------
---------
Signed-off-by: yangxuan <xuan.yang@zilliz.com>
Cherry-pick from master
pr: #37245
See also #37205
Previously releasing growing segments could be triggered by two
conditions:
- Sealed Segment with same id is loaded
- Segment start position is before target checkpoint ts
Which has a worst case that the corresponding sealed segment is
compacted and the checkpoint is pinned by a growing l0 segment.
This PR introduces a new rule that: a growing segment could be released
if the segment id appeared in current target dropped segment id list.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Cherry pick from master
pr: #37305
Related to #36887
Remove non-hit pk delete record logic does not work since
`insert_record_.contain` does not work due to logic problem.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Cherry pick from master
pr: #37195
Related to #36887
`LoadDeltaLogs` API did not check memory usage. When system is under
high delete load pressure, this could result into OOM quit.
This PR add resource check for `LoadDeltaLogs` actions and separate
internal deltalog loading function with public one.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Cherry-pick from master
pr: #37220
Related to #36887
Previously using newly create pool per request shall cause goroutine
leakage. This PR change this behavior by using singleton delete pool.
This change could also provide better concurrency control over delete
memory usage.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Cherry pick from master
pr: #37223
Related to #36102
Previous PR #36107 add grpc inteceptor to observe rpc stats. Using same
strategy, this pr add gin middleware to observer restful v2 rpc stats.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>