639 Commits

Author SHA1 Message Date
congqixia
7b51e4839f
fix: Resolve conflict on qc task test (#39796)
Related to #39701 & #39681

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-02-11 18:40:45 +08:00
wei liu
ff5c680c99
fix: load collection stucks if compaction/gc happens (#39701)
issue: #39680
if compaction/gc happens, load collection may stuck due to
SegmentNotFound, we should trigger UpdateNextTarget to get a new data
view to execute loading operation.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-11 15:48:50 +08:00
wei liu
85c9f92ff4
fix: uneven distribution caused by executing task delta cache leak (#39702)
issue: #39681 

this PR maintain workload effect in action instead of computing workload
effect from target, which may cause leak if target changes.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-11 14:30:46 +08:00
Zhen Ye
d3e32bb599
enhance: make pchannel level flusher (#39275)
issue: #38399

- Add a pchannel level checkpoint for flush processing
- Refactor the recovery of flushers of wal
- make a shared wal scanner first, then make multi datasyncservice on it

Signed-off-by: chyezh <chyezh@outlook.com>
2025-02-10 16:32:45 +08:00
jaime
8a4ac8cccd
enhance: expose more metrics data (#39456)
issue: #36621 #39417
1. Adjust the server-side cache size.
2. Add source information for configurations.
3. Add node ID for compaction and indexing tasks.
4. Resolve localhost access issues to fix health check failures for
etcd.

Signed-off-by: jaime <yun.zhang@zilliz.com>
2025-02-07 11:50:50 +08:00
wei liu
05ac4041aa
enhance: use rated logger for high frequency log in dist handler (#39452)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-05 15:31:10 +08:00
Zhen Ye
c84a0748c4
enhance: add rw/ro streaming query node replica management (#38677)
issue: #38399

- Embed the query node into streaming node to make delegator available
at streaming node.
- The embedded query node has a special server label
`QUERYNODE_STREAMING-EMBEDDED`.
- Change the balance strategy to make the channel assigned to streaming
node as much as possible.

Signed-off-by: chyezh <chyezh@outlook.com>
2025-01-24 16:55:07 +08:00
yihao.dai
5fb597b37b
fix: Remove frequently updating metric to avoid mutex contention (#38775)
issue: https://github.com/milvus-io/milvus/issues/37630

Reduce the frequency updating metrics to avoid holding the mutex for
long periods.

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-24 10:31:07 +08:00
yihao.dai
e0b26260f2
enhance: enable task delta cache (#39307)
When there are many segment tasks in the querycoord scheduler, the
traversal in `GetSegmentTaskDelta` checks becomes time-consuming. This
PR adds caching for segment deltas.

issue: https://github.com/milvus-io/milvus/issues/37630

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Co-authored-by: Wei Liu <wei.liu@zilliz.com>
2025-01-23 14:31:16 +08:00
yihao.dai
38f813bed3
enhance: Read metadata concurrently to accelerate recovery (#38403)
Read metadata such as segments, binlogs, and partitions concurrently at
the collection level.

issue: https://github.com/milvus-io/milvus/issues/37630

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-23 14:27:27 +08:00
yihao.dai
e55d6506e3
enhance: Remove frequent observe log (#39413)
/kind improvement

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-20 11:01:10 +08:00
yihao.dai
657550cf06
fix: Fix slow dist handle and slow observe (#38566)
1. Provide partition&channel level indexing in the collection target.
2. Make `SegmentAction` not wait for distribution.
3. Remove scheduler and target manager mutex.
4. Optimize logging to reduce CPU overhead.

issue: https://github.com/milvus-io/milvus/issues/37630

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-15 20:17:00 +08:00
wei liu
d2834a1812
enhance: Add logs for check health failed (#39208)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-01-15 17:31:00 +08:00
jaime
e8f76cd2d9
fix: unstable ut in leader_vew_manager.go file (#39161)
issue: #38672

Signed-off-by: jaime <yun.zhang@zilliz.com>
2025-01-15 12:26:59 +08:00
Zhen Ye
3e788f0fbd
enhance: record memory size (uncompressed) item for index (#38770)
issue: #38715

- Current milvus use a serialized index size(compressed) for estimate
resource for loading.
- Add a new field `MemSize` (before compressing) for index to estimate
resource.

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-01-14 10:33:06 +08:00
wei liu
cc5d59392a
fix: channel unbalance during stopping balance progress (#38971)
issue: #38970
cause the stopping balance channel still use the row_count_based policy,
which may causes channel unbalance in multi-collection case.

This PR impl a score based stopping balance channel policy.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-01-13 11:21:06 +08:00
wei liu
826b726c86
fix: Prevent leader checker from generating excessive duplicate leader tasks (#39000)
issue: #39001
Background:
Segment Load Version: Each segment load request assigns a timestamp as
its version. When multiple copies of a segment are loaded on different
QueryNodes, the leader checker uses this version to identify the latest
copy and updates the routing table in the leader view to point to it.
Delegator Router Version: When a delegator builds a route to a QueryNode
that has loaded a segment, it also records the segment's version.

Router Table Update Logic: If the leader checker detects that the
version of a segment in the routing table does not match the version in
the worker, it updates the routing table to point to the QueryNode with
the latest version. Additionally, it updates the segment's load version
in the QueryNode during this process.

Issue:
When a channel is undergoing load balancing, the leader checker may sync
the routing table to a new delegator. This sync operation modifies the
segment's load version, which invalidates the routing in the old
delegator. Subsequently, the leader checker updates the routing table in
the old delegator, breaking the routing in the new delegator. This cycle
continues, causing repeated updates and inconsistencies.

Fix:
This PR introduces two changes to address the issue:
1. Use NodeID to verify whether the delegator's routing table needs an
update, avoiding unnecessary modifications.
2. Ensure compatibility by using the latest segment's load version as
the version recorded in the routing table.

These changes resolve the cyclic updates and prevent the leader checker
from generating excessive duplicate tasks, ensuring routing stability
across delegators during load balancing.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-01-10 14:12:57 +08:00
Zhen Ye
bb8d1ab3bf
enhance: make new go package to manage proto (#39114)
issue: #39095

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-01-10 10:49:01 +08:00
jaime
f03a85725a
enhance: add db name in replica (#38672)
issue: #36621

Signed-off-by: jaime <yun.zhang@zilliz.com>
2025-01-09 19:40:59 +08:00
wei liu
47e7ea241e
enhance: Add log for case which target not update as expected (#38944)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-01-07 17:45:03 +08:00
Xiaofan
cb6eca8e91
fix: drop partition can not be successful if load failed (#38793)
fix #38649
when partition load failed, the partition drop will also fail due to the
wrong error message

Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>
2024-12-30 19:42:52 +08:00
wei liu
f49d618382
fix: Querycoord will trigger unexpected balance task after restart (#38630)
issue: #38606

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-25 19:30:48 +08:00
wei liu
25f0c82ceb
fix: Fix update loading collection's load config doesn't work (#38595)
issue: #38594

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-25 18:02:51 +08:00
wei liu
9c3f59dbbe
fix: Prevent balancer from overloading the same QueryNode (#38719)
issue: #38718
The balancer calculates the workload of executing tasks as an ongoing
score for target nodes. However, a logic issue arises when
GetSegmentTaskDelta or GetChannelTaskDelta is called with
collectionID=-1, which incorrectly returns zero.

Due to the incorrect global score, the executing task's workload is not
properly reflected for each collection. Consequently, each collection
submits its own balance task, leading to the balancer assigning
excessive tasks to the same QueryNode.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-25 16:36:49 +08:00
jaime
5afd0c0a2b
fix: Revert "Expose metrics of stanby coordinators (#27698)" (#38620)
issue: #38608

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-12-23 11:46:57 +08:00
jaime
78438ef41e
fix: revert optimize CPU usage for CheckHealth requests (#35589) (#38555)
issue: #35563

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-12-19 00:38:45 +08:00
yihao.dai
d3c174b0f1
enhance: Accelerate observe collection (#38028)
1. A collection should observe the channel only once.
2. A collection should check the CollectionLoadPercent for updates only
once.
3. Skip saving coll/partition meta if there are no changes, primarily to
accelerate collection observation after recovery.

issue: https://github.com/milvus-io/milvus/issues/37630

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-17 14:14:45 +08:00
jaime
28fdbc4e30
enhance: optimize CPU usage for CheckHealth requests (#35589)
issue: #35563
1. Use an internal health checker to monitor the cluster's health state,
storing the latest state on the coordinator node. The CheckHealth
request retrieves the cluster's health from this latest state on the
proxy sides, which enhances cluster stability.
2. Each health check will assess all collections and channels, with
detailed failure messages temporarily saved in the latest state.
3. Use CheckHealth request instead of the heavy GetMetrics request on
the querynode and datanode

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-12-17 11:02:45 +08:00
SimFG
2afe2eaf3e
feat: support to replicate collection when the services contains the system tt msg (#37559)
- issue: #37105

---------

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-12-17 09:08:46 +08:00
wei liu
659847c11f
enhance: Remove load task limit in one round (#38436)
the task limit in assignSegment/assignChannel will works for both load
task and balance task.

this PR remove the load task limit, only limit balance task num in one
round.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-16 19:30:43 +08:00
wei liu
40f9db491e
fix: Fix SyncDistribution may cost too much time on retry (#38454)
issue: #38428

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-16 11:38:44 +08:00
tinswzy
27229f7907
enhance: refine exists log print with ctx (#38080)
issue: #35917 
Refines exists log print with ctx

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
2024-12-14 22:36:44 +08:00
Zhen Ye
833c74aa66
enhance: add detail, replica count for resource group (#38314)
issue: #30647

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-12-13 14:14:50 +08:00
wei liu
e279ccf109
enhance: Enable score based balance channel policy (#38143)
issue: #38142
current balance channel policy only consider current collection's
distribution, so if all collections has 1 channel, and all channels has
been loaded on same querynode, after querynode num increase, balance
channel won't be triggered.

This PR enable score based balance channel policy, to achieve:
1. distribute all channels evenly across multiple querynodes
2. distribute each collection's channel evenly across multiple
querynodes.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-11 17:20:43 +08:00
Zhen Ye
d3ae8e9232
fix: delay the wait other coord logic in query coord after query coord change into standby state (#38259)
issue: https://github.com/milvus-io/milvus/issues/37764

- After removing rpc layer from mixcoord, the querycoord at standby mode
will be blocked forever of deployment rolling

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-12-11 15:48:42 +08:00
wei liu
950203aba0
enhance: Optimize save colelction target latency (#38345)
issue: #38237
this PR only use better compression level for proto msg which is larger
than 1MB, and use a lighter compression level for smaller proto msg,
which could get a better latency in most case.

this PR could reduce the latency from 22.7s to 4.7s with 10000
collctions and each collections has 1000 segments.

before this PR:
BenchmarkTargetManager-8 1 22781536357 ns/op 566407275088 B/op 11188282
allocs/op
after this PR:
BenchmarkTargetManager-8 1 4729566944 ns/op 36713248864 B/op 10963615
allocs/op

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-11 10:12:43 +08:00
congqixia
7ea9c983d2
enhance: Add mockery package config for QC&QN (#38340)
Related to #38339

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-10 19:18:42 +08:00
wei liu
856e2aad7d
fix: Leader task stuck and retry again and again (#38202)
issue: #38201
leader task require to update delegator's distribution, and only success
after the distribution change has been applyed to delegator. but the
delegator will reject the distribution change if it's version is older
than current version in delegator. which cause the leader task stuck and
retry forever.

this PR remove the leader task finish check.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-10 19:16:42 +08:00
wei liu
f04986fceb
enhance: Remove constraint on release segment task (#38297)
issue: #38305
after we disable balance segment and balance channel happens at same
time, the constriant which require release segment must happens on
serviceable shard leader is unnessary.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-10 11:18:49 +08:00
jaime
8ed019735c
enhance: add disk stats within system metrics (#38033)
issue: ##36621

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-12-06 16:32:41 +08:00
congqixia
36946cc9ce
enhance: Set loaded collection/partition number to metrics (#38271)
Related to #36456
Previous PR: #38471 #38233

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-06 16:18:40 +08:00
congqixia
6ff19481f0
enhance: Resolve compilation error due to PR conflict (#38252)
Related pr: #38233 #38059

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-05 19:26:40 +08:00
congqixia
051bc280dd
enhance: Make dynamic load/release partition follow targets (#38059)
Related to #37849

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-05 16:24:40 +08:00
congqixia
32645fc28a
enhance: Unify querycoord meta metrics (#38233)
Related to #36456

Unify collection/partition number metrics to collection manager in case
of unwant missing modification

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-05 15:48:39 +08:00
tinswzy
7944538ade
enhance: Add ctx param to KV operation interfaces (#38154)
issue: #35917 
Refine KV operation interfaces by adding a ctx param

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
2024-12-05 15:16:41 +08:00
tinswzy
e76802f910
enhance: refine querycoord meta/catalog related interfaces to ensure that each method includes a ctx parameter (#37916)
issue: #35917 
This PR refine the querycoord meta related interfaces to ensure that
each method includes a ctx parameter.

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
2024-11-25 11:14:34 +08:00
jaime
7bbfe86bcd
enhance: add list index and segment index retrieval API for WebUI (#37861)
issue: #36621

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-11-22 16:58:34 +08:00
congqixia
b34bfb98a0
enhance: Refine Replica manager colle2Replicas secondary index (#37906)
Related to #37630

This PR add a new util coll2Replicas secondary index to reduce map
access & iteration while get replicas by collection

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-22 11:54:32 +08:00
wei liu
965bda6e60
enhance: Add channel name to shard leader log in meta cache (#37856)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-21 19:24:31 +08:00
wei liu
0a440e0d38
fix: Prevent simultaneous balance of segments and channels (#37850)
issue: #33550
balance segment and balance segment execute at same time, which will
cause bounch of corner case.

This PR disable simultaneous balance of segments and channels

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-21 17:56:55 +08:00