114 Commits

Author SHA1 Message Date
jaime
7bbfe86bcd
enhance: add list index and segment index retrieval API for WebUI (#37861)
issue: #36621

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-11-22 16:58:34 +08:00
jaime
1e8ea4a7e7
feat: add segment/channel/task/slow query render (#37561)
issue: #36621

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-11-12 17:44:29 +08:00
wei liu
266f8ef1f5
fix: Search may return less result after qn recover (#36549)
issue: #36293 #36242
after qn recover, delegator may be loaded in new node, after all segment
has been loaded, delegator becomes serviceable. but delegator's target
version hasn't been synced, and if search/query comes, delegator will
use wrong target version to filter out a empty segment list, which
caused empty search result.

This pr will block delegator's serviceable status until target version
is synced

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-12 16:34:28 +08:00
wei liu
a03157838b
enhance: Enable node assign policy on resource group (#36968)
issue: #36977
with node_label_filter on resource group, user can add label on
querynode with env `MILVUS_COMPONENT_LABEL`, then resource group will
prefer to accept node which match it's node_label_filter.

then querynode's can't be group by labels, and put querynodes with same
label to same resource groups.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-08 11:18:27 +08:00
jaime
f348bd9441
feat: add segment,pipeline, replica and resourcegroup api for WebUI (#37344)
issue: #36621

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-11-07 11:52:25 +08:00
jaime
9d16b972ea
feat: add tasks page into management WebUI (#37002)
issue: #36621

1. Add API to access task runtime metrics, including:
  - build index task
  - compaction task
  - import task
- balance (including load/release of segments/channels and some leader
tasks on querycoord)
  - sync task
2. Add a debug model to the webpage by using debug=true or debug=false
in the URL query parameters to enable or disable debug mode.

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-10-28 10:13:29 +08:00
jaime
4746f47282
feat: management WebUI homepage (#36822)
issue: #36784
1. Implement an embedded web server for WebUI access.  
2. Complete the homepage development.

Home page demo:
<img width="2177" alt="iShot_2024-10-10_17 57 34"
src="https://github.com/user-attachments/assets/38539917-ce09-4e54-a5b5-7f4f7eaac353">

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-10-23 11:29:28 +08:00
wei liu
3cd0b26285
enhance: Enable dynamic update loaded collection's replica (#35822)
issue: #35821
After collection loaded, if we need to increase/decrease collection's
replica, we need to release and load it again.

milvus offers 4 solution to update loaded collection's replica, this PR
aims to dynamic change the replica number without release, and after
replica number changed, milvus will execute load replica or release
replica in async, and the replica loaded status can be checked by
getReplicas API.

Notice that if set too much replicas than querynode can afford,the new
replica won't be loaded successfully until enough querynode joins.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-25 10:13:18 +08:00
congqixia
2fbc628994
feat: Support field partial load collection (#35416)
Related to #35415

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-08-20 16:49:02 +08:00
jaime
fcec4c21b9
fix: check collection health(queryable) fail for releasing collection (#34947)
issue: #34946

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-08-02 17:20:15 +08:00
jaime
9630974fbb
enhance: move rocksmq from internal to pkg module (#33881)
issue: #33956

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-06-25 21:18:15 +08:00
wayblink
5fac2fa1d2
fix: Panic if ProcessActiveStandBy returns error (#33369)
#33368

Signed-off-by: wayblink <anyang.wang@zilliz.com>
2024-06-19 11:16:00 +08:00
wei liu
303470fc35
fix: Clean offline node from resource group after qc restart (#33232)
issue: #33200 #33207
pr#33104 causes the offline node will be kept in resource group after qc
recover, and offline node will be assign to new replica as rwNode, then
request send to those node will fail by NodeNotFound.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-22 10:03:40 +08:00
wei liu
33bd6eed28
fix: Clean offline node from replica after qc recover (#33213)
issue: #33200 #33207

pr#33104 remove this logic by mistake, which cause the offline node will
be kept in replica after qc recover, and request send to offline qn will
go a NodeNotFound error.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-21 15:41:39 +08:00
wei liu
2013d97243
enhance: Enable to dynamic update balancer policy in querycoord (#33037)
issue: #33036
This PR enable to dynamic update balancer policy without restart
querycoord.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-21 14:29:39 +08:00
wei liu
a7f6193bfc
fix: query node may stuck at stopping progress (#33104)
issue: #33103 
when try to do stopping balance for stopping query node, balancer will
try to get node list from replica.GetNodes, then check whether node is
stopping, if so, stopping balance will be triggered for this replica.

after the replica refactor, replica.GetNodes only return rwNodes, and
the stopping node maintains in roNodes, so balancer couldn't find
replica which contains stopping node, and stopping balance for replica
won't be triggered, then query node will stuck forever due to
segment/channel doesn't move out.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-20 10:21:38 +08:00
congqixia
861977ab60
fix: Start LeaderCacheObserver before SyncAll (#33035)
Related to #33033

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-05-14 13:25:32 +08:00
wei liu
e2332bdc17
enhance: Enable channel exclusive balance policy (#32911)
issue: #32910  
* split replica's node list to channels when create replicas
 * balance nodes among channels when node change happens
 * implement channel level balance, let balance happens in channel level

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-10 17:27:31 +08:00
wei liu
ba02d54a30
enhance: update shard leader cache when leader location changed (#32470)
issue: #32466

this PR enhance that when shard location changed, update proxy's shard
leader cache. in case of query node failover case, proxy can find
replica recover

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-08 10:05:29 +08:00
chyezh
48fe977a9d
enhance: declarative resource group api (#31930)
issue: #30647

- Add declarative resource group api

- Add config for resource group management

- Resource group recovery enhancement

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-15 08:13:19 +08:00
congqixia
25a1c9ecf0
fix: Make coordinator Register not blocked on ProcessActiveStandby (#32069)
See also #32066

This PR make coordinator register successful and let
`ProcessActiveStandBy` run async. And roles may receive stop signal and
notify servers.

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-10 18:49:18 +08:00
chyezh
a2502bde75
enhance: replica manager enhancement (#31496)
issue: #30647 

- ReplicaManager manage read only node now, and always do persistent of
node distribution of replica.

- All segment/channel checker using ReplicaManager to get read-only node
or read-write node, but not ResourceManager.

- ReplicaManager promise that only apply unique querynode to one replica
in same collection now (replicas in same collection never hold same
querynode at same time).

- ReplicaManager promise that fairly node count assignment policy if
multi replicas of collection is assigned to one resource group.

- Move some parameters check into ReplicaManager to avoid data race.

- Allow transfer replica to resource group that already load replica of
same collection

- Allow transfer node between resource groups that load replica of same
collection

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-05 04:57:16 +08:00
wei liu
4dfdb1a443
fix: save current target after target observer stop (#31315)
issue: #28491

should save target to meta store after target observer stop, incase of
target changed

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-18 12:27:04 +08:00
wei liu
d79aa58b37
enhance: Speed up target recovery after query coord restart (#31240)
issue: #28491

after querycoord restart, it will pull a new target, which include
channel and segment list. when segments loaded on querynode has reached
the target, the collection could provide search/query. but if segment
list changes by time, ater querycoord pull a new target, it will takes a
few minutes to catch up the target's segment distribution. and before
that, query/search will fail due to lack of segments.

This PR save the current loaded target to meta storein querycoord's stop
progress, and recover it when query coord starts, to speed up the target
recovery time.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-15 14:19:03 +08:00
chyezh
ff4237bb90
enhance: add hostname into node info (#30673)
issue: https://github.com/milvus-io/milvus/issues/30647

- Address may be reused in k8s environment. Using hostname can be
better.

Signed-off-by: chyezh <chyezh@outlook.com>
2024-03-15 10:45:06 +08:00
jaime
db79be3ae0
fix: ctx cancel should be the last step while stopping server (#31220)
issue: #31219

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-03-15 10:33:05 +08:00
SimFG
ee8d6f236c
enhance: make the watch dm channel request better compatibility (#30952)
issue: #30938

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-03-01 16:07:37 +08:00
chyezh
0c7474d7e8
enhance: add graceful stop timeout to avoid node stop hang under extreme cases (#30317)
1. add coordinator graceful stop timeout to 5s
2. change the order of datacoord component while stop
3. change querynode grace stop timeout to 900s, and we should
potentially change this to 600s when graceful stop is smooth

issue: #30310
also see pr: #30306

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-02-29 17:01:50 +08:00
Bingyi Sun
564b12c661
enhance: make balance cost threshold configurable (#30636)
Signed-off-by: sunby <sunbingyi1992@gmail.com>
2024-02-19 15:24:50 +08:00
wei liu
e98c62abbb
enhance: refactor leader_observer to leader_checker (#29454)
issue: #29453 

sync distribution by rpc will also call loadSegment/releaseSegment,
which may cause all kinds of concurrent case on same segment, such as
concurrent load and release on one segment.
This PR add leader_checker which generate load/release task to correct
the leader view, instead of calling sync distribution by rpc

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-01-05 15:54:55 +08:00
wei liu
839a72129e
fix: Auto balance param can't be updated by dynamic (#29501)
This PR fixed that auto balance param can't be updated by dynamic

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-12-27 14:30:53 +08:00
SimFG
dd9c61831d
enhance: Support to get the param value in the runtime (#29297)
/kind improvement
issue: #29299

Signed-off-by: SimFG <bang.fu@zilliz.com>
2023-12-22 18:36:44 +08:00
jaime
b1e0a27f31
enhance: Add logs for each step during service initialization (#28624)
/kind improvement

Signed-off-by: jaime <yun.zhang@zilliz.com>
2023-11-27 16:30:26 +08:00
congqixia
a2fe9dad49
enhance: Make etcd kv request timeout configurable (#28661)
See also #28660
This pr add request timeout config item for etcd kv request timeout
 Sync the default timeout value to same value for etcdKV & tikv config

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-11-23 19:34:23 +08:00
smellthemoon
73f2bab454
enhance:add some log when create client and get component states (#28160)
/kind improvement

Signed-off-by: lixinguo <xinguo.li@zilliz.com>
Co-authored-by: lixinguo <xinguo.li@zilliz.com>
2023-11-22 09:12:22 +08:00
wei liu
b9bf910039
fix unstable auto balance config ut (#28288)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-11-09 10:00:22 +08:00
yah01
1b90630633
Fix the target updated before version updated to cause data missing (#28250)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-11-08 11:36:22 +08:00
wei liu
5b45a138b1
disable auto balance when old node exists (#28191)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-11-07 14:02:20 +08:00
wei liu
7485eeb689
fix sync distribution with wrong version (#28130)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-11-03 19:02:18 +08:00
Filip Haltmayer
6b1a106a31
Moving etcd client into session (#27069)
Signed-off-by: Filip Haltmayer <filip.haltmayer@zilliz.com>
2023-10-27 07:36:12 +08:00
wei liu
178db7b0f0
check stopping node during start qc (#27859)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-10-24 12:20:11 +08:00
jaime
ec1fe3549e
Add a stop hook to clean session (#27564)
Signed-off-by: jaime <yun.zhang@zilliz.com>
2023-10-16 10:24:10 +08:00
yah01
be980fbc38
Refine state check (#27541)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-10-11 21:01:35 +08:00
yah01
a8ce1b6686
Refine QueryCoord stopping (#27371)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-09-27 16:27:27 +08:00
jaime
7f7c71ea7d
Decoupling client and server API in types interface (#27186)
Co-authored-by:: aoiasd <zhicheng.yue@zilliz.com>

Signed-off-by: jaime <yun.zhang@zilliz.com>
2023-09-26 09:57:25 +08:00
SimFG
26f06dd732
Format the code (#27275)
Signed-off-by: SimFG <bang.fu@zilliz.com>
2023-09-21 09:45:27 +08:00
yiwangdr
337edc321b
tikv integration (#26246)
Signed-off-by: yiwangdr <yiwangdr@gmail.com>
2023-09-07 07:25:14 +08:00
SimFG
28681276e2
Improve the retry of the rpc client (#26795)
Signed-off-by: SimFG <bang.fu@zilliz.com>
2023-09-06 17:43:14 +08:00
yah01
3349db4aa7
Refine errors to remove changes breaking design (#26521)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-09-04 09:57:09 +08:00
wei liu
949c320185
remove pull target from qc recover (#26775)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-09-01 11:17:01 +08:00