wei liu 47949fd883
enhance: Implement rewatch mechanism for etcd failure scenarios (#43829) (#43920)
issue: #43828
pr: #43829 #43909
Implement robust rewatch mechanism to handle etcd connection failures
and node reconnection scenarios in DataCoord and QueryCoord, along with
heartbeat lag monitoring capabilities.

Changes include:
- Implement rewatchDataNodes/rewatchQueryNodes callbacks for etcd
reconnection scenarios
- Add idempotent rewatchNodes method to handle etcd session recovery
gracefully
- Add QueryCoordLastHeartbeatTimeStamp metric for monitoring node
heartbeat lag
- Clean up heartbeat metrics when nodes go down to prevent metric leaks

---------

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Co-authored-by: Zhen Ye <chyezh@outlook.com>
2025-10-15 14:12:01 +08:00
..
2023-09-21 09:45:27 +08:00
2023-09-26 17:15:27 +08:00
2021-11-16 15:41:11 +08:00

Data Coordinator

Data cooridnator(datacoord for short) is the component to organize DataNodes and segments allocations.

Dependency

  • KV store: a kv store has all the meta info datacoord needs to operate. (etcd)
  • Message stream: a message stream to communicate statistics information with data nodes. (Pulsar)
  • Root Coordinator: timestamp, id and meta source.
  • Data Node(s): could be an instance or a cluster, actual worker group handles data modification operations.