Cherry-pick from master
pr: #45147
Related to #44509
Fix a bug where QueryNodeNumEntities metrics were not updated for
collections with zero segments, causing stale metrics when all segments
are flushed or compacted.
The previous implementation used separate loops: one to update size
metrics for all collections, and another to update num entities metrics
only for collections present in the grouped segments map. Collections
with no segments were skipped in the second loop, leaving their
NumEntities metrics stale.
Changes:
- Consolidate size and num entities metric updates into single loop
- Iterate over all collections instead of grouped segments
- Get collection metadata from manager instead of segment instances
- Correctly set NumEntities to 0 for collections with no segments
- Apply the same fix to both growing and sealed segment processing
- Add nil check for collection metadata before processing
This ensures all collection metrics are updated consistently, even when
segment count drops to zero.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #42995
- The consuming lag at streaming node will be reported to coordinator.
- The consuming lag will trigger the write limit and deny by quota
center.
- Set the ttProtection by default.
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #41048
Fixes issue introduced in PR #33522 where metric resets caused
incomplete data collection by monitoring systems.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #35563
1. Use an internal health checker to monitor the cluster's health state,
storing the latest state on the coordinator node. The CheckHealth
request retrieves the cluster's health from this latest state on the
proxy sides, which enhances cluster stability.
2. Each health check will assess all collections and channels, with
detailed failure messages temporarily saved in the latest state.
3. Use CheckHealth request instead of the heavy GetMetrics request on
the querynode and datanode
Signed-off-by: jaime <yun.zhang@zilliz.com>
Related to #37630
TSafe manager is too complex for current implementation and each
delegator need one goroutine waiting for tsafe update event.
Tsafe updating could be executed in pipeline. This PR remove tsafe
manager and simplify the entire logic of tsafe updating.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #36621
1. Add API to access task runtime metrics, including:
- build index task
- compaction task
- import task
- balance (including load/release of segments/channels and some leader
tasks on querycoord)
- sync task
2. Add a debug model to the webpage by using debug=true or debug=false
in the URL query parameters to enable or disable debug mode.
Signed-off-by: jaime <yun.zhang@zilliz.com>
/kind improvement
fix: #31272
This pr add more metrics, which are:
- Slow query count, which the duration considered as slow can be
configurable;
- Number of deleted entities;
- Number of entities imported;
- Number of entities per collection;
- Number of loaded entities per collection;
- Number of indexed entities;
- Number of indexed entities, per collection, per index and whether it's
a vetor index;
- Quota states (LongTimeTickDelay, MemoryExhuasted, DiskQuotaExhuasted)
per database;
---------
Signed-off-by: longjiquan <jiquan.long@zilliz.com>
See also #31289
This PR:
- Set collection level `QueryNodeEntitiesSize` to zero if all segment
released
- Delete `QueryNodeEntitiesSize` metrics value after collection ref is
zero
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>