mirror of https://gitee.com/milvus-io/milvus.git synced 2026-01-07 19:31:51 +08:00

History

enhance: Add channel-based node blacklist for LB policy retry (#46091 )

issue: #46090
This change introduces a global node blacklist mechanism to immediately
cut off query traffic to failed delegators across all concurrent
requests.

Key features:
- Introduce ChannelBlacklist to track failed delegator nodes per channel
- When a query fails, the node is immediately blacklisted and excluded
from ALL subsequent requests (not just retries within the same request)
- Blacklisted nodes are automatically excluded during node selection
- Entries expire after configurable duration (default 30s) to allow
automatic recovery when nodes become healthy again
- Background cleanup loop removes expired entries periodically
- Add proxy.replicaBlacklistDuration and
proxy.replicaBlacklistCleanupInterval configuration parameters
- Blacklist can be disabled by setting duration to 0

Before this change:
- Failed nodes were only excluded within the same request's retry loop
- Concurrent requests would still attempt to query the failed node
- Each request had to experience its own failure before avoiding the
node

After this change:
- Once a node fails, it is immediately excluded from all requests
- New requests arriving during the blacklist period will skip the failed
node without experiencing any failure
- This significantly reduces latency spikes during node failures

Signed-off-by: Wei Liu <wei.liu@zilliz.com>

2026-01-06 11:01:29 +08:00

channel_blacklist_test.go

enhance: Add channel-based node blacklist for LB policy retry (#46091 )

2026-01-06 11:01:29 +08:00

channel_blacklist.go

enhance: Add channel-based node blacklist for LB policy retry (#46091 )

2026-01-06 11:01:29 +08:00

lb_balancer.go

…

lb_policy_test.go

…

lb_policy.go

enhance: Add channel-based node blacklist for LB policy retry (#46091 )

2026-01-06 11:01:29 +08:00

look_aside_balancer_test.go

…

look_aside_balancer.go

…

manager.go

…

mock_lb_balancer.go

enhance: standardize mockery config for proxy package (#45030 )

2025-10-22 16:00:05 +08:00

mock_lb_policy.go

enhance: standardize mockery config for proxy package (#45030 )

2025-10-22 16:00:05 +08:00

mock_shardclient_manager.go

enhance: standardize mockery config for proxy package (#45030 )

2025-10-22 16:00:05 +08:00

model.go

…

OWNERS

…

README.md

…

roundrobin_balancer_test.go

…

roundrobin_balancer.go

…

shard_client_test.go

…

shard_client.go

…

README.md

ShardClient Package

The shardclient package provides client-side connection management and load balancing for communicating with QueryNode shards in the Milvus distributed architecture. It manages QueryNode client connections, caches shard leader information, and implements intelligent request routing strategies.

Overview

In Milvus, collections are divided into shards (channels), and each shard has multiple replicas distributed across different QueryNodes for high availability and load balancing. The shardclient package is responsible for:

Connection Management: Maintaining a pool of gRPC connections to QueryNodes with automatic lifecycle management
Shard Leader Cache: Caching the mapping of shards to their leader QueryNodes to reduce coordination overhead
Load Balancing: Distributing requests across available QueryNode replicas using configurable policies
Fault Tolerance: Automatic retry and failover when QueryNodes become unavailable

Architecture

┌──────────────────────────────────────────────────────────────┐
│                      Proxy Layer                              │
│                                                                │
│  ┌─────────────────────────────────────────────────────┐    │
│  │              ShardClientMgr                          │    │
│  │  • Shard leader cache (database → collection → shards) │
│  │  • QueryNode client pool management                   │
│  │  • Client lifecycle (init, purge, close)             │
│  └───────────────────────┬──────────────────────────────┘    │
│                          │                                    │
│  ┌───────────────────────▼──────────────────────────────┐    │
│  │              LBPolicy                                 │    │
│  │  • Execute workload on collection/channels           │    │
│  │  • Retry logic with replica failover                 │    │
│  │  • Node selection via balancer                       │    │
│  └───────────────────────┬──────────────────────────────┘    │
│                          │                                    │
│         ┌────────────────┴────────────────┐                  │
│         │                                  │                  │
│  ┌──────▼────────┐              ┌─────────▼──────────┐       │
│  │ RoundRobin    │              │  LookAsideBalancer │       │
│  │ Balancer      │              │  • Cost-based      │       │
│  │               │              │  • Health check    │       │
│  └───────────────┘              └────────────────────┘       │
│                          │                                    │
│  ┌───────────────────────▼──────────────────────────────┐    │
│  │           shardClient (per QueryNode)                │    │
│  │  • Connection pool (configurable size)               │    │
│  │  • Round-robin client selection                      │    │
│  │  • Lazy initialization and expiration                │    │
│  └──────────────────────────────────────────────────────┘    │
└─────────────────────┬────────────────────────────────────────┘
                      │ gRPC
      ┌───────────────┴───────────────┐
      │                               │
┌─────▼─────┐                  ┌──────▼──────┐
│ QueryNode │                  │ QueryNode   │
│    (1)    │                  │    (2)      │
└───────────┘                  └─────────────┘

Core Components

1. ShardClientMgr

The central manager for QueryNode client connections and shard leader information.

File: manager.go

Key Responsibilities:

Cache shard leader mappings from QueryCoord (database → collectionName → channel → []nodeInfo)
Manage shardClient instances for each QueryNode
Automatically purge expired clients (default: 60 minutes of inactivity)
Invalidate cache when shard leaders change

Interface:

type ShardClientMgr interface {
    GetShard(ctx context.Context, withCache bool, database, collectionName string,
             collectionID int64, channel string) ([]nodeInfo, error)
    GetShardLeaderList(ctx context.Context, database, collectionName string,
                       collectionID int64, withCache bool) ([]string, error)
    DeprecateShardCache(database, collectionName string)
    InvalidateShardLeaderCache(collections []int64)
    GetClient(ctx context.Context, nodeInfo nodeInfo) (types.QueryNodeClient, error)
    Start()
    Close()
}

Configuration:

purgeInterval: Interval for checking expired clients (default: 600s)
expiredDuration: Time after which inactive clients are purged (default: 60min)

2. shardClient

Manages a connection pool to a single QueryNode.

File: shard_client.go

Features:

Lazy initialization: Connections are created on first use
Connection pooling: Configurable pool size (ProxyCfg.QueryNodePoolingSize, default: 1)
Round-robin selection: Distributes requests across pool connections
Expiration tracking: Tracks last active time for automatic cleanup
Thread-safe: Safe for concurrent access

Lifecycle:

Created when first request needs a QueryNode
Initializes connection pool on first getClient() call
Tracks lastActiveTs on each use
Closed by manager if expired or during shutdown

3. LBPolicy

Executes workloads on collections/channels with retry and failover logic.

File: lb_policy.go

Key Methods:

Execute(ctx, CollectionWorkLoad): Execute workload in parallel across all shards
ExecuteOneChannel(ctx, CollectionWorkLoad): Execute workload on any single shard (for lightweight operations)
ExecuteWithRetry(ctx, ChannelWorkload): Execute on specific channel with retry on different replicas

Retry Strategy:

Retry up to max(retryOnReplica, len(shardLeaders)) times
Maintain excludeNodes set to avoid retrying failed nodes
Refresh shard leader cache if initial attempt fails
Clear excludeNodes if all replicas exhausted

Workload Types:

type ChannelWorkload struct {
    Db             string
    CollectionName string
    CollectionID   int64
    Channel        string
    Nq             int64           // Number of queries
    Exec           ExecuteFunc     // Actual work to execute
}

type ExecuteFunc func(context.Context, UniqueID, types.QueryNodeClient, string) error

4. Load Balancers

Two strategies for selecting QueryNode replicas:

RoundRobinBalancer

File: roundrobin_balancer.go

Simple round-robin selection across available nodes. No state tracking, minimal overhead.

Use case: Uniform workload distribution when all nodes have similar capacity

LookAsideBalancer

File: look_aside_balancer.go

Cost-aware load balancer that considers QueryNode workload and health.

Features:

Cost metrics tracking: Caches CostAggregation (response time, service time, total NQ) from QueryNodes
Workload score calculation: Uses power-of-3 formula to prefer lightly loaded nodes:
```
score = executeSpeed + (1 + totalNQ + executingNQ)³ × serviceTime
```
Periodic health checks: Monitors QueryNode health via GetComponentStates RPC
Unavailable node handling: Marks nodes unreachable after consecutive health check failures
Adaptive behavior: Falls back to round-robin when workload difference is small

Configuration Parameters:

ProxyCfg.CostMetricsExpireTime: How long to trust cached cost metrics (default: varies)
ProxyCfg.CheckWorkloadRequestNum: Check workload every N requests (default: varies)
ProxyCfg.WorkloadToleranceFactor: Tolerance for workload difference before preferring lighter node
ProxyCfg.CheckQueryNodeHealthInterval: Interval for health checks
ProxyCfg.HealthCheckTimeout: Timeout for health check RPC
ProxyCfg.RetryTimesOnHealthCheck: Failures before marking node unreachable

Selection Strategy:

if (requestCount % CheckWorkloadRequestNum == 0) {
    // Cost-aware selection
    select node with minimum workload score
    if (maxScore - minScore) / minScore <= WorkloadToleranceFactor {
        fall back to round-robin
    }
} else {
    // Fast path: round-robin
    select next available node
}

Configuration

Key configuration parameters from paramtable:

Parameter	Path	Description	Default
QueryNodePoolingSize	`ProxyCfg.QueryNodePoolingSize`	Size of connection pool per QueryNode	1
RetryTimesOnReplica	`ProxyCfg.RetryTimesOnReplica`	Max retry times on replica failures	varies
ReplicaSelectionPolicy	`ProxyCfg.ReplicaSelectionPolicy`	Load balancing policy: `round_robin` or `look_aside`	`look_aside`
CostMetricsExpireTime	`ProxyCfg.CostMetricsExpireTime`	Expiration time for cost metrics cache	varies
CheckWorkloadRequestNum	`ProxyCfg.CheckWorkloadRequestNum`	Frequency of workload-aware selection	varies
WorkloadToleranceFactor	`ProxyCfg.WorkloadToleranceFactor`	Tolerance for workload differences	varies
CheckQueryNodeHealthInterval	`ProxyCfg.CheckQueryNodeHealthInterval`	Health check interval	varies
HealthCheckTimeout	`ProxyCfg.HealthCheckTimeout`	Health check RPC timeout	varies

Usage Example

import (
    "context"
    "github.com/milvus-io/milvus/internal/proxy/shardclient"
    "github.com/milvus-io/milvus/internal/types"
)

// 1. Create ShardClientMgr with MixCoord client
mgr := shardclient.NewShardClientMgr(mixCoordClient)
mgr.Start()  // Start background purge goroutine
defer mgr.Close()

// 2. Create LBPolicy
policy := shardclient.NewLBPolicyImpl(mgr)
policy.Start(ctx)  // Start load balancer (health checks, etc.)
defer policy.Close()

// 3. Execute collection workload (e.g., search/query)
workload := shardclient.CollectionWorkLoad{
    Db:             "default",
    CollectionName: "my_collection",
    CollectionID:   12345,
    Nq:             100,  // Number of queries
    Exec: func(ctx context.Context, nodeID int64, client types.QueryNodeClient, channel string) error {
        // Perform actual work (search, query, etc.)
        req := &querypb.SearchRequest{/* ... */}
        resp, err := client.Search(ctx, req)
        return err
    },
}

// Execute on all channels in parallel
err := policy.Execute(ctx, workload)

// Or execute on any single channel (for lightweight ops)
err := policy.ExecuteOneChannel(ctx, workload)

Cache Management

Shard Leader Cache

The shard leader cache stores the mapping of shards to their leader QueryNodes:

database → collectionName → shardLeaders {
    collectionID: int64
    shardLeaders: map[channel][]nodeInfo
}

Cache Operations:

Hit: When cached shard leaders are used (tracked via ProxyCacheStatsCounter)
Miss: When cache lookup fails, triggers RPC to QueryCoord via GetShardLeaders
Invalidation:
- DeprecateShardCache(db, collection): Remove specific collection
- InvalidateShardLeaderCache(collectionIDs): Remove collections by ID (called on shard leader changes)
- RemoveDatabase(db): Remove entire database

Client Purging

The ShardClientMgr periodically purges unused clients:

Every purgeInterval (default: 600s), iterate all cached clients
Check if client is still a shard leader (via ListShardLocation())
If not a leader and expired (lastActiveTs > expiredDuration), close and remove
This prevents connection leaks when QueryNodes are removed or shards rebalance

Error Handling

Common Errors

errClosed: Client is closed (returned when accessing closed shardClient)
merr.ErrChannelNotAvailable: No available shard leaders for channel
merr.ErrNodeNotAvailable: Selected node is not available
merr.ErrCollectionNotLoaded: Collection is not loaded in QueryNodes
merr.ErrServiceUnavailable: All available nodes are unreachable

Retry Logic

Retry is handled at multiple levels:

LBPolicy level:
- Retries on different replicas when request fails
- Refreshes shard leader cache on failure
- Respects context cancellation
Balancer level:
- Tracks failed nodes and excludes them from selection
- Health checks recover nodes when they come back online
gRPC level:
- Connection-level retries handled by gRPC layer

Metrics

The package exports several metrics:

ProxyCacheStatsCounter: Shard leader cache hit/miss statistics
- Labels: nodeID, method (GetShard/GetShardLeaderList), status (hit/miss)
ProxyUpdateCacheLatency: Latency of updating shard leader cache
- Labels: nodeID, method

Testing

The package includes extensive test coverage:

shard_client_test.go: Tests for connection pool management
manager_test.go: Tests for cache management and client lifecycle
lb_policy_test.go: Tests for retry logic and workload execution
roundrobin_balancer_test.go: Tests for round-robin selection
look_aside_balancer_test.go: Tests for cost-aware selection and health checks

Mock interfaces (via mockery):

mock_shardclient_manager.go: Mock ShardClientMgr
mock_lb_policy.go: Mock LBPolicy
mock_lb_balancer.go: Mock LBBalancer

Thread Safety

All components are designed for concurrent access:

shardClientMgrImpl: Uses sync.RWMutex for cache, typeutil.ConcurrentMap for clients
shardClient: Uses sync.RWMutex and atomic operations
LookAsideBalancer: Uses typeutil.ConcurrentMap for all mutable state
RoundRobinBalancer: Uses atomic.Int64 for index

Proxy (internal/proxy/): Uses shardclient to route search/query requests to QueryNodes
QueryCoord (internal/querycoordv2/): Provides shard leader information via GetShardLeaders RPC
QueryNode (internal/querynodev2/): Receives and processes requests routed by shardclient
Registry (internal/registry/): Provides client creation functions for gRPC connections

Future Improvements

Potential areas for enhancement:

Adaptive pooling: Dynamically adjust connection pool size based on load
Circuit breaker: Add circuit breaker pattern for consistently failing nodes
Advanced metrics: Export more detailed metrics (per-node latency, error rates, etc.)
Smart caching: Use TTL-based cache expiration instead of invalidation-only
Connection warming: Pre-establish connections to known QueryNodes

README.md Unescape Escape

ShardClient Package

Overview

Architecture

Core Components

1. ShardClientMgr

2. shardClient

3. LBPolicy

4. Load Balancers

RoundRobinBalancer

LookAsideBalancer

Configuration

Usage Example

Cache Management

Shard Leader Cache

Client Purging

Error Handling

Common Errors

Retry Logic

Metrics

Testing

Thread Safety

Related Components

Future Improvements

README.md