issue: #45598
The MultiTargetBalancer was missing replica field assignment in the
generated segment and channel plans, which caused panic during balance
operations. This change ensures that all balance plans have the replica
field properly set to fix the panic issue.
Also refactored the balance test to extract common test logic into a
reusable helper function and added a new integration test specifically
for MultipleTargetBalancer policy.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #40513
for querynode which return resource exhausted error, add a penalty
duration on it, and suspend loading new resource until penalty duration
expired.
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #45865
- Modified leader_checker.go to include all nodes (RO + RW) instead of
only RW nodes, preventing channel balance from stucking on RO nodes
- Added debug logging in segment_checker.go when no shard leader found
- Enhanced target_observer.go with detailed logging for delegator check
failures to improve debugging visibility
- Fixed integration tests:
- Temporarily disabled partial result counter assertion in
partial_result_on_node_down_test.go pending concurrent issue fix
- Increased transfer channel timeout from 10s to 20s in
manual_rolling_upgrade_test.go to avoid flaky test caused by target
update interval (10s)
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Related to #45960
Follow-up to #45961
After #45961 ensured that handleNodeUp is always called for nodes
discovered during rewatchNodes (including stopping nodes), this change
adds a safeguard in ResourceManager.handleNodeUp to skip adding stopping
nodes to resource groups.
1. **resource_manager.go**: Add check for IsStoppingState() in
handleNodeUp to prevent stopping nodes from being added to incomingNode
set and assigned to resource groups.
2. **server.go**:
- Delete processed nodes from sessionMap to avoid duplicate processing
in the subsequent loop
- Add warning logs for stopping state transitions during rewatch
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Related to #45960
When QueryCoord restarts or reconnects to etcd, the rewatchNodes
function previously skipped handleNodeUp for QueryNodes in stopping
state. This caused stopping balance to fail because necessary components
were not initialized:
- Task scheduler executor was not added
- Dist handler was not started
- Node was not registered in resource manager
This fix ensures handleNodeUp is always called for new nodes regardless
of their stopping state, followed by handleNodeStopping if the node is
stopping. This allows the graceful shutdown process to correctly migrate
segments and channels away from stopping nodes.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #43117
If we enable checking when loading segments, all segment should always
be loaded by streamingnode but not 2.5 querynode, make some search and
query failure when upgrading. Otherwise, some search and query result
will be wrong when upgrading. We choose to disable this checking for now
to promise available search and query when upgrading.
also see pr: #43346
Signed-off-by: chyezh <chyezh@outlook.com>
Related to #44620
Related to unstable ut "internal/querycoordv2 TestServer/TestNodeUp"
Introduce SessionWatcher interface to fix race condition and goroutine
leak that caused unstable unit test TestServer/TestNodeUp.
Changes:
- Add SessionWatcher interface with EventChannel() and Stop() methods
- Refactor WatchServices() to return SessionWatcher instead of raw
channel
- Fix cleanup order in QueryCoordV2: stop watcher before session
- Update DataCoord, ConnectionManager to use SessionWatcher
- Add MockSessionWatcher for testing
Fixes race condition between session context cancellation and internal
loop exit. Eliminates goroutine leak by providing explicit lifecycle
management.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #45452
- alias/rename related DDL should use database level exclusive lock
- alias cannot use as the resource key of lock, use collection name
instead
- transfer replica should use WAL-based framework
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #45080, #45274, #45285
- LoadCollection doesn't ignore the ignorable request, for false field
array.
- CreatIndex doesn't ignore the ignorable request, for wrong index.
- index meta is not thread safe.
- lost parameter check of DDL.
- DDL Ack scheduler may get stuck and DDL is block until next incoming
DDL.
- lost parameter checker of ddl
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #43897
- Load/Release collection/partition is implemented by WAL-based DDL
framework now.
- Support AlterLoadConfig/DropLoadConfig in wal now.
- Load/Release operation can be synced by new CDC now.
- Refactor some UT for load/release DDL.
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #43897
- Part of collection/index related DDL is implemented by WAL-based DDL
framework now.
- Support following message type in wal, CreateCollection,
DropCollection, CreatePartition, DropPartition, CreateIndex, AlterIndex,
DropIndex.
- Part of collection/index related DDL can be synced by new CDC now.
- Refactor some UT for collection/index DDL.
- Add Tombstone scheduler to manage the tombstone GC for collection or
partition meta.
- Move the vchannel allocation into streaming pchannel manager.
---------
Signed-off-by: chyezh <chyezh@outlook.com>
Related to #44956
Add manifest_path field throughout the data path to support LOON Storage
V2 manifest tracking. The manifest stores metadata for segment data
files and enables the unified Storage V2 FFI interface.
Changes include:
- Add manifest_path field to SegmentInfo and SaveBinlogPathsRequest
proto messages
- Add UpdateManifest operator to datacoord meta operations
- Update metacache, sync manager, and meta writer to propagate manifest
paths
- Include manifest_path in segment load info for query coordinator
This is part of the Storage V2 FFI interface integration.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
relate: https://github.com/milvus-io/milvus/issues/43687
We used to run the temporary analyzer and validate analyzer on the
proxy, but the proxy should not be a computation-heavy node. This PR
move all analyzer calculations to the streaming node.
---------
Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
Fixes#45035
This commit addresses a data race issue where refreshCollection was
updating the collection notifier without proper lock protection.
Changes:
- Add UpdateCollection method to CollectionManager with proper locking
- Introduce CollectionOperator pattern for thread-safe collection
updates
- Make setRefreshNotifier private and use it through the operator
pattern
- Update refreshCollection to use the new UpdateCollection method
- Handle collection not found error gracefully in refreshCollection
The CollectionOperator pattern ensures all collection modifications go
through the CollectionManager's lock, preventing concurrent access
issues.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #43897
- Resource group related DDL is implemented by WAL-based DDL framework
now.
- Support following message type in wal AlterResourceGroup,
DropResourceGroup.
- Resource group DDL can be synced by new CDC now.
- Refactor some UT for resource group DDL.
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #43858
Fix the issue introduced in PR #43992 where deactivating the balance
checker incorrectly stops stopping balance operations.
Changes:
- Move IsActive() check after stopping balance logic
- Only skip normal balance when checker is inactive
- Allow stopping balance to proceed regardless of checker state
This ensures stopping balance can execute even when the balance checker
is deactivated.
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #44014
- On standalone, the query node inside need to load segment and watch
channel, so the querynode is not a embeded querynode in streamingnode
without `LabelStreamingNodeEmbeddedQueryNode`. The channel dist manager
can not confirm a standalone node is a embededStreamingNode.
Bug is introduced by #44099
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #44730
Fix the issue where logs were not outputting as expected due to
incorrect log package imports across multiple components.
Changes include:
- Add golangci-lint rule to forbid github.com/pingcap/log usage
- Replace github.com/pingcap/log with
github.com/milvus-io/milvus/pkg/v2/log
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #43897
- Return LastConfirmedMessageID when wal append operation.
- Add resource-key-based locker for broadcast-ack operation to protect
the coord state when executing ddl.
- Resource-key-based locker is held until the broadcast operation is
acked.
- ResourceKey support shared and exclusive lock.
- Add FastAck execute ack right away after the broadcast done to speed
up ddl.
- Ack callback will support broadcast message result now.
- Add tombstone for broadcaster to avoid to repeatedly commit DDL and
ABA issue.
---------
Signed-off-by: chyezh <chyezh@outlook.com>
When set mmap enabled in both collection properties and field
properties, load segment will fail.
See also: #44443
Signed-off-by: yangxuan <xuan.yang@zilliz.com>
issue: #43858
Refactor the balance checker implementation to use priority queues for
managing collection balance operations, improving processing efficiency
and order control.
Changes include:
- Export priority queue interfaces (Item, BaseItem, PriorityQueue)
- Replace collection round-robin with priority-based queue system
- Add BalanceCheckCollectionMaxCount configuration parameter
- Optimize balance task generation with batch processing limits
- Refactor processBalanceQueue method for different strategies
- Enhance test coverage with comprehensive unit tests
The new priority queue system processes collections based on row count
or collection ID order, providing better control over balance operation
priorities and resource utilization.
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #42942
This pr includes the following changes:
1. Added checks for index checker in querycoord to generate drop index
tasks
2. Added drop index interface to querynode
3. To avoid search failure after dropping the index, the querynode
allows the use of lazy mode (warmup=disable) to load raw data even when
indexes contain raw data.
4. In segcore, loading the index no longer deletes raw data; instead, it
evicts it.
5. In expr, the index is pinned to prevent concurrent errors.
---------
Signed-off-by: sunby <sunbingyi1992@gmail.com>
issue: #44014
- Because the session of querynode and streamingnode is different.
- So when streamingnode session down first, a streaming query node will
be treated as querynode.
- Use label but not streaming node session to fix it.
Signed-off-by: chyezh <chyezh@outlook.com>
1. Enable Milvus to read cipher configs
2. Enable cipher plugin in binlog reader and writer
3. Add a testCipher for unittests
4. Support pooling for datanode
5. Add encryption in storagev2
See also: #40321
Signed-off-by: yangxuan <xuan.yang@zilliz.com>
---------
Signed-off-by: yangxuan <xuan.yang@zilliz.com>
issue: #43933
Fix the issue where QueryCoord restart leads to node status
inconsistency in resource manager, causing segment loading failures and
incorrect resource group assignments.
Changes include:
- Add CheckNodesInResourceGroup method to sync node status after restart
- Implement proper cleanup of offline/stopping nodes from resource
groups
- Add automatic discovery and assignment of new nodes to resource groups
- Enhance rewatchNodes process to include resource manager
synchronization
This ensures resource manager maintains correct node status and
assignments even after QueryCoord restarts, preventing segment loading
failures and improving system reliability.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #43828
Implement robust rewatch mechanism to handle etcd connection failures
and node reconnection scenarios in DataCoord and QueryCoord, along with
heartbeat lag monitoring capabilities.
Changes include:
- Implement rewatchDataNodes/rewatchQueryNodes callbacks for etcd
reconnection scenarios
- Add idempotent rewatchNodes method to handle etcd session recovery
gracefully
- Add QueryCoordLastHeartbeatTimeStamp metric for monitoring node
heartbeat lag
- Clean up heartbeat metrics when nodes go down to prevent metric leaks
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Ref https://github.com/milvus-io/milvus/issues/42148https://github.com/milvus-io/milvus/pull/42406 impls the segcore part of
storage for handling with VectorArray.
This PR:
1. impls the go part of storage for VectorArray
2. impls the collection creation with StructArrayField and VectorArray
3. insert and retrieve data from the collection.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com>
Signed-off-by: SpadeA-Tang <u6748471@anu.edu.au>
issue: #43072, #43289
- manage the schema version at recovery storage.
- update the schema when creating collection or alter schema.
- get schema at write buffer based on version.
- recover the schema when upgrading from 2.5.
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #43117, #42966, #43373
- also fix channel balance may not work at 2.6.
- fix error lost at delete path
- add mvcc into s/q log
- change the log level for TestCoordDownSearch
Signed-off-by: chyezh <chyezh@outlook.com>