742 Commits

Author SHA1 Message Date
aoiasd
354ab2f55e
enhance: sync file resource to querynode and datanode (#44480)
relate:https://github.com/milvus-io/milvus/issues/43687
Support use file resource with sync mode.
Auto download or remove file resource to local when user add or remove
file resource.
Sync file resource to node when find new node session.

---------

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-12-04 16:23:11 +08:00
wei liu
0c63ed95bb
test: [skip e2e] fix unstable assignment tests in balancer (#46042)
issue: #46038

- Add assertSegmentPlanNumAndTargetNodeMatch and
assertChannelPlanNumAndTargetNodeMatch helper functions to validate plan
count and target node membership for unstable assignment tests
- Mark "test assigning channels with resource exhausted nodes" as
unstable since node 2 and 3 have equal priority after filtering
- Replace simple length check with target node validation to ensure
plans assign to expected node set even when order is non-deterministic

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-12-04 16:17:11 +08:00
wei liu
a308331b81
fix: Set replica field in balance plans to prevent panic (#45722)
issue: #45598

The MultiTargetBalancer was missing replica field assignment in the
generated segment and channel plans, which caused panic during balance
operations. This change ensures that all balance plans have the replica
field properly set to fix the panic issue.

Also refactored the balance test to extract common test logic into a
reusable helper function and added a new integration test specifically
for MultipleTargetBalancer policy.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-12-04 10:19:11 +08:00
wei liu
e70c01362d
enhance: Add resource exhaustion querynode penalty policy (#45808)
issue: #40513
for querynode which return resource exhausted error, add a penalty
duration on it, and suspend loading new resource until penalty duration
expired.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-12-02 16:59:11 +08:00
wei liu
3bb3e8c09e
fix: Enable leader checker to sync segment distribution to RO nodes (#45949)
issue: #45865

- Modified leader_checker.go to include all nodes (RO + RW) instead of
only RW nodes, preventing channel balance from stucking on RO nodes
- Added debug logging in segment_checker.go when no shard leader found
- Enhanced target_observer.go with detailed logging for delegator check
failures to improve debugging visibility
- Fixed integration tests:
- Temporarily disabled partial result counter assertion in
partial_result_on_node_down_test.go pending concurrent issue fix
- Increased transfer channel timeout from 10s to 20s in
manual_rolling_upgrade_test.go to avoid flaky test caused by target
update interval (10s)

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-12-02 10:07:09 +08:00
Zhen Ye
2ef18c5b4f
enhance: remove watch at session liveness check (#45968)
issue: #45724

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-12-01 17:55:10 +08:00
congqixia
af734f19dc
enhance: skip adding stopping node to resource group in handleNodeUp (#45969)
Related to #45960
Follow-up to #45961

After #45961 ensured that handleNodeUp is always called for nodes
discovered during rewatchNodes (including stopping nodes), this change
adds a safeguard in ResourceManager.handleNodeUp to skip adding stopping
nodes to resource groups.

1. **resource_manager.go**: Add check for IsStoppingState() in
handleNodeUp to prevent stopping nodes from being added to incomingNode
set and assigned to resource groups.

2. **server.go**:
- Delete processed nodes from sessionMap to avoid duplicate processing
in the subsequent loop
   - Add warning logs for stopping state transitions during rewatch

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-12-01 16:17:10 +08:00
congqixia
5f5560d042
fix: always call handleNodeUp in rewatchNodes for proper stopping balance (#45961)
Related to #45960

When QueryCoord restarts or reconnects to etcd, the rewatchNodes
function previously skipped handleNodeUp for QueryNodes in stopping
state. This caused stopping balance to fail because necessary components
were not initialized:
- Task scheduler executor was not added
- Dist handler was not started
- Node was not registered in resource manager

This fix ensures handleNodeUp is always called for new nodes regardless
of their stopping state, followed by handleNodeStopping if the node is
stopping. This allows the graceful shutdown process to correctly migrate
segments and channels away from stopping nodes.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-12-01 11:09:08 +08:00
Zhen Ye
4f080bd3a0
fix: remove the streamingnode checking when loading segment (#45859)
issue: #43117

If we enable checking when loading segments, all segment should always
be loaded by streamingnode but not 2.5 querynode, make some search and
query failure when upgrading. Otherwise, some search and query result
will be wrong when upgrading. We choose to disable this checking for now
to promise available search and query when upgrading.

also see pr: #43346

Signed-off-by: chyezh <chyezh@outlook.com>
2025-11-28 10:09:08 +08:00
Zhen Ye
31976d8adb
fix: executor/scheduler should be latest replica meta but not replica copy (#45877)
issue: #45865

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-11-28 06:59:08 +08:00
cai.zhang
7c9a9c6f7e
fix: Reduce querycoord check node in replica interval for test (#45837)
issue: #45791

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
2025-11-27 15:07:07 +08:00
congqixia
f51fcc09ae
fix: resolve SessionWatcher goroutine leak and unstable UT in querycoordv2 (#45627)
Related to #44620
Related to unstable ut "internal/querycoordv2 TestServer/TestNodeUp"

Introduce SessionWatcher interface to fix race condition and goroutine
leak that caused unstable unit test TestServer/TestNodeUp.

Changes:
- Add SessionWatcher interface with EventChannel() and Stop() methods
- Refactor WatchServices() to return SessionWatcher instead of raw
channel
- Fix cleanup order in QueryCoordV2: stop watcher before session
- Update DataCoord, ConnectionManager to use SessionWatcher
- Add MockSessionWatcher for testing

Fixes race condition between session context cancellation and internal
loop exit. Eliminates goroutine leak by providing explicit lifecycle
management.

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-11-21 18:33:06 +08:00
aoiasd
947c8855f3
feat: support search bm25 with highlight (#44923)
relate: https://github.com/milvus-io/milvus/issues/42589

---------

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-11-18 16:09:39 +08:00
Zhen Ye
b7fb8ed38c
fix: use the right resource key lock for ddl and use new ddl in transfer replica (#45506)
issue: #45452

- alias/rename related DDL should use database level exclusive lock
- alias cannot use as the resource key of lock, use collection name
instead
- transfer replica should use WAL-based framework

Signed-off-by: chyezh <chyezh@outlook.com>
2025-11-12 19:01:38 +08:00
yihao.dai
cabc47ce01
fix: Fix channel not available error and release collection blocking (#45428)
1. Ensure replica creation is idempotent.
2. Prevent currentTarget update when replica is missing.
3. Move the wait-for-release logic into the DDL framework's callback,
and add a timeout to prevent it from blocking the DDL callback
indefinitely.

issue: https://github.com/milvus-io/milvus/issues/45301,
https://github.com/milvus-io/milvus/issues/45274,
https://github.com/milvus-io/milvus/issues/45295

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-11-12 18:55:37 +08:00
Zhen Ye
a2ce70d252
fix: ddl framework bug patch (#45290)
issue: #45080, #45274, #45285

- LoadCollection doesn't ignore the ignorable request, for false field
array.
- CreatIndex doesn't ignore the ignorable request, for wrong index.
- index meta is not thread safe.
- lost parameter check of DDL.
- DDL Ack scheduler may get stuck and DDL is block until next incoming
DDL.
- lost parameter checker of ddl

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-11-04 22:25:33 +08:00
Spade A
c0029b788d
fix: alter collection failed with MMAP setting for STRUCT (#45173)
issue: https://github.com/milvus-io/milvus/issues/45001
ref: https://github.com/milvus-io/milvus/issues/42148

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com>
Co-authored-by: aoiasd <zhicheng.yue@zilliz.com>
2025-11-04 20:19:33 +08:00
Zhen Ye
966ebfbcab
fix: support upgrading from 2.6.x to 2.6.5 (#45264)
issue: #43897

Signed-off-by: chyezh <chyezh@outlook.com>
2025-11-04 18:31:32 +08:00
Zhen Ye
00d8d2c33d
enhance: support load/release collection/partition with WAL-based DDL framework (#45154)
issue: #43897

- Load/Release collection/partition is implemented by WAL-based DDL
framework now.
- Support AlterLoadConfig/DropLoadConfig in wal now.
- Load/Release operation can be synced by new CDC now.
- Refactor some UT for load/release DDL.

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-11-02 18:39:32 +08:00
Zhen Ye
309d564796
enhance: support collection and index with WAL-based DDL framework (#45033)
issue: #43897

- Part of collection/index related DDL is implemented by WAL-based DDL
framework now.
- Support following message type in wal, CreateCollection,
DropCollection, CreatePartition, DropPartition, CreateIndex, AlterIndex,
DropIndex.
- Part of collection/index related DDL can be synced by new CDC now.
- Refactor some UT for collection/index DDL.
- Add Tombstone scheduler to manage the tombstone GC for collection or
partition meta.
- Move the vchannel allocation into streaming pchannel manager.

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-10-30 14:24:08 +08:00
congqixia
569a5b40d2
enhance: [StorageV2] add manifest path support for FFI integration (#44991)
Related to #44956

Add manifest_path field throughout the data path to support LOON Storage
V2 manifest tracking. The manifest stores metadata for segment data
files and enables the unified Storage V2 FFI interface.

Changes include:
- Add manifest_path field to SegmentInfo and SaveBinlogPathsRequest
proto messages
- Add UpdateManifest operator to datacoord meta operations
- Update metacache, sync manager, and meta writer to propagate manifest
paths
- Include manifest_path in segment load info for query coordinator

This is part of the Storage V2 FFI interface integration.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-10-27 19:24:10 +08:00
Spade A
6494c75d31
fix: collection level MMAP does not take effect for STRUCT (#44996)
issue: https://github.com/milvus-io/milvus/issues/42148

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-10-23 19:52:05 +08:00
aoiasd
cfeb095ad7
enhance: forbid build analyzer at proxy (#44067)
relate: https://github.com/milvus-io/milvus/issues/43687
We used to run the temporary analyzer and validate analyzer on the
proxy, but the proxy should not be a computation-heavy node. This PR
move all analyzer calculations to the streaming node.

---------

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-10-23 10:58:12 +08:00
congqixia
20dcb45b3d
fix: prevent data race in querycoord collection notifier update (#45037)
Fixes #45035

This commit addresses a data race issue where refreshCollection was
updating the collection notifier without proper lock protection.

Changes:
- Add UpdateCollection method to CollectionManager with proper locking
- Introduce CollectionOperator pattern for thread-safe collection
updates
- Make setRefreshNotifier private and use it through the operator
pattern
- Update refreshCollection to use the new UpdateCollection method
- Handle collection not found error gracefully in refreshCollection

The CollectionOperator pattern ensures all collection modifications go
through the CollectionManager's lock, preventing concurrent access
issues.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-10-23 10:28:04 +08:00
Zhen Ye
21076196bf
enhance: support resource group with WAL-based DDL framework (#44874)
issue: #43897

- Resource group related DDL is implemented by WAL-based DDL framework
now.
- Support following message type in wal AlterResourceGroup,
DropResourceGroup.
- Resource group DDL can be synced by new CDC now.
- Refactor some UT for resource group DDL.

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-10-21 09:58:03 +08:00
1mmortal
e18e7d3b32
fix: Pingpong load balancing issue when segment has only 1 row(#44840) (#44841)
Use math.Ceil to calculate Priority uniformly
issue: https://github.com/milvus-io/milvus/issues/44840

Signed-off-by: 1mmortal <lmzzzzz1@163.com>
2025-10-16 11:18:00 +08:00
wei liu
38833b0e1d
fix: Fix deactivate balance checker also stops stopping balance (#44834)
issue: #43858
Fix the issue introduced in PR #43992 where deactivating the balance
checker incorrectly stops stopping balance operations.

Changes:
- Move IsActive() check after stopping balance logic
- Only skip normal balance when checker is inactive
- Allow stopping balance to proceed regardless of checker state

This ensures stopping balance can execute even when the balance checker
is deactivated.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-10-15 15:50:04 +08:00
Zhen Ye
53e8f150e8
fix: check if qn is sqn with label and streamingnode list (#44792)
issue: #44014

- On standalone, the query node inside need to load segment and watch
channel, so the querynode is not a embeded querynode in streamingnode
without `LabelStreamingNodeEmbeddedQueryNode`. The channel dist manager
can not confirm a standalone node is a embededStreamingNode.

Bug is introduced by #44099

Signed-off-by: chyezh <chyezh@outlook.com>
2025-10-13 16:33:59 +08:00
wei liu
33d1e7de83
fix: Replace incorrect log import with milvus v2 log package (#44731)
issue: #44730
Fix the issue where logs were not outputting as expected due to
incorrect log package imports across multiple components.

Changes include:
- Add golangci-lint rule to forbid github.com/pingcap/log usage
- Replace github.com/pingcap/log with
github.com/milvus-io/milvus/pkg/v2/log

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-10-10 20:27:57 +08:00
zhenshan.cao
4279f166c6
enhance: Add refine logs for task scheduler in QueryCoord (#44577)
issue: https://github.com/milvus-io/milvus/issues/43968

Signed-off-by: zhenshan.cao <zhenshan.cao@zilliz.com>
2025-10-10 10:07:55 +08:00
Zhen Ye
19e5e9f910
enhance: broadcaster will lock resource until message acked (#44508)
issue: #43897

- Return LastConfirmedMessageID when wal append operation.
- Add resource-key-based locker for broadcast-ack operation to protect
the coord state when executing ddl.
- Resource-key-based locker is held until the broadcast operation is
acked.
- ResourceKey support shared and exclusive lock.
- Add FastAck execute ack right away after the broadcast done to speed
up ddl.
- Ack callback will support broadcast message result now.
- Add tombstone for broadcaster to avoid to repeatedly commit DDL and
ABA issue.

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-09-24 20:58:05 +08:00
XuanYang-cn
24037a396a
fix: LoadSegment failed for dup field mmap.enabel props (#44465)
When set mmap enabled in both collection properties and field
properties, load segment will fail.
See also: #44443

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
2025-09-22 14:40:06 +08:00
wei liu
6d4961b978
enhance: Refactor balance checker with priority queue (#43992)
issue: #43858
Refactor the balance checker implementation to use priority queues for
managing collection balance operations, improving processing efficiency
and order control.

Changes include:
- Export priority queue interfaces (Item, BaseItem, PriorityQueue)
- Replace collection round-robin with priority-based queue system
- Add BalanceCheckCollectionMaxCount configuration parameter
- Optimize balance task generation with batch processing limits
- Refactor processBalanceQueue method for different strategies
- Enhance test coverage with comprehensive unit tests

The new priority queue system processes collections based on row count
or collection ID order, providing better control over balance operation
priorities and resource utilization.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-09-19 17:46:01 +08:00
zhenshan.cao
691a8df953
feat: Add RESTful api for rolling upgrade support (#44381)
issue: https://github.com/milvus-io/milvus/issues/43968

Co-authored-by: chyezh <ye.zhen@zilliz.com>
2025-09-16 20:08:00 +08:00
Bingyi Sun
0c0630cc38
feat: support dropping index without releasing collection (#42941)
issue: #42942

This pr includes the following changes:
1. Added checks for index checker in querycoord to generate drop index
tasks
2. Added drop index interface to querynode
3. To avoid search failure after dropping the index, the querynode
allows the use of lazy mode (warmup=disable) to load raw data even when
indexes contain raw data.
4. In segcore, loading the index no longer deletes raw data; instead, it
evicts it.
5. In expr, the index is pinned to prevent concurrent errors.

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-09-02 16:17:52 +08:00
Zhen Ye
9e2d1963d4
enhance: support cchannel for streaming service (#44143)
issue: #43897

- add cchannel as a special vchannel to hold some ddl and dcl.

Signed-off-by: chyezh <chyezh@outlook.com>
2025-09-02 10:05:52 +08:00
zhagnlu
fc876639cf
enhance: support json stats with shredding design (#42534)
#42533

Co-authored-by: luzhang <luzhang@zilliz.com>
2025-09-01 10:49:52 +08:00
Zhen Ye
23085ae437
fix: use query node label check if streamingnode (#44099)
issue: #44014

- Because the session of querynode and streamingnode is different.
- So when streamingnode session down first, a streaming query node will
be treated as querynode.
- Use label but not streaming node session to fix it.

Signed-off-by: chyezh <chyezh@outlook.com>
2025-08-29 10:45:59 +08:00
Chun Han
da156981c6
feat: milvus support posix-compatible mode(milvus-io#43942) (#43944)
related: #43942

Signed-off-by: MrPresent-Han <chun.han@gmail.com>
Co-authored-by: MrPresent-Han <chun.han@gmail.com>
2025-08-27 16:29:50 +08:00
XuanYang-cn
37a447d166
feat: Add CMEK cipher plugin (#43722)
1. Enable Milvus to read cipher configs
2. Enable cipher plugin in binlog reader and writer
3. Add a testCipher for unittests
4. Support pooling for datanode
5. Add encryption in storagev2

See also: #40321 
Signed-off-by: yangxuan <xuan.yang@zilliz.com>

---------

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
2025-08-27 11:15:52 +08:00
Zhen Ye
575345ae7b
fix: get streamingnodes from service discovery without channel assign (#44033)
issue: #43767

Signed-off-by: chyezh <chyezh@outlook.com>
2025-08-26 14:29:51 +08:00
Zhen Ye
cbb9392564
fix: filter the streaming node from resource group (#43984)
issue: #43981

Signed-off-by: chyezh <chyezh@outlook.com>
2025-08-22 19:21:47 +08:00
wei liu
399f63300c
enhance: Implement dynamic interval updates for ticker components (#43865)
issue: #43858

Enable dynamic configuration updates for ticker intervals without
restart. This enhancement allows runtime configuration changes to take
effect immediately for better operational flexibility.

Changes include:
- Apply "drain+Reset only when interval changed" pattern across all
ticker components to preserve existing timing phases
- Fix goroutine variable capture issue in CheckerController.Start()
- Remove unnecessary ticker.Stop() in manual trigger paths
- Add dynamic interval checking in QueryCoordV2 components:
  * checkers/controller.go: Various checker intervals
  * dist/dist_handler.go: DistPullInterval, CheckExecutedFlagInterval
  * session/cluster.go: CheckNodeSessionInterval
  * server.go: CheckAutoBalanceConfigInterval
  * observers/target_observer.go: UpdateNextTargetInterval
  * observers/collection_observer.go: CollectionObserverInterval
- Add dynamic interval checking in QueryNodeV2 components:
  * segments/disk_usage_fetcher.go: DiskSizeFetchInterval
- Ensure thread safety by performing all ticker operations in same
goroutine with proper drain before Reset to avoid spurious triggers

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-08-21 10:07:47 +08:00
wei liu
384c493d0e
fix: Fix node status inconsistency after QueryCoord restart (#43941)
issue: #43933

Fix the issue where QueryCoord restart leads to node status
inconsistency in resource manager, causing segment loading failures and
incorrect resource group assignments.

Changes include:
- Add CheckNodesInResourceGroup method to sync node status after restart
- Implement proper cleanup of offline/stopping nodes from resource
groups
- Add automatic discovery and assignment of new nodes to resource groups
- Enhance rewatchNodes process to include resource manager
synchronization

This ensures resource manager maintains correct node status and
assignments even after QueryCoord restarts, preventing segment loading
failures and improving system reliability.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-08-20 14:13:46 +08:00
wei liu
dada00a81c
fix: dirty querynode doesn't clean up after restart (#43909)
issue: #43905

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-08-18 18:05:46 +08:00
wei liu
3e9e830074
enhance: Implement rewatch mechanism for etcd failure scenarios (#43829)
issue: #43828
Implement robust rewatch mechanism to handle etcd connection failures
and node reconnection scenarios in DataCoord and QueryCoord, along with
heartbeat lag monitoring capabilities.

Changes include:
- Implement rewatchDataNodes/rewatchQueryNodes callbacks for etcd
reconnection scenarios
- Add idempotent rewatchNodes method to handle etcd session recovery
gracefully
- Add QueryCoordLastHeartbeatTimeStamp metric for monitoring node
heartbeat lag
- Clean up heartbeat metrics when nodes go down to prevent metric leaks

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-08-14 10:31:44 +08:00
wei liu
ecc2ac0426
fix: apply load config changes failed after restart (#43554)
issue: #43107

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-08-01 20:13:37 +08:00
Spade A
faeb7fd410
feat: impl StructArray -- create schema, insert, and retrieve data (#42855)
Ref https://github.com/milvus-io/milvus/issues/42148

https://github.com/milvus-io/milvus/pull/42406 impls the segcore part of
storage for handling with VectorArray.
This PR:
1. impls the go part of storage for VectorArray
2. impls the collection creation with StructArrayField and VectorArray
3. insert and retrieve data from the collection.

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com>
Signed-off-by: SpadeA-Tang <u6748471@anu.edu.au>
2025-07-27 01:30:55 +08:00
Zhen Ye
e9ab73e93d
enhance: add schema version at recovery storage (#43500)
issue: #43072, #43289

- manage the schema version at recovery storage.
- update the schema when creating collection or alter schema.
- get schema at write buffer based on version.
- recover the schema when upgrading from 2.5.

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-07-23 21:38:54 +08:00
Zhen Ye
df7e507c49
fix: balance may not trigger at balance checker when upgrading (#43462)
issue: #43416

Signed-off-by: chyezh <chyezh@outlook.com>
2025-07-22 16:02:53 +08:00