708 Commits

Author SHA1 Message Date
Bingyi Sun
0c0630cc38
feat: support dropping index without releasing collection (#42941)
issue: #42942

This pr includes the following changes:
1. Added checks for index checker in querycoord to generate drop index
tasks
2. Added drop index interface to querynode
3. To avoid search failure after dropping the index, the querynode
allows the use of lazy mode (warmup=disable) to load raw data even when
indexes contain raw data.
4. In segcore, loading the index no longer deletes raw data; instead, it
evicts it.
5. In expr, the index is pinned to prevent concurrent errors.

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-09-02 16:17:52 +08:00
Zhen Ye
9e2d1963d4
enhance: support cchannel for streaming service (#44143)
issue: #43897

- add cchannel as a special vchannel to hold some ddl and dcl.

Signed-off-by: chyezh <chyezh@outlook.com>
2025-09-02 10:05:52 +08:00
zhagnlu
fc876639cf
enhance: support json stats with shredding design (#42534)
#42533

Co-authored-by: luzhang <luzhang@zilliz.com>
2025-09-01 10:49:52 +08:00
Zhen Ye
23085ae437
fix: use query node label check if streamingnode (#44099)
issue: #44014

- Because the session of querynode and streamingnode is different.
- So when streamingnode session down first, a streaming query node will
be treated as querynode.
- Use label but not streaming node session to fix it.

Signed-off-by: chyezh <chyezh@outlook.com>
2025-08-29 10:45:59 +08:00
Chun Han
da156981c6
feat: milvus support posix-compatible mode(milvus-io#43942) (#43944)
related: #43942

Signed-off-by: MrPresent-Han <chun.han@gmail.com>
Co-authored-by: MrPresent-Han <chun.han@gmail.com>
2025-08-27 16:29:50 +08:00
XuanYang-cn
37a447d166
feat: Add CMEK cipher plugin (#43722)
1. Enable Milvus to read cipher configs
2. Enable cipher plugin in binlog reader and writer
3. Add a testCipher for unittests
4. Support pooling for datanode
5. Add encryption in storagev2

See also: #40321 
Signed-off-by: yangxuan <xuan.yang@zilliz.com>

---------

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
2025-08-27 11:15:52 +08:00
Zhen Ye
575345ae7b
fix: get streamingnodes from service discovery without channel assign (#44033)
issue: #43767

Signed-off-by: chyezh <chyezh@outlook.com>
2025-08-26 14:29:51 +08:00
Zhen Ye
cbb9392564
fix: filter the streaming node from resource group (#43984)
issue: #43981

Signed-off-by: chyezh <chyezh@outlook.com>
2025-08-22 19:21:47 +08:00
wei liu
399f63300c
enhance: Implement dynamic interval updates for ticker components (#43865)
issue: #43858

Enable dynamic configuration updates for ticker intervals without
restart. This enhancement allows runtime configuration changes to take
effect immediately for better operational flexibility.

Changes include:
- Apply "drain+Reset only when interval changed" pattern across all
ticker components to preserve existing timing phases
- Fix goroutine variable capture issue in CheckerController.Start()
- Remove unnecessary ticker.Stop() in manual trigger paths
- Add dynamic interval checking in QueryCoordV2 components:
  * checkers/controller.go: Various checker intervals
  * dist/dist_handler.go: DistPullInterval, CheckExecutedFlagInterval
  * session/cluster.go: CheckNodeSessionInterval
  * server.go: CheckAutoBalanceConfigInterval
  * observers/target_observer.go: UpdateNextTargetInterval
  * observers/collection_observer.go: CollectionObserverInterval
- Add dynamic interval checking in QueryNodeV2 components:
  * segments/disk_usage_fetcher.go: DiskSizeFetchInterval
- Ensure thread safety by performing all ticker operations in same
goroutine with proper drain before Reset to avoid spurious triggers

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-08-21 10:07:47 +08:00
wei liu
384c493d0e
fix: Fix node status inconsistency after QueryCoord restart (#43941)
issue: #43933

Fix the issue where QueryCoord restart leads to node status
inconsistency in resource manager, causing segment loading failures and
incorrect resource group assignments.

Changes include:
- Add CheckNodesInResourceGroup method to sync node status after restart
- Implement proper cleanup of offline/stopping nodes from resource
groups
- Add automatic discovery and assignment of new nodes to resource groups
- Enhance rewatchNodes process to include resource manager
synchronization

This ensures resource manager maintains correct node status and
assignments even after QueryCoord restarts, preventing segment loading
failures and improving system reliability.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-08-20 14:13:46 +08:00
wei liu
dada00a81c
fix: dirty querynode doesn't clean up after restart (#43909)
issue: #43905

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-08-18 18:05:46 +08:00
wei liu
3e9e830074
enhance: Implement rewatch mechanism for etcd failure scenarios (#43829)
issue: #43828
Implement robust rewatch mechanism to handle etcd connection failures
and node reconnection scenarios in DataCoord and QueryCoord, along with
heartbeat lag monitoring capabilities.

Changes include:
- Implement rewatchDataNodes/rewatchQueryNodes callbacks for etcd
reconnection scenarios
- Add idempotent rewatchNodes method to handle etcd session recovery
gracefully
- Add QueryCoordLastHeartbeatTimeStamp metric for monitoring node
heartbeat lag
- Clean up heartbeat metrics when nodes go down to prevent metric leaks

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-08-14 10:31:44 +08:00
wei liu
ecc2ac0426
fix: apply load config changes failed after restart (#43554)
issue: #43107

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-08-01 20:13:37 +08:00
Spade A
faeb7fd410
feat: impl StructArray -- create schema, insert, and retrieve data (#42855)
Ref https://github.com/milvus-io/milvus/issues/42148

https://github.com/milvus-io/milvus/pull/42406 impls the segcore part of
storage for handling with VectorArray.
This PR:
1. impls the go part of storage for VectorArray
2. impls the collection creation with StructArrayField and VectorArray
3. insert and retrieve data from the collection.

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com>
Signed-off-by: SpadeA-Tang <u6748471@anu.edu.au>
2025-07-27 01:30:55 +08:00
Zhen Ye
e9ab73e93d
enhance: add schema version at recovery storage (#43500)
issue: #43072, #43289

- manage the schema version at recovery storage.
- update the schema when creating collection or alter schema.
- get schema at write buffer based on version.
- recover the schema when upgrading from 2.5.

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-07-23 21:38:54 +08:00
Zhen Ye
df7e507c49
fix: balance may not trigger at balance checker when upgrading (#43462)
issue: #43416

Signed-off-by: chyezh <chyezh@outlook.com>
2025-07-22 16:02:53 +08:00
Zhen Ye
25b76e1fde
fix: cannot auto balance the channel from old arch to streamingnode (#43424)
issue: #43416, #43413

- also fix the panic on streamingnode when concurrent sync

Signed-off-by: chyezh <chyezh@outlook.com>
2025-07-20 23:00:52 +08:00
Zhen Ye
3aacd179f7
fix: balance channel before balance segment when upgrading (#43346)
issue: #43117, #42966, #43373

- also fix channel balance may not work at 2.6.
- fix error lost at delete path
- add mvcc into s/q log
- change the log level for TestCoordDownSearch

Signed-off-by: chyezh <chyezh@outlook.com>
2025-07-17 20:16:52 +08:00
wei liu
039564199c
fix: Prevent duplicate segment results in count queries (#43173)
issue: #41570
Fix issue where growing and sealed segments could be searched
simultaneously, causing inflated count(*) results. This was caused by
logic introduced in PR #42009 that made sealed segments readable before
target version advancement.

Changes include:
- Fix conditional filtering logic in PinReadableSegments to prevent
sealed segments from becoming readable prematurely
- Use target version filter for full results (ratio=1.0) to ensure
sealed segments only become readable after target advancement
- Use query view segment list filter for partial results (ratio<1.0) to
maintain backward compatibility
- Simplify target version setting in AddDistributions to prevent
premature segment readability
- Add logging for redundant growing segments during sync
- Add comprehensive unit tests covering the duplicate segment scenario

This fix ensures count(*) queries return accurate results by preventing
the same segment from being counted in both growing and sealed states.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-07-14 11:10:49 +08:00
wei liu
b2597c6329
enhance: apply load config changes after QueryCoord restart (#43108)
issue: #43107 
- Add checkLoadConfigChanges() to apply load config during startup
- Call config check in startQueryCoord() after restart
- Skip auto-updates for collections with user-specified replica numbers
- Add is_user_specified_replica_mode field to preserve user settings
- Add comprehensive unit tests with mockey

Ensures existing collections use latest cluster-level config after
restart.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-07-10 14:28:48 +08:00
congqixia
1fae5230fe
fix: Check field mmap property before apply collection level one (#43090)
Related to #43089

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-07-03 14:30:44 +08:00
congqixia
7bc7b18ed5
fix: [AddField] Prevent concurrent load during UpdateSchema (#43043)
Related to #43028

This PR:
- Add mutex prevent concurrent load segment & schema change
- Add schema verison field in load meta
- Update schema in PutOrRef if schema verison is larger

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-07-02 17:38:44 +08:00
Zhen Ye
ecb24e7232
enhance: use multi-process framework in integration test (#42976)
issue: #41609

- add env `MILVUS_NODE_ID_FOR_TESTING` to set up a node id for milvus
process.
- add env `MILVUS_CONFIG_REFRESH_INTERVAL` to set up the refresh
interval of paramtable.
- Init paramtable when calling `paramtable.Get()`.
- add new multi process framework for integration test.
- change all integration test into multi process.
- merge some test case into one suite to speed up it.
- modify some test, which need to wait for issue #42966, #42685.
- remove the waittssync for delete collection to fix issue: #42989

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-06-30 14:22:43 +08:00
wei liu
c919340763
enhance: Optimize channel node balancing for uneven QN distribution (#42786)
issue: #42860
Fix channel node allocation when QueryNode count is not a multiple of
channel count. The previous algorithm used simple division which caused
uneven distribution with remainders.

Key improvements:
- Implement smart remainder distribution algorithm
- Refactor large function into focused helper functions
- Support two-phase rebalancing (release then allocate)
- Handle edge cases like insufficient nodes gracefully

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-06-30 12:14:42 +08:00
wei liu
be492c2939
fix: Add missing keylocks in ReleasePartition operation (#42940)
issue: #42098
Fix concurrent access issue by adding proper locking around
ReleasePartition operation to prevent race conditions when releasing
partitions on the same collection.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-06-25 21:48:42 +08:00
wei liu
bf5fde1431
fix: Prevent delegator unserviceable due to shard leader change (#42689)
issue: #42098 #42404
Fix critical issue where concurrent balance segment and balance channel
operations cause delegator view inconsistency. When shard leader
switches between load and release phases of segment balance, it results
in loading segments on old delegator but releasing on new delegator,
making the new delegator unserviceable.

The root cause is that balance segment modifies delegator views, and if
these modifications happen on different delegators due to leader change,
it corrupts the delegator state and affects query availability.

Changes include:
- Add shardLeaderID field to SegmentTask to track delegator for load
- Record shard leader ID during segment loading in move operations
- Skip release if shard leader changed from the one used for loading
- Add comprehensive unit tests for leader change scenarios

This ensures balance segment operations are atomic on single delegator,
preventing view corruption and maintaining delegator serviceability.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-06-19 12:10:38 +08:00
Bingyi Sun
6bebb68727
fix: Return all targets segments in ListLoadedSegments (#42728)
issue: https://github.com/milvus-io/milvus/issues/42412

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-06-18 11:20:38 +08:00
Chun Han
001619aef9
feat: supporing load priority for loading (#42413)
related: #40781

Signed-off-by: MrPresent-Han <chun.han@gmail.com>
Co-authored-by: MrPresent-Han <chun.han@gmail.com>
2025-06-17 15:22:38 +08:00
congqixia
9653ec8d8c
fix: [AddField] Remove load list check on querycoord (#42736)
Related to #42735

Load field list shall work as hint after tiered storage impl, so the
load list compare is meaningless and block load with empty list after
adding a new field.

This PR totally moves the check logic.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-06-17 09:50:37 +08:00
Bingyi Sun
1bf960b1a8
enhance: Check loaded segments before gc (#42639)
issue: https://github.com/milvus-io/milvus/issues/42412

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-06-13 17:44:38 +08:00
congqixia
d59002d45e
fix: Make controller wait checker worker quit and add nil protection (#42704)
Related to #42702

This patch add wait logic for `CheckerController` and nil check for
channel checker in case of panicking during server/testcase stop
procedure

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-06-13 13:20:35 +08:00
wei liu
e7c0a6ffbb
enhance: Refine QueryNode task parallelism based on CPU core count (#42166)
issue: #42165
Implement dynamic task execution capacity calculation based on QueryNode
CPU core count instead of static configuration for better resource
utilization.

Changes include:
- Add CpuCoreNum() method and WithCpuCoreNum() option to NodeInfo
- Implement GetTaskExecutionCap() for dynamic capacity calculation
- Add QueryNodeTaskParallelismFactor parameter for tuning
- Update proto definition to include cpu_core_num field
- Add unit tests for new functionality

This allows QueryCoord to automatically adjust task parallelism based on
actual hardware resources.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-06-11 13:20:35 +08:00
wei liu
317e7999da
fix: ReleasePartition cause delegator unserviceable. (#42423)
issue: #42098 #42404
related to: ##42009 #41937

Implement new method to handle partition removal from next target
without directly modifying current target.

Changes include:
- Add RemovePartitionFromNextTarget method and deprecate RemovePartition
- Update target_observer to use new method for ReleasePartition
operations
- Add unit tests and mock methods for new functionality

This ensures that all changes to next target will propagates to
delegator's query view.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-06-09 19:02:34 +08:00
cai.zhang
5566a85bcc
enhance: Add proxy task queue metrics (#42156)
issue: #42155

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
2025-06-04 11:26:32 +08:00
Zhen Ye
508264f953
fix: querynode upgrade from 2.5 get stucked (#42502)
issue: #42492

- consider the old RO query node (not streaming node) when balancing
channel.
- querynode graceful stop can be done if there's only L0 segment exists.

Signed-off-by: chyezh <chyezh@outlook.com>
2025-06-04 11:20:30 +08:00
wei liu
aa66072a1c
enhance: Remove inadvertently introduced goccy/go-json dependency (#42146)
Remove the 'goccy/go-json' library, which was inadvertently introduced,
and revert to using the standard internal JSON handling.

Changes include:
- Removed dependency on 'github.com/goccy/go-json' in go.mod and go.sum.
- Replaced import of 'goccy/go-json' with 'internal/json' in
'internal/querycoordv2/task/scheduler.go'.

This correction ensures the project continues to use the intended JSON
processing libraries and avoids unnecessary external dependencies.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-06-03 17:38:32 +08:00
wei liu
2669d14ba0
refactor: Remove balance constraints between channel and segment tasks (#42177)
issue: #42176

Remove the mutual exclusion constraints between channel and segment
balance tasks to allow them to run concurrently.

Changes include:
- Remove permitBalanceChannel() and permitBalanceSegment() methods from
RoundRobinBalancer
- Update ChannelLevelScoreBalancer, MultiTargetBalancer,
RowCountBasedBalancer, and ScoreBasedBalancer to remove constraint
checks
- Allow segment balance tasks to proceed even when channel balance tasks
are running
- Update test cases to reflect new behavior where balance tasks no
longer block each other

This change improves the efficiency of load balancing by removing
unnecessary coordination overhead between different types of balance
operations.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-05-30 18:14:25 +08:00
wei liu
eabb62e3ab
fix: Segment may be released prematurely during balance channel (#42090)
issue: #41143

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-05-29 18:36:35 +08:00
aoiasd
2ae4d80120
enhance: support run analyzer by loaded collection field (#42113)
relate: https://github.com/milvus-io/milvus/issues/42094

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-05-29 10:54:30 +08:00
wei liu
54619eaa2c
feat: Implement partial result support on node down (#42009)
issue: https://github.com/milvus-io/milvus/issues/41690
This commit implements partial search result functionality when query
nodes go down, improving system availability during node failures. The
changes include:

- Enhanced load balancing in proxy (lb_policy.go) to handle node
failures with retry support
- Added partial search result capability in querynode delegator and
distribution logic
- Implemented tests for various partial result scenarios when nodes go
down
- Added metrics to track partial search results in querynode_metrics.go
- Updated parameter configuration to support partial result required
data ratio
- Replaced old partial_search_test.go with more comprehensive
partial_result_on_node_down_test.go
- Updated proto definitions and improved retry logic

These changes improve query resilience by returning partial results to
users when some query nodes are unavailable, ensuring that queries don't
completely fail when a portion of data remains accessible.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-05-28 00:12:28 +08:00
wei liu
78010262f0
enhance: Optimize shard serviceable mechanism (#41937)
issue: https://github.com/milvus-io/milvus/issues/41690
- Merge leader view and channel management into ChannelDistManager,
allowing a channel to have multiple delegators.
- Improve shard leader switching to ensure a single replica only has one
shard leader per channel. The shard leader handles all resource loading
and query requests.
- Refine the serviceable mechanism: after QC completes loading, sync the
query view to the delegator. The delegator then determines its
serviceable status based on the query view.
- When a delegator encounters forwarding query or deletion failures,
mark the corresponding segment as offline and transition it to an
unserviceable state.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-05-22 11:38:24 +08:00
wei liu
4e1208f4f6
enhance: support balancing multiple collections in single trigger (#41875)
issue: #41874
- Optimize balance_checker to support balancing multiple collections
simultaneously
- Add new parameters for segment and channel balancing batch sizes
- Add enableBalanceOnMultipleCollections parameter
- Update tests for balance checker

This change improves resource utilization by allowing the system to
balance multiple collections in a single trigger with configurable batch
sizes.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-05-21 21:38:25 +08:00
yihao.dai
65dd3982d8
fix: Fix ants.Pool goroutine leak (#41892)
1. Release the pool after it is no longer in use.
2. Upgrade ants.Pool to fix the goroutine leak issue (see [PR
#287](https://github.com/panjf2000/ants/pull/287)).

issue: https://github.com/milvus-io/milvus/issues/41838

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-05-19 17:56:22 +08:00
Zhen Ye
5fd47c3c89
fix: mockery too unavailable after upgrade golang version (#41481)
issue: #41291
pr: #41318

Signed-off-by: chyezh <chyezh@outlook.com>
2025-04-24 10:46:43 +08:00
SimFG
91d40fa558
fix: Update logging context and upgrade dependencies (#41318)
- issue: #41291

---------

Signed-off-by: SimFG <bang.fu@zilliz.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2025-04-23 10:52:38 +08:00
congqixia
b36c88f3c8
enhance: [AddField] Broadcast schema change via WAL (#41373)
Related to #39718

Add Broadcast logic for collection schema change and notifies:
- Streamnode - Delegator
- Streamnode - Flush component
- QueryNodes via grpc

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-04-22 16:28:37 +08:00
Xianhui Lin
f9febe3bae
enhance: Merge RootCoord, DataCoord And QueryCoord into MixCoord (#41006)
Merge RootCoord, DataCoord And QueryCoord into MixCoord
Make Session into one
issue : https://github.com/milvus-io/milvus/issues/37764

---------

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
2025-04-11 16:36:30 +08:00
wei liu
a839d94c9e
fix: balance checker may enter infinite normal balance loop after balance suspension (#41195)
issue: #41194
- Refactor hasUnbalancedCollection flag handling to function scope
- Ensure tracking sets clearance when no balance needed
- Add deferred cleanup for both normal/stopping balance paths
- Add unit tests for collection tracking scenarios

The changes ensure tracking sets (normalBalanceCollectionsCurrentRound
and stoppingBalanceCollectionsCurrentRound) are properly cleared when:
- All collections in current round are balanced
- Balance checks return early due to unready targets
- Balance feature flags are disabled

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-04-10 15:22:29 +08:00
Xianhui Lin
3bc24c264f
enhance: Add json key inverted index in stats for optimization (#38039)
Add json key inverted index in stats for optimization
https://github.com/milvus-io/milvus/issues/36995

---------

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-04-10 15:20:28 +08:00
wei liu
99270103cf
fix: Offline segment block delegator recovery (#40827)
issue: #39937
Before PR #39552, whenever a segment was missing in either the `current
target` or the `next target`, we would trigger `load segment` to recover
the delegator. However, restoring only the missing segments in the `next
target` is sufficient to advance the target and complete the recovery
process.

In PR #39552, we removed the scheduling of L0 segments along with this
unnecessary `load segment` logic. However, this exposed a new issue: if
the `current target` still has missing segments and there is a flaw in
the `checkDelegatorDataReady` logic, it could block the recovery of a
delegator that contains `offline segments`.

Since `offline segments` are cleaned up asynchronously in this scenario,
this PR removes their blocking effect on delegator recovery, ensuring a
smoother failure recovery process.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-04-07 14:56:22 +08:00