milvus/internal/querycoordv2/meta/channel_dist_manager_test.go
Zhen Ye 318db122b8
enhance: cherry pick patch of new DDL framework and CDC (#45025)
issue: #43897, #44123
pr: #44898
related pr: #44607 #44642 #44792 #44809 #44564 #44560 #44735 #44822
#44865 #44850 #44942 #44874 #44963 #44886 #44898

enhance: remove redundant channel manager from datacoord (#44532)

issue: #41611

- After enabling streaming arch, channel manager of data coord is a
redundant component.


fix: Fix CDC OOM due to high buffer size (#44607)

Fix CDC OOM by:
1. free msg buffer manually.
2. limit max msg buffer size.
3. reduce scanner msg hander buffer size.

issue: https://github.com/milvus-io/milvus/issues/44123

fix: remove wrong start timetick to avoid filtering DML whose timetick
is less than it. (#44691)

issue: #41611

- introduced by #44532

enhance: support remove cluster from replicate topology (#44642)

issue: #44558, #44123
- Update config(A->C) to A and C, config(B) to B on replicate topology
(A->B,A->C) can remove the B from replicate topology
- Fix some metric error of CDC

fix: check if qn is sqn with label and streamingnode list (#44792)

issue: #44014

- On standalone, the query node inside need to load segment and watch
channel, so the querynode is not a embeded querynode in streamingnode
without `LabelStreamingNodeEmbeddedQueryNode`. The channel dist manager
can not confirm a standalone node is a embededStreamingNode.

Bug is introduced by #44099

enhance: Make GetReplicateInfo API work at the pchannel level (#44809)

issue: https://github.com/milvus-io/milvus/issues/44123

enhance: Speed up CDC scheduling (#44564)

Make CDC watch etcd replicate pchannel meta instead of listing them
periodically.

issue: https://github.com/milvus-io/milvus/issues/44123


enhance: refactor update replicate config operation using
wal-broadcast-based DDL/DCL framework (#44560)

issue: #43897

- UpdateReplicateConfig operation will broadcast AlterReplicateConfig
message into all pchannels with cluster-exclusive-lock.
- Begin txn message will use commit message timetick now (to avoid
timetick rollback when CDC with txn message).
- If current cluster is secondary, the UpdateReplicateConfig will wait
until the replicate configuration is consistent with the config
replicated from primary.


enhance: support rbac with WAL-based DDL framework (#44735)

issue: #43897

- RBAC(Roles/Users/Privileges/Privilege Groups) is implemented by
WAL-based DDL framework now.
- Support following message type in wal `AlterUser`, `DropUser`,
`AlterRole`, `DropRole`, `AlterUserRole`, `DropUserRole`,
`AlterPrivilege`, `DropPrivilege`, `AlterPrivilegeGroup`,
`DropPrivilegeGroup`, `RestoreRBAC`.
- RBAC can be synced by new CDC now.
- Refactor some UT for RBAC.


enhance: support database with WAL-based DDL framework (#44822)

issue: #43897

- Database related DDL is implemented by WAL-based DDL framework now.
- Support following message type in wal CreateDatabase, AlterDatabase,
DropDatabase.
- Database DDL can be synced by new CDC now.
- Refactor some UT for Database DDL.

enhance: support alias with WAL-based DDL framework (#44865)

issue: #43897

- Alias related DDL is implemented by WAL-based DDL framework now.
- Support following message type in wal AlterAlias, DropAlias.
- Alias DDL can be synced by new CDC now.
- Refactor some UT for Alias DDL.

enhance: Disable import for replicating cluster (#44850)

1. Import in replicating cluster is not supported yet, so disable it for
now.
2. Remove GetReplicateConfiguration wal API

issue: https://github.com/milvus-io/milvus/issues/44123


fix: use short debug string to avoid newline in debug logs (#44925)

issue: #44924

fix: rerank before requery if reranker didn't use field data (#44942)

issue: #44918


enhance: support resource group with WAL-based DDL framework (#44874)

issue: #43897

- Resource group related DDL is implemented by WAL-based DDL framework
now.
- Support following message type in wal AlterResourceGroup,
DropResourceGroup.
- Resource group DDL can be synced by new CDC now.
- Refactor some UT for resource group DDL.


fix: Fix Fix replication txn data loss during chaos (#44963)

Only confirm CommitMsg for txn messages to prevent data loss.

issue: https://github.com/milvus-io/milvus/issues/44962,
https://github.com/milvus-io/milvus/issues/44123

fix: wrong execution order of DDL/DCL on secondary (#44886)

issue: #44697, #44696

- The DDL executing order of secondary keep same with order of control
channel timetick now.
- filtering the control channel operation on shard manager of
streamingnode to avoid wrong vchannel of create segment.
- fix that the immutable txn message lost replicate header.


fix: Fix primary-secondary replication switch blocking (#44898)

1. Fix primary-secondary replication switchover blocking by delete
replicate pchannel meta using modRevision.
2. Stop channel replicator(scanner) when cluster role changes to prevent
continued message consumption and replication.
3. Close Milvus client to prevent goroutine leak.
4. Create Milvus client once for a channel replicator.
5. Simplify CDC controller and resources.

issue: https://github.com/milvus-io/milvus/issues/44123

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Signed-off-by: chyezh <chyezh@outlook.com>
Co-authored-by: yihao.dai <yihao.dai@zilliz.com>
2025-11-03 15:39:33 +08:00

426 lines
13 KiB
Go

// Licensed to the LF AI & Data foundation under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package meta
import (
"testing"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/suite"
"github.com/milvus-io/milvus/internal/coordinator/snmanager"
"github.com/milvus-io/milvus/internal/querycoordv2/session"
"github.com/milvus-io/milvus/internal/util/sessionutil"
"github.com/milvus-io/milvus/pkg/v2/proto/datapb"
"github.com/milvus-io/milvus/pkg/v2/proto/querypb"
"github.com/milvus-io/milvus/pkg/v2/util/metricsinfo"
)
type ChannelDistManagerSuite struct {
suite.Suite
dist *ChannelDistManager
collection int64
nodes []int64
channels map[string]*DmChannel
}
func (suite *ChannelDistManagerSuite) SetupSuite() {
// Replica 0: 0, 2
// Replica 1: 1
suite.collection = 10
suite.nodes = []int64{0, 1, 2}
suite.channels = map[string]*DmChannel{
"dmc0": {
VchannelInfo: &datapb.VchannelInfo{
CollectionID: suite.collection,
ChannelName: "dmc0",
},
Node: 0,
Version: 1,
View: &LeaderView{
ID: 1,
CollectionID: suite.collection,
Channel: "dmc0",
Version: 1,
Status: &querypb.LeaderViewStatus{
Serviceable: true,
},
},
},
"dmc1": {
VchannelInfo: &datapb.VchannelInfo{
CollectionID: suite.collection,
ChannelName: "dmc1",
},
Node: 1,
Version: 1,
View: &LeaderView{
ID: 1,
CollectionID: suite.collection,
Channel: "dmc1",
Version: 1,
Status: &querypb.LeaderViewStatus{
Serviceable: true,
},
},
},
}
}
func (suite *ChannelDistManagerSuite) SetupTest() {
snmanager.ResetDoNothingStreamingNodeManager(suite.T())
suite.dist = NewChannelDistManager(session.NewNodeManager())
// Distribution:
// node 0 contains channel dmc0
// node 1 contains channel dmc0, dmc1
// node 2 contains channel dmc1
suite.dist.Update(suite.nodes[0], suite.channels["dmc0"].Clone())
suite.dist.Update(suite.nodes[1], suite.channels["dmc0"].Clone(), suite.channels["dmc1"].Clone())
suite.dist.Update(suite.nodes[2], suite.channels["dmc1"].Clone())
}
func (suite *ChannelDistManagerSuite) TestGetBy() {
dist := suite.dist
// Test GetAll
channels := dist.GetByFilter()
suite.Len(channels, 4)
// Test GetByNode
for _, node := range suite.nodes {
channels := dist.GetByFilter(WithNodeID2Channel(node))
suite.AssertNode(channels, node)
}
// Test GetByCollection
channels = dist.GetByCollectionAndFilter(suite.collection)
suite.Len(channels, 4)
suite.AssertCollection(channels, suite.collection)
channels = dist.GetByCollectionAndFilter(-1)
suite.Len(channels, 0)
// Test GetByNodeAndCollection
// 1. Valid node and valid collection
for _, node := range suite.nodes {
channels := dist.GetByCollectionAndFilter(suite.collection, WithNodeID2Channel(node))
suite.AssertNode(channels, node)
suite.AssertCollection(channels, suite.collection)
}
// 2. Valid node and invalid collection
channels = dist.GetByCollectionAndFilter(-1, WithNodeID2Channel(suite.nodes[1]))
suite.Len(channels, 0)
// 3. Invalid node and valid collection
channels = dist.GetByCollectionAndFilter(suite.collection, WithNodeID2Channel(-1))
suite.Len(channels, 0)
}
func (suite *ChannelDistManagerSuite) AssertNames(channels []*DmChannel, names ...string) bool {
for _, channel := range channels {
hasChannel := false
for _, name := range names {
if channel.ChannelName == name {
hasChannel = true
break
}
}
if !suite.True(hasChannel, "channel %v not in the given expected list %+v", channel.ChannelName, names) {
return false
}
}
return true
}
func (suite *ChannelDistManagerSuite) AssertNode(channels []*DmChannel, node int64) bool {
for _, channel := range channels {
if !suite.Equal(node, channel.Node) {
return false
}
}
return true
}
func (suite *ChannelDistManagerSuite) AssertCollection(channels []*DmChannel, collection int64) bool {
for _, channel := range channels {
if !suite.Equal(collection, channel.GetCollectionID()) {
return false
}
}
return true
}
func TestChannelDistManager(t *testing.T) {
suite.Run(t, new(ChannelDistManagerSuite))
}
func TestDmChannelClone(t *testing.T) {
// Test that Clone properly copies the View field including Status
originalChannel := &DmChannel{
VchannelInfo: &datapb.VchannelInfo{
CollectionID: 100,
ChannelName: "test-channel",
},
Node: 1,
Version: 10,
View: &LeaderView{
ID: 5,
CollectionID: 100,
Channel: "test-channel",
Version: 20,
Status: &querypb.LeaderViewStatus{
Serviceable: true,
},
},
}
clonedChannel := originalChannel.Clone()
// Check all fields were properly cloned
assert.Equal(t, originalChannel.GetCollectionID(), clonedChannel.GetCollectionID())
assert.Equal(t, originalChannel.GetChannelName(), clonedChannel.GetChannelName())
assert.Equal(t, originalChannel.Node, clonedChannel.Node)
assert.Equal(t, originalChannel.Version, clonedChannel.Version)
// Check that View was properly cloned
assert.NotNil(t, clonedChannel.View)
assert.Equal(t, originalChannel.View.ID, clonedChannel.View.ID)
assert.Equal(t, originalChannel.View.CollectionID, clonedChannel.View.CollectionID)
assert.Equal(t, originalChannel.View.Channel, clonedChannel.View.Channel)
assert.Equal(t, originalChannel.View.Version, clonedChannel.View.Version)
// Check that Status was properly cloned
assert.NotNil(t, clonedChannel.View.Status)
assert.Equal(t, originalChannel.View.Status.GetServiceable(), clonedChannel.View.Status.GetServiceable())
// Verify that modifying the clone doesn't affect the original
clonedChannel.View.Status.Serviceable = false
assert.True(t, originalChannel.View.Status.GetServiceable())
assert.False(t, clonedChannel.View.Status.GetServiceable())
}
func TestDmChannelIsServiceable(t *testing.T) {
// Test serviceable channel
serviceableChannel := &DmChannel{
VchannelInfo: &datapb.VchannelInfo{
CollectionID: 100,
ChannelName: "serviceable",
},
View: &LeaderView{
Status: &querypb.LeaderViewStatus{
Serviceable: true,
},
},
}
assert.True(t, serviceableChannel.IsServiceable())
// Test non-serviceable channel
nonServiceableChannel := &DmChannel{
VchannelInfo: &datapb.VchannelInfo{
CollectionID: 100,
ChannelName: "non-serviceable",
},
View: &LeaderView{
Status: &querypb.LeaderViewStatus{
Serviceable: false,
},
},
}
assert.False(t, nonServiceableChannel.IsServiceable())
}
func (suite *ChannelDistManagerSuite) TestUpdateReturnsNewServiceableChannels() {
dist := NewChannelDistManager(session.NewNodeManager())
// Create a non-serviceable channel
nonServiceableChannel := suite.channels["dmc0"].Clone()
nonServiceableChannel.View.Status.Serviceable = false
// Update with non-serviceable channel first
newServiceableChannels := dist.Update(suite.nodes[0], nonServiceableChannel)
suite.Len(newServiceableChannels, 0, "No new serviceable channels should be returned")
// Now update with a serviceable channel
serviceableChannel := nonServiceableChannel.Clone()
serviceableChannel.View.Status.Serviceable = true
newServiceableChannels = dist.Update(suite.nodes[0], serviceableChannel)
suite.Len(newServiceableChannels, 1, "One new serviceable channel should be returned")
suite.Equal("dmc0", newServiceableChannels[0].GetChannelName())
// Update with same serviceable channel should not return it again
newServiceableChannels = dist.Update(suite.nodes[0], serviceableChannel)
suite.Len(newServiceableChannels, 0, "Already serviceable channel should not be returned")
// Add a different channel that's serviceable
newChannel := suite.channels["dmc1"].Clone()
newChannel.View.Status.Serviceable = true
newServiceableChannels = dist.Update(suite.nodes[0], serviceableChannel, newChannel)
suite.Len(newServiceableChannels, 1, "Only the new serviceable channel should be returned")
suite.Equal("dmc1", newServiceableChannels[0].GetChannelName())
}
func (suite *ChannelDistManagerSuite) TestGetShardLeader() {
nodeManager := session.NewNodeManager()
dist := NewChannelDistManager(nodeManager)
// Create a replica
replicaPB := &querypb.Replica{
ID: 1,
CollectionID: suite.collection,
Nodes: []int64{0, 2, 4},
}
replica := NewReplica(replicaPB)
// Create channels with different versions and serviceability
channel1Node0 := suite.channels["dmc0"].Clone()
channel1Node0.Version = 1
channel1Node0.View.Status.Serviceable = false
channel1Node2 := suite.channels["dmc0"].Clone()
channel1Node2.Node = 2
channel1Node2.Version = 2
channel1Node2.View.Status.Serviceable = false
// Update with non-serviceable channels
dist.Update(0, channel1Node0)
dist.Update(2, channel1Node2)
// Test getting leader with no serviceable channels - should return highest version
leader := dist.GetShardLeader("dmc0", replica)
suite.NotNil(leader)
suite.Equal(int64(2), leader.Node)
suite.Equal(int64(2), leader.Version)
// Now make one channel serviceable
channel1Node0.View.Status.Serviceable = true
dist.Update(0, channel1Node0)
// Test that serviceable channel is preferred even with lower version
leader = dist.GetShardLeader("dmc0", replica)
suite.NotNil(leader)
suite.Equal(int64(0), leader.Node)
suite.Equal(int64(1), leader.Version)
suite.True(leader.IsServiceable())
// Make both channels serviceable but with different versions
channel1Node2.View.Status.Serviceable = true
dist.Update(2, channel1Node2)
// Test that highest version is chosen among serviceable channels
leader = dist.GetShardLeader("dmc0", replica)
suite.NotNil(leader)
suite.Equal(int64(2), leader.Node)
suite.Equal(int64(2), leader.Version)
suite.True(leader.IsServiceable())
// Test channel not in replica
// Create a new replica with different nodes
replicaPB = &querypb.Replica{
ID: 1,
CollectionID: suite.collection,
Nodes: []int64{1},
}
replicaWithDifferentNodes := NewReplica(replicaPB)
leader = dist.GetShardLeader("dmc0", replicaWithDifferentNodes)
suite.Nil(leader)
// Test nonexistent channel
leader = dist.GetShardLeader("nonexistent", replica)
suite.Nil(leader)
// Test streaming node
nodeManager.Add(session.NewNodeInfo(session.ImmutableNodeInfo{
NodeID: 4,
Address: "localhost:1",
Hostname: "localhost",
Labels: map[string]string{sessionutil.LabelStreamingNodeEmbeddedQueryNode: "1"},
}))
channel1Node4 := suite.channels["dmc0"].Clone()
channel1Node4.Node = 4
channel1Node4.Version = 3
channel1Node4.View.Status.Serviceable = false
dist.Update(4, channel1Node4)
leader = dist.GetShardLeader("dmc0", replica)
suite.NotNil(leader)
suite.Equal(int64(4), leader.Node)
suite.Equal(int64(3), leader.Version)
suite.False(leader.IsServiceable())
}
func TestGetChannelDistJSON(t *testing.T) {
manager := NewChannelDistManager(session.NewNodeManager())
channel1 := &DmChannel{
VchannelInfo: &datapb.VchannelInfo{
CollectionID: 100,
ChannelName: "channel-1",
},
Node: 1,
Version: 1,
View: &LeaderView{
ID: 1,
CollectionID: 100,
Channel: "channel-1",
Version: 1,
Status: &querypb.LeaderViewStatus{
Serviceable: true,
},
},
}
channel2 := &DmChannel{
VchannelInfo: &datapb.VchannelInfo{
CollectionID: 200,
ChannelName: "channel-2",
},
Node: 2,
Version: 1,
View: &LeaderView{
ID: 1,
CollectionID: 200,
Channel: "channel-2",
Version: 1,
Status: &querypb.LeaderViewStatus{
Serviceable: true,
},
},
}
manager.Update(1, channel1)
manager.Update(2, channel2)
channels := manager.GetChannelDist(0)
assert.Equal(t, 2, len(channels))
checkResult := func(channel *metricsinfo.DmChannel) {
if channel.NodeID == 1 {
assert.Equal(t, "channel-1", channel.ChannelName)
assert.Equal(t, int64(100), channel.CollectionID)
} else if channel.NodeID == 2 {
assert.Equal(t, "channel-2", channel.ChannelName)
assert.Equal(t, int64(200), channel.CollectionID)
} else {
assert.Failf(t, "unexpected node id", "unexpected node id %d", channel.NodeID)
}
}
for _, channel := range channels {
checkResult(channel)
}
}