milvus/internal/util/segcore/segment_test.go
congqixia f94b04e642
feat: [2.6] integrate Loon FFI for manifest-based segment loading and index building (#46076)
Cherry-pick from master
pr: #45061 #45488 #45803 #46017 #44991 #45132 #45723 #45726 #45798
#45897 #45918 #44998

This feature integrates the Storage V2 (Loon) FFI interface as a unified
storage layer for segment loading and index building in Milvus. It
enables
manifest-based data access, replacing the traditional binlog-based
approach
with a more efficient columnar storage format.

Key changes:

### Segment Self-Managed Loading Architecture
- Move segment loading orchestration from Go layer to C++ segcore
- Add NewSegmentWithLoadInfo() API for passing load info during segment
creation
- Implement SetLoadInfo() and Load() methods in SegmentInterface
- Support parallel loading of indexed and non-indexed fields
- Enable both sealed and growing segments to self-manage loading

### Storage V2 FFI Integration
- Integrate milvus-storage library's FFI interface for packed columnar
data
- Add manifest path support throughout the data path (SegmentInfo,
LoadInfo)
- Implement ManifestReader for generating manifests from binlogs
- Support zero-copy data exchange using Arrow C Data Interface
- Add ToCStorageConfig() for Go-to-C storage config conversion

### Manifest-Based Index Building
- Extend FileManagerContext to carry loon_ffi_properties
- Implement GetFieldDatasFromManifest() using Arrow C Stream interface
- Support manifest-based reading in DiskFileManagerImpl and
MemFileManagerImpl
- Add fallback to traditional segment insert files when manifest
unavailable

### Compaction Pipeline Updates
- Include manifest path in all compaction task builders (clustering, L0,
mix)
- Update BulkPackWriterV2 to return manifest path
- Propagate manifest metadata through compaction pipeline

### Configuration & Protocol
- Add common.storageV2.useLoonFFI config option (default: false)
- Add manifest_path field to SegmentLoadInfo and related proto messages
- Add manifest field to compaction segment messages

### Bug Fixes
- Fix mmap settings not applied during segment load (key typo fix)
- Populate index info after segment loading to prevent redundant load
tasks
- Fix memory corruption by removing premature transaction handle
destruction

Related issues: #44956, #45060, #39173

## Individual Cherry-Picked Commits

1. **e1c923b5cc** - fix: apply mmap settings correctly during segment
load (#46017)
2. **63b912370b** - enhance: use milvus-storage internal C++ Reader API
for Loon FFI (#45897)
3. **bfc192faa5** - enhance: Resolve issues integrating loon FFI
(#45918)
4. **fb18564631** - enhance: support manifest-based index building with
Loon FFI reader (#45726)
5. **b9ec2392b9** - enhance: integrate StorageV2 FFI interface for
manifest-based segment loading (#45798)
6. **66db3c32e6** - enhance: integrate Storage V2 FFI interface for
unified storage access (#45723)
7. **ae789273ac** - fix: populate index info after segment loading to
prevent redundant load tasks (#45803)
8. **49688b0be2** - enhance: Move segment loading logic from Go layer to
segcore for self-managed loading (#45488)
9. **5b2df88bac** - enhance: [StorageV2] Integrate FFI interface for
packed reader (#45132)
10. **91ff5706ac** - enhance: [StorageV2] add manifest path support for
FFI integration (#44991)
11. **2192bb4a85** - enhance: add NewSegmentWithLoadInfo API to support
segment self-managed loading (#45061)
12. **4296b01da0** - enhance: update delta log serialization APIs to
integrate storage V2 (#44998)

## Technical Details

### Architecture Changes
- **Before**: Go layer orchestrated segment loading, making multiple CGO
calls
- **After**: Segments autonomously manage loading in C++ layer with
single entry point

### Storage Access Pattern
- **Before**: Read individual binlog files through Go storage layer
- **After**: Read manifest file that references packed columnar data via
FFI

### Benefits
- Reduced cross-language call overhead
- Better resource management at C++ level
- Improved I/O performance through batched streaming reads
- Cleaner separation of concerns between Go and C++ layers
- Foundation for proactive schema evolution handling

---------

Signed-off-by: Ted Xu <ted.xu@zilliz.com>
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Co-authored-by: Ted Xu <ted.xu@zilliz.com>
2025-12-04 17:09:12 +08:00

402 lines
14 KiB
Go

package segcore_test
import (
"context"
"path/filepath"
"testing"
"github.com/stretchr/testify/assert"
"google.golang.org/protobuf/proto"
"github.com/milvus-io/milvus-proto/go-api/v2/commonpb"
"github.com/milvus-io/milvus-proto/go-api/v2/schemapb"
"github.com/milvus-io/milvus/internal/mocks/util/mock_segcore"
"github.com/milvus-io/milvus/internal/storage"
"github.com/milvus-io/milvus/internal/util/initcore"
"github.com/milvus-io/milvus/internal/util/segcore"
"github.com/milvus-io/milvus/pkg/v2/proto/datapb"
"github.com/milvus-io/milvus/pkg/v2/proto/planpb"
"github.com/milvus-io/milvus/pkg/v2/proto/querypb"
"github.com/milvus-io/milvus/pkg/v2/proto/segcorepb"
"github.com/milvus-io/milvus/pkg/v2/util/paramtable"
"github.com/milvus-io/milvus/pkg/v2/util/typeutil"
)
func TestGrowingSegment(t *testing.T) {
paramtable.Init()
localDataRootPath := filepath.Join(paramtable.Get().LocalStorageCfg.Path.GetValue(), typeutil.QueryNodeRole)
initcore.InitLocalChunkManager(localDataRootPath)
err := initcore.InitMmapManager(paramtable.Get(), 1)
assert.NoError(t, err)
initcore.InitTieredStorage(paramtable.Get())
assert.NoError(t, err)
collectionID := int64(100)
segmentID := int64(100)
schema := mock_segcore.GenTestCollectionSchema("test-reduce", schemapb.DataType_Int64, true)
collection, err := segcore.CreateCCollection(&segcore.CreateCCollectionRequest{
CollectionID: collectionID,
Schema: schema,
IndexMeta: mock_segcore.GenTestIndexMeta(collectionID, schema),
})
assert.NoError(t, err)
assert.NotNil(t, collection)
defer collection.Release()
segment, err := segcore.CreateCSegment(&segcore.CreateCSegmentRequest{
Collection: collection,
SegmentID: segmentID,
SegmentType: segcore.SegmentTypeGrowing,
IsSorted: false,
})
assert.NoError(t, err)
assert.NotNil(t, segment)
defer segment.Release()
assert.Equal(t, segmentID, segment.ID())
assert.Equal(t, int64(0), segment.RowNum())
assert.Zero(t, segment.MemSize())
assert.True(t, segment.HasRawData(0))
assertEqualCount(t, collection, segment, 0)
insertMsg, err := mock_segcore.GenInsertMsg(collection, 1, segmentID, 100)
assert.NoError(t, err)
insertResult, err := segment.Insert(context.Background(), &segcore.InsertRequest{
RowIDs: insertMsg.RowIDs,
Timestamps: insertMsg.Timestamps,
Record: &segcorepb.InsertRecord{
FieldsData: insertMsg.FieldsData,
NumRows: int64(len(insertMsg.RowIDs)),
},
})
assert.NoError(t, err)
assert.NotNil(t, insertResult)
assert.Equal(t, int64(100), insertResult.InsertedRows)
assert.Equal(t, int64(100), segment.RowNum())
assertEqualCount(t, collection, segment, 100)
pk := storage.NewInt64PrimaryKeys(1)
pk.Append(storage.NewInt64PrimaryKey(10))
deleteResult, err := segment.Delete(context.Background(), &segcore.DeleteRequest{
PrimaryKeys: pk,
Timestamps: []typeutil.Timestamp{
1000,
},
})
assert.NoError(t, err)
assert.NotNil(t, deleteResult)
assert.Equal(t, int64(99), segment.RowNum())
}
func assertEqualCount(
t *testing.T,
collection *segcore.CCollection,
segment segcore.CSegment,
count int64,
) {
plan := planpb.PlanNode{
Node: &planpb.PlanNode_Query{
Query: &planpb.QueryPlanNode{
IsCount: true,
},
},
}
expr, err := proto.Marshal(&plan)
assert.NoError(t, err)
retrievePlan, err := segcore.NewRetrievePlan(
collection,
expr,
typeutil.MaxTimestamp,
100,
0,
0)
defer retrievePlan.Delete()
assert.True(t, retrievePlan.ShouldIgnoreNonPk())
assert.False(t, retrievePlan.IsIgnoreNonPk())
retrievePlan.SetIgnoreNonPk(true)
assert.True(t, retrievePlan.IsIgnoreNonPk())
assert.NotZero(t, retrievePlan.MsgID())
assert.NotNil(t, retrievePlan)
assert.NoError(t, err)
retrieveResult, err := segment.Retrieve(context.Background(), retrievePlan)
assert.NotNil(t, retrieveResult)
assert.NoError(t, err)
result, err := retrieveResult.GetResult()
assert.NoError(t, err)
assert.NotNil(t, result)
assert.Equal(t, count, result.AllRetrieveCount)
retrieveResult.Release()
retrieveResult2, err := segment.RetrieveByOffsets(context.Background(), &segcore.RetrievePlanWithOffsets{
RetrievePlan: retrievePlan,
Offsets: []int64{0, 1, 2, 3, 4},
})
assert.NoError(t, err)
assert.NotNil(t, retrieveResult2)
retrieveResult2.Release()
}
func TestConvertToSegcoreSegmentLoadInfo(t *testing.T) {
t.Run("nil input", func(t *testing.T) {
result := segcore.ConvertToSegcoreSegmentLoadInfo(nil)
assert.Nil(t, result)
})
t.Run("empty input", func(t *testing.T) {
src := &querypb.SegmentLoadInfo{}
result := segcore.ConvertToSegcoreSegmentLoadInfo(src)
assert.NotNil(t, result)
assert.Equal(t, int64(0), result.SegmentID)
assert.Equal(t, int64(0), result.PartitionID)
assert.Equal(t, int64(0), result.CollectionID)
})
t.Run("full conversion", func(t *testing.T) {
// Create source querypb.SegmentLoadInfo with all fields populated
src := &querypb.SegmentLoadInfo{
SegmentID: 1001,
PartitionID: 2001,
CollectionID: 3001,
DbID: 4001,
FlushTime: 5001,
BinlogPaths: []*datapb.FieldBinlog{
{
FieldID: 100,
Binlogs: []*datapb.Binlog{
{
EntriesNum: 10,
TimestampFrom: 1000,
TimestampTo: 2000,
LogPath: "/path/to/binlog",
LogSize: 1024,
LogID: 9001,
MemorySize: 2048,
},
},
ChildFields: []int64{101, 102},
},
},
NumOfRows: 1000,
Statslogs: []*datapb.FieldBinlog{
{
FieldID: 200,
Binlogs: []*datapb.Binlog{
{
EntriesNum: 5,
TimestampFrom: 1500,
TimestampTo: 2500,
LogPath: "/path/to/statslog",
LogSize: 512,
LogID: 9002,
MemorySize: 1024,
},
},
},
},
Deltalogs: []*datapb.FieldBinlog{
{
FieldID: 300,
Binlogs: []*datapb.Binlog{
{
EntriesNum: 3,
TimestampFrom: 2000,
TimestampTo: 3000,
LogPath: "/path/to/deltalog",
LogSize: 256,
LogID: 9003,
MemorySize: 512,
},
},
},
},
CompactionFrom: []int64{8001, 8002},
IndexInfos: []*querypb.FieldIndexInfo{
{
FieldID: 100,
EnableIndex: true,
IndexName: "test_index",
IndexID: 7001,
BuildID: 7002,
IndexParams: []*commonpb.KeyValuePair{{Key: "index_type", Value: "HNSW"}},
IndexFilePaths: []string{"/path/to/index"},
IndexSize: 4096,
IndexVersion: 1,
NumRows: 1000,
CurrentIndexVersion: 2,
IndexStoreVersion: 3,
},
},
SegmentSize: 8192,
InsertChannel: "insert_channel_1",
ReadableVersion: 6001,
StorageVersion: 7001,
IsSorted: true,
TextStatsLogs: map[int64]*datapb.TextIndexStats{
400: {
FieldID: 400,
Version: 1,
Files: []string{"/path/to/text/stats1", "/path/to/text/stats2"},
LogSize: 2048,
MemorySize: 4096,
BuildID: 9101,
},
},
Bm25Logs: []*datapb.FieldBinlog{
{
FieldID: 500,
Binlogs: []*datapb.Binlog{
{
EntriesNum: 7,
TimestampFrom: 3000,
TimestampTo: 4000,
LogPath: "/path/to/bm25log",
LogSize: 768,
LogID: 9004,
MemorySize: 1536,
},
},
},
},
JsonKeyStatsLogs: map[int64]*datapb.JsonKeyStats{
600: {
FieldID: 600,
Version: 2,
Files: []string{"/path/to/json/stats"},
LogSize: 1024,
MemorySize: 2048,
BuildID: 9201,
JsonKeyStatsDataFormat: 1,
},
},
Priority: commonpb.LoadPriority_HIGH,
}
// Convert to segcorepb.SegmentLoadInfo
result := segcore.ConvertToSegcoreSegmentLoadInfo(src)
// Validate basic fields
assert.NotNil(t, result)
assert.Equal(t, src.SegmentID, result.SegmentID)
assert.Equal(t, src.PartitionID, result.PartitionID)
assert.Equal(t, src.CollectionID, result.CollectionID)
assert.Equal(t, src.DbID, result.DbID)
assert.Equal(t, src.FlushTime, result.FlushTime)
assert.Equal(t, src.NumOfRows, result.NumOfRows)
assert.Equal(t, src.SegmentSize, result.SegmentSize)
assert.Equal(t, src.InsertChannel, result.InsertChannel)
assert.Equal(t, src.ReadableVersion, result.ReadableVersion)
assert.Equal(t, src.StorageVersion, result.StorageVersion)
assert.Equal(t, src.IsSorted, result.IsSorted)
assert.Equal(t, src.Priority, result.Priority)
assert.Equal(t, src.CompactionFrom, result.CompactionFrom)
// Validate BinlogPaths conversion
assert.Equal(t, len(src.BinlogPaths), len(result.BinlogPaths))
assert.Equal(t, src.BinlogPaths[0].FieldID, result.BinlogPaths[0].FieldID)
assert.Equal(t, len(src.BinlogPaths[0].Binlogs), len(result.BinlogPaths[0].Binlogs))
assert.Equal(t, src.BinlogPaths[0].Binlogs[0].EntriesNum, result.BinlogPaths[0].Binlogs[0].EntriesNum)
assert.Equal(t, src.BinlogPaths[0].Binlogs[0].TimestampFrom, result.BinlogPaths[0].Binlogs[0].TimestampFrom)
assert.Equal(t, src.BinlogPaths[0].Binlogs[0].TimestampTo, result.BinlogPaths[0].Binlogs[0].TimestampTo)
assert.Equal(t, src.BinlogPaths[0].Binlogs[0].LogPath, result.BinlogPaths[0].Binlogs[0].LogPath)
assert.Equal(t, src.BinlogPaths[0].Binlogs[0].LogSize, result.BinlogPaths[0].Binlogs[0].LogSize)
assert.Equal(t, src.BinlogPaths[0].Binlogs[0].LogID, result.BinlogPaths[0].Binlogs[0].LogID)
assert.Equal(t, src.BinlogPaths[0].Binlogs[0].MemorySize, result.BinlogPaths[0].Binlogs[0].MemorySize)
assert.Equal(t, src.BinlogPaths[0].ChildFields, result.BinlogPaths[0].ChildFields)
// Validate Statslogs conversion
assert.Equal(t, len(src.Statslogs), len(result.Statslogs))
assert.Equal(t, src.Statslogs[0].FieldID, result.Statslogs[0].FieldID)
// Validate Deltalogs conversion
assert.Equal(t, len(src.Deltalogs), len(result.Deltalogs))
assert.Equal(t, src.Deltalogs[0].FieldID, result.Deltalogs[0].FieldID)
// Validate IndexInfos conversion
assert.Equal(t, len(src.IndexInfos), len(result.IndexInfos))
assert.Equal(t, src.IndexInfos[0].FieldID, result.IndexInfos[0].FieldID)
assert.Equal(t, src.IndexInfos[0].EnableIndex, result.IndexInfos[0].EnableIndex)
assert.Equal(t, src.IndexInfos[0].IndexName, result.IndexInfos[0].IndexName)
assert.Equal(t, src.IndexInfos[0].IndexID, result.IndexInfos[0].IndexID)
assert.Equal(t, src.IndexInfos[0].BuildID, result.IndexInfos[0].BuildID)
assert.Equal(t, len(src.IndexInfos[0].IndexParams), len(result.IndexInfos[0].IndexParams))
assert.Equal(t, src.IndexInfos[0].IndexFilePaths, result.IndexInfos[0].IndexFilePaths)
assert.Equal(t, src.IndexInfos[0].IndexSize, result.IndexInfos[0].IndexSize)
assert.Equal(t, src.IndexInfos[0].IndexVersion, result.IndexInfos[0].IndexVersion)
assert.Equal(t, src.IndexInfos[0].NumRows, result.IndexInfos[0].NumRows)
assert.Equal(t, src.IndexInfos[0].CurrentIndexVersion, result.IndexInfos[0].CurrentIndexVersion)
assert.Equal(t, src.IndexInfos[0].IndexStoreVersion, result.IndexInfos[0].IndexStoreVersion)
// Validate TextStatsLogs conversion
assert.Equal(t, len(src.TextStatsLogs), len(result.TextStatsLogs))
textStats := result.TextStatsLogs[400]
assert.NotNil(t, textStats)
assert.Equal(t, src.TextStatsLogs[400].FieldID, textStats.FieldID)
assert.Equal(t, src.TextStatsLogs[400].Version, textStats.Version)
assert.Equal(t, src.TextStatsLogs[400].Files, textStats.Files)
assert.Equal(t, src.TextStatsLogs[400].LogSize, textStats.LogSize)
assert.Equal(t, src.TextStatsLogs[400].MemorySize, textStats.MemorySize)
assert.Equal(t, src.TextStatsLogs[400].BuildID, textStats.BuildID)
// Validate Bm25Logs conversion
assert.Equal(t, len(src.Bm25Logs), len(result.Bm25Logs))
assert.Equal(t, src.Bm25Logs[0].FieldID, result.Bm25Logs[0].FieldID)
// Validate JsonKeyStatsLogs conversion
assert.Equal(t, len(src.JsonKeyStatsLogs), len(result.JsonKeyStatsLogs))
jsonStats := result.JsonKeyStatsLogs[600]
assert.NotNil(t, jsonStats)
assert.Equal(t, src.JsonKeyStatsLogs[600].FieldID, jsonStats.FieldID)
assert.Equal(t, src.JsonKeyStatsLogs[600].Version, jsonStats.Version)
assert.Equal(t, src.JsonKeyStatsLogs[600].Files, jsonStats.Files)
assert.Equal(t, src.JsonKeyStatsLogs[600].LogSize, jsonStats.LogSize)
assert.Equal(t, src.JsonKeyStatsLogs[600].MemorySize, jsonStats.MemorySize)
assert.Equal(t, src.JsonKeyStatsLogs[600].BuildID, jsonStats.BuildID)
assert.Equal(t, src.JsonKeyStatsLogs[600].JsonKeyStatsDataFormat, jsonStats.JsonKeyStatsDataFormat)
})
t.Run("nil elements in arrays and maps", func(t *testing.T) {
src := &querypb.SegmentLoadInfo{
SegmentID: 1001,
BinlogPaths: []*datapb.FieldBinlog{
nil, // nil element should be skipped
{FieldID: 100},
},
Statslogs: []*datapb.FieldBinlog{
nil,
},
IndexInfos: []*querypb.FieldIndexInfo{
nil,
{FieldID: 200},
},
TextStatsLogs: map[int64]*datapb.TextIndexStats{
100: nil, // nil value should be skipped
200: {FieldID: 200},
},
JsonKeyStatsLogs: map[int64]*datapb.JsonKeyStats{
300: nil,
400: {FieldID: 400},
},
}
result := segcore.ConvertToSegcoreSegmentLoadInfo(src)
assert.NotNil(t, result)
assert.Equal(t, 1, len(result.BinlogPaths))
assert.Equal(t, int64(100), result.BinlogPaths[0].FieldID)
assert.Equal(t, 0, len(result.Statslogs))
assert.Equal(t, 1, len(result.IndexInfos))
assert.Equal(t, int64(200), result.IndexInfos[0].FieldID)
assert.Equal(t, 1, len(result.TextStatsLogs))
assert.NotNil(t, result.TextStatsLogs[200])
assert.Equal(t, 1, len(result.JsonKeyStatsLogs))
assert.NotNil(t, result.JsonKeyStatsLogs[400])
})
}