# Milvus Snapshot User Guide

## Overview

Milvus snapshot feature allows users to create point-in-time copies of collections. This powerful capability enables data backup, versioning, and restoration scenarios. Snapshots capture the complete state of data including vector data, metadata, indexes, and schema information at a specific timestamp.

## Key Features

- Point-in-time consistency: Snapshots capture data at a specific timestamp, ensuring data consistency
- Metadata preservation: Snapshots include schema, indexes, and collection properties
- Efficient storage: Snapshots use a manifest-based approach for efficient storage in object storage (S3)
- Restore capability: Restore snapshots to new collections with data integrity

## Core Concepts

### Snapshot Components

A Milvus snapshot consists of:
1. Snapshot Metadata: Basic information including name, description, collection ID, and creation timestamp
2. Collection Description: Schema definition, partition information, and collection properties
3. Segment Data: Vector data files (binlogs), deletion logs (deltalogs), and index files
4. Index Information: Index metadata and file paths

### Storage Structure

Snapshots are stored in object storage with the following structure:

```
snapshots/{collection_id}/
├── metadata/
│   └── {snapshot_id}.json         # Snapshot metadata (JSON format)
│
└── manifests/
    └── {snapshot_id}/             # Directory for each snapshot
        ├── {segment_id_1}.avro    # Individual segment manifest (Avro format)
        ├── {segment_id_2}.avro
        └── ...
```

**Note**: The metadata JSON file directly contains an array of manifest file paths, eliminating the need for a separate manifest list file.

## API Reference

### Create Snapshot

Create a snapshot for a collection.

**Best Practice (Strongly Recommended)**: Call `flush()` before creating a snapshot to ensure all data is persisted. The `create_snapshot` operation only captures existing sealed segments and does not trigger data flushing automatically. Data in growing segments will not be included in the snapshot.

**Note**: Calling flush is not mandatory, but highly recommended to avoid data loss. If you skip flush, only data that has already been flushed to sealed segments will be included in the snapshot.

**Python SDK Example:**
```python
from pymilvus import MilvusClient

client = MilvusClient(uri="http://localhost:19530")

# Recommended: Flush data before creating snapshot to ensure all data is included
client.flush(collection_name="my_collection")

# Create snapshot for entire collection
client.create_snapshot(
    collection_name="my_collection",
    snapshot_name="backup_20240101",
    description="Daily backup for January 1st, 2024"
)
```

**Go SDK Example:**
```go
import (
    "context"
    "github.com/milvus-io/milvus/client/v2/milvusclient"
)

client, err := milvusclient.New(context.Background(), &milvusclient.ClientConfig{
    Address: "localhost:19530",
})

// Recommended: Flush data before creating snapshot to ensure all data is included
err = client.Flush(context.Background(), milvusclient.NewFlushOption("my_collection"))
if err != nil {
    log.Fatal(err)
}

// Create snapshot
createOpt := milvusclient.NewCreateSnapshotOption("backup_20240101", "my_collection").
    WithDescription("Daily backup for January 1st, 2024")

err = client.CreateSnapshot(context.Background(), createOpt)
```

Parameters:
- snapshot_name (string): User-defined unique name for the snapshot
- collection_name (string): Name of the collection to snapshot
- description (string, optional): Description of the snapshot

### List Snapshots

List existing snapshots for collections.

**Python SDK Example:**
```python
# List all snapshots for a collection
snapshots = client.list_snapshots(collection_name="my_collection")
```

**Go SDK Example:**
```go
// List snapshots for collection
listOpt := milvusclient.NewListSnapshotsOption().
    WithCollectionName("my_collection")

snapshots, err := client.ListSnapshots(context.Background(), listOpt)
```

Parameters:
- collection_name (string, optional): Filter snapshots by collection

Returns:
- List of snapshot names

### Describe Snapshot

Get detailed information about a specific snapshot.

**Python SDK Example:**
```python
snapshot_info = client.describe_snapshot(
    snapshot_name="backup_20240101",
    include_collection_info=True
)

print(f"Snapshot ID: {snapshot_info.id}")
print(f"Collection: {snapshot_info.collection_name}")
print(f"Created: {snapshot_info.create_ts}")
print(f"Description: {snapshot_info.description}")
```

**Go SDK Example:**
```go
describeOpt := milvusclient.NewDescribeSnapshotOption("backup_20240101")
resp, err := client.DescribeSnapshot(context.Background(), describeOpt)

fmt.Printf("Snapshot ID: %d\n", resp.GetSnapshotInfo().GetId())
fmt.Printf("Collection: %s\n", resp.GetSnapshotInfo().GetCollectionName())
```

Parameters:
- snapshot_name (string): Name of the snapshot to describe
- include_collection_info (bool, optional): Whether to include collection schema and index information

Returns:
- SnapshotInfo: Basic snapshot information
- CollectionDescription: Collection schema and properties (if requested)
- IndexInfo[]: Index information (if requested)

### Restore Snapshot

Restore a snapshot to a new collection. This operation is asynchronous and returns a job ID for tracking the restore progress.

**Restore Mechanism**: Snapshot restore uses a **Copy Segment** mechanism instead of traditional bulk insert. This approach:
- Directly copies segment files (binlogs, deltalogs, index files) from snapshot storage
- Preserves Field IDs and Index IDs to ensure compatibility with existing data files
- Avoids data rewriting and index rebuilding, resulting in significantly faster restore times
- Typically 10-100x faster than traditional backup/restore methods

**Python SDK Example:**
```python
# Restore snapshot to new collection
job_id = client.restore_snapshot(
    snapshot_name="backup_20240101",
    collection_name="restored_collection",
)

# Wait for restore to complete
import time
while True:
    state = client.get_restore_snapshot_state(job_id=job_id)
    if state.state == "RestoreSnapshotCompleted":
        print(f"Restore completed in {state.time_cost}ms")
        break
    elif state.state == "RestoreSnapshotFailed":
        print(f"Restore failed: {state.reason}")
        break
    print(f"Restore progress: {state.progress}%")
    time.sleep(1)
```

**Go SDK Example:**
```go
restoreOpt := milvusclient.NewRestoreSnapshotOption("backup_20240101", "restored_collection")

jobID, err := client.RestoreSnapshot(context.Background(), restoreOpt)
if err != nil {
    log.Fatal(err)
}

// Poll for restore completion
for {
    state, err := client.GetRestoreSnapshotState(context.Background(), 
        milvusclient.NewGetRestoreSnapshotStateOption(jobID))
    if err != nil {
        log.Fatal(err)
    }
    
    if state.GetState() == milvuspb.RestoreSnapshotState_RestoreSnapshotCompleted {
        log.Printf("Restore completed in %dms", state.GetTimeCost())
        break
    }
    
    if state.GetState() == milvuspb.RestoreSnapshotState_RestoreSnapshotFailed {
        log.Fatalf("Restore failed: %s", state.GetReason())
    }
    
    log.Printf("Restore progress: %d%%", state.Progress)
    time.Sleep(time.Second)
}
```

Parameters:
- snapshot_name (string): Name of the snapshot to restore
- collection_name (string): Name of the target collection to create

Returns:
- job_id (int64): Restore job ID for tracking progress

### Drop Snapshot

Delete a snapshot permanently.

**Python SDK Example:**
```python
client.drop_snapshot(snapshot_name="backup_20240101")
```

**Go SDK Example:**
```go
dropOpt := milvusclient.NewDropSnapshotOption("backup_20240101")
err := client.DropSnapshot(context.Background(), dropOpt)
```

Parameters:
- snapshot_name (string): Name of the snapshot to drop

### Get Restore Snapshot State

Query the status and progress of a restore snapshot job.

**Python SDK Example:**
```python
state = client.get_restore_snapshot_state(job_id=12345)

print(f"Job ID: {state.job_id}")
print(f"Snapshot Name: {state.snapshot_name}")
print(f"Collection ID: {state.collection_id}")
print(f"State: {state.state}")
print(f"Progress: {state.progress}%")
if state.state == "RestoreSnapshotFailed":
    print(f"Failure Reason: {state.reason}")
print(f"Time Cost: {state.time_cost}ms")
```

**Go SDK Example:**
```go
stateOpt := milvusclient.NewGetRestoreSnapshotStateOption(12345)
state, err := client.GetRestoreSnapshotState(context.Background(), stateOpt)
if err != nil {
    log.Fatal(err)
}

fmt.Printf("Job ID: %d\n", state.GetJobId())
fmt.Printf("Snapshot Name: %s\n", state.GetSnapshotName())
fmt.Printf("Collection ID: %d\n", state.GetCollectionId())
fmt.Printf("State: %s\n", state.GetState())
fmt.Printf("Progress: %d%%\n", state.GetProgress())
if state.GetState() == milvuspb.RestoreSnapshotState_RestoreSnapshotFailed {
    fmt.Printf("Failure Reason: %s\n", state.GetReason())
}
fmt.Printf("Time Cost: %dms\n", state.GetTimeCost())
```

Parameters:
- job_id (int64): The restore job ID returned from RestoreSnapshot

Returns:
- RestoreSnapshotInfo with the following fields:
  - job_id (int64): Restore job ID
  - snapshot_name (string): Snapshot name being restored
  - collection_id (int64): Target collection ID
  - state (enum): Current state (Pending, InProgress, Completed, Failed)
  - progress (int32): Progress percentage (0-100)
  - reason (string): Error reason if failed
  - time_cost (uint64): Time cost in milliseconds

### List Restore Snapshot Jobs

List all restore snapshot jobs, optionally filtered by collection name.

**Python SDK Example:**
```python
# List all restore jobs
jobs = client.list_restore_snapshot_jobs()

for job in jobs:
    print(f"Job {job.job_id}: {job.snapshot_name} -> Collection {job.collection_id}")
    print(f"  State: {job.state}, Progress: {job.progress}%")

# List restore jobs for a specific collection
jobs = client.list_restore_snapshot_jobs(collection_name="my_collection")
```

**Go SDK Example:**
```go
// List all restore jobs
listOpt := milvusclient.NewListRestoreSnapshotJobsOption()
jobs, err := client.ListRestoreSnapshotJobs(context.Background(), listOpt)
if err != nil {
    log.Fatal(err)
}

for _, job := range jobs {
    fmt.Printf("Job %d: %s -> Collection %d\n", 
        job.GetJobId(), job.GetSnapshotName(), job.GetCollectionId())
    fmt.Printf("  State: %s, Progress: %d%%\n", 
        job.GetState(), job.GetProgress())
}

// List restore jobs for a specific collection
listOpt = milvusclient.NewListRestoreSnapshotJobsOption().
    WithCollectionName("my_collection")
jobs, err = client.ListRestoreSnapshotJobs(context.Background(), listOpt)
```

Parameters:
- collection_name (string, optional): Filter jobs by target collection name

Returns:
- List of RestoreSnapshotInfo objects for all matching restore jobs

## Use Cases

### 1. Data Backup and Recovery

Snapshots provide a lightweight and efficient backup solution compared to traditional tools like milvus-backup.

```python
import datetime

# Create daily backup
today = datetime.date.today().strftime("%Y%m%d")
snapshot_name = f"daily_backup_{today}"

# Recommended: Flush data to ensure all changes are persisted
client.flush(collection_name="production_vectors")

# Create snapshot
client.create_snapshot(
    collection_name="production_vectors",
    snapshot_name=snapshot_name,
    description=f"Daily backup for {today}"
)
```

**Comparison: Snapshot vs. milvus-backup**

| Operation | milvus-backup | Snapshot |
|-----------|---------------|----------|
| **Backup Creation** | Copies all data files | Creates metadata only (milliseconds) |
| **Restore Process** | Imports data and rebuilds indexes | Copies existing data and index files directly |
| **Performance** | Slower, resource-intensive | Fast and lightweight |
| **System Impact** | High I/O and CPU usage | Minimal impact |

**Why Snapshots are More Efficient:**
- **Creation**: Only generates snapshot metadata without copying any data files
- **Restoration**: Directly copies existing data files and index files, no data rewriting or index rebuilding needed
- **Speed**: Backup in milliseconds, restore in seconds to minutes (vs. hours for large collections)

The snapshot capability provides a foundation for significantly improving the milvus-backup tool

### 2. Offline Data Processing with Spark

Snapshots enable efficient offline data processing by providing stable, consistent data sources for analytical workloads. Users can directly access snapshot data stored in object storage (S3) with Spark or other big data processing frameworks without impacting the live Milvus cluster.

**Key Benefits:**
- **Direct Access**: Read snapshot data directly from S3 without going through Milvus query APIs
- **Data Stability**: Snapshot mechanism ensures data remains available and unchanged during long-running batch jobs
- **No Cluster Impact**: Offline processing doesn't affect production Milvus query performance
- **Cost-Effective**: Leverage cheaper compute resources for batch analytics instead of online query nodes

**Use Case Example: Vector Similarity Analysis**

```python
from pyspark.sql import SparkSession
import datetime
import json

# Step 1: Create snapshot for offline processing
snapshot_name = f"analytics_snapshot_{datetime.date.today().strftime('%Y%m%d')}"

# Recommended: Flush data to ensure all changes are persisted
client.flush(collection_name="user_embeddings")
client.create_snapshot(
    collection_name="user_embeddings",
    snapshot_name=snapshot_name,
    description="Snapshot for daily analytics job"
)

# Step 2: Get snapshot metadata to locate data files in S3
snapshot_info = client.describe_snapshot(
    snapshot_name=snapshot_name,
    include_collection_info=True
)

# Step 3: Process snapshot data with Spark
spark = SparkSession.builder \
    .appName("VectorAnalytics") \
    .config("spark.hadoop.fs.s3a.access.key", "YOUR_ACCESS_KEY") \
    .config("spark.hadoop.fs.s3a.secret.key", "YOUR_SECRET_KEY") \
    .getOrCreate()

# Read and parse snapshot metadata to get actual file paths
# Note: This is a simplified example. In practice, you need to:
# 1. Parse the metadata JSON file to get manifest file paths
# 2. Parse manifest Avro files to get binlog/deltalog paths
# 3. Read the actual data files
s3_path = snapshot_info.s3_location

# Example: Read binlog files directly if you know the structure
# In reality, you would parse the manifest files first
df = spark.read.format("your_format").load(f"s3a://{s3_path}/binlogs/")

# Perform analytics operations
# Example: Compute vector statistics, clustering, or quality metrics
result = df.groupBy("partition_id").agg({
    "vector_dim": "count",
    "timestamp": "max"
})

result.write.mode("overwrite").parquet("s3a://analytics-results/daily_stats/")

# Step 4: Clean up snapshot after processing completes
client.drop_snapshot(snapshot_name=snapshot_name)
```

**Common Offline Processing Scenarios:**
- **Vector Quality Analysis**: Analyze embedding distributions and detect anomalies
- **Data Migration**: Transform and migrate data between different Milvus clusters
- **ETL Pipelines**: Extract vectors for training or fine-tuning machine learning models
- **Compliance Auditing**: Generate reports on data usage and access patterns
- **Feature Engineering**: Derive new features from existing vector embeddings for downstream tasks


### 3. Data Versioning

Maintain multiple versions of data for experimentation:

```python
# Create version snapshots before major updates
# Recommended: Flush to ensure all data is captured
client.flush(collection_name="ml_embeddings")

client.create_snapshot(
    collection_name="ml_embeddings",
    snapshot_name="v1.0_baseline",
    description="Baseline model embeddings"
)

# After model update, flush and create new snapshot
client.flush(collection_name="ml_embeddings")

client.create_snapshot(
    collection_name="ml_embeddings",
    snapshot_name="v1.1_improved",
    description="Improved model embeddings"
)
```

### 4. Testing and Development

Create snapshots for testing environments:

```python
# Create test data snapshot
# Recommended: Flush to ensure all test data is captured
client.flush(collection_name="test_collection")

client.create_snapshot(
    collection_name="test_collection",
    snapshot_name="test_dataset_v1",
    description="Test dataset for regression testing"
)

# Restore for testing with progress tracking
job_id = client.restore_snapshot(
    snapshot_name="test_dataset_v1",
    collection_name="test_environment"
)

# Monitor restore progress
import time
while True:
    state = client.get_restore_snapshot_state(job_id=job_id)
    if state.state == "RestoreSnapshotCompleted":
        print(f"Test environment ready! Restored in {state.time_cost}ms")
        break
    elif state.state == "RestoreSnapshotFailed":
        print(f"Restore failed: {state.reason}")
        break
    print(f"Setting up test environment: {state.progress}%")
    time.sleep(1)
```

### 5. Managing Multiple Restore Operations

Track multiple restore jobs simultaneously:

```python
# Start multiple restore operations
job_ids = []
for i in range(3):
    job_id = client.restore_snapshot(
        snapshot_name=f"snapshot_v{i}",
        collection_name=f"test_env_{i}"
    )
    job_ids.append(job_id)

# Monitor all jobs
while job_ids:
    completed = []
    for job_id in job_ids:
        state = client.get_restore_snapshot_state(job_id=job_id)
        if state.state in ["RestoreSnapshotCompleted", "RestoreSnapshotFailed"]:
            completed.append(job_id)
            print(f"Job {job_id} finished: {state.state}")
    
    for job_id in completed:
        job_ids.remove(job_id)
    
    if job_ids:
        time.sleep(1)

# Alternatively, list all restore jobs
jobs = client.list_restore_snapshot_jobs()
for job in jobs:
    print(f"Job {job.job_id}: {job.snapshot_name} - {job.progress}% ({job.state})")
```

## Best Practices

### 1. Naming Conventions

Use consistent and descriptive naming:

```python
# Good naming examples
"daily_backup_20240101"
"v2.1_production_release"
"test_dataset_regression_suite"

# Avoid generic names
"backup1", "test", "snapshot"
```

### 2. Snapshot Management

- Regular cleanup: Remove old snapshots to save storage
- Documentation: Use descriptive descriptions for future reference
- Verification: Always verify snapshot creation and restoration
- Monitoring: Track snapshot creation times and storage usage
- Job tracking: Store restore job IDs for monitoring and troubleshooting

### 3. Storage Considerations

- Snapshots consume storage space proportional to collection size
- Object storage costs apply for snapshot retention
- Consider compression and deduplication at storage layer
- Plan retention policies based on business requirements

### 4. Performance Optimization

- Create snapshots during low-traffic periods
- Avoid creating multiple simultaneous snapshots
- Monitor impact on system resources

### 5. Restore Operation Management

- Asynchronous operations: Restore operations are asynchronous and return immediately with a job ID
- Progress monitoring: Always poll restore job status using the job ID for large collections
- Timeout handling: Implement appropriate timeout and retry logic for restore operations
- Error recovery: Check restore job state and reason field if restoration fails
- Resource planning: Ensure sufficient system resources (memory, disk, CPU) before restoring large snapshots
- Job listing: Use ListRestoreSnapshotJobs to monitor all ongoing restore operations
- Completion verification: After restore completes, verify collection data integrity before using it

### 6. Performance and Storage Considerations

#### Execution Time

**Snapshot Creation:**
- Creation is typically completed in milliseconds
- The operation only generates snapshot metadata and stores it to object storage
- Lightweight operation with minimal performance impact on the system

**Snapshot Restore:**
- Restore time ranges from seconds to minutes depending on data volume
- The operation involves copying data from the snapshot to the target collection
- Key factors affecting restore time:
  - Total data volume in the snapshot (binlogs, deltalogs, index files)
  - Network bandwidth between Milvus cluster and object storage
  - Object storage throughput limits and concurrent I/O operations
- Use `GetRestoreSnapshotState` API to monitor restore progress for large snapshots

#### Storage Impact

**Important Storage Behavior:**
- Data and indexes referenced by snapshots are **NOT automatically garbage collected**
- Even if the original segment is dropped, snapshot-referenced data remains in object storage
- This prevents data loss but increases storage consumption

**GC Protection Mechanism:**
- **Segment-level protection**: The garbage collector checks `GetSnapshotBySegment()` before deleting any segment
  - Segments referenced by any snapshot are skipped during GC
  - Protection applies to both metadata (Etcd) and data files (S3)
- **Index-level protection**: The garbage collector checks `GetSnapshotByIndex()` before deleting index files
  - Index files referenced by snapshots are preserved even after drop index operations
  - Ensures index data availability during snapshot restore
- **Implementation location**: `internal/datacoord/garbage_collector.go`

**Storage Cost Considerations:**
- In extreme cases, a single snapshot can double your object storage costs
- Example: If original collection uses 100GB, and you create a snapshot then compact the collection to generate new segments, you may have up to 200GB in object storage
- Snapshot metadata itself is minimal (typically < 1MB per snapshot)

**Storage Management Best Practices:**
- **Regular cleanup**: Explicitly drop snapshots that are no longer needed to free storage
- **Retention policies**: Define and enforce snapshot lifecycle policies (e.g., keep only last 7 days)
- **Monitor usage**: Track object storage consumption and identify orphaned snapshot data
- **Cost planning**: Factor in potential storage doubling when planning capacity and budgets
- **Snapshot audit**: Periodically review and remove obsolete snapshots to reclaim storage space

## Limitations and Considerations

### Current Limitations

1. **Read-only snapshots**: Snapshots are immutable once created
2. **Cross-cluster restoration**: Snapshots are currently tied to the originating cluster (cross-cluster restore not yet supported)
3. **Schema compatibility**: Restored collections maintain original schema
4. **Resource usage**: Snapshot creation may impact system performance during metadata collection
5. **Channel/Partition matching**:
   - The restored collection's shard count must match the snapshot's channel count
   - Partition count must match (including both auto-created and user-created partitions)
6. **TTL handling**:
   - Current implementation does not automatically handle collection TTL settings
   - Restored historical data may conflict with TTL policies
   - Recommendation: Disable TTL or adjust TTL time before restoring snapshots
7. **Field/Index ID preservation**:
   - Restore process uses `PreserveFieldId=true` and `PreserveIndexId=true`
   - These flags ensure compatibility between snapshot data files and restored collection

### Planning Considerations

1. Storage costs: Factor in long-term storage costs for snapshots
2. Recovery time: Larger snapshots take longer to restore (monitor via job progress)
3. Network bandwidth: Restoration involves data transfer
4. Consistency model: Snapshots reflect point-in-time consistency
5. Asynchronous operations: Restore operations run in background; plan for monitoring and status checking
6. Job management: Keep track of restore job IDs for production environments

## Troubleshooting

### Common Issues

Snapshot creation fails:
- Verify collection exists and is accessible
- Check available storage space
- Ensure proper permissions for object storage
- Verify system resources are sufficient
Restoration fails:
- Confirm snapshot exists and is accessible
- Check target collection name doesn't already exist
- Verify sufficient system resources
- Ensure object storage connectivity
- Query restore job state to get specific failure reason
- Check restore job progress to identify at which stage it failed

Performance issues:
- Monitor system resource usage during operations
- Consider creating snapshots during maintenance windows

### Error Messages

Common error patterns and solutions:

Error: "snapshot not found"
Solution: Verify snapshot name and check if it was deleted

Error: "collection already exists"
Solution: Use a different target collection name for restoration

Error: "insufficient storage"
Solution: Free up storage space or increase limits

Error: "permission denied"
Solution: Check object storage credentials and permissions

Error: "restore job not found"
Solution: Verify job ID is correct; job may have expired or been cleaned up

Error: "restore snapshot failed" (check state.reason for details)
Solution: Query GetRestoreSnapshotState for specific failure reason and address the underlying issue

## Conclusion

Milvus snapshots provide a robust solution for data backup, versioning, and recovery scenarios. By following the best practices and understanding the limitations, users can effectively leverage snapshots to ensure data durability and enable sophisticated data management workflows.

For additional support and advanced configuration options, refer to the Milvus documentation or contact the Milvus community.