milvus/tests/python_client/milvus_client/expressions/README.md

# Expression Filtering Tests

This directory contains comprehensive test modules for Milvus client expression filtering capabilities.

## Test Modules

### 1. `test_milvus_client_scalar_expression_filtering_optimized.py`
**Primary test module for comprehensive scalar expression filtering**

**Features:**
- Tests all Milvus-supported scalar data types (INT8, INT16, INT32, INT64, BOOL, FLOAT, DOUBLE, VARCHAR, ARRAY, JSON)
- Covers all operators: Comparison (==, !=, >, <, >=, <=), Range (IN, LIKE), Arithmetic (+, -, *, /, %, **), Logical (AND, OR, NOT), Null (IS NULL, IS NOT NULL)
- Single collection design with multiple index types for efficiency
- Index consistency verification (same results for indexed vs non-indexed fields)
- Comprehensive error handling and failure debugging
- Automatic reproduction script generation
- Test complex Json expression (JSON[JSON], JSON[LIST[JSON]], JSON[JSON[LIST]], etc)

**Key Design:**
- One collection containing all data types
- Each data type has multiple fields representing different index types
- 10% of data is NULL to test IS NULL/IS NOT NULL operators
- Specific VARCHAR patterns: `str_xxx`, `xxx_str`, `xxx_str_xxx`
- Comprehensive LIKE pattern coverage with escape handling
- Create examples of typed, dynamic, and shared keys in json
- Generate expressions to valida query result

### 2. `test_milvus_client_scalar_expression_filtering.py`
**Legacy comprehensive scalar expression filtering test**

**Features:**
- Original comprehensive test implementation
- Multiple collection approach
- Extensive test coverage for all data types and operators
- Detailed validation logic

### 3. `test_milvus_client_random_expression_generator.py`
**Random expression generation for edge case testing**

**Features:**
- Generates random complex expressions
- Tests edge cases and unusual combinations
- Stress testing for expression parsing
- Random data generation with various patterns

## Data Type Coverage

### Supported Scalar Types
- **Numeric**: INT8, INT16, INT32, INT64, FLOAT, DOUBLE
- **Boolean**: BOOL
- **String**: VARCHAR
- **Array**: ARRAY (with all element types)
- **JSON**: JSON (with complex nested structures)

### Array Element Types
- All scalar types: INT8, INT16, INT32, INT64, BOOL, FLOAT, DOUBLE, VARCHAR

## Operator Coverage

### Comparison Operators
- `==`, `!=`, `>`, `<`, `>=`, `<=`

### Range Operators
- `IN` (with array indexing support)
- `LIKE` (with comprehensive pattern coverage)

### Arithmetic Operators
- `+`, `-`, `*`, `/`, `%`, `**`

### Logical Operators
- `AND`, `OR`, `NOT`

### Null Operators
- `IS NULL`, `IS NOT NULL`

### Array Functions
- Array indexing: `field[index]`

### JSON Functions
- JSON key access: `field['key']`

## Index Type Support

### Scalar Index Types
| Data Types                                               | INVERTED | BITMAP | STL_SORT | Trie | NGRAM | AUTOINDEX |
|:---------------------------------------------------------|:--------:|:------:|:--------:|:----:|:-----:|:---------:|
| INT8, INT16, INT32, INT64                                |   yes    |  yes   |   yes    |  no  |  no   |    yes    |
| BOOL                                                     |   yes    |  yes   |    no    |  no  |  no   |    yes    |
| FLOAT, DOUBLE                                            |   yes    |   no   |   yes    |  no  |  no   |    yes    |
| VARCHAR                                                  |   yes    |  yes   |    no    | yes  |  yes  |    yes    |
| JSON                                                     |   yes    |   no   |    no    |  no  |  yes* |    yes    |
| ARRAY (elements: BOOL, INT8, INT16, INT32, INT64, VARCHAR) |   yes    |  yes   |    no    |  no  |  no   |    yes    |
| ARRAY (elements: FLOAT, DOUBLE)                          |   yes    |   no   |    no    |  no  |  no   |    yes    |

*JSON fields require `json_path` and `json_cast_type: "varchar"` parameters for NGRAM index

### NGRAM Index Specific Features

The NGRAM index is specialized for efficient text partial matching and fuzzy search on VARCHAR and JSON fields.

**Supported Fields:**
- **VARCHAR**: Direct text content indexing
- **JSON**: Requires `json_path` parameter to specify the JSON field path (e.g., `field_name['key']`)

**Index Parameters:**
- `min_gram`: Minimum n-gram length (required, positive integer)
- `max_gram`: Maximum n-gram length (required, positive integer, ≥ min_gram)
- `json_path`: JSON field path for JSON fields (e.g., `"json_field['body']"`)
- `json_cast_type`: Must be `"varchar"` for JSON fields

**Performance Characteristics:**
- Optimized for LIKE queries with `%` and `_` wildcards
- Two-phase query execution: n-gram filtering + secondary validation
- Query strings shorter than `min_gram` fall back to full table scan
- Supports multilingual text including Chinese, Japanese, and Korean

**Example Index Creation:**
```python
# VARCHAR field
index_params.add_index(
    field_name="content",
    index_type="NGRAM",
    params={"min_gram": 2, "max_gram": 3}
)

# JSON field
index_params.add_index(
    field_name="json_field",
    index_type="NGRAM",
    params={
        "min_gram": 2,
        "max_gram": 3,
        "json_path": "json_field['body']",
        "json_cast_type": "varchar"
    }
)
```

## Test Features

### Error Handling
- Parsing error detection and skipping
- Graceful handling of unsupported expressions
- Detailed error reporting

### Debugging Support
- Automatic debug info saving on failure
- Parquet file export for test data
- Reproduction script generation
- Schema and configuration preservation

### Validation Logic
- Ground truth calculation using Python lambdas
- Result count and ID verification
- Index consistency verification

### LIKE Pattern Coverage
- Prefix patterns: `str%`
- Suffix patterns: `%str`
- Contains patterns: `%str%`
- Single character wildcard: `str_`, `_str`
- Combination patterns: `str_%`, `%_str`
- Escape patterns: `str\%`, `str\_`

**NGRAM Index Optimization:**
- LIKE queries on VARCHAR and JSON fields with NGRAM index are automatically optimized
- Query performance significantly improves for pattern matching operations
- Supports all LIKE patterns with `%` and `_` wildcards
- Automatic fallback to full scan when query length < `min_gram`

## Usage

### Running Tests
```bash
# Run optimized test
pytest test_milvus_client_scalar_expression_filtering_optimized.py

# Run legacy comprehensive test
pytest test_milvus_client_scalar_expression_filtering.py

# Run random expression generator
pytest test_milvus_client_random_expression_generator.py

# Run NGRAM index specific tests
pytest ../../testcases/indexes/test_ngram.py
```

### Debug Information
On test failure, debug information is automatically saved to `/tmp/ci_logs/`:
- Test data as Parquet files
- Collection schema and configuration
- Failed expressions list
- Reproduction script

### Reproduction Script
The generated reproduction script can:
- Rebuild the entire test environment
- Recreate schema, data, and indexes
- Re-run failed expressions
- Validate results

## Design Principles

1. **Comprehensive Coverage**: Test all supported data types, operators, and index types (including NGRAM)
2. **Efficiency**: Single collection design for optimal performance
3. **Reliability**: Robust error handling and debugging
4. **Maintainability**: Clear code structure and documentation
5. **Reproducibility**: Automatic failure reproduction capabilities
6. **Index Optimization**: Validate performance improvements with specialized indexes like NGRAM