mirror of
https://gitee.com/milvus-io/milvus.git
synced 2025-12-28 22:45:26 +08:00
<test>: <add test case for complex json expression
On branch feature/json-shredding
Changes to be committed:
modified: milvus_client/expressions/README.md
modified:
milvus_client/expressions/test_milvus_client_scalar_filtering.py
---------
Signed-off-by: Eric Hou <eric.hou@zilliz.com>
Co-authored-by: Eric Hou <eric.hou@zilliz.com>
210 lines
7.4 KiB
Markdown
210 lines
7.4 KiB
Markdown
# Expression Filtering Tests
|
|
|
|
This directory contains comprehensive test modules for Milvus client expression filtering capabilities.
|
|
|
|
## Test Modules
|
|
|
|
### 1. `test_milvus_client_scalar_expression_filtering_optimized.py`
|
|
**Primary test module for comprehensive scalar expression filtering**
|
|
|
|
**Features:**
|
|
- Tests all Milvus-supported scalar data types (INT8, INT16, INT32, INT64, BOOL, FLOAT, DOUBLE, VARCHAR, ARRAY, JSON)
|
|
- Covers all operators: Comparison (==, !=, >, <, >=, <=), Range (IN, LIKE), Arithmetic (+, -, *, /, %, **), Logical (AND, OR, NOT), Null (IS NULL, IS NOT NULL)
|
|
- Single collection design with multiple index types for efficiency
|
|
- Index consistency verification (same results for indexed vs non-indexed fields)
|
|
- Comprehensive error handling and failure debugging
|
|
- Automatic reproduction script generation
|
|
- Test complex Json expression (JSON[JSON], JSON[LIST[JSON]], JSON[JSON[LIST]], etc)
|
|
|
|
**Key Design:**
|
|
- One collection containing all data types
|
|
- Each data type has multiple fields representing different index types
|
|
- 10% of data is NULL to test IS NULL/IS NOT NULL operators
|
|
- Specific VARCHAR patterns: `str_xxx`, `xxx_str`, `xxx_str_xxx`
|
|
- Comprehensive LIKE pattern coverage with escape handling
|
|
- Create examples of typed, dynamic, and shared keys in json
|
|
- Generate expressions to valida query result
|
|
|
|
### 2. `test_milvus_client_scalar_expression_filtering.py`
|
|
**Legacy comprehensive scalar expression filtering test**
|
|
|
|
**Features:**
|
|
- Original comprehensive test implementation
|
|
- Multiple collection approach
|
|
- Extensive test coverage for all data types and operators
|
|
- Detailed validation logic
|
|
|
|
### 3. `test_milvus_client_random_expression_generator.py`
|
|
**Random expression generation for edge case testing**
|
|
|
|
**Features:**
|
|
- Generates random complex expressions
|
|
- Tests edge cases and unusual combinations
|
|
- Stress testing for expression parsing
|
|
- Random data generation with various patterns
|
|
|
|
## Data Type Coverage
|
|
|
|
### Supported Scalar Types
|
|
- **Numeric**: INT8, INT16, INT32, INT64, FLOAT, DOUBLE
|
|
- **Boolean**: BOOL
|
|
- **String**: VARCHAR
|
|
- **Array**: ARRAY (with all element types)
|
|
- **JSON**: JSON (with complex nested structures)
|
|
|
|
### Array Element Types
|
|
- All scalar types: INT8, INT16, INT32, INT64, BOOL, FLOAT, DOUBLE, VARCHAR
|
|
|
|
## Operator Coverage
|
|
|
|
### Comparison Operators
|
|
- `==`, `!=`, `>`, `<`, `>=`, `<=`
|
|
|
|
### Range Operators
|
|
- `IN` (with array indexing support)
|
|
- `LIKE` (with comprehensive pattern coverage)
|
|
|
|
### Arithmetic Operators
|
|
- `+`, `-`, `*`, `/`, `%`, `**`
|
|
|
|
### Logical Operators
|
|
- `AND`, `OR`, `NOT`
|
|
|
|
### Null Operators
|
|
- `IS NULL`, `IS NOT NULL`
|
|
|
|
### Array Functions
|
|
- Array indexing: `field[index]`
|
|
|
|
### JSON Functions
|
|
- JSON key access: `field['key']`
|
|
|
|
## Index Type Support
|
|
|
|
### Scalar Index Types
|
|
| Data Types | INVERTED | BITMAP | STL_SORT | Trie | NGRAM | AUTOINDEX |
|
|
|:---------------------------------------------------------|:--------:|:------:|:--------:|:----:|:-----:|:---------:|
|
|
| INT8, INT16, INT32, INT64 | yes | yes | yes | no | no | yes |
|
|
| BOOL | yes | yes | no | no | no | yes |
|
|
| FLOAT, DOUBLE | yes | no | yes | no | no | yes |
|
|
| VARCHAR | yes | yes | no | yes | yes | yes |
|
|
| JSON | yes | no | no | no | yes* | yes |
|
|
| ARRAY (elements: BOOL, INT8, INT16, INT32, INT64, VARCHAR) | yes | yes | no | no | no | yes |
|
|
| ARRAY (elements: FLOAT, DOUBLE) | yes | no | no | no | no | yes |
|
|
|
|
*JSON fields require `json_path` and `json_cast_type: "varchar"` parameters for NGRAM index
|
|
|
|
### NGRAM Index Specific Features
|
|
|
|
The NGRAM index is specialized for efficient text partial matching and fuzzy search on VARCHAR and JSON fields.
|
|
|
|
**Supported Fields:**
|
|
- **VARCHAR**: Direct text content indexing
|
|
- **JSON**: Requires `json_path` parameter to specify the JSON field path (e.g., `field_name['key']`)
|
|
|
|
**Index Parameters:**
|
|
- `min_gram`: Minimum n-gram length (required, positive integer)
|
|
- `max_gram`: Maximum n-gram length (required, positive integer, ≥ min_gram)
|
|
- `json_path`: JSON field path for JSON fields (e.g., `"json_field['body']"`)
|
|
- `json_cast_type`: Must be `"varchar"` for JSON fields
|
|
|
|
**Performance Characteristics:**
|
|
- Optimized for LIKE queries with `%` and `_` wildcards
|
|
- Two-phase query execution: n-gram filtering + secondary validation
|
|
- Query strings shorter than `min_gram` fall back to full table scan
|
|
- Supports multilingual text including Chinese, Japanese, and Korean
|
|
|
|
**Example Index Creation:**
|
|
```python
|
|
# VARCHAR field
|
|
index_params.add_index(
|
|
field_name="content",
|
|
index_type="NGRAM",
|
|
params={"min_gram": 2, "max_gram": 3}
|
|
)
|
|
|
|
# JSON field
|
|
index_params.add_index(
|
|
field_name="json_field",
|
|
index_type="NGRAM",
|
|
params={
|
|
"min_gram": 2,
|
|
"max_gram": 3,
|
|
"json_path": "json_field['body']",
|
|
"json_cast_type": "varchar"
|
|
}
|
|
)
|
|
```
|
|
|
|
## Test Features
|
|
|
|
### Error Handling
|
|
- Parsing error detection and skipping
|
|
- Graceful handling of unsupported expressions
|
|
- Detailed error reporting
|
|
|
|
### Debugging Support
|
|
- Automatic debug info saving on failure
|
|
- Parquet file export for test data
|
|
- Reproduction script generation
|
|
- Schema and configuration preservation
|
|
|
|
### Validation Logic
|
|
- Ground truth calculation using Python lambdas
|
|
- Result count and ID verification
|
|
- Index consistency verification
|
|
|
|
### LIKE Pattern Coverage
|
|
- Prefix patterns: `str%`
|
|
- Suffix patterns: `%str`
|
|
- Contains patterns: `%str%`
|
|
- Single character wildcard: `str_`, `_str`
|
|
- Combination patterns: `str_%`, `%_str`
|
|
- Escape patterns: `str\%`, `str\_`
|
|
|
|
**NGRAM Index Optimization:**
|
|
- LIKE queries on VARCHAR and JSON fields with NGRAM index are automatically optimized
|
|
- Query performance significantly improves for pattern matching operations
|
|
- Supports all LIKE patterns with `%` and `_` wildcards
|
|
- Automatic fallback to full scan when query length < `min_gram`
|
|
|
|
## Usage
|
|
|
|
### Running Tests
|
|
```bash
|
|
# Run optimized test
|
|
pytest test_milvus_client_scalar_expression_filtering_optimized.py
|
|
|
|
# Run legacy comprehensive test
|
|
pytest test_milvus_client_scalar_expression_filtering.py
|
|
|
|
# Run random expression generator
|
|
pytest test_milvus_client_random_expression_generator.py
|
|
|
|
# Run NGRAM index specific tests
|
|
pytest ../../testcases/indexes/test_ngram.py
|
|
```
|
|
|
|
### Debug Information
|
|
On test failure, debug information is automatically saved to `/tmp/ci_logs/`:
|
|
- Test data as Parquet files
|
|
- Collection schema and configuration
|
|
- Failed expressions list
|
|
- Reproduction script
|
|
|
|
### Reproduction Script
|
|
The generated reproduction script can:
|
|
- Rebuild the entire test environment
|
|
- Recreate schema, data, and indexes
|
|
- Re-run failed expressions
|
|
- Validate results
|
|
|
|
## Design Principles
|
|
|
|
1. **Comprehensive Coverage**: Test all supported data types, operators, and index types (including NGRAM)
|
|
2. **Efficiency**: Single collection design for optimal performance
|
|
3. **Reliability**: Robust error handling and debugging
|
|
4. **Maintainability**: Clear code structure and documentation
|
|
5. **Reproducibility**: Automatic failure reproduction capabilities
|
|
6. **Index Optimization**: Validate performance improvements with specialized indexes like NGRAM
|