mirror of
https://gitee.com/milvus-io/milvus.git
synced 2025-12-28 14:35:27 +08:00
issue: https://github.com/milvus-io/milvus/issues/45525 see added README.md for added optimizations <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added query expression optimization feature with a new `optimizeExpr` configuration flag to enable automatic simplification of filter predicates, including range predicate optimization, merging of IN/NOT IN conditions, and flattening of nested logical operators. * **Bug Fixes** * Adjusted delete operation behavior to correctly handle expression evaluation. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
Expression Rewriter (planparserv2/rewriter)
This module performs rule-based logical rewrites on parsed planpb.Expr trees right after template value filling and before planning/execution.
Entry
RewriteExpr(*planpb.Expr) *planpb.Expr(inentry.go)- Recursively visits the expression tree and applies a set of composable, side-effect-free rewrite rules.
- Uses global configuration from
paramtable.Get().CommonCfg.EnabledOptimizeExpr
RewriteExprWithConfig(*planpb.Expr, bool) *planpb.Expr(inentry.go)- Same as
RewriteExprbut allows custom configuration for testing or special cases.
- Same as
Configuration
The rewriter can be configured via the following parameter (refreshable at runtime):
| Parameter | Default | Description |
|---|---|---|
common.enabledOptimizeExpr |
true |
Enable query expression optimization including range simplification, IN/NOT IN merge, TEXT_MATCH merge, and all other optimizations |
IMPORTANT: IN/NOT IN value list sorting and deduplication always runs regardless of this configuration setting, because the execution engine depends on sorted value lists.
Implemented Rules
- IN / NOT IN normalization and merges (
term_in.go)
- OR-equals to IN (same column):
a == v1 OR a == v2 ...→a IN (v1, v2, ...)- Numeric columns only merge when count > threshold (default 150); others when count > 1.
- AND-not-equals to NOT IN (same column):
a != v1 AND a != v2 ...→NOT (a IN (v1, v2, ...))- Same thresholds as above.
- IN vs Equal redundancy elimination (same column):
- AND:
(a ∈ S) AND (a = v):- if
v ∈ S→a = v - if
v ∉ S→ contradiction → constantfalse
- if
- OR:
(a ∈ S) OR (a = v)→a ∈ (S ∪ {v})(always union)
- AND:
- IN with IN union:
- OR:
(a ∈ S1) OR (a ∈ S2)→a ∈ (S1 ∪ S2)with sorting/dedup - AND:
(a ∈ S1) AND (a ∈ S2)→a ∈ (S1 ∩ S2); empty intersection → constantfalse
- OR:
- Sort and deduplicate
IN/NOT INvalue lists (supported types: bool, int64, float64, string).
- TEXT_MATCH OR merge (
text_match.go)
- Merge ORs of
TEXT_MATCH(field, "literal")on the same column (no options):- Concatenate literals with a single space in the order they appear; no tokenization, deduplication, or sorting is performed.
- Example:
TEXT_MATCH(f, "A C") OR TEXT_MATCH(f, "B D")→TEXT_MATCH(f, "A C B D")
- If any
TEXT_MATCHin the group has options (e.g.,minimum_should_match), this optimization is skipped for that group.
- Range predicate simplification (
range.go)
- AND tighten (same column):
- Lower bounds:
a > 10 AND a > 20→a > 20(pick strongest lower) - Upper bounds:
a < 50 AND a < 60→a < 50(pick strongest upper) - Mixed lower and upper:
a > 10 AND a < 50→10 < a < 50(BinaryRangeExpr) - Inclusion respected (>, >=, <, <=). On ties, exclusive is considered stronger than inclusive for tightening.
- Lower bounds:
- OR weaken (same column, same direction):
- Lower bounds:
a > 10 OR a > 20→a > 10(pick weakest lower) - Upper bounds:
a < 10 OR a < 20→a < 20(pick weakest upper) - Inclusion respected, preferring inclusive for weakening in ties.
- Lower bounds:
- Mixed-direction OR (lower vs upper) is not merged.
- Equivalent-bound collapses (same column, same value):
- AND:
a ≥ x AND a > x→a > x;a ≤ y AND a < y→a < y - OR:
a ≥ x OR a > x→a ≥ x;a ≤ y OR a < y→a ≤ y - Symmetric dedup:
a > 10 AND a ≥ 10→a > 10;a < 5 OR a ≤ 5→a ≤ 5
- AND:
- IN ∩ range filtering:
- AND:
(a ∈ {…}) AND (range)→ keep only values in the set that satisfy the range- e.g.,
{1,3,5} AND a > 3→{5}
- e.g.,
- AND:
- Supported columns for range optimization:
- Scalar: Int8/Int16/Int32/Int64, Float/Double, VarChar
- Array element access: when indexing an element (e.g.,
ArrayInt[0]), the element type above applies - JSON/dynamic fields with nested paths (e.g.,
JSONField["price"],$meta["age"]) are range-optimized- Type determined from literal value (int, float, string)
- Numeric types (int and float) are compatible and normalized to Double for merging
- Different type categories are not merged (e.g.,
json["a"] > 10andjson["a"] > "hello"remain separate) - Bool literals are not optimized (no meaningful ranges)
- Literal compatibility:
- Integer columns require integer literals (e.g.,
Int64Field > 10) - Float/Double columns accept both integer and float literals (e.g.,
FloatField > 10or> 10.5)
- Integer columns require integer literals (e.g.,
- Column identity:
- Merges only happen within the same
ColumnInfo(including nested path and element index). For example,ArrayInt[0]andArrayInt[1]are different columns and are not merged with each other.
- Merges only happen within the same
- BinaryRangeExpr merging:
- AND: Merge multiple
BinaryRangeExprnodes on the same column to compute intersection (max lower, min upper)(10 < x < 50) AND (20 < x < 40)→(20 < x < 40)- Empty intersection → constant
false
- AND with UnaryRangeExpr: Update appropriate bound of
BinaryRangeExpr(10 < x < 50) AND (x > 30)→(30 < x < 50)
- OR: Merge overlapping or adjacent
BinaryRangeExprnodes into wider interval(10 < x < 25) OR (20 < x < 40)→(10 < x < 40)(overlapping)(10 < x <= 20) OR (20 <= x < 30)→(10 < x < 30)(adjacent with inclusive)- Disjoint intervals remain separate:
(10 < x < 20) OR (30 < x < 40)→ remains as OR
- Inclusivity handling: AND prefers exclusive on equal bounds (stronger), OR prefers inclusive (weaker)
- AND: Merge multiple
General Notes
- All merges require operands to target the same column (same
ColumnInfo, including nested path/element type). - Rewrite runs after template value filling; template placeholders do not appear here.
- Sorting/dedup for IN/NOT IN is deterministic; duplicates are removed post-sort.
- Numeric-threshold for OR→IN / AND≠→NOT IN is defined in
util.go(defaultConvertOrToInNumericLimit, default 150).
Pass Ordering (current)
- OR branch:
- Flatten
- OR
==→ IN - TEXT_MATCH merge (no options)
- Range weaken (same-direction bounds)
- BinaryRangeExpr merge (overlapping/adjacent intervals)
- IN with
!=short-circuiting - IN ∪ IN union
- IN vs Equal redundancy elimination
- Fold back to BinaryExpr
- AND branch:
- Flatten
- Range tighten / interval construction
- BinaryRangeExpr merge (intersection, also with UnaryRangeExpr)
- IN ∪ IN intersection (if any)
- IN with
!=filtering - IN ∩ range filtering
- IN vs Equal redundancy elimination
- AND
!=→ NOT IN - Fold back to BinaryExpr
Each construction of IN will be normalized (sorted and deduplicated). TEXT_MATCH OR merge concatenates literals with a single space; no tokenization, deduplication, or sorting is performed.
File Structure
entry.go— rewrite entry and visitor orchestrationutil.go— shared helpers (column keying, value classification, sorting/dedup, constructors)term_in.go— IN/NOT IN normalization and conversionstext_match.go— TEXT_MATCH OR merge (no options)range.go— range tightening/weakening and interval construction
Future Extensions
- More IN-range algebra (e.g.,
INvs exact equality propagation across subtrees). - Merging phrase_match or other string ops with clearly-defined token rules.
- More algebraic simplifications around equality and null checks:
- Contradiction detection:
(a == 1) AND (a == 2)→false;(a > 10) AND (a == 5)→false - Tautology detection:
(a > 10) OR (a <= 10)→true(for non-NULL values) - Absorption laws:
(a > 10) OR ((a > 10) AND (b > 20))→a > 10
- Contradiction detection:
- Advanced BinaryRangeExpr merging:
- OR with 3+ intervals: Currently limited to 2 intervals. Full interval merging algorithm needed for
(10 < x < 20) OR (15 < x < 25) OR (22 < x < 30)→(10 < x < 30). - OR with unbounded + bounded: Currently skipped. Could optimize
(x > 10) OR (5 < x < 15)→x > 5.
- OR with 3+ intervals: Currently limited to 2 intervals. Full interval merging algorithm needed for