Spade A 0114bd1dc9
feat: support match operator family (#46518)
issue: https://github.com/milvus-io/milvus/issues/46517
ref: https://github.com/milvus-io/milvus/issues/42148

This PR supports match operator family with struct array and brute force
search only.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: match operators only target struct-array element-level
predicates and assume callers provide a correct row_start so element
indices form a contiguous range; IArrayOffsets implementations convert
row-level bitmaps/rows (starting at row_start) into element-level
bitmaps or a contiguous element-offset vector used by brute-force
evaluation.

- New capability added: end-to-end support for MATCH_* semantics
(match_any, match_all, match_least, match_most, match_exact) — parser
(grammar + proto), planner (ParseMatchExprs), expr model
(expr::MatchExpr), compilation (Expr→PhyMatchFilterExpr), execution
(PhyMatchFilterExpr::Eval uses element offsets/bitmaps), and unit tests
(MatchExprTest + parser tests). Implementation currently works for
struct-array inputs and uses brute-force element counting via
RowBitsetToElementOffsets/RowBitsetToElementBitset.

- Logic removed or simplified and why: removed the ad-hoc
DocBitsetToElementOffsets helper and consolidated offset/bitset
derivation into IArrayOffsets::RowBitsetToElementOffsets and a
row_start-aware RowBitsetToElementBitset, and removed EvalCtx overloads
that embedded ExprSet (now EvalCtx(exec_ctx, offset_input)). This
centralizes array-layout logic in ArrayOffsets and removes duplicated
offset conversion and EvalCtx variants that were redundant for
element-level evaluation.

- No data loss / no behavior regression: persistent formats are
unchanged (no proto storage or on-disk layout changed); callers were
updated to supply row_start and now route through the centralized
ArrayOffsets APIs which still use the authoritative
row_to_element_start_ mapping, preserving exact element index mappings.
Eval logic changes are limited to in-memory plumbing (how
offsets/bitmaps are produced and how EvalCtx is constructed); expression
evaluation still invokes exprs_->Eval where needed, so existing behavior
and stored data remain intact.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com>
2025-12-29 11:03:26 +08:00

210 lines
10 KiB
ANTLR

grammar Plan;
expr:
Identifier (op1=(ADD | SUB) INTERVAL interval_string=StringLiteral)? op2=(LT | LE | GT | GE | EQ | NE) ISO compare_string=StringLiteral # TimestamptzCompareForward
| ISO compare_string=StringLiteral op2=(LT | LE | GT | GE | EQ | NE) Identifier (op1=(ADD | SUB) INTERVAL interval_string=StringLiteral)? # TimestamptzCompareReverse
| IntegerConstant # Integer
| FloatingConstant # Floating
| BooleanConstant # Boolean
| StringLiteral # String
| (Identifier|Meta) # Identifier
| JSONIdentifier # JSONIdentifier
| StructSubFieldIdentifier # StructSubField
| LBRACE Identifier RBRACE # TemplateVariable
| '(' expr ')' # Parens
| '[' expr (',' expr)* ','? ']' # Array
| EmptyArray # EmptyArray
| EXISTS expr # Exists
| expr LIKE StringLiteral # Like
| TEXTMATCH'('Identifier',' StringLiteral (',' textMatchOption)? ')' # TextMatch
| PHRASEMATCH'('Identifier',' StringLiteral (',' expr)? ')' # PhraseMatch
| RANDOMSAMPLE'(' expr ')' # RandomSample
| ElementFilter'('Identifier',' expr')' # ElementFilter
| MATCH_ALL'(' Identifier ',' expr ')' # MatchAll
| MATCH_ANY'(' Identifier ',' expr ')' # MatchAny
| MATCH_LEAST'(' Identifier ',' expr ',' THRESHOLD ASSIGN IntegerConstant ')' # MatchLeast
| MATCH_MOST'(' Identifier ',' expr ',' THRESHOLD ASSIGN IntegerConstant ')' # MatchMost
| MATCH_EXACT'(' Identifier ',' expr ',' THRESHOLD ASSIGN IntegerConstant ')' # MatchExact
| expr POW expr # Power
| op = (ADD | SUB | BNOT | NOT) expr # Unary
// | '(' typeName ')' expr # Cast
| expr op = (MUL | DIV | MOD) expr # MulDivMod
| expr op = (ADD | SUB) expr # AddSub
| expr op = (SHL | SHR) expr # Shift
| expr op = NOT? IN expr # Term
| (JSONContains | ArrayContains)'('expr',' expr')' # JSONContains
| (JSONContainsAll | ArrayContainsAll)'('expr',' expr')' # JSONContainsAll
| (JSONContainsAny | ArrayContainsAny)'('expr',' expr')' # JSONContainsAny
| STEuqals'('Identifier','StringLiteral')' # STEuqals
| STTouches'('Identifier','StringLiteral')' # STTouches
| STOverlaps'('Identifier','StringLiteral')' # STOverlaps
| STCrosses'('Identifier','StringLiteral')' # STCrosses
| STContains'('Identifier','StringLiteral')' # STContains
| STIntersects'('Identifier','StringLiteral')' # STIntersects
| STWithin'('Identifier','StringLiteral')' # STWithin
| STDWithin'('Identifier','StringLiteral',' expr')' # STDWithin
| STIsValid'('Identifier')' # STIsValid
| ArrayLength'('(Identifier | JSONIdentifier)')' # ArrayLength
| Identifier '(' ( expr (',' expr )* ','? )? ')' # Call
| expr op1 = (LT | LE) (Identifier | JSONIdentifier | StructSubFieldIdentifier) op2 = (LT | LE) expr # Range
| expr op1 = (GT | GE) (Identifier | JSONIdentifier | StructSubFieldIdentifier) op2 = (GT | GE) expr # ReverseRange
| expr op = (LT | LE | GT | GE) expr # Relational
| expr op = (EQ | NE) expr # Equality
| expr BAND expr # BitAnd
| expr BXOR expr # BitXor
| expr BOR expr # BitOr
| expr AND expr # LogicalAnd
| expr OR expr # LogicalOr
| (Identifier | JSONIdentifier) ISNULL # IsNull
| (Identifier | JSONIdentifier) ISNOTNULL # IsNotNull;
textMatchOption:
MINIMUM_SHOULD_MATCH ASSIGN IntegerConstant;
// typeName: ty = (BOOL | INT8 | INT16 | INT32 | INT64 | FLOAT | DOUBLE);
// BOOL: 'bool';
// INT8: 'int8';
// INT16: 'int16';
// INT32: 'int32';
// INT64: 'int64';
// FLOAT: 'float';
// DOUBLE: 'double';
LBRACE: '{';
RBRACE: '}';
LT: '<';
LE: '<=';
GT: '>';
GE: '>=';
EQ: '==';
NE: '!=';
LIKE: 'like' | 'LIKE';
EXISTS: 'exists' | 'EXISTS';
TEXTMATCH: 'text_match'|'TEXT_MATCH';
PHRASEMATCH: 'phrase_match'|'PHRASE_MATCH';
RANDOMSAMPLE: 'random_sample' | 'RANDOM_SAMPLE';
MATCH_ALL: 'match_all' | 'MATCH_ALL';
MATCH_ANY: 'match_any' | 'MATCH_ANY';
MATCH_LEAST: 'match_least' | 'MATCH_LEAST';
MATCH_MOST: 'match_most' | 'MATCH_MOST';
MATCH_EXACT: 'match_exact' | 'MATCH_EXACT';
INTERVAL: 'interval' | 'INTERVAL';
ISO: 'iso' | 'ISO';
MINIMUM_SHOULD_MATCH: 'minimum_should_match' | 'MINIMUM_SHOULD_MATCH';
THRESHOLD: 'threshold' | 'THRESHOLD';
ASSIGN: '=';
ADD: '+';
SUB: '-';
MUL: '*';
DIV: '/';
MOD: '%';
POW: '**';
SHL: '<<';
SHR: '>>';
BAND: '&';
BOR: '|';
BXOR: '^';
AND: '&&' | 'and' | 'AND';
OR: '||' | 'or' | 'OR';
ISNULL: 'is null' | 'IS NULL';
ISNOTNULL: 'is not null' | 'IS NOT NULL';
BNOT: '~';
NOT: '!' | 'not' | 'NOT';
IN: 'in' | 'IN';
EmptyArray: '[' (Whitespace | Newline)* ']';
JSONContains: 'json_contains' | 'JSON_CONTAINS';
JSONContainsAll: 'json_contains_all' | 'JSON_CONTAINS_ALL';
JSONContainsAny: 'json_contains_any' | 'JSON_CONTAINS_ANY';
ArrayContains: 'array_contains' | 'ARRAY_CONTAINS';
ArrayContainsAll: 'array_contains_all' | 'ARRAY_CONTAINS_ALL';
ArrayContainsAny: 'array_contains_any' | 'ARRAY_CONTAINS_ANY';
ArrayLength: 'array_length' | 'ARRAY_LENGTH';
ElementFilter: 'element_filter' | 'ELEMENT_FILTER';
STEuqals:'st_equals' | 'ST_EQUALS';
STTouches:'st_touches' | 'ST_TOUCHES';
STOverlaps: 'st_overlaps' | 'ST_OVERLAPS';
STCrosses: 'st_crosses' | 'ST_CROSSES';
STContains: 'st_contains' | 'ST_CONTAINS';
STIntersects : 'st_intersects' | 'ST_INTERSECTS';
STWithin :'st_within' | 'ST_WITHIN';
STDWithin: 'st_dwithin' | 'ST_DWITHIN';
STIsValid: 'st_isvalid' | 'ST_ISVALID';
BooleanConstant: 'true' | 'True' | 'TRUE' | 'false' | 'False' | 'FALSE';
IntegerConstant:
DecimalConstant
| OctalConstant
| HexadecimalConstant
| BinaryConstant;
FloatingConstant:
DecimalFloatingConstant
| HexadecimalFloatingConstant;
Identifier: Nondigit (Nondigit | Digit)*;
Meta: '$meta';
StringLiteral: EncodingPrefix? ('"' DoubleSCharSequence? '"' | '\'' SingleSCharSequence? '\'');
JSONIdentifier: (Identifier | Meta)('[' (StringLiteral | DecimalConstant) ']')+;
StructSubFieldIdentifier: '$[' Identifier ']';
fragment EncodingPrefix: 'u8' | 'u' | 'U' | 'L';
fragment DoubleSCharSequence: DoubleSChar+;
fragment SingleSCharSequence: SingleSChar+;
fragment DoubleSChar: ~["\\\r\n] | EscapeSequence | '\\\n' | '\\\r\n';
fragment SingleSChar: ~['\\\r\n] | EscapeSequence | '\\\n' | '\\\r\n';
fragment Nondigit: [a-zA-Z_];
fragment Digit: [0-9];
fragment BinaryConstant: '0' [bB] [0-1]+;
fragment DecimalConstant: NonzeroDigit Digit* | '0';
fragment OctalConstant: '0' OctalDigit*;
fragment HexadecimalConstant: '0' [xX] HexadecimalDigitSequence;
fragment NonzeroDigit: [1-9];
fragment OctalDigit: [0-7];
fragment HexadecimalDigit: [0-9a-fA-F];
fragment HexQuad:
HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit;
fragment UniversalCharacterName:
'\\u' HexQuad
| '\\U' HexQuad HexQuad;
fragment DecimalFloatingConstant:
FractionalConstant ExponentPart?
| DigitSequence ExponentPart;
fragment HexadecimalFloatingConstant:
'0' [xX] (
HexadecimalFractionalConstant
| HexadecimalDigitSequence
) BinaryExponentPart;
fragment FractionalConstant:
DigitSequence? '.' DigitSequence
| DigitSequence '.';
fragment ExponentPart: [eE] [+-]? DigitSequence;
fragment DigitSequence: Digit+;
fragment HexadecimalFractionalConstant:
HexadecimalDigitSequence? '.' HexadecimalDigitSequence
| HexadecimalDigitSequence '.';
fragment HexadecimalDigitSequence: HexadecimalDigit+;
fragment BinaryExponentPart: [pP] [+-]? DigitSequence;
fragment EscapeSequence:
'\\' ['"?abfnrtv\\]
| '\\' OctalDigit OctalDigit? OctalDigit?
| '\\x' HexadecimalDigitSequence
| UniversalCharacterName;
Whitespace: [ \t]+ -> skip;
Newline: ( '\r' '\n'? | '\n') -> skip;