relate: https://github.com/milvus-io/milvus/issues/46718
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Enhancement: Optimize Jieba and Lindera Analyzer Clone
**Core Invariant**: JiebaTokenizer and LinderaTokenizer must be
efficiently cloneable without lifetime constraints to support analyzer
composition in multi-language detection chains.
**What Logic Was Improved**:
- **JiebaTokenizer**: Replaced `Cow<'a, Jieba>` with
`Arc<jieba_rs::Jieba>` and removed the `<'a>` lifetime parameter. The
global JIEBA instance now wraps in Arc, enabling `#[derive(Clone)]` on
the struct. This eliminates lifetime management complexity while
maintaining zero-copy sharing via atomic reference counting.
- **LinderaTokenizer**: Introduced public `LinderaSegmenter` struct
encapsulating dictionary and mode state, and implemented explicit
`Clone` that properly duplicates the segmenter (cloning Arc-wrapped
dictionary), applies `box_clone()` to each boxed token filter, and
clones the token buffer. Previously, Clone was either unavailable or
incompletely handled trait objects.
**Why Previous Implementation Was Limiting**:
- The `Cow::Borrowed` pattern for JiebaTokenizer created explicit
lifetime dependencies that prevented straightforward `#[derive(Clone)]`.
Switching to Arc eliminates borrow checker constraints while providing
the same reference semantics for immutable shared state.
- LinderaTokenizer's token filters are boxed trait objects, which cannot
be auto-derived. Manual Clone implementation with `box_clone()` calls
correctly handles polymorphic filter duplication.
**No Data Loss or Behavior Regression**:
- Arc cloning is semantically equivalent to `Cow::Borrowed` for
read-only access; both efficiently share the underlying Jieba instance
and Dictionary without data duplication.
- The explicit Clone preserves all tokenizer state: segmenter (with
shared Arc dictionary), all token filters (via individual box_clone),
and the token buffer used during tokenization.
- Token stream behavior unchanged—segmentation and filter application
order remain identical.
- New benchmarks (`bench_jieba_tokenizer_clone`,
`bench_lindera_tokenizer_clone`) measure and validate clone performance
for both tokenizers.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
issue: #46540
Empty timetick is just used to sync up the time clock between different
component in milvus. So empty timetick can be ignored if we achieve the
lsn/mvcc semantic for timetick. Currently, some components need the
empty timetick to trigger some operation, such as flush/tsafe. So we
only slow down the empty time tick for 5 seconds.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: with LSN/MVCC semantics consumers only need (a) the
first timetick that advances the latest-required-MVCC to unblock
MVCC-dependent waits and (b) occasional periodic timeticks (~≤5s) for
clock synchronization—therefore frequent non-persisted empty timeticks
can be suppressed without breaking MVCC correctness.
- Logic removed/simplified: per-message dispatch/consumption of frequent
non-persisted empty timeticks is suppressed — an MVCC-aware filter
emptyTimeTickSlowdowner (internal/util/pipeline/consuming_slowdown.go)
short-circuits frequent empty timeticks in the stream pipeline
(internal/util/pipeline/stream_pipeline.go), and the WAL flusher
rate-limits non-persisted timetick dispatch to one emission per ~5s
(internal/streamingnode/server/flusher/flusherimpl/wal_flusher.go); the
delegator exposes GetLatestRequiredMVCCTimeTick to drive the filter
(internal/querynodev2/delegator/delegator.go).
- Why this does NOT introduce data loss or regressions: the slowdowner
always refreshes latestRequiredMVCCTimeTick via
GetLatestRequiredMVCCTimeTick and (1) never filters timeticks <
latestRequiredMVCCTimeTick (so existing tsafe/flush waits stay
unblocked) and (2) always lets the first timetick ≥
latestRequiredMVCCTimeTick pass to notify pending MVCC waits;
separately, WAL flusher suppression applies only to non-persisted
timeticks and still emits when the 5s threshold elapses, preserving
periodic clock-sync messages used by flush/tsafe.
- Enhancement summary (where it takes effect): adds
GetLatestRequiredMVCCTimeTick on ShardDelegator and
LastestMVCCTimeTickGetter, wires emptyTimeTickSlowdowner into
NewPipelineWithStream (internal/util/pipeline), and adds WAL flusher
rate-limiting + metrics
(internal/streamingnode/server/flusher/flusherimpl/wal_flusher.go,
pkg/metrics) to reduce CPU/dispatch overhead while keeping MVCC
correctness and periodic synchronization.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #46737
PR #46440 refactored checkStale to use action.Node() for all action
types, which breaks LeaderAction stale checking. For LeaderAction,
Node() returns the worker node where segment resides, but the task is
executed on the leader node (delegator).
When syncing segments from RO worker nodes to a RW delegator, using
action.Node() incorrectly marks the task as stale because the worker is
RO, even though the leader is RW and the task should proceed.
This fix:
- Uses leaderID instead of Node() for LeaderAction stale checking
- Adds detailed comments explaining the distinction
- Adds unit tests covering the RO worker with RW leader scenario
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: https://github.com/milvus-io/milvus/issues/46743
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary
**Core Invariant:** Sort compaction tasks must not be created
concurrently for the same segment. The system relies on atomic
check-and-set semantics to prevent duplicate task creation.
**What Logic is Improved:** The code now guards sort compaction task
creation with an explicit `CheckAndSetSegmentsCompacting` check before
calling `createSortCompactionTask`. Previously, tasks could be attempted
for segments already undergoing compaction, triggering warning logs that
incorrectly suggested task creation failures. The fix skips task
creation when a segment is already compacting, avoiding these misleading
warnings entirely.
**Why No Data Loss or Regression:**
- The `CheckAndSetSegmentsCompacting` method atomically checks whether a
segment is already being compacted and only proceeds if it's not; this
is the correct guard pattern for preventing concurrent compactions
- When a segment is already compacting (`isCompacting == true`), the
code correctly increments the done counter and skips to the next
segment, which is the intended behavior (no wasted task creation
attempts)
- The function signature change to `createSortCompactionTask` adds only
an internal parameter (the current task context for logging); no public
APIs are affected
- Logging refactoring maintains semantic equivalence while providing
task-scoped context
**Concrete Fix:** The misleading warning during sort compaction is
eliminated by preventing task creation attempts for already-compacting
segments through the mutex-protected `CheckAndSetSegmentsCompacting`
guard, rather than attempting creation and failing downstream.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Add support for DataNode compaction using file resources in ref mode.
SortCompation and StatsJobs will build text indexes, which may use file
resources.
relate: https://github.com/milvus-io/milvus/issues/43687
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: file resources (analyzer binaries/metadata) are only
fetched, downloaded and used when the node is configured in Ref mode
(fileresource.IsRefMode via CommonCfg.QNFileResourceMode /
DNFileResourceMode); Sync now carries a version and managers track
per-resource versions/resource IDs so newer resource sets win and older
entries are pruned (RefManager/SynchManager resource maps).
- Logic removed / simplified: component-specific FileResourceMode flags
and an indirection through a long-lived BinlogIO wrapper were
consolidated — file-resource mode moved to CommonCfg, Sync/Download APIs
became version- and context-aware, and compaction/index tasks accept a
ChunkManager directly (binlog IO wrapper creation inlined). This
eliminates duplicated config checks and wrapper indirection while
preserving the same chunk/IO semantics.
- Why no data loss or behavior regression: all file-resource code paths
are gated by the configured mode (default remains "sync"); when not in
ref-mode or when no resources exist, compaction and stats flows follow
existing code paths unchanged. Versioned Sync + resourceID maps ensure
newly synced sets replace older ones and RefManager prunes stale files;
GetFileResources returns an error if requested IDs are missing (prevents
silent use of wrong resources). Analyzer naming/parameter changes add
analyzer_extra_info but default-callers pass "" so existing analyzers
and index contents remain unchanged.
- New capability: DataNode compaction and StatsJobs can now build text
indexes using external file resources in Ref mode — DataCoord exposes
GetFileResources and populates CompactionPlan.file_resources;
SortCompaction/StatsTask download resources via fileresource.Manager,
produce an analyzer_extra_info JSON (storage + resource->id map) via
analyzer.BuildExtraResourceInfo, and propagate analyzer_extra_info into
BuildIndexInfo so the tantivy bindings can load custom analyzers during
text index creation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
related: #36380
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: aggregation is centralized and schema-aware — all
aggregate functions are created via the exec Aggregate registry
(milvus::exec::Aggregate) and validated by ValidateAggFieldType, use a
single in-memory accumulator layout (Accumulator/RowContainer) and
grouping primitives (GroupingSet, HashTable, VectorHasher), ensuring
consistent typing, null semantics and offsets across planner → exec →
reducer conversion paths (toAggregateInfo, Aggregate::create,
GroupingSet, AggResult converters).
- Removed / simplified logic: removed ad‑hoc count/group-by and reducer
code (CountNode/PhyCountNode, GroupByNode/PhyGroupByNode, cntReducer and
its tests) and consolidated into a unified AggregationNode →
PhyAggregationNode + GroupingSet + HashTable execution path and
centralized reducers (MilvusAggReducer, InternalAggReducer,
SegcoreAggReducer). AVG now implemented compositionally (SUM + COUNT)
rather than a bespoke operator, eliminating duplicate implementations.
- Why this does NOT cause data loss or regressions: existing data-access
and serialization paths are preserved and explicitly validated —
bulk_subscript / bulk_script_field_data and FieldData creation are used
for output materialization; converters (InternalResult2AggResult ↔
AggResult2internalResult, SegcoreResults2AggResult ↔
AggResult2segcoreResult) enforce shape/type/row-count validation; proxy
and plan-level checks (MatchAggregationExpression,
translateOutputFields, ValidateAggFieldType, translateGroupByFieldIds)
reject unsupported inputs (ARRAY/JSON, unsupported datatypes) early.
Empty-result generation and explicit error returns guard against silent
corruption.
- New capability and scope: end-to-end GROUP BY and aggregation support
added across the stack — proto (plan.proto, RetrieveRequest fields
group_by_field_ids/aggregates), planner nodes (AggregationNode,
ProjectNode, SearchGroupByNode), exec operators (PhyAggregationNode,
PhyProjectNode) and aggregation core (Aggregate implementations:
Sum/Count/Min/Max, SimpleNumericAggregate, RowContainer, GroupingSet,
HashTable) plus proxy/querynode reducers and tests — enabling grouped
and global aggregation (sum, count, min, max, avg via sum+count) with
schema-aware validation and reduction.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: MrPresent-Han <chun.han@gmail.com>
Co-authored-by: MrPresent-Han <chun.han@gmail.com>
Related to #46595
Remove the EnableStorageV2 config option and enforce StorageV2 format
across all write paths including compaction, import, write buffer, and
streaming segment allocation. V1 format write tests are now skipped as
writing V1 format is no longer supported.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #46090
This change introduces a global node blacklist mechanism to immediately
cut off query traffic to failed delegators across all concurrent
requests.
Key features:
- Introduce ChannelBlacklist to track failed delegator nodes per channel
- When a query fails, the node is immediately blacklisted and excluded
from ALL subsequent requests (not just retries within the same request)
- Blacklisted nodes are automatically excluded during node selection
- Entries expire after configurable duration (default 30s) to allow
automatic recovery when nodes become healthy again
- Background cleanup loop removes expired entries periodically
- Add proxy.replicaBlacklistDuration and
proxy.replicaBlacklistCleanupInterval configuration parameters
- Blacklist can be disabled by setting duration to 0
Before this change:
- Failed nodes were only excluded within the same request's retry loop
- Concurrent requests would still attempt to query the failed node
- Each request had to experience its own failure before avoiding the
node
After this change:
- Once a node fails, it is immediately excluded from all requests
- New requests arriving during the blacklist period will skip the failed
node without experiencing any failure
- This significantly reduces latency spikes during node failures
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: https://github.com/milvus-io/milvus/issues/46713
/kind bug
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: a Limiter's hasUpdated flag controls whether its
non-Infinite limit is exported to proxies via toRequestLimiter(); only
limiters with hasUpdated==true and non-Inf rates produce rate updates
delivered to proxies (pkg/util/ratelimitutil/limiter.go: HasUpdated /
toRequestLimiter behavior unchanged).
- Exact bug and fix (issue #46713): collection-level limiters created
from configured collection/partition/database properties were
constructed with correct limits but left hasUpdated==false, so they were
skipped by the existing !HasUpdated() check and never sent to proxies.
Fix: add Limiter.SetHasUpdated(updated bool) and call a new
updateLimiterHasUpdated helper immediately after creating limiter nodes
during initialization/reset (internal/rootcoord/quota_center.go) to mark
non-Inf newly-created limiters as updated so they are included in
toRequestLimiter exports.
- Logic simplified / redundancy removed: initialization now explicitly
sets limiter initialization state (hasUpdated) for newly-created
non-Infinite limiters instead of relying on implicit later side-effects
to toggle the flag; this removes the implicit gap between creation and
the expectation that a configured limiter should be published.
- No data-loss or behavior regression: the change only mutates the
in-memory hasUpdated flag for freshly created limiter instances
(pkg/util/ratelimitutil/limiter.go: SetHasUpdated) and sets it in the
limiter initialization path (internal/rootcoord/quota_center.go). It
does not alter token accounting (advance, AllowN, Cancel), rate
computation, SetLimit semantics, persistence, or proxy filtering
logic—only ensures intended collection-level rates are delivered to
proxies—so no persisted data or runtime rate behavior is removed or
degraded.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: guanghuihuang <guanghuihuang@didiglobal.com>
Co-authored-by: guanghuihuang <guanghuihuang@didiglobal.com>
issue: #44358
Implement complete snapshot management system including creation,
deletion, listing, description, and restoration capabilities across all
system components.
Key features:
- Create snapshots for entire collections
- Drop snapshots by name with proper cleanup
- List snapshots with collection filtering
- Describe snapshot details and metadata
Components added/modified:
- Client SDK with full snapshot API support and options
- DataCoord snapshot service with metadata management
- Proxy layer with task-based snapshot operations
- Protocol buffer definitions for snapshot RPCs
- Comprehensive unit tests with mockey framework
- Integration tests for end-to-end validation
Technical implementation:
- Snapshot metadata storage in etcd with proper indexing
- File-based snapshot data persistence in object storage
- Garbage collection integration for snapshot cleanup
- Error handling and validation across all operations
- Thread-safe operations with proper locking mechanisms
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant/assumption: snapshots are immutable point‑in‑time
captures identified by (collection, snapshot name/ID); etcd snapshot
metadata is authoritative for lifecycle (PENDING → COMMITTED → DELETING)
and per‑segment manifests live in object storage (Avro / StorageV2). GC
and restore logic must see snapshotRefIndex loaded
(snapshotMeta.IsRefIndexLoaded) before reclaiming or relying on
segment/index files.
- New capability added: full end‑to‑end snapshot subsystem — client SDK
APIs (Create/Drop/List/Describe/Restore + restore job queries),
DataCoord SnapshotWriter/Reader (Avro + StorageV2 manifests),
snapshotMeta in meta, SnapshotManager orchestration
(create/drop/describe/list/restore), copy‑segment restore
tasks/inspector/checker, proxy & RPC surface, GC integration, and
docs/tests — enabling point‑in‑time collection snapshots persisted to
object storage and restorations orchestrated across components.
- Logic removed/simplified and why: duplicated recursive
compaction/delta‑log traversal and ad‑hoc lookup code were consolidated
behind two focused APIs/owners (Handler.GetDeltaLogFromCompactTo for
delta traversal and SnapshotManager/SnapshotReader for snapshot I/O).
MixCoord/coordinator broker paths were converted to thin RPC proxies.
This eliminates multiple implementations of the same traversal/lookup,
reducing divergence and simplifying responsibility boundaries.
- Why this does NOT introduce data loss or regressions: snapshot
create/drop use explicit two‑phase semantics (PENDING → COMMIT/DELETING)
with SnapshotWriter writing manifests and metadata before commit; GC
uses snapshotRefIndex guards and
IsRefIndexLoaded/GetSnapshotBySegment/GetSnapshotByIndex checks to avoid
removing referenced files; restore flow pre‑allocates job IDs, validates
resources (partitions/indexes), performs rollback on failure
(rollbackRestoreSnapshot), and converts/updates segment/index metadata
only after successful copy tasks. Extensive unit and integration tests
exercise pending/deleting/GC/restore/error paths to ensure idempotence
and protection against premature deletion.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Add configurable MAP_POPULATE flag support for mmap operations to reduce
page faults and improve first read performance.
Key changes:
- Add `queryNode.mmap.populate` config (default: true) to control
MAP_POPULATE flag usage
- Add `mmap_populate` parameter to MmapChunkTarget, ChunkTranslator,
GroupChunkTranslator, and ManifestGroupTranslator
- Apply MAP_POPULATE to both MmapChunkTarget and MemChunkTarget
- Propagate mmap_populate setting through chunk creation pipeline
When enabled, MAP_POPULATE pre-faults the mapped pages into memory,
eliminating page faults during subsequent access and improving query
performance for the first read operations.
issue: #46760
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
If QueryNode loads multiple sealed segments with BM25 enabled, BM25
stats registration into IDFOracle could stop after the first segment due
to an early-terminating ConcurrentMap.Range callback. This change:
Register BM25 stats for all sealed segments by continuing iteration
(return true) during sealed-segment load
Prevent repeated warnings like idf oracle lack some sealed segment
Ensure IDF/BM25 statistics are not silently incomplete (improving BM25
ranking correctness)
issue: #46725
Core invariant: for any BM25-enabled collection, every loaded sealed
segment with available BM25 stats must be registered into IDFOracle, so
SyncDistribution can always find the sealed segments present in the
distribution snapshot.
Bug fix: ConcurrentMap.Range respects the callback’s boolean return;
returning false stops iteration. The sealed BM25 stats registration
callback previously returned false, which could register only the first
sealed segment and leave the rest missing—causing IDFOracle to warn idf
oracle lack some sealed segment and potentially compute IDF from
incomplete stats. Fixed by returning true to continue iterating and
registering all segments.
No behavior regression: the change only affects the sealed-segment BM25
stats registration loop; it does not alter segment loading, distribution
snapshot generation, or non-BM25 codepaths. For collections without BM25
(or when BM25 stats are nil), behavior remains unchanged.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: For any BM25-enabled collection, every loaded sealed
segment with available BM25 stats must be registered into IDFOracle so
SyncDistribution can discover them in distribution snapshots.
- Bug fix (links to #46725): The BM25 stats registration callback used
with bm25Stats.Range() in loadStreamDelete() returned false, which
prematurely stopped iteration after the first sealed segment and left
subsequent sealed segments unregistered. The fix changes the callback to
return true so the Range loop completes and registers BM25 stats for all
sealed segments.
- Logic simplified/removed: The early-return (false) in the
ConcurrentMap.Range callback that aborted further registrations has been
removed (replaced by returning true). That early abort was redundant and
incorrect because registration must proceed for every entry; allowing
Range to continue restores the intended one-to-many registration
behavior.
- No data loss or regression: The change is narrowly scoped to the
sealed-segment BM25 stats registration loop in
internal/querynodev2/delegator/delegator_data.go and does not modify
segment loading, distribution snapshot generation, growing-segment
handling, or non-BM25 codepaths. Returning true only permits full
iteration and registration; it does not delete or alter existing data
structures or load state, so IDF/BM25 statistics become complete without
changing other behaviors.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: thangTang <tangtianhang099@gmail.com>
issue: #46635
## Summary
- Fix spelling error in constant name: `CredentialSeperator` ->
`CredentialSeparator`
- Updated all usages across the codebase to use the correct spelling
## Changes
- `pkg/util/constant.go`: Renamed the constant
- `pkg/util/contextutil/context_util.go`: Updated usage
- `pkg/util/contextutil/context_util_test.go`: Updated usage
- `internal/proxy/authentication_interceptor.go`: Updated usage
- `internal/proxy/util.go`: Updated usage
- `internal/proxy/util_test.go`: Updated usage
- `internal/proxy/trace_log_interceptor_test.go`: Updated usage
- `internal/proxy/accesslog/info/util.go`: Updated usage
- `internal/distributed/proxy/service.go`: Updated usage
- `internal/distributed/proxy/httpserver/utils.go`: Updated usage
## Test Plan
- [x] All references updated consistently
- [x] No functional changes - only constant name spelling correction
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: the separator character for credentials remains ":"
everywhere — only the exported identifier was renamed from
CredentialSeperator → CredentialSeparator; the constant value and
split/join semantics are unchanged.
- Change (bug fix): corrected the misspelled exported constant in
pkg/util/constant.go and updated all references across the codebase
(parsing, token construction, header handling and tests) to use the new
identifier; this is an identifier rename that removes an inconsistent
symbol and prevents compile-time/reference errors.
- Logic simplified/redundant work removed: no runtime logic was removed;
the simplification is purely maintenance-focused — eliminating a
misspelled exported name that could cause developers to introduce
duplicate or incorrect constants.
- No data loss or behavior regression: runtime code paths are unchanged
— e.g., GetAuthInfoFromContext, ParseUsernamePassword,
AuthenticationInterceptor, proxy service token construction and
access-log extraction still use ":" to split/join credentials; updated
and added unit tests (parsing and metadata extraction) exercise these
paths and validate identical semantics.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: majiayu000 <1835304752@qq.com>
Signed-off-by: lif <1835304752@qq.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
issue: #46740
- Add duplicate ID check before query execution (fail fast)
- Auto infer anns_field when only one vector field exists in schema
Signed-off-by: Li Liu <li.liu@zilliz.com>
issue: https://github.com/milvus-io/milvus/issues/46678
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: Text index log keys are canonicalized at KV
(serialization) boundaries — etcd stores compressed filename-only
entries, while in-memory and runtime consumers must receive full
object-storage keys so Datanode/QueryNode can load text indexes
directly.
- Logic removed/simplified: ad-hoc reconstruction of full text-log paths
scattered across components (garbage_collector.getTextLogs,
querynodev2.LoadTextIndex, compactor/index task code) was removed;
consumers now use TextIndexStats.Files as-provided (full keys). Path
compression/decompression was centralized into KV marshal/unmarshal
utilities (metautil.ExtractTextLogFilenames in marshalSegmentInfo and
metautil.BuildTextLogPaths in kv_catalog.listSegments), eliminating
redundant, inconsistent prefix-rebuilding logic that broke during
rolling upgrades.
- Why this does NOT cause data loss or regressions: before persist,
marshalSegmentInfo compresses TextStatsLogs.Files to filenames
(metautil.ExtractTextLogFilenames) so stored KV remains compact; on
load, kv_catalog.listSegments calls metautil.BuildTextLogPaths to
restore full paths and includes compatibility logic that leaves
already-full keys unchanged. Thus every persisted filename is
recoverable to a valid full key and consumers receive correct full paths
(see marshalSegmentInfo → KV write path and kv_catalog.listSegments →
reload path), preventing dropped or malformed keys.
- Bug fix (refs #46678): resolves text-log loading failures during
cluster upgrades by centralizing path handling at KV encode/decode and
removing per-component path reconstruction — the immediate fix is
changing consumers to read TextIndexStats.Files directly and relying on
marshal/unmarshal to perform compression/expansion, preventing
mixed-format failures during rolling upgrades.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: sijie-ni-0214 <sijie.ni@zilliz.com>
issue: #46033
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Pull Request Summary: Entity-Level TTL Field Support
### Core Invariant and Design
This PR introduces **per-entity TTL (time-to-live) expiration** via a
dedicated TIMESTAMPTZ field as a fine-grained alternative to
collection-level TTL. The key invariant is **mutual exclusivity**:
collection-level TTL and entity-level TTL field cannot coexist on the
same collection. Validation is enforced at the proxy layer during
collection creation/alteration (`validateTTL()` prevents both being set
simultaneously).
### What Is Removed and Why
- **Global `EntityExpirationTTL` parameter** removed from config
(`configs/milvus.yaml`, `pkg/util/paramtable/component_param.go`). This
was the only mechanism for collection-level expiration. The removal is
safe because:
- The collection-level TTL path (`isEntityExpired(ts)` check) remains
intact in the codebase for backward compatibility
- TTL field check (`isEntityExpiredByTTLField()`) is a secondary path
invoked only when a TTL field is configured
- Existing deployments using collection TTL can continue without
modification
The global parameter was removed specifically because entity-level TTL
makes per-entity control redundant with a collection-wide setting, and
the PR chooses one mechanism per collection rather than layering both.
### No Data Loss or Behavior Regression
**TTL filtering logic is additive and safe:**
1. **Collection-level TTL unaffected**: The `isEntityExpired(ts)` check
still applies when no TTL field is configured; callers of
`EntityFilter.Filtered()` pass `-1` as the TTL expiration timestamp when
no field exists, causing `isEntityExpiredByTTLField()` to return false
immediately
2. **Null/invalid TTL values treated safely**: Rows with null TTL or TTL
≤ 0 are marked as "never expire" (using sentinel value `int64(^uint64(0)
>> 1)`) and are preserved across compactions; percentile calculations
only include positive TTL values
3. **Query-time filtering automatic**: TTL filtering is transparently
added to expression compilation via `AddTTLFieldFilterExpressions()`,
which appends `(ttl_field IS NULL OR ttl_field > current_time)` to the
filter pipeline. Entities with null TTL always pass the filter
4. **Compaction triggering granular**: Percentile-based expiration (20%,
40%, 60%, 80%, 100%) allows configurable compaction thresholds via
`SingleCompactionRatioThreshold`, preventing premature data deletion
### Capability Added: Per-Entity Expiration with Data Distribution
Awareness
Users can now specify a TIMESTAMPTZ collection property `ttl_field`
naming a schema field. During data writes, TTL values are collected per
segment and percentile quantiles (5-value array) are computed and stored
in segment metadata. At query time, the TTL field is automatically
filtered. At compaction time, segment-level percentiles drive
expiration-based compaction decisions, enabling intelligent compaction
of segments where a configurable fraction of data has expired (e.g.,
compact when 40% of rows are expired, controlled by threshold ratio).
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
/kind improvement
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: index parameter validation and test expectations for
the HNSW-family must be explicit, consistent, and deterministic — this
PR enforces that by adding exhaustive parameter matrices for HNSW_PRQ
(tests/python_client/testcases/indexes/{idx_hnsw_prq.py,
test_hnsw_prq.py}) and normalizing expectations in idx_hnsw_pq.py via a
shared success variable.
- Logic removed / simplified: brittle, ad-hoc string expectations were
consolidated — literal "success" occurrences were replaced with a single
success variable and ambiguous short error messages were replaced by the
canonical descriptive error text; this reduces duplicated assertion
logic in tests and removes dependence on fragile, truncated messages.
- Bug fix (tests): corrected HNSW_PQ test expectations to assert the
full, authoritative error for invalid PQ m ("The dimension of the vector
(dim) should be a multiple of the number of subquantizers (m).") and
aligned HNSW_PRQ test matrices (idx_hnsw_prq.py) to the same explicit
expectations — the change targets test assertions only and fixes false
negatives caused by mismatched messages.
- No data loss or behavior regression: only test code is added/modified
(tests/python_client/testcases/indexes/*). Production code paths remain
unmodified — collection creation, insert/flush, client.create_index,
wait_for_index_ready, load_collection, search, and client.describe_index
are invoked by tests but not changed; therefore persisted data, index
artifacts, and runtime behavior are unaffected.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: zilliz <jiaming.li@zilliz.com>
issue: #45640
- After async logging, the C log and go log has no order promise,
meanwhile the C log format is not consistent with Go Log; so we close
the output of glog, just forward the log result operation into Go side
which will be handled by the async zap logger.
- Use CGO to filter all cgo logging and promise the order between c log
and go log.
- Also fix the metric name, add new metric to count the logging.
- TODO: after woodpecker use the logger of milvus, we can add bigger
buffer for logging.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: all C (glog) and Go logs must be routed through the
same zap async pipeline so ordering and formatting are preserved; this
PR ensures every glog emission is captured and forwarded to zap before
any async buffering diverges the outputs.
- Logic removed/simplified: direct glog outputs and hard
stdout/stderr/log_dir settings are disabled (configs/glog.conf and flags
in internal/core/src/config/ConfigKnowhere.cpp) because they are
redundant once a single zap sink handles all logs; logging metrics were
simplified from per-length/volatile gauges to totalized counters
(pkg/metrics/logging_metrics.go & pkg/log/*), removing duplicate
length-tracking and making accounting consistent.
- No data loss or behavior regression (concrete code paths): Google
logging now adds a GoZapSink (internal/core/src/common/logging_c.h,
logging_c.cpp) that calls the exported CGO bridge goZapLogExt
(internal/util/cgo/logging/logging.go). Go side uses
C.GoStringN/C.GoString to capture full message and file, maps glog
severities to zapcore levels, preserves caller info, and writes via the
existing zap async core (same write path used by Go logs). The C++
send() trims glog's trailing newline and forwards exact buffers/lengths,
so message content, file, line, and severity are preserved and
serialized through the same async writer—no log entries are dropped or
reordered relative to Go logs.
- Capability added (where it takes effect): a CGO bridge that forwards
glog into zap—new Go-exported function goZapLogExt
(internal/util/cgo/logging/logging.go), a GoZapSink in C++ that forwards
glog sends (internal/core/src/common/logging_c.h/.cpp), and blank
imports of the cgo initializer across multiple packages (various
internal/* files) to ensure the bridge is registered early so all C logs
are captured.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: chyezh <chyezh@outlook.com>
Bump milvus version to v2.6.8
Signed-off-by: sre-ci-robot sre-ci-robot@users.noreply.github.com
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
**Core Invariant**: This PR assumes the Milvus v2.6.8 Docker image is a
stable, compatible release that can transparently replace v2.6.7 in
standalone embed configurations without breaking backward compatibility.
**What Changed**: Updated the Milvus Docker image tag from `v2.6.7` to
`v2.6.8` in two standalone embedding configuration scripts:
- `scripts/standalone_embed.bat` (line 83)
- `scripts/standalone_embed.sh` (line 62)
**Why This Is Safe**: These scripts only specify the container image
version and pass through pre-existing configuration files
(`embedEtcd.yaml`, `user.yaml`) to the container. No local logic, data
schemas, or API contracts are modified—the container startup behavior
remains identical, just pulling a newer upstream image tag. Version
increments within the same major.minor series (v2.6.x) follow semantic
versioning conventions ensuring no breaking changes.
**Impact**: Users pulling or running these standalone embed scripts will
automatically use the newer v2.6.8 Milvus release, receiving bug fixes
and enhancements from the patch version bump while maintaining
compatible behavior with existing configurations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Bump milvus version to v2.6.8
Signed-off-by: sre-ci-robot sre-ci-robot@users.noreply.github.com
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Version Bump Summary: Milvus v2.6.7 → v2.6.8
**Core Invariant**: This PR assumes v2.6.8 is a drop-in replacement for
v2.6.7, maintaining API compatibility and deployment configuration
compatibility across all environments (standalone, GPU, and cluster
deployments).
**Scope of Changes**: Updates Docker image version references from
`milvusdb/milvus:v2.6.7` to `milvusdb/milvus:v2.6.8` across all
deployment documentation and configuration files:
- Binary deployment README
- Standalone docker-compose.yml (CPU variant)
- GPU standalone docker-compose.yml
- Cluster distributed deployment inventory.ini
**No Behavior Regression Risk**: Since this modifies only external
artifact references (Docker image tags in deployment configs and
documentation examples), not any runtime logic or data schemas, there is
zero risk of data loss or operational regression. The semantic
versioning convention (patch-level bump: v2.6.7 → v2.6.8) indicates this
is a maintenance release with backward compatibility preserved.
**Automation Context**: This is an automated version bump by
sre-ci-robot, indicating a routine dependency update process rather than
manual configuration changes.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
If collection TTL property is malformed (e.g., non-numeric value),
compaction tasks would fail silently and get stuck. This change:
- Add centralized GetCollectionTTL/GetCollectionTTLFromMap functions in
pkg/common to handle TTL parsing with proper error handling
- Validate TTL property in createCollectionTask and alterCollectionTask
PreExecute to reject invalid values early
- Refactor datacoord compaction policies to use the new common functions
- Remove duplicated getCollectionTTL from datacoord/util.go
issue: #46716
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: collection.ttl.seconds must be a parseable int64 and
validated at collection creation/alter time so malformed TTLs never
reach compaction/execution codepaths.
- Bug fix (resolves#46716): malformed/non-numeric TTLs could silently
cause compaction tasks to fail/stall; fixed by adding centralized
parsing helpers pkg/common.GetCollectionTTL and GetCollectionTTLFromMap
and validating TTL in createCollectionTask.PreExecute and
alterCollectionTask.PreExecute (calls with default -1 and return
parameter-invalid errors on parse failure).
- Simplification / removed redundancy: eliminated duplicated
getCollectionTTL in internal/datacoord/util.go and replaced ad-hoc TTL
parsing across datacoord (compaction policies, import_util, compaction
triggers) and proxy util with the common helpers, centralizing error
handling and defaulting logic.
- No data loss or behavior regression: valid TTL parsing semantics
unchanged (helpers use identical int64 parsing and default fallback from
paramtable/CommonCfg); validation occurs in PreExecute so existing valid
collections proceed unchanged while malformed values are rejected
early—compaction codepaths now receive only validated TTL values (or
explicit defaults), preventing silent skips without altering valid
execution flows.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #46500
- simplify the run_go_codecov.sh to make sure the set -e to protect any
sub command failure.
- remove all embed etcd in test to make full test can be run at local.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## PR Summary: Simplify Go Unit Tests by Removing Embedded etcd and
Async Startup Scaffolding
**Core Invariant:**
This PR assumes that unit tests can be simplified by running without
embedded etcd servers (delegating to environment-based or external etcd
instances via `kvfactory.GetEtcdAndPath()` or `ETCD_ENDPOINTS`) and by
removing goroutine-based async startup scaffolding in favor of
synchronous component initialization. Tests remain functionally
equivalent while becoming simpler to run and debug locally.
**What is Removed or Simplified:**
1. **Embedded etcd test infrastructure deleted**: Removes
`EmbedEtcdUtil` type and its public methods (SetupEtcd,
TearDownEmbedEtcd) from `pkg/util/testutils/embed_etcd.go`, removes the
`StartTestEmbedEtcdServer()` helper from `pkg/util/etcd/etcd_util.go`,
and removes etcd embedding from test suites (e.g., `TaskSuite`,
`EtcdSourceSuite`, `mixcoord/client_test.go`). Tests now either skip
etcd-dependent tests (via `MILVUS_UT_WITHOUT_KAFKA=1` environment flag
in `kafka_test.go`) or source etcd from external configuration (via
`kvfactory.GetEtcdAndPath()` in `task_test.go`, or `ETCD_ENDPOINTS`
environment variable in `etcd_source_test.go`). This eliminates the
overhead of spinning up temporary etcd servers for unit tests.
2. **Async startup scaffolding replaced with synchronous
initialization**: In `internal/proxy/proxy_test.go` and
`proxy_rpc_test.go`, the `startGrpc()` method signature removes the
`sync.WaitGroup` parameter; components are now created, prepared, and
run synchronously in-place rather than in goroutines (e.g., `go
testServer.startGrpc(ctx, &p)` becomes `testServer.startGrpc(ctx, &p)`
running synchronously). Readiness checks (e.g., `waitForGrpcReady()`)
remain in place to ensure startup safety without concurrency constructs.
This simplifies control flow and reduces debugging complexity.
3. **Shell script orchestration unified with proper error handling**: In
`scripts/run_go_codecov.sh` and `scripts/run_intergration_test.sh`,
per-package inline test invocations are consolidated into a single
`test_cmd()` function with unified `TEST_CMD_WITH_ARGS` array containing
race, coverage, verbose, and other flags. The problematic `set -ex` is
replaced with `set -e` alone (removing debug output noise while
preserving strict error semantics), ensuring the scripts fail fast on
any command failure.
**Why No Regression:**
- Test assertions and code paths remain unchanged; only deployment
source of etcd (embedded → external) and startup orchestration (async →
sync) change.
- Readiness verification (e.g., `waitForGrpcReady()`) is retained,
ensuring components are initialized before test execution.
- Test flags (race detection, coverage, verbosity) are uniformly applied
across all packages via unified `TEST_CMD_WITH_ARGS`, preserving test
coverage and quality.
- `set -e` alone is sufficient for strict failure detection without the
`-x` flag's verbose output.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: chyezh <chyezh@outlook.com>
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: config refresh events must reliably propagate updated
values and evict cached entries within a bounded time window; tests must
observe this deterministically without relying on fixed sleeps.
- Logic simplified: brittle fixed time.Sleep delays and separate error
assertions were replaced by assert.Eventually polling blocks that
combine value checks and cache-eviction verification, and consolidated
checks to reduce redundant assertions.
- Why no data loss / no behavior regression: only test synchronization
and assertions were changed—production config manager code paths (value
propagation, KV puts, cache eviction) are untouched; tests now wait for
the same outcomes more robustly, so no mutation of runtime behavior or
storage occurs.
- Enhancement scope: this is a test-stability improvement (no new
runtime capability); it fixes flaky unit tests (root cause: timing
assumptions) by replacing fixed waits with bounded polling and by using
t.Context for KV puts to align test context usage.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
issue: #45841
- CPP log make the multi log line in one debug, remove the "\n\t".
- remove some log that make no sense.
- slow down some log like ChannelDistManager.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: logging is purely observational — this PR only
reduces, consolidates, or reformats diagnostic output (removing
per-item/noise logs, consolidating batched logs, and converting
multi-line log strings) while preserving all control flow, return
values, and state mutations across affected code paths.
- Removed / simplified logic: deleted low-value per-operation debug/info
logs (e.g., ListIndexes, GetRecoveryInfo, GcConfirm,
push-to-reorder-buffer, several streaming/wal/debug traces), replaced
per-item inline logs with single batched deferred logs in
querynodev2/delegator (logExcludeInfo) and CleanInvalid, changed C++
PlanNode ToString() multi-line output to compact single-line bracketed
format (removed "\n\t"), and added thresholded interceptor logging
(InterceptorMetrics.ShouldBeLogged) and message-type-driven log levels
to avoid verbose entries.
- Why this does NOT cause data loss or behavioral regression: no
function signatures, branching, state updates, persistence calls, or
return values were changed — examples: ListIndexes still returns the
same Status/IndexInfos; GcConfirm still constructs and returns
resp.GetGcFinished(); Insert and CleanInvalid still perform the same
insert/removal operations (only their per-item logging was aggregated);
PlanNode ToString changes only affect emitted debug strings. All error
handling and control flow paths remain intact.
- Enhancement intent: reduce log volume and improve signal-to-noise for
debugging by removing redundant, noisy logs and emitting concise,
rate-/threshold-limited summaries while preserving necessary diagnostics
and original program behavior.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #46669
When partial result is enabled (PartialResultRequiredDataRatio < 1.0),
the Serviceable() method would return true even if syncedByCoord is
false (by bypassing viewReady check). However, PinReadableSegments uses
GetLoadedRatio() == 1.0 to decide whether to filter segments by target
version.
This causes a problem: when loadedRatio == 1.0 but syncedByCoord ==
false, segments are filtered by an incorrect target version, resulting
in an empty segment list during search.
This change:
- Replace GetLoadedRatio() == 1.0 with Serviceable() check to ensure
target version filtering only happens after coord sync completes
- Remove partial result bypass in Serviceable() to keep the check
consistent
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Bug Fix Summary
**Core Invariant**: `Serviceable()` must enforce a strict requirement
that both data loading AND coordinator synchronization are complete
before allowing full search operations. This prevents using stale or
uninitialized target versions.
**Logic Removed/Simplified**:
- Removed the partial-result bypass from `Serviceable()` that previously
allowed it to return `true` even when `syncedByCoord == false`
- Replaced `GetLoadedRatio() == 1.0` checks in `PinReadableSegments`
with `Serviceable()` calls to ensure target-version filtering only
occurs after coord sync completes
- Simplified the serviceability condition from parameterized
partial-result logic to a direct conjunction: `loadedRatio >= 1.0 AND
syncedByCoord == true`
**No Data Loss or Regression**: The change is safe because:
- When `Serviceable()` returns `true` (both loadedRatio ≥ 1.0 AND
syncedByCoord ≥ true), segments are filtered by the current valid target
version—this is the full-result path
- When `Serviceable()` returns `false` but `loadedRatio >=
requiredLoadRatio` (partial result case), segments are filtered against
the query view's segment lists rather than target version, ensuring
non-empty results as validated by
`TestPinReadableSegments_PartialResultNotEmpty`
- The test explicitly demonstrates that even with `loadedRatio == 1.0`
and `syncedByCoord == false`, calling `PinReadableSegments(0.8,
partition)` returns segments (partial result) instead of an empty list,
which was the bug root cause
**Root Cause Fix**: Previously, segments could be filtered with
`unreadableTargetVersion` when `loadedRatio == 1.0` but the querycoord
hadn't yet synced the target, causing empty segment lists. Now the sync
state is checked before deciding the filtering strategy, preventing this
race condition.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #46676
This change fixes a bug where etcd prefix queries in RBAC could
incorrectly match entries with similar prefixes. For example, when
querying roles for user "admin", it could mistakenly return roles
belonging to "admin2".
The fix adds explicit "/" suffix to prefix keys before LoadWithPrefix
calls in three locations:
- getRolesByUsername: user role mapping queries
- ListGrant (appendGrantEntity): grantee ID queries
- ListGrant (role query): role grant queries
Also updates related unit tests to match the new prefix format and adds
TestRBACPrefixMatch to verify the fix.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Bug Fix: RBAC Etcd Prefix Matching Data Leakage
**Core Invariant:**
Etcd prefix queries must use explicit "/" delimiters between key
segments to enforce strict hierarchical boundaries; without them,
string-prefix matching returns all keys with similar starting characters
(e.g., prefix "admin" matches both "admin" and "admin2").
**Root Cause & Fix:**
The bug occurred in three RBAC query functions where prefix-based
lookups lacked trailing "/" separators. For example,
`getRolesByUsername(ctx, tenant, "admin")` would construct prefix
`"RoleMappingPrefix/tenant/admin"` and query `LoadWithPrefix(ctx,
prefix)`, unintentionally matching roles assigned to both "admin" and
"admin2" users. The fix appends "/" to the prefix before querying (e.g.,
`prefix + "/"`), making queries strictly match the intended
user/role/grantee entry only.
**Why No Data Loss or Regression:**
The fix modifies only how keys are *queried*, not how they are *stored*.
Etcd keys remain unchanged (still formatted as
`"RoleMappingPrefix/tenant/username/rolename"`). The corresponding
parsing logic using `typeutil.AfterN(key, k, "/")` correctly extracts
role names since the prefix `k` now ends with "/" (eliminating the need
to append "/" in the delimiter argument). All three affected code
paths—`getRolesByUsername`, `ListGrant` grantee ID queries, and
`ListGrant` role grant queries—consistently apply the same pattern,
ensuring backward-compatible behavior while fixing the unintended
cross-user/role leakage.
**Verification:**
New test suite `TestRBACPrefixMatch` confirms that querying user "user1"
no longer returns user "user10"'s roles, and similarly for role/grantee
ID prefixes, validating the fix resolves the reported data isolation
issue.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
test: add unit tests for mixed int64/float types in BinaryRangeExpr
When processing binary range expressions (e.g., `x > 499 && x <= 512.0`)
on JSON/dynamic fields with expression templates, the lower and upper
bounds could have different numeric types (int64 vs float64). This
caused an assertion failure in GetValueFromProto when the template type
didn't match the actual proto value type.
Fixes:
1. Go side (fill_expression_value.go): Normalize numeric types for JSON
fields - if either bound is float and the other is int, convert the int
to float.
2. C++ side (BinaryRangeExpr.cpp):
- Check both lower_val and upper_val types when dispatching
- Use double template when either bound is float
- Use GetValueWithCastNumber instead of GetValueFromProto to safely
handle int64->double conversion
issue: #46588
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: JSON field binary-range expressions must present
numeric bounds to the evaluator with a consistent numeric type; if
either bound is floating-point, both bounds must be treated as double to
avoid proto-type mismatches during template instantiation.
- Bug fix (issue #46588 & concrete change): mixed int64/float bounds
could dispatch the wrong template (e.g.,
ExecRangeVisitorImplForJson<int64_t>) and trigger assertions in
GetValueFromProto. Fixes: (1) Go parser (FillBinaryRangeExpressionValue
in fill_expression_value.go) normalizes mixed JSON numeric bounds by
promoting the int bound to float; (2) C++ evaluator
(PhyBinaryRangeFilterExpr::Eval in BinaryRangeExpr.cpp) inspects both
lower_type and upper_type, sets use_double when either is float, selects
ExecRangeVisitorImplForJson<double> for mixed numeric cases, and
replaces GetValueFromProto with GetValueWithCastNumber so int64→double
conversions are handled safely.
- Removed / simplified logic: the previous evaluator branched on only
the lower bound's proto type and had separate index/non-index handling
for int64 vs float; that per-bound branching is replaced by unified
numeric handling (convert to double when needed) and a single numeric
path for index use — eliminating redundant, error-prone branches that
assumed homogeneous bound types.
- No data loss or regression: changes only promote int→double for
JSON-range comparisons when the other bound is float; integer-only and
float-only paths remain unchanged. Promotion uses IEEE double (C++
double and Go float64) and only affects template dispatch and
value-extraction paths; GetValueWithCastNumber safely converts int64 to
double and index/non-index code paths both normalize consistently,
preserving semantics for comparisons and avoiding assertion failures.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
https://github.com/milvus-io/milvus/issues/42589
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Semantic Highlighting Feature
**Core Invariant**: Semantic highlighting operates on a per-field basis
with independent text processing through an external Zilliz highlight
provider. The implementation maintains field ID to field name mapping
and correlates highlight results back to original field outputs.
**What is Added**: This PR introduces semantic highlighting capability
for search results alongside the existing lexical highlighting. The
feature consists of:
- New `SemanticHighlight` orchestrator that validates queries/input
fields against collection schema, instantiates a Zilliz-based provider,
and batches text processing across multiple queries
- New `SemanticHighlighter` proxy wrapper implementing the `Highlighter`
interface for search pipeline integration
- New `semanticHighlightOperator` that processes search results by
delegating per-field text processing to the provider and attaching
correlated `HighlightResult` data to search outputs
- New gRPC service definition (`HighlightService`) and
`ZillizClient.Highlight()` method for external provider communication
**No Data Loss or Regression**: The change is purely additive without
modifying existing logic:
- Lexical highlighting path remains unchanged (separate switch case in
`createHighlightTask`)
- New `HighlightResults` field is only populated when semantic
highlighting is explicitly requested via `HighlightType_Semantic` enum
value
- Gracefully handles missing fields by returning explicit errors rather
than silent failures
- Pipeline operator integration follows existing patterns and only
processes when semantic highlighter is instantiated
**Why This Design**: Semantic highlighting is routed through the same
pipeline operator pattern as lexical highlighting, ensuring consistent
integration into search workflows. The per-field model allows flexible
highlighting across different text columns and batch processing ensures
efficient handling of multiple queries with configurable provider
constraints.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: junjie.jiang <junjie.jiang@zilliz.com>
issue: #46481
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: DiskANN requires OS support for asynchronous I/O
(AIO); the Makefile now encodes this by defaulting disk_index=OFF on
Darwin (macOS) and disk_index=ON on other OSes where AIO is available.
- Simplified logic: the build-time default was inverted for non-Darwin
platforms so DiskANN is enabled by default; redundant conditional
handling that previously forced OFF everywhere has been removed in favor
of an OS-based default while preserving the single manual override
variable.
- No data loss or regression: this is a compile-time change only — it
toggles inclusion of DiskANN code paths at build time and does not
modify runtime persistence or existing index files. macOS builds still
skip AIO-dependent DiskANN code paths, and Linux/other builds merely
compile support by default; no migration or runtime data-path changes
are introduced.
- Backward compatibility / fix for issue #46481: addresses the reported
need to enable DiskANN by default (issue #46481) while keeping explicit
disk_index overrides intact for CI and developer workflows, so existing
build scripts and deployments that pass disk_index continue to behave
unchanged.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: xianliang.li <xianliang.li@zilliz.com>
related issue: #46616
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: these tests assume the v2 group-by search
implementation (TestMilvusClientV2Base and pymilvus v2 APIs such as
AnnSearchRequest/WeightedRanker) is functionally correct; the PR extends
coverage to validate group-by semantics when using JSON fields and
dynamic fields (see
tests/python_client/milvus_client_v2/test_milvus_client_search_group_by.py
— TestGroupSearch.setup_class and parametrized group_by_field cases).
- Logic removed/simplified: legacy v1 test scaffolding and duplicated
parametrized fixtures/test permutations were consolidated into
v2-focused suites (TestGroupSearch now inherits TestMilvusClientV2Base;
old TestGroupSearch/TestcaseBase patterns and large blocks in
test_mix_scenes were removed) to avoid redundant fixture permutations
and duplicate assertions while reusing shared helpers in common_func
(e.g., gen_scalar_field, gen_row_data_by_schema) and common_type
constants.
- Why this does NOT introduce data loss or behavior regression: only
test code, test helpers, and test imports were changed — no
production/server code altered. Test helper changes are
backward-compatible (gen_scalar_field forces primary key nullable=False
and only affects test data generation paths in
tests/python_client/common/common_func.py; get_field_dtype_by_field_name
now accepts schema dicts/ORM schemas and is used only by tests to choose
vector generation) and collection creation/insertion in tests use the
same CollectionSchema/FieldSchema paths, so production
storage/serialization logic is untouched.
- New capability (test addition): adds v2 test coverage for group-by
search over JSON and dynamic fields plus related scenarios — pagination,
strict/non-strict group_size, min/max group constraints, multi-field
group-bys and binary vector cases — implemented in
tests/python_client/milvus_client_v2/test_milvus_client_search_group_by.py
to address issue #46616.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Pull Request Summary: Test Case Updates for API Behavior Changes
**Core Invariant**: These test case updates reflect backend API
improvements to error messaging and schema information returned by
collection operations. The changes maintain backward compatibility—no
public signatures change, and all modifications are test expectation
updates.
**Updated Error Messages for Better Diagnostics**:
- `test_add_field_feature.py`: Updated expected error when adding a
vector field without dimension specification from a generic "not support
to add vector field" to the more descriptive "vector field must have
dimension specified, field name = {field_name}: invalid parameter". This
change is non-breaking for clients that only check error codes; it
improves developer experience with clearer error context.
**Schema Information Extension**:
- `test_milvus_client_collection.py`: Added `enable_namespace: False` to
the expected `describe_collection()` output. This is a new boolean field
in the collection metadata that defaults to False, representing an
opt-in feature. Existing code querying describe_collection continues to
work; the new field is simply an additional property in the response
dictionary.
**Dynamic Error Message Construction**:
- `test_milvus_client_search_invalid.py`: Replaced hardcoded error
message with conditional logic that generates the appropriate error
based on input state (None vectors vs invalid vector data). This
prevents test brittle failure if multiple error conditions exist, and
correctly validates the API's behavior handles both "missing data" and
"malformed data" cases distinctly.
**No Regression Risk**: All changes update test expectations to match
improved backend behavior. The changes are additive (new field in
schema) or clarifying (better error messages), with no modifications to
existing response structures or behavior for valid inputs.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: nico <cheng.yuan@zilliz.com>
issue: #46687
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: raw-data cleanup must be scoped to (segment_id,
field_id) so deleting temporary raw files for one field never removes
raw files for other fields in the same segment (prevents cross-field
deletion during index builds).
- Root cause and fix (bug): VectorDiskIndex::Build() and
BuildWithDataset() called RemoveDir on the segment-level path; this
removed rawdata/{segment_id}/. The fix changes both calls to remove
storage::GenFieldRawDataPathPrefix(local_chunk_manager, segment_id,
field_id) instead, limiting cleanup to rawdata/{segment_id}_{field_id}/
(field-scoped).
- Logic removed/simplified: the old helper GetSegmentRawDataPathPrefix
was removed and callers were switched to GenFieldRawDataPathPrefix;
cleanup logic is simplified from segment-level to field-level path
generation and removal, eliminating redundant broad deletions.
- Why this does NOT cause data loss or regress behavior: the change
narrows RemoveDir() to the exact field path used when caching raw data
and offsets earlier in Build (offsets_path and CacheRawDataToDisk
produce field-scoped local paths). Build still writes/reads offsets and
raw data from GenFieldRawDataPathPrefix(...) and then removes that same
prefix after successful index.Build(); therefore only temporary files
for the built field are deleted and other fields’ raw files under the
same segment are preserved. This fixes issue #46687 by preventing
accidental deletion of other fields’ raw data.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: TestWait must deterministically verify that
FixedSizeAllocator.Wait() is notified when virtual resources are
released, so an allocation blocked due to exhausted virtual capacity
eventually succeeds after explicit deallocations.
- Removed/simplified logic: replaced the previous flaky pattern that
spawned 100 concurrent goroutines performing Reallocate with an explicit
channel-synchronized release goroutine that performs 100 sequential
negative Reallocate calls only after the test blocks on allocation. This
eliminates timing-dependent concurrency and the nondeterministic i-based
assertion.
- Why no data loss or behavior regression: only the test changed —
allocator implementation (Allocate/Reallocate/Release/Wait/notify) and
public APIs are unchanged. The test now exercises the same code paths
(Allocate fails, Wait blocks on cond, Reallocate/Release call notify),
but in a deterministic order, so the allocator semantics and resource
accounting (used, allocs map) remain unaffected.
- Change type and intent: Enhancement/refactor of unit test stability —
it tightens test synchronization to remove race-dependent assertions and
ensure the Wait/notify mechanism is reliably exercised without modifying
allocator logic.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
issue: https://github.com/milvus-io/milvus/issues/42053
Process ngram in batch rather than all by once.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Batch Processing for N-gram Queries
**Core Invariant:** All data iteration is now driven by `batch_size_` as
the fundamental unit; for sealed chunked segments processing string/JSON
data, processing is strictly stateless to allow specialized batched
algorithms.
**Simplified Logic:**
- Removed the `process_all_chunks` boolean flag from
`ProcessMultipleChunksCommon` (renamed to
`ProcessDataChunksForMultipleChunk`) as it was redundant—all iteration
paths now converge on the same `batch_size_`-driven chunking strategy
with unified data size clamping (`std::min(chunk_size, batch_size_ -
processed_size)`).
- Eliminated wrapper delegation methods
(`ProcessDataChunksForMultipleChunk` and
`ProcessAllChunksForMultipleChunk` old wrappers) that pointed to a
single common implementation with a conditional flag.
**No Data Loss or Behavior Regression:**
- The new `ProcessAllDataChunkBatched<T>` is an additional stateless
public path (requires sealed + chunked segments, type constraints:
`std::string_view|Json|ArrayView`) that iterates all `num_data_chunk_`
chunks in `batch_size_` granularity without mutating cursor state
(`current_data_chunk_`, `current_data_chunk_pos_`), ensuring
deterministic re-entrant processing.
- Existing cursor-based APIs (`ProcessDataChunksForMultipleChunk`,
`ProcessChunkForSealedSeg`) remain unchanged for standard expression
evaluation—no segment state is corrupted.
- N-gram query execution now routes through
`ExecuteQueryWithPredicate<T, Predicate>(literal, segment, predicate,
need_post_filter)` which forwards generic predicates and delegates to
`segment->ProcessAllDataChunkBatched<T>(execute_batch, res)` for
post-filtering, avoiding per-chunk single-pass traversal.
**Enhancement:** Generic predicate template `template <typename T,
typename Predicate>` with perfect forwarding (`Predicate&& predicate`)
replaces the fixed `std::function<bool(const T&)>` signature,
eliminating function wrapper overhead for n-gram matcher closures and
enabling efficient batch processing callbacks.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
Issue: #46627
add one more test case to cover duplicate pk partial update
On branch feature/partial-update
Changes to be committed:
modified: milvus_client/test_milvus_client_partial_update.py
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: upserts with partial_update=True consolidate records
by primary key (PK) rather than creating duplicate rows; this test
verifies the partial-update upsert path preserves PK identity and merge
semantics.
- Change: adds test
test_milvus_client_partial_update_duplicate_pk_partial_update which
inserts duplicate-PK batches, then calls client.upsert(...,
partial_update=True) on a subset of fields and asserts final row count
equals default_nb, exercising the partial-update code path (upsert →
partial update handling → query) not previously covered.
- No production logic removed/simplified: this PR only adds test
coverage (no code paths removed or altered); nothing in production code
is changed or simplified by the PR.
- No data loss or regression introduced: the test validates concrete
code paths — upsert with partial_update True followed by
query(out_fields/with_vec, pk checks) — and asserts deduplication
(2×default_nb → default_nb). Because the PR only adds assertions against
existing behavior and does not modify runtime logic, it cannot cause
data loss or behavioral regressions.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: Eric Hou <eric.hou@zilliz.com>
Co-authored-by: Eric Hou <eric.hou@zilliz.com>
Issue: #46424
test:add_collection_field(invalid_default_value)
hybrid_search(NOT supported_
simplify some test cases using one single collection to save time.
query with different time shift and timezone settings
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: TIMESTAMPTZ values are treated as absolute instants
(timezone-preserving). Tests assume conversions between stored instants
and display timezones/time-shifts are deterministic and reversible; the
PR validates queries/reads across different timezone and time-shift
settings against that invariant.
- Removed/simplified logic: duplicated per-test create/insert/teardown
flows and several isolated timestamptz unit cases (edge_case, Feb_29,
partial_update, standalone query) were consolidated into a module-scoped
fixture that creates a single COLLECTION_NAME, inserts ROWS, and handles
teardown. This removes redundant setup/teardown code and repeated
scaffolding while preserving the same API exercise points
(create_collection, insert, query, alter_collection_properties,
alter_database_properties, describe_collection, describe_database).
- No data loss or behavior regression: only test code was reorganized
and new assertions exercise the same production APIs and code paths used
previously (create_collection → insert → query / alter_properties →
describe). The fixture inserts the same ROWS and tests still
convert/compare timestamptz values via cf.convert_timestamptz and query
check routines; the new invalid-default-value test only asserts error
handling when adding a TIMESTAMPTZ field with an invalid default and
does not mutate persisted data or change production logic.
- PR type (Enhancement/Test): expands and reorganizes E2E test coverage
for TIMESTAMPTZ—centralizes collection setup to reduce runtime and
flakiness, adds explicit coverage for invalid-default-value behavior,
and increases timezone/time-shift query scenarios without altering
product behavior.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Eric Hou <eric.hou@zilliz.com>
Co-authored-by: Eric Hou <eric.hou@zilliz.com>
issue: https://github.com/milvus-io/milvus/issues/44123
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: legacy in-cluster CDC/replication plumbing
(ReplicateMsg types, ReplicateID-based guards and flags) is obsolete —
the system relies on standard msgstream positions, subPos/end-ts
semantics and timetick ordering as the single source of truth for
message ordering and skipping, so replication-specific
channels/types/guards can be removed safely.
- Removed/simplified logic (what and why): removed replication feature
flags and params (ReplicateMsgChannel, TTMsgEnabled,
CollectionReplicateEnable), ReplicateMsg type and its tests, ReplicateID
constants/helpers and MergeProperties hooks, ReplicateConfig and its
propagation (streamPipeline, StreamConfig, dispatcher, target),
replicate-aware dispatcher/pipeline branches, and replicate-mode
pre-checks/timestamp-allocation in proxy tasks — these implemented a
redundant alternate “replicate-mode” pathway that duplicated
position/end-ts and timetick logic.
- Why this does NOT cause data loss or regression (concrete code paths):
no persistence or core write paths were removed — proxy PreExecute flows
(internal/proxy/task_*.go) still perform the same schema/ID/size
validations and then follow the normal non-replicate execution path;
dispatcher and pipeline continue to use position/subPos and
pullback/end-ts in Seek/grouping (pkg/mq/msgdispatcher/dispatcher.go,
internal/util/pipeline/stream_pipeline.go), so skipping and ordering
behavior remains unchanged; timetick emission in rootcoord
(sendMinDdlTsAsTt) is now ungated (no silent suppression), preserving or
increasing timetick delivery rather than removing it.
- PR type and net effect: Enhancement/Refactor — removes deprecated
replication API surface (types, helpers, config, tests) and replication
branches, simplifies public APIs and constructor signatures, and reduces
surface area for future maintenance while keeping DML/DDL persistence,
ordering, and seek semantics intact.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
issue: https://github.com/milvus-io/milvus/issues/46651
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Enhancement: Add Context-Aware Logging for Proxy and RootCoord Meta
Table Operations
**Core Invariant**: All changes maintain existing cache behavior and
state transition logic by purely enhancing observability through
context-aware logging without modifying control flow, return values, or
data structures.
**Logic Simplified Without Regression**:
- Removed internal helper method `getFullCollectionInfo` from MetaCache
by inlining its logic directly into GetCollectionInfo, eliminating an
unnecessary abstraction layer while preserving the exact same
cache-hit/miss and fetch-or-update paths
- This consolidation has no impact on behavior because the helper was
only called from one location and the inlined logic executes identically
**Enhanced Logging for Observability (No Behavior Changes)**:
- Added context-aware logging (log.Ctx(ctx)) to cache miss scenarios and
timestamp comparisons in proxy MetaCache, enabling request tracing
without altering cache lookup logic
- Expanded RootCoord MetaTable's internal helper method signatures to
propagate context for contextual logging across collection lifecycle
events (begin truncate, update state, remove names/aliases, delete from
collections map), while keeping all call sites and state transitions
unchanged
- Enhanced DescribeCollection logging in proxy to capture request scope
(role, database, collection IDs, timestamp) and response schema at
operation boundaries
**Type**: Enhancement focused on improved observability. All
modifications are strictly additive logging; no data structures, caching
strategies, or core logic paths were altered.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>