issue: #40513
for querynode which return resource exhausted error, add a penalty
duration on it, and suspend loading new resource until penalty duration
expired.
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #45865
- Modified leader_checker.go to include all nodes (RO + RW) instead of
only RW nodes, preventing channel balance from stucking on RO nodes
- Added debug logging in segment_checker.go when no shard leader found
- Enhanced target_observer.go with detailed logging for delegator check
failures to improve debugging visibility
- Fixed integration tests:
- Temporarily disabled partial result counter assertion in
partial_result_on_node_down_test.go pending concurrent issue fix
- Increased transfer channel timeout from 10s to 20s in
manual_rolling_upgrade_test.go to avoid flaky test caused by target
update interval (10s)
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
/kind improvement
Replace direct self.schema access and describe_collection() calls with
get_schema() method to ensure consistent schema handling with complete
struct_fields information. Also fix FlushChecker error handling and
change schema log level from info to debug.
Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
Related to #45960
Follow-up to #45961
After #45961 ensured that handleNodeUp is always called for nodes
discovered during rewatchNodes (including stopping nodes), this change
adds a safeguard in ResourceManager.handleNodeUp to skip adding stopping
nodes to resource groups.
1. **resource_manager.go**: Add check for IsStoppingState() in
handleNodeUp to prevent stopping nodes from being added to incomingNode
set and assigned to resource groups.
2. **server.go**:
- Delete processed nodes from sessionMap to avoid duplicate processing
in the subsequent loop
- Add warning logs for stopping state transitions during rewatch
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Related to #45960
When QueryCoord restarts or reconnects to etcd, the rewatchNodes
function previously skipped handleNodeUp for QueryNodes in stopping
state. This caused stopping balance to fail because necessary components
were not initialized:
- Task scheduler executor was not added
- Dist handler was not started
- Node was not registered in resource manager
This fix ensures handleNodeUp is always called for new nodes regardless
of their stopping state, followed by handleNodeStopping if the node is
stopping. This allows the graceful shutdown process to correctly migrate
segments and channels away from stopping nodes.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Previously, search with highlight only supported using BM25 search text
as the highlight target.
This PR adds support for highlighting with user-defined queries.
relate: https://github.com/milvus-io/milvus/issues/42589
---------
Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
Related to #45910
When IndexNodeBinding mode is enabled, DataCoord skips session watching
for datanodes but the dnSessionWatcher field remains nil. This causes a
panic when other code attempts to access the watcher.
This fix introduces an EmptySessionWatcher as a placeholder for the
IndexNodeBinding mode scenario. The empty watcher implements the
SessionWatcher interface with no-op methods, preventing nil pointer
dereferences while maintaining the expected interface contract.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
This PR refactors the Loon FFI reader implementation to use
milvus-storage's internal C++ Reader API directly instead of the
external FFI interface.
Key changes:
- Replace external FFI calls (get_record_batch_reader, reader_destroy)
with direct C++ Reader API calls
- Add GetLoonReader() helper function to create Reader instances using
milvus-storage::api::Reader::create()
- Use MakeInternalPropertiesFromStorageConfig() instead of
MakePropertiesFromStorageConfig() to get internal properties
- Update NewPackedFFIReaderWithManifest() to deserialize column groups
from JSON manifest content directly
- Simplify GetFFIReaderStream() to use Reader::get_record_batch_reader()
and arrow::ExportRecordBatchReader() for Arrow stream export
- Change CFFIPackedReader typedef from ReaderHandle to void* for
flexibility
- Update milvus-storage dependency version to ba7df7b
This change improves code maintainability by using the native C++ API
directly and eliminates the overhead of going through the external FFI
layer.
issue: #44956
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Related to #44956
- Update milvus-storage version to ba7df7b for chunk reader fix
- Pass manifest path to index build request in DataCoord/DataNode
- Add null chunk assertion with detailed debug info in
ManifestGroupTranslator
- Fix memory corruption by removing premature transaction handle
destruction
- Clean up log message in ChunkedSegmentSealedImpl
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #45640
- log may be dropped if the underlying file system is busy.
- use async write syncer to avoid the log operation block the milvus
major system.
- remove some log dependency from the until function to avoid
dependency-loop.
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #43117
If we enable checking when loading segments, all segment should always
be loaded by streamingnode but not 2.5 querynode, make some search and
query failure when upgrading. Otherwise, some search and query result
will be wrong when upgrading. We choose to disable this checking for now
to promise available search and query when upgrading.
also see pr: #43346
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #45847
After a collection is successfully loaded, the shard-leader state on the
QC may still not be marked as serviceable. It becomes serviceable only
after the scheduled distribution update runs, which will also invalidate
the shard-leader cache on the proxy. Therefore, even if queries are
already executable, the shard-leader mapping on the proxy may still
change afterward.
Try to ensure—as much as possible—that the proxy’s shard-leader cache
remains stable before killing the mixcoord.
Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
- Refactor connection logic to prioritize uri and token parameters over
host/port/user/password for a more modern connection approach
- Add explicit limit parameter (limit=5) to search and query operations
in chaos checkers to avoid returning unlimited results
- Migrate test_all_collections_after_chaos.py from Collection wrapper to
MilvusClient API style
- Update pytest fixtures in chaos test files to support uri/token params
Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
issue: https://github.com/milvus-io/milvus/issues/45691
Add persistent task management for external collections with automatic
detection of external_source and external_spec changes. When source
changes, the system aborts running tasks and creates new ones, ensuring
only one active task per collection. Tasks validate their source on
completion to prevent superseded tasks from committing results.
---------
Signed-off-by: sunby <sunbingyi1992@gmail.com>
issue: #45831
This PR fixes a race condition in `TestZillizClient` test cases where
`t.Logf`
could be called after the test function had returned, leading to a panic
or data race failure.
Root cause:
1. Incorrect defer order: `defer s.Stop()` was called after `defer
lis.Close()`.
This caused `lis.Accept()` to return an error before `s.Stop()` had set
the quit flag,
triggering the error handling path in the server goroutine.
2. Unsafe logging: The error handling path used `t.Logf` inside a
goroutine, which
is unsafe if the main test function exits before the log is printed.
Fixes:
- Changed defer order to ensure `s.Stop()` is called before
`lis.Close()`.
- Replaced `t.Logf` with `fmt.Printf` in the background goroutine to
avoid panic on test completion.
Signed-off-by: Li Liu <li.liu@zilliz.com>
This PR adds support for reading data from StorageV2 using manifest
files and the Loon FFI interface during index building, providing an
alternative to the traditional segment insert files approach.
Key changes:
Core C++ changes:
- Add SEGMENT_MANIFEST_KEY and LOON_FFI_PROPERTIES_KEY constants for
manifest handling
- Extend FileManagerContext to carry loon_ffi_properties for FFI
operations
- Update index_c.cpp to pass manifest and loon properties to file
managers for all index types (vector, JSON key, text)
- Implement GetFieldDatasFromManifest() in Util.cpp using Arrow C Stream
interface:
* Create Arrow schema from field metadata
* Initialize FFI reader with manifest content and storage properties
* Import record batches from C data interface
* Convert to FieldData for index building
- Update DiskFileManagerImpl and MemFileManagerImpl to support
manifest-based data reading with fallback to traditional paths
Loon FFI utilities (internal/core/src/storage/loon_ffi/):
- Add ToCStorageConfig() to convert StorageConfig to C-compatible
structure
- Implement GetManifest() to parse manifest JSON and retrieve column
groups via FFI
- Enhance MakePropertiesFromStorageConfig() integration
Storage V2 integration:
- Update milvus-storage dependency from 0883026 to 302143c for latest
FFI support
Protobuf changes:
- Add manifest field to BuildIndexInfo for passing manifest path to C++
layer
Configuration:
- Add common.storageV2.useLoonFFI config option (default: false) for
feature toggle
This change is part of issue #44956 to integrate the StorageV2 FFI
interface as the unified storage layer. The implementation maintains
backward compatibility by checking for manifest presence and falling
back to existing segment insert files approach when manifest is not
provided.
Related issue: #44956
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #45486
This commit refactors the chunk writing system by introducing a
two-phase
approach: size calculation followed by writing to a target. This enables
efficient group chunk creation where multiple fields share a single mmap
region, significantly reducing the number of mmap system calls and VMAs.
- Optimize `mmap` usage: single `mmap` per group chunk instead of per
field
- Split ChunkWriter into two phases:
- `calculate_size()`: Pre-compute required memory without allocation
- `write_to_target()`: Write data to a provided ChunkTarget
- Implement `ChunkMmapGuard` for unified mmap region lifecycle
management
- Handles `munmap` and file cleanup via RAII
- Shared via `std::shared_ptr` across multiple chunks in a group
Signed-off-by: Shawn Wang <shawn.wang@zilliz.com>
---------
Signed-off-by: Shawn Wang <shawn.wang@zilliz.com>
This will avoid endless retry CreateDatabase/DropDatabase when
cipherPlugin fails in the new DDL framework.
See also: #45826
---------
Signed-off-by: yangxuan <xuan.yang@zilliz.com>
issue: #45728
When mixcoord is in standby mode and shutdown is triggered, the
ProcessActiveStandBy goroutine may panic if context cancellation occurs.
This happens because the error handling didn't check for
context.Canceled errors before panicking.
Changes:
- Add context cancellation check in mix_coord Register() before panic
- Check s.ctx.Err() == context.Canceled and gracefully exit
- Remove unused ForceActiveStandby() function from session_util
This ensures standby mixcoord can shutdown gracefully without panic when
context is cancelled during the standby process.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Related to #44956
**New Translator (C++)**
- Added `ManifestGroupTranslator`
(`internal/core/src/segcore/storagev2translator/`)
- Translates manifest-based column groups to Milvus internal format
- Implements `GroupCTMeta` interface for chunk-based column access
- Supports both memory and mmap storage modes
- Handles cache warmup policies for vector and scalar data
**ChunkedSegmentSealedImpl**
(`internal/core/src/segcore/ChunkedSegmentSealedImpl.cpp:333`)
- Added `LoadColumnGroups(const std::string& manifest_path)`: Main entry
point for manifest-based loading
- Creates milvus-storage Reader from manifest file
- Parallelizes column group loading using thread pool
- Aggregates loading exceptions and reports errors
- Added `LoadColumnGroup()`: Loads individual column group
- Extracts field IDs from column group metadata
- Creates ManifestGroupTranslator for each column group
- Builds ProxyChunkColumn for field access
- Special handling for timestamp field index construction
**SegmentGrowingImpl**
(`internal/core/src/segcore/SegmentGrowingImpl.cpp`)
- Added similar `LoadColumnGroups()` and `LoadColumnGroup()` methods for
growing segments
- Maintains consistency with sealed segment loading path
Storage FFI Utilities
**loon_ffi/util** (`internal/core/src/storage/loon_ffi/util.cpp`)
- Added `MakeInternalPropertiesFromStorageConfig()`: Converts C storage
config to internal Properties
- Maps all storage configuration fields (S3, GCS, Azure, local)
- Handles SSL, IAM, virtual host settings
- Configures connection timeouts and max connections
- Added `MakeInternalLocalProperies()`: Creates local filesystem
properties
- Added `ToCStorageConfig()`: Converts Go StorageConfig to C
representation
- Added `GetColumnGroups()`: Extracts column groups from manifest file
using Transaction API
Protocol Buffer Changes
**segcore.proto** (`pkg/proto/segcore.proto:121`)
- Added `manifest_path` field to `SegmentLoadInfo` message
- Enables passing manifest file path from Go layer to C++ core
Go Integration
**segment.go** (`internal/util/segcore/segment.go:372`)
- Updated `ConvertToSegcoreSegmentLoadInfo()` to propagate
`ManifestPath` field
- Bridges QueryNode segment load info to Segcore format
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
This commit optimizes std::vector usage across segcore by adding
reserve() calls where the size is known in advance, reducing memory
reallocations during push_back operations.
Changes:
- TimestampIndex.cpp: Reserve space for prefix_sums and
timestamp_barriers
- SegmentGrowingImpl.cpp: Reserve space for binlog info vectors
- ChunkedSegmentSealedImpl.cpp: Reserve space for futures and field data
vectors
- storagev2translator/GroupChunkTranslator.cpp: Reserve space for
metadata vectors
This improves performance by avoiding multiple memory reallocations when
the vector size is predictable.
issue: https://github.com/milvus-io/milvus/issues/45679
---------
Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
#45610
this fix add a little cost for execute:
=== Lower Bound Overhead (isolated) ===
Position 1 (list len = 90000): 39 ns per lower_bound
Position 2 (list len =180000): 45 ns per lower_bound
Position 3 (list len =270000): 46 ns per lower_bound
Position 4 (list len =360000): 38 ns per lower_bound
Position 5 (list len =450000): 42 ns per lower_bound
Position 6 (list len =540000): 55 ns per lower_bound
Position 7 (list len =630000): 56 ns per lower_bound
Position 8 (list len =720000): 49 ns per lower_bound
Position 9 (list len =810000): 48 ns per lower_bound
Signed-off-by: luzhang <luzhang@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
issue: https://github.com/milvus-io/milvus/issues/45783
for simdjson ondemand api, a iterator can only be used once. use dom api
to prevent crashes when processing JSON contains operations with
different types.
Signed-off-by: sunby <sunbingyi1992@gmail.com>
relate: https://github.com/milvus-io/milvus/issues/43687
Support use user provice file by file params, in analyzer params.
Could use local file or remote file resource.
Support use file params in jieba extern dict.
Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
Skip redundant failure marking for completed import jobs when the
collection is dropped or import jobs are timeout.
issue: https://github.com/milvus-io/milvus/issues/45766
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>