milvus

mirror of https://gitee.com/milvus-io/milvus.git synced 2025-12-07 01:28:27 +08:00

Author	SHA1	Message	Date
sparknack	bdd65871ea	enhance: tiered storage: estimate segment loading resource usage while considering eviction (#43323 ) issue: #41435 After introducing the caching layer's lazy loading and eviction mechanisms, most parts of a segment won't be loaded into memory or disk immediately, even if the segment is marked as LOADED. This means physical resource usage may be very low. However, we still need to reserve enough resources for the segments marked as LOADED. Thus, the logic of resource usage estimation during segment loading, which based on physcial resource usage only for now, should be changed. To address this issue, we introduced the concept of logical resource usage in this patch. This can be thought of as the base reserved resource for each LOADED segment. A segment’s logical resource usage is derived from its final evictable and inevictable resource usage and calculated as follows: ``` SLR = SFPIER + evitable_cache_ratio * SFPER ``` it also equals to ``` SLR = (SFPIER + SFPER) - (1.0 - evitable_cache_ratio) * SFPER ``` `SLR`: The logical resource usage of a segment. `SFPIER`: The final physical inevictable resource usage of a segment. `SFPER`: The final physical evictable resource usage of a segment. `evitable_cache_ratio`: The ratio of a segment's evictable resources that can be cached locally. The higher the ratio, the more physical memory is reserved for evictable memory. When loading a segment, two types of resource usage are taken into account. First is the estimated maximum physical resource usage: ``` PPR = HPR + CPR + SMPR - SFPER ``` `PPR`: The predicted physical resource usage after the current segment is allowed to load. `HPR`: The physical resource usage obtained from hardware information. `CPR`: The total physical resource usage of segments that have been committed but not yet loaded. When one new segment is allow to load, `CPR' = CPR + (SMR - SER)`. When one of the committed segments is loaded, `CPR' = CPR - (SMR - SER)`. `SMPR`: The maximum physical resource usage of the current segment. `SFPER`: The final physical evictable resource usage of the current segment. Second is the estimated logical resource usage, this check is only valid when eviction is enabled: ``` PLR = LLR + CLR + SLR ``` `PLR`: The predicted logical resource usage after the current segment is allowed to load. `LLR`: The total logical resource usage of all loaded segments. When a new segment is loaded, `LLR` should be updated to `LLR' = LLR + SLR`. `CLR`: The total logical resource usage of segments that have been committed but not yet loaded. When one new segment is allow to load, `CLR' = CLR + SLR`. When one of the committed segments is loaded, `CLR' = CLR - SLR`. `SLR`: The logical resource usage of the current segment. Only when `PPR < PRL && PLR < PRL` (`PRL`: Physical resource limit of the querynode), the segment is allowed to be loaded. --------- Signed-off-by: Shawn Wang <shawn.wang@zilliz.com>	2025-08-01 21:31:37 +08:00
yihao.dai	50f621abf2	fix: Fix compaction failed due to ID exhausted (#43699 ) Change default `compaction.preAllocateIDExpansionFactor` to 10000. issue: https://github.com/milvus-io/milvus/issues/43673 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-08-01 19:17:37 +08:00
Buqian Zheng	052fb6c562	feat: add time based eviction to data managed by cachinglayer (#43490 ) issue: https://github.com/milvus-io/milvus/issues/41435 also added disk capacity protection --------- Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>	2025-07-29 16:17:35 +08:00
yihao.dai	a29b3272b0	fix: Improve import memory management to prevent OOM (#43568 ) 1. Use blocking memory allocation to wait until memory becomes available 2. Perform memory allocation at the file level instead of per task 3. Limit Parquet file reader batch size to prevent excessive memory consumption 4. Limit import buffer size from 20% to 10% of total memory issue: https://github.com/milvus-io/milvus/issues/43387, https://github.com/milvus-io/milvus/issues/43131 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-07-28 21:25:35 +08:00
yihao.dai	9fbd41a97d	fix: Adjust binlog and parquet reader buffer size for import (#43495 ) 1. Modify the binlog reader to stop reading a fixed 4096 rows and instead use the calculated bufferSize to avoid generating small binlogs. 2. Use a fixed bufferSize (32MB) for the Parquet reader to prevent OOM. issue: https://github.com/milvus-io/milvus/issues/43387 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-07-23 21:28:54 +08:00
Buqian Zheng	0599113a4b	enhance: add timeout to resource reservation (#43441 ) issue: https://github.com/milvus-io/milvus/issues/41435 Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>	2025-07-22 15:24:53 +08:00
Buqian Zheng	d793def47c	feat: impose a physical memory limit when loading cells (#43222 ) issue: #41435 issue: https://github.com/milvus-io/milvus/issues/43038 This PR also: 1. removed ERROR state from ListNode 2. CacheSlot will do reserveMemory once for all requested cells after updating the state to LOADING, so now we transit a cell to LOADING before its resource reservation 3. reject resource reservation directly if size >= max_size --------- Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>	2025-07-18 11:18:52 +08:00
Zhen Ye	07fa2cbdd3	enhance: wal balance consider the wal status on streamingnode (#43265 ) issue: #42995 - don't balance the wal if the producing-consuming lag is too long. - don't balance if the rebalance is set as false. - don't balance if the wal is balanced recently. Signed-off-by: chyezh <chyezh@outlook.com>	2025-07-18 11:10:51 +08:00
sthuang	4f17640598	enhance: [StorageV2] clean up legacy flag (#43290 ) related: #39173 Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>	2025-07-15 10:18:49 +08:00
Ted Xu	07894b37b6	enhance: returning collection metadata from cache (#42823 ) See #43187 --------- Signed-off-by: Ted Xu <ted.xu@zilliz.com>	2025-07-14 10:54:50 +08:00
PjJinchen	a90694165b	feat: Supports tracing services that require header-based authentication. (#43211 ) issue: https://github.com/milvus-io/milvus/issues/43082 support tracing services that require header-based authentication. for example: aliyun SLS, volcengine LogService etc... [aliyun SLS](https://help.aliyun.com/zh/sls/import-trace-data-from-golang-applications-to-log-service-by-using-opentelemetry-sdk-for-golang?spm=a2c4g.11186623.help-menu-search-28958.d_1#section-ktk-xxz-8om) Add a headers config in trace config ``` trace: exporter: otlp sampleFraction: 1 otlp: endpoint: milvus-cn-beijing-pre.cn-beijing.log.aliyuncs.com:10010 method: # otlp export method, acceptable values: ["grpc", "http"], using "grpc" by default secure: true headers: # base64 initTimeoutSeconds: 10 ``` it is encoded as base64, raw data is json ``` { "x-sls-otel-project": "milvus-cn-beijing-pre", "x-sls-otel-instance-id": "milvus-cn-beijing-pre", "x-sls-otel-ak-id": "xxx", "x-sls-otel-ak-secret": "xxx" } ``` [volcengine tls](https://www.volcengine.com/docs/6470/812322#grpc-%E5%8D%8F%E8%AE%AE%E5%88%9D%E5%A7%8B%E5%8C%96%E7%A4%BA%E4%BE%8B) Add a headers config in trace config ``` trace: exporter: otlp sampleFraction: 1 otlp: endpoint: xxx method: # otlp export method, acceptable values: ["grpc", "http"], using "grpc" by default secure: true headers: # base64 initTimeoutSeconds: 10 ``` it is encoded as base64, raw data is json ``` { "x-tls-otel-region": "cn-beijing", "x-tls-otel-tracetopic": "milvus-cn-beijing-pre", "x-tls-otel-ak": "xxx", "x-tls-otel-sk": "xxx" } ``` Signed-off-by: PjJinchen <6268414+pj1987111@users.noreply.github.com>	2025-07-10 17:32:48 +08:00
cai.zhang	6989e18599	enhance: Move sort stats task to sort compaction (#42562 ) issue: #42560 --------- Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>	2025-07-08 20:22:47 +08:00
Zhen Ye	ed9aa1d4db	fix: limit GC concurrency as CPU number (#43165 ) issue: #42833 Signed-off-by: chyezh <chyezh@outlook.com>	2025-07-08 10:46:46 +08:00
Ted Xu	6153272d4b	enhance: disabling max entry limit by default (#43166 ) See: #43055 --------- Signed-off-by: Ted Xu <ted.xu@zilliz.com>	2025-07-08 10:10:46 +08:00
yihao.dai	9cbd194c6b	fix: Prevent import from generating small binlogs (#43132 ) - Introduce dynamic buffer sizing to avoid generating small binlogs during import - Refactor import slot calculation based on CPU and memory constraints - Implement dynamic pool sizing for sync manager and import tasks according to CPU core count issue: https://github.com/milvus-io/milvus/issues/43131 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-07-07 21:32:47 +08:00
cai.zhang	4133e3b8fd	fix: Enable merge sort and fix sort bug (#43080 ) issue: #42980, #43034 Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>	2025-07-04 10:18:44 +08:00
Zhen Ye	e97e44d56e	enhance: limit the gc concurrency when cpu is high (#43059 ) issue: #42833 Signed-off-by: chyezh <chyezh@outlook.com>	2025-07-04 09:22:43 +08:00
sparknack	7e855f1046	enhance: add disk file writer with Direct IO support (#42665 ) issue: #43040 This patch introduces a disk file writer that supports Direct IO. Currently, it is exclusively utilized during the QueryNode load process. Below is its parameters: 1. `common.diskWriteMode` This parameter controls the write mode of the local disk, which is used to write temporary data downloaded from remote storage. Currently, only QueryNode uses 'common.diskWrite*' parameters. Support for other components will be added in the future. The options include 'direct' and 'buffered'. The default value is 'buffered'. 2. `common.diskWriteBufferSizeKb` Disk write buffer size in KB, only used when disk write mode is 'direct', default is 64KB. Current valid range is [4, 65536]. If the value is not aligned to 4KB, it will be rounded up to the nearest multiple of 4KB. 3. `common.diskWriteNumThreads` This parameter controls the number of writer threads used for disk write operations. The valid range is [0, hardware_concurrency]. It is designed to limit the maximum concurrency of disk write operations to reduce the impact on disk read performance. For example, if you want to limit the maximum concurrency of disk write operations to 1, you can set this parameter to 1. The default value is 0, which means the caller will perform write operations directly without using an additional writer thread pool. In this case, the maximum concurrency of disk write operations is determined by the caller's thread pool size. Both parameters can be updated during runtime. --------- Signed-off-by: Shawn Wang <shawn.wang@zilliz.com>	2025-07-02 22:18:44 +08:00
Zhen Ye	08fff353af	fix: Revert "enhance: Enable mergeSort by default starting from version 2.6.0 (#42981 )" (#43046 ) issue: #43034 - implementation of mergeSortMultipleSegments is wrong. Signed-off-by: chyezh <chyezh@outlook.com>	2025-07-01 17:30:29 +08:00
cai.zhang	c82943dca1	enhance: Enable mergeSort by default starting from version 2.6.0 (#42981 ) issue: #42980 Enable mergeSort for mix compaction to reduce sort stats tasks. --------- Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>	2025-06-30 21:46:43 +08:00
Zhen Ye	8367e4ec6a	fix: set 72h for wal retention (#42910 ) issue: #42706 Signed-off-by: chyezh <chyezh@outlook.com>	2025-06-27 17:36:43 +08:00
aoiasd	e2566c0e92	enhance: bm25 stats local cache use local storage path (#42923 ) Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2025-06-25 13:44:46 +08:00
Zhen Ye	a081906fb4	enhance: smaller backoff configuration for wal balancer to make faster recovery (#42869 ) issue: #42835 Signed-off-by: chyezh <chyezh@outlook.com>	2025-06-23 10:32:40 +08:00
cai.zhang	8f8ffe9989	fix: Reduce task slot for standalone to 1/4 of normal datanode (#42808 ) issue: #42129 --------- Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>	2025-06-20 16:38:46 +08:00
Zhen Ye	1f66b650e9	fix: pulsar cannot work properly if backlog exceed (#42653 ) issue: #42649 - the sync operation of different pchannel is concurrent now. - add a option to notify the backlog clear automatically. - make pulsar walimpls can be recovered from backlog exceed. Signed-off-by: chyezh <chyezh@outlook.com>	2025-06-13 14:28:37 +08:00
yihao.dai	86876682da	enhance: Enhance import integration tests and logs (#42612 ) 1. Optimize the import process: skip subsequent steps and mark the task as complete if the number of imported rows is 0. 2. Improve import integration tests: a. Add a test to verify that autoIDs are not duplicated b. Add a test for the corner case where all data is deleted c. Shorten test execution time 3. Enhance import logging: a. Print imported segment information upon completion b. Include file name in failure logs issue: https://github.com/milvus-io/milvus/issues/42488, https://github.com/milvus-io/milvus/issues/42518 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-06-12 20:02:35 +08:00
Buqian Zheng	8511ede5f8	feat: add back queryNode.cache.warmup for compatibility (#42621 ) issue: https://github.com/milvus-io/milvus/issues/41435 also make ChunkTranslator to load in parallel --------- Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>	2025-06-12 10:56:40 +08:00
wei liu	e7c0a6ffbb	enhance: Refine QueryNode task parallelism based on CPU core count (#42166 ) issue: #42165 Implement dynamic task execution capacity calculation based on QueryNode CPU core count instead of static configuration for better resource utilization. Changes include: - Add CpuCoreNum() method and WithCpuCoreNum() option to NodeInfo - Implement GetTaskExecutionCap() for dynamic capacity calculation - Add QueryNodeTaskParallelismFactor parameter for tuning - Update proto definition to include cpu_core_num field - Add unit tests for new functionality This allows QueryCoord to automatically adjust task parallelism based on actual hardware resources. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-06-11 13:20:35 +08:00
Zhen Ye	43f0c56ce7	fix: limit the concurency of zstd compression and decrease the memory usage of binlog generation (#42630 ) issue: #42028 - limit the concurrency of zstd compression. - zstd.go modified from `github.com/apache/arrow/go/v17/parquet/compress/ztsd.go` - may be related to #42129 Signed-off-by: chyezh <chyezh@outlook.com>	2025-06-11 09:06:34 +08:00
yihao.dai	837349dead	enhance: Adjust default import buffer size (#42541 ) Increase insert buffer size from 16MB to 64MB, while keeping delete buffer size at 16MB. issue: https://github.com/milvus-io/milvus/issues/42518 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-06-09 13:02:33 +08:00
wei liu	8511881d3f	enhance: Increase search/query retry times on proxy before timeout (#40438 ) issue: #39379 Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-06-06 18:12:32 +08:00
Zhen Ye	0567f512b3	fix: streamingnode get stucked when stop (#42501 ) issue: #42498 - fix: sealed segment cannot be flushed after upgrading - fix: get mvcc panic when upgrading - ignore the L0 segment when graceful stop of querynode. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-06-05 12:22:31 +08:00
Ted Xu	35c17523de	feat: limit search result entries (#42522 ) See: #42521 Signed-off-by: Ted Xu <ted.xu@zilliz.com>	2025-06-05 12:08:33 +08:00
yihao.dai	6fda1f69c8	fix: Fix duplicate autoID between import and insert (#42519 ) Remove the unlimited logID mechanism and switch to redundantly allocating a large number of IDs. issue: https://github.com/milvus-io/milvus/issues/42518 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-06-04 19:58:31 +08:00
Chun Han	ed0df38605	enhance: resize high priority wqthreadpool dynamically(#40838 ) (#41549 ) (#41929 ) related: #40838 pr: https://github.com/milvus-io/milvus/pull/41549 Signed-off-by: MrPresent-Han <chun.han@gmail.com>	2025-05-30 10:18:36 +08:00
Zhen Ye	b94cee2413	fix: growing segment from old arch is not flushed after upgrading (#42164 ) issue: #42162 - enhance: add read ahead buffer size issue #42129 - fix: rocksmq consumer's close operation may get stucked - fix: growing segment from old arch is not flushed after upgrading --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-05-29 23:00:28 +08:00
Buqian Zheng	7243c1d0ce	feat: remove async warmup policy (#42123 ) issue: https://github.com/milvus-io/milvus/issues/41993 Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>	2025-05-28 10:30:28 +08:00
cqy123456	5fe7015f63	enhance: InterimIndex support more index type and data type (#41021 ) issue: https://github.com/milvus-io/milvus/issues/27678 cherry pick from : https://github.com/milvus-io/milvus/pull/39180, https://github.com/milvus-io/milvus/pull/40429 Signed-off-by: cqy123456 <qianya.cheng@zilliz.com>	2025-05-28 08:40:28 +08:00
wei liu	54619eaa2c	feat: Implement partial result support on node down (#42009 ) issue: https://github.com/milvus-io/milvus/issues/41690 This commit implements partial search result functionality when query nodes go down, improving system availability during node failures. The changes include: - Enhanced load balancing in proxy (lb_policy.go) to handle node failures with retry support - Added partial search result capability in querynode delegator and distribution logic - Implemented tests for various partial result scenarios when nodes go down - Added metrics to track partial search results in querynode_metrics.go - Updated parameter configuration to support partial result required data ratio - Replaced old partial_search_test.go with more comprehensive partial_result_on_node_down_test.go - Updated proto definitions and improved retry logic These changes improve query resilience by returning partial results to users when some query nodes are unavailable, ensuring that queries don't completely fail when a portion of data remains accessible. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-05-28 00:12:28 +08:00
congqixia	6d0b15308d	enhance: Take nq into slow query consideration (#42109 ) Related to #40756 Large nq will naturally increase query time, which causing lots of slow log when user NQ numbers are very large. This PR make slow search counts span per nq (using avg val) to decide whether one request is slow or not. Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2025-05-27 19:56:28 +08:00
Zhen Ye	212e17c4c5	fix: modify param to use less memory when flush and sync (#42102 ) issue: #42097 Signed-off-by: chyezh <chyezh@outlook.com>	2025-05-27 10:12:27 +08:00
aoiasd	0fafb706ba	enhance: add segment bm25 stats local cache (#41775 ) relate: https://github.com/milvus-io/milvus/issues/41424 Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2025-05-26 18:44:27 +08:00
wei liu	f84650ece0	enhance: Reduce session TTL from 30s to 10s for faster failure detection (#42050 ) Optimize session management by reducing the TTL (Time To Live) value for service registration from 30 seconds to 10 seconds. This change improves the system's ability to detect service failures more quickly and enhances overall cluster responsiveness. Changes include: - Update default session TTL from 30s to 10s in milvus.yaml - Adjust DefaultSessionTTL constant from 30 to 10 seconds - Update SessionTTL default value from 60 to 10 seconds - Maintain consistent TTL values across configuration files This optimization reduces the time required for the system to detect when services become unavailable, leading to faster failover and improved cluster stability during node failures or network issues. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-05-26 12:04:26 +08:00
Chun Han	d1cfa58a0a	feature: support compact expiry data(#41336 ) (#42056 ) related: #41336 Signed-off-by: MrPresent-Han <chun.han@gmail.com> Co-authored-by: MrPresent-Han <chun.han@gmail.com>	2025-05-25 16:46:31 +08:00
Buqian Zheng	2e3539319d	feat: vector field raw data to mmap by default (#41975 ) issue: https://github.com/milvus-io/milvus/issues/41435 should address https://github.com/milvus-io/milvus/issues/41774 this PR also: * added caching layer memory overhead metric * re-enable TextMatch.GrowingLoadData test Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>	2025-05-22 11:56:25 +08:00
wei liu	4e1208f4f6	enhance: support balancing multiple collections in single trigger (#41875 ) issue: #41874 - Optimize balance_checker to support balancing multiple collections simultaneously - Add new parameters for segment and channel balancing batch sizes - Add enableBalanceOnMultipleCollections parameter - Update tests for balance checker This change improves resource utilization by allowing the system to balance multiple collections in a single trigger with configurable batch sizes. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-05-21 21:38:25 +08:00
yihao.dai	142bd2fc05	enhance: Pooling for data tasks (#41256 ) 1. Add global scheduler for datacoord. 2. Define and implement new CreateTask, QueryTask, DropTask interfaces. 3. Refine Import, Compaction, Stats, Index task. issue: https://github.com/milvus-io/milvus/issues/41123 Co-authored-by: Cai Zhang <cai.zhang@zilliz.com>	2025-05-20 21:06:24 +08:00
cai.zhang	38ded7364f	fix: Don't create index for unsorted importing segment when enable stats (#41864 ) issue: #41863 --------- Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>	2025-05-19 10:52:23 +08:00
wei liu	2d0ae3a709	fix: unexpected password for root user (#41817 ) issue: #41816 pr #37983 introduced an issue, if doesn't specified `defaultRootPassword` in milvus.yaml, then `"Milvus"` will be used as default password for root user, instead of `Milvus`. This PR fix the unexpected password for root, and add comment for case which use large numeric password requires double quotes. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-05-14 19:42:22 +08:00
Zhen Ye	7beafe99a7	enhance: implement wal garbage collector with truncate api (#41770 ) issue: #41544 - add a truncator implementation into wal recovery storage. - add metrics for recovery storage. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-05-13 22:08:56 +08:00

1 2 3 4 5 ...

406 Commits