mirror of https://gitee.com/milvus-io/milvus.git synced 2026-01-07 19:31:51 +08:00

History

test: replace parquet with jsonl for EventRecords and RequestRecords in checker (#46671 )

/kind improvement

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: tests' persistence of EventRecords and RequestRecords
must be append-safe under concurrent writers; this PR replaces Parquet
with JSONL and uses per-file locks and explicit buffer flushes to
guarantee atomic, append-safe writes (EventRecords uses event_lock +
append per line; RequestRecords buffers under request_lock and flushes
to file when threshold or on sink()).

- Logic removed/simplified and rationale: DataFrame-based parquet
append/read logic (pyarrow/fastparquet) and implicit parquet buffering
were removed in favor of simple line-oriented JSON writes and explicit
buffer management. The complex Parquet append/merge paths were redundant
because parquet append under concurrent test-writer patterns caused
corruption; JSONL removes the append-mode complexity and the
parquet-specific buffering/serialization code.

- Why no data loss or behavior regression (concrete code paths):
EventRecords.insert writes a complete JSON object per event under
event_lock to /tmp/ci_logs/event_records_*.jsonl and get_records_df
reads every JSON line under the same lock (or returns an empty DataFrame
with the same schema on FileNotFound/Error), preserving all fields
event_name/event_status/event_ts. RequestRecords.insert appends to an
in-memory buffer under request_lock and triggers _flush_buffer() when
len(buffer) >= 100; _flush_buffer() writes each buffered JSON line to
/tmp/ci_logs/request_records_*.jsonl and clears the buffer; sink() calls
_flush_buffer() under request_lock before get_records_df() reads the
file — ensuring all buffered records are persisted before reads. Both
read paths handle FileNotFoundError and exceptions by returning empty
DataFrames with identical column schemas, so external callers see the
same API and no silent record loss.

- Enhancement summary (concrete): Replaces flaky Parquet append/read
with JSONL + explicit locking and deterministic flush semantics,
removing the root cause of parquet append corruption in tests while
keeping the original DataFrame-based analysis consumers unchanged
(get_records_df returns equivalent schemas).
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>

2025-12-30 14:13:21 +08:00

chaos_objects

test: [skip e2e]fix mixcoord label selectors (#41340 )

2025-04-16 15:18:31 +08:00

config

[skip e2e]Update multi replicas chaos test (#17739 )

2022-06-24 12:06:14 +08:00

scripts

Update milvs helm repo for ci (#28042 )

2023-11-01 18:54:16 +08:00

testcases

test: replace parquet with jsonl for EventRecords and RequestRecords in checker (#46671 )

2025-12-30 14:13:21 +08:00

chaos_commons.py

test: add unique error message collection in chaos checker (#46262 )

2025-12-11 13:49:12 +08:00

chaos_test.sh

test: [skip e2e]update chaos test script (#38886 )

2024-12-31 16:12:52 +08:00

checker.py

test: replace parquet with jsonl for EventRecords and RequestRecords in checker (#46671 )

2025-12-30 14:13:21 +08:00

cluster-values.yaml

test: [skip e2e]update chaos test script (#38886 )

2024-12-31 16:12:52 +08:00

conftest.py

test: add rolling upgrade test scripts (#43109 )

2025-07-17 14:26:52 +08:00

constants.py

test: refactor checker to using milvus client (#45524 )

2025-11-20 11:59:08 +08:00

one-pod-standalone-values.yaml

enhance:[skip e2e]update one pod mode resource (#34196 )

2024-06-26 16:54:05 +08:00

README.md

[skip e2e]Update README for chaos test (#14546 )

2021-12-29 21:04:02 +08:00

requirements.txt

test: replace parquet with jsonl for EventRecords and RequestRecords in checker (#46671 )

2025-12-30 14:13:21 +08:00

run.sh

[skip e2e]Update script to run all chaos test (#14004 )

2021-12-22 20:31:20 +08:00

standalone-values.yaml

test: [skip e2e]update chaos test script (#38886 )

2024-12-31 16:12:52 +08:00

test_chaos_apply_to_coord.py

test: [skip e2e]remove tls connection (#42799 )

2025-06-18 10:34:43 +08:00

test_chaos_apply_to_determined_pod.py

test: [skip e2e]remove tls connection (#42799 )

2025-06-18 10:34:43 +08:00

test_chaos_apply.py

test: update default value of pytest addoption (#38836 )

2024-12-30 14:34:51 +08:00

test_chaos_bulk_insert.py

enhance: Merge IndexNode and DataNode (#40272 )

2025-03-13 14:26:11 +08:00

test_chaos_data_consist.py

[skip e2e]Add wait pod ready function for chaos data consist test (#15898 )

2022-03-07 16:25:58 +08:00

test_chaos_memory_stress.py

test: add more request type checker for test (#29210 )

2023-12-14 19:38:45 +08:00

test_chaos_multi_replicas.py

[skip e2e]Get the number of replicas needed to load via get_replicas (#17872 )

2022-06-28 20:02:22 +08:00

test_chaos.py

test: add more request type checker for test (#29210 )

2023-12-14 19:38:45 +08:00

test_load_with_checker.py

enhance: Merge IndexNode and DataNode (#40272 )

2025-03-13 14:26:11 +08:00

README.md

Chaos Tests

Goal

Chaos tests are designed to check the reliability of Milvus.

For instance, if one pod is killed:

verify that it restarts automatically
verify that the related operation fails, while the other operations keep working successfully during the absence of the pod
verify that all the operations work successfully after the pod back to running state
verify that no data lost

Prerequisite

Chaos tests run in pytest framework, same as e2e tests.

Please refer to Run E2E Tests

Flow Chart

Test Scenarios

Milvus in cluster mode

pod kill

Kill pod every 5s

pod network partition

Two direction(to and from) network isolation between a pod and the rest of the pods

pod failure

Set the pod（querynode, indexnode and datanode）as multiple replicas, make one of them failure, and test milvus's functionality

pod memory stress

Limit the memory resource of pod and generate plenty of stresses over a group of pods

Milvus in standalone mode

standalone pod is killed
minio pod is killed

How it works

Test scenarios are designed by different chaos objects
Every chaos object is defined in one yaml file locates in folder chaos_objects
Every chaos yaml file specified by ALL_CHAOS_YAMLS in constants.py would be parsed as a parameter and be passed into test_chaos.py
All expectations of every scenario are defined in testcases.yaml locates in folder chaos_objects
Chaos Mesh is used to inject chaos into Milvus in test_chaos.py

Run

Manually

Run a single test scenario manually(take query node pod is killed as instance):

update ALL_CHAOS_YAMLS = 'chaos_querynode_podkill.yaml' in constants.py

run the commands below:

cd /milvus/tests/python_client/chaos

pytest test_chaos.py --host ${Milvus_IP} -v

Run multiple test scenario in a category manually(take network partition chaos for all pods as instance):

update ALL_CHAOS_YAMLS = 'chaos_*_network_partition.yaml' in constants.py

run the commands below:

cd /milvus/tests/python_client/chaos

pytest test_chaos.py --host ${Milvus_IP} -v

Automation Scripts

Run test scenario automatically:

update chaos type and pod in chaos_test.sh

run the commands below:

cd /milvus/tests/python_client/chaos
# in this step, script will install milvus with replicas_num and run testcase
bash chaos_test.sh ${pod} ${chaos_type} ${chaos_task} ${replicas_num}
# example: bash chaos_test.sh querynode pod_kill chaos-test 2

Github Action

Nightly

still in planning

Todo

network attack
clock skew
IO injection

How to contribute

Get familiar with chaos engineering and Chaos Mesh
Design chaos scenarios, preferring to pick from todo list
Generate yaml file for your chaos scenarios. You can create a chaos experiment in chaos-dashboard, then download the yaml file of it.
Add yaml file to chaos_objects dir and rename it as chaos_${component_name}_${chaos_type}.yaml. Make sure kubectl apply -f ${your_chaos_yaml_file} can take effect
Add testcase in testcases.yaml. You should figure out the expectation of milvus during the chaos
Run your added testcase according to Manually above and check whether it as your expectation

README.md Unescape Escape

Chaos Tests

Goal

Prerequisite

Flow Chart

Test Scenarios

Milvus in cluster mode

pod kill

pod network partition

pod failure

pod memory stress

Milvus in standalone mode

How it works

Run

Manually

Automation Scripts

Github Action

Nightly

Todo

How to contribute

README.md