From 73bda08ec9222d906100df18f9a18c11d27250b2 Mon Sep 17 00:00:00 2001 From: Xiaofan <83447078+xiaofan-luan@users.noreply.github.com> Date: Sat, 8 Mar 2025 09:48:02 +0800 Subject: [PATCH] doc: json storage format (#40479) the design doc for the json storage improvemnet Signed-off-by: xiaofanluan --- docs/design_docs/json_storage.md | 116 +++++++++++++++++++++++++++++++ 1 file changed, 116 insertions(+) create mode 100644 docs/design_docs/json_storage.md diff --git a/docs/design_docs/json_storage.md b/docs/design_docs/json_storage.md new file mode 100644 index 0000000000..26d011fbff --- /dev/null +++ b/docs/design_docs/json_storage.md @@ -0,0 +1,116 @@ +# JSON Storage Design Document + +## 1. Data Model Design + +### 1.1 Data Layering + +#### Dense Part +A set of "core fields" (such as primary keys and commonly used metadata) that are present in most records. + +#### Sparse Part +Additional attributes that appear only in some records, potentially involving unstructured or dynamically extended information. + +### 1.2 JSON Splitting and Mapping + +#### Dense Field Extraction +When parsing JSON, predefined dense fields are extracted and mapped to independent columns in Parquet. A method similar to Parquet Variant Shredding is used to flatten nested data. + +#### Sparse Data Preservation +Fields not included in the dense part are stored in a sparse data field. They are serialized using BSON (Binary JSON) format, leveraging its efficient binary representation and rich data type support, with the result stored in a Parquet BINARY type field. + +## 2. Storage Strategy + +### 2.1 Columnar Storage for Dense Data +- **Schema Definition**: Create independent columns in Parquet for each dense field, explicitly specifying data types (such as numeric, string, list, etc.). +- **Query Performance**: Columnar format is suitable for large data scanning and aggregation operations, improving query efficiency, especially for vectors, indexes, and frequently queried fields. + +### 2.2 Row Storage for Sparse Data +- **BSON Storage**: + - Serialize sparse data as BSON binary format and store it in a single binary column of the Parquet file. + - BSON format not only compresses more efficiently but also preserves complete data type information of the original data, avoiding numerous null values and file fragmentation issues. + +## 3. Parquet Schema Construction +- **Columnar Part**: Build a fixed schema based on dense fields, with each field having a clear data type definition. +- **Row Part**: Define a dedicated field (e.g., `sparse_data`) for storing sparse data, with type set to BINARY, directly storing BSON data. +- **Hybrid Mode**: When writing, dense data is filled into respective columns, and remaining sparse data is serialized as BSON and written to the `sparse_data` field, achieving a balance between query efficiency and storage flexibility. + +## 4. Integration and Implementation Considerations + +### 4.1 Data Classification Strategy +- **Density Classification**: + - Classify fields as dense or sparse based on their frequency of occurrence in records (e.g., greater than 30% for dense), while considering data type consistency. If a field has multiple data types, we should treat data types that appear in more than 30% of records as dense fields, with the remaining types stored as sparse fields. +- **Dynamic Extension**: + - For dynamically extended fields, regardless of frequency, store them in the BSON-formatted sparse part to simplify schema evolution. + +### 4.2 Indexing for Sparse Data Access + +#### Sparse Column Key Indexing +To accelerate BSON parsing, an inverted index stores BSON keys along with their offsets and sizes or values if they are of numeric type. + +##### Value Data Structure Diagram +| Valid | Type | Row ID | Offset/Value | +|:-----:|:-----:|:------:|:------------:| +| 1bit | 4bit | 27bit | 16 offset, 16bit size | + +- **64-bit Structure Breakdown**: + - **Bit 1 (Valid)**: 1 bit indicating data validity (1 = valid, 0 = invalid). + - **Bits 2-5 (Type)**: 4 bits representing the data type. + - **Bits 5-31 (Row ID)**: 27 bits for the row ID, uniquely identifying the data row. + - **Bits 32-64 (Last 32 bits)**: + - If **Valid = 1**: Last 32 bits store the actual data value. + - If **Valid = 0**: Last 32 bits are split into: + - **First 16 bits (Offset)**: Indicates the data offset position. + - **Last 16 bits (Size)**: Indicates the data size. + +The column key index is optional, and can be configured at table creation time or modified later through field properties. + +## 5. Example Data + +### 5.1 Example JSON Records + +```json +[ + {"id": 1, "attr1": "value1", "attr2": 100}, + {"id": 2, "attr1": "value2", "attr3": true}, + {"id": 3, "attr1": "value3", "attr4": "extra", "attr5": 3.14} +] +``` + +- **Dense Data:** + - The field `id` is considered dense. +- **Sparse Data:** + - Record 1: `attr1`, `attr2` + - Record 2: `attr1`, `attr3` + - Record 3: `attr1`, `attr4`, `attr5` + +### 5.2 Parquet File Storage + +#### Schema Representation + +| Column Name | Data Type | Description | +|--------------|-----------|-------------| +| **id** | int64 | Dense column storing the integer identifier. | +| **sparse_data** | binary | Sparse column storing BSON-serialized data of all remaining fields. | +| **sparse_index** | binary | Index column storing key offsets for efficient parsing. | + +#### Stored Data Breakdown + +- **Dense Column (`id`)**: + - Row 1: `1` + - Row 2: `2` + - Row 3: `3` + +- **Sparse Column (`sparse_data`)**: + - **Row 1:** BSON representation of `{"attr1": "value1", "attr2": 100}` + - **Row 2:** BSON representation of `{"attr1": "value2", "attr3": true}` + - **Row 3:** BSON representation of `{"attr1": "value3", "attr4": "extra", "attr5": 3.14}` + +- **Sparse Index (`sparse_index`)**: + - **Row 1:** Index entries mapping `attr1` and `attr2` to their respective positions in `sparse_data`. + - **Row 2:** Index entries mapping `attr1` and `attr3`. + - **Row 3:** Index entries mapping `attr1`, `attr4`, and `attr5`. + +In an actual system, the sparse data would be serialized using a BSON library (e.g., bsoncxx) for a compact binary format. The example above demonstrates the logical mapping of JSON data to the Parquet storage format. + +--- +