-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[feat](Variant) Support NestedGroup for variant columns with search DSL NESTED clause#60847
[feat](Variant) Support NestedGroup for variant columns with search DSL NESTED clause#60847eldenmoon wants to merge 11 commits intoapache:masterfrom
Conversation
NESTED clause
Introduce NestedGroup support for tracking nested array structures in variant columns, including storage layer definitions, core data structures, read/write path integration, FE pruning rule enhancements, and NESTED clause in search DSL.
IDL definitions:
Add NestedGroupInfoPB, NestedOffsetsIndexPB in segment_v2.proto
Add ColumnPathInfo fields for nested group metadata
Add NESTED clause type and nested_path field in Exprs.thrift
Core data structures (BE):
nested_group_path.h: Path utilities for nested group identification
nested_group_provider.h/cpp: Provider interface with default no-op implementation
Offsets mapping index and offset tracking utilities for nested arrays
Variant reader/writer integration (BE):
Integrate NestedGroup-aware read path in variant_column_reader
Integrate NestedGroup-aware write path in variant_column_writer_impl
Add segment_iterator and segment_creator integration
Enhance path_in_data, variant_util, json_parser for nested structures
Add variant_nested_group_max_depth config
Update variant_statistics for nested group tracking
FE pruning and rewrite rules:
Add AccessPathExpressionCollector for access path collection
Refactor NestedColumnPruning and VariantSubPathPruning logic
Enhance ExpressionAnalyzer and expression context for pruning support
Add NestedGroup array support in Explode/ExplodeOuter
Search DSL NESTED clause:
Add NESTED token and grammar rules in SearchLexer/Parser.g4
Implement NESTED clause parsing in SearchDslParser
Implement NESTED function evaluation in function_search and vsearch
Tests:
nested_group_path_test, nested_group_path_filter_test, nested_group_provider_test
function_search_nested_test, vsearch_expr_test updates
PruneNestedColumnTest, VariantPruningLogicTest, SearchDslParserTest (FE)
4 new regression test suites for variant pruning
- NestedGroupInfoPB for nested group statistics
- NestedOffsetsIndexPB for nested offsets index
- Fields in ColumnPathInfo for nested group metadata
- NESTED_OFFSETS_INDEX enum value
Add thrift definition for NESTED search clause:
- NESTED clause type in TSearchClause
- nested_path field for NESTED clause
in variant columns:
- nested_group_path.h: Path utilities for nested group identification
- nested_group_provider.h/cpp: Provider interface with default no-op impl
- nested_offsets_mapping_index.h/cpp: Offsets mapping index for nested arrays
- offset_manager.h/offset_manager_impl.h: Offset tracking utilities
Tests:
- nested_group_path_test.cpp: Path utility tests
- nested_group_path_filter_test.cpp: Path filter tests
- nested_group_provider_test.cpp: Provider contract tests
within variant columns.
FE changes:
- SearchLexer/Parser.g4: Add NESTED token and grammar rules
- SearchDslParser.java: Parse NESTED clause with path specification
- SearchPredicate.java: Support NESTED clause type
BE changes:
- vsearch.cpp: Evaluate NESTED clause in search expressions
- function_search.cpp/h: Implement NESTED function evaluation
Tests:
- SearchDslParserTest.java: FE parser tests
- function_search_nested_test.cpp: BE NESTED function tests
- vsearch_expr_test.cpp: Search expression tests
pruning to better support variant column access patterns.
FE changes:
- AccessPathExpressionCollector.java: New collector for access paths
- AccessPathPlanCollector.java: Enhanced plan collection
- NestedColumnPruning.java: Refactored nested column pruning logic
- VariantSubPathPruning.java: Enhanced sub-path pruning
- ExpressionAnalyzer.java: Expression analysis updates
- Explode/ExplodeOuter.java: NestedGroup array support
BE changes:
- vexpr.cpp/h: Expression node enhancements
- vexpr_context.h: Context changes for pruning support
Tests:
- PruneNestedColumnTest.java: Nested column pruning tests
- VariantPruningLogicTest.java: Variant pruning logic tests
and related utility components.
Core reader/writer:
- variant_column_reader.cpp/h: NestedGroup-aware read path
- variant_column_writer_impl.cpp/h: NestedGroup-aware write path
- variant_statistics.h: Statistics tracking for nested groups
- column_reader.cpp/h: Column reader NestedGroup support
- segment_iterator.cpp: Iterator integration
- segment_creator.cpp: Creator integration
Utilities:
- path_in_data.cpp/h: Path utilities for nested structures
- variant_util.cpp/h: Variant utility enhancements
- json_parser.cpp: JSON parser improvements
- column_variant.cpp/h: Column variant changes
- config.cpp/h: variant_nested_group_max_depth config
- exec_env.h: Query cache setter
Tests & regression:
- Modified existing tests for compatibility
- 4 new regression test suites for variant pruning
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
3271773 to
f9c0e67
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces NestedGroup-related infrastructure for VARIANT (multi-level array/object nesting) and adds a NESTED(path, query) clause to the search DSL, alongside expanded access-path/sub-path pruning support and coverage tests across FE/BE and regression suites.
Changes:
- Extend FE search DSL parsing/AST + thrift explain to support a top-level
NESTED()clause with anested_path. - Add NestedGroup metadata/index plumbing (proto, iterators, mapping index) and related BE execution support for nested search evaluation.
- Expand VARIANT access-path/sub-path pruning support (including numeric array indices) and add regression + unit tests.
Reviewed changes
Copilot reviewed 73 out of 74 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| regression-test/suites/variant_p0/test_variant_mixed_object_array_pruning_debugpoint.groovy | New regression coverage for mixed object/array pruning with debugpoints |
| regression-test/suites/variant_p0/test_variant_compaction_root_empty_after_full.groovy | New regression to ensure compaction doesn't empty VARIANT root |
| regression-test/suites/variant_p0/test_variant_access_path_pruning_debugpoint.groovy | New regression coverage for access path pruning via debugpoints |
| regression-test/suites/variant_p0/test_sub_path_pruning.groovy | Adds explode-related pruning assertions/explain checks |
| gensrc/thrift/Exprs.thrift | Adds NESTED clause support via nested_path in thrift |
| gensrc/proto/segment_v2.proto | Adds NestedGroup-related metadata and a nested offsets index proto |
| fe/fe-core/src/test/java/org/apache/doris/nereids/trees/expressions/functions/scalar/SearchDslParserTest.java | Adds FE unit tests for NESTED() parsing/validation rules |
| fe/fe-core/src/test/java/org/apache/doris/nereids/rules/rewrite/VariantPruningLogicTest.java | New FE pruning logic tests for VARIANT access paths/subcolumns |
| fe/fe-core/src/test/java/org/apache/doris/nereids/rules/rewrite/PruneNestedColumnTest.java | Extends pruning tests to cover VARIANT access path propagation |
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/scalar/SearchDslParser.java | Implements NESTED() parsing, AST fields, and top-level validation |
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/generator/ExplodeOuter.java | Allows EXPLODE_OUTER to accept VARIANT inputs |
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/generator/Explode.java | Allows EXPLODE to accept VARIANT inputs |
| fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/VariantSubPathPruning.java | Supports integer-like literal indices in VARIANT subpaths |
| fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/SlotTypeReplacer.java | Includes VARIANT in slot replacement/pruning decision |
| fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/RewriteSearchToSlots.java | Minor cleanup around variant-subcolumn comment/handling |
| fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/NestedColumnPruning.java | Extends pruning entry conditions & access-path handling for VARIANT |
| fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/AccessPathPlanCollector.java | Collects access paths for VARIANT and adjusts generate handling |
| fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/AccessPathExpressionCollector.java | Builds VARIANT access paths including numeric indices |
| fe/fe-core/src/main/java/org/apache/doris/nereids/rules/analysis/ExpressionAnalyzer.java | Marks VARIANT usage as "hasNestedColumns" for pruning pipeline |
| fe/fe-core/src/main/java/org/apache/doris/analysis/SearchPredicate.java | Includes nested_path in explain and thrift conversion |
| fe/fe-core/src/main/antlr4/org/apache/doris/nereids/search/SearchParser.g4 | Adds parser rule for nestedQuery |
| fe/fe-core/src/main/antlr4/org/apache/doris/nereids/search/SearchLexer.g4 | Adds lexer mode/tokens for nested path scanning |
| be/test/vec/function/function_search_test.cpp | Updates clause categorization/mapping for NESTED; moves tests out |
| be/test/vec/function/function_search_nested_test.cpp | New BE unit tests focused on NESTED clause behavior |
| be/test/vec/exprs/vsearch_expr_test.cpp | Updates inverted index context creation; adds NESTED fallback test |
| be/test/vec/exprs/vexpr_evalute_inverted_index_test.cpp | Updates IndexExecContext construction signature |
| be/test/vec/exprs/try_cast_expr_test.cpp | Adds include needed after context/header refactor |
| be/test/olap/rowset/segment_v2/segment_iterator_no_need_read_data_test.cpp | Updates test expectation for extracted VARIANT read-data behavior |
| be/test/olap/rowset/segment_v2/nested_group_provider_test.cpp | New tests for default NestedGroup provider contracts (CE behavior) |
| be/test/olap/rowset/segment_v2/nested_group_path_test.cpp | New tests for nested group path utilities/constants |
| be/test/olap/rowset/segment_v2/nested_group_path_filter_test.cpp | New tests for NestedGroup path filtering behavior |
| be/test/olap/rowset/segment_v2/external_col_meta_util_test.cpp | Removes non-English comment note |
| be/src/vec/runtime/vparquet_transformer.cpp | Fully-qualifies parquet types/exceptions to avoid ambiguity |
| be/src/vec/olap/olap_data_convertor.cpp | Adds convertor for unsigned bigint (offset/length internal cols) |
| be/src/vec/json/path_in_data.h | Adds array-depth helper + new path utility declarations |
| be/src/vec/json/path_in_data.cpp | Implements prefix stripping + path append helpers |
| be/src/vec/json/json_parser.cpp | Adjusts nested array handling & array alignment behavior |
| be/src/vec/functions/function_search.h | Adds nested-search APIs and forward declarations |
| be/src/vec/functions/function_search.cpp | Implements NESTED evaluation flow + execution context refactor |
| be/src/vec/exprs/vsearch.cpp | Passes IndexExecContext/segment info; improves VARIANT subcolumn binding |
| be/src/vec/exprs/vexpr_context.h | Extends IndexExecContext with segment pointer + iterator options |
| be/src/vec/exprs/vexpr.h | Moves contains_blockable_function definition out-of-header |
| be/src/vec/exprs/vexpr.cpp | Implements new contains_blockable_function; SEARCH_EXPR treated as slot-acting |
| be/src/vec/exec/scan/olap_scanner.cpp | Uses parent_unique_id fallback for access-path maps |
| be/src/vec/common/variant_util.h | Tracks presence of nested groups in variant extended info |
| be/src/vec/common/variant_util.cpp | Nullable-aware array base-type detection; compaction skips extended schema when nested groups exist |
| be/src/vec/columns/column_variant.h | Adds missing include for column vector usage |
| be/src/vec/columns/column_variant.cpp | Minor include cleanup/formatting |
| be/src/runtime/exec_env.h | Adds setter for inverted index query cache |
| be/src/olap/rowset/segment_v2/variant/variant_statistics.h | Adds NestedGroup stats, pb (de)serialization updates, renames doc-value map |
| be/src/olap/rowset/segment_v2/variant/variant_external_meta_reader.cpp | Removes non-English comment note |
| be/src/olap/rowset/segment_v2/variant/variant_column_writer_impl.h | Introduces NestedGroup write provider integration and stats merging hooks |
| be/src/olap/rowset/segment_v2/variant/variant_column_writer_impl.cpp | Wires NestedGroup provider into write pipeline; stats merge + writer refactors |
| be/src/olap/rowset/segment_v2/variant/variant_column_reader.h | Adds NestedGroup reader structs and planning hooks |
| be/src/olap/rowset/segment_v2/variant/offset_manager_impl.h | New shared implementation for offset padding/reading utilities |
| be/src/olap/rowset/segment_v2/variant/offset_manager.h | New offset manager interface for NestedGroup offset handling |
| be/src/olap/rowset/segment_v2/variant/nested_offsets_mapping_index.h | New mapping index interface + reader/writer declarations |
| be/src/olap/rowset/segment_v2/variant/nested_offsets_mapping_index.cpp | Implements nested offsets mapping index build/read/map logic |
| be/src/olap/rowset/segment_v2/variant/nested_group_provider.h | Defines NestedGroup provider interfaces, filters, and factories |
| be/src/olap/rowset/segment_v2/variant/nested_group_provider.cpp | Default CE providers (no-op/disabled) + default path matching |
| be/src/olap/rowset/segment_v2/variant/nested_group_path.h | New helpers/constants for nested-group path encoding |
| be/src/olap/rowset/segment_v2/variant/hierarchical_data_iterator.cpp | Skips NestedGroup subcolumns in hierarchical merge; adds debug checks |
| be/src/olap/rowset/segment_v2/segment_iterator.cpp | Uses full path for extracted column field_name; updates inverted-index type inference; passes segment/iter opts into IndexExecContext |
| be/src/olap/rowset/segment_v2/index_writer.cpp | Uses column path when constructing extracted-column field_name |
| be/src/olap/rowset/segment_v2/index_file_writer.cpp | Include tweak for atomic usage |
| be/src/olap/rowset/segment_v2/column_writer.h | Adds NestedGroup provider/stats members for variant writers |
| be/src/olap/rowset/segment_v2/column_reader_cache.cpp | Removes stray preprocessor character |
| be/src/olap/rowset/segment_v2/column_reader.h | Exposes lazy-loading accessors for zone-map and ordinal indexes |
| be/src/olap/rowset/segment_v2/column_reader.cpp | Implements lazy-loading accessors; recognizes new index type in init |
| be/src/olap/rowset/segment_creator.cpp | Minor header/include cleanup and param renames |
| be/src/common/config.h | Adds config knob for max nested group depth |
| be/src/common/config.cpp | Defines default for nested group max depth |
Comments suppressed due to low confidence (1)
be/src/vec/json/json_parser.cpp:35
json_parser.cppreferencesconfig::variant_max_json_key_lengthbut no longer includescommon/config.h, so this should fail to compile withconfigundefined. Add back the#include "common/config.h"(keep the IWYU pragma if needed).
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
be/src/olap/rowset/segment_v2/variant/nested_offsets_mapping_index.h
Outdated
Show resolved
Hide resolved
...c/main/java/org/apache/doris/nereids/trees/expressions/functions/scalar/SearchDslParser.java
Show resolved
Hide resolved
f9c0e67 to
c61965d
Compare
caa1183 to
bfef111
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 65 out of 65 changed files in this pull request and generated 5 comments.
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
...c/main/java/org/apache/doris/nereids/trees/expressions/functions/scalar/SearchDslParser.java
Show resolved
Hide resolved
| uint32_t idx = | ||
|
static_cast |
Copilot
AI
Feb 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cast to uint32_t from int could be unsafe if atoi returns a negative value or the parsed integer is larger than UINT32_MAX. Consider using std::stoul with error handling instead of atoi, or add validation to ensure the parsed value is within valid range.
| if (rel_str.find(DOC_VALUE_COLUMN_PATH) != std::string::npos) { | ||
| size_t bucket = rel_str.rfind('b'); | ||
| int bucket_value = std::stoi(rel_str.substr(bucket + 1)); | ||
|
uint32_t bucket_value = static_cast |
Copilot
AI
Feb 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to line 1040, this cast from stoi result to uint32_t is potentially unsafe. The stoi function can throw exceptions for invalid input and may return negative values. Consider using std::stoul with proper error handling or add validation checks.
| CHECK(column->size() == nrows) << "column->size()=" << column->size() << ", nrows=" << nrows | ||
| << ", path=" << node.path.get_path(); |
Copilot
AI
Feb 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CHECK macro will cause the program to abort on failure. For production code, consider using a more graceful error handling approach that returns a Status error instead of aborting the process, unless data corruption is guaranteed and aborting is the safest option.
| CHECK(column->size() == nrows) << "column->size()=" << column->size() << ", nrows=" << nrows | |
| << ", path=" << node.path.get_path(); | |
| if (column->size() != nrows) { | |
| return Status::InternalError( | |
| "Column size {} not equal to nrows {} at path {}", | |
| column->size(), nrows, node.path.get_path()); | |
| } |
|
const std::unordered_map |
||
|
std::shared_ptr |
||
| (void)field_name_to_column_id; |
Copilot
AI
Feb 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The field_name_to_column_id parameter is marked with (void) to suppress unused warnings, but it's passed through the call chain. If this parameter is intended for future use, add a TODO comment explaining its purpose. If it's truly unused, consider removing it from the function signature to reduce complexity.
| const std::unordered_map |
|
| std::shared_ptr |
|
| (void)field_name_to_column_id; | |
| [[maybe_unused]] const std::unordered_map |
|
| std::shared_ptr |
|
| // TODO: Use field_name_to_column_id to map nested field names to column IDs | |
| // when evaluating nested queries, so that nested_path and its children | |
| // can be resolved consistently with the outer query. |
bfef111 to
d71cce4
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 28886 msmachine: 'aliyun_ecs.c7a.8xlarge_32C64G'
|
TPC-DS: Total hot run time: 183590 msmachine: 'aliyun_ecs.c7a.8xlarge_32C64G'
|
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 28840 msmachine: 'aliyun_ecs.c7a.8xlarge_32C64G'
|
TPC-DS: Total hot run time: 183630 msmachine: 'aliyun_ecs.c7a.8xlarge_32C64G'
|
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
5e66e1a to
922ae43
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 28583 msmachine: 'aliyun_ecs.c7a.8xlarge_32C64G'
|
TPC-DS: Total hot run time: 184918 msmachine: 'aliyun_ecs.c7a.8xlarge_32C64G'
|
BE UT Coverage ReportIncrement line coverage Increment coverage report
|