# Grove The `grove` is a specialized B+ tree optimized for genomic interval storage and querying. It organizes data by index (e.g., chromosome) and supports efficient overlap queries. An embedded graph overlay allows you to create directed edges between keys, representing relationships between genomic features. Beyond the core data structure covered on this page, the grove also supports: - **{doc}`graph`** — Create directed edges between keys to represent relationships such as transcript structures or gene regulatory networks - **{doc}`graph_manipulation`** — Inspect, modify, and analyze the graph structure with operations for edge management, neighbor traversal, and graph statistics - **{doc}`loading_data`** — Load genomic data from BED, GFF/GTF, and BAM/SAM files directly into a grove ## Key Types and the key_type_base Concept The grove is highly flexible and can work with any key type that satisfies the `key_type_base` concept. Genogrove provides four built-in key types: - `interval` - Basic genomic intervals (closed `[start, end]`) - `genomic_coordinate` - Intervals with strand information - `numeric` - Single numerical value - `kmer` - Single k-mer value **Custom Key Types** You can use any custom type as a key in the grove, as long as it satisfies the `key_type_base` concept requirements: ```cpp // Required operations for key_type_base concept: // 1. Comparison operators: <, >, == // 2. Static overlap detection: overlaps(a, b) -> bool // 3. Static pairwise aggregation: aggregate(a, b) -> T // 4. String representation: to_string() -> string ``` Example of a custom key type: ```cpp #include struct CustomInterval { size_t start; size_t end; int priority; // Comparison operators bool operator<(const CustomInterval& other) const { return start < other.start; } bool operator>(const CustomInterval& other) const { return start > other.start; } bool operator==(const CustomInterval& other) const { return start == other.start && end == other.end; } // Static overlap method static bool overlaps(const CustomInterval& a, const CustomInterval& b) { return !(a.end <= b.start || b.end <= a.start); } // Static pairwise aggregate method static CustomInterval aggregate(const CustomInterval& a, const CustomInterval& b) { return {std::min(a.start, b.start), std::max(a.end, b.end), 0}; } // String representation std::string to_string() const { return "[" + std::to_string(start) + ", " + std::to_string(end) + ") p=" + std::to_string(priority); } }; // Now you can use it with grove gst::grove my_grove(100); ``` ## Creating a Grove Create a grove with specified template parameters and tree order: ```cpp #include #include namespace gdt = genogrove::data_type; namespace gst = genogrove::structure; int main() { // grove(order) // Order determines max keys per node (higher = more cache-friendly) // Order must be >= 3 (throws std::invalid_argument otherwise) // Using built-in interval type gst::grove grove1(100); // Using genomic_coordinate (with strand) gst::grove grove2(100); // Default-constructed grove uses order 3 gst::grove grove3; return 0; } ``` **Template Parameters:** - `key_type`: Type satisfying `key_type_base` concept (interval, genomic_coordinate, or custom) - `data_type`: Associated data type (default: void for no data) - `edge_data_type`: Edge metadata for graph overlay (default: void) **Constructor Parameters:** - `order` (int): Maximum children per node (and `order - 1` keys). Must be `>= 3`; throws `std::invalid_argument("grove order must be >= 3")` otherwise. Order 2 is degenerate — internal splits with `order == 2` would produce a sibling with zero keys, which classical B+ trees forbid. The default constructor (`grove()`) uses order 3. ## Ownership Semantics `grove` (and its internal `node` type) is **non-copyable and move-only**. Copy construction and copy assignment are deleted because these types own raw pointers internally—a shallow copy would cause a double-free. Move operations are provided, so you can return a grove from a function or store it in a container: ```cpp // OK — move construction auto loaded = gst::grove::deserialize(in); // OK — pass by reference void process(gst::grove& g); // OK — move into a container std::vector> groves; groves.push_back(std::move(my_grove)); // ERROR — copy is deleted // auto copy = my_grove; // won't compile // void f(gst::grove g); // won't compile (pass-by-value copies) ``` Always pass groves by reference or move them explicitly with `std::move`. ## Accessing Indices Use `get_root_nodes()` to access the grove's index map (chromosome → root node). It returns a **const reference** to the internal `std::unordered_map`, so use `.at()` or `.find()` for lookups (not `operator[]`, which requires a non-const map). The reference is valid only while the grove is alive and unmodified. ```cpp const auto& roots = my_grove.get_root_nodes(); if (auto it = roots.find("chr1"); it != roots.end()) { // it->second is the root node for chr1 } ``` `set_root_nodes()` is private — indices are created automatically on first insertion. ## Inserting Data The grove supports multiple insertion modes with multi-index organization: ```cpp #include #include namespace gdt = genogrove::data_type; namespace gst = genogrove::structure; int main() { gst::grove my_grove(100); // Insert with index (e.g., chromosome name) // Returns pointer to inserted key auto* key1 = my_grove.insert_data("chr1", gdt::interval{100, 200}, "gene1"); auto* key2 = my_grove.insert_data("chr1", gdt::interval{150, 250}, "gene2"); // Insert on different chromosome auto* key3 = my_grove.insert_data("chr2", gdt::interval{300, 400}, "gene3"); // Optimized insertion for pre-sorted data auto* key4 = my_grove.insert_data("chr1", gdt::interval{500, 600}, "gene4", gst::sorted); // Much faster! return 0; } ``` **Insertion Modes:** - `unsorted` (default): Full tree traversal for random data - `sorted`: Optimized O(1) insertion for pre-sorted data - `bulk`: Bulk insertion using hybrid bottom-up/append approach **Multi-Index Organization:** - Each index (e.g., chromosome) maintains a separate B+ tree - Enables efficient chromosome-specific queries - Automatic index creation on first insertion ## Bulk Insertion For loading large datasets efficiently, use bulk insertion. Bulk insertion uses a hybrid bottom-up/append approach that is significantly faster than incremental insertion. **How It Works:** - **Empty index**: Builds tree from bottom-up (O(n) time) - Creates fully-packed leaf nodes first - Then builds internal layers - \~10-20x faster than incremental insertion for large datasets - Better space utilization: 75-90% full nodes vs ~50% from split-based insertion - **Existing data**: Appends to rightmost node - **CRITICAL PRECONDITION**: All new keys must be strictly greater than existing keys in the index - Throws `std::runtime_error` if precondition is violated **Usage:** ```cpp #include #include #include namespace gdt = genogrove::data_type; namespace gst = genogrove::structure; int main() { gst::grove my_grove(100); // Prepare sorted data as a container of (key, data) pairs std::vector> data = { {{100, 200}, "gene1"}, {{300, 400}, "gene2"}, {{500, 600}, "gene3"} }; // Bulk insert with sorted data (fastest method) my_grove.insert_data("chr1", data, gst::sorted, gst::bulk); return 0; } ``` **Performance Comparison:** For loading 1M intervals into an empty index: - **Incremental insertion** (`unsorted`): ~10-20 seconds - **Sorted insertion** (`sorted`): ~2-5 seconds - **Bulk insertion** (`sorted, bulk`): ~0.5-1 second **When to Use Bulk Insertion:** - Loading pre-sorted genomic files (BED, GFF, GTF) - Initial data loading into empty groves - Datasets with >10K intervals per chromosome - When maximum performance is required **Important Notes:** - **Data must be sorted** before calling bulk insertion - Bulk insertion is most beneficial for datasets >10K intervals - When appending to existing data, ensure new keys are strictly greater than all existing keys in that index - Genomic files (BED, GFF, GTF) are typically pre-sorted by position ## Removing Keys Use `remove_key()` to remove a previously-inserted key from the B+ tree. The method takes the index name (e.g., chromosome) and a pointer to the key — typically one that was returned from `insert_data()` or looked up via `intersect()`. ```cpp gst::grove my_grove(100); auto* k = my_grove.insert_data("chr1", gdt::interval{100, 200}, "gene1", gst::sorted); // Remove the key. Returns true if found and removed, false otherwise. bool removed = my_grove.remove_key("chr1", k); ``` **Behaviour:** - Returns `true` if the key was found and removed, `false` otherwise (unknown index, null pointer, or key not present in the index's tree). - Handles leaf underflow via borrow-from-sibling, merge, cascading rebalance, and root collapse — the tree stays balanced automatically. - **Graph edges are cleaned up automatically**: all incoming and outgoing edges for the removed key are dropped, so callers never need to call `remove_edge`/`remove_edges_from`/`remove_edges_to` manually after `remove_key`. - The key object itself is *not* deallocated — it remains in the grove's internal deque so that pointer stability is preserved. It is simply unlinked from the tree and graph. Call [`compact()`](#reclaiming-removed-key-storage-compact) to reclaim the slots when the dead storage adds up. ```cpp // Example: prune genes that fall below a coverage threshold auto results = my_grove.intersect(gdt::interval{0, 1000000}, "chr1"); for (auto* key : results.get_keys()) { if (coverage_of(key) < min_coverage) { my_grove.remove_key("chr1", key); } } ``` ### Reclaiming removed-key storage (`compact`) `remove_key()` unlinks keys from the tree but **does not free** the underlying `key<>` objects in the grove's internal deque. Across many insert/remove cycles the deque grows without bound while `indexed_vertex_count()` decreases — invisible to public counters. `compact()` is the caller-controlled, transactional rebuild that reclaims that memory. ```cpp gst::grove my_grove(100); // ... many insert / remove cycles ... std::cout << "Live indexed keys: " << my_grove.indexed_vertex_count() << "\n"; std::cout << "Deque slots in use: " << my_grove.key_storage_size() << "\n"; if (my_grove.key_storage_size() > 2 * my_grove.indexed_vertex_count()) { my_grove.compact(); // O(N + E); rebuilds key_storage, rewrites graph adjacency } ``` **Behaviour:** - **O(N + E)** — single pre-order walk of every tree to migrate keys + a single pass over the graph adjacency map to remap pointers. - **Invalidates every indexed-key pointer** previously returned by `insert_data()`, query results, or `get_neighbors()`. Callers must rediscover keys via queries after compaction. - **External keys are unaffected.** Pointers returned by `add_external_key()` and any graph edges referring to them remain valid. Use `key_storage_size()` to decide when compaction is worth running. The value strictly grows with internal separator keys (a healthy property of B+ trees) and with `remove_key()` calls; compaction is only useful once the latter starts dominating. ```cpp // Typical pattern: bulk-remove, then compact when no callers hold stale pointers for (auto* k : victims) my_grove.remove_key("chr1", k); // ... ensure no callers still hold `k` pointers ... my_grove.compact(); ``` ## Querying Intervals Find all intervals that overlap with a query region: ```cpp #include #include namespace gdt = genogrove::data_type; namespace gst = genogrove::structure; int main() { gst::grove my_grove(100); // Insert some intervals my_grove.insert_data("chr1", gdt::interval{100, 200}, "gene1"); my_grove.insert_data("chr1", gdt::interval{150, 250}, "gene2"); my_grove.insert_data("chr1", gdt::interval{300, 400}, "gene3"); my_grove.insert_data("chr2", gdt::interval{100, 200}, "gene4"); // Query specific chromosome (temporaries work directly) auto results = my_grove.intersect(gdt::interval{175, 225}, "chr1"); // Process results — get_keys() returns pointers to keys owned by the grove. for (auto* key : results.get_keys()) { std::cout << "Found: " << key->get_data() << " at " << key->get_value().to_string() << "\n"; } // Output: gene1, gene2 (gene3 doesn't overlap) // Query across all chromosomes auto all_results = my_grove.intersect(gdt::interval{175, 225}); std::cout << "Total matches: " << all_results.get_keys().size() << "\n"; return 0; } ``` **Query Features:** - Efficient overlap-based searching using B+ tree structure - Index-specific queries (single chromosome) — `index` parameter accepts `std::string_view` - Global queries (all chromosomes) - Accepts temporaries and named variables (const reference parameter) - Returns `query_result` containing matching keys **Result type:** `query_result::get_keys()` returns `const std::vector*>&` — pointers to keys owned by the grove, suitable for feeding directly into mutating graph operations such as `add_edge` or `link_if` without a `const_cast`. The container template `query_result` is constrained with the `key_type_base` concept, so misuse with a non-conforming type fails at the result-container instantiation site with a clean concept diagnostic instead of a deep template instantiation error inside `key<>`. ```{warning} Mutating a key's value via `key::set_value(...)` corrupts B+ tree ordering invariants. The same risk exists through the pointer returned by `insert_data()`, so the convention is enforced by discipline rather than at the API level — only mutate the *data* payload through these pointers, never the key value. ``` For the closest features on each side of a query (rather than features overlapping it), see [Flanking-Key Queries](#flanking-key-queries) below. **Concept Constraints:** The grove uses C++20 concepts to provide clear compile-time errors: - `link_if(keys, predicate)` requires `std::invocable` - `get_neighbors_if(source, predicate)` requires `std::predicate` - `insert_data(index, data, sorted, bulk)` and `build_tree_bottom_up`: `Container` must satisfy `std::ranges::forward_range` and `std::ranges::sized_range` - `insert_data(index, data, bulk)`: `Container` must satisfy `std::ranges::random_access_range` and `std::ranges::sized_range` ## Flanking-Key Queries In addition to `intersect()`, which finds keys that overlap a query, the grove supports a `flanking()` query that returns the **predecessor** and **successor** of a query — the two nearest non-overlapping keys that bracket it in sort order. This is useful when the caller needs the closest features on each side of a position rather than the features covering it (for example, the upstream and downstream genes of an enhancer). ```cpp #include #include namespace gdt = genogrove::data_type; namespace gst = genogrove::structure; // flanking() works on any populated grove regardless of how it was built — // `gst::sorted` here is just for parity with surrounding examples. gst::grove g(100); g.insert_data("chr1", gdt::interval{100, 200}, "A", gst::sorted); g.insert_data("chr1", gdt::interval{500, 600}, "B", gst::sorted); g.insert_data("chr1", gdt::interval{900, 1000}, "C", gst::sorted); auto r = g.flanking(gdt::interval{650, 700}, "chr1"); if (auto* pred = r.get_predecessor()) { // "B" — largest non-overlapping key < query } if (auto* succ = r.get_successor()) { // "C" — smallest non-overlapping key > query } ``` Each field of the returned `flanking_query_result` may be `nullptr` if no such key exists (for example, the query is past the rightmost key, or the index does not exist). Returned pointers reference keys owned by the grove and remain valid as long as the referenced key is not removed. Like `query_result`, `flanking_query_result` is constrained with the `key_type_base` concept so non-conforming `key_type` arguments fail at the result-container instantiation site. **Selection rule.** Keys that satisfy `key_type::overlaps(K, query)` are excluded by definition. Among the remaining candidates: - For interval-like keys (`interval`, `genomic_coordinate`), the predecessor is the key with the largest `end` below `query.start` (smallest left-side gap), and the successor is the key with the smallest `start` above `query.end` (smallest right-side gap). For nested intervals these can differ from the sort-order extremum — e.g., between `[50, 100]` and `[80, 90]`, sort picks `[80, 90]` but `[50, 100]` is closer to a query at `[150, …]`. - For scalar keys (`numeric`, `kmer`), nearest-by-value coincides with the sort-order extremum, so the predecessor/successor are simply the closest values on each side. **Distance is computed by the caller** because it is type-specific. For closed-coord intervals: `query.start - predecessor->get_value().get_end() - 1`; for numeric scalars: `query.value - predecessor->get_value()`. ### Filtering by a Predicate A second overload accepts a binary predicate `is_compatible(candidate, query)` that filters candidates before the overlap and ordering checks. Use this to express domain-specific constraints that are not encoded in `key_type::overlaps()` — for example, restricting the search to same-strand neighbors when keys carry strand information: ```cpp // Assumes a grove parameterised with genomic_coordinate (which carries strand); // gst::grove grove(...); // gdt::genomic_coordinate q{...}; auto r = grove.flanking(q, "chr1", [](const auto& candidate, const auto& q) { return q.get_strand() == '*' || candidate.get_strand() == '*' || candidate.get_strand() == q.get_strand(); }); ``` The `get_strand()` calls assume `genomic_coordinate` keys; for plain `interval` keys (no strand), use a predicate that inspects whatever metadata your key type exposes. The predicate is invoked at leaf level only. Internal-node pruning is purely structural (based on aggregate ranges) and never consults the predicate, so subtrees that contain only incompatible keys are still visited but filtered out at the leaves. ```{note} `flanking()` is named to avoid collision with the graph-overlay `get_neighbors()`, which returns graph-edge-connected keys — a distinct concept from spatial nearest neighbors. ``` ```{toctree} :maxdepth: 1 graph graph_manipulation loading_data ```