Grove#
The grove is a specialized B+ tree optimized for genomic interval storage and querying. It organizes data by index (e.g., chromosome) and supports efficient overlap queries. An embedded graph overlay allows you to create directed edges between keys, representing relationships between genomic features.
Beyond the core data structure covered on this page, the grove also supports:
Linking Keys — Create directed edges between keys to represent relationships such as transcript structures or gene regulatory networks
Graph Manipulation — Inspect, modify, and analyze the graph structure with operations for edge management, neighbor traversal, and graph statistics
Loading Data from Files — Load genomic data from BED, GFF/GTF, and BAM/SAM files directly into a grove
Key Types and the key_type_base Concept#
The grove is highly flexible and can work with any key type that satisfies the key_type_base concept. Genogrove provides four built-in key types:
interval- Basic genomic intervals (closed[start, end])genomic_coordinate- Intervals with strand informationnumeric- Single numerical valuekmer- Single k-mer value
Custom Key Types
You can use any custom type as a key in the grove, as long as it satisfies the key_type_base concept requirements:
// Required operations for key_type_base concept:
// 1. Comparison operators: <, >, ==
// 2. Static overlap detection: overlaps(a, b) -> bool
// 3. Static pairwise aggregation: aggregate(a, b) -> T
// 4. String representation: to_string() -> string
Example of a custom key type:
#include <genogrove/data_type/key_type_base.hpp>
struct CustomInterval {
size_t start;
size_t end;
int priority;
// Comparison operators
bool operator<(const CustomInterval& other) const {
return start < other.start;
}
bool operator>(const CustomInterval& other) const {
return start > other.start;
}
bool operator==(const CustomInterval& other) const {
return start == other.start && end == other.end;
}
// Static overlap method
static bool overlaps(const CustomInterval& a, const CustomInterval& b) {
return !(a.end <= b.start || b.end <= a.start);
}
// Static pairwise aggregate method
static CustomInterval aggregate(const CustomInterval& a, const CustomInterval& b) {
return {std::min(a.start, b.start), std::max(a.end, b.end), 0};
}
// String representation
std::string to_string() const {
return "[" + std::to_string(start) + ", " +
std::to_string(end) + ") p=" + std::to_string(priority);
}
};
// Now you can use it with grove
gst::grove<CustomInterval, std::string> my_grove(100);
Creating a Grove#
Create a grove with specified template parameters and tree order:
#include <genogrove/structure/grove/grove.hpp>
#include <genogrove/data_type/interval.hpp>
namespace gdt = genogrove::data_type;
namespace gst = genogrove::structure;
int main() {
// grove<key_type, data_type, edge_data_type>(order)
// Order determines max keys per node (higher = more cache-friendly)
// Order must be >= 3 (throws std::invalid_argument otherwise)
// Using built-in interval type
gst::grove<gdt::interval, std::string> grove1(100);
// Using genomic_coordinate (with strand)
gst::grove<gdt::genomic_coordinate, std::string> grove2(100);
// Default-constructed grove uses order 3
gst::grove<gdt::interval, std::string> grove3;
return 0;
}
Template Parameters:
key_type: Type satisfyingkey_type_baseconcept (interval, genomic_coordinate, or custom)data_type: Associated data type (default: void for no data)edge_data_type: Edge metadata for graph overlay (default: void)
Constructor Parameters:
order(int): Maximum children per node (andorder - 1keys). Must be>= 3; throwsstd::invalid_argument("grove order must be >= 3")otherwise. Order 2 is degenerate — internal splits withorder == 2would produce a sibling with zero keys, which classical B+ trees forbid. The default constructor (grove()) uses order 3.
Ownership Semantics#
grove (and its internal node type) is non-copyable and move-only. Copy construction and
copy assignment are deleted because these types own raw pointers internally—a shallow copy would
cause a double-free. Move operations are provided, so you can return a grove from a function or
store it in a container:
// OK — move construction
auto loaded = gst::grove<gdt::interval, std::string>::deserialize(in);
// OK — pass by reference
void process(gst::grove<gdt::interval, std::string>& g);
// OK — move into a container
std::vector<gst::grove<gdt::interval, std::string>> groves;
groves.push_back(std::move(my_grove));
// ERROR — copy is deleted
// auto copy = my_grove; // won't compile
// void f(gst::grove<gdt::interval, std::string> g); // won't compile (pass-by-value copies)
Always pass groves by reference or move them explicitly with std::move.
Accessing Indices#
Use get_root_nodes() to access the grove’s index map (chromosome → root node). It returns a
const reference to the internal std::unordered_map, so use .at() or .find() for lookups
(not operator[], which requires a non-const map). The reference is valid only while the grove is
alive and unmodified.
const auto& roots = my_grove.get_root_nodes();
if (auto it = roots.find("chr1"); it != roots.end()) {
// it->second is the root node for chr1
}
set_root_nodes() is private — indices are created automatically on first insertion.
Inserting Data#
The grove supports multiple insertion modes with multi-index organization:
#include <genogrove/structure/grove/grove.hpp>
#include <genogrove/data_type/interval.hpp>
namespace gdt = genogrove::data_type;
namespace gst = genogrove::structure;
int main() {
gst::grove<gdt::interval, std::string> my_grove(100);
// Insert with index (e.g., chromosome name)
// Returns pointer to inserted key
auto* key1 = my_grove.insert_data("chr1",
gdt::interval{100, 200},
"gene1");
auto* key2 = my_grove.insert_data("chr1",
gdt::interval{150, 250},
"gene2");
// Insert on different chromosome
auto* key3 = my_grove.insert_data("chr2",
gdt::interval{300, 400},
"gene3");
// Optimized insertion for pre-sorted data
auto* key4 = my_grove.insert_data("chr1",
gdt::interval{500, 600},
"gene4",
gst::sorted); // Much faster!
return 0;
}
Insertion Modes:
unsorted(default): Full tree traversal for random datasorted: Optimized O(1) insertion for pre-sorted databulk: Bulk insertion using hybrid bottom-up/append approach
Multi-Index Organization:
Each index (e.g., chromosome) maintains a separate B+ tree
Enables efficient chromosome-specific queries
Automatic index creation on first insertion
Bulk Insertion#
For loading large datasets efficiently, use bulk insertion. Bulk insertion uses a hybrid bottom-up/append approach that is significantly faster than incremental insertion.
How It Works:
Empty index: Builds tree from bottom-up (O(n) time)
Creates fully-packed leaf nodes first
Then builds internal layers
~10-20x faster than incremental insertion for large datasets
Better space utilization: 75-90% full nodes vs ~50% from split-based insertion
Existing data: Appends to rightmost node
CRITICAL PRECONDITION: All new keys must be strictly greater than existing keys in the index
Throws
std::runtime_errorif precondition is violated
Usage:
#include <genogrove/structure/grove/grove.hpp>
#include <genogrove/data_type/interval.hpp>
#include <vector>
namespace gdt = genogrove::data_type;
namespace gst = genogrove::structure;
int main() {
gst::grove<gdt::interval, std::string> my_grove(100);
// Prepare sorted data as a container of (key, data) pairs
std::vector<std::pair<gdt::interval, std::string>> data = {
{{100, 200}, "gene1"}, {{300, 400}, "gene2"}, {{500, 600}, "gene3"}
};
// Bulk insert with sorted data (fastest method)
my_grove.insert_data("chr1", data, gst::sorted, gst::bulk);
return 0;
}
Performance Comparison:
For loading 1M intervals into an empty index:
Incremental insertion (
unsorted): ~10-20 secondsSorted insertion (
sorted): ~2-5 secondsBulk insertion (
sorted, bulk): ~0.5-1 second
When to Use Bulk Insertion:
Loading pre-sorted genomic files (BED, GFF, GTF)
Initial data loading into empty groves
Datasets with >10K intervals per chromosome
When maximum performance is required
Important Notes:
Data must be sorted before calling bulk insertion
Bulk insertion is most beneficial for datasets >10K intervals
When appending to existing data, ensure new keys are strictly greater than all existing keys in that index
Genomic files (BED, GFF, GTF) are typically pre-sorted by position
Removing Keys#
Use remove_key() to remove a previously-inserted key from the B+ tree. The method takes the
index name (e.g., chromosome) and a pointer to the key — typically one that was returned from
insert_data() or looked up via intersect().
gst::grove<gdt::interval, std::string> my_grove(100);
auto* k = my_grove.insert_data("chr1", gdt::interval{100, 200}, "gene1", gst::sorted);
// Remove the key. Returns true if found and removed, false otherwise.
bool removed = my_grove.remove_key("chr1", k);
Behaviour:
Returns
trueif the key was found and removed,falseotherwise (unknown index, null pointer, or key not present in the index’s tree).Handles leaf underflow via borrow-from-sibling, merge, cascading rebalance, and root collapse — the tree stays balanced automatically.
Graph edges are cleaned up automatically: all incoming and outgoing edges for the removed key are dropped, so callers never need to call
remove_edge/remove_edges_from/remove_edges_tomanually afterremove_key.The key object itself is not deallocated — it remains in the grove’s internal deque so that pointer stability is preserved. It is simply unlinked from the tree and graph. Call
compact()to reclaim the slots when the dead storage adds up.
// Example: prune genes that fall below a coverage threshold
auto results = my_grove.intersect(gdt::interval{0, 1000000}, "chr1");
for (auto* key : results.get_keys()) {
if (coverage_of(key) < min_coverage) {
my_grove.remove_key("chr1", key);
}
}
Reclaiming removed-key storage (compact)#
remove_key() unlinks keys from the tree but does not free the underlying key<> objects
in the grove’s internal deque. Across many insert/remove cycles the deque grows without bound
while indexed_vertex_count() decreases — invisible to public counters. compact() is the
caller-controlled, transactional rebuild that reclaims that memory.
gst::grove<gdt::interval, std::string> my_grove(100);
// ... many insert / remove cycles ...
std::cout << "Live indexed keys: " << my_grove.indexed_vertex_count() << "\n";
std::cout << "Deque slots in use: " << my_grove.key_storage_size() << "\n";
if (my_grove.key_storage_size() > 2 * my_grove.indexed_vertex_count()) {
my_grove.compact(); // O(N + E); rebuilds key_storage, rewrites graph adjacency
}
Behaviour:
O(N + E) — single pre-order walk of every tree to migrate keys + a single pass over the graph adjacency map to remap pointers.
Invalidates every indexed-key pointer previously returned by
insert_data(), query results, orget_neighbors(). Callers must rediscover keys via queries after compaction.External keys are unaffected. Pointers returned by
add_external_key()and any graph edges referring to them remain valid.
Use key_storage_size() to decide when compaction is worth running. The value strictly grows
with internal separator keys (a healthy property of B+ trees) and with remove_key() calls;
compaction is only useful once the latter starts dominating.
// Typical pattern: bulk-remove, then compact when no callers hold stale pointers
for (auto* k : victims) my_grove.remove_key("chr1", k);
// ... ensure no callers still hold `k` pointers ...
my_grove.compact();
Querying Intervals#
Find all intervals that overlap with a query region:
#include <genogrove/structure/grove/grove.hpp>
#include <genogrove/data_type/interval.hpp>
namespace gdt = genogrove::data_type;
namespace gst = genogrove::structure;
int main() {
gst::grove<gdt::interval, std::string> my_grove(100);
// Insert some intervals
my_grove.insert_data("chr1", gdt::interval{100, 200}, "gene1");
my_grove.insert_data("chr1", gdt::interval{150, 250}, "gene2");
my_grove.insert_data("chr1", gdt::interval{300, 400}, "gene3");
my_grove.insert_data("chr2", gdt::interval{100, 200}, "gene4");
// Query specific chromosome (temporaries work directly)
auto results = my_grove.intersect(gdt::interval{175, 225}, "chr1");
// Process results — get_keys() returns pointers to keys owned by the grove.
for (auto* key : results.get_keys()) {
std::cout << "Found: " << key->get_data()
<< " at " << key->get_value().to_string() << "\n";
}
// Output: gene1, gene2 (gene3 doesn't overlap)
// Query across all chromosomes
auto all_results = my_grove.intersect(gdt::interval{175, 225});
std::cout << "Total matches: " << all_results.get_keys().size() << "\n";
return 0;
}
Query Features:
Efficient overlap-based searching using B+ tree structure
Index-specific queries (single chromosome) —
indexparameter acceptsstd::string_viewGlobal queries (all chromosomes)
Accepts temporaries and named variables (const reference parameter)
Returns
query_resultcontaining matching keys
Result type:
query_result::get_keys() returns const std::vector<key<...>*>& — pointers to keys owned by
the grove, suitable for feeding directly into mutating graph operations such as add_edge or
link_if without a const_cast. The container template query_result<key_type, data_type> is
constrained with the key_type_base concept, so misuse with a non-conforming type fails at the
result-container instantiation site with a clean concept diagnostic instead of a deep template
instantiation error inside key<>.
Warning
Mutating a key’s value via key::set_value(...) corrupts B+ tree ordering invariants. The same
risk exists through the pointer returned by insert_data(), so the convention is enforced by
discipline rather than at the API level — only mutate the data payload through these
pointers, never the key value.
For the closest features on each side of a query (rather than features overlapping it), see Flanking-Key Queries below.
Concept Constraints:
The grove uses C++20 concepts to provide clear compile-time errors:
link_if(keys, predicate)requiresstd::invocable<Predicate, key*, key*>get_neighbors_if(source, predicate)requiresstd::predicate<Predicate, const edge_data_type&>insert_data(index, data, sorted, bulk)andbuild_tree_bottom_up:Containermust satisfystd::ranges::forward_rangeandstd::ranges::sized_rangeinsert_data(index, data, bulk):Containermust satisfystd::ranges::random_access_rangeandstd::ranges::sized_range
Flanking-Key Queries#
In addition to intersect(), which finds keys that overlap a query, the grove supports a
flanking() query that returns the predecessor and successor of a query — the
two nearest non-overlapping keys that bracket it in sort order. This is useful when the
caller needs the closest features on each side of a position rather than the features
covering it (for example, the upstream and downstream genes of an enhancer).
#include <genogrove/structure/grove/grove.hpp>
#include <genogrove/data_type/interval.hpp>
namespace gdt = genogrove::data_type;
namespace gst = genogrove::structure;
// flanking() works on any populated grove regardless of how it was built —
// `gst::sorted` here is just for parity with surrounding examples.
gst::grove<gdt::interval, std::string> g(100);
g.insert_data("chr1", gdt::interval{100, 200}, "A", gst::sorted);
g.insert_data("chr1", gdt::interval{500, 600}, "B", gst::sorted);
g.insert_data("chr1", gdt::interval{900, 1000}, "C", gst::sorted);
auto r = g.flanking(gdt::interval{650, 700}, "chr1");
if (auto* pred = r.get_predecessor()) {
// "B" — largest non-overlapping key < query
}
if (auto* succ = r.get_successor()) {
// "C" — smallest non-overlapping key > query
}
Each field of the returned flanking_query_result may be nullptr if no such
key exists (for example, the query is past the rightmost key, or the index does
not exist). Returned pointers reference keys owned by the grove and remain valid
as long as the referenced key is not removed. Like query_result,
flanking_query_result<key_type, data_type> is constrained with the
key_type_base concept so non-conforming key_type arguments fail at the
result-container instantiation site.
Selection rule. Keys that satisfy key_type::overlaps(K, query) are
excluded by definition. Among the remaining candidates:
For interval-like keys (
interval,genomic_coordinate), the predecessor is the key with the largestendbelowquery.start(smallest left-side gap), and the successor is the key with the smalleststartabovequery.end(smallest right-side gap). For nested intervals these can differ from the sort-order extremum — e.g., between[50, 100]and[80, 90], sort picks[80, 90]but[50, 100]is closer to a query at[150, …].For scalar keys (
numeric,kmer), nearest-by-value coincides with the sort-order extremum, so the predecessor/successor are simply the closest values on each side.
Distance is computed by the caller because it is type-specific. For closed-coord
intervals: query.start - predecessor->get_value().get_end() - 1; for numeric scalars:
query.value - predecessor->get_value().
Filtering by a Predicate#
A second overload accepts a binary predicate is_compatible(candidate, query) that
filters candidates before the overlap and ordering checks. Use this to express
domain-specific constraints that are not encoded in key_type::overlaps() — for
example, restricting the search to same-strand neighbors when keys carry strand
information:
// Assumes a grove parameterised with genomic_coordinate (which carries strand);
// gst::grove<gdt::genomic_coordinate, /* data */> grove(...);
// gdt::genomic_coordinate q{...};
auto r = grove.flanking(q, "chr1",
[](const auto& candidate, const auto& q) {
return q.get_strand() == '*' || candidate.get_strand() == '*'
|| candidate.get_strand() == q.get_strand();
});
The get_strand() calls assume genomic_coordinate keys; for plain interval
keys (no strand), use a predicate that inspects whatever metadata your key
type exposes.
The predicate is invoked at leaf level only. Internal-node pruning is purely structural (based on aggregate ranges) and never consults the predicate, so subtrees that contain only incompatible keys are still visited but filtered out at the leaves.
Note
flanking() is named to avoid collision with the graph-overlay get_neighbors(),
which returns graph-edge-connected keys — a distinct concept from spatial nearest
neighbors.