Structure#
The genogrove::structure namespace contains core data structures for genomic data storage and querying.
Tag Types#
sorted_t#
-
struct sorted_t#
Tag type for dispatching to sorted insertion algorithm.
See also
grove::insert_data(std::string_view, key_type, data_type, sorted_t)
Note
Use this when inserting data that is already sorted for optimal performance
bulk_t#
-
struct bulk_t#
Tag type for dispatching to bulk insertion algorithm.
string_hash#
grove#
Both grove and node are non-copyable, move-only types (Rule of Five). Copy construction
and copy assignment are deleted because these classes manage owned raw pointers—a shallow copy
would cause double-free. Move construction and move assignment are provided so groves can be
returned from functions (e.g., grove::deserialize()) and stored in containers.
-
template<typename key_type, typename data_type = void, typename edge_data_type = void>
class grove# Template-based B+ tree container for efficient genomic interval storage and querying.
The grove is a specialized B+ tree implementation designed for genomic data with:
Multi-index support: Separate trees for different indices (e.g., chromosomes)
Optimized sorted insertion: Direct rightmost leaf insertion for pre-sorted data
Graph overlay: Optional directed graph structure on top of the tree
Deque-based key storage: Stable memory addresses with improved cache locality
Linked leaf nodes: Efficient range traversal through leaf chaining
Key features:
Insert operations return pointers to inserted keys for immediate use
Automatic node splitting when capacity (order) is exceeded
Efficient overlap-based queries for genomic intervals
Embedded graph_overlay for relationship management between keys
Example usage:
grove<interval, bed_entry> g(100); // Create grove with order 100 auto* key_ptr = g.insert_data("chr1", interval{100, 200}, data, sorted); auto results = g.intersect(query_interval, "chr1"); g.add_edge(key_ptr, another_key); // Build graph relationships
Note
The order parameter controls the maximum number of keys per node (B+ tree capacity)
Note
Keys are stored in a std::deque owned by the grove for stable addresses
Note
The graph overlay is embedded within the grove and shares its lifetime
- Template Parameters:
key_type – The type of keys stored in the tree (must satisfy key_type_base concept)
data_type – Optional associated data type (default: void for no data)
edge_data_type – Optional edge metadata type for graph overlay (default: void)
Public Functions
-
inline explicit grove(int order)#
Construct a grove with specified order.
Order must be at least 3. With order == 2,
split_internal_nodewould produce a right sibling with 0 keys and 1 child (a degenerate internal node that breaks every path which dereferenceskeys[0]or callscalc_keys_aggregate). Classical B+ trees require order >= 3 for splits to produce balanced halves on both sides.- Parameters:
order – Determines the maximum number of order-1 keys and order children per node
- Throws:
std::invalid_argument – if order < 3
-
inline grove()#
Construct a grove with default order of 3.
-
inline ~grove()#
Destructor that cleans up all tree nodes.
Note
Recursively deletes all root nodes and their children; keys in deque are automatically freed
-
inline int get_order() const#
Get the order (maximum capacity) of the grove.
- Returns:
The order of the B+ tree (maximum number of keys per node)
-
inline const std::unordered_map<std::string, node<key_type, data_type>*, string_hash, std::equal_to<>> &get_root_nodes() const#
Get map of all root nodes indexed by their string keys.
- Returns:
Const reference to unordered map from index names (e.g., chromosome names) to root node pointers
-
inline node<key_type, data_type> *get_rightmost_node(std::string_view key) const#
Get the rightmost leaf node for a given index.
Note
Used for optimized sorted insertion
- Parameters:
key – The index name (e.g., chromosome name)
- Returns:
Pointer to rightmost leaf node, or nullptr if index doesn’t exist
node#
node is non-copyable, move-only (same rationale as grove). The move constructor and move
assignment operator transfer ownership of child pointers and leave the source node empty.
-
template<typename key_type, typename data_type = void>
class node# B+ tree node representing internal or leaf nodes in the grove structure.
The node class is a fundamental building block of the grove B+ tree implementation:
Internal nodes: Store keys for navigation and maintain children pointers
Leaf nodes: Store actual data keys and are linked together for range queries
Memory ownership: Nodes do NOT own keys (owned by grove’s deque), but DO own child nodes
Parent-child relationships: Each node maintains links to parent, children, and next sibling
Key characteristics:
Order parameter controls maximum capacity (order-1 keys, order children)
Keys are stored as pointers to entries in grove’s deque for stable addresses
Leaf nodes are chained via next pointers for efficient sequential traversal
Internal nodes aggregate child keys for navigation during tree traversal
Note
Keys are NOT deleted in destructor as they’re owned by the grove’s deque
Note
Only internal nodes delete their children; leaf nodes have no children to delete
Note
The order must be at least 2
- Template Parameters:
key_type – The type of keys stored in this node (must satisfy key_type_base concept)
data_type – Optional associated data type (default: void for no data)
Public Functions
-
inline explicit node(int order)#
Construct a new node with specified order.
Initializes a node with:
Pre-allocated capacity for keys and children to avoid reallocations
No parent or next sibling (set later during tree operations)
Not marked as leaf (set explicitly when needed)
- Parameters:
order – The B+ tree order (determines max keys = order-1, max children = order)
-
inline ~node()#
Destroy the node and recursively delete child nodes.
Memory management rules:
Keys are NOT deleted (owned by grove’s deque)
Child nodes ARE deleted recursively (owned by parent node)
Leaf nodes have no children to delete
-
inline int get_order() const noexcept#
Get the B+ tree order of this node.
- Returns:
The order value (max children = order, max keys = order-1)
-
inline std::vector<gdt::key<key_type, data_type>*> &get_keys()#
Get mutable reference to the keys vector.
Note
Keys are pointers to entries in grove’s deque, not owned by node
- Returns:
Reference to vector of key pointers
-
inline const std::vector<gdt::key<key_type, data_type>*> &get_keys() const#
Get const reference to the keys vector.
- Returns:
Const reference to vector of key pointers
-
inline std::vector<node<key_type, data_type>*> &get_children()#
Get mutable reference to the children vector.
Note
Children are owned by this node and will be deleted in destructor
- Returns:
Reference to vector of child node pointers
-
inline const std::vector<node<key_type, data_type>*> &get_children() const#
Get const reference to the children vector.
- Returns:
Const reference to vector of child node pointers
-
inline node *get_parent() const noexcept#
Get pointer to parent node.
- Returns:
Pointer to parent node, or nullptr if this is root
-
inline void set_parent(node<key_type, data_type> *parent)#
Set the parent node pointer.
- Parameters:
parent – Pointer to the parent node
-
inline node *get_next() const noexcept#
Get pointer to next sibling node.
Note
Only meaningful for leaf nodes that are chained together
- Returns:
Pointer to next leaf node, or nullptr if no next sibling
-
inline void set_next(node<key_type, data_type> *next)#
Set the next sibling pointer (for leaf node chaining)
Note
Only used for leaf nodes to enable efficient range traversal
- Parameters:
next – Pointer to the next leaf node in sequence
-
inline bool get_is_leaf() const noexcept#
Check if this node is a leaf.
- Returns:
True if this is a leaf node, false if internal node
-
inline void set_is_leaf(bool is_leaf)#
Set whether this node is a leaf.
- Parameters:
is_leaf – True if this is a leaf node, false if internal node
-
inline void insert_key_ptr(gdt::key<key_type, data_type> *key_ptr)#
Insert a pre-allocated key pointer into the node at sorted position.
Finds the correct sorted position by comparing key values and inserts the key pointer at that position, maintaining sorted order.
Note
Key must be allocated by grove before calling this
Note
This performs a linear search; consider using indexed insertion for bulk operations
- Parameters:
key_ptr – Pointer to key (already allocated by grove’s deque)
-
inline void insert_key_ptr(gdt::key<key_type, data_type> *key_ptr, int index)#
Insert a pre-allocated key pointer at specific index.
Directly inserts the key at the specified index without checking sort order. Caller is responsible for ensuring the insertion maintains sorted order.
Note
Key must be allocated by grove before calling this
- Parameters:
key_ptr – Pointer to key (already allocated by grove’s deque)
index – Position to insert at (0-based, must be in [0, keys.size()])
- Throws:
std::out_of_range – if index is out of bounds
-
inline key_type calc_keys_aggregate()#
Aggregate this node’s own keys.
For leaf nodes, this is the bounding range of the leaf data. For internal nodes, it covers children[0..n-2] (each separator covers its corresponding child) but NOT the last child, which has no separator key in this node. Use calc_subtree_range() to cover the full subtree including the last child’s catch-all chain.
For example with intervals: returns the bounding interval (min start, max end) of this node’s keys.
- Returns:
Aggregate key_type covering all keys stored directly in this node
-
inline key_type calc_subtree_range()#
Compute the bounding range of the entire subtree rooted at this node.
For leaf nodes, identical to calc_keys_aggregate(). For internal nodes, calc_keys_aggregate() only covers children[0..n-2] (each separator key covers its corresponding child), missing the last child which has no separator in this node. This method descends the last-child chain at every level to pick up the missing range.
Used as a separator value when the parent needs an accurate upper bound for this node’s entire subtree — especially after rebalancing, where calc_keys_aggregate() would leave the separator too narrow to reach keys in the catch-all chain.
- Returns:
Aggregate key_type covering all leaf keys reachable from this node
-
inline void add_child(node<key_type, data_type> *child, int index)#
Add a child node at the specified index.
Inserts a child node at the given position, shifting existing children to the right. The index must be in range [0, children.size()].
- Parameters:
child – Pointer to the child node to add
index – Position to insert the child (0-based)
- Throws:
std::out_of_range – if index is invalid
-
inline node *get_child(int index)#
Get child node at the specified index.
- Parameters:
index – Position of the child to retrieve (0-based)
- Throws:
std::out_of_range – if index is invalid
- Returns:
Pointer to the child node
-
inline node *get_child(int index) const#
Get child node at the specified index (const version)
- Parameters:
index – Position of the child to retrieve (0-based)
- Throws:
std::out_of_range – if index is invalid
- Returns:
Pointer to the child node
-
void serialize(std::ostream &os) const#
Serialize this node to an output stream.
Writes the node’s structure and content to the stream in binary format, allowing the node to be persisted and later restored.
- Parameters:
os – Output stream to write serialized data to
-
inline void print_keys(std::ostream &os, std::string_view sep = "\t") const#
Print all keys in this node to an output stream.
Outputs the string representation of each key in the node, separated by the specified separator. Useful for debugging and visualization.
- Parameters:
os – Output stream to write to
sep – Separator string between keys (default: tab)
Public Static Functions
-
static node<key_type, data_type> *deserialize(std::istream &is, int order, std::deque<gdt::key<key_type, data_type>> &key_storage)#
Deserialize a node from an input stream.
Reads serialized node data from the stream and reconstructs the node structure, including keys and child relationships. Keys are placed directly into the provided deque (owned by the grove) rather than heap-allocated, ensuring pointer stability and single-owner semantics.
- Parameters:
is – Input stream to read serialized data from
order – The B+ tree order to use for the deserialized node
key_storage – Deque to allocate keys into for stable pointer addresses
- Returns:
Pointer to the newly created and populated node
graph_overlay#
-
template<typename key_type, typename data_type = void, typename edge_data_type = void>
class graph_overlay# Graph overlay for grove - decoupled graph structure.
Provides graph capabilities on top of any grove by storing directed edges between keys. The graph is completely separate from the tree structure, allowing multiple graphs with different edge metadata types on the same grove.
- Template Parameters:
key_type – The key_type of the keys (e.g., interval)
data_type – The data_type of the keys
edge_data_type – Optional metadata for edges (void for no metadata)
Public Functions
-
graph_overlay() = default#
Default constructor.
-
inline void add_edge(gdt::key<key_type, data_type> *source, gdt::key<key_type, data_type> *target)#
Add edge without metadata.
- Parameters:
source – Pointer to source key
target – Pointer to target key
-
template<typename M = edge_data_type>
inline void add_edge(gdt::key<key_type, data_type> *source, gdt::key<key_type, data_type> *target, M &&metadata)# Add edge with metadata.
- Parameters:
source – Pointer to source key
target – Pointer to target key
metadata – Edge metadata (e.g., transcript ID, confidence)
-
inline bool remove_edge(gdt::key<key_type, data_type> *source, gdt::key<key_type, data_type> *target)#
Remove a specific edge.
- Parameters:
source – Pointer to source key
target – Pointer to target key
- Returns:
true if edge was found and removed, false otherwise
-
inline std::vector<gdt::key<key_type, data_type>*> get_neighbors(const gdt::key<key_type, data_type> *source) const#
Get all neighbor keys (targets of outgoing edges)
- Parameters:
source – Pointer to source key
- Returns:
Vector of pointers to neighbor keys
-
template<typename M = edge_data_type>
inline std::vector<M> get_edges(const gdt::key<key_type, data_type> *source) const# Get edge metadata for all outgoing edges.
- Parameters:
source – Pointer to source key
- Returns:
Vector of edge metadata (only available when edge_data_type != void)
-
inline const std::vector<edge> &get_edge_list(const gdt::key<key_type, data_type> *source) const#
Get all outgoing edge structures (with targets and metadata)
- Parameters:
source – Pointer to source key
- Returns:
Const reference to vector of edges (empty if no edges)
-
template<typename Predicate>
inline std::vector<gdt::key<key_type, data_type>*> get_neighbors_if(const gdt::key<key_type, data_type> *source, Predicate predicate) const# Get neighbors filtered by predicate on edge metadata.
- Parameters:
source – Pointer to source key
predicate – Function to filter edges by metadata
- Returns:
Vector of neighbor keys where predicate returns true
-
inline bool has_edge(const gdt::key<key_type, data_type> *source, const gdt::key<key_type, data_type> *target) const#
Check if edge exists.
- Parameters:
source – Pointer to source key
target – Pointer to target key
- Returns:
true if edge exists, false otherwise
-
inline size_t out_degree(const gdt::key<key_type, data_type> *source) const#
Get number of outgoing edges from a key.
- Parameters:
source – Pointer to source key
- Returns:
Number of outgoing edges
-
inline size_t edge_count() const#
Get total number of edges in graph.
- Returns:
Total edge count
-
inline size_t vertex_count_with_edges() const#
Get number of vertices (keys) with at least one outgoing edge.
- Returns:
Number of keys that have outgoing edges
-
inline size_t remove_edges_from(gdt::key<key_type, data_type> *source)#
Remove all outgoing edges from a source key.
- Parameters:
source – Pointer to source key
- Returns:
Number of edges removed
-
inline size_t remove_edges_to(gdt::key<key_type, data_type> *target)#
Remove all incoming edges to a target key.
Note
O(E) — scans all edges in the graph
- Parameters:
target – Pointer to target key
- Returns:
Number of edges removed
-
inline size_t remove_all_edges(gdt::key<key_type, data_type> *key)#
Remove all edges (both incoming and outgoing) involving a key.
- Parameters:
key – Pointer to the key
- Returns:
Total number of edges removed
-
template<typename Predicate>
inline size_t remove_edges_if(Predicate predicate)# Remove all edges matching a predicate.
- Parameters:
predicate – Callable taking const edge& and returning bool
- Returns:
Number of edges removed
-
inline void remap_keys(const std::unordered_map<const gdt::key<key_type, data_type>*, gdt::key<key_type, data_type>*> &remap)#
Rewrite every source/target pointer using an old → new remap.
For each adjacency entry (source → edges):
if
sourceappears inremap, the entry is moved toremap[source]for each
edge.targetthat appears inremap, it is rewritten
Keys absent from the remap (e.g. external keys, which were not migrated) are preserved unchanged. If multiple old sources map to the same new source (non-injective remap), their edge lists are appended into the destination bucket — no edges are dropped. Intended for grove::compact() to migrate adjacency after rebuilding the indexed key storage.
- Parameters:
remap – Map from old key pointer to new key pointer
-
inline void clear()#
Clear all edges.
-
inline bool empty() const#
Check if graph is empty.
- Returns:
true if no edges exist
-
struct edge#
Edge structure representing a directed connection.
Public Functions
-
template<typename M = edge_data_type>
inline edge(gdt::key<key_type, data_type> *tgt, M &&meta)#
Public Members
-
std::conditional_t<std::is_void_v<edge_data_type>, std::monostate, edge_data_type> metadata#
-
template<typename M = edge_data_type>