Structure#

The genogrove::structure namespace contains core data structures for genomic data storage and querying.

Tag Types#

sorted_t#

struct sorted_t#

Tag type for dispatching to sorted insertion algorithm.

See also

grove::insert_data(std::string_view, key_type, data_type, sorted_t)

Note

Use this when inserting data that is already sorted for optimal performance

bulk_t#

struct bulk_t#

Tag type for dispatching to bulk insertion algorithm.

string_hash#

struct string_hash#

Transparent hash for std::string keys, enabling lookup with std::string_view without allocating a temporary std::string.

Public Types

using is_transparent = void#

Public Functions

inline size_t operator()(std::string_view sv) const noexcept#

grove#

Both grove and node are non-copyable, move-only types (Rule of Five). Copy construction and copy assignment are deleted because these classes manage owned raw pointers—a shallow copy would cause double-free. Move construction and move assignment are provided so groves can be returned from functions (e.g., grove::deserialize()) and stored in containers.

template<typename key_type, typename data_type = void, typename edge_data_type = void>
class grove#

Template-based B+ tree container for efficient genomic interval storage and querying.

The grove is a specialized B+ tree implementation designed for genomic data with:

  • Multi-index support: Separate trees for different indices (e.g., chromosomes)

  • Optimized sorted insertion: Direct rightmost leaf insertion for pre-sorted data

  • Graph overlay: Optional directed graph structure on top of the tree

  • Deque-based key storage: Stable memory addresses with improved cache locality

  • Linked leaf nodes: Efficient range traversal through leaf chaining

Key features:

  • Insert operations return pointers to inserted keys for immediate use

  • Automatic node splitting when capacity (order) is exceeded

  • Efficient overlap-based queries for genomic intervals

  • Embedded graph_overlay for relationship management between keys

Example usage:

grove<interval, bed_entry> g(100);  // Create grove with order 100
auto* key_ptr = g.insert_data("chr1", interval{100, 200}, data, sorted);
auto results = g.intersect(query_interval, "chr1");
g.add_edge(key_ptr, another_key);  // Build graph relationships

Note

The order parameter controls the maximum number of keys per node (B+ tree capacity)

Note

Keys are stored in a std::deque owned by the grove for stable addresses

Note

The graph overlay is embedded within the grove and shares its lifetime

Template Parameters:
  • key_type – The type of keys stored in the tree (must satisfy key_type_base concept)

  • data_type – Optional associated data type (default: void for no data)

  • edge_data_type – Optional edge metadata type for graph overlay (default: void)

Public Functions

inline explicit grove(int order)#

Construct a grove with specified order.

Order must be at least 3. With order == 2, split_internal_node would produce a right sibling with 0 keys and 1 child (a degenerate internal node that breaks every path which dereferences keys[0] or calls calc_keys_aggregate). Classical B+ trees require order >= 3 for splits to produce balanced halves on both sides.

Parameters:

order – Determines the maximum number of order-1 keys and order children per node

Throws:

std::invalid_argument – if order < 3

inline grove()#

Construct a grove with default order of 3.

inline ~grove()#

Destructor that cleans up all tree nodes.

Note

Recursively deletes all root nodes and their children; keys in deque are automatically freed

grove(const grove&) = delete#
grove &operator=(const grove&) = delete#
grove(grove&&) noexcept = default#
grove &operator=(grove&&) noexcept = default#
inline int get_order() const#

Get the order (maximum capacity) of the grove.

Returns:

The order of the B+ tree (maximum number of keys per node)

inline const std::unordered_map<std::string, node<key_type, data_type>*, string_hash, std::equal_to<>> &get_root_nodes() const#

Get map of all root nodes indexed by their string keys.

Returns:

Const reference to unordered map from index names (e.g., chromosome names) to root node pointers

inline node<key_type, data_type> *get_rightmost_node(std::string_view key) const#

Get the rightmost leaf node for a given index.

Note

Used for optimized sorted insertion

Parameters:

key – The index name (e.g., chromosome name)

Returns:

Pointer to rightmost leaf node, or nullptr if index doesn’t exist

node#

node is non-copyable, move-only (same rationale as grove). The move constructor and move assignment operator transfer ownership of child pointers and leave the source node empty.

template<typename key_type, typename data_type = void>
class node#

B+ tree node representing internal or leaf nodes in the grove structure.

The node class is a fundamental building block of the grove B+ tree implementation:

  • Internal nodes: Store keys for navigation and maintain children pointers

  • Leaf nodes: Store actual data keys and are linked together for range queries

  • Memory ownership: Nodes do NOT own keys (owned by grove’s deque), but DO own child nodes

  • Parent-child relationships: Each node maintains links to parent, children, and next sibling

Key characteristics:

  • Order parameter controls maximum capacity (order-1 keys, order children)

  • Keys are stored as pointers to entries in grove’s deque for stable addresses

  • Leaf nodes are chained via next pointers for efficient sequential traversal

  • Internal nodes aggregate child keys for navigation during tree traversal

Note

Keys are NOT deleted in destructor as they’re owned by the grove’s deque

Note

Only internal nodes delete their children; leaf nodes have no children to delete

Note

The order must be at least 2

Template Parameters:
  • key_type – The type of keys stored in this node (must satisfy key_type_base concept)

  • data_type – Optional associated data type (default: void for no data)

Public Functions

inline explicit node(int order)#

Construct a new node with specified order.

Initializes a node with:

  • Pre-allocated capacity for keys and children to avoid reallocations

  • No parent or next sibling (set later during tree operations)

  • Not marked as leaf (set explicitly when needed)

Parameters:

order – The B+ tree order (determines max keys = order-1, max children = order)

inline ~node()#

Destroy the node and recursively delete child nodes.

Memory management rules:

  • Keys are NOT deleted (owned by grove’s deque)

  • Child nodes ARE deleted recursively (owned by parent node)

  • Leaf nodes have no children to delete

node(const node&) = delete#
node &operator=(const node&) = delete#
inline node(node &&other) noexcept#
inline node &operator=(node &&other) noexcept#
inline int get_order() const noexcept#

Get the B+ tree order of this node.

Returns:

The order value (max children = order, max keys = order-1)

inline std::vector<gdt::key<key_type, data_type>*> &get_keys()#

Get mutable reference to the keys vector.

Note

Keys are pointers to entries in grove’s deque, not owned by node

Returns:

Reference to vector of key pointers

inline const std::vector<gdt::key<key_type, data_type>*> &get_keys() const#

Get const reference to the keys vector.

Returns:

Const reference to vector of key pointers

inline std::vector<node<key_type, data_type>*> &get_children()#

Get mutable reference to the children vector.

Note

Children are owned by this node and will be deleted in destructor

Returns:

Reference to vector of child node pointers

inline const std::vector<node<key_type, data_type>*> &get_children() const#

Get const reference to the children vector.

Returns:

Const reference to vector of child node pointers

inline node *get_parent() const noexcept#

Get pointer to parent node.

Returns:

Pointer to parent node, or nullptr if this is root

inline void set_parent(node<key_type, data_type> *parent)#

Set the parent node pointer.

Parameters:

parent – Pointer to the parent node

inline node *get_next() const noexcept#

Get pointer to next sibling node.

Note

Only meaningful for leaf nodes that are chained together

Returns:

Pointer to next leaf node, or nullptr if no next sibling

inline void set_next(node<key_type, data_type> *next)#

Set the next sibling pointer (for leaf node chaining)

Note

Only used for leaf nodes to enable efficient range traversal

Parameters:

next – Pointer to the next leaf node in sequence

inline bool get_is_leaf() const noexcept#

Check if this node is a leaf.

Returns:

True if this is a leaf node, false if internal node

inline void set_is_leaf(bool is_leaf)#

Set whether this node is a leaf.

Parameters:

is_leaf – True if this is a leaf node, false if internal node

inline void insert_key_ptr(gdt::key<key_type, data_type> *key_ptr)#

Insert a pre-allocated key pointer into the node at sorted position.

Finds the correct sorted position by comparing key values and inserts the key pointer at that position, maintaining sorted order.

Note

Key must be allocated by grove before calling this

Note

This performs a linear search; consider using indexed insertion for bulk operations

Parameters:

key_ptr – Pointer to key (already allocated by grove’s deque)

inline void insert_key_ptr(gdt::key<key_type, data_type> *key_ptr, int index)#

Insert a pre-allocated key pointer at specific index.

Directly inserts the key at the specified index without checking sort order. Caller is responsible for ensuring the insertion maintains sorted order.

Note

Key must be allocated by grove before calling this

Parameters:
  • key_ptr – Pointer to key (already allocated by grove’s deque)

  • index – Position to insert at (0-based, must be in [0, keys.size()])

Throws:

std::out_of_range – if index is out of bounds

inline key_type calc_keys_aggregate()#

Aggregate this node’s own keys.

For leaf nodes, this is the bounding range of the leaf data. For internal nodes, it covers children[0..n-2] (each separator covers its corresponding child) but NOT the last child, which has no separator key in this node. Use calc_subtree_range() to cover the full subtree including the last child’s catch-all chain.

For example with intervals: returns the bounding interval (min start, max end) of this node’s keys.

Returns:

Aggregate key_type covering all keys stored directly in this node

inline key_type calc_subtree_range()#

Compute the bounding range of the entire subtree rooted at this node.

For leaf nodes, identical to calc_keys_aggregate(). For internal nodes, calc_keys_aggregate() only covers children[0..n-2] (each separator key covers its corresponding child), missing the last child which has no separator in this node. This method descends the last-child chain at every level to pick up the missing range.

Used as a separator value when the parent needs an accurate upper bound for this node’s entire subtree — especially after rebalancing, where calc_keys_aggregate() would leave the separator too narrow to reach keys in the catch-all chain.

Returns:

Aggregate key_type covering all leaf keys reachable from this node

inline void add_child(node<key_type, data_type> *child, int index)#

Add a child node at the specified index.

Inserts a child node at the given position, shifting existing children to the right. The index must be in range [0, children.size()].

Parameters:
  • child – Pointer to the child node to add

  • index – Position to insert the child (0-based)

Throws:

std::out_of_range – if index is invalid

inline node *get_child(int index)#

Get child node at the specified index.

Parameters:

index – Position of the child to retrieve (0-based)

Throws:

std::out_of_range – if index is invalid

Returns:

Pointer to the child node

inline node *get_child(int index) const#

Get child node at the specified index (const version)

Parameters:

index – Position of the child to retrieve (0-based)

Throws:

std::out_of_range – if index is invalid

Returns:

Pointer to the child node

void serialize(std::ostream &os) const#

Serialize this node to an output stream.

Writes the node’s structure and content to the stream in binary format, allowing the node to be persisted and later restored.

Parameters:

os – Output stream to write serialized data to

inline void print_keys(std::ostream &os, std::string_view sep = "\t") const#

Print all keys in this node to an output stream.

Outputs the string representation of each key in the node, separated by the specified separator. Useful for debugging and visualization.

Parameters:
  • os – Output stream to write to

  • sep – Separator string between keys (default: tab)

Public Static Functions

static node<key_type, data_type> *deserialize(std::istream &is, int order, std::deque<gdt::key<key_type, data_type>> &key_storage)#

Deserialize a node from an input stream.

Reads serialized node data from the stream and reconstructs the node structure, including keys and child relationships. Keys are placed directly into the provided deque (owned by the grove) rather than heap-allocated, ensuring pointer stability and single-owner semantics.

Parameters:
  • is – Input stream to read serialized data from

  • order – The B+ tree order to use for the deserialized node

  • key_storage – Deque to allocate keys into for stable pointer addresses

Returns:

Pointer to the newly created and populated node

graph_overlay#

template<typename key_type, typename data_type = void, typename edge_data_type = void>
class graph_overlay#

Graph overlay for grove - decoupled graph structure.

Provides graph capabilities on top of any grove by storing directed edges between keys. The graph is completely separate from the tree structure, allowing multiple graphs with different edge metadata types on the same grove.

Template Parameters:
  • key_type – The key_type of the keys (e.g., interval)

  • data_type – The data_type of the keys

  • edge_data_type – Optional metadata for edges (void for no metadata)

Public Functions

graph_overlay() = default#

Default constructor.

inline void add_edge(gdt::key<key_type, data_type> *source, gdt::key<key_type, data_type> *target)#

Add edge without metadata.

Parameters:
  • source – Pointer to source key

  • target – Pointer to target key

template<typename M = edge_data_type>
inline void add_edge(gdt::key<key_type, data_type> *source, gdt::key<key_type, data_type> *target, M &&metadata)#

Add edge with metadata.

Parameters:
  • source – Pointer to source key

  • target – Pointer to target key

  • metadata – Edge metadata (e.g., transcript ID, confidence)

inline bool remove_edge(gdt::key<key_type, data_type> *source, gdt::key<key_type, data_type> *target)#

Remove a specific edge.

Parameters:
  • source – Pointer to source key

  • target – Pointer to target key

Returns:

true if edge was found and removed, false otherwise

inline std::vector<gdt::key<key_type, data_type>*> get_neighbors(const gdt::key<key_type, data_type> *source) const#

Get all neighbor keys (targets of outgoing edges)

Parameters:

source – Pointer to source key

Returns:

Vector of pointers to neighbor keys

template<typename M = edge_data_type>
inline std::vector<M> get_edges(const gdt::key<key_type, data_type> *source) const#

Get edge metadata for all outgoing edges.

Parameters:

source – Pointer to source key

Returns:

Vector of edge metadata (only available when edge_data_type != void)

inline const std::vector<edge> &get_edge_list(const gdt::key<key_type, data_type> *source) const#

Get all outgoing edge structures (with targets and metadata)

Parameters:

source – Pointer to source key

Returns:

Const reference to vector of edges (empty if no edges)

template<typename Predicate>
inline std::vector<gdt::key<key_type, data_type>*> get_neighbors_if(const gdt::key<key_type, data_type> *source, Predicate predicate) const#

Get neighbors filtered by predicate on edge metadata.

Parameters:
  • source – Pointer to source key

  • predicate – Function to filter edges by metadata

Returns:

Vector of neighbor keys where predicate returns true

inline bool has_edge(const gdt::key<key_type, data_type> *source, const gdt::key<key_type, data_type> *target) const#

Check if edge exists.

Parameters:
  • source – Pointer to source key

  • target – Pointer to target key

Returns:

true if edge exists, false otherwise

inline size_t out_degree(const gdt::key<key_type, data_type> *source) const#

Get number of outgoing edges from a key.

Parameters:

source – Pointer to source key

Returns:

Number of outgoing edges

inline size_t edge_count() const#

Get total number of edges in graph.

Returns:

Total edge count

inline size_t vertex_count_with_edges() const#

Get number of vertices (keys) with at least one outgoing edge.

Returns:

Number of keys that have outgoing edges

inline size_t remove_edges_from(gdt::key<key_type, data_type> *source)#

Remove all outgoing edges from a source key.

Parameters:

source – Pointer to source key

Returns:

Number of edges removed

inline size_t remove_edges_to(gdt::key<key_type, data_type> *target)#

Remove all incoming edges to a target key.

Note

O(E) — scans all edges in the graph

Parameters:

target – Pointer to target key

Returns:

Number of edges removed

inline size_t remove_all_edges(gdt::key<key_type, data_type> *key)#

Remove all edges (both incoming and outgoing) involving a key.

Parameters:

key – Pointer to the key

Returns:

Total number of edges removed

template<typename Predicate>
inline size_t remove_edges_if(Predicate predicate)#

Remove all edges matching a predicate.

Parameters:

predicate – Callable taking const edge& and returning bool

Returns:

Number of edges removed

inline void remap_keys(const std::unordered_map<const gdt::key<key_type, data_type>*, gdt::key<key_type, data_type>*> &remap)#

Rewrite every source/target pointer using an old → new remap.

For each adjacency entry (source → edges):

  • if source appears in remap, the entry is moved to remap[source]

  • for each edge.target that appears in remap, it is rewritten

Keys absent from the remap (e.g. external keys, which were not migrated) are preserved unchanged. If multiple old sources map to the same new source (non-injective remap), their edge lists are appended into the destination bucket — no edges are dropped. Intended for grove::compact() to migrate adjacency after rebuilding the indexed key storage.

Parameters:

remap – Map from old key pointer to new key pointer

inline void clear()#

Clear all edges.

inline bool empty() const#

Check if graph is empty.

Returns:

true if no edges exist

struct edge#

Edge structure representing a directed connection.

Public Functions

inline explicit edge(gdt::key<key_type, data_type> *tgt)#
template<typename M = edge_data_type>
inline edge(gdt::key<key_type, data_type> *tgt, M &&meta)#

Public Members

gdt::key<key_type, data_type> *target#
std::conditional_t<std::is_void_v<edge_data_type>, std::monostate, edge_data_type> metadata#