genogrove#

A high-performance modern C++ library for genomic data structures and interval queries.

Overview#

Genogrove provides a specialized B+ tree data structure (the grove) optimized for storing and querying genomic intervals. It combines efficient interval overlap detection with an embedded graph overlay for representing relationships between genomic features.

Key Features:

  • Flexible Key Types: Works with any type satisfying the key_type_base concept (built-in: interval, genomic_coordinate, numeric, kmer)

  • Multi-Index Organization: Separate trees per chromosome for efficient queries

  • Sorted Insertion: O(1) amortized insertion for pre-sorted genomic data

  • Graph Overlay: Link keys within the grove to represent feature relationships

  • File I/O: Automatic format detection and compression support (BED, GFF/GTF, BAM/SAM, FASTA/FASTQ)

  • Modern C++20: Type-safe, concept-based design

Quick Example#

Here’s a complete example showing file reading, storage, and querying:

#include <genogrove/io/bed_reader.hpp>
#include <genogrove/structure/grove/grove.hpp>
#include <genogrove/data_type/interval.hpp>
#include <iostream>
#include <stdexcept>

namespace gio = genogrove::io;
namespace gdt = genogrove::data_type;
namespace gst = genogrove::structure;

int main() {
    // Create grove to store genomic features
    gst::grove<gdt::interval, std::string> features(100);

    // Read BED file (handles .bed.gz automatically)
    gio::bed_reader reader("genes.bed.gz");

    try {
        for (const auto& entry : reader) {
            // Insert sorted by chromosome
            // Convert half-open [start, end) to closed [start, end]
            features.insert_data(
                entry.chrom,
                gdt::interval(entry.start, entry.end - 1),
                entry.name,
                gst::sorted  // Optimized for pre-sorted data
            );
        }
    } catch (const std::runtime_error& e) {
        std::cerr << "Error: " << e.what() << "\n";
    }

    // Query for overlapping features
    auto results = features.intersect(gdt::interval{1000, 2000}, "chr1");

    std::cout << "Found " << results.get_keys().size()
              << " overlapping features\n";

    return 0;
}

Why Genogrove?#

Performance

Optimized B+ tree implementation with O(1) sorted insertion and efficient overlap queries.

Flexibility

Use built-in genomic types or define custom key types for specialized applications.

Graph Integration

Represent complex relationships (transcripts, regulatory networks) alongside spatial queries.

Modern Design

C++20 concepts, type safety, and zero-cost abstractions.

Requirements#

  • Compiler: C++20 compatible (GCC 13+, Clang 16+, Apple Clang 15+)

  • Build System: CMake 3.14 or higher

  • Dependencies: htslib (for compressed file support)

Getting Started#

Ready to use genogrove? Check out the User Guide for:

  • Installation instructions

  • Detailed tutorials on I/O operations

  • Working with data types and the grove

  • Complete examples and best practices

Documentation#

User Guide

Comprehensive tutorials and examples

API Reference

Complete API reference

GitHub Repository

genogrove on GitHub

Community#

License#

genogrove is distributed under the GNU General Public License v3.0 or later.