genogrove#
A high-performance modern C++ library for genomic data structures and interval queries.
Overview#
Genogrove provides a specialized B+ tree data structure (the grove) optimized for storing and querying genomic intervals. It combines efficient interval overlap detection with an embedded graph overlay for representing relationships between genomic features.
Key Features:
Flexible Key Types: Works with any type satisfying the
key_type_baseconcept (built-in:interval,genomic_coordinate,numeric,kmer)Multi-Index Organization: Separate trees per chromosome for efficient queries
Sorted Insertion: O(1) amortized insertion for pre-sorted genomic data
Graph Overlay: Link keys within the grove to represent feature relationships
File I/O: Automatic format detection and compression support (BED, GFF/GTF, BAM/SAM, FASTA/FASTQ)
Modern C++20: Type-safe, concept-based design
Quick Example#
Here’s a complete example showing file reading, storage, and querying:
#include <genogrove/io/bed_reader.hpp>
#include <genogrove/structure/grove/grove.hpp>
#include <genogrove/data_type/interval.hpp>
#include <iostream>
#include <stdexcept>
namespace gio = genogrove::io;
namespace gdt = genogrove::data_type;
namespace gst = genogrove::structure;
int main() {
// Create grove to store genomic features
gst::grove<gdt::interval, std::string> features(100);
// Read BED file (handles .bed.gz automatically)
gio::bed_reader reader("genes.bed.gz");
try {
for (const auto& entry : reader) {
// Insert sorted by chromosome
// Convert half-open [start, end) to closed [start, end]
features.insert_data(
entry.chrom,
gdt::interval(entry.start, entry.end - 1),
entry.name,
gst::sorted // Optimized for pre-sorted data
);
}
} catch (const std::runtime_error& e) {
std::cerr << "Error: " << e.what() << "\n";
}
// Query for overlapping features
auto results = features.intersect(gdt::interval{1000, 2000}, "chr1");
std::cout << "Found " << results.get_keys().size()
<< " overlapping features\n";
return 0;
}
Why Genogrove?#
- Performance
Optimized B+ tree implementation with O(1) sorted insertion and efficient overlap queries.
- Flexibility
Use built-in genomic types or define custom key types for specialized applications.
- Graph Integration
Represent complex relationships (transcripts, regulatory networks) alongside spatial queries.
- Modern Design
C++20 concepts, type safety, and zero-cost abstractions.
Requirements#
Compiler: C++20 compatible (GCC 13+, Clang 16+, Apple Clang 15+)
Build System: CMake 3.14 or higher
Dependencies: htslib (for compressed file support)
Getting Started#
Ready to use genogrove? Check out the User Guide for:
Installation instructions
Detailed tutorials on I/O operations
Working with data types and the grove
Complete examples and best practices
Documentation#
- User Guide
Comprehensive tutorials and examples
- API Reference
Complete API reference
- GitHub Repository
Community#
Issues: Report bugs and request features on GitHub Issues
Discussions: Ask questions and share ideas on GitHub Discussions
License#
genogrove is distributed under the GNU General Public License v3.0 or later.