Performance Optimization#

C++ · genogrove 0.25.3

Tips for Optimal Performance

Choose Appropriate Tree Order

The tree order parameter significantly affects performance:

Small datasets (< 10K intervals): order = 50-100
Medium datasets (10K-1M intervals): order = 100-500
Large datasets (> 1M intervals): order = 500-1000

Higher order values provide:

Better cache locality (fewer cache misses)
Reduced tree height (fewer levels to traverse)
More keys per node (better for sequential access)

// For a large genomic dataset
gst::grove<gdt::interval, std::string> my_grove(500);

Use Sorted Insertion for Pre-sorted Data

When data is already sorted (common for BED/GFF files), use sorted insertion:

// For sorted data (e.g., BED files)
grove.insert_data(index, interval, data, gst::sorted);  // Much faster!

// For unsorted data — no tag, full tree traversal (O(log n))
grove.insert_data(index, interval, data);

Sorted insertion provides O(1) amortized insertion time vs O(log n) for unsorted.

Use Bulk Insertion for Large Datasets

For loading large datasets (>10K intervals), bulk insertion provides dramatic performance improvements:

std::vector<std::pair<gdt::interval, std::string>> data = load_data();

// Bulk insertion - ~10-20x faster than incremental
// Data must be sorted before calling this function
grove.insert_data("chr1", data, gst::sorted, gst::bulk);

Runtime Characteristics:

Empty index: O(n) bottom-up tree construction
- Builds fully-packed leaf nodes (order-1 keys per node)
- Then constructs internal layers
- ~10-20x faster than incremental insertion
- Better space utilization: 75-90% full nodes vs ~50% from splits
Existing data: Appends to rightmost node
- CRITICAL: All new keys must be strictly greater than existing keys
- Throws std::runtime_error if violated

Performance Comparison (1M intervals, empty index):

Method	Time	Node Utilization
Incremental (unsorted)	10-20 seconds	~50% (from splits)
Sorted insertion	2-5 seconds	~50% (from splits)
Bulk insertion	0.5-1 second	75-90% (packed)

When to Use Bulk Insertion:

Initial loading of large genomic files (BED, GFF, GTF)
Datasets with >10K intervals per chromosome
When maximum performance is required
Building indices for production use
Important: Data must be sorted before using bulk insertion

Organize by Chromosome

Always use chromosome/contig names as the index parameter:

// Good - organized by chromosome
grove.insert_data("chr1", interval, data);
grove.insert_data("chr2", interval, data);

// Bad - all data in one tree
grove.insert_data("all", interval, data);

Benefits:

Efficient chromosome-specific queries
Reduced tree traversal for most genomic queries
Better memory locality for same-chromosome data

Compression Support

Choosing the Right Compression Format

Different compression formats have different trade-offs:

BGZF/GZIP (.gz)

Standard for genomic files
Supports random access (BGZF)
Good compression ratio
Use for: BED, GFF, VCF files

ZSTD (.zst)

Best compression ratio
Fast decompression
Use for: Long-term storage, archives

LZ4 (.lz4)

Fastest decompression
Lower compression ratio
Use for: Temporary files, fast processing

Example:

// All these are automatically handled
gio::bed_reader reader1("data.bed.gz");    // BGZF
gio::bed_reader reader2("data.bed.zst");   // ZSTD
gio::bed_reader reader3("data.bed.lz4");   // LZ4
gio::bed_reader reader4("data.bed");       // Uncompressed

Memory Management

Efficient Memory Usage

The grove uses a deque for key storage, providing:

Stable pointers (required for graph overlay)
Better cache locality than individual allocations
Automatic memory management

// Keys are automatically managed
auto* key = grove.insert_data("chr1", interval, data);
// Pointer remains valid until grove is destroyed

Graph Overlay Considerations

When using graph overlays, consider:

Edges are stored separately from the tree structure
Each edge stores a pointer to target key
Memory usage: O(E) where E is number of edges

// Clear graph if no longer needed
grove.clear_graph();  // Frees edge memory, keeps keys

Serialization

Save and load grove structures for faster startup:

#include <fstream>

// Save grove to disk
{
    std::ofstream out("grove_data.bin", std::ios::binary);
    my_grove.serialize(out);
}

// Load grove from disk
{
    std::ifstream in("grove_data.bin", std::ios::binary);
    auto loaded_grove = gst::grove<gdt::interval, std::string>::deserialize(in);
}

Note: Serialization saves:

Tree structure
All keys and data
Order parameter
External (graph-only) keys
Graph overlay edges (with metadata when edge_data_type is non-void)

Not saved:

Rightmost node cache (recalculated on load)

Query Optimization

Chromosome-Specific Queries

Always specify the chromosome when possible:

// Fast - searches only chr1 tree
auto results = grove.intersect(query, "chr1");

// Slower - searches all chromosomes
auto all_results = grove.intersect(query);

Result Processing

Process query results efficiently:

auto results = grove.intersect(query, "chr1");

// Get keys once
const auto& keys = results.get_keys();

// Process without repeated calls
for (auto* key : keys) {
    process(key);
}

Benchmarking

Example benchmark for insertion:

#include <chrono>
#include <iostream>

void benchmark_insertion() {
    gst::grove<gdt::interval, std::string> grove(100);

    const int N = 1000000;
    auto start = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < N; ++i) {
        grove.insert_data(
            "chr1",
            gdt::interval{i * 100, i * 100 + 50},
            "feature" + std::to_string(i),
            gst::sorted  // Data is sorted
        );
    }

    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(
        end - start
    );

    std::cout << "Inserted " << N << " intervals in "
              << duration.count() << " ms\n";
    std::cout << "Rate: " << (N / (duration.count() / 1000.0))
              << " insertions/second\n";
}

Best Practices Summary

Choose appropriate tree order based on dataset size
Use bulk insertion for large datasets (>10K intervals)
Use sorted insertion for pre-sorted incremental data
Organize by chromosome for efficient queries
Use BGZF compression for genomic files
Specify chromosome in queries when possible
Clear graph when edges are no longer needed
Serialize large groves for faster loading
Benchmark your specific use case

Python · pygenogrove 0.7.0

Bulk and Fast-Path Inserts

For large loads, prefer insert_bulk(index, items, presorted=False): the tree is built bottom-up in a single native pass, which is much faster than repeated insert. For an incremental rightmost-append fast path, use insert_sorted(index, key, data).

import pygenogrove as pg

g = pg.Grove()

# Bulk load: per-index (one chromosome at a time), built bottom-up
items = [(pg.GenomicCoordinate('+', i * 100, i * 100 + 50), f"feature{i}") for i in range(1_000_000)]
g.insert_bulk("chr1", items, presorted=True)   # presorted skips the internal sort

# Rightmost-append fast path for incremental sorted inserts
g.insert_sorted("chr2", pg.GenomicCoordinate('+', 100, 200), "feat")

insert_bulk is per-index — call it once per chromosome / contig.
Pass presorted=True only when items is already sorted; otherwise leave it False and the call sorts for you.
Both insert_bulk and insert_sorted release the GIL around the native build.

Threading & the GIL

pygenogrove releases the Python GIL around its I/O-bound native work, so multiple Python threads can run during that native work.

Calls that RELEASE the GIL (other Python threads run during the native work):

File readers __next__: BedReader / GffReader / BamReader / VcfReader / FastaReader (disk read + htslib/parse).
FastaIndex — construction (builds a missing .fai) and both fetch overloads.
Grove.serialize / Grove.deserialize / Grove.to_sif (file + zlib).
insert_bulk — around the C++ build loop.

Calls that KEEP the GIL (and why):

Single insert / insert_sorted / intersect — short, hot operations.
The predicate-callback methods flanking(..., predicate) / get_neighbors_if / link_if / link_with / remove_edges_if — their C++ re-enters Python, so the GIL is required.

Warning

Thread-safety contract: a genogrove object (reader / grove / index) is not internally synchronized. The safe pattern is one object per thread — e.g. a pool where each worker has its own reader / grove, overlapping only the native work. Sharing a single object across threads is a data race.

Example (parallel I/O with a thread pool):

from concurrent.futures import ThreadPoolExecutor
import pygenogrove as pg

def count(path):                        # each worker its own reader
    return sum(1 for _ in pg.BedReader(path))

with ThreadPoolExecutor() as ex:        # readers' disk I/O now overlaps
    totals = list(ex.map(count, bed_paths))