Performance Optimization#

Tips for Optimal Performance#

Choose Appropriate Tree Order#

The tree order parameter significantly affects performance:

  • Small datasets (< 10K intervals): order = 50-100

  • Medium datasets (10K-1M intervals): order = 100-500

  • Large datasets (> 1M intervals): order = 500-1000

Higher order values provide:

  • Better cache locality (fewer cache misses)

  • Reduced tree height (fewer levels to traverse)

  • More keys per node (better for sequential access)

// For a large genomic dataset
gst::grove<gdt::interval, std::string> my_grove(500);

Use Sorted Insertion for Pre-sorted Data#

When data is already sorted (common for BED/GFF files), use sorted insertion:

// For sorted data (e.g., BED files)
grove.insert_data(index, interval, data, gst::sorted);  // Much faster!

// For unsorted data — no tag, full tree traversal (O(log n))
grove.insert_data(index, interval, data);

Sorted insertion provides O(1) amortized insertion time vs O(log n) for unsorted.

Use Bulk Insertion for Large Datasets#

For loading large datasets (>10K intervals), bulk insertion provides dramatic performance improvements:

std::vector<std::pair<gdt::interval, std::string>> data = load_data();

// Bulk insertion - ~10-20x faster than incremental
// Data must be sorted before calling this function
grove.insert_data("chr1", data, gst::sorted, gst::bulk);

Runtime Characteristics:

  • Empty index: O(n) bottom-up tree construction

    • Builds fully-packed leaf nodes (order-1 keys per node)

    • Then constructs internal layers

    • ~10-20x faster than incremental insertion

    • Better space utilization: 75-90% full nodes vs ~50% from splits

  • Existing data: Appends to rightmost node

    • CRITICAL: All new keys must be strictly greater than existing keys

    • Throws std::runtime_error if violated

Performance Comparison (1M intervals, empty index):

Method

Time

Node Utilization

Incremental (unsorted)

10-20 seconds

~50% (from splits)

Sorted insertion

2-5 seconds

~50% (from splits)

Bulk insertion

0.5-1 second

75-90% (packed)

When to Use Bulk Insertion:

  • Initial loading of large genomic files (BED, GFF, GTF)

  • Datasets with >10K intervals per chromosome

  • When maximum performance is required

  • Building indices for production use

  • Important: Data must be sorted before using bulk insertion

Organize by Chromosome#

Always use chromosome/contig names as the index parameter:

// Good - organized by chromosome
grove.insert_data("chr1", interval, data);
grove.insert_data("chr2", interval, data);

// Bad - all data in one tree
grove.insert_data("all", interval, data);

Benefits:

  • Efficient chromosome-specific queries

  • Reduced tree traversal for most genomic queries

  • Better memory locality for same-chromosome data

Compression Support#

Choosing the Right Compression Format#

Different compression formats have different trade-offs:

BGZF/GZIP (.gz)

  • Standard for genomic files

  • Supports random access (BGZF)

  • Good compression ratio

  • Use for: BED, GFF, VCF files

ZSTD (.zst)

  • Best compression ratio

  • Fast decompression

  • Use for: Long-term storage, archives

LZ4 (.lz4)

  • Fastest decompression

  • Lower compression ratio

  • Use for: Temporary files, fast processing

Example:

// All these are automatically handled
gio::bed_reader reader1("data.bed.gz");    // BGZF
gio::bed_reader reader2("data.bed.zst");   // ZSTD
gio::bed_reader reader3("data.bed.lz4");   // LZ4
gio::bed_reader reader4("data.bed");       // Uncompressed

Memory Management#

Efficient Memory Usage#

The grove uses a deque for key storage, providing:

  • Stable pointers (required for graph overlay)

  • Better cache locality than individual allocations

  • Automatic memory management

// Keys are automatically managed
auto* key = grove.insert_data("chr1", interval, data);
// Pointer remains valid until grove is destroyed

Graph Overlay Considerations#

When using graph overlays, consider:

  • Edges are stored separately from the tree structure

  • Each edge stores a pointer to target key

  • Memory usage: O(E) where E is number of edges

// Clear graph if no longer needed
grove.clear_graph();  // Frees edge memory, keeps keys

Serialization#

Save and load grove structures for faster startup:

#include <fstream>

// Save grove to disk
{
    std::ofstream out("grove_data.bin", std::ios::binary);
    my_grove.serialize(out);
}

// Load grove from disk
{
    std::ifstream in("grove_data.bin", std::ios::binary);
    auto loaded_grove = gst::grove<gdt::interval, std::string>::deserialize(in);
}

Note: Serialization saves:

  • Tree structure

  • All keys and data

  • Order parameter

  • External (graph-only) keys

  • Graph overlay edges (with metadata when edge_data_type is non-void)

Not saved:

  • Rightmost node cache (recalculated on load)

Query Optimization#

Chromosome-Specific Queries#

Always specify the chromosome when possible:

// Fast - searches only chr1 tree
auto results = grove.intersect(query, "chr1");

// Slower - searches all chromosomes
auto all_results = grove.intersect(query);

Result Processing#

Process query results efficiently:

auto results = grove.intersect(query, "chr1");

// Get keys once
const auto& keys = results.get_keys();

// Process without repeated calls
for (auto* key : keys) {
    process(key);
}

Benchmarking#

Example benchmark for insertion:

#include <chrono>
#include <iostream>

void benchmark_insertion() {
    gst::grove<gdt::interval, std::string> grove(100);

    const int N = 1000000;
    auto start = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < N; ++i) {
        grove.insert_data(
            "chr1",
            gdt::interval{i * 100, i * 100 + 50},
            "feature" + std::to_string(i),
            gst::sorted  // Data is sorted
        );
    }

    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(
        end - start
    );

    std::cout << "Inserted " << N << " intervals in "
              << duration.count() << " ms\n";
    std::cout << "Rate: " << (N / (duration.count() / 1000.0))
              << " insertions/second\n";
}

Best Practices Summary#

  1. Choose appropriate tree order based on dataset size

  2. Use bulk insertion for large datasets (>10K intervals)

  3. Use sorted insertion for pre-sorted incremental data

  4. Organize by chromosome for efficient queries

  5. Use BGZF compression for genomic files

  6. Specify chromosome in queries when possible

  7. Clear graph when edges are no longer needed

  8. Serialize large groves for faster loading

  9. Benchmark your specific use case