Performance Optimization#
Tips for Optimal Performance#
Choose Appropriate Tree Order#
The tree order parameter significantly affects performance:
Small datasets (< 10K intervals): order = 50-100
Medium datasets (10K-1M intervals): order = 100-500
Large datasets (> 1M intervals): order = 500-1000
Higher order values provide:
Better cache locality (fewer cache misses)
Reduced tree height (fewer levels to traverse)
More keys per node (better for sequential access)
// For a large genomic dataset
gst::grove<gdt::interval, std::string> my_grove(500);
Use Sorted Insertion for Pre-sorted Data#
When data is already sorted (common for BED/GFF files), use sorted insertion:
// For sorted data (e.g., BED files)
grove.insert_data(index, interval, data, gst::sorted); // Much faster!
// For unsorted data — no tag, full tree traversal (O(log n))
grove.insert_data(index, interval, data);
Sorted insertion provides O(1) amortized insertion time vs O(log n) for unsorted.
Use Bulk Insertion for Large Datasets#
For loading large datasets (>10K intervals), bulk insertion provides dramatic performance improvements:
std::vector<std::pair<gdt::interval, std::string>> data = load_data();
// Bulk insertion - ~10-20x faster than incremental
// Data must be sorted before calling this function
grove.insert_data("chr1", data, gst::sorted, gst::bulk);
Runtime Characteristics:
Empty index: O(n) bottom-up tree construction
Builds fully-packed leaf nodes (order-1 keys per node)
Then constructs internal layers
~10-20x faster than incremental insertion
Better space utilization: 75-90% full nodes vs ~50% from splits
Existing data: Appends to rightmost node
CRITICAL: All new keys must be strictly greater than existing keys
Throws
std::runtime_errorif violated
Performance Comparison (1M intervals, empty index):
Method |
Time |
Node Utilization |
|---|---|---|
Incremental (unsorted) |
10-20 seconds |
~50% (from splits) |
Sorted insertion |
2-5 seconds |
~50% (from splits) |
Bulk insertion |
0.5-1 second |
75-90% (packed) |
When to Use Bulk Insertion:
Initial loading of large genomic files (BED, GFF, GTF)
Datasets with >10K intervals per chromosome
When maximum performance is required
Building indices for production use
Important: Data must be sorted before using bulk insertion
Organize by Chromosome#
Always use chromosome/contig names as the index parameter:
// Good - organized by chromosome
grove.insert_data("chr1", interval, data);
grove.insert_data("chr2", interval, data);
// Bad - all data in one tree
grove.insert_data("all", interval, data);
Benefits:
Efficient chromosome-specific queries
Reduced tree traversal for most genomic queries
Better memory locality for same-chromosome data
Compression Support#
Choosing the Right Compression Format#
Different compression formats have different trade-offs:
BGZF/GZIP (.gz)
Standard for genomic files
Supports random access (BGZF)
Good compression ratio
Use for: BED, GFF, VCF files
ZSTD (.zst)
Best compression ratio
Fast decompression
Use for: Long-term storage, archives
LZ4 (.lz4)
Fastest decompression
Lower compression ratio
Use for: Temporary files, fast processing
Example:
// All these are automatically handled
gio::bed_reader reader1("data.bed.gz"); // BGZF
gio::bed_reader reader2("data.bed.zst"); // ZSTD
gio::bed_reader reader3("data.bed.lz4"); // LZ4
gio::bed_reader reader4("data.bed"); // Uncompressed
Memory Management#
Efficient Memory Usage#
The grove uses a deque for key storage, providing:
Stable pointers (required for graph overlay)
Better cache locality than individual allocations
Automatic memory management
// Keys are automatically managed
auto* key = grove.insert_data("chr1", interval, data);
// Pointer remains valid until grove is destroyed
Graph Overlay Considerations#
When using graph overlays, consider:
Edges are stored separately from the tree structure
Each edge stores a pointer to target key
Memory usage: O(E) where E is number of edges
// Clear graph if no longer needed
grove.clear_graph(); // Frees edge memory, keeps keys
Serialization#
Save and load grove structures for faster startup:
#include <fstream>
// Save grove to disk
{
std::ofstream out("grove_data.bin", std::ios::binary);
my_grove.serialize(out);
}
// Load grove from disk
{
std::ifstream in("grove_data.bin", std::ios::binary);
auto loaded_grove = gst::grove<gdt::interval, std::string>::deserialize(in);
}
Note: Serialization saves:
Tree structure
All keys and data
Order parameter
External (graph-only) keys
Graph overlay edges (with metadata when
edge_data_typeis non-void)
Not saved:
Rightmost node cache (recalculated on load)
Query Optimization#
Chromosome-Specific Queries#
Always specify the chromosome when possible:
// Fast - searches only chr1 tree
auto results = grove.intersect(query, "chr1");
// Slower - searches all chromosomes
auto all_results = grove.intersect(query);
Result Processing#
Process query results efficiently:
auto results = grove.intersect(query, "chr1");
// Get keys once
const auto& keys = results.get_keys();
// Process without repeated calls
for (auto* key : keys) {
process(key);
}
Benchmarking#
Example benchmark for insertion:
#include <chrono>
#include <iostream>
void benchmark_insertion() {
gst::grove<gdt::interval, std::string> grove(100);
const int N = 1000000;
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < N; ++i) {
grove.insert_data(
"chr1",
gdt::interval{i * 100, i * 100 + 50},
"feature" + std::to_string(i),
gst::sorted // Data is sorted
);
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(
end - start
);
std::cout << "Inserted " << N << " intervals in "
<< duration.count() << " ms\n";
std::cout << "Rate: " << (N / (duration.count() / 1000.0))
<< " insertions/second\n";
}
Best Practices Summary#
Choose appropriate tree order based on dataset size
Use bulk insertion for large datasets (>10K intervals)
Use sorted insertion for pre-sorted incremental data
Organize by chromosome for efficient queries
Use BGZF compression for genomic files
Specify chromosome in queries when possible
Clear graph when edges are no longer needed
Serialize large groves for faster loading
Benchmark your specific use case