Reading Genomic Files (I/O)#
The genogrove::io namespace provides efficient readers for common genomic file formats with automatic compression detection.
Reader Ownership#
All streaming file readers (bed_reader, gff_reader, bam_reader, fasta_reader) own raw
htslib resource pointers and are non-copyable but movable. Attempting to copy a reader will
produce a compile error. fasta_index (indexed random-access FASTA) follows the same rule.
namespace gio = genogrove::io;
gio::bed_reader reader("data.bed");
// gio::bed_reader copy = reader; // compile error — not copyable
gio::bed_reader moved = std::move(reader); // OK — transfers ownership
When passing readers to functions, use a reference or move them:
namespace gio = genogrove::io;
void process(gio::bed_reader& reader) { // pass by reference
for (const auto& entry : reader) { /* ... */ }
}
Automatic File Type Detection#
Genogrove can automatically detect file types and compression formats:
#include <genogrove/io/filetype_detector.hpp>
namespace gio = genogrove::io;
int main() {
gio::filetype_detector detector;
auto [filetype, compression] = detector.detect_filetype("data.bed.gz");
// filetype will be gio::filetype::BED
// compression will be gio::compression_type::GZIP
return 0;
}
Supported File Types:
BED (Browser Extensible Data)
BEDGRAPH
GFF (General Feature Format)
GTF (Gene Transfer Format)
BAM/SAM/CRAM (Sequence Alignments)
FASTA (
.fa,.fasta,.fna)FASTQ (
.fq,.fastq,.fnq)VCF (Variant Call Format)
GG (Genogrove native format)
Supported Compression Formats:
GZIP (.gz, including BGZF)
BZIP2 (.bz2)
XZ (.xz, LZMA)
ZSTD (.zst)
LZ4 (.lz4)
For bed_reader and gff_reader, both BGZF (block-gzip, bgzip) and plain gzip
(gzip) compressed inputs are accepted with the .gz extension — internally they
share htslib’s bgzf_open(), which reads both formats transparently. BGZF files
support random-access seeks; plain gzip files are read sequentially. This is useful
when consuming files from sources that distribute plain-gzip compression (e.g.,
ENCODE GTFs, GENCODE annotations) — there is no need to re-compress with bgzip.
Iterator Equality Contract#
The iterator returned by reader.begin() is a single-pass input iterator. Equality is
position-aware:
End iterators (
reader.end()) compare equal to each other.A non-end iterator compares equal to an end iterator only if it has hit EOF or an error.
Two non-end iterators are equal iff they share the same parent reader AND have advanced the same number of times (an internal monotonic position counter is bumped on each successful advance).
The common range-for pattern (for (const auto& entry : reader)) and the it != reader.end()
loop are unaffected by this rule. The rule matters only when you copy an iterator and advance
one copy — that pattern violates the single-pass nature of the underlying reader anyway (the
older copy keeps its cached current_entry_ but the reader state has moved forward), so it is
not a recommended idiom; the equality contract just makes the resulting iterators distinguishable
rather than silently equal.
Error Handling#
All file readers throw std::runtime_error on parse and I/O errors by default. The read_next()
method returns false only at end-of-file. Wrap iteration in a try-catch to handle errors:
namespace gio = genogrove::io;
gio::bed_reader reader("data.bed");
try {
for (const auto& entry : reader) {
// process entries...
}
} catch (const std::runtime_error& e) {
std::cerr << "Error: " << e.what() << "\n";
// error messages include the line number, e.g.
// "Invalid coordinate format at line 42"
}
This pattern applies to all readers (bed_reader, gff_reader, bam_reader).
Lenient Mode#
To skip malformed records instead of throwing, enable lenient mode via the reader’s options struct.
Use get_error_message() to check for errors on individual records:
namespace gio = genogrove::io;
// BED: skip_invalid_lines
gio::bed_reader reader("data.bed", gio::bed_reader_options{.skip_invalid_lines = true});
// GFF: skip_invalid_lines
// gio::gff_reader reader("data.gff", gio::gff_reader_options{.skip_invalid_lines = true});
// BAM: bam_reader throws only on I/O errors — no lenient mode option
for (const auto& entry : reader) {
// process entries — malformed lines are silently skipped
}
Error Message Lifecycle#
get_error_message() reflects the most recently attempted record only — it is cleared at the
start of each iteration. It does not accumulate errors across records.
During iteration: contains the error from the last skipped record (if any), or is empty if the last record was valid.
After iteration completes: empty, because the final read was EOF (no error).
To log every skipped record, check get_error_message() inside the loop:
for (const auto& entry : reader) {
if (!reader.get_error_message().empty()) {
std::cerr << "Skipped: " << reader.get_error_message() << "\n";
}
// process valid entry...
}
Zero-Record Inputs#
For bed_reader and gff_reader, structurally valid inputs that contain no records — empty files,
files where every line is blank, and files that contain only #-prefixed comments or header lines —
are not an error. The constructor returns successfully and the iterator immediately compares
equal to end(), so the body of a for (...) loop simply runs zero times.
This means callers decide the policy for “no data” rather than the reader. Detect zero records either by checking inside the loop or by comparing iterators directly:
gio::gff_reader reader(path);
bool any = false;
for (const auto& entry : reader) {
any = true;
// process entry...
}
if (!any) {
// no records — consumer-defined policy: warn, skip, or fail
}
The iterator is single-pass: each call to begin() on a reader that still has data
consumes the next record from the underlying stream (the iterator’s constructor calls
read_next() eagerly so that *it is immediately valid). Call begin() at most once
per reader. When detecting empty inputs without iterating, do the comparison and stop:
gio::gff_reader reader(path);
if (reader.begin() == reader.end()) {
// no records — safe: begin() consumed nothing because the file was empty
}
// do not call begin() again on the same reader if you already called it above —
// for a non-empty file, that second call would skip the first record
The following error conditions still throw std::runtime_error (or skip the line, in lenient mode):
File-open failures (missing file, permission denied, unreadable BGZF header)
Malformed first record that fails to parse
Per-line parse errors discovered mid-iteration
In other words, “valid file with zero records” is now a quiet success; only structurally broken inputs raise errors.
Coordinate Semantics#
Readers preserve format-native coordinate conventions. The grove uses closed [start, end] coordinates (both endpoints inclusive). When inserting reader entries, convert as needed:
Format |
Convention |
Conversion to grove interval |
|---|---|---|
BED / BAM |
0-based half-open |
|
GFF / GTF |
1-based inclusive |
|
// BED / BAM — subtract 1 from end
for (const auto& entry : bed_reader) {
grove.insert_data(entry.chrom,
gdt::interval(entry.start, entry.end - 1),
data);
}
// GFF / GTF — use start and end directly (both already inclusive)
for (const auto& entry : gff_reader) {
grove.insert_data(entry.seqid,
gdt::interval(entry.start, entry.end),
data);
}
BED Files#
BED files store genomic intervals with optional metadata. The bed_reader provides iterator-based access:
#include <genogrove/io/bed_reader.hpp>
#include <iostream>
namespace gio = genogrove::io;
int main() {
// Automatically handles compressed files (.bed.gz)
gio::bed_reader reader("example.bed");
try {
for (const auto& entry : reader) {
std::cout << "Chromosome: " << entry.chrom << "\n"
<< "Start: " << entry.start << "\n"
<< "End: " << entry.end << "\n";
// Optional fields (if present in file)
if (entry.name) {
std::cout << "Name: " << *entry.name << "\n";
}
if (entry.strand) {
std::cout << "Strand: " << *entry.strand << "\n";
}
}
} catch (const std::runtime_error& e) {
std::cerr << "Parse error: " << e.what() << "\n";
}
return 0;
}
Mixed BED Formats#
The bed_reader supports files that mix BED3, BED6, and BED12 records. Optional fields are reset
on each record, so a BED6 line following a BED12 line will not carry over stale block or RGB data:
gio::bed_reader reader("mixed.bed");
for (const auto& entry : reader) {
// Always present (BED3)
std::cout << entry.chrom << "\t"
<< entry.start << "\t"
<< entry.end;
// Only set on BED6+ lines
if (entry.name) std::cout << "\t" << *entry.name;
if (entry.score) std::cout << "\t" << *entry.score;
if (entry.strand) std::cout << "\t" << *entry.strand;
// Only set on BED12 lines
if (entry.blocks) std::cout << "\t(has blocks)";
std::cout << "\n";
}
BED Entry Fields#
chrom(std::string): Chromosome namestart(size_t): Start position (0-based, half-open)end(size_t): End position (0-based, half-open)name(std::optional<std::string>): Feature namescore(std::optional<int>): Score valuestrand(std::optional<char>): Strand (+/-)thickness(std::optional<thick_info>): Display thickness coordinatesitem_rgb(std::optional<rgb_color>): RGB color valueblocks(std::optional<block_info>): Block information
GFF/GTF Files#
GFF3 and GTF files contain gene annotations. The gff_reader auto-detects the format variant
by inspecting the attribute column (GFF3 uses key=value, GTF uses key "value").
Coordinates are stored in native GFF format: 1-based inclusive [start, end].
#include <genogrove/io/gff_reader.hpp>
#include <iostream>
namespace gio = genogrove::io;
int main() {
gio::gff_reader reader("annotations.gff");
try {
for (const auto& entry : reader) {
std::cout << "Sequence: " << entry.seqid << "\n"
<< "Type: " << entry.type << "\n"
<< "Start: " << entry.start << "\n"
<< "End: " << entry.end << "\n";
// Access attributes (column 9)
if (auto gene_id = entry.get_gene_id()) {
std::cout << "Gene ID: " << *gene_id << "\n";
}
// Check format
if (entry.is_gtf()) {
std::cout << "Format: GTF\n";
} else if (entry.is_gff3()) {
std::cout << "Format: GFF3\n";
}
}
} catch (const std::runtime_error& e) {
std::cerr << "Parse error: " << e.what() << "\n";
}
return 0;
}
GFF Entry Fields#
seqid(std::string): Chromosome/contig namesource(std::string): Source of the featuretype(std::string): Feature type (gene, exon, CDS, etc.)start(size_t): Start position (1-based inclusive, native GFF format)end(size_t): End position (1-based inclusive, native GFF format)score(std::optional<double>): Score value (std::nulloptwhen.in file)strand(std::optional<char>): Strand (+, -, ., or ?)phase(std::optional<int>): Phase for CDS features (0, 1, or 2)attributes(std::map<std::string, std::string, std::less<>>): Key-value pairs from column 9 (transparent comparator enablesstring_viewlookups)format(gff_format): Detected format —gff_format::GFF3,gff_format::GTF, orgff_format::UNKNOWN
Attribute Access#
Helper methods return std::optional<std::string> (or std::optional<int> for get_exon_number()).
Some helpers try multiple attribute keys to work across GFF3 and GTF conventions:
get_gene_id()— returnsgene_idget_transcript_id()— returnstranscript_idget_exon_number()— parsesexon_numberasintget_gene_name()— triesgene_name, then falls back to GFF3’sNameget_gene_biotype()— triesgene_biotype,gene_type, thenbiotypeget_attribute(key)— generic getter for any attribute key (acceptsstd::string_view)
You can also access the attributes map directly:
// Direct map access
auto it = entry.attributes.find("ID");
if (it != entry.attributes.end()) {
std::cout << "ID: " << it->second << "\n";
}
GTF Quoted Semicolons#
Semicolons inside double-quoted GTF attribute values (e.g., gene_name "test;name") are correctly
preserved. The parser recognizes that these are part of the value rather than field delimiters.
GFF3 files are unaffected — GFF3 uses URL-encoding (%3B) for literal semicolons per spec.
GTF Validation#
GTF attribute validation is opt-in via the validate_gtf option (default: false). When
enabled on GTF-format files, the reader enforces mandatory GTF2 attributes:
gene_idis required on all featurestranscript_idis required on exon, CDS, start_codon, stop_codon, UTR, 5UTR, and 3UTR features
If validation fails, read_next() throws std::runtime_error (or skips the line when
skip_invalid_lines is enabled). GFF3 files are unaffected regardless of this setting.
// Enable GTF validation
gio::gff_reader reader("annotations.gtf",
gio::gff_reader_options{.validate_gtf = true});
// Combine with lenient mode to skip invalid records instead of throwing
gio::gff_reader lenient_reader("annotations.gtf",
gio::gff_reader_options{.skip_invalid_lines = true, .validate_gtf = true});
Convenience Methods#
is_gtf()— returnstrueif format is GTFis_gff3()— returnstrueif format is GFF3
BAM/SAM Files#
BAM, SAM, and CRAM files store sequence alignments. The bam_reader auto-detects the format and handles
decompression via htslib. SAM uses 1-based positions (POS); these are converted to 0-based half-open
[start, end) values using the CIGAR string to compute the aligned reference length.
Note
read_next() error contract. When read_next() returns true the record is fully populated
and get_error_message() reads empty. bam_reader::read_next() throws std::runtime_error on
both I/O errors and on truncated auxiliary data ("Truncated auxiliary data at record N") —
truncated aux is no longer a silent post-hoc warning in get_error_message(). Always wrap BAM
iteration in try/catch rather than relying on a post-loop get_error_message() check to
detect aux truncation.
#include <genogrove/io/bam_reader.hpp>
#include <iostream>
namespace gio = genogrove::io;
int main() {
gio::bam_reader reader("alignments.bam");
try {
for (const auto& entry : reader) {
std::cout << "Read: " << entry.qname << "\n"
<< "Chrom: " << entry.chrom << "\n"
<< "Start: " << entry.start << "\n"
<< "End: " << entry.end << "\n"
<< "Strand: " << entry.get_strand() << "\n"
<< "MAPQ: " << static_cast<int>(entry.mapq) << "\n"
<< "CIGAR: " << entry.cigar_string_repr() << "\n";
if (entry.mate) {
std::cout << "Mate chrom: " << entry.mate->chrom << "\n"
<< "Mate pos: " << entry.mate->position << "\n"
<< "Insert size: " << entry.mate->insert_size << "\n";
}
}
} catch (const std::runtime_error& e) {
std::cerr << "Read error: " << e.what() << "\n";
}
return 0;
}
Filtering Options#
Use factory methods on bam_reader_options to apply common filters, or build a custom options struct:
namespace gio = genogrove::io;
// Factory presets
gio::bam_reader reader1("reads.bam", gio::bam_reader_options::defaults()); // skip unmapped (default)
gio::bam_reader reader2("reads.bam", gio::bam_reader_options::include_all()); // no filtering
gio::bam_reader reader3("reads.bam", gio::bam_reader_options::primary_only()); // primary alignments only
gio::bam_reader reader4("reads.bam", gio::bam_reader_options::high_quality(20)); // MAPQ >= 20
// Custom options
gio::bam_reader_options opts;
opts.skip_unmapped = true;
opts.skip_secondary = true;
opts.skip_duplicates = true;
opts.min_mapq = 30;
gio::bam_reader reader5("reads.bam", opts);
Filter fields:
skip_unmapped(bool, defaulttrue): Skip unmapped readsskip_secondary(bool, defaultfalse): Skip secondary alignmentsskip_supplementary(bool, defaultfalse): Skip supplementary alignmentsskip_qc_fail(bool, defaultfalse): Skip QC-failed readsskip_duplicates(bool, defaultfalse): Skip duplicate readsmin_mapq(uint8_t, default0): Minimum mapping quality
SAM Entry Fields#
qname(std::string): Read namechrom(std::string): Reference sequence namestart(size_t): Start position (0-based, half-open, computed from POS). For unmapped reads and for records whose CIGAR consumes zero reference bases (pure soft-clip100S, hard-clip-only secondary alignments100H+FLAG=256),start == POSandend == start— gate onconsumes_reference()before treating the record as a real interval.end(size_t): End position (0-based, half-open, computed from POS + CIGAR-consumed reference length). Equalsstartfor unmapped / zero-ref-consuming records (see above).flags(alignment_flags): Bitwise flags with convenience methods (is_paired(),is_reverse(),is_duplicate(), etc.)mapq(uint8_t): Mapping qualitycigar(cigar_string): CIGAR operations as astd::vector<cigar_element>sequence(std::string): Read sequencequality(std::string): ASCII quality scoresmate(std::optional<mate_info>): Mate information (chrom,position,insert_size)tags(sam_tags): Auxiliary tags (std::unordered_map<std::string, sam_tag_value>)
CIGAR Operations#
Each cigar_element has an op (operation code) and a length. Reference-consuming operations
(M, D, N, =, X) determine the aligned interval length.
for (const auto& elem : entry.cigar) {
std::cout << elem.length << elem.to_char(); // e.g. "50M2I30M"
if (elem.consumes_reference()) { /* M, D, N, =, X */ }
if (elem.consumes_query()) { /* M, I, S, =, X */ }
}
Tag Access#
Auxiliary tags are stored in an std::unordered_map<std::string, sam_tag_value>. The value is a
std::variant supporting char, int64_t, float, std::string, and typed vectors.
auto it = entry.tags.find("NH");
if (it != entry.tags.end()) {
int64_t num_hits = std::get<int64_t>(it->second);
std::cout << "Number of hits: " << num_hits << "\n";
}
Header Access#
The bam_reader provides methods to inspect the SAM header and reference sequences before or after iteration:
get_header()— returns the full SAM header text (all@HD,@SQ,@RG,@PGlines)get_reference_names()— returns astd::vector<std::string>of reference sequence names from the header
namespace gio = genogrove::io;
gio::bam_reader reader("alignments.bam");
// Inspect reference sequences
const auto& refs = reader.get_reference_names();
std::cout << "References (" << refs.size() << "):\n";
for (const auto& name : refs) {
std::cout << " " << name << "\n";
}
// Access the raw SAM header (e.g., to find read groups)
const std::string& header = reader.get_header();
std::cout << "Header:\n" << header << "\n";
Convenience Methods#
get_strand()— returns'+','-', or'.'is_primary()— not secondary and not supplementaryis_mapped()— not unmappedconsumes_reference()—trueiff the record covers any reference bases (start < end). Returnsfalsefor unmapped reads and for mapped records whose CIGAR consumes zero reference bases (pure soft-clip, hard-clip-only secondary alignments). Use this as the gate before converting to a closedgdt::interval(start, end - 1)or inserting into a grove — the closed-interval conversion underflows otherwise.cigar_string_repr()— CIGAR as a human-readable string (e.g."50M2I30M")
// Recommended insertion pattern: gate on consumes_reference()
for (const auto& entry : reader) {
if (!entry.consumes_reference()) continue; // skip soft-clip / hard-clip-only
grove.insert_data(entry.chrom,
gdt::interval(entry.start, entry.end - 1),
entry);
}
FASTA / FASTQ Files#
Genogrove provides two complementary APIs for FASTA/FASTQ data:
fasta_reader— streaming iteration over every record in a FASTA or FASTQ file (including gzip-compressed variants). Follows the samefile_reader<EntryType>iterator pattern as the other readers. Backed by htslib’skseqparser.fasta_index— indexed random-access reader for.fa/.fasta/.fnafiles. Fetches regions or whole sequences in O(1) using a.faiindex (auto-created on first use). Backed by htslib’sfaidxAPI.
Streaming: fasta_reader#
#include <genogrove/io/fasta_reader.hpp>
#include <iostream>
namespace gio = genogrove::io;
int main() {
gio::fasta_reader reader("reads.fq.gz");
try {
for (const auto& entry : reader) {
std::cout << entry.name << " (" << entry.sequence.size() << " bp)\n";
if (entry.quality) {
std::cout << " quality: " << *entry.quality << "\n";
}
}
} catch (const std::runtime_error& e) {
std::cerr << "Parse error: " << e.what() << "\n";
}
return 0;
}
Format (FASTA vs. FASTQ) is auto-detected per record by
kseq— mixed>and@headers are handled transparently.entry.qualityisstd::optional<std::string>— populated for FASTQ records,std::nulloptfor FASTA records.fasta_reader_options{.skip_empty_sequences = true}skips records whose sequence is empty.fasta_readerhas no lenient mode:read_next()throwsstd::runtime_erroron truncated quality strings or I/O errors, so use the same try/catch pattern shown forbam_readerabove.
FASTA Entry Fields#
name(std::string): Sequence name (text after>or@, up to the first whitespace)comment(std::string): Rest of the header line after the name (empty if none)sequence(std::string): Nucleotide sequencequality(std::optional<std::string>): Per-base quality string (FASTQ only,std::nulloptfor FASTA)
Indexed Access: fasta_index#
fasta_index is useful when you already have genomic coordinates (e.g., from a BED or GFF file)
and want to pull out the underlying sequence without scanning the entire FASTA.
#include <genogrove/io/fasta_index.hpp>
#include <iostream>
namespace gio = genogrove::io;
int main() {
// Opens the FASTA and loads (or creates) its .fai index.
gio::fasta_index fasta("genome.fa");
// Fetch a region — coordinates are 0-based half-open [start, end),
// matching BED / BAM conventions.
std::string promoter = fasta.fetch("chr1", 1000, 2000);
// Fetch an entire sequence by name.
std::string chrM = fasta.fetch("chrM");
// Enumerate sequences in the index.
for (size_t i = 0; i < fasta.sequence_count(); ++i) {
auto name = fasta.sequence_name(i);
std::cout << name << ": " << fasta.sequence_length(name) << " bp\n";
}
if (fasta.has_sequence("chrUn")) {
// ...
}
}
Coordinate reminder: fetch(name, start, end) follows BED/BAM’s 0-based half-open convention.
When pairing with a GFF/GTF record (which is 1-based inclusive), shift the start by one:
// GFF: 1-based inclusive [start, end] → FASTA: 0-based half-open [start-1, end)
for (const auto& entry : gff_reader) {
std::string seq = fasta.fetch(entry.seqid, entry.start - 1, entry.end);
// ...
}
Notes:
All
fasta_indexlookup methods (fetch,sequence_name,sequence_length) throwstd::out_of_rangeon unknown sequence names or out-of-range indices;fetchadditionally throwsstd::out_of_rangeon invalid regions (start >= end, or a region exceeding htslib’s coordinate limit).std::runtime_erroris reserved for an actual htslib fetch failure.The
.faifile is created on first use if missing (requires write permission to the FASTA directory).fasta_indexis non-copyable and movable.