# Reading Genomic Files (I/O)

The `genogrove::io` namespace provides efficient readers for common genomic file formats with automatic compression detection.

## Reader Ownership

All streaming file readers (`bed_reader`, `gff_reader`, `bam_reader`, `fasta_reader`) own raw
htslib resource pointers and are **non-copyable** but **movable**. Attempting to copy a reader will
produce a compile error. `fasta_index` (indexed random-access FASTA) follows the same rule.

```cpp
namespace gio = genogrove::io;

gio::bed_reader reader("data.bed");

// gio::bed_reader copy = reader;            // compile error — not copyable
gio::bed_reader moved = std::move(reader);   // OK — transfers ownership
```

When passing readers to functions, use a reference or move them:

```cpp
namespace gio = genogrove::io;

void process(gio::bed_reader& reader) {      // pass by reference
    for (const auto& entry : reader) { /* ... */ }
}
```

## Automatic File Type Detection

Genogrove can automatically detect file types and compression formats:

```cpp
#include <genogrove/io/filetype_detector.hpp>

namespace gio = genogrove::io;

int main() {
    gio::filetype_detector detector;
    auto [filetype, compression] = detector.detect_filetype("data.bed.gz");

    // filetype will be gio::filetype::BED
    // compression will be gio::compression_type::GZIP

    return 0;
}
```

**Supported File Types:**

- BED (Browser Extensible Data)
- BEDGRAPH
- GFF (General Feature Format)
- GTF (Gene Transfer Format)
- BAM/SAM/CRAM (Sequence Alignments)
- FASTA (`.fa`, `.fasta`, `.fna`)
- FASTQ (`.fq`, `.fastq`, `.fnq`)
- VCF (Variant Call Format)
- GG (Genogrove native format)

**Supported Compression Formats:**

- GZIP (.gz, including BGZF)
- BZIP2 (.bz2)
- XZ (.xz, LZMA)
- ZSTD (.zst)
- LZ4 (.lz4)

For `bed_reader` and `gff_reader`, both BGZF (block-gzip, `bgzip`) and plain gzip
(`gzip`) compressed inputs are accepted with the `.gz` extension — internally they
share htslib's `bgzf_open()`, which reads both formats transparently. BGZF files
support random-access seeks; plain gzip files are read sequentially. This is useful
when consuming files from sources that distribute plain-gzip compression (e.g.,
ENCODE GTFs, GENCODE annotations) — there is no need to re-compress with `bgzip`.

## Iterator Equality Contract

The iterator returned by `reader.begin()` is a **single-pass input iterator**. Equality is
position-aware:

- End iterators (`reader.end()`) compare equal to each other.
- A non-end iterator compares equal to an end iterator only if it has hit EOF or an error.
- Two non-end iterators are equal **iff they share the same parent reader AND have advanced
  the same number of times** (an internal monotonic position counter is bumped on each
  successful advance).

The common range-for pattern (`for (const auto& entry : reader)`) and the `it != reader.end()`
loop are unaffected by this rule. The rule matters only when you copy an iterator and advance
one copy — that pattern violates the single-pass nature of the underlying reader anyway (the
older copy keeps its cached `current_entry_` but the reader state has moved forward), so it is
not a recommended idiom; the equality contract just makes the resulting iterators distinguishable
rather than silently equal.

## Error Handling

All file readers throw `std::runtime_error` on parse and I/O errors by default. The `read_next()`
method returns `false` only at end-of-file. Wrap iteration in a try-catch to handle errors:

```cpp
namespace gio = genogrove::io;

gio::bed_reader reader("data.bed");

try {
    for (const auto& entry : reader) {
        // process entries...
    }
} catch (const std::runtime_error& e) {
    std::cerr << "Error: " << e.what() << "\n";
    // error messages include the line number, e.g.
    // "Invalid coordinate format at line 42"
}
```

This pattern applies to all readers (`bed_reader`, `gff_reader`, `bam_reader`).

### Lenient Mode

To skip malformed records instead of throwing, enable lenient mode via the reader's options struct.
Use `get_error_message()` to check for errors on individual records:

```cpp
namespace gio = genogrove::io;

// BED: skip_invalid_lines
gio::bed_reader reader("data.bed", gio::bed_reader_options{.skip_invalid_lines = true});

// GFF: skip_invalid_lines
// gio::gff_reader reader("data.gff", gio::gff_reader_options{.skip_invalid_lines = true});

// BAM: bam_reader throws only on I/O errors — no lenient mode option

for (const auto& entry : reader) {
    // process entries — malformed lines are silently skipped
}
```

### Error Message Lifecycle

`get_error_message()` reflects the **most recently attempted record** only — it is cleared at the
start of each iteration. It does not accumulate errors across records.

- **During iteration**: contains the error from the last skipped record (if any), or is empty if the
  last record was valid.
- **After iteration completes**: empty, because the final read was EOF (no error).

To log every skipped record, check `get_error_message()` inside the loop:

```cpp
for (const auto& entry : reader) {
    if (!reader.get_error_message().empty()) {
        std::cerr << "Skipped: " << reader.get_error_message() << "\n";
    }
    // process valid entry...
}
```

### Zero-Record Inputs

For `bed_reader` and `gff_reader`, structurally valid inputs that contain no records — empty files,
files where every line is blank, and files that contain only `#`-prefixed comments or header lines —
are **not** an error. The constructor returns successfully and the iterator immediately compares
equal to `end()`, so the body of a `for (...)` loop simply runs zero times.

This means callers decide the policy for "no data" rather than the reader. Detect zero records
either by checking inside the loop or by comparing iterators directly:

```cpp
gio::gff_reader reader(path);

bool any = false;
for (const auto& entry : reader) {
    any = true;
    // process entry...
}
if (!any) {
    // no records — consumer-defined policy: warn, skip, or fail
}
```

The iterator is single-pass: each call to `begin()` on a reader that still has data
consumes the next record from the underlying stream (the iterator's constructor calls
`read_next()` eagerly so that `*it` is immediately valid). Call `begin()` at most once
per reader. When detecting empty inputs without iterating, do the comparison and stop:

```cpp
gio::gff_reader reader(path);
if (reader.begin() == reader.end()) {
    // no records — safe: begin() consumed nothing because the file was empty
}
// do not call begin() again on the same reader if you already called it above —
// for a non-empty file, that second call would skip the first record
```

The following error conditions still throw `std::runtime_error` (or skip the line, in lenient mode):

- File-open failures (missing file, permission denied, unreadable BGZF header)
- Malformed first record that fails to parse
- Per-line parse errors discovered mid-iteration

In other words, "valid file with zero records" is now a quiet success; only structurally broken
inputs raise errors.

## Coordinate Semantics

Readers preserve format-native coordinate conventions. The grove uses **closed** `[start, end]` coordinates (both endpoints inclusive). When inserting reader entries, convert as needed:

| Format    | Convention             | Conversion to grove interval                        |
|-----------|------------------------|-----------------------------------------------------|
| BED / BAM | 0-based half-open `[start, end)` | `gdt::interval(entry.start, entry.end - 1)` |
| GFF / GTF | 1-based inclusive `[start, end]` | `gdt::interval(entry.start, entry.end)`      |

```cpp
// BED / BAM — subtract 1 from end
for (const auto& entry : bed_reader) {
    grove.insert_data(entry.chrom,
                      gdt::interval(entry.start, entry.end - 1),
                      data);
}

// GFF / GTF — use start and end directly (both already inclusive)
for (const auto& entry : gff_reader) {
    grove.insert_data(entry.seqid,
                      gdt::interval(entry.start, entry.end),
                      data);
}
```

## BED Files

BED files store genomic intervals with optional metadata. The `bed_reader` provides iterator-based access:

```cpp
#include <genogrove/io/bed_reader.hpp>
#include <iostream>

namespace gio = genogrove::io;

int main() {
    // Automatically handles compressed files (.bed.gz)
    gio::bed_reader reader("example.bed");

    try {
        for (const auto& entry : reader) {
            std::cout << "Chromosome: " << entry.chrom << "\n"
                      << "Start: " << entry.start << "\n"
                      << "End: " << entry.end << "\n";

            // Optional fields (if present in file)
            if (entry.name) {
                std::cout << "Name: " << *entry.name << "\n";
            }
            if (entry.strand) {
                std::cout << "Strand: " << *entry.strand << "\n";
            }
        }
    } catch (const std::runtime_error& e) {
        std::cerr << "Parse error: " << e.what() << "\n";
    }

    return 0;
}
```

### Mixed BED Formats

The `bed_reader` supports files that mix BED3, BED6, and BED12 records. Optional fields are reset
on each record, so a BED6 line following a BED12 line will not carry over stale block or RGB data:

```cpp
gio::bed_reader reader("mixed.bed");

for (const auto& entry : reader) {
    // Always present (BED3)
    std::cout << entry.chrom << "\t"
              << entry.start << "\t"
              << entry.end;

    // Only set on BED6+ lines
    if (entry.name)   std::cout << "\t" << *entry.name;
    if (entry.score)  std::cout << "\t" << *entry.score;
    if (entry.strand) std::cout << "\t" << *entry.strand;

    // Only set on BED12 lines
    if (entry.blocks)  std::cout << "\t(has blocks)";

    std::cout << "\n";
}
```

### BED Entry Fields

- `chrom` (std::string): Chromosome name
- `start` (size_t): Start position (0-based, half-open)
- `end` (size_t): End position (0-based, half-open)
- `name` (std::optional\<std::string>): Feature name
- `score` (std::optional\<int>): Score value
- `strand` (std::optional\<char>): Strand (+/-)
- `thickness` (std::optional\<thick_info>): Display thickness coordinates
- `item_rgb` (std::optional\<rgb_color>): RGB color value
- `blocks` (std::optional\<block_info>): Block information

## GFF/GTF Files

GFF3 and GTF files contain gene annotations. The `gff_reader` auto-detects the format variant
by inspecting the attribute column (GFF3 uses `key=value`, GTF uses `key "value"`).
Coordinates are stored in native GFF format: **1-based inclusive** `[start, end]`.

```cpp
#include <genogrove/io/gff_reader.hpp>
#include <iostream>

namespace gio = genogrove::io;

int main() {
    gio::gff_reader reader("annotations.gff");

    try {
        for (const auto& entry : reader) {
            std::cout << "Sequence: " << entry.seqid << "\n"
                      << "Type: " << entry.type << "\n"
                      << "Start: " << entry.start << "\n"
                      << "End: " << entry.end << "\n";

            // Access attributes (column 9)
            if (auto gene_id = entry.get_gene_id()) {
                std::cout << "Gene ID: " << *gene_id << "\n";
            }

            // Check format
            if (entry.is_gtf()) {
                std::cout << "Format: GTF\n";
            } else if (entry.is_gff3()) {
                std::cout << "Format: GFF3\n";
            }
        }
    } catch (const std::runtime_error& e) {
        std::cerr << "Parse error: " << e.what() << "\n";
    }

    return 0;
}
```

### GFF Entry Fields

- `seqid` (std::string): Chromosome/contig name
- `source` (std::string): Source of the feature
- `type` (std::string): Feature type (gene, exon, CDS, etc.)
- `start` (size_t): Start position (1-based inclusive, native GFF format)
- `end` (size_t): End position (1-based inclusive, native GFF format)
- `score` (std::optional\<double>): Score value (`std::nullopt` when `.` in file)
- `strand` (std::optional\<char>): Strand (+, -, ., or ?)
- `phase` (std::optional\<int>): Phase for CDS features (0, 1, or 2)
- `attributes` (std::map\<std::string, std::string, std::less\<>>): Key-value pairs from column 9 (transparent comparator enables `string_view` lookups)
- `format` (gff_format): Detected format — `gff_format::GFF3`, `gff_format::GTF`, or `gff_format::UNKNOWN`

### Attribute Access

Helper methods return `std::optional<std::string>` (or `std::optional<int>` for `get_exon_number()`).
Some helpers try multiple attribute keys to work across GFF3 and GTF conventions:

- `get_gene_id()` — returns `gene_id`
- `get_transcript_id()` — returns `transcript_id`
- `get_exon_number()` — parses `exon_number` as `int`
- `get_gene_name()` — tries `gene_name`, then falls back to GFF3's `Name`
- `get_gene_biotype()` — tries `gene_biotype`, `gene_type`, then `biotype`
- `get_attribute(key)` — generic getter for any attribute key (accepts `std::string_view`)

You can also access the attributes map directly:

```cpp
// Direct map access
auto it = entry.attributes.find("ID");
if (it != entry.attributes.end()) {
    std::cout << "ID: " << it->second << "\n";
}
```

### GTF Quoted Semicolons

Semicolons inside double-quoted GTF attribute values (e.g., `gene_name "test;name"`) are correctly
preserved. The parser recognizes that these are part of the value rather than field delimiters.
GFF3 files are unaffected — GFF3 uses URL-encoding (`%3B`) for literal semicolons per spec.

### GTF Validation

GTF attribute validation is **opt-in** via the `validate_gtf` option (default: `false`). When
enabled on GTF-format files, the reader enforces mandatory GTF2 attributes:

- `gene_id` is required on **all** features
- `transcript_id` is required on exon, CDS, start_codon, stop_codon, UTR, 5UTR, and 3UTR features

If validation fails, `read_next()` throws `std::runtime_error` (or skips the line when
`skip_invalid_lines` is enabled). GFF3 files are unaffected regardless of this setting.

```cpp
// Enable GTF validation
gio::gff_reader reader("annotations.gtf",
    gio::gff_reader_options{.validate_gtf = true});

// Combine with lenient mode to skip invalid records instead of throwing
gio::gff_reader lenient_reader("annotations.gtf",
    gio::gff_reader_options{.skip_invalid_lines = true, .validate_gtf = true});
```

### Convenience Methods

- `is_gtf()` — returns `true` if format is GTF
- `is_gff3()` — returns `true` if format is GFF3

## BAM/SAM Files

BAM, SAM, and CRAM files store sequence alignments. The `bam_reader` auto-detects the format and handles
decompression via htslib. SAM uses 1-based positions (POS); these are converted to 0-based half-open
`[start, end)` values using the CIGAR string to compute the aligned reference length.

```{note}
**`read_next()` error contract.** When `read_next()` returns `true` the record is fully populated
and `get_error_message()` reads empty. `bam_reader::read_next()` throws `std::runtime_error` on
both I/O errors **and on truncated auxiliary data** (`"Truncated auxiliary data at record N"`) —
truncated aux is no longer a silent post-hoc warning in `get_error_message()`. Always wrap BAM
iteration in `try`/`catch` rather than relying on a post-loop `get_error_message()` check to
detect aux truncation.
```

```cpp
#include <genogrove/io/bam_reader.hpp>
#include <iostream>

namespace gio = genogrove::io;

int main() {
    gio::bam_reader reader("alignments.bam");

    try {
        for (const auto& entry : reader) {
            std::cout << "Read: " << entry.qname << "\n"
                      << "Chrom: " << entry.chrom << "\n"
                      << "Start: " << entry.start << "\n"
                      << "End: " << entry.end << "\n"
                      << "Strand: " << entry.get_strand() << "\n"
                      << "MAPQ: " << static_cast<int>(entry.mapq) << "\n"
                      << "CIGAR: " << entry.cigar_string_repr() << "\n";

            if (entry.mate) {
                std::cout << "Mate chrom: " << entry.mate->chrom << "\n"
                          << "Mate pos: " << entry.mate->position << "\n"
                          << "Insert size: " << entry.mate->insert_size << "\n";
            }
        }
    } catch (const std::runtime_error& e) {
        std::cerr << "Read error: " << e.what() << "\n";
    }

    return 0;
}
```

### Filtering Options

Use factory methods on `bam_reader_options` to apply common filters, or build a custom options struct:

```cpp
namespace gio = genogrove::io;

// Factory presets
gio::bam_reader reader1("reads.bam", gio::bam_reader_options::defaults());       // skip unmapped (default)
gio::bam_reader reader2("reads.bam", gio::bam_reader_options::include_all());    // no filtering
gio::bam_reader reader3("reads.bam", gio::bam_reader_options::primary_only());   // primary alignments only
gio::bam_reader reader4("reads.bam", gio::bam_reader_options::high_quality(20)); // MAPQ >= 20

// Custom options
gio::bam_reader_options opts;
opts.skip_unmapped = true;
opts.skip_secondary = true;
opts.skip_duplicates = true;
opts.min_mapq = 30;
gio::bam_reader reader5("reads.bam", opts);
```

**Filter fields:**

- `skip_unmapped` (bool, default `true`): Skip unmapped reads
- `skip_secondary` (bool, default `false`): Skip secondary alignments
- `skip_supplementary` (bool, default `false`): Skip supplementary alignments
- `skip_qc_fail` (bool, default `false`): Skip QC-failed reads
- `skip_duplicates` (bool, default `false`): Skip duplicate reads
- `min_mapq` (uint8_t, default `0`): Minimum mapping quality

### SAM Entry Fields

- `qname` (std::string): Read name
- `chrom` (std::string): Reference sequence name
- `start` (size_t): Start position (0-based, half-open, computed from POS). For unmapped reads and for records whose CIGAR consumes zero reference bases (pure soft-clip `100S`, hard-clip-only secondary alignments `100H`+`FLAG=256`), `start == POS` and `end == start` — gate on `consumes_reference()` before treating the record as a real interval.
- `end` (size_t): End position (0-based, half-open, computed from POS + CIGAR-consumed reference length). Equals `start` for unmapped / zero-ref-consuming records (see above).
- `flags` (alignment_flags): Bitwise flags with convenience methods (`is_paired()`, `is_reverse()`, `is_duplicate()`, etc.)
- `mapq` (uint8_t): Mapping quality
- `cigar` (cigar_string): CIGAR operations as a `std::vector<cigar_element>`
- `sequence` (std::string): Read sequence
- `quality` (std::string): ASCII quality scores
- `mate` (std::optional\<mate_info>): Mate information (`chrom`, `position`, `insert_size`)
- `tags` (sam_tags): Auxiliary tags (`std::unordered_map<std::string, sam_tag_value>`)

### CIGAR Operations

Each `cigar_element` has an `op` (operation code) and a `length`. Reference-consuming operations
(M, D, N, =, X) determine the aligned interval length.

```cpp
for (const auto& elem : entry.cigar) {
    std::cout << elem.length << elem.to_char();        // e.g. "50M2I30M"
    if (elem.consumes_reference()) { /* M, D, N, =, X */ }
    if (elem.consumes_query())     { /* M, I, S, =, X */ }
}
```

### Tag Access

Auxiliary tags are stored in an `std::unordered_map<std::string, sam_tag_value>`. The value is a
`std::variant` supporting `char`, `int64_t`, `float`, `std::string`, and typed vectors.

```cpp
auto it = entry.tags.find("NH");
if (it != entry.tags.end()) {
    int64_t num_hits = std::get<int64_t>(it->second);
    std::cout << "Number of hits: " << num_hits << "\n";
}
```

### Header Access

The `bam_reader` provides methods to inspect the SAM header and reference sequences before or after iteration:

- `get_header()` — returns the full SAM header text (all `@HD`, `@SQ`, `@RG`, `@PG` lines)
- `get_reference_names()` — returns a `std::vector<std::string>` of reference sequence names from the header

```cpp
namespace gio = genogrove::io;

gio::bam_reader reader("alignments.bam");

// Inspect reference sequences
const auto& refs = reader.get_reference_names();
std::cout << "References (" << refs.size() << "):\n";
for (const auto& name : refs) {
    std::cout << "  " << name << "\n";
}

// Access the raw SAM header (e.g., to find read groups)
const std::string& header = reader.get_header();
std::cout << "Header:\n" << header << "\n";
```

### Convenience Methods

- `get_strand()` — returns `'+'`, `'-'`, or `'.'`
- `is_primary()` — not secondary and not supplementary
- `is_mapped()` — not unmapped
- `consumes_reference()` — `true` iff the record covers any reference bases (`start < end`). Returns `false` for unmapped reads and for mapped records whose CIGAR consumes zero reference bases (pure soft-clip, hard-clip-only secondary alignments). Use this as the gate before converting to a closed `gdt::interval(start, end - 1)` or inserting into a grove — the closed-interval conversion underflows otherwise.
- `cigar_string_repr()` — CIGAR as a human-readable string (e.g. `"50M2I30M"`)

```cpp
// Recommended insertion pattern: gate on consumes_reference()
for (const auto& entry : reader) {
    if (!entry.consumes_reference()) continue;   // skip soft-clip / hard-clip-only
    grove.insert_data(entry.chrom,
                      gdt::interval(entry.start, entry.end - 1),
                      entry);
}
```

## FASTA / FASTQ Files

Genogrove provides two complementary APIs for FASTA/FASTQ data:

- **`fasta_reader`** — streaming iteration over every record in a FASTA or FASTQ file
  (including gzip-compressed variants). Follows the same `file_reader<EntryType>` iterator pattern
  as the other readers. Backed by htslib's `kseq` parser.
- **`fasta_index`** — indexed random-access reader for `.fa` / `.fasta` / `.fna` files. Fetches
  regions or whole sequences in O(1) using a `.fai` index (auto-created on first use). Backed by
  htslib's `faidx` API.

### Streaming: `fasta_reader`

```cpp
#include <genogrove/io/fasta_reader.hpp>
#include <iostream>

namespace gio = genogrove::io;

int main() {
    gio::fasta_reader reader("reads.fq.gz");

    try {
        for (const auto& entry : reader) {
            std::cout << entry.name << " (" << entry.sequence.size() << " bp)\n";
            if (entry.quality) {
                std::cout << "  quality: " << *entry.quality << "\n";
            }
        }
    } catch (const std::runtime_error& e) {
        std::cerr << "Parse error: " << e.what() << "\n";
    }
    return 0;
}
```

- Format (FASTA vs. FASTQ) is auto-detected per record by `kseq` — mixed `>` and `@` headers are
  handled transparently.
- `entry.quality` is `std::optional<std::string>` — populated for FASTQ records, `std::nullopt`
  for FASTA records.
- `fasta_reader_options{.skip_empty_sequences = true}` skips records whose sequence is empty.
- `fasta_reader` has no lenient mode: `read_next()` throws `std::runtime_error` on truncated
  quality strings or I/O errors, so use the same try/catch pattern shown for `bam_reader` above.

### FASTA Entry Fields

- `name` (std::string): Sequence name (text after `>` or `@`, up to the first whitespace)
- `comment` (std::string): Rest of the header line after the name (empty if none)
- `sequence` (std::string): Nucleotide sequence
- `quality` (std::optional\<std::string>): Per-base quality string (FASTQ only, `std::nullopt` for FASTA)

### Indexed Access: `fasta_index`

`fasta_index` is useful when you already have genomic coordinates (e.g., from a BED or GFF file)
and want to pull out the underlying sequence without scanning the entire FASTA.

```cpp
#include <genogrove/io/fasta_index.hpp>
#include <iostream>

namespace gio = genogrove::io;

int main() {
    // Opens the FASTA and loads (or creates) its .fai index.
    gio::fasta_index fasta("genome.fa");

    // Fetch a region — coordinates are 0-based half-open [start, end),
    // matching BED / BAM conventions.
    std::string promoter = fasta.fetch("chr1", 1000, 2000);

    // Fetch an entire sequence by name.
    std::string chrM = fasta.fetch("chrM");

    // Enumerate sequences in the index.
    for (size_t i = 0; i < fasta.sequence_count(); ++i) {
        auto name = fasta.sequence_name(i);
        std::cout << name << ": " << fasta.sequence_length(name) << " bp\n";
    }

    if (fasta.has_sequence("chrUn")) {
        // ...
    }
}
```

**Coordinate reminder**: `fetch(name, start, end)` follows BED/BAM's 0-based half-open convention.
When pairing with a GFF/GTF record (which is 1-based inclusive), shift the start by one:

```cpp
// GFF: 1-based inclusive [start, end] → FASTA: 0-based half-open [start-1, end)
for (const auto& entry : gff_reader) {
    std::string seq = fasta.fetch(entry.seqid, entry.start - 1, entry.end);
    // ...
}
```

**Notes:**

- All `fasta_index` lookup methods (`fetch`, `sequence_name`, `sequence_length`) throw
  `std::out_of_range` on unknown sequence names or out-of-range indices; `fetch` additionally
  throws `std::out_of_range` on invalid regions (`start >= end`, or a region exceeding htslib's
  coordinate limit). `std::runtime_error` is reserved for an actual htslib fetch failure.
- The `.fai` file is created on first use if missing (requires write permission to the FASTA
  directory).
- `fasta_index` is non-copyable and movable.