Working with Data Types#
The genogrove::data_type namespace provides core genomic data types.
Common Interface#
All data types in genogrove implement a shared interface:
Comparison operators:
<,>,==Overlap detection: Static method checking if two values intersect
Aggregation: Combines multiple values into one result
Serialization/Deserialization: Binary I/O operations
String representation: Converts values to readable format
Constructors, comparison operators, getters/setters, and overlap detection are constexpr, so they
can be used in compile-time contexts:
constexpr gdt::interval region{100, 200};
static_assert(region.get_start() == 100);
static_assert(gdt::interval::overlaps(region, gdt::interval{150, 250}));
Intervals#
The interval class represents genomic regions using closed [start, end] coordinates (both endpoints inclusive):
#include <genogrove/data_type/interval.hpp>
namespace gdt = genogrove::data_type;
int main() {
// Create intervals
gdt::interval iv1{100, 200}; // [100, 200]
gdt::interval iv2{150, 250}; // [150, 250]
// Check for overlap
if (gdt::interval::overlaps(iv1, iv2)) {
std::cout << "Intervals overlap\n";
}
// Comparison operators
gdt::interval iv3{50, 75};
if (iv3 < iv1) {
std::cout << "iv3 comes before iv1\n";
}
// Aggregate two intervals into one
auto merged = gdt::interval::aggregate(iv1, iv2);
// String representation
std::cout << iv1.to_string() << "\n"; // "[100, 200]"
return 0;
}
Interval Methods:
INVALID_POSITION— Named constant (std::numeric_limits<size_t>::max()) used as the sentinel for default-constructed (uninitialized) intervals. Use this instead ofstd::string::nposwhen checking for uninitialized state.overlaps(a, b)- Static method to check overlapaggregate(a, b)- Merge two intervals into their bounding intervalComparison:
<,>,==get_start(),get_end(),set_range(start, end)— validates both positions atomicallyto_string()- String representation
Genomic Coordinates#
The genomic_coordinate class extends intervals with strand information:
#include <genogrove/data_type/genomic_coordinate.hpp>
namespace gdt = genogrove::data_type;
int main() {
// Create coordinate with strand
gdt::genomic_coordinate coord{'+', 1000, 2000};
// Access properties
char strand = coord.get_strand();
size_t start = coord.get_start();
size_t end = coord.get_end();
// Modify strand (validates: must be '+', '-', '.', or '*')
coord.set_strand('-');
// Modify both positions atomically
coord.set_range(1500, 2500);
// All interval methods are available
if (gdt::genomic_coordinate::overlaps(coord, other_coord)) {
std::cout << "Coordinates overlap\n";
}
return 0;
}
Keys and Associated Data#
The key class wraps genomic types with optional associated data:
#include <genogrove/data_type/key.hpp>
#include <genogrove/data_type/interval.hpp>
namespace gdt = genogrove::data_type;
struct GeneInfo {
std::string name;
double expression;
};
int main() {
// Key with data
gdt::key<gdt::interval, GeneInfo> gene_key(
gdt::interval{100, 200},
GeneInfo{"BRCA1", 45.3}
);
// Access key value
auto interval = gene_key.get_value();
// Access associated data
auto info = gene_key.get_data();
std::cout << "Gene: " << info.name << "\n";
std::cout << "Expression: " << info.expression << "\n";
// Check if key has data
if (gene_key.has_data()) {
std::cout << "Key has associated data\n";
}
return 0;
}
Numeric#
The numeric type provides integer wrapper semantics for point-based operations:
#include <genogrove/data_type/numeric.hpp>
namespace gdt = genogrove::data_type;
int main() {
// Create numeric values
gdt::numeric n1{100};
gdt::numeric n2{200};
// Comparison uses standard integer ordering
if (n1 < n2) {
std::cout << "n1 comes before n2\n";
}
// Overlap occurs only when values are exactly equal
gdt::numeric n3{100};
if (gdt::numeric::overlaps(n1, n3)) {
std::cout << "Values are equal\n";
}
// Aggregation returns the maximum of two values
auto max_val = gdt::numeric::aggregate(n1, n2); // Returns 200
return 0;
}
Numeric Characteristics:
Comparison: Standard integer ordering
Overlap: Only when values are exactly equal
Aggregation: Returns the maximum value
K-mer#
The kmer type represents DNA sequences using compact 2-bit encoding:
#include <genogrove/data_type/kmer.hpp>
namespace gdt = genogrove::data_type;
int main() {
// Create k-mer from DNA sequence
gdt::kmer k1{"ACGT"}; // 4-mer
gdt::kmer k2{"TGCA"}; // 4-mer
// Comparison orders by k-value first, then encoding
if (k1 < k2) {
std::cout << "k1 comes before k2\n";
}
// Overlap requires exact sequence match
gdt::kmer k3{"ACGT"};
if (gdt::kmer::overlaps(k1, k3)) {
std::cout << "K-mers are identical\n";
}
// String representation
std::cout << k1.to_string() << "\n"; // "ACGT"
return 0;
}
K-mer Characteristics:
Encoding: 2-bit per nucleotide (A=00, C=01, G=10, T=11)
BASE_MASK— Named constant (0x03) for the 2-bit nucleotide mask used indecode_base()Maximum length: 32 nucleotides (fits in 64-bit integer)
Comparison: Orders by k-value first, then encoding
Overlap: Exact sequence match required
Aggregation: Returns k-mer with maximum encoding