Serialization#
Genogrove supports serialization for persisting groves to disk and loading them back. This avoids re-parsing and re-inserting data from source files, which is significantly faster for large datasets.
Basic Usage#
Save a grove to disk and load it back:
#include <genogrove/structure/grove/grove.hpp>
#include <genogrove/data_type/interval.hpp>
#include <fstream>
namespace gdt = genogrove::data_type;
namespace gst = genogrove::structure;
int main() {
gst::grove<gdt::interval, std::string> my_grove(100);
my_grove.insert_data("chr1", gdt::interval{100, 200}, "gene1");
my_grove.insert_data("chr1", gdt::interval{300, 400}, "gene2");
// Save to disk
{
std::ofstream out("grove.bin", std::ios::binary);
my_grove.serialize(out);
}
// Load from disk
{
std::ifstream in("grove.bin", std::ios::binary);
auto loaded = gst::grove<gdt::interval, std::string>::deserialize(in);
// loaded is a fully functional grove with all data restored
}
return 0;
}
Always open streams with std::ios::binary to avoid platform-specific newline translation.
grove::serialize() (and the supporting grove_to_sif() and node::serialize()) are
const-qualified, so consumers holding a const grove& — for example a read-only post-build
query layer — can serialize without a const_cast:
void persist(const gst::grove<gdt::interval, std::string>& g, std::ostream& os) {
g.serialize(os); // OK — serialize() is const
}
How It Works#
The grove serializes its complete B+ tree structure using zlib compression. The output is a compressed binary stream (not raw bytes), so files are compact but not directly inspectable with hex editors. Internally the data is written in a depth-first traversal:
Tree order and number of indices (chromosomes)
For each index: the index name followed by the full tree (nodes, keys, and associated data)
External key storage
Graph overlay edges
All built-in key types (interval, genomic_coordinate, numeric, kmer) and common data types
(std::string, trivially copyable types like int, double, uint32_t) are serialized automatically.
Combined Persistence with Registry#
When using registry to store shared metadata (e.g., sample names referenced by ID in the grove),
serialize the registry before the grove and deserialize in the same order:
#include <genogrove/data_type/registry.hpp>
#include <fstream>
namespace gdt = genogrove::data_type;
namespace gst = genogrove::structure;
int main() {
auto& reg = gdt::registry<std::string>::instance();
gst::grove<gdt::interval, uint32_t> my_grove(100);
// Intern shared metadata, store IDs in the grove
auto id1 = reg.intern("SampleA_liver");
auto id2 = reg.intern("SampleB_brain");
my_grove.insert_data("chr1", gdt::interval{100, 200}, id1);
my_grove.insert_data("chr1", gdt::interval{150, 250}, id2);
// Save: registry first, then grove
{
std::ofstream out("data.bin", std::ios::binary);
reg.serialize(out);
my_grove.serialize(out);
}
// Load: same order
reg.clear();
{
std::ifstream in("data.bin", std::ios::binary);
auto& restored = gdt::registry<std::string>::deserialize(in);
auto loaded = gst::grove<gdt::interval, uint32_t>::deserialize(in);
// Registry IDs in the grove still resolve correctly
auto results = loaded.intersect(gdt::interval{100, 200}, "chr1");
for (auto* key : results.get_keys()) {
const auto& name = restored.get(key->get_data()); // "SampleA_liver"
}
}
return 0;
}
Custom Key Type Serialization#
If you use a custom key type with the grove, it must implement a serialize member method and a
static deserialize factory method:
struct CustomInterval {
size_t start;
size_t end;
void serialize(std::ostream& os) const {
os.write(reinterpret_cast<const char*>(&start), sizeof(start));
os.write(reinterpret_cast<const char*>(&end), sizeof(end));
}
[[nodiscard]] static CustomInterval deserialize(std::istream& is) {
CustomInterval ci;
is.read(reinterpret_cast<char*>(&ci.start), sizeof(ci.start));
is.read(reinterpret_cast<char*>(&ci.end), sizeof(ci.end));
if (!is) throw std::runtime_error("Failed to deserialize CustomInterval");
return ci;
}
// ... other key_type_base requirements (operator<, overlap, aggregate, etc.)
};
Custom Data Type Serialization#
For custom data types stored as associated data in keys, you have two options:
Option 1: Member methods#
Add serialize and deserialize methods directly to your type:
struct Annotation {
std::string name;
double score;
void serialize(std::ostream& os) const {
genogrove::data_type::serializer<std::string>::write(os, name);
os.write(reinterpret_cast<const char*>(&score), sizeof(score));
}
[[nodiscard]] static Annotation deserialize(std::istream& is) {
auto name = genogrove::data_type::serializer<std::string>::read(is);
double score;
is.read(reinterpret_cast<char*>(&score), sizeof(score));
if (!is) throw std::runtime_error("Failed to deserialize Annotation");
return {std::move(name), score};
}
};
Option 2: Specialize serialization_traits#
For third-party types you cannot modify, specialize serialization_traits:
#include <genogrove/data_type/serialization_traits.hpp>
template<>
struct genogrove::data_type::serialization_traits<ThirdPartyType> {
static void serialize(std::ostream& os, const ThirdPartyType& value) {
// write fields to os
}
static ThirdPartyType deserialize(std::istream& is) {
// read fields from is and return constructed object
}
};
What works automatically#
Trivially copyable types (
int,double,uint32_t, etc.) — serialized viamemcpystd::string— built-in specialization (length-prefixed)Built-in key types (
interval,genomic_coordinate,numeric,kmer) — member methods providedgio::bed_entry(with its nestedgio::block_info) — memberserialize/deserializeprovided, so agrove<gdt::interval, gio::bed_entry>can be persisted directly. The.ggfiles produced by theidxCLI subcommand are this form.gio::rgb_colorandgio::thick_infoare trivially copyable and serialize automatically.
Source Stream Must Be Seekable for Concatenated Payloads#
grove::deserialize() uses zlib’s streaming decoder, which may finish consuming the compressed
payload before exhausting the input buffer. To preserve any bytes that follow the grove (e.g.,
concatenated payloads, sentinel markers, file tails), the internal inflate_streambuf rewinds
the unconsumed bytes via source.seekg(...).
The source stream must therefore be seekable when anything follows the grove in the stream.
On non-seekable sources (pipes, sockets, custom non-seekable streambufs) the seek fails and
deserialize() throws:
inflate_streambuf: source stream is not seekable; concatenated payloads require a seekable source
For a single-payload .gg file loaded via std::ifstream, this requirement is automatically
satisfied — file streams are seekable. The requirement matters only for the concatenated-payload
pattern (registry then grove from the same stream, multiple grove payloads back-to-back, sentinel
trailers).
If you must deserialize from a non-seekable source, copy it into a std::stringstream first:
std::stringstream buf;
buf << non_seekable_source.rdbuf(); // drain into a seekable buffer
auto g = gst::grove<...>::deserialize(buf);
Important Notes#
grove::deserialize()returns a grove by value. Becausegroveis a move-only type (copy is deleted), the return relies on Named Return Value Optimization (NRVO) or implicit move. No special handling is needed—just assign the result to a local variable as shown in the examples above.All
deserializemethods (node::deserialize,grove::deserialize,registry::deserialize,serialization_traits<std::string>::deserialize) throwstd::runtime_erroron corrupt or truncated streams.node::deserializeadditionally validates B+ tree invariants (num_keys < order, num_children <= order).registry::deserializeprovides a strong exception guarantee: if the stream throws or is truncated, the singleton is left exactly as it was before the call. The new state is built into local containers and committed via noexcept move-assign only after the read loop completes. It also rejects header counts that exceed theid_typecapacity ("Failed to deserialize registry: entry count exceeds id_type capacity") and streams containing duplicate keys ("Failed to deserialize registry: duplicate key"). See Registry for details.Graph edges added via
add_edge()orlink_if()are now persisted during serialization and restored on deserialize.Breaking format change: The serialized format now includes graph edges after external keys. Files serialized with older versions are incompatible and must be re-created.
All
deserializemethods are marked[[nodiscard]]to prevent accidentally discarding the result.