# Serialization Genogrove supports serialization for persisting groves to disk and loading them back. This avoids re-parsing and re-inserting data from source files, which is significantly faster for large datasets. ## Basic Usage Save a grove to disk and load it back: ```cpp #include #include #include namespace gdt = genogrove::data_type; namespace gst = genogrove::structure; int main() { gst::grove my_grove(100); my_grove.insert_data("chr1", gdt::interval{100, 200}, "gene1"); my_grove.insert_data("chr1", gdt::interval{300, 400}, "gene2"); // Save to disk { std::ofstream out("grove.bin", std::ios::binary); my_grove.serialize(out); } // Load from disk { std::ifstream in("grove.bin", std::ios::binary); auto loaded = gst::grove::deserialize(in); // loaded is a fully functional grove with all data restored } return 0; } ``` Always open streams with `std::ios::binary` to avoid platform-specific newline translation. `grove::serialize()` (and the supporting `grove_to_sif()` and `node::serialize()`) are **`const`-qualified**, so consumers holding a `const grove&` — for example a read-only post-build query layer — can serialize without a `const_cast`: ```cpp void persist(const gst::grove& g, std::ostream& os) { g.serialize(os); // OK — serialize() is const } ``` ## How It Works The grove serializes its complete B+ tree structure using **zlib compression**. The output is a compressed binary stream (not raw bytes), so files are compact but not directly inspectable with hex editors. Internally the data is written in a depth-first traversal: 1. Tree order and number of indices (chromosomes) 2. For each index: the index name followed by the full tree (nodes, keys, and associated data) 3. External key storage 4. Graph overlay edges All built-in key types (`interval`, `genomic_coordinate`, `numeric`, `kmer`) and common data types (`std::string`, trivially copyable types like `int`, `double`, `uint32_t`) are serialized automatically. ## Combined Persistence with Registry When using `registry` to store shared metadata (e.g., sample names referenced by ID in the grove), serialize the registry **before** the grove and deserialize in the same order: ```cpp #include #include namespace gdt = genogrove::data_type; namespace gst = genogrove::structure; int main() { auto& reg = gdt::registry::instance(); gst::grove my_grove(100); // Intern shared metadata, store IDs in the grove auto id1 = reg.intern("SampleA_liver"); auto id2 = reg.intern("SampleB_brain"); my_grove.insert_data("chr1", gdt::interval{100, 200}, id1); my_grove.insert_data("chr1", gdt::interval{150, 250}, id2); // Save: registry first, then grove { std::ofstream out("data.bin", std::ios::binary); reg.serialize(out); my_grove.serialize(out); } // Load: same order reg.clear(); { std::ifstream in("data.bin", std::ios::binary); auto& restored = gdt::registry::deserialize(in); auto loaded = gst::grove::deserialize(in); // Registry IDs in the grove still resolve correctly auto results = loaded.intersect(gdt::interval{100, 200}, "chr1"); for (auto* key : results.get_keys()) { const auto& name = restored.get(key->get_data()); // "SampleA_liver" } } return 0; } ``` ## Custom Key Type Serialization If you use a custom key type with the grove, it must implement a `serialize` member method and a static `deserialize` factory method: ```cpp struct CustomInterval { size_t start; size_t end; void serialize(std::ostream& os) const { os.write(reinterpret_cast(&start), sizeof(start)); os.write(reinterpret_cast(&end), sizeof(end)); } [[nodiscard]] static CustomInterval deserialize(std::istream& is) { CustomInterval ci; is.read(reinterpret_cast(&ci.start), sizeof(ci.start)); is.read(reinterpret_cast(&ci.end), sizeof(ci.end)); if (!is) throw std::runtime_error("Failed to deserialize CustomInterval"); return ci; } // ... other key_type_base requirements (operator<, overlap, aggregate, etc.) }; ``` ## Custom Data Type Serialization For custom data types stored as associated data in keys, you have two options: ### Option 1: Member methods Add `serialize` and `deserialize` methods directly to your type: ```cpp struct Annotation { std::string name; double score; void serialize(std::ostream& os) const { genogrove::data_type::serializer::write(os, name); os.write(reinterpret_cast(&score), sizeof(score)); } [[nodiscard]] static Annotation deserialize(std::istream& is) { auto name = genogrove::data_type::serializer::read(is); double score; is.read(reinterpret_cast(&score), sizeof(score)); if (!is) throw std::runtime_error("Failed to deserialize Annotation"); return {std::move(name), score}; } }; ``` ### Option 2: Specialize serialization_traits For third-party types you cannot modify, specialize `serialization_traits`: ```cpp #include template<> struct genogrove::data_type::serialization_traits { static void serialize(std::ostream& os, const ThirdPartyType& value) { // write fields to os } static ThirdPartyType deserialize(std::istream& is) { // read fields from is and return constructed object } }; ``` ### What works automatically - **Trivially copyable types** (`int`, `double`, `uint32_t`, etc.) — serialized via `memcpy` - **`std::string`** — built-in specialization (length-prefixed) - **Built-in key types** (`interval`, `genomic_coordinate`, `numeric`, `kmer`) — member methods provided - **`gio::bed_entry`** (with its nested `gio::block_info`) — member `serialize`/`deserialize` provided, so a `grove` can be persisted directly. The `.gg` files produced by the `idx` CLI subcommand are this form. `gio::rgb_color` and `gio::thick_info` are trivially copyable and serialize automatically. ## Source Stream Must Be Seekable for Concatenated Payloads `grove::deserialize()` uses zlib's streaming decoder, which may finish consuming the compressed payload before exhausting the input buffer. To preserve any bytes that follow the grove (e.g., concatenated payloads, sentinel markers, file tails), the internal `inflate_streambuf` rewinds the unconsumed bytes via `source.seekg(...)`. **The source stream must therefore be seekable when anything follows the grove in the stream.** On non-seekable sources (pipes, sockets, custom non-seekable streambufs) the seek fails and `deserialize()` throws: > `inflate_streambuf: source stream is not seekable; concatenated payloads require a seekable source` For a single-payload `.gg` file loaded via `std::ifstream`, this requirement is automatically satisfied — file streams are seekable. The requirement matters only for the concatenated-payload pattern (registry then grove from the same stream, multiple grove payloads back-to-back, sentinel trailers). If you must deserialize from a non-seekable source, copy it into a `std::stringstream` first: ```cpp std::stringstream buf; buf << non_seekable_source.rdbuf(); // drain into a seekable buffer auto g = gst::grove<...>::deserialize(buf); ``` ## Important Notes - `grove::deserialize()` returns a grove **by value**. Because `grove` is a move-only type (copy is deleted), the return relies on Named Return Value Optimization (NRVO) or implicit move. No special handling is needed—just assign the result to a local variable as shown in the examples above. - All `deserialize` methods (`node::deserialize`, `grove::deserialize`, `registry::deserialize`, `serialization_traits::deserialize`) throw `std::runtime_error` on corrupt or truncated streams. - `node::deserialize` additionally validates B+ tree invariants (num_keys < order, num_children <= order). - `registry::deserialize` provides a **strong exception guarantee**: if the stream throws or is truncated, the singleton is left exactly as it was before the call. The new state is built into local containers and committed via noexcept move-assign only after the read loop completes. It also rejects header counts that exceed the `id_type` capacity (`"Failed to deserialize registry: entry count exceeds id_type capacity"`) and streams containing duplicate keys (`"Failed to deserialize registry: duplicate key"`). See {doc}`data_types/registry` for details. - Graph edges added via `add_edge()` or `link_if()` are now persisted during serialization and restored on deserialize. - **Breaking format change**: The serialized format now includes graph edges after external keys. Files serialized with older versions are incompatible and must be re-created. - All `deserialize` methods are marked `[[nodiscard]]` to prevent accidentally discarding the result.