collaborative cross graphical genome

The mouse reference is one of the most widely used and accurately assembled mammalian genomes and is the foundation for a wide range of bioinformatics and genetics tools. However, it represents the genome of a single inbred mouse strain, C57BL/6J. Recently, inexpensive and fast genome sequencing has enabled the assembly of other laboratory mouse strains at a quality approaching that of the reference but using these alternative assemblies in standard genomics analysis pipelines presents significant challenges. Moreover, new genetic resource populations are being developed, e.g. Collaborative Cross (CC), whose genomes are a mosaic of multiple inbred mouse strains. This presents a dire need to integrate multiple genomes into the standard sequence analysis. It has been suggested that a pangenome reference assembly, which incorporates multiple genomes into a single representation, are the path forward, but there are few standards for, or instances of practical pangenome representations suitable for large eukaryotic genomes. We present a pragmatic graph-based pangenome representation as a genomic resource for the widely-used recombinant-inbred CC mouse strains and its eight founder genomes. The pangenome representation leverages existing standards for genomic sequence representations with backward-compatible extensions to describe graph topology and genome-specific annotations along paths. It packs 83 mouse genomes (8 founders and 75 CC strains) into a single graph representation that captures important notions related to genomes such as identity-by-descent and highly variable genomic regions. The introduction of special anchor nodes with sequence content provides a valid coordinate framework that divides large eukaryotic genomes into homologous segments and addresses most of the graph-based position reference issues. Parallel edges between anchors place variants within a context that facilitates orthogonal genome comparison and visualization. Furthermore, the graph structure allows annotations to be placed in multiple genomic contexts and simplifies their maintenance as the assembly improves. The CC reference pangenome provides an open framework for graph-based tool chain development and analysis.