An Online Viewer and Database for the Collaborative Cross Graphical Genome
The pan-genome is a collection of genomes that are organized in a single representation and used jointly in the downstream analysis. As the assembly of multiple intra-specific genomes becomes available, the trend is to build pan-genomes as the new reference. We previously built a pan-genome called the Collaborative Cross Graphical Genome (CCGG) for a mouse population known as the Collaborative Cross (CC).The CC is a widely-used genetic reference mouse population derived from eight genetically diverse founder strains. The CCGG was constructed by merging the eight founder assemblies into a single graph.
We present the CCGG probe database for two purposes: 1) to assess the founder assemblies presented in the CCGG; 2) to characterize newly sequenced CC samples. In the CCGG, edges represent the sequence diversity in founder strains. The CCGG probe database was constructed by first selecting probes from edges to highlight differences between founder assemblies.The selection of edge probes was designed to be unbiased relative to any founder assembly presented in the CCGG. They contain at least one variant on 99% of edges and cover every base pair within the CCGG.Those probes were queried in the raw sequence data of 96 CC samples and 8 newly sequenced CC founders for the read counts (i.e. the number of occurrences). The read counts were organized and compressed to speed up the online queries. This database has been used in three different applications, including compressing the CCGG, resolving the recombination boundaries of CC and locating assembly errors.
We also present an online visualizer for the CCGG. Currently, few bioinformatics tools have been developed for visualizing graph-based pan-genomes. As the result, their structure is hard to understand and their usage is often limited in small groups. Thus, to promote the utilization of the CCGG, we provide an online CCGG viewer to allow users to navigate any genomic regions in the CCGG at different scales. The CCGG viewer currently supports four different types of visualization, including the anchor distribution, edge distribution, graph topology and CCGG probe database. The CCGG viewer is available online and can be accessed at http://devel.csbio.unc.edu/GraphicalGenome/viewer/.