Inferring TE haplotype markers from population genomics data using hierarchical clustering

Transposable elements (TEs) are genetic parasites that invade genomes and manipulate their host’s molecular machinery to replicate. Just as organisms must compete for resources and evade predators in their ecological setting, TEs do the same, but within their own genomic ecology. TEs must compete with each other for limited genomic space while evading the genome’s immune system, piRNAs. And just like their macro-organismal counterparts, these selective forces may be driving a radiative genetic diversity of these TEs. We aim to characterize this diversity by developing new methods to infer TE haplotype markers from unassembled short-read data. We define TE haplotype markers as sets of SNPs that are physically linked with each other on the same TE sequence. To infer these markers, we reason that the copy number of SNPs on the same TE sequences should be correlated. We take short reads and align them to TE consensus sequences using the ConTExt pipeline and estimate the copy number of alleles. We then employ a Hierarchical Clustering approach on the correlations of allele copy number to infer the degree of genetic linkage between SNPs. The result are sets of SNPs inferred to be linked in sequence and represent markers that distinguish TE variants. We benchmarked this approach by using simulations of short-read data, and then used this method to characterize the genetic variation of recently active TEs within 85 strains of the Global Diversity Lines, a population genomics resource of Drosophila melanogaster genomes. To verify these TE haplotype markers we aligned TE consensus sequences to PacBio assemblies and compared the full length TE haplotypes in the PacBio sequences to our inferred TE haplotype markers. Our analysis of the GDL revealed a great diversity of TE haplotype markers, many of which are enriched for specific geographic populations. Signatures of population structure for these TEs can be largely attributed to the expansion of only a small number TE variants distinguished by their haplotype markers. The TE variants that expanded are mostly found in low frequency globally suggesting, that they are ancestral variants that rose to higher copy number in a sub-divided population.