PgmNr P2022: Genotype calling from population-genomic sequencing data.

Authors:
T. Maruki; M. Lynch


Institutes
Indiana University, Bloomington, IN.


Abstract:

Genotype calling plays important roles in population-genomic analyses.  Although many statistical methods have been developed, the performance of the widely used genotype-calling methods is not understood well, especially when the population deviates from Hardy-Weinberg equilibrium (HWE).  In this study, we develop a maximum-likelihood (ML) method for calling genotypes that incorporates population-level prior estimates of genotype frequencies and error rates to improve the accuracy of genotypes called from low-coverage sequencing data.  We compare the performance of the proposed method with that of GATK and Samtools using computer simulations under genetic conditions where the population may deviate from HWE.  The results show the proposed method yields more accurate called genotypes than the currently widely used methods. 

        In addition to the method for low-coverage sequencing data, we develop another ML method for calling genotypes from high-coverage sequencing data, which does not require prior population-level estimates and enables identification of polymorphisms with arbitrary number of alleles.  Using computer simulations, we examine when the coverage is high enough to accurately characterize polymorphisms using the proposed method.  Taking the results of the performance evaluation into account, we apply the proposed method to high-coverage (mean 18×) whole-genome sequencing data of 83 clones from a population of Daphnia pulex.  Our results using multiple procedures for minimizing analyzing sites with mismapped reads indicate that a nonnegligible fraction of polymorphisms in this species is triallelic, demonstrating the importance of relaxing the assumption of biallelic polymorphisms.  Because of the efficiency and flexibility, the proposed method can in principle be extended to population-genomic analyses of polyploid data.  As an example, we extend it to analyses of triploid sequencing data.  Using computer simulations, we examine the performance of the proposed method.  Our results show that calling accurate genotypes from triploid sequencing data requires much higher coverage than that from diploid sequencing data, which will help researchers to design sequencing strategies for population-genomic studies in polyploid organisms.