PgmNr D1522: Library preparation effects on estimating satellite DNA abundance from short-read sequencing.

Authors:
Sarah E. Sander; Kevin H.-C. Wei; Andrew G. Clark; Daniel A. Barbash


Institutes
Cornell University, Ithaca, NY.


Keyword: next-generation sequencing

Abstract:

Tandem repetitive DNA, also known as satellite DNA, is a major component of most eukaryotic genomes, including Drosophila and humans. Satellite DNAs can exist as selfish genomic parasites that propagate in the genome at the expense of the host, but can also form essential chromosome structures including centromeres, telomeres, and sub-telomeric regions. Despite these roles in chromosome structure, satellite DNA sequences differ widely in sequence, location, and abundance in the genome, even between closely related species. There is a basic molecular and theoretical understanding of how repetitive DNA can expand or contract within a genome. However, few studies test these models using genome-wide data, because the repetitive part of the genome is difficult to sequence accurately and to quantify with current technology.

Here we describe the effects of using newly developed PCR-free library preparation methods on the assessment of satellite DNA abundance from Illumina sequencing reads. We performed Illumina sequencing on replicate libraries constructed with PCR-free, 8 cycle PCR, and 12 cycle PCR methods from a single DNA extraction. We quantify satellite abundance from raw sequencing reads using our kmer-based algorithm, k-Seek. We then compare satellite abundances across library preparations using correlations, principal components analysis (PCA), and discriminant analysis of principal components (DAPC). We find that the different preparation methods produce libraries that are distinctly different from one another and that much of the differences between libraries are driven by satellite sequences that are underrepresented in conventional PCR-based library preps, such as the highly abundant AATAT satellite. These satellites are much better represented in sequencing reads derived from PCR-free libraries. Despite the PCR-induced bias, which skews absolute abundances, the bias appears to be quantitatively stable, so that contrasts across lines whose libraries are all constructed in the same way can still be reliable. Nevertheless, PCR-free methods are clearly preferred if there is interest in quantitative assessment of repeat composition.