PgmNr D1520: Curation of transcript models with all available public sequencing reads.

Authors:
ZX Chen; B. Busby; J. Fear; H. Yang; B. Oliver


Institutes
National Institute of Health, Bethesda, MD.


Keyword: next-generation sequencing

Abstract:

A complete reference genome and accurate gene annotations are essential for all genetics and genomics research. The Berkeley DrosophilaGenome Project (BDGP) release 6 of D. melanogaster genome has significantly improved completeness of the genome, especially for the Y chromosome and other heterochromatic regions. However, the current annotations were lifted over from release 5 and remain incomplete.  To address this issue, we are taking a deeply data-driven approach to update the gene annotations.  Here we use 13,020 runs of publically available RNA-seq data, with a total of 186 billion reads or 17 terabases from Sequence Read Archive (SRA) to re-construct gene models.  Furthermore, we extracted associated metadata from SRA, Biosample, Gene Expression Omnibus (GEO), and publications to annotate each sample with tissue, sex, stage, genotype, cell-type and sample-type information.  The samples come from a variety of tissues (whole organism [5,158], head [4,859], ovary [1,023], brain [121], testis [69]), stages (adult [8,896], embryo [2,422], larva [973], pupa [243]) and cell type (mix [10,889], S2 [1,122], Kc167 [150], OSC [118], neuroblast [57]).  With a study of this scale we need to not only assess the quality of the individual samples, but also verify associated metadata.  We subjected each of these datasets to a strict set of quality control metrics.  First, all samples were mapped to the D. melanogaster BDGP release 6 genome with HISAT2.  Alignments were used to measure strandedness, mappability, 5’ or 3’ bias, RNA integrity and gene expression abundance.  For gene model annotation we selected only stranded libraries, for a total of 47 billion mapped reads.  This dataset shows an expanded signal to noise ratio compared to single datasets such as modENCODE, that allows for better distinctions between intronic and exonic segments of gene models and provides clearer splicing evidence. Current FlyBase annotations are very good with only 13% of stranded SRA runs having more than 5% uniquely mapped reads in intergenic regions.  However, there are still cases  of unannotated genes, UTR extensions, novel splicing events, and antisense transcripts.  For example, we found extended 5’ UTR of the protein-coding gene cut (ct), and an unannotated antisense transcript in 5’ region of another protein-coding gene bazooka (baz).   These results indicate that published sequencing data are rich resources for the curation of transcript models that will validate many of the outstanding current transcript models and add to the transcriptional complexity of the annotation.