PgmNr M255: GENCODE: using new technologies to improve reference mouse genome annotation.

Authors:
Mark Thomas; Jennifer Harrow; GENCODE Consortium


Institutes
Wellcome Trust Sanger Institute, Hinxton, Cambridge, United Kingdom.


Abstract:

Understanding transcriptional complexity is important for the study of disease, especially now that CRISPR-Cas9 technologies are driving a genome-editing revolution.  The GENCODE resource is now the default human gene set in the UCSC and Ensembl genome browsers, and we are currently working to improve reference mouse genome annotation.  With an emphasis on alternative splicing, our current release (M9) contains 115,125 transcripts from 21,971 protein coding and 9,436 long non-coding genes. The increasing availability of next-generation sequencing data from RNAseq, CAGE and PolyAseq allows us to define transcribed regions with ever increasing accuracy; adding to the transcriptional complexity of the genome. Identifying functional transcripts is particularly important, so as to differentiate them from transcripts arising via stochastic events or spliceosomal errors. The function of most protein coding transcripts is evident from the encoded protein, where as the function of long non-coding transcripts is more difficult to determine.

In an effort to improve our functional understanding of transcripts, we are combining phyloCSF comparative data with advances in ribosome profiling and mass spectrometry to assess the coding potential of transcripts. Combining these approaches not only allows us to improve protein coding gene annotation, it also highlights how differences in the precise TSS can influence the translational start of proteins. This has actually resulted in a decrease in the total number of mouse protein coding genes, while the number of pseudogenes has increased. The number of long non-coding RNA transcripts is also increasing, with longer reads from PacBio RNAseq and Capture-Seq experiments improving transcript annotation. With the increased effort, we now have regular releases of the mouse GENCODE gene set for the C57BL/6 reference genome.  We are also collaborating with the Sanger Mouse Genomes project to extend annotation to de novo genome assemblies for the other laboratory mouse strains.