PgmNr M5104: Utilizing NCBI’s Mouse Genome Resources.

Authors:
Tripti Gupta; Kelly McGarvey; Terence Murphy; Kim Pruitt


Institutes
National Center for Biotechnology Information, National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894.


Abstract:

Complete and accurate genome annotation is essential to researchers using genomic data. A major focus of the Reference Sequence (RefSeq) database at the National Center for Biotechnology Information (NCBI) is to provide an accurate and comprehensive annotation of the mouse genome through computational and manual curation. The RefSeq database contains annotated genomic, transcript, and protein sequence records derived from data in public sequence databases and from computation, curation, and collaboration. This combinatorial approach results in a high-quality annotation that focuses on representation of full-length, non-redundant sequence data and that is regularly updated through re-annotations every 12 to 18 months. RefSeq provides whole genome annotation of the reference strain C57BL/6J genome assembly, including variation on alternate locus scaffolds, maintained by the Genome Reference Consortium (GRC) and annotation of the mixed strain Celera assembly. In addition, complete annotations of 15 other rodents, including Peromyscus maniculatus bairdii and Rattus norvegicus, are available and can be downloaded from the NCBI RefSeq Genomes FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/). In order to provide the most consistent and comprehensive annotation possible, RefSeq scientists manually curate genes in close collaboration with the Mouse Genome Informatics database, the Consensus Coding Sequence project, and the GRC. Manual curation methods continually incorporate new data sets such as genome wide promoter-associated epigenetic data and PolyA-Seq data to improve annotation and define manually annotated features on transcript and protein sequences. RefSeq curation efforts have traditionally focused on representing full-length transcripts of protein coding genes, primarily using transcript data, protein alignments, and published data as evidence; however, in recent years, we have incorporated additional data into our computational and manual curation methods, allowing for more complete and accurate annotation. For example, changes to NCBI’s eukaryotic annotation pipeline allowed the incorporation of RNA-Seq data, resulting in significant increases in the numbers of predicted protein-coding variants and non-coding transcripts. Incorporation of the RNA-Seq data has been particularly valuable to our recent efforts to expand the representation of long non-coding RNAs. The current annotation of the GRCm38 assembly includes 31,500 predicted non-coding RNAs, a 26% increase relative to the previous annotation, which did not utilize RNA-Seq data. This poster provides an overview of NCBI’s mouse genome resources and highlights recent curation efforts by the RefSeq group.