Table of Contents: 2015 MAY–JUNE No. 404
RefSeq Release 70 Now Available with Re-annotated Bacterial Genomes for Uniformity Across Genomes and Species. NLM Tech Bull. 2015 Mar-Apr;(404):b7.
[Editor's Note: This is a reprint of an announcement published on NLM/NCBI List ncbi-announce, an e-mail announcement list available from the NLM/NCBI. To subscribe to this list, please see the ncbi-announce -- NCBI announcements and updates page.]
The full Reference Sequence (RefSeq) release 70 is now available online, on the FTP site, and through NCBI's programming utilities, with 74,720,563 records describing 50,351,119 proteins, 11,310,700 RNAs, and sequences from 54,118 different organisms.
This release reflects a large update of complete bacterial RefSeq genomes, proteins, and Genes. In order to make genome annotation comparable across genomes and species, NCBI has re-annotated all RefSeq prokaryotic genomes using NCBI's genome annotation pipeline. Previously, it was possible that the same gene, in the same species, with an identical sequence for the gene's genomic region might be annotated with a different protein simply because it was annotated using different methods. Now, the same gene in the same species with the same sequence will be annotated with exactly the same protein in RefSeq.
In addition, each annotated CDS used to be tracked with a distinct RefSeq protein accession number. However, due to identical protein sequences being found on multiple re-annotated RefSeq genomes and extensive bacterial genome sequencing, the RefSeq prokaryotic protein dataset rapidly became very redundant. Rather than flood the protein database with thousands of completely identical proteins, NCBI has adopted the use of non-redundant WP proteins for RefSeq prokaryotic genomes annotated with NCBI pipelines, which we first announced in June 2013. Now, if the identical protein sequence appears on more than one RefSeq genome, NCBI simply reuses the existing WP accession number instead of creating a new accession for each new occurrence and genome. As a result, over 7 million proteins were removed, significantly reducing protein redundancy for the prokaryotic dataset. A removed accession report (release70.removed-records.gz) and a supplemental data mapping file (release70.bacterial-reannotation-report.txt.gz) are available in the release-catalog directory on FTP.
This is a first step toward managing data in a world where genomes are sequenced for assays, rather than to discover novel proteins. We appreciate that this is a new and major change for RefSeq prokaryotic genomes, but it is also a necessary change to make as the number of disease-outbreak and other isolate sequencing continues to rapidly increase. For more information on changes to protein records, nucleotide records, the impact to NCBI Gene, and future plans, please see the latest story on NCBI News: http://www.ncbi.nlm.nih.gov/news/05-07-2015-refseq-release-70-reannotation.
NCBI has created documentation to explain these changes in detail:
If you have more questions or specific questions that are not addressed in the documentation, you can write to the Help Desk at info@ncbi.nlm.nih.gov or use the feedback form on the RefSeq page.