Genomics, Proteomics & Bioinformatics (GPB) published an online database article titled "GenBase: A Nucleotide Sequence Database" from the Beijing Institute of Genomics, Chinese Academy of Sciences (National Center for Bioinformatics). In our "Recommended Translations" column, we are pleased to invite Dr. Zhao Xuetong, the co-first author of the article, to systematically introduce the construction and content of GenBase, a gene sequence database.
introduction
Gene sequence and annotation information (including DNA, RNA, and protein sequence information) is one of the core basic data supporting gene function research. With the rapid development of biology, in the past few decades, scientists in the field of life sciences in mainland China have produced a large amount of genetic sequence data, many of which have been submitted to the International Nucleotide Sequence Database Collaboration (INSDC). Currently, researchers in China and other countries rely heavily on INSDC for sequence submission and search. At the same time, the rapid development of sequencing technology has led to a rapid increase in the amount of sequence data, which has brought great challenges to timely and effective submission and sharing. In order to ensure the sovereignty and security of the genetic sequence data in mainland China, and to meet the practical needs of mainland researchers in the process of gene sequence data collection, management and sharing, we have completed the development of the gene sequence database GenBase (https://ngdc.cncb.ac.cn/genbase/) United States in accordance with the GenBank database of the National Bioinformatics Center of the People's Republic of China (NCBI).
GenBase is the core resource of the National Genome Science Data Center, which uses GenBank's data model to support the submission of multiple data types, including genomic DNA, mRNA, ncRNA, and nucleic acid sequences derived from organelles, viruses, plasmids, and phages, through an online bilingual submission system. In addition, GenBase integrates all sequences from GenBank and keeps them updated daily, providing free and publicly accessible data, supporting the distribution and sharing of international datasets, and facilitating data access for Chinese researchers.
Data model and data access
GenBase's data model is compatible with the INSDC data model and allows association with two CNCB-NGDC metadata description databases: BioProject and BioSample. GenBase allows users to submit nucleic acid sequences from multiple species in bulk at once. Once submitted, a unique number with the "sub" prefix will be generated. After quality control, each nucleic acid sequence is assigned an access number that begins with "C_" followed by 2 letters, 6 numbers, and a suffix to the serial version number. At the same time, each protein sequence associated with a given nucleic acid sequence is assigned an access number that begins with "C_" followed by 3 letters, 5 numbers, and the suffix of the sequence version number (Figure 1). Whenever a sequence changes, the sequence version number is modified. Sequences are generated and stored in ASN.1 format and displayed online in GBFF format, both formats commonly used by GenBank.
Figure 1 GenBase data model
Data Submission and Validation
Generic sequences
GenBase has built a user-friendly universal sequence online submission system that supports both Chinese and English, with nine sections, namely submitter, publication information, sequencing technology, sequence, collection or batch, category, metainformation, characteristics, and result preview (Figure 2). The submission system supports comprehensive real-time validation.
Figure 2 Overall GenBase architecture
In the Sequence phase (Step 4), GenBase verifies the sequence file uploaded by the user online, including the sequence format, sequence content, species name, molecular type, and genetic code. In the metadata stage (step 7), GenBase collects 57 metadata information related to sequences based on the Excel format and verifies them in real time. For example, controlled vocabulary validation is performed on fields such as sampling location, organelle/location, and specific format validation is performed on collection date and latitude and longitude. In the "Features" phase (Step 8), GenBase supports annotation files in three formats: 5-column GenBank feature table, GFF3, and Excel format. Users can choose one of the formats for sequence annotation. GenBase verifies user-submitted annotation files in real time. For example, verify that all sequence IDs in the annotation file exactly match the IDs in the nucleic acid sequence file, that the coordinates are integers, and that the gene annotations conform to INSDC specifications. Currently, 768 features and their corresponding annotation information are available for sequence annotation. After the user confirms all the information on the "Result Preview" page, GenBase uses table2asn (https://www.ncbi.nlm.nih.gov/genbank/table2asn/) to perform a final check of the submitted sequence and generate high-quality sequence files (e.g., GBFF and SQN files).
SARS-CoV-2 sequences
In order to improve the intersection efficiency of SARS-CoV-2 sequences, GenBase has designed a dedicated SARS-CoV-2 sequence submission module. The submission process for this module is similar to that of generic sequences, but integrates with VADR procedures for automated annotation of SARS-CoV-2 sequences. In addition, GenBase provides a dedicated metadata Excel file format for SARS-CoV-2, ensuring compatibility with INSDC and the Global Initiative on Sharing All Influenza Data (GISAID).
Statistics
Since its official launch on March 24, 2023, GenBase has achieved rapid growth in data volume (Figures 3A and B). As of April 16, 2024, GenBase has integrated and updated 270,606,796 nucleic acid sequences and 305,810,135 protein sequences from GenBank (Figure 3C). As of April 16, 2024, GenBase has received 67,399 nucleic acid sequences and 681,930 protein sequences submitted by users, covering 393 species (Figure 3C). Of the data submitted, 62,988 nucleic acid sequences (93%) and 613,351 annotated protein sequences (90%) have been released. Notably, out of 54,884 submissions of SARS-CoV-2 genome sequences with standardized annotations, 52,147 have been published.
Figure 3 GenBase statistics (as of April 16, 2024)
Retrieval and download
In GenBase, users can search using the advanced search function with 31 search fields, and at the same time, the advanced search has a history retention function to facilitate the viewing of historical search information. Users can refine search results using filters such as species, data source, data type, and sort results using different sorting options such as access number, date modified, organism, and sequence length. GenBase provides four data display formats and supports batch downloads to meet different usage needs. In order to facilitate the bulk download of FASTA files, a REST API (e.g. https://ngdc.cncb.ac.cn/genbase/api/file/fasta?acc=C_AA001108.1) has been developed. In addition, an FTP site (https://download2.cncb.ac.cn/genbase/daily/) is provided for users to download nucleic acids and protein sequences published daily by GenBase.
Future developments
GenBase is based in China, serves the world, receives data submissions from researchers around the world, and provides one-stop web services for the collection, storage, publication and sharing of gene sequence data. In the future, GenBase will continue to strive to advance research and development in the field of biology, including improving the web interface for data submission, retrieval, and presentation, and expanding the scope of services to include genome annotation, such as virus, mitochondrial and chloroplast genomes, to ensure the accuracy of downstream data analysis. In addition, we will integrate user-friendly online tools to facilitate sequence data analysis, such as species identification. Finally, we will facilitate collaboration by sharing and exchanging all publicly available nucleic acid sequences with INSDC members, thereby providing a comprehensive data resource for researchers worldwide.
Reviewers:
GPB Youth Editorial Board Member Zhou Zhan
Article compilation source:
Bu C, Zheng X, Zhao X, Xu T, Bai X, Jia Y, et al. GenBase: A Nucleotide Sequence Database. Genomics, Proteomics & Bioinformatics https://doi.org/10.1093/gpbjnl/qzae047.
For the full text in English, please see:
https://academic.oup.com/gpb/advance-article/doi/10.1093/gpbjnl/qzae047/7698051
Author & Funding Information:
Bu Congfan, Zheng Xinchang, Zhao Xuetong, Xu Tianyi, and Bai Xue from the Beijing Institute of Genomics (National Bioinformatics Center) of the Chinese Academy of Sciences (https://ngdc.cncb.ac.cn/) are the co-first authors of this paper, and senior engineer Tang Bixia and researcher Bao Yiming are the co-corresponding authors of the paper. This research was supported by the Strategic Pilot Project of the Chinese Academy of Sciences, the National Key R&D Program of the People's Republic of China, the International Thematic Network of the Belt and Road International Alliance of Scientific Organizations, the International Cooperation Project of the Chinese Academy of Sciences "Research and Development of International Genomics Data Sharing System", and the International Biodiversity and Health Big Data Sharing Program.
GPB Papers:
GenBase: A Nucleotide Sequence Database