VCGDB: a dynamic genome database of the Chinese population
© Ling et al.; licensee BioMed Central Ltd. 2014
Received: 22 September 2013
Accepted: 28 March 2014
Published: 5 April 2014
Skip to main content
© Ling et al.; licensee BioMed Central Ltd. 2014
Received: 22 September 2013
Accepted: 28 March 2014
Published: 5 April 2014
The data released by the 1000 Genomes Project contain an increasing number of genome sequences from different nations and populations with a large number of genetic variations. As a result, the focus of human genome studies is changing from single and static to complex and dynamic. The currently available human reference genome (GRCh37) is based on sequencing data from 13 anonymous Caucasian volunteers, which might limit the scope of genomics, transcriptomics, epigenetics, and genome wide association studies.
We used the massive amount of sequencing data published by the 1000 Genomes Project Consortium to construct the Virtual Chinese Genome Database (VCGDB), a dynamic genome database of the Chinese population based on the whole genome sequencing data of 194 individuals. VCGDB provides dynamic genomic information, which contains 35 million single nucleotide variations (SNVs), 0.5 million insertions/deletions (indels), and 29 million rare variations, together with genomic annotation information. VCGDB also provides a highly interactive user-friendly virtual Chinese genome browser (VCGBrowser) with functions like seamless zooming and real-time searching. In addition, we have established three population-specific consensus Chinese reference genomes that are compatible with mainstream alignment software.
VCGDB offers a feasible strategy for processing big data to keep pace with the biological data explosion by providing a robust resource for genomics studies; in particular, studies aimed at finding regions of the genome associated with diseases.
The 1000 Genomes Project Consortium has used the dramatic increase in sequencing power that has become available to sequence the genomes of 1092 individuals from 14 populations in different parts of the world [1, 2]. Other approaches, like genome-wide association studies (GWAS), combine several hundred thousand variants from different individuals known to have a particular disease and related clinical traits, thereby associating genome-wide genotyping with the phenotypic disease for gene discovery [3, 4]. The amount of healthcare-related data that are being digitally collected and stored, especially disease-related sequence variations that are used widely in personal medicine studies, are vast and expanding rapidly [5, 6]. As a result, data management and analysis tools to convert this vast resource into information and knowledge are also advancing [7, 8]. Third-generation high-throughput sequencing technology with its extraordinarily higher throughput and lower cost is now available [9, 10]. Meanwhile, traditional sequencing platforms are still producing about five petabytes of sequencing data annually. These two technologies are driving the exponential growth of the genomics data "ocean" [11–13], which raises urgent problems on how to handle such huge amounts of data, including their storage, transfer, integration, and mining [14–18]. Sequencing data from different sample preparation protocols and various sequencing platforms with variable read lengths and sequencing coverage are often handled with different analysis tools and parameters. Thus, the standardization of sequencing studies and their interpretation are challenges that researchers are beginning to pay more attention to .
The current Genome Reference Consortium human genome (build 37), GRCh37, is derived from 13 anonymous Caucasian volunteers from Buffalo, New York . This sequence is used as a standard template and a guiding principle for the discovery of low-frequency variants in different individuals from different populations, the development of computational software, and for building clinical genomic resources [21, 22]. Although the reference genome has been revised several times, it still offers limited information for the study of population- and individual-specific variants. The human reference genome sequence is unable to meet the precise investigative requirements of genomics, transcriptomics, epigenetics, and genome-wide association studies (GWAS) . Several efforts have been made to generate specific complete human genome sequences of different populations. Levy et al. reported the whole genome sequence of a Caucasian individual of Western European ancestry (CEU) sequenced using Sanger methodology . Bentley et al. used the short reads sequenced on a next-generation sequencing platform to determine the genome sequence of a Yoruba individual from Ibadan, Nigeria (YRI) . Wang et al. sequenced the genome of a Han Chinese individual using combined strategies for alignment and assembly . Other population-specific genome sequences, including the genomes of a Korean and an Irish individual, have been released [27, 28]. Each of these genome sequences represent individuals from a particular population and are annotated with information about single nucleotide polymorphisms (SNPs), insertions/deletions (indels), and large structure variations based on the human reference genome sequence. These data are for one individual’s genome only and do not represent or characterize population-specific differences .
The huge amount of human genome sequence data [30–32] has made it possible to detect high- and low-frequency variants across the whole human genome. As a result, the focus of human genome studies are changing from single and static to complex and dynamic . The 1000 Genomes Project Consortium has generated and published a massive amount of human genome sequencing data. Using some of this data, we have constructed the Virtual Chinese Genome Database (VCGDB), a dynamic genome database of the Chinese population based on whole-genome sequencing of 194 individuals. VCGDB is "virtual" because the reference genome provided in the database is the statistical result of terabases of sequencing data from hundreds of Chinese individuals that describe the genetic variation features specific to the Chinese population. VCGDB is "dynamic" because dynamic variations of individual characters and genomic annotation information, such as reference genes, genomic duplications, and GWAS clinical traits are integrated in the database. VCGDB offers a strategy for processing large amounts of genomic data and is a robust database for genomics and disease-related studies in the Chinese population.
We collected all the whole genome low-coverage (2–4x) alignment results of 194 individuals from two typical Chinese populations released by 1000 Genomes Project. The data included samples from 100 Southern Han Chinese individuals (CHS) and samples from 94 Han Chinese individuals in Beijing (CHB). All the data can be downloaded from mirror FTP sites of NCBI ( ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/) or EBI ( ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/). The sequencing data were generated in different laboratories using a variety of different platforms; therefore, all the data were standardized based on "Bavarva Theory" . The raw sequencing reads have been mapped to the human reference genome using a consensus strategy based on different sequencing platforms, and the output is in binary BAM format . The raw data and alignment data have been compressed to about 3.3 and 4.8 terabytes, respectively.
In the data preprocessing step, we used SAMtools to multi-pileup all the samples to convert read-based alignment information into position-based data by splitting the reads into bases . The process was optimized by piling up the samples in parallel. The resulting files from the 194 samples were about 70 gigabytes each and 12.6 terabytes in total.
The raw sequence data of the 194 whole genomes contained 3 billion nucleotide positions; therefore, a specific algorithm was required to extract meaningful information in a short time using limited storage space. We designed a two-step candidate dynamic information-extraction strategy to filter out irrelevant or redundant data and select the data for further data analysis and database construction. First, we built a candidate dynamic position list containing all the positions with variant probability within at least one sample. Assuming the samples to be independent from each other, we ran a parallel search and calculated the candidate dynamic positions (CDPs) from all the samples simultaneously. As a result, we obtained a non-redundant CDP hash dataset with a total of 55,549,120 CDPs from the 194 Chinese genome samples. Next, we developed a cross-chromosome data searching and extracting algorithm (Additional file 1: Figure S1) to filter the redundant data. The output of this process contained the mapped nucleotide information from each CDP together with the sample information. This candidate dynamic information-extraction strategy reduced the size of the data by 98%, which significantly decreased the programming difficulty and CPU time required for subsequent analysis.
The common SNP/SNV (single nucleotide polymorphism/single nucleotide variation) calling algorithms generally use parameters such as read quality to evaluate and filter variations caused by sequencing errors. This strategy is useful for sequencing data with a reasonable level of coverage from a single individual, because false positives of base positions will be quite low and differentiation between individuals does not need to be considered. However, for the massive amounts of low-coverage sequencing data from the large number of samples from different individuals in VCGDB, false positives could be higher and harder to detect, and difficult to distinguish from normal variations in each sample. Further, the samples were generated using different platforms under slightly different conditions, which may cause unequal weighting when the data are merged. To address these problems, we developed a data analysis strategy especially for large-scale data samples with low coverage, to preserve the independence and maintain equal weighting of each sample when handling the dynamic variation information.
where, x i is one of the detected nucleotide types (A/T/G/C); n is the number of detected nucleotide types; and p(x i ) is the occurrence probability of each nucleotide type in a given unit. The units were filtered by their comentropy values against a threshold of binary unit ≤0.95, ternary unit ≤1.50, and quaternary unit ≤1.92.
Second, we calculated the nucleotide distribution of each CDP in three pre-defined populations, Han Chinese in Beijing (CHB), Southern Han Chinese (CHS), and an integrated dataset of CHB and CHS to represent the entire eastern region of China (CHN), and then used several parameters such as read depth, nucleotide distribution, major allele probability, and the population-level comentropy value to estimate each dynamic position in the corresponding population. Rare variations and indels are generally considered to have a high probability of being related to diseases and they are often the main factors that cause differences in genome size among different individuals or populations. We extracted rare variants that occurred in less than 5% of the samples and grouped them by population.
Based on the annotations for the human reference genome sequence, we annotated the dynamic positions in the genome sequences of the three pre-defined Chinese populations to coding regions, intergenic regions, introns, and untranslated regions (UTRs). Traditional genome annotation databases and software such as the Gene Ontology and the KEGG pathway databases require a gene list as input and not position information, which makes them unsuitable for the dynamic positions and indels analysis that is required for the virtual genome data. Therefore, we used ANNOVAR, a position- and short region-based genome annotation software, to annotate the dynamic positions and indels in our data, especially the major allele and indel positions against the reference genome (hereafter referred to as MAIR) . A number of databases, such as RefGene, genomicSuperDups, and gwasCatalog, were used to assign the annotations and an enrichment analysis was performed to reveal the potential biological significance of the dynamic genomics information .
To handle the huge amounts of data that were generated, we used the MySQL database management system to construct VCGDB. We configured the optimal relationships among the tables and in the data structure to obtain the best searching efficiency. The information obtained from the analysis described above was stored in VCGDB in three parts; dynamic position information, indel information, and genomic annotation information. The three parts were all highly structured, indexed, classified, and properly stored in MySQL tables. The chromosome name and the position index were used to build connections between the tables. The tables were split by chromosome name and every table was constrained to less than ten million records, which largely reduced the search region and the response time. Considering the data characteristics and search requirements, we constructed VCGDB in four main parts: "entity", "annotation", "dynamic", and "reference". The "entity" part stores basic information that describes the base positions; the "dynamic" part stores all the information that corresponds to dynamic positions, indels, and rare variations; the "annotation" part contains raw datasets of the annotation database associated with position-based annotation records; and the "reference" part is a structured database of linear references capable of high performance sequence initiation. We used the MAIR information to build population-specific consensus Chinese reference genome sequences named the virtual Chinese reference genomes (VCGs), for the three pre-defined populations and tested them by aligning raw sequencing data to the different genomes and comparing the mapped reads. The VCGs can be recognized by genome alignment software like BWA or Bowtie [38, 39].
Traditional genome browsers are not designed to visualize the dynamic genomics information in VCGDB. Therefore, we developed the Virtual Chinese Genome Browser (VCGBrowser), a highly flexible, region-based genome browser to display the virtual and dynamic genomes in VCGDB. VCGBrowser was built using the Java Swing Applet and the Genoviz software development kit (Genoviz SDK), an open source library of re-usable components for genomics data visualization . VCGBrowser can be installed as a client-based application running in a personal computer, or it can be used online through the Java applet web interface. To ensure data transfer security, we have implemented a servlet so that users do not have to change their local security policy for a remote connection. The object-based methodology in VCGBrowser accelerates the transfer speed of visualized data (Additional file 1: Figure S2). Compatible testing was approved under mainstream operation systems like Windows XP, Windows 7, Ubuntu 12.04, and CentOS 6.0 and in browsers like Internet Explorer, Mozilla Firefox, and Google Chrome.
Dynamic position counts for the different populations in the virtual Chinese genome database (VCGDB)
Number of dynamic positions
Number of indels
Number of rare variations
Statistical analysis of genetic variations in the exonic regions of the Chinese and GRCh37 genomes
Enrichment analysis of MAIR in GWAS trait locations in the Chinese and GRCh37 genomes
Multiple sclerosis (41/187)
Multiple sclerosis (41/187)
Multiple sclerosis (41/187)
Crohn's disease (36/181)
Crohn's disease (40/181)
Crohn's disease (38/181)
Body mass index (34/109)
Body mass index (34/109)
Coronary heart disease (37/151)
Coronary heart disease (34/151)
Coronary heart disease (33/151)
Body mass index (36/109)
Type 2 diabetes (33/164)
Type 2 diabetes (32/164)
Bipolar disorder (32/109)
Rheumatoid arthritis (31/170)
LDL cholesterol (32/114)
Type 2 diabetes (31/164)
LDL cholesterol (31/114)
HDL cholesterol (31/118)
Rheumatoid arthritis (30/170)
Bipolar disorder (30/109)
Type 1 diabetes (29/107)
Bone mineral density (30/87)
Bone mineral density (30/87)
Bone mineral density (29/87)
LDL cholesterol (30/114)
VCGDB is a well-organized, highly structured, and indexed database that supports real-time high-performance searches using a web search engine and a genome browser. Users can launch a query to obtain information from several different VCGDB tables, including dynamic position information, MAIR, gene information, and GWAS clinical traits. For example, a dynamic position search will return the basic information on position, population, major and minor allele contribution, and major allele probability and distribution, and so on. A download page is generated so that users can download any Chinese-specific genome sequence generated using the data in the database.
The VCGBrowser is a highly interactive user-friendly interface that can be used to view the "virtual" and "dynamic" genomics information (Figure 4A). VCGBrowser is both a web-based applet and a client-based cross-platform application that can be used as an online browser or can be downloaded for use as local software. The VCGBrowser consists of five modules: a control module, browser module, brief module, detail module, and progress module. Unlike traditional genome browsers, VCGBrowser is a creative tool in five main aspects. One, the VCGBrowser allows users to browse the GRCh37 reference genome and the consensus CHN, CHS, and CHB genome sequences in the same window, making it easier for users to detect differences among the populations. Two, the VCGBrowser integrates the dynamic genomics information onto the consensus coordinate of genome sequences. For each dynamic position, multi-colored rectangle bars are used to indicate the nucleotide distribution at that position, triangles are used to indicate indels, and colored characters are used to mark rare variations. Three, the VCGBrowser supports flexible and seamless zooming and browsing. Traditional genome browsers like the UCSC Genome Browser use image-based technologies so that images are always refreshed when zooming in or out . The efficiency of this methodology depends on internet speed and bottlenecks can occur when a large number of users access the site at the same time. VCGBrowser uses servlet technology to fetch all the information and to calculate all the symbols in real-time. Thus, VCGBrowser supports seamless zooming and scrolling to any resolution; from the genomic level that shows the dynamic distribution of a region of interest, to the nucleotide level at which the residues and detailed information can be displayed. Four, the VCGBrowser provides biological annotations, including gene name, gene region, and GWAS traits, and marks them all in the browser window, so that users can easily combine these biological factors with the dynamic genomics information. Five, all the symbols shown in the browser window are selectable and support real-time searching. After a query had been initiated, brief information about the region being queried is displayed in a side window; then, simply clicking on a symbol of interest triggers an instant search in the database, producing detailed information for the users.
A download page is provided for users to download the applications and data in VCGDB. The downloadable version of VCGBrowser can be run on different Windows or Linux operation systems. Moreover, the consensus Chinese reference genome sequences (VCGs) for the CHN, CHS, and CHB populations are also provided on this page. The three VCGs are population-specific linear reference genome sequences that support alignment software, such as BWA, Bowtie, and SOAP [38, 39, 43].
People from different geographical regions have their own specific phenotypes. For instance, Southern Han Chinese (CHS) are generally thinner and shorter, while Han Chinese in Beijing (CHB) are generally stronger and taller. We used the information in VCGDB to examine the potential genotype factors that may lead to these physical differences. Based on the results of a comparison of MAIR in the CHS and CHB genome sequences, we connected the significant genotypic differences that we found with GWAS traits, which showed that these two populations had many specific genotypes related to height and body mass index. These findings might explain why Caucasians generally appear taller and stronger than Chinese. For the disease GWAS traits, Crohn's disease, coronary heart disease, and type 2 diabetes were found to have higher morbidity rates in the Chinese populations, which agreed with a previously reported common ailment investigation [44–46]. Although CHS and CHB are both Han Chinese populations, differences in the dynamic genomics information revealed differences in several disease phenotypes between the two populations.
Mapping of 15 Asian genomes onto the VCG, YH and GRCh37 reference genomes
The continuous innovations in high-throughput sequencing technology, even single-molecule sequencing, have dramatically expanded the capacity for description and data collection; however, a large bottleneck remains in the efficiency of compiling, organizing, and manipulating these data. The biggest challenges are in computing resource allocation, parallel computing control, algorithm optimization, and the physical structural design of a database.
VCGDB is considered "virtual" because the reference genome that we provided here does not belong to any real human being; it is the statistical result of terabases of sequencing data from hundreds of Chinese individuals. VCGDB adequately describe the genetic variations, features, and preferences of the Chinese populations that they represent. VCGDB and the associated VCGBrowser provide refined and comprehensive data from the 1000 Genomes Project biological big data, which has been annotated and analyzed with the aim of building connections between human genomic research and medical diagnosis. The VCGBrowser provides a highly flexible user-friendly interface for the user to search and work on. Moreover, users no longer need to deal with massive amounts of data; rather, they can use mature databases and analysis tools to classify individuals or patients into subpopulations that differ in their susceptibility to a particular disease or in their response to a specific treatment. Here, we have developed the VCGDB and VCGBrowser as a support system for researchers and doctors to build an accurate and precise classification of the human genome and diseases, and thereby promote the progress in social healthcare.
The "dynamic" feature in VCGDB can display multiple levels of genetic variation information in the three Chinese populations. Individual genomes were examined first to detect nucleotide-level variations between individuals and populations and to evaluate the degree of variation in the base backbone. Then, all the genetic variation information was collected for all the individuals in the three populations. Finally, the dynamic variations of individual characters and genomic annotation information, such as reference genes, genomic duplications, and GWAS clinical traits were integrated into the VCGDB structure. The data preprocessing that we developed, downsized the raw data to an analyzable scale without losing any detail information, and the analyzing algorithms translated sequencing data into dynamic genomics information using limited time and computing resource. The results are output as a big-table-like data structure, which is convenient for data exporting and follow-up studies. Furthermore, the optimized VCGDB structure allowed the implementation of a high-efficiency real-time search of all the dynamic genomics information in the database along the whole length of the Chinese genomes. The VCGDB structure is a novel database scheme that can be used to deal with the huge amounts of incoming biological data.
Although about 8 terabytes of Chinese genome data have been integrated in VCGDB, it remains just a tiny piece in the huge data iceberg that will be required to fully illustrate the complexities of the human genome. We set a threshold of 5% to define rare variants because of the limited sample size and the current sequencing error rate. Usually, a rare variant that may be disease-related will have an occurrence probability of less than 1% or even one in a million, which is currently almost impossible to detect. In the near future, sequencing projects such as UK10K, The Cancer Genome Atlas (TCGA), and TwinsUK will generate ultra-large volumes of human genome data [47–49]. Because the sequencing data are generated in different laboratories using various platforms, how to normalize these data, merge it with existing data, and analyze and interpret it are big challenges that have to be addressed .
Traditional alignment software programs or algorithms always use the static human GRCh37 reference genome as the template and dynamic variations are seldom considered. To overcome this limitation, an advanced dataset could be used in a dynamic mapping process, but this approach wastes data and overlooks potential biological significance. New algorithms with higher mapping speeds and dynamic variation support need to be developed to handle the increasing quantities of data . Conversely, limited computing resources restrain data mining efficiency and its applicability to many investigators. Cloud computing and supercomputing may be the best solution in response to the data explosion crisis [50–52]. We are planning to build a stable data system in the cloud, develop user-friendly tools and pipelines, and establish cloud platforms to accelerate further algorithm development.
Managing and maintenance are critical for databases. We will continue to provide the computational resources, debug the programs, and ensure the stable running of VCGDB and VCGBrowser. In the future, we will continue to monitor the progress of other human genome projects, collect and merge Chinese sequencing samples, execute our data analysis workflow, validate the results, and periodically update VCGDB.
In this study, we constructed a new type of dynamic genome database of three Chinese populations, including CHS and CHB, which is very different from any of the current traditional human genome databases. VCGDB integrates all the dynamic information generated from the whole-genome sequencing of hundreds of individuals, and combines it with the corresponding genomic annotation information. The "virtual" and "dynamic" features of VCGDB helped reveal genetic variations in the Chinese genomes. We developed a highly interactive user-friendly VCGBrowser, which has significant functions like seamless zooming and real-time searching, for users to search and compare the dynamic information of the different populations in VCGDB. Based on the population-specific information in VCGDB, we build consensus Chinese reference genomes to detect nucleotide preferences in the Chinese populations, and to be compatible with traditional alignment software. We propose that VCGDB offers a feasible strategy for processing big data to keep pace with the growing volume of biological data and provides a robust resource based on the massive amounts of genomics data for genomics studies and investigations into genetic diseases.
Database homepage: http://vcg.cbi.ac.cn/
Requirements: Java Runtime Environment (JRE) version 1.6.0 or upper
Virtual Chinese genome database
Virtual Chinese genome browser
Single nucleotide polymorphism
Single nucleotide variation
Major allele and indel positions against reference genome
Genome wide association studies
HUGO Gene Nomenclature Committee
Virtual Chinese reference genome
Yan Huang reference genome
Candidate dynamic position
Population from Han Chinese in southeast of China
Population from Han Chinese in Beijing, China
Population of Chinese in Denver, Colorado
Population of Dai Chinese in Xishuangbanna, China
Population of Japanese in Tokyo, Japan
Population of Utah residents with Northern and Western European ancestry
Population of Yoruba in Ibadan, Nigeria.
The authors are grateful to the reviewers for their valuable comments, which have led to improvements of this paper. The authors thank the 1000 Genomes Project Consortium for the public data of human genome. This study was supported by grant (2010CB126604, 2011CB965200) from the National Basic Research Program (973 Program), the Ministry of Science and Technology of the People’s Republic of China; grant (2009FY120100) from the Special Foundation Work Program, the Ministry of Science and Technology of the People’s Republic of China and grant (2012AA020409) from National Programs for High Technology Research and Development (863 Program), the Ministry of Science and Technology of the People’s Republic of China; grant from the National Science Foundation of China (31101063, 31271386).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.