A comprehensive evaluation of ensembl, RefSeq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification
© Zhao and Zhang; licensee BioMed Central. 2015
Received: 30 September 2014
Accepted: 30 January 2015
Published: 18 February 2015
RNA-Seq has become increasingly popular in transcriptome profiling. One aspect of transcriptome research is to quantify the expression levels of genomic elements, such as genes, their transcripts and exons. Acquiring a transcriptome expression profile requires genomic elements to be defined in the context of the genome. Multiple human genome annotation databases exist, including RefGene (RefSeq Gene), Ensembl, and the UCSC annotation database. The impact of the choice of an annotation on estimating gene expression remains insufficiently investigated.
In this paper, we systematically characterized the impact of genome annotation choice on read mapping and transcriptome quantification by analyzing a RNA-Seq dataset generated by the Human Body Map 2.0 Project. The impact of a gene model on mapping of non-junction reads is different from junction reads. For the RNA-Seq dataset with a read length of 75 bp, on average, 95% of non-junction reads were mapped to exactly the same genomic location regardless of which gene models was used. By contrast, this percentage dropped to 53% for junction reads. In addition, about 30% of junction reads failed to align without the assistance of a gene model, while 10–15% mapped alternatively. There are 21,958 common genes among RefGene, Ensembl, and UCSC annotations. When we compared the gene quantification results in RefGene and Ensembl annotations, 20% of genes are not expressed, and thus have a zero count in both annotations. Surprisingly, identical gene quantification results were obtained for only 16.3% (about one sixth) of genes. Approximately 28.1% of genes’ expression levels differed by 5% or higher, and of those, the relative expression levels for 9.3% of genes (equivalent to 2038) differed by 50% or greater. The case studies revealed that the gene definition differences in gene models frequently result in inconsistency in gene quantification.
We demonstrated that the choice of a gene model has a dramatic effect on both gene quantification and differential analysis. Our research will help RNA-Seq data analysts to make an informed choice of gene model in practical RNA-Seq data analysis.
RNA-Seq, the sequencing of a population of RNA transcripts using high-throughput sequencing technologies, profiles an entire transcriptome at single-base resolution whilst concurrently quantifying gene expression levels [1-5]. RNA-Seq can analyze subtle features of the transcriptome, such as novel transcript variants, allele-specific expression, and splice junctions [4,5]. Previously, we performed a side-by-side comparison of RNA-Seq and microarray to investigate T-cell activation, and demonstrated that RNA-Seq is superior in detecting low abundance transcripts, and for differentiating biologically critical isoforms . RNA-Seq also avoids technical limitations inherent to the microarray platform related to probe performance, such as cross-hybridization, limited detection range of individual probes, as well as non-specific hybridization [6-8]. With decreasing sequencing cost, RNA-Seq is becoming an attractive approach to profile gene expression or transcript abundance, and to evaluate differential expression among biological conditions.
Current RNA-Seq approaches use shotgun sequencing technologies such as Illumina, in which millions or even billions of short reads are generated from a randomly fragmented cDNA library. After sequencing, the first step involves mapping those short reads to a genome or transcriptome. In recent years, a large number of mapping algorithms have been developed for read mapping and RNA-Seq differential analysis [9-14]. However, accurate alignment of high-throughput short RNA-Seq reads remains challenging, mainly because of junction (i.e., exon-exon spanning) reads and the ambiguity of multiple-mapping reads. Currently, many RNA-Seq alignment tools, including GSNAP , OSA , STAR , MapSplice, and TopHat , use reference transcriptomes to inform the alignments of junction reads. In our previous study , we had assessed the impact of using RefGene (RefSeq Gene)  on mapping short RNA-Seq reads, and demonstrated that without the assistance of RefGene, more than one third of junction reads failed to map to the reference genome in the alignment process.
One aspect of transcriptome research is to quantify expression levels of genes, transcripts, and exons. Acquiring the transcriptome expression profile requires genomic elements to be defined in the context of the genome. In addition to RefGene, there are several other public human genome annotations, including UCSC Known Genes , Ensembl , AceView , Vega , and GENCODE. Characteristics of these annotations differ because of variations in annotation strategies and information sources. RefSeq human gene models are well supported and broadly used in various studies. The UCSC Known Genes dataset is based on protein data from Swiss-Prot/TrEMBL (UniProt) and the associated mRNA data from GenBank, and serves as a foundation for the UCSC Genome Browser. Vega genes are manually curated transcripts produced by the HAVANA group at the Welcome Trust Sanger Institute, and are merged into Ensembl. Ensembl genes contain both automated genome annotation and manual curation, while the gene set of GENCODE corresponds to Ensembl annotation since GENCODE version 3c (equivalent to Ensembl 56). AceView provides a comprehensive non-redundant curated representation of all available human cDNA sequences.
Although there are multiple genome annotations available, researchers need to choose a genome annotation (or gene model) while performing RNA-Seq data analysis. However, the effect of genome annotation choice on downstream RNA-Seq expression estimates is under-appreciated. Wu et al.  defined the complexity of human genome annotations in terms of the number of genes, isoforms, and exons, and demonstrated that the selection of human genome annotation results in different gene expression estimates. Chen et al.  systematically compared the human annotations present in RefSeq, Ensembl, and AceView on diverse transcriptomic and genetic analyses. They found that the human gene annotations in the three databases are far from complete, although Ensembl and AceView annotate many more genes than RefSeq. In this paper, we performed a comprehensive evaluation of different annotations on RNA-Seq data analysis, including RefGene, UCSC, and Ensembl. We chose these three gene models because we use them regularly for in-house RNA-Seq data analysis. Our research focused on: (1) comparing the coverage and incompleteness of different gene models; (2) quantifying the impact of gene models on the mapping of both junction and non-junction reads; and (3) evaluating the effect of genome annotation choice on gene quantification and differential analysis. To a broader extent, one of the most practical questions researchers want to know in advance is: if different gene models are chosen for RNA-Seq data analysis, what is the chance of obtaining the same quantification result for a given gene?
Results and discussion
The Human Body Map 2.0 Project generated RNA-Seq data for 16 different human tissues (adipose, adrenal, brain, breast, colon, heart, kidney, leukocyte, liver, lung, lymph node, ovary, prostate, skeletal muscle, testis, and thyroid). We chose to analyze this public dataset because gene expression is tissue specific and analyzing those 16 high-quality RNA-Seq samples as a whole could result in less biased conclusions. Note that none of the gene annotation is 100% complete. As a result, for those RNA-Seq reads not covered by a gene annotation, whether to use the gene model in the mapping step has no impact on their mappings. Therefore, to fairly assess the impact of a gene model on RNA-Seq read mapping, only those reads covered by a gene model were used. In this study, we devised a two-stage mapping protocol. In Stage #1, all reads that are not covered by a gene model were filtered out. In Stage #2, all remaining reads were mapped to the reference genome with and without the use of a gene model. The role of a gene model in the mapping step was then quantified and characterized by comparing the mapping results in Stage #2.
The coverage of different gene annotations
In contrast, Figure 1 shows that the read mapping percentage is also sample dependent, and this holds true for every gene model. For instance, only 52.5% of sequence reads in the heart were mapped to the RefGene model; while in leukocytes, 84.2% of reads could be mapped to RefGene. This mapping difference between heart and leukocyte results from, at least in part, the incompleteness of the RefGene annotation. As more genes are annotated in a gene model, a higher percentage of reads will be mapped in the “Transcriptome only” mapping mode.
The data patterns in “transcriptome + genome” mapping mode were different from those determined by the “transcriptome only” mode (left panel on Figure 1). In the “transcriptome + genome” mapping mode, the average mapping rates for Ensembl, RefGene, and UCSC increased to 96.7%, 94.5%, and 94.6%, respectively, and the mapping rate difference among different gene models decreased. This large difference in the mapping rates between the two modes suggests the incompleteness of gene models: there are many reads that were mapped to the genomic regions without annotations.
In the “transcriptome only” mapping mode, an average of 6.9%, 1.4%, and 1.8% of reads were multiple-mapped reads in Ensembl, RefGene, and UCSC gene models, respectively (the right panel in Figure 1). The percentage of multiple-mapped reads in Ensembl is higher than in RefGene or UCSC. Usually, a more comprehensive annotation generally annotates more genes and isoforms, and thus, increases the possibility of ambiguous mappings. These ambiguous mappings directly translate to an increase in the percentage of non-uniquely mapped reads.
The impact of a gene model on RNA-seq read mapping
In Figure 3A, we divided uniquely mapped reads into two classes, i.e., non-junction reads and junction reads, and investigated the impact of a gene model on their mapping. Accordingly to Figure 3A, roughly 23% of mapped reads were junction reads, and the remaining 77% were non-junction reads. For non-junction reads (see Figure 3B), 95% remained mapped to exactly the same genomic location regardless of the use of a gene model. Without a gene model, 3% to 9% of non-junctions reads became multiple mapped reads. Thus, it is rare for a non-junction read to become unmapped or mapped alternatively. However, the mapping of junction reads was strongly impacted by the gene models (see Figure 3C). Without using a gene model, an average of 53% of junction reads remained mapped to the same genomic regions, 30% of failed to map to any genomic region, and 10–15% of them mapped alternatively. Such alternative mappings are generally inferior compared to their corresponding mapping results using a gene model . Similar to non-junction reads, an average of 5% of junction reads were mapped to more than one location without using a gene model. As shown in Figure 3C, more uniquely-mapped junction reads became multiple mapped reads in RefGene and/or UCSC than in Ensembl when the sequence reads were aligned to the reference genome without the use of gene models.
The impact of gene model choice on gene quantification
The distribution of the ratio of read counts between RefGene and Ensembl annotations (read length = 75 bp)
Gene definitions and quantification results for certain exemplary genes in the liver tissue sample (read length = 75 bp)
The effect of gene models on differential analysis
The effect of a gene model on mapping is read length dependent
All the analysis results for the dataset with a 50-bp read length were reported in the supplementary tables and figures. Intuitively, the shorter a read, the more likely it is to map to multiple locations. As a result, the percentage of uniquely mapped reads decreases, and the percentage of multiple-mapping reads increases. No matter which gene model was used for mapping, this observation held true; for example if we compare Additional file 1: Table S1 with Additional file 1: Table S2, and/or Additional file 1: Table S3 with Additional file 1: Table S4. Thus, the mapping fidelity for a sequence read increases with its length, and this is especially true for junction reads. As demonstrated in Figure 3C and Additional file 1: Table S5, when the read length was 75 bp, an average of 53% of junction reads remained mapped to the same genomic regions when mapped without gene annotation. However, this percentage dropped to 42% when the read length was 50 bp long (Additional file 1: Figure S3C and Additional file 1: Table S6). Thus, the effect of a gene model on the mapping of junction reads is significantly influenced by read length.
In the meantime, the relative abundance of junction reads is heavily determined by read length.
According to Figure 3A and Additional file 1: Table S5, on average, roughly 23% of sequence reads were junction reads when the read length was 75 bp. The percentage of junction reads dropped to 16% when the read length was 50 bp (see Additional file 1: Figure S3A and Additional file 1: Table S6). This is explained by the fact that the longer the read, the more likely that it spans more than one exon. As sequencing technology evolves, the read length will become longer and longer. Consequently, more junction reads will be generated by short-gun sequencing technologies. Therefore, the need to incorporate genome annotation in the read mapping process will greatly increase.
Which genome annotation to choose for gene quantification?
In practice, there is no simple answer to this question, and it depends on the purpose of the analysis. In this paper, we demonstrated that the choice of a gene model has an effect on the quantification results. Previously, we compared the gene quantification results when RefGene and Ensembl annotations were used. Among 25,958 common genes, the expressions of 2038 genes (i.e., 9.3%) differed by 50% or more when choosing one annotation over the other. Such a large difference frequently results from the gene definition differences in the annotations. Genes with the same HUGO symbol in different gene models can be defined as completely different genomic regions. When choosing an annotation database, researchers should keep in mind that no database is perfect and some gene annotations might be inaccurate or entirely wrong.
Wu et al.  suggested that when conducting research that emphasizes reproducible and robust gene expression estimates, a less complex genome annotation, such as RefGene, might be preferred. When conducting more exploratory research, a more complex genome annotation, such as Ensembl, should be chosen. Based upon our experience of RNA-Seq data analysis, we recommend using RefGene annotation if RNA-Seq is used as a replacement for a microarray in transcriptome profiling. For human samples, Affymetrix GeneChip HT HG-U133+ PM arrays are one of the most popular microarray platforms for transcriptome profiling, and the genes covered by this chip overlap with RefGene very well, according to Zhao et al.  h. Despite the fact that Ensembl R74 contains 63,677 annotated gene entries, only 22,810 entries (roughly one third) correspond to protein coding genes. There are 17,057 entries representing various types of RNAs, including rRNA (566), snoRNA (1549), snRNA (2067), miRNA (3361), misc_RNA (2174), and lincRNA (7340). There are 15,583 pseudogenes in Ensembl R74. For most RNA-Seq sequencing projects, only mRNAs are presumably enriched and sequenced, and there is no point in mapping sequence reads to RNAs such as miRNAs or lincRNAs. Ensembl R74 contains 819 processed transcripts that were generated by reverse transcription of an mRNA transcript with subsequent reintegration of the cDNA into the genome, and are usually not actively expressed. In this scenario, a read truly originating from an active mRNA can be mapped to the processed transcript or mapped to the processed transcript only, which is especially true for junction reads. Consequently, the true expression for the corresponding mRNA may be underestimated. Another downside of using a larger annotation database is calculation of adjusted p values, because the adjustment of the raw p value to allow for multiple testing is mainly determined by the number of genes in the model. If genes of interest are defined inconsistently across different annotations, it is recommended that the RNA-Seq dataset is analyzed using different gene models.
RNA-Seq has become increasingly popular in transcriptome profiling. Acquiring transcriptome expression profiles requires researchers to choose a genome annotation for RNA-Seq data analysis. In this paper, we assessed the impact of gene models on the mapping of junction and non-junction reads, and compared the impact of genome annotation choice on gene quantification and differential analysis. To fairly assess the impact of a gene model on RNA-Seq read mapping, we devised a two-stage mapping protocol, in which sequence reads that could not be mapped to a reference transcriptome were filtered out, and the remaining reads were mapped to the reference genome with and without the use of a gene model in the mapping step. Our protocol ensured that only those reads compatible with a gene model were used to evaluate the role of a genome annotation in RNA-Seq data analysis.
Ensembl annotates more genes than RefGene and UCSC. On average, 95% of non-junction reads were mapped to exactly the same genomic location without the use of a gene model. However, only an average of 53% junction reads remained mapped to the same genomic regions. About 30% of junction reads failed to be mapped without the assistance of a gene model, while 10–15% mapped alternatively. It is also demonstrated that the effect of a gene model on the mapping of sequence reads is significantly influenced by read length. The mapping fidelity for a sequence read increases with its length. When the read length was reduced from 75 bp to 50 bp, the percentage of junction reads that remained mapped to the same genomic regions dropped from 53% to 42% without the assistance of gene annotation.
There are 21,958 common genes among RefGene, Ensembl, and UCSC annotations. Using the dataset with the read length of 75 bp, we compared the gene quantification results in RefGene and Ensembl annotations, and obtained identical counts for an average of 16.3% (about one sixth) of genes. Twenty percent of genes are not expressed, and thus have zero counts in both annotations. About 28.1% of genes showed expression levels that differed by 5% or higher; of these, the relative expression levels for 9.3% of genes (equivalent to 2038) differed by 50% or greater. The case studies revealed that the difference in gene definitions caused the observed inconsistency in gene quantification.
The Human Body Map 2.0 Project, using Illumina sequencing, generated RNA-Seq data for 16 different human tissues (adipose, adrenal, brain, breast, colon, heart, kidney, leukocyte, liver, lung, lymph node, ovary, prostate, skeletal muscle, testis, and thyroid) and is accessible from ArrayExpress (accession number E-MTAB-513). To demonstrate the impact of read length on analysis results, we created a new dataset in which each original 75-bp long sequence read was trimmed to 50 bp. The same analysis protocol described below was applied to both datasets. The RefGene, Ensembl, and UCSC annotation files in GTF format were downloaded from the UCSC genome browser.
Primary sequencing reads were first mapped to reference transcriptome and the human reference genome GRCH37.3 using Omicsoft sequence aligner (OSA) . Benchmarked with existing methods such as Tophat and others, OSA improves mapping speed 4–10 fold, with better sensitivity and fewer false positives. When a gene model is used in conjunction with a reference genome, by default, OSA maps RNA-Seq reads in three consecutive steps: (1) all reads are mapped to the reference transcriptome; (2) for mapped reads with mismatches, OSA aligns them with the reference genome and chooses the best hits; and (3) for unmapped reads, OSA maps them to reference genome. OSA can be finely controlled, and step #1 could be run alone if only those reads that could be mapped to a reference transcriptome were desired.
Accordingly, the effect of a gene model on RNA-Seq read mapping could be characterized and quantified by comparing the mapping results in different mapping modes. We focused on those uniquely mapped reads in the “transcriptome only” mode, and divided them into four categories (Figure 9C) according to their mapping results without a gene annotation in mapping step: (1) “Identical”, the same alignment results were obtained regardless of the use of a gene model; (2) “Alternative”, the read still mapped but mapped differently. It turns out that the majority of reads in this category were junction reads. A junction read could be either mapped as a non-junction read, or remain mapped as a junction read but with different start, end, and splicing positions; (3) “Multiple”, a uniquely mapped read became a multiple-mapped one. When a read is mapped across the whole reference genome, it is more likely to be mapped to multiple locations; and (4) “Unmapped”, i.e., a read could not be mapped to anywhere in the genome without the assistance of a gene model. Nearly all reads in this category were junction reads.
We are grateful to Gary Ge at Omicsoft for sharing his deep insight on OSA implementation and his assistance with running OSA. We thank Pfizer for offering us computational resources. We receive no funding support from any third party.
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5(7):621–8.View ArticlePubMedGoogle Scholar
- Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, et al. Transcriptome genetics using second generation sequencing in a caucasian population. Nature. 2010;464(7289):773–7.View ArticlePubMedGoogle Scholar
- Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet. 2008;40(12):1413–5.View ArticlePubMedGoogle Scholar
- Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63.View ArticlePubMed CentralPubMedGoogle Scholar
- Mutz KO, Heilkenbrinker A, Lönne M, Walter JG, Stahl F. Transcriptome analysis using next-generation sequencing. Curr Opin Biotechnol. 2013;24(1):22–30.View ArticlePubMedGoogle Scholar
- Zhao S, Fung-Leung W-P, Bittner A, Ngo K, Liu X. Comparison of RNA-Seq and microarray in transcriptome profiling of activated T cells. PLoS One. 2014;9(1):e78644.View ArticlePubMed CentralPubMedGoogle Scholar
- Hurd PJ, Nelson CJ. Advantages of next-generation sequencing versus the microarray in epigenetic research. Brief Funct Genomic Proteomic. 2009;8(3):174–83.View ArticlePubMedGoogle Scholar
- Malone J, Oliver B. Microarrays, deep sequencing and the true measure of the transcriptome. BMC Biol. 2011;9:34.View ArticlePubMed CentralPubMedGoogle Scholar
- Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011;8(6):469–77.View ArticlePubMedGoogle Scholar
- Engström PG, Steijger T, Sipos B, Grant GR, Kahles A, RGASP Consortium, et al. Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Methods. 2013;10(12):1185–91.View ArticlePubMed CentralPubMedGoogle Scholar
- Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics. 2013;14:91.View ArticlePubMed CentralPubMedGoogle Scholar
- Borozan I, Watt SN, Ferretti V. Evaluation of alignment algorithms for discovery and identification of pathogens using RNA-Seq. PLoS One. 2013;8(10):e76935.View ArticlePubMed CentralPubMedGoogle Scholar
- Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11(10):R106.View ArticlePubMed CentralPubMedGoogle Scholar
- Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40.View ArticlePubMed CentralPubMedGoogle Scholar
- Wu TD, Nacu S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics. 2010;26(7):873–81.View ArticlePubMed CentralPubMedGoogle Scholar
- Hu J, Ge H, Newman M, Liu K. OSA: a fast and accurate alignment tool for RNA-Seq. Bioinformatics. 2012;28(14):1933–4.View ArticlePubMedGoogle Scholar
- Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15–21.View ArticlePubMed CentralPubMedGoogle Scholar
- Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, et al. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res. 2010;38(18):e178.View ArticlePubMed CentralPubMedGoogle Scholar
- Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14(4):R36.View ArticlePubMed CentralPubMedGoogle Scholar
- Zhao S. Assessment of the impact of using a reference transcriptome in mapping short RNA-Seq reads. PLoS One. 2014;9(7):e101374.View ArticlePubMed CentralPubMedGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007;35(Database):D61–5.View ArticlePubMed CentralPubMedGoogle Scholar
- Hsu F, Kent WJ, Clawson H, Kuhn RM, Diekhans M, Haussler D. The UCSC known genes. Bioinformatics. 2006;22(9):1036–46.View ArticlePubMedGoogle Scholar
- Flicek P, Amode MR, Barrell D, Beal K, Billis K, Brent S, et al. Ensembl 2014. Nucleic Acids Res. 2014;42(Database issue):D749–55.View ArticlePubMed CentralPubMedGoogle Scholar
- Thierry-Mieg D, Thierry-Mieg J. AceView: a comprehensive cDNA-supported gene and transcripts annotation. Genome Biol. 2006;7 Suppl 1:1–14.View ArticlePubMedGoogle Scholar
- Wilming LG, Gilbert JG, Howe K, Trevanion S, Hubbard T, Harrow JL. The vertebrate genome annotation (Vega) database. Nucleic Acids Res. 2008;36(Database):D753–60.View ArticlePubMed CentralPubMedGoogle Scholar
- Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, et al. GENCODE: the reference human genome annotation for the ENCODE project. Genome Res. 2012;22(9):1760–74.View ArticlePubMed CentralPubMedGoogle Scholar
- Wu P-Y, Phan JH, Wang MD. Assessing the impact of human genome annotation choice on RNA-seq expression estimates. BMC Bioinformatics. 2013;14 Suppl 11:S8.View ArticlePubMed CentralPubMedGoogle Scholar
- Chen G, Wang C, Shi L, Qu X, Chen J, Yang J, et al. Incorporating the human gene annotations in different databases significantly improved transcriptomic and genetic analyses. RNA. 2013;19(4):479–89.View ArticlePubMed CentralPubMedGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.