Evaluation of the similarity of gene expression data estimated with SAGE and Affymetrix GeneChips
© Ruissen et al. 2005
Received: 28 October 2004
Accepted: 14 June 2005
Published: 14 June 2005
Skip to main content
© Ruissen et al. 2005
Received: 28 October 2004
Accepted: 14 June 2005
Published: 14 June 2005
Serial Analysis of Gene Expression (SAGE) and microarrays have found awidespread application, but much ambiguity exists regarding the evaluation of these technologies. Cross-platform utilization of gene expression data from the SAGE and microarray technology could reduce the need for duplicate experiments and facilitate a more extensive exchange of data within the research community. This requires a measure for the correspondence of the different gene expression platforms. To date, a number of cross-platform evaluations (including a few studies using SAGE and Affymetrix GeneChips) have been conducted showing a variable, but overall low, concordance. This study evaluates these overall measures and introduces the between-ratio difference as a concordance measure pergene.
In this study, gene expression measurements of Unigene clusters represented by both Affymetrix GeneChips HG-U133A and SAGE were compared using two independent RNA samples. After matching of the data sets the final comparison contains a small data set of 1094 unique Unigene clusters, which is unbiased with respect to expression level. Different overall correlation approaches, like Up/Down classification, contingency tables and correlation coefficients were used to compare both platforms. In addition, we introduce a novel approach to compare two platforms based on the calculation of differences between expression ratios observed in each platform for each individual transcript. This approach results in a concordance measure per gene (with statistical probability value), as opposed to the commonly used overall concordance measures between platforms.
We can conclude that intra-platform correlations are generally good, but that overall agreement between the two platforms is modest. This might be due to the binomially distributed sampling variation in SAGE tag counts, SAGE annotation errors and the intensity variation between probe sets of a single gene in Affymetrix GeneChips. We cannot identify or advice which platform performs better since both have their (dis)-advantages. Therefore it is strongly recommended to perform follow-up studies of interesting genes using additional techniques. The newly introduced between-ratio difference is a filtering-independent measure for between-platform concordance. Moreover, the between-ratio difference per gene can be used to detect transcripts with similar regulation on both platforms.
Methods for the analysis of gene expression profiles have gone through progressive development over recent years. Traditionally, the level of transcribed mRNA has been analyzed using methods such as Northern blots, quantitative RT-PCR, differential display [1, 2], representational difference analysis , total gene expression analysis  and suppressive subtractive hybridization [5, 6]. All these methods, although fruitful and still in use, have a limited scope with regard to the number of genes that can be analyzed simultaneously. Because of this limitation, new methods have been developed, including serial analysis of gene expression (SAGE) , massive parallel signature sequencing (MPSS) , cDNA and oligo microarray chip technologies [9–13] and Affymetrix GeneChips .
SAGE is based on the high-throughput sequencing of concatemers of short (13–14 bp; recently 21–25 bp) sequence tags that originate from a known position within a transcript and therefore theoretically contain sufficient information to identify a transcript . In contrast to microarrays, SAGE estimates the abundances (expression levels) of thousands of transcripts without prior knowledge of the transcripts being expressed. The proportion of the tag within the total number of tags in the library gives a direct estimate of the abundance of the transcript within a biological sample. The advantage of the SAGE technique is that it performs a random sampling from the pool of all expressed transcripts (also called a transcriptome) allowing the discovery of new transcripts. The proportional nature of the data enables easy exchange among researchers thus allowing the creation of large public SAGE data sets for numerous human tissues, both normal and diseased [14, 15]. Disadvantages of SAGE are that the technique is expensive, labor-intensive and prone to sequencing errors. Moreover, the annotation of the short 10 bp sequence tags may identify more than one transcript. This can be overcome by using LongSAGE libraries that contain 17 bp tags which can be more reliably mapped to Unigene clusters or the complete genome sequence . However, SAGE is not suitable for high-throughput analyses of multiple samples.
In contrast to SAGE, DNA microarrays are used to measure relative expression levels between samples of thousands of known transcripts. Currently, three array variants are being used (for reviews see [17, 18]) i.e. spotted cDNA microarrays, spotted oligonucleotide microarrays and synthesized oligonucleotide microarrays (Affymetrix GeneChips). The advantages of Affymetrix GeneChips are that they are highly sensitive enabling the detection of mRNAs present at levels as low as 1 transcript in 100000  when the probe labeling step is not considered . They are suitable for high-throughput analyses of multiple samples, and data can easily be shared and used for comparisons with other researchers using the same chips. Disadvantages of Affymetrix GeneChips are that they are only commercially available, are costly and require expensive specialized equipment and are inflexible in design (although custom design is possible at high cost). Furthermore, GeneChips only measure the expression of genes represented on the chip in contrast to SAGE, in which the expression profile of the complete transcriptome can be mapped.
In the current study we have determined the similarity between SAGE- and Affymetrix GeneChips-generated gene expression profiles of two independent RNA samples. One RNA sample is isolated from a Wilms' tumor; the other is the Stratagene Universal reference RNA. These expression data were then used to evaluate the annotation problems when comparing different gene profiling platforms and the methods that can be used to compare two different platforms with respect to individual gene expression measurements and with respect to between-sample gene expression ratios. Finally, it is demonstrated that the between-ratio difference can be applied to select those transcripts that display similar expression changes in both platforms.
Microarray experiments were performed using Wilms' tumor RNA and the Stratagene Reference RNA. Results of biological replicas of each sample, with independent cRNA synthesis and hybridizations, showed a good reproducibility (Pearson correlation coefficients of 0.982 (n = 11938) and 0.979 (n = 10489);both P < 0.01) using intensity values for all probe sets with a "present" signal (on average 54%; absent = 44% and marginal = 2%) (Figure 2B ; black spots). This indicates that two identical RNA samples perform very similar within the pre-processing and final hybridization reactions. Although, in contrast to SAGE, the intensity signals on the array do not represent the actual abundance of mRNA molecules, we classified the Affymetrix data to get an impression of the signal distribution. These distributions are similar to those of the SAGE data. The majority (~90%) of the probe sets showed low signal intensity.
In the comparison of data obtained by SAGE and Affymetrix GeneChips only reliably annotated tags can be included (as described in the 'Matching of platforms' paragraph of the Material and Methods section; see also Shippy et al.). Annotation of SAGE tags to genes and their corresponding Unigene cluster numbers revealed that on average 30% of all tags (including low abundant tags) could be reliably annotated based on the SAGE Genie principles . Annotation improves to an average of 70% for tags that have an intermediate to abundant expression level. The remainder of the tags could not reliably be associated with a gene or Unigene cluster because they were not available through the SAGE Genie site, annotated to unclustered ESTs, or their reliability was below 67% (according to the SAGE Genie principles). Additionally, we performed LongSAGE for the Wilms' tumor sample, which allows the identification of 17 bp tags instead of 10 bp tags. Theoretically, over 99.8% of the 17 bp tags are expected to occur only once in the human genome. However, analyses based on actual sequences have demonstrated that only 75% of the 17 bp tags occur only once in the human genome, with the remaining tags matching duplicated genes or repeated sequences . Complete annotation of LongSAGE tags using SAGE Genie data and principles revealed that 28% of all tags could be assigned a reliable Unigene cluster. Similar to SAGE, the annotation improves to approximately 70% for tags that have an intermediate to abundant expression level.
In the comparison of platforms, we first analyzed the similarity of gene expression levels between SAGE and Affymetrix data in one tissue sample. Both datasets were matched according to their Unigene cluster numbers. Figure 2C shows a scatter plot of SAGE and Affymetrix gene expression values of the 6408 Unigene clusters before exclusion of ambiguous matches (red spots). For multiple matches, the highest tag count or intensity value per cluster was plotted. In this scatter plot the black spots represent the final selection of 1094 unambiguous and filtered Unigene clusters. Note that high Affymetrix expression levels are observed for low SAGE tag counts (spots in top-left quadrant of figure 2C), but that no high tag counts are found for low Affymetrix data (few spots in bottom-right quadrant). Overall, the correlation between SAGE tag counts and Affymetrix intensity levels of the 6408 matching Unigene clusters seemed to be modest. This was confirmed by mapping the distribution of the top 100 highly expressed genes in SAGE in the distribution of the Affymetrix dataset, and vice versa (Data not shown, but this can be inferred from figure 2C). In both comparisons, only halve of the genes from the top 100 of one platform have a rank in the top 100 of the other platform, whereas approximately 10% are matched to genes with ranks of over 1000 in the other platform. This already shows that the correlation of expression levels between platforms is modest.
In most gene expression studies, alterations of expression levels are expressed in relation to the simultaneously determined expression level of a reference sample and conclusions are drawn based on these ratios. To this end, expression ratios were calculated between the reference RNA and the Wilms' tumor data for the SAGE tag counts as well as for Affymetrix HG-U133A GeneChips spot intensities. In this comparison the final data set containing only the between-sample ratios for unambiguous transcripts was used (Figure 3), allowing effective comparison of the two platforms.
Summary of similarities between SAGE and Affymetrix HG-U133GeneChips for the final dataset (= 1094)
Contingency table diagonal
Pearson Correlation coefficient3
0–3 fold between-ratio difference
Low expresssion 1
High expresssion 1
Significant difference 2
In an attempt to explain the difference in gene expression between SAGE and Affymetrix GeneChips we summarize different sources. Variation due to "noisy fold ratios" generated from low-intensity transcripts is a widespread cause of error when computing statistics on ratios without accounting for the intensities from which the ratios were derived . Within our data set we have shown that the final data set is an unbiased selection of the total data set (Figure 2D). Additionally, the mean intensity signals for both SAGE and Affymetrix GeneChips appear to be randomly distributed over the ratio distribution (data not shown). This indicates that the difference in expression ratios between platforms is not caused by low intensity values.
In addition, it has been suggested that the GC-content of the transcripts could influence the correspondence between platforms . To test this hypothesis for the final data set (n = 1094) we retrieved all transcript sequences (mostly Refseq sequences ) and probe set sequences and calculated the GC-content for each transcript and the average GC-content of the corresponding probe sets. The GC-contents were divided into classes (30–35%; 35–40%; 45–50% etc.) and the correlation between GC-content and the differences in expression ratios between platforms was tested. Statistical analysis showed that ratio differences did not depend on the GC-content of the transcript (Chi-square value of 25.69; df = 35; P = 0.875). However, Unigene clusters showing good agreement between platforms tend to depend on the high GC-content of the corresponding probe sets (Chi-square value of 61.114; df = 30; P = 0.001). This GC-analysis indicates that expression data from probe sets with a higher GCcontent show a better agreement with their corresponding SAGE data and are more reliable. Note in this respect that for a Unigene cluster the GC content of a probe set is not necessarily the same as that of a transcript.
To answer the question whether gene expression data generated by SAGE and by Affymetrix HG-U133A GeneChips can be used interchangeably, data from these two techniques were compared using two independent RNA samples. Analysis of intra-platform variation shows good correlation for both SAGE and Affymetrix; this is also observed by others (see Figure 6). The inter-platform comparison depends on reliable annotation of the SAGE tags for which we used the tag annotation from SAGE Genie . A reliable association could be made only for 30% of all tags, which increases to 70% for intermediate and high abundant tags. This indicated that SAGE tag annotation requires improvement, especially for low abundant tags. Based on literature findings, the use of LongSAGE should refine annotation of SAGE tags . However, the current study showed an annotation profile similar to the above-mentioned percentages, indicating that LongSAGE is still not sufficient for unique gene identification. Similar disappointing improvements in annotation efficiency have been found in other studies . Further comparison of SAGE and LongSAGE requires a study that falls beyond the scope of this paper; such a study has recently been published . For the annotation in Affymetrix GeneChips, accession numbers had to be converted to Unigene clusters, which was hampered by the fact that 8% of the transcripts present on Affymetrix HG-U133A GeneChips were no longer present in a Unigene cluster. Moreover, some probe sets might represent a different transcript than initially reported (see for an example ).
A first impression about the agreement between SAGE and Affymetrix HG-U133A GeneChips was obtained from the evaluation of the top100 of highly abundant transcripts in one RNA sample in each platform. This comparison showed that approximately 50% of the top100 of highly expressed transcripts showed a corresponding expression within the top100 of highly expressed transcripts of the other platform. This is in line with the findings of Ishii et al.  who compared SAGE with Affymetrix GeneChips containing approximately 6000 transcripts, and Iacobuzio-Donahue et al.  who showed that only genes that display robust changes in gene expression were identified by both platforms. In our current study, approximately 80% of transcripts detected in the top100 of one platform were mapped within the top1000 of the competing platform. A similar figure was presented by Evans et al.  who used the RG-U34A Affymetrix GeneChips. Recently, Kim  suggested that absolute expression analyses of SAGE and oligonucleotide microarray technology reliably detected medium-to-high abundant transcripts.
For a more extensive comparison between the individual gene expression profiling platforms we used gene expression ratios between Wilms' tumor and Stratagene Universal reference RNA as determined by SAGE and Affymetrix GeneChips. The use of ratios might have the disadvantage of losing information about individual expression values. However, it corrects for platform specific variations (i.e. probe design, hybridization efficiencies etc.). By matching SAGE and Affymetrix data, an unambiguous data set was generated. On average about 30% of the unambiguous genes were observed to be expressed by both SAGE and Affymetrix GeneChips and could be included in the final comparison. Although this comparison comprised only 13% of all SAGE Unigene clusters and only 8% of the Affymetrix Unigene clusters, it was demonstrated that this selection was unbiased with respect to gene expression levels in each of the platforms. This allows the extrapolation of the conclusions to the whole platform.
We looked for the correspondence in gene expression results between the two techniques using Up/Down classification (Figure 1A), the contingency table diagonal (Figure 1B) and correlation coefficients (Figure 1C). In addition, an approach was introduced in which differences between scaled ratios were calculated. The latter measure was introduced to circumvent pitfalls of Up/Down classification, contingency tables and correlation coefficient that were discussed in the background section. To this end, we introduced an approach in which the scaling of the ratio data enables the calculation of individual ratio differences between platforms. These ratio differences can then be used to determine to which extend and in which range (e.g. 0–3 fold difference) two platforms differ in their expression ratio estimation. In this study we show that, as opposed to the other overall concordance measures, the between-ratio difference is hardly sensitive to filtering of noisy data. From the current analysis, we conclude that contingency tables and, preferably, calculation of ratio differences between two platforms should be used to compare gene expression profiles from different platforms. Moreover, the between-ratio difference provides the user with a correspondence measure per individual gene that can be used to select those genes for which a predetermined correspondence level is reached. The approximately normal distribution of the between-ratio differences (Figure 4) allows the calculation of a standardized difference value for each gene from which a P-value can be obtained. Note that this P-value cannot be used to test whether the ratio difference equals zero. Such a test requires a gene specific variance estimate in the denominator of the standardized difference and such a variance estimate cannot be obtained from the four non-replicated expression values that are used to calculate the ratio difference. However, the standardized difference and its P-value can be used as a measure for the position of a specific gene within the distribution of between-platform ratio differences and as such they can serve as a statistical threshold to determine which genes can be confidently interchanged between platforms. For instance, in the current study, the transcripts with a less than 0.5 fold between-ratio difference (red dots in figure 5D) have a chance of at least 0.8 that they show similar gene expression on both platforms. Some of the choices in the scaling procedure can be considered to be ad-hoc. However, given the current State of understanding of the causes for within and between platform variability it was deemed best to opt for a simple quadratic scaling equation to convert the distribution of ratios, which is asymmetric around 1 to a common scale. When the knowledge on the physics, chemistry, and sampling statistics increases, better conversion functions will present themselves.
Disparities of the technical approaches
• Sequence errors (although it has been shown that most of the single-copy SAGE tags are not generated from experimental sequence errors, but that they are novel tags derived from novel transcripts )
• Tag annotation difficulties
• Missing transcripts due to absence of a recognition site for the anchoring enzyme (approximately 0.7%) or GC-content bias [24,54]
• Incorrect tags arise from incomplete digestion or alternative poly-adenylation 
• Sequence polymorphisms resulting in multiple tags for a single transcript
Affymetrix HG-U133 GeneChips
• Probe design issues (such as distance of the target sequence from the poly-A tail; secondary structures within the target sequence; cross-reactivity of the probe with other transcripts, nucleic acid structure)
• Differences in hybridization efficiencies between probe sets
• Incorrect annotation of transcripts (no sequence verification)
• Efficiencies in dye incorporation
Future studies should be aimed on improving the efficiency of SAGE tag annotation and avoidance of systematic bias in microarray techniques. Only then, measurements of various technologies can be directly compared and transformed to a universal gene expression catalogue. SAGE has the advantage that a whole transcriptome is analyzed, but is limited to the analysis of a small number of samples. For screening of large sets of samples SAGE cannot be the favored choice and Affymetrix GeneChips might be a good alternative. Therefore, we think that the future lies in combining the data from SAGE with Affymetrix GeneChips, custom cDNA or oligo arrays. This gives the advantage of complete expression profiling using SAGE and high-throughput array screening of a larger panel of samples allowing rapid identification and for instance validation of clinical relevant genes involved in disease onset [42, 43]. Finally, the proposed ratio difference between platforms using an universal reference sample (as also indicated in ) can serve as a measure for interplatform correspondence per individual gene.
This paper evaluates several approaches for the comparison of different gene expression platforms, outlined using SAGE and Affymetrix GeneChips. We demonstrate that for both SAGE and Affymetrix GeneChips the intra-platform correlations are extremely good, but that the inter-platform agreement based on an unbiased selection of transcripts is modest. The agreement between platforms increases if only transcripts are included with high tag counts and high hybridisation intensities. It appears that the expression distributions are similar for each of the platforms, but that the correlation between platforms is modest due to intrinsic differences, like sensitivity, levels of noise, and gene annotation. Finally, we introduce a novel, filtering-independent approach for data analysis based on the calculation of differences between expression ratios observed in SAGE and Affymetrix GeneChips for each individual transcript. The statistical probability value that can be assigned to each individual betweenratio difference, allows the selection of individual transcripts that display similar regulation on both platforms.
Wilms' tumor tissue was obtained from a single individual after resection of the tumor. Tissue was immediately frozen in liquid nitrogen. Informed consent to use this material for scientific research was obtained. After homogenization, total RNA was extracted using Trizol (Invitrogen, Breda, The Netherlands), dissolved in RNase free water and stored at -80°C. The Stratagene Universal reference RNA was obtained from Stratagene (Stratagene, Amsterdam, The Netherlands, catalog #740000-41). Purity and integrity of the RNA samples was confirmed on the Agilent 2100 Bioanalyzer (Agilent Technologies Netherlands B.V., Amstelveen, The Netherlands), using the LabChip® approach.
The SAGE library of the Wilms' tumor RNA was generated using the I-SAGE kit according to the manufacturer's instructions (Invitrogen, Breda, The Netherlands; cat. #T5000-03). A detailed protocol may be obtained as a free download . For LongSAGE minor modifications were implemented in the protocol of the I-SAGE kit; i.e. the restriction enzyme Bsm FI was replaced by Mme I, linkers were adapted for LongSAGE and ditags were created using sticky-end ligation. All sequence files were processed using the SAGE2000 software provided by Dr. K.W. Kinzler (see also ). The SAGE library from the Stratagene Universal reference RNA was obtained from the NCBI website. This library can be retrieved in the Gene Expression Omnibus under code GSM1734 [14, 20]).
Extracted SAGE tags were annotated based on the SAGE Genie principles  through several stringent filters using data from the CGAP website . Several databases (i.e. HsMap.txt, HsRepetitive.txt and HsDatasets.txt) were combined to a final dataset containing all information necessary for tag annotation. Tags matching to unclustered EST's were considered to be no-matches. Tags matching to Unigene clusters retrieved from low ranked databases (<67%; according to the rules set by CGAP) were not included in our comparisons. During this process tags are matched to no, one unique, or more than one Unigene cluster (Unigene Build 160, March 2003). To further identify tags matching more than one Unigene cluster, we extracted the 11th base from our original sequence files using the SAGE2000 software. This 11th base can be used to match against the deposited sequences (Genbank, EMBL etc.) and in this way one may be able to exclude Unigene clusters that contain a different 11th base in their sequence and thereby minimize the number of multiple matches. In the final comparison tags matching to multiple Unigene clusters were excluded. For annotation of LongSAGE tags we used the data available at the CGAP site for Unigene Build 170 (July 2004). These annotations were not available for Unigene Build 160.
Affymetrix HG-U133A GeneChips were used and the hybridizations were performed according to the manufacturer's protocols and carried out at the Micro-array Department (MAD; Institute for Life Sciences, Faculty of Science, University of Amsterdam). For analysis, the MAS 5.0 software suite was used and comparisons between duplicate Wilms' tumor hybridizations and duplicate Stratagene Universal reference RNA hybridizations were made (data were deposited into the GEO under accession GSE1158). This gives four comparisons (2 Log ratios), from which the geometric mean gene expression ratio between the two samples was calculated. Probe sets on the Affymetrix chips were matched with Unigene clusters (Unigene Build 160, March 2003).
The matching of data from two different gene expression profiling platforms (as illustrated in figure 3) poses a couple of problems. On the one hand, a SAGE tag may link to more than one Unigene cluster which results in matches with multiple different Affymetrix probe sets. On the other hand multiple tags originating from one Unigene cluster might match with one Affymetrix probe set. Examining all multiple matches for each individual transcript is extremely laborious and beyond the scope of this study. To circumvent these and other problems we included in our comparison only those clusters for which a one-to-one relation between the two platforms was found. These clusters are called unambiguous Unigene clusters. This matching step already results in a considerable reduction of data available for the comparison. In addition, data were filtered for the presence of gene expression (tag count>0 in both SAGE libraries and present signal on the arrays for both RNA samples).
For each platform and each transcript that full-filled the matching criteria an expression ratio between Wilms' tumor and Stratagene Reference RNA was calculated. With these ratios the correspondence between platforms was estimated using the Pearson correlation coefficient, Up/Down classification and a contingency table (Figure 1A, 1B, 1C). Because none of these measures was deemed satisfactory as overall correspondence measure (see background section) we developed a new measure based on the difference between the log(ratio) values in the two platforms for each individual transcript (Figure 4). The chemistry, physics and statistics of the detection technique make that in each platform the observed gene expression is a non-linear transformation of the real gene expression level. For instance, saturation of the array hybridization makes that the high expression levels are truncated. However, because such artifacts affect genes in both tissues in the same way, an observed expression ratio of 1 can still be expected to be observed for genes that are not differentially expressed in the studied tissues. On the other hand, these saturation effects, as well as the relatively larger Poisson error in the detection of low intensity values will affect the ratios on both sides of the ratio distribution in an unpredictable way. Similarly, the sampling error in SAGE will affect ratios for lowly expressed genes, despite the fact that SAGE tag counts are linearly related to transcript abundance. The substitution of zero tag counts that is required for the calculation of ratios will also skew the ratios . Finally, the discrete nature of tag counts, combined with the necessary normalization of tag counts to tags per 50000, will have non-linear effects on the observed ratio distribution in the SAGE platform. Therefore, the relation between the gene expression ratios observed in the SAGE and Affymetrix platform cannot be assumed to be a simple linear Y = X relation. This is already clear from the difference ranges of ratio values in each platform. To directly compare the ratios observed in both platforms at least the range of observed ratios should be similar. The nature of the relation is unknown and fully obscured by the variability in both platforms. However, because in each platform the observed ratio of 1 can be assumed to be true, the simplest function to scale the range of ratio of one platform to that of the other platform is a quadratic equation. Such a scaling function can be based on three values from each ratio distribution. These are the ratio of 1 and, to avoid undue influence of the extreme ratios, the 10th and 90th percentile values. The quadratic scaling takes into account that the ratio distribution is not symmetrical around ratio 1. The full scaling procedure is illustrated and detailed in Figure 4. Note that the scaling uses log(ratio) values. After scaling, the absolute difference between the log(ratios) per individual gene was calculated. The resulting differences of log(expression ratios) were classified into classes of width 0.5, which corresponds to an approximate 3-fold difference in expression ratio between platforms. These classes were used to label the genes in scatter plots of two different platforms (Figure 2D). As illustrated in Figure 4, the distribution of between-ratio differences is approximately normal. Therefore, the mean and standard deviation of this distribution can be used to calculate a standardized difference value (Diffst) per gene and a P-value for this standardized difference can be obtained from the normal distribution. This P-value can then serve as a measure for the position of each gene in the distribution of between-platform ratio differences. Note that this P-value should not be interpreted as a significance value for the ratio difference between platforms. Such a test requires a gene specific variance estimate in the denominator of the standardized difference, which cannot easily be derived from the available data.
This work was supported by the Stichting Kindergeneeskundig Kankeronderzoek (SKK) and the Dutch Cancer Society (KWF; grant UVA 2001–2558)
We would like to thank Dr. A.H.C. van Kampen for reading the manuscript and helpful discussions and Raymond J. Waaijer for his bioinformatics support (Bioinformatics Laboratory, Academic Medical Center, the Netherlands).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.