Almost all of the steps needed to process array CGH ratio data for phylogenetic analysis can influence the result. These include the filtering, normalization and tree-building procedures applied to the data. Using empirical data, we tested the effect of different analytical approaches on distance and parsimony analysis. Different methods of normalization combined with mean or median hybridization values for each species were used. For Neighbor-Joining analysis, different metrics for converting normalized intensity ratios to genetic distances were used, as well as different thresholds for converting ratios to discrete character data for parsimony analysis. One such method was a probabilistic method for converting ratios to discrete character data (GACK) and the other was a self-contained work-flow (MPP), which converts raw CGH data to data suitable for phylogenetic analysis. An alternative to average or median CGH ratio values, BAGEL (a Bayesian approach to estimate a representative hybridization value for each species), was also tested.
To consider the effect of evolutionary distance among taxa on the analyses, we grouped taxa into three datasets that varied in combined genetic distance. The CON set consisted of the six most closely related outbreeding individuals of Neurospora (N. crassa A, N. crassa C, N. sitophila, N. tetrasperma, N. intermedia, and N. discreta). The NEU set is the CON set plus N. terricola, an obligately self-fertilizing relative. The ALL taxa set is the NEU set plus two more distant relatives, Sordaria macrospora and Podospora anserina. Phylogenetic trees for these taxon sets, based on multilocus sequence analyses (MLSA; [26–28]) are shown in Figure 1, with and without the reference taxon as indicated. As detailed in the methods, in all analyses, we present results with and without the reference taxon because some analyses are confounded by the zero distance between the reference taxon and itself.
Summaries of our analyses given below are supported by analyses in the additional material as follows: Analysis of filtered normalized data for NJ (Additional File 1: table S1A) and parsimony after GACK processing (Additional File 1: table S1B). Analysis of Bayesian-based (BAGEL) treatments for NJ (Additional File 1: table S1C) and parsimony (Additional File 1: table S1D). Parsimony analyses using MPP (Additional File 1: table S1F). MPP was also used to implement a likelihood approach with Neighbor-Joining to compensate for the single reference design (Additional File 1: table S1E).
Neighbor-Joining Analysis Normalized data, (Figure 2A & 2B, Additional File 2: table S2)
In no case did analysis of the ALL taxa set by Neighbor-Joining of CGH data that had been simply normalized produce the same tree as MLSA tree (Figure 2A). The most successful analyses produced trees at least two steps longer than the MLSA tree as judged by the SymD (symmetric distance) metric. These most successful analyses used NJ with either Pearson's correlation coefficient or Euclidean distance and with or without the reference taxon, and with loess (Limma), robust spline or linear normalization, but never with lowess (Acuity) normalization. Taking the mean or median of normalization values had no effect on the outcome and neither did employing the additional 40% filter.
With the NEU taxa set, the MLSA tree was recovered perfectly by NJ using Euclidean distance, but only with linear normalization based on the ratio of means. There was no effect of including or excluding the reference taxon, using the mean or median value for each gene, or adding the additional 40% filter. Clearly, exclusion of the distant taxa, Sordaria and Podospora, had a positive effect on the analyses, but only with this combination of methods.
With the CON taxa set, the MLSA tree was recovered less frequently. Here, again, the most robust result (0 steps away) was by NJ analysis using Euclidean distance of data linearly normalized by the linear ratio of means. However, results were better when the reference taxon was included and the additional 40% filter was omitted on mean values for each gene. Again, inclusion or exclusion of the more divergent taxa had the largest effect on recovery of the MLSA tree and trees with topologies close to the MLSA tree were found only with a narrow combination of methods.
Parsimony Analysis Normalized data, (Figure 2C & 2B, Additional File 2: table S2)
For the ALL dataset, with or without the reference, no trees with topologies identical to the MLSA tree were produced for any normalization. With the reference taxon included, the averaged loess and median of the spline normalization give trees two to four steps distant for most values of the %EPP cutoff. The lowess trees were 1 to 10 steps longer than the MLSA tree, depending on the values of the %EPP. Trees that included the reference taxon were substantially worse (see Additional File 2: table S2).
With the NEU taxon set, several of the thresholds based on %EPP resulted in trees with topologies identical to the MLSA tree when the reference taxon was excluded. These trees identical to MLSA trees included those made using the averaged values of the loess and the median values of the spline normalizations, and many of the percent EEP values using averaged linear normalization. Adding the additional 40% filter had a negative effect such that only the 50% EPP threshold gave the MLSA tree topology. With the reference taxon included in the analysis, the MLSA tree topology was not recovered as judged by the SymD metric (no closer than six steps) or the D1 metric (as close as one step, see Additional File 2: table S2).
For the CON dataset, excluding the reference, the averaged values of the linear and loess normalizations recovered the MLSA tree topology in four and five of the eleven percent EPP thresholds respectively. The median and average values of the robust spline normalization were also successful in capturing the MLSA tree. Again, the additional 40% filter resulted in poorer trees overall. The remaining iterations of the data were two to four steps away. Including the reference species gave trees that were no closer to the MLSA tree than four steps and then only for the spline and loess normalizations.
NJ after Bayesian estimation of a relative hybridization level (Figure 3A and 3B, Additional File 3: table S3)
For the ALL taxa set, with and without the reference species, the MLSA tree topology was not recovered with either the Euclidean or correlation-based metric. The closest approximation of the MLSA tree was achieved by the robust spline and loess normalizations (two steps longer by both the Euclidean and correlation distance metrics). For the correlation metric, including the reference species had no effect on results. However, for the Euclidean metric, including the reference species in some cases increased the length as compared to the MLSA tree by from two to six steps.
For the NEU dataset, the Euclidean distance metric captured the MLSA tree for both the spline and loess normalizations while the correlation metric did so solely with the robust spline normalization. This result was found regardless of whether or not the reference taxon was included. For the CON dataset, the correlation metric outperformed the Euclidean metric by capturing the MLSA tree with almost all approaches (except the lowess normalization) regardless of whether the reference taxon was included or excluded. The Euclidean metric recovered the MSLA tree topology with fewer combinations of approaches, and performed worse when the reference taxon was included (Limma spline normalization moved from no to two steps distant).
One noteworthy result with BAGEL NJ trees is that, unlike the other methods tested, the results are far less sensitive to inclusion or exclusion of the reference taxon. This insensitivity is presumably due to BAGEL's extrapolation of the reference value, which appears to be a more robust way of including the reference taxon than including self-self controls for tree construction.
Bagel Parsimony (Figure 3C and 3D, Additional File 3: table S3)
For parsimony analysis, the BAGEL estimates of hybridization levels described above were binned at the first, second, and the third quartile. For the ALL dataset, no method of analysis recovered the MLSA tree topology. The loess, spline, and lowess normalizations were four steps distant irrespective of inclusion of the reference taxon. The linear normalization was worse (six steps distant) and the worst result was obtained when the reference taxon was included (eight steps distant).
For the NEU set, no method of analysis recovered the MLSA tree topology, although the loess, spline, and loess normalizations again performed best (two steps distant when binning data at the first quartile). The CGH trees constructed from the linear normalization were at best four steps longer than the MLSA tree when the reference taxon was excluded, and six steps longer when it was included. Binning at the second or third quartile resulted in trees that were typically four steps longer.
For the CON taxon set the MLSA tree topology was recovered only when the reference taxon included, and then only when binning at the first quartile for the spline, loess or lowess normalizations, or at the third quartile for spline and loess normalizations. Other approaches gave trees two to four steps longer than the MLSA tree.
Unlike the BAGEL NJ analyses, which were insensitive to inclusion or exclusion of the reference taxon, the BAGEL parsimony analysis improved when the reference taxon was included in the CGH phylogeny. However, in the BAGEL parsimony analyses, the MLSA tree was recovered only for the CON taxa dataset, any only for a narrow set of approaches, as noted above.
Treebuilding with MPP, the Microarray to Phylogeny Pipeline (Figure 4, Additional File 4: table S4)
As described in the methods, we used the MPP pipeline to construct both Neighbor-Joining and Parsimony trees for the three groups of taxa. The MPP method begins by using CGH data to score hybridization probes as present or absent. These data can be exported for parsimony analysis or they can be used to make pairwise distance matrices by a likelihood approach that is designed to compensate for the single reference design. These distance data are then used for phylogenetic analysis by Neighbor-Joining analysis. MPP allows the user to control various options: the CGH data can be transformed using either a log or an inverse hyperbolic sine function (arsinh), the presence or absence of a probe can be estimated by EPP or by BPP, and the binwidth for assigning probe presence or absence can be set at either 0.05 (norm) or determined experimentally (exp). Applying these options in all combinations gave us eight basic combinations of options for both parsimony and NJ phylogenetic analyses.
For the ALL dataset, MPP using NJ did not recover the MLSA tree for any of the eight options. A tree two steps longer than the MLSA tree was recovered using arsinh, BPP, with a binwidth set at norm and excluding the reference taxon. Use of EPP or inclusion of the reference taxon gave trees at least twice as distant.
For the NEU taxon set using MPP and NJ, with the reference taxon excluded, trees concordant with the MLSA tree were recovered for five of the eight options. These five included all of the log-transformed data and the arsinh-transformed data option with BPP and norm binwidth. When the reference was included, the same iterations gave CGH trees four to six steps longer than the MLSA tree.
For the CON dataset using MPP with NJ, the results were similar to the NEU set in that all log-transformed data options recovered the MLSA tree topology when the reference taxon was excluded and no options recovered the MLSA tree topology when the reference taxon was included.
Parsimony MPP tree construction (Figure 4C and 4D, Additional File 4: table S4)
For parsimony analysis, the same eight combinations were used to convert GCH data to presence/absence data sets and trees were made using 50% Majority-Rule consensus.
For MPP with parsimony analysis of the ALL dataset with the reference taxon excluded, the best trees using any option were one step longer than the MLSA tree (data log-transformed with either EPP or BPP followed by exp binning, or data arsinh-transformed with BPP followed by norm binning). When the reference taxon was included, trees from the four BPP and EPP arsinh-transformed sets were eight steps longer and those from the log-transformed data were even more distant.
For MPP with parsimony analysis of the NEU dataset with the reference taxon excluded, the five iterations that recovered the MLSA tree topology for the NJ analysis did the same for parsimony analysis, i.e., all of the log-transformed data and the arsinh-transformed data option with BPP and norm binwidth. When the reference taxon was included in the CGH phylogeny, no option returned the MLSA topology.
For MPP with parsimony analysis of the CON dataset, when the reference was excluded from the CGH phylogeny, the results were identical to those the results of the NEU taxon set. When the reference was included, the results were nearly identical to those for the NEU data set, i.e., no option returned the MLSA tree topology.