Analysis of long branch extraction and long branch shortening
© O’Connor et al. 2010
Published: 02 November 2010
Skip to main content
© O’Connor et al. 2010
Published: 02 November 2010
Long branch attraction (LBA) is a problem that afflicts both the parsimony and maximum likelihood phylogenetic analysis techniques. Research has shown that parsimony is particularly vulnerable to inferring the wrong tree in Felsenstein topologies. The long branch extraction method is a procedure to detect a data set suffering from this problem so that Maximum Likelihood could be used instead of Maximum Parsimony.
The long branch extraction method has been well cited and used by many authors in their analysis but no strong validation has been performed as to its accuracy. We performed such an analysis by an extensive search of the branch length search space under two topologies of six taxa, a Felsenstein-like topology and Farris-like topology. We also examine a long branch shortening method.
The long branch extraction method seems to mask the majority of the search space rendering it ineffective as a detection method of LBA. A proposed alternative, the long branch shortening method, is also ineffective in predicting long branch attraction for all tree topologies.
Due to its speed and simplicity, one of the most common methods used in phylogenetics is Maximum Parsimony  (MP). MP is based on the principle of the Occam’s razor, which means the simplest explanation for any phenomenon is the most probable. Under this principle parsimony makes the claim of using few if any assumptions, and while this has been disputed, MP’s model is much more simple with far fewer parameters than many other phylogenetic methods. Three major problems have been cited with MP, stemming from this assumption of simplicity. Many authors have argued that parsimony has under parameterized the problem, then the claim was made that it over parameterizes the problem . The third problem is that of Long-Branch Attraction (LBA).
LBA is the foundation for many of the arguments against the use of MP in phylogenetics. One foundational study showed that MP can be positively misleading when two non-sister taxa have long branches compared to the rest of the tree . This bias has then been reiterated in a number of other simulated and empirical studies (see Bergsten  for an in-depth review of the current debate on LBA). The crux of the problem is that long branches, whether sister taxa or not, are claded or grouped together, creating scenarios where the MP method will consistently be incorrect. This has the potential to occur often, when given enough evolutionary time because multiple sites will differentiate from each other. Since there is a finite set of characters, (i.e. A,C,G,T for DNA) the two sequences will have many sites with matching characters. As more evolutionary time passes, fewer of these sites will be due to a common ancestor or homology, and more of them will be due to the random use of the same nucleotide. This non-homologous yet similar sequence of characters adds noise to the phylogenetic signal. This problem is not unique to parsimony but parsimony suffers from it more extensively than Maximum Likelihood (ML) [6–10].
If, after completing a full parsimony search you obtain a tree with a questionable grouping of a certain taxa that appears basal and makes the formal classification polyphyletic, suspect LBA.
Exclude the outgroup and re-run the analysis: does the questionable taxa form a monophyletic clade of the formal classification?
Return the outgroup and remove the questionable taxa and re-run the analysis: does this root the tree differently then in step 1 (later compare to step 4 and 5 as well)?
Return the questionable taxa and reanalyze the data set by separating the gene information from the morphological data: does the morphological data form a monophyletic group of the formal classification while the gene data place the questionable taxa basal in the tree?
Analyze the gene data using a method that takes into account branch lengths, (i.e. Bayes or Likelihood): does this method form a monophyletic group of the formal classification?
Using the same analysis of step 5: are the branch lengths of the questionable taxa and the outgroup some of the longest in the tree?
If you can answer yes to all the previous questions, LBA is the least refuted hypothesis. We have chosen to automate this technique with a few modifications and evaluate it on a series of synthetic data with six taxa under a variety of branch lengths with verified LBA.
The six taxa synthetic data sets were used for two main reasons. Six taxa data sets are small enough to be calculated in reasonable time but large enough for the LBE method to work. This gave us an a priori knowledge as to which trees were suffering from long branch attraction.
To produce these data sets we used the program Dawg  under a General Time Reversible (GTR) [13–15] model of evolution. We used similar parameters as those found in the examples included with the program and explored a range of branch lengths. The lambda value of 0.1 was used for the indel evolution rate and can be interpreted as one indel for every ten substitutions. The sequence length was set to 2000 as this gives a reasonably sized sequence to allow for the expected value of any simulation to be seen. The nucleotide frequencies for the simulation were set to 0.2, 0.3, 0.3, and 0.2 for A, C, T, and G respectively with substitution parameters set to 1.5, 3.0, 0.9, 1.2, 2.5, and 1.0 for AC, AG, AT, CG, CT, GT respectively. These settings were chosen based on examples given with the Dawg program.
Dawg generated data sets for trees under both topologies where the α and β branch lengths were varied from a branch length of 0.1 to 2.0, incremented by 0.1. A branch length of one is interpreted to mean that each site is expected to have one substitution from the internal node under the GTR definition of branch length. For each permutation of α and β branch lengths we ran 100 replicates to get a percentage of matches between the two methods. This created a total of 40,000 data sets for each topology.
Each data set was then analyzed by comparing the best parsimony tree from an exhaustive search. With six taxa this means scoring all 105 possible trees to find the best one. This best tree was then compared to the parsimony tree and the percentage of the trials out of 100 that the two matched was recorded. Then for each of the permutations of α and β we generated a new set of 100 data sets and performed a heuristic TBR parsimony search. All scoring and searching was done with PAUP* . The tree that MP returned from the heuristic search was then analyzed using LBE.
To perform LBE, the target tree, in our case the resultant heuristically derived parsimony tree, and data set are given as parameters along with a list of outgroup taxa and questionable taxa to our Java version of LBE. Of the two β branches, one was selected as the questionable taxa while the other was selected as the outgroup.
The first step of LBE is to remove the outgroup from the tree and the data set and rerun a parsimony search. The second step is to add the outgroup back and remove the questionable taxa. To increase the sensitivity, according to the recommendations of Bergsten (see “Concluding discussion: suggestions” from ) we included a third step where the original data set was evaluated under a branch length estimator method. We used Maximum Likelihood, and the resultant tree was compared to the original parsimony tree. If at any step the tree found by the re-ran search is the same as the original tree, minus the removed taxa in the first two steps, then LBA is no longer suspected and the search is terminated. If instead it passed through all of the steps, the branch lengths of the outgroup and the questionable taxa were compared to the rest of the branch lengths. If they were in the top quartile they were considered long branches. Having passed through each step or test, the least disputed hypotheses based on molecular data would be LBA.
One problem with most phylogenetic algorithms is the loss of detectable signal with extremely long trees. The length of the tree is the sum of all the branch lengths it has and those with an extreme length or long trees are difficult to decipher. This problem is clearly visible when examining the upper right of the figures under both topologies. We hypothesis that as the branch lengths get longer the percentage correct will converge to 0.95% as this is a random guess out of the 105 possible topologies.
For a method to accurately detect LBA, it needs to discern between these two types of topologies and find the area of LBA. The region found by searching the branch length space should be the same predicted by LBE. Surprisingly this was not the case.
Further, LBE predicted LBA under the Farris-like topology, where we know a priori that the data set does not suffer from LBA. A few inconsistent categorizations would be understandable because no method is perfect. But this situation, where similar branch lengths give similar conservative predictions under both topologies, calls into question what the method is actually predicting.
It is consistently classifying the wrong area of the Felsenstein-like topology as LBA and the same area of the Farris-like topology. In reality, this is an area suffering from loss of signal. But even in other areas of loss of signal, i.e. the lower right corner of Figure 2, it is classifying it as not having any LBA. Even though this is technically correct the loss of signal should produce a random-like result in the prediction of LBA, not an extremely confident vote that it is not suffering from LBA. Keeping in mind the method is detecting LBA as the least refuted hypothesis, it seems odd that the only area detected as having LBA is not actually suffering from it and those areas that are suffering from LBA have inconsistent results.
Siddall and Whiting make the claim that, “... if each of the two branches individually group in precisely the same place as the other when they are allowed to stand alone in an analysis, one can hardly argue that they are attracted to this placement by the absent branch. ” While this seems logical, one needs to remember that a common way to avoid LBA in the first place is to add additional taxa to break up long branches [17, 18]. One possible reason that extracting taxa doesn’t work to detect LBA is that parsimony is sensitive to the removal of taxa, creating artificial long branches in the reran analysis. In the case of our analysis, removing a taxa would still be classified as not LBA because it created an artificially long branch consisting of a full α branch along with a half α branch. This then would attract either the original long branch taxa and it would look the same as the original LBA tree and then be rejected as LBA. In other words the extraction creates a problem with sampling, not splitting up longer branches by adding taxa, a typical pitfall when dealing with LBA. The long branch is not being attracted by the excluded long branch but it is being attracted to the extended branch caused by not breaking it up. This creates a double error and deceives the procedure into thinking it is not a case of LBA
Area II is much more hypothetical but seems to fit the data reasonably well. When examining Figures 5 and 6 there is a noticeable but rough line at about y = – 2 * x + 2. We hypothesis that the shape of this line is a function of the branch lengths. This area is obviously crucial as seen in Figure 5 because it is the area suffering from LBA. In other words, the predictive power of LBE is being masked by this artificial long branch in the exact area needed for accurate prediction of LBA. This triangle directly corresponds to the areas under LBA, thus making the technique inadvisable.
Finally, area III is where the LBE method is actually mostly correct or the area not suffering from some other artifact. Unfortunately, this area is not suffering from LBA but eventually it losses phylogenetic signal. It is the most clearly seen in Figure 4 where ML can determine to a greater extent the phylogenetic signal. At approximately the same point LBE makes incorrect predictions because of the loss of signal. This area is not under a LBA bias for MP and so is correctly labeled as not having LBA but this is not informative. This really does not add a lot of strength to the procedure because it is already unambiguous.
Due to the problems associated with Long Branch Extraction, an alternate approach could be used. Rather than removing the suspected long branch that would cause changes in the overall phylogeny, a series of iterative steps are taken to shorten the branch to diminish the phylogenetic signal being sent from the questionable branch and then see if that changes the phylogeny. If the phylogeny changes, long branch attraction is suspected.
Rather than sampling from all the other taxa, construct the ancestral sequence to all taxa excluding the outgroup and qtaxa. With this sequence, you have the combined signal of all the other taxa, or a summary of that clade.
Re-run the analysis with the hybridized sequence included in place of the qtaxa. If the taxa moves after reducing its own signal and adding some signal from the monophyletic clade you have some evidence of LBA The parameter or probability of switching in the binomial distribution is increased and steps 2 and 3 are repeated until either the probability reaches 1 or consistently (i.e. multiple runs) shows the hybridized qtaxa clading with the hypothetical clade.
One of the weaknesses of such an approach is the lack of an absolute answer. You don’t get a final answer of yes or no (as to whether LBA is occurring) but added evidence that there is a problem. This evidence comes in the form of a probability or percentage of the branch that needs to be shortened to form the monophyletic clade. If the probability comes out high, 0.9 to 1.0, you can be fairly sure that LBA is not occurring that that there is strong phylogenetic signal supporting the current position in the phylogeny. If it is very low, then long branch attraction has occurred and is causing an incorrect tree to be inferred. This evidence can help the researcher to understand if the questionable taxa (qtaxa) is sending a strong signal to be in the current location or a weak one. A weak signal implies that the location is inferred only because of analogous evolution and not homology. This implication can then be interpreted as the determination or detection of LBA.
In Figures 9, 10, 11, 12, dark black indicates regions where Long Branch Shortening(LBS) fails to predict whether long branch attraction is occurring. Gray areas indicate regions where LBS successfully determined whether or not long branch attraction was present in the resulting phylogeny. With a 0% sampling frequency (Figure 9), the target taxon is not modified at all and thus the phylogeny does not change. LBS then reports that no LBA exists anywhere. In this case, the Felsenstein Zone (the black region in the upper left portion of the graph) is clear and LBS is unable to detect long branch attraction. As the sampling frequency increases, the target taxon becomes more like the clade and LBS is more able to detect long branch attraction in the Felsenstein Zone. However, this comes at a price. The region where there is no long branch attraction is now reported incorrectly (the lower left portion of the graph). This is due to the fact that the target taxon has become so much more like the other taxa that at 90% sampling (Figure 12), long branch attraction is always reported because the target taxon always moves; resulting in a different phylogeny.
Long Branch Extraction(LBE) and Long Branch Shortening(LBS) are not reliable methods for detecting Long Branch Attraction(LBA) and should not be used in phylogenetic inquiries about LBA. Under a variety of branch lengths for six taxa synthetic data sets LBE incorrectly and inconsistently predicts LBA because of its inability to distinguish between artificially created long branches and the correct tree topology. The artificial long branch is created by the removal of the outgroup or questionable taxa branch creating a sister taxa that is artificially long, having removed the taxa that would break up its long branch. An additional problem is that the ML step masks a large area of the branch length space not giving the method the specificity that is needed to be effective. This was shown by an in depth search over two topologies, the Felsenstein-like topology that is easily susceptible to LBA and the Farris-like topology in which the long branches are correctly grouped together. The results support our conclusion that LBE is ineffective in detecting LBA.
LBS is not effective because it incorrectly estimates the sequence present at the ancestral node. Statistical sampling of the other sequences artificially causes the target taxon to appear like all the taxa rather than shortening its branch. This results in a loss of accuracy in the detection of LBA.
Both LBE and LBS suffer from a secondary effect. When a branch is extracted from the phylogeny or shortened, other branches are free to become the longest branch and will potentially draw other similarly long branches away from their correct locations. Both Maximum Likelihood and Maximum Parsimony are subject to LBA in Felsenstein topologies and Likelihood provides superior results in only a small part of the Felsenstein Zone.
This project was supported by the National Science Foundation under Grant No. 0120718 and by the Brigham Young University Office of Research and Creative Activity. Publication of this supplement was made possible with support from the International Society of Intelligent Biological Medicine (ISIBM).
This article has been published as part of BMC Genomics Volume 11 Supplement 2, 2010: Proceedings of the 2009 International Conference on Bioinformatics & Computational Biology (BioComp 2009). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/11?issue=S2.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.