A bootstrap based analysis pipeline for efficient classification of phylogenetically related animal miRNAs

Huang, Yong; Gu, Xun

doi:10.1186/1471-2164-8-66

Methodology article
Open access
Published: 06 March 2007

A bootstrap based analysis pipeline for efficient classification of phylogenetically related animal miRNAs

Yong Huang¹ &
Xun Gu¹

BMC Genomics volume 8, Article number: 66 (2007) Cite this article

5092 Accesses
15 Citations
3 Altmetric
Metrics details

Abstract

Background

Phylogenetically related miRNAs (miRNA families) convey important information of the function and evolution of miRNAs. Due to the special sequence features of miRNAs, pair-wise sequence identity between miRNA precursors alone is often inadequate for unequivocally judging the phylogenetic relationships between miRNAs. Most of the current methods for miRNA classification rely heavily on manual inspection and lack measurements of the reliability of the results.

Results

In this study, we designed an analysis pipeline (the Phylogeny-Bootstrap-Cluster (PBC) pipeline) to identify miRNA families based on branch stability in the bootstrap trees derived from overlapping genome-wide miRNA sequence sets. We tested the PBC analysis pipeline with the miRNAs from six animal species, H. sapiens, M. musculus, G. gallus, D. rerio, D. melanogaster, and C. elegans. The resulting classification was compared with the miRNA families defined in miRBase. The two classifications were largely consistent.

Conclusion

The PBC analysis pipeline is an efficient method for classifying large numbers of heterogeneous miRNA sequences. It requires minimum human involvement and provides measurements of the reliability of the classification results.

Background

MicroRNAs (miRNAs) are small (~22 nucleotides) non-coding RNAs that have the ability to repress the expression of target genes post-transcriptionally. Since the discovery of the first two miRNA genes, lin-4 [1, 2] and let-7 [3, 4], much has been learned about the structure, biogenesis and function of miRNAs [5–7]. A growing number of miRNAs have been found in animals, plants and viruses [8, 9]. The "miRBase" database [10] currently hosts 4039 miRNAs from 45 species (Release 8.2).

Studies in plants [11], animals [12, 13], and viruses [8] have shown that the innovation of miRNAs is an ongoing process [14, 15], which indicates that most miRNAs are of different evolutionary origins. Some miRNAs, however, were also populated through local or genome-wide duplications [14, 16], and formed miRNA families. As the phylogenetically related miRNAs convey important information about the function and evolution of miRNAs, a reliable classification of miRNA families is indispensable for miRNA studies.

Some work has been done in classifying miRNA families. The nomenclature system for miRNAs by "miRBase" [17] can be view as one of the earliest classifications. It classified miRNAs with similar mature forms together and assigned them the same id numbers. In recent releases of miRBase, a "miRNA family (miFam)" feature was present, which clustered similar miRNA precursors together based on computational analysis and manual inspection. In a recent report by Hertel et al [14], the phylogenetic distribution of miRNAs in 30 species was systematically studied. In that study, potential miRNAs in the species were identified with BLAST [18] and profile based methods [19]. The phylogenetic analysis was based on pair-wise sequence identity between precursors, and the significance of the sequence identity cutoff was evaluated with a z-score.

These studies have revealed important phylogenetic information about miRNAs. However, there are some underlying problems with the current methods. First, the mature forms of miRNAs were often used as the classification criteria in the methods, intentionally or not. This can decrease the sensitivity of finding paralogous miRNAs, as the mature part of a duplicated copy of a miRNA is not necessarily under strong selective pressure. Meanwhile, due to the short length of the mature forms, false classification caused by convergent evolution is very likely to happen. Second, since a fixed sequence identity cutoff value alone is inadequate to classify the miRNAs (See additional file 1: Classification by BLAST), most of the current methods relied on manual inspection in deciding the families. This introduces a heavy human factor in the classification process. Third, most of the current methods do not have a measurement of the reliability of the classification results. The z-score used by Hertel et al [14] was no exception, because the z-score was actually a measurement of the significance of a fixed sequence identity cutoff value (a BLAST-like e-value could be directly derived from a z-score).

In this study, we designed a bootstrap based analysis pipeline (the Phylogeny-Bootstrap-Cluster (PBC) pipeline) to identify phylogenetically related miRNAs. In our method, the families are identified based on branch stability in the bootstrap trees derived from overlapping input sets of genome-wide miRNA precursor sequences. This approach is similar to the "nodal stability" approach [20] that has been used in phylogenetic tree inference. The difference is that we vary the input data set rather than multiple sequence alignment parameters. A "Vote" algorithm was designed to automate the process of identifying and evaluating potential families. The human involvement in the classification process was minimized, and the reliability of each family was evaluated by its supporting levels from the bootstrap trees. We tested the PBC analysis pipeline with the miRNAs from the six animal species, H. sapiens, M. musculus, G. gallus, D. rerio, D. melanogaster, and C. elegans. The resulting classification was compared with the miRNA families defined by miRBase. While the classifications were largely consistent, our reliability measurement showed that a few new families can be supported and several families in miRBase may not be supported. The PBC analysis pipeline offers an efficient and objective method for classifying large amount of miRNAs.

Results

Algorithm

The phylogeny-bootstrap-clustering (PBC) analysis pipeline

This method is base on the branch stability in the bootstrap trees derived from overlapping input sets of genome-wide miRNA precursor sequences. For a set of miRNA sequences, we can always perform a multiple sequence alignment (MSA) and build a bootstrap tree from the alignment result. Due to the imperfection of MSA, it is almost certain that some false classifications will happen. However, the true classifications are generally more robust to variations in input sequences or alignment parameters of MSA than the false classifications [20]. Based on this principle, we can add additional sequences to the original input sequence set of the MSA, and generate a new bootstrap tree. The true families should be more stable and are more likely to be intact under a branch of high bootstrap value in the new tree, while the falsely classified families are more likely to be broken up in the new tree. If multiple new trees are built with different additional sequences, the likelihood for a falsely classified family to be intact in all the new trees decreases geometrically. In practice, for efficient classification of large amount of miRNAs, the original input sequences and the additional input sequences can be genome-wide collections of miRNAs.

Suppose we have n species whose miRNA information is available. Let S_idenote the set of miRNA precursor sequences in species i, and let S_i+S_jdenote the union of S_iand S_j. In the PBC analysis pipeline, we use the sequence sets (S₁, S₁+S₂, ..., S₁+S_n) as the input, where S₁ is the original input sequence set and S₂, ..., S_nare different additional sequences. MSA, neighbor-joining (NJ) tree building and bootstrapping are carried out for each input sequence set, and n corresponding bootstrap trees are built (bTree₁, ... bTree_n) (see Figure 1). In these input sets, the S₁ set of sequences appear in all the input sequence sets. The addition of other sequence sets (S₂, ..., S_n) to S₁ introduces variations to the input of MSA. From the bootstrap trees (bTree₁, ... bTree_n), we identify the branches with bootstrap values above the family defining cutoff values and denote such branching nodes as the "family defining nodes." If all the S₁ leaves under such a branch are also clustered together under a "family defining node" in all the other trees, these leaves form a consensus family. The detailed procedure of this step is capsulated in the "Vote" algorithm described in the next section. A consensus family may contain sequences from several species, but only the classification of the S₁ sequences is confirmed in this step. The reliability of the classification of a group of S₁ sequences in a family can be measured as the average of the bootstrap values of their best common ancestors in the n input trees.

Once the S₁ sequences are classified, for an individual sequence x from (S₂, ..., S_n), we can carry out another PBC analysis with the input sequence sets as (S₁+x, S₁+x +S₂, ..., S₁+x +S_n; removing redundant x wherever it occurs) to classify x with its phylogenetically related miRNAs in S₁. However, for genome-wide sets of sequences, this is computationally prohibitive (as each run of the PBC analysis take hours). Instead of confirming individual sequence, in practice, we formulate a confirmation set of sequences C which is composed of the sequences from (S₂, ..., S_n) whose phylogenetic relationship with miRNAs in S₁ is of interest. We also remove the most obviously S₁-specific miRNAs from S₁ according to the result of the first round, these include the miRNAs in singleton families in the first round and the miRNAs in tandem clusters. A new round of PBC analysis is then carried out using the input set of (S₁+C, S₁+C +S₂, ..., S₁+C +S_n) (removing redundant sequences wherever they occur; S₁-specific sequences removed from S₁). Those sequences in the confirmation set C whose classifications are confirmed in the new round are patched back to the consensus families (of the first round). The resulting consensus families form a classification of miRNAs that are phylogenetically related to S₁ miRNAs. In other words, S₁ miRNAs are the index for this classification. Phylogenetic relations among S₂, ..., S_nsequences that are not related to any S₁ sequences are not covered in this classification, but such relations can be examined in the same way by using a different set of sequences as index.

The "Vote" algorithm

The "Vote" algorithm is the kernel of the PBC analysis pipeline. The input for the "Vote" algorithm includes the set of bootstrap trees (bTree₁, ... bTree_n), the "family defining bootstrap cutoff values" which are derived from well studied known families and are tree specific, and a "evaluation bootstrap cutoff value" which is used to evaluate families defined in other trees. Usually, the "family defining cutoff values" are set to be greater than the "evaluation cutoff value".

The "Vote" algorithm (see Figure 2) is designed to identify families from the bootstrap trees and get measurements of reliability of the identified families. Suppose the S₁ sequences form the index, as is in the case of the first round of PBC. First, for each tree and from root down, the branches with bootstrap value above (>=) the family defining cutoff values are sought. The search will not go inside a branch if the top node of the branch already has a bootstrap value above the cutoff. The top nodes of these branches are the "family defining nodes." For each identified branch, the S₁ leaves in the branch form a testing set. Second, for each testing set, we can obtain the bootstrap values of their Best Common Ancestor (BCA) in the input trees. The BCA of a set of nodes in a bootstrap tree is defined to be the common ancestor with the highest bootstrap value among all the common ancestors of the set of nodes (See additional file 2: Best common ancestor (BCA)). Third, the reliability of a testing set is evaluated by its BCA bootstrap values in the input trees. If the BCA bootstrap value is above the evaluation cutoff value in a tree, the testing set is regarded as supported by that tree. If a testing set is supported by enough input trees, it is deemed as confirmed (this is where the name "Vote" comes from). In practice, we request a testing set to be supported by n-1 trees, allowing one exception. If a testing set is not confirmed, it is progressively broken down until all its subsets are confirmed. Finally, all the confirmed sets are clustered into consensus families using a single linkage clustering approach [21].

Testing

Classification of miRNAs from six animal species using the PBC analysis pipeline

The input for the testing case were the miRNAs from six animal species, H. sapiens, M. musculus, G. gallus, D. rerio, D. melanogaster, and C. elegans. These miRNAs cover most of the experimentally identified miRNAs. S₁ was set to be the human miRNAs and S₂, ..., S₆ were the miRNAs from the other five species. MSA was carried out with CLUSTALW 1.83 [22]. Neighbor-joining trees building and bootstrapping were carried out with Mega 3.1 [23] using 1000 bootstrap replications and p-distance substitution model. The family defining bootstrap cutoff values are tree-specific, and are set to be the smallest bootstrap value of the reference miRNA families (let7, mir-124, mir-17 and mir-1, See additional file 3: Reference miRNA families) in each input tree. These reference families are well established in the literature. The actual cutoff values were 84% for the "hsa" tree, 90% for the "hsa-mmu" tree, 91% for the "hsa-gga" tree, 78% for the "hsa-dre" tree, 75% for the "hsa-dme" tree and 82% for the "hsa-cel" tree. The evaluation cutoff value is set to be 50%.

After the first round of the PBC analysis, we examined the composition of the consensus families. For human miRNAs with same id numbers, only 2 are separated in the consensus families, namely mir-92/mir-92b and mir-449/mir-449b, showing that most of the miRNA families are robust to the variation in the input of the PBC pipeline. We further examined the alignments for mir-92/92b and mir-449/449b. In both cases, the mature sequences had high sequence identities while the hairpin sequences had low identities outside the mature parts. The miRNAs in these cases are separated into different families, as they may be the results of convergent evolution.

The confirmation set of sequences were deduced from the consensus families of the first round of PBC. The confirmation set includes the miRNAs from the other species that were not classified with their human counterparts (having the same id numbers). There were 1 from mouse, 0 from chicken, 2 from fish, 5 from fly, and 3 from worm (including cel-lin-4 and cel-lsy-6). The confirmation set also includes the non-human miRNAs that were classified in a family with none of its human counterparts. There were 13 from mouse, 1 from chicken, 12 from fish (selected 3 from the dre-mir-430b miRNA cluster), 20 from fly, and 16 from worm. A second round of PBC analysis was carried out. 42 of the miRNAs in the confirmation set were confirmed and were patched back to the consensus families and the unconfirmed miRNAs were kept out of the families. The resulting consensus families formed the classification of phylogenetically related miRNAs with the human miRNAs as the index. The full list can be found in additional file 4: Full list of the families with human member. The supporting levels of the human miRNAs in each family can be found in additional file 5: Supporting levels of the families.

The phylogenetic distribution of miRNAs

Based on the family composition, we categorized the miRNA families in the testing case into categories I-IV. Category I contains miRNA families that have miRNAs from both mammals and invertebrates. In addition, we demand that all the Category I families should have chicken or fish miRNAs. Category II contains miRNA families that have miRNAs from both mammals and non-mammal vertebrates. Category III contains miRNA families that have miRNAs from mammals only. Category IV contains the miRNA families that have no mammal miRNAs. The distribution is summarized in Table 1. An abbreviated list of the human miRNAs of the families is shown in Table 2. It should be noted that worm or fly miRNAs are present in eight families in category III. These families were put in category III because there were no chicken or fish miRNAs in these families. These cases are likely to be the result of convergent evolution or incomplete miRNA discovery in chicken and fish. Most of the human miRNAs in these eight families are from a recent publication [24].

Table 1 Distribution of miRNA families in different categories of conservation

Full size table

Table 2 The human members of miRNA families in different conservation categories

Full size table

Although the discovery of miRNAs is not comprehensive in species like chicken and fish, the phylogenetic distribution of the miRNA families still shows interesting trends. The majority of the human miRNAs are conserved only in vertebrates. Among the human miRNAs, 47 are in category I, 146 are in category II, and 269 are in category III. This shows that more than half of the human miRNAs are conserved in mammals only, and the majority of the rest are vertebrate only. Only a small portion of human miRNAs have invertebrate homologues. Mouse miRNAs display a similar distribution. For the invertebrate miRNAs, the portion of miRNAs that are conserved in vertebrates is also very small, 22/78 in D. melanogaster and 6/114 in C. elegans. Only a small number of miRNAs, the 15 miRNA families in category I, are conserved between invertebrates and mammals. However, more than half of all the miRNAs with known function are in these families. This shows a high correlation of sequence conservation and functional importance among the miRNAs. The discovery of miRNAs is still ongoing, so more data and analysis will be needed to generate a quantitatively more detailed phylogenetic distribution of miRNAs.

Comparison with the miFam classification

Since Release 8.1, a miFam (miRNA family) feature which provides family classification information of miRNA hairpin sequences has been available in miRBase. We compared our classification results with the miFam classification in Release 8.2. Not considering the families with one or less human miRNA or the families without non-human miRNAs, our comparison showed that 122 out of the 172 comparable PBC families are the same in miFam. While most of the miRNA families are consistent, we examined the cases where the classification was different (summarized in Table 3). Most of these differences involve additional merging or separation of the families between the two classifications. For the families that are merged in the PBC classification, the multiple sequence alignments of the hairpin sequences are generally acceptable, but the alignments of the mature sequence are not necessarily strong. One example is the mir-134/mir-412 case (See additional file 6: Multiple sequence alignments of selected miRNA families), in which the hairpin sequences are similar while the mature sequences differ greatly. The situation is reversed in the families that are merged in the miFam classification. One example is the mir-25, 92/mir-92b case, in which the mature sequences are almost identical while the rest of the sequences share little sequence similarity (See additional file 6). In only three cases, the families defined by PBC intersect with (overlap with but are not covered by) the families defined by miFam, but all these miRNAs are tandem clustered miRNAs which diverge from each other much more than most miRNA homologues.

Table 3 Comparison of the family composition between the PBC classification and the miFam classification

Full size table

Taken together, the comparison between the PBC and the miFam classifications showed that most of the families were consistent. Most of the differences stemmed likely from the differences in treating information from the mature sequences. While the PBC analysis pipeline relies totally on the precursor sequences, the miFam classification may have taken certain reference from the similarity between the mature sequences.

Classification of new miRNAs

While preparing the manuscript, a new release of miRBase sequences (Release 9.0) became available. We used our PBC analysis pipeline to classify the 12 new human miRNAs in this release by treating them as the confirmation set. The results showed that hsa-mir-758 can be assigned to the hsa-mir-379,380, 411 family; hsa-mir-767 can be assigned to the hsa-mir-105 family; and hsa-mir-802 can be assigned to the hsa-mir-511 family. The sequence alignments and the support levels among the trees can be found in additional file 7: Classification of new human miRNAs in Release 9.0. The hsa-mir-758 case is present in miFam already, while the other two cases have not been reported.

Discussion

From an evolutionary point of view, miRNAs are a heterogeneous group of sequences. First, miRNAs are of heterogeneous evolutionary origins. Most of the miRNAs are not related to each other. They are categorized together as miRNAs just because they share certain common sequence features and functional mechanisms [6]. As is shown in the phylogenetic distribution of miRNA families in this study, miRNAs also differ greatly as to their levels of conservation across the species. Second, miRNAs differ in their evolutionary patterns. Some, the let-7 family for example, maintain almost identical mature forms in evolution. Others, the mir-10, 99, 100, 125 family for example, diverge in the mature forms (See additional file 8: The mir-10, 99, 100, 125 family).

The evolutionary distance between miRNAs is the underlying basis for the classification of miRNAs. The heterogeneous nature of miRNA sequences has made it impossible to use a single model to summarize the evolutionary distance between miRNAs. In context, many current methods actually manually inspect the classification process, which brings in a heavy human factor.

In the PBC analysis pipeline, we used a non-parametric approach. The analysis depended totally on the precursor sequences, and no model of the evolutionary distances between miRNAs was assumed a priori. The classification criteria were essentially derived from the data or were based on knowledge from the literature (the reference families). The human factor was minimized in the classification process. Meanwhile, the reliability of the families can be evaluated by their support levels in the bootstrap trees.

Conclusion

The PBC analysis pipeline is an efficient method for classifying large numbers of heterogeneous miRNA sequences. The analysis pipeline assumes no models for the evolutionary distances between miRNAs. It requires minimum human involvement and provides a method to evaluate the reliability of the classification results. This analysis pipeline is efficient for classifying genome-wide sets of miRNA sequences. It is also an efficient method to classify newly cloned individual miRNAs into existing miRNA families.

Methods

Implementation

The miRNA sequences were retrieved from miRBase [10] (Release 8.2, Jul. 2006; Release 9.0, Oct. 2006). MiRNAs from six species, including H. sapiens (hsa, 462 entries), M. musculus (mmu, 358 entries), G. gallus (gga, 152 entries), D. rerio (dre, 337 entries), D. melanogaster (dme, 78 entries), and C. elegans (cel, 114 entries), were chosen for this study. miFam family information was retrieved from the same release.

Multiple sequence alignment (MSA) was carried out with CLUSTALW 1.83 [22]; neighbor-joining trees with bootstrap were inferred by Mega 3.1 [23] with 1000 bootstrap replications and p-distance substitution model. All the other data analysis was carried out by customized Perl modules and scripts, which were attached as the additional file 9: The software and instruction for the PBC analysis pipeline. The codes are also available at [25].

References

Lee RC, Feinbaum RL, Ambros V: The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell. 1993, 75 (5): 843-854. 10.1016/0092-8674(93)90529-Y.
Article CAS PubMed Google Scholar
Wightman B, Ha I, Ruvkun G: Posttranscriptional regulation of the heterochronic gene lin-14 by lin-4 mediates temporal pattern formation in C. elegans. Cell. 1993, 75 (5): 855-862. 10.1016/0092-8674(93)90530-4.
Article CAS PubMed Google Scholar
Reinhart BJ, Slack FJ, Basson M, Pasquinelli AE, Bettinger JC, Rougvie AE, Horvitz HR, Ruvkun G: The 21-nucleotide let-7 RNA regulates developmental timing in Caenorhabditis elegans. Nature. 2000, 403 (6772): 901-906. 10.1038/35002607.
Article CAS PubMed Google Scholar
Slack FJ, Basson M, Liu Z, Ambros V, Horvitz HR, Ruvkun G: The lin-41 RBCC gene acts in the C. elegans heterochronic pathway between the let-7 regulatory RNA and the LIN-29 transcription factor. Mol Cell. 2000, 5 (4): 659-669. 10.1016/S1097-2765(00)80245-2.
Article CAS PubMed Google Scholar
Ambros V: The functions of animal microRNAs. Nature. 2004, 431 (7006): 350-355. 10.1038/nature02871.
Article CAS PubMed Google Scholar
Bartel DP: MicroRNAs: genomics, biogenesis, mechanism, and function. Cell. 2004, 116 (2): 281-297. 10.1016/S0092-8674(04)00045-5.
Article CAS PubMed Google Scholar
Lai EC: microRNAs: runts of the genome assert themselves. Curr Biol. 2003, 13 (23): R925-36. 10.1016/j.cub.2003.11.017.
Article CAS PubMed Google Scholar
Pfeffer S, Zavolan M, Grasser FA, Chien M, Russo JJ, Ju J, John B, Enright AJ, Marks D, Sander C, Tuschl T: Identification of virus-encoded microRNAs. Science. 2004, 304 (5671): 734-736. 10.1126/science.1096781.
Article CAS PubMed Google Scholar
Reinhart BJ, Weinstein EG, Rhoades MW, Bartel B, Bartel DP: MicroRNAs in plants. Genes Dev. 2002, 16 (13): 1616-1626. 10.1101/gad.1004402.
Article CAS PubMed Central PubMed Google Scholar
Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ: miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res. 2006, 34 (Database issue): D140-4. 10.1093/nar/gkj112.
Article CAS PubMed Central PubMed Google Scholar
Allen E, Xie Z, Gustafson AM, Sung GH, Spatafora JW, Carrington JC: Evolution of microRNA genes by inverted duplication of target gene sequences in Arabidopsis thaliana. Nat Genet. 2004, 36 (12): 1282-1290. 10.1038/ng1478.
Article CAS PubMed Google Scholar
Tanzer A, Amemiya CT, Kim CB, Stadler PF: Evolution of microRNAs located within Hox gene clusters. J Exp Zoolog B Mol Dev Evol. 2005, 304 (1): 75-85. 10.1002/jez.b.21021.
Article Google Scholar
Tanzer A, Stadler PF: Molecular evolution of a microRNA cluster. J Mol Biol. 2004, 339 (2): 327-335. 10.1016/j.jmb.2004.03.065.
Article CAS PubMed Google Scholar
Hertel J, Lindemeyer M, Missal K, Fried C, Tanzer A, Flamm C, Hofacker IL, Stadler PF: The expansion of the metazoan microRNA repertoire. BMC Genomics. 2006, 7: 25-10.1186/1471-2164-7-25.
Article PubMed Central PubMed Google Scholar
Maher C, Stein L, Ware D: Evolution of Arabidopsis microRNA families through duplication events. Genome Res. 2006, 16 (4): 510-519. 10.1101/gr.4680506.
Article CAS PubMed Central PubMed Google Scholar
Bompfunewerer AF, Flamm C, Fried C, Fritzsch G, Hofacker IL, Lehmann J, Missal K, Mosig A, Muller B, Prohaska SJ, Stadler PF, Tanzer A, Washietl S, Witwer C: Evolutionary Patterns of Non-Coding RNAs. Theory in Biosciences. 2005, 123 (4): 301-369. 10.1016/j.thbio.2005.01.002.
Article PubMed Google Scholar
Griffiths-Jones S: The microRNA Registry. Nucleic Acids Res. 2004, 32 (Database issue): D109-11. 10.1093/nar/gkh023.
Article CAS PubMed Central PubMed Google Scholar
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410.
Article CAS PubMed Google Scholar
Legendre M, Lambert A, Gautheret D: Profile-based detection of microRNA precursors in animal genomes. Bioinformatics. 2005, 21 (7): 841-845. 10.1093/bioinformatics/bti073.
Article CAS PubMed Google Scholar
Giribet G: Stability in phylogenetic formulations and its relationship to nodal support. Syst Biol. 2003, 52 (4): 554-564. 10.1080/10635150390223730.
Article PubMed Google Scholar
Martinez WL, Martinez AR: Exploratory data analysis with MATLAB. Series in computer science and data analysis. 2005, Boca Raton , Chapman & Hall/CRC, xv, 405 p.-
Google Scholar
Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22 (22): 4673-4680. 10.1093/nar/22.22.4673.
Article CAS PubMed Central PubMed Google Scholar
Kumar S, Tamura K, Jakobsen IB, Nei M: MEGA2: molecular evolutionary genetics analysis software. Bioinformatics. 2001, 17 (12): 1244-1245. 10.1093/bioinformatics/17.12.1244.
Article CAS PubMed Google Scholar
Cummins JM, He Y, Leary RJ, Pagliarini R, Diaz LA, Sjoblom T, Barad O, Bentwich Z, Szafranska AE, Labourier E, Raymond CK, Roberts BS, Juhl H, Kinzler KW, Vogelstein B, Velculescu VE: The colorectal microRNAome. Proc Natl Acad Sci U S A. 2006, 103 (10): 3687-3692. 10.1073/pnas.0511155103.
Article CAS PubMed Central PubMed Google Scholar
PBC codes. [http://www.public.iastate.edu/~yhames04/PBC_codes.zip]

Download references

Acknowledgements

We thank Dr. Karin Dorman for comments on the manuscript. This work is supported by NIH grants to XG.

Author information

Authors and Affiliations

Department of Genetics, Development, and Cell Biology, Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA, 50011, USA
Yong Huang & Xun Gu

Authors

Yong Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xun Gu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xun Gu.

Additional information

Authors' contributions

XG and YH designed the study. YH carried out the data analysis. Both authors contributed to the manuscript writing. All authors read and approved the final manuscript.

Electronic supplementary material

12864_2006_779_MOESM1_ESM.doc

Additional File 1: Classification by BLAST. The classifications of miRNAs using different BLAST bit score cutoff values were compared to the miFam classification in the aspects of the cumulative miRNA difference and the composition of the families. (DOC 26 KB)

12864_2006_779_MOESM2_ESM.doc

Additional File 2: Best common ancestor (BCA). This file illustrates how the BCA is defined for a group of nodes. (DOC 74 KB)

12864_2006_779_MOESM3_ESM.doc

Additional File 3: Reference miRNA families. This file shows the composition of the reference families used in this analysis and the sequence identity range between miRNA precursors in the families. (DOC 24 KB)

12864_2006_779_MOESM4_ESM.txt

Additional File 4: Full list of the families with human member. The list of families from the classification of the miRNAs from six animal species, H. sapiens, M. musculus, G. gallus, D. rerio, D. melanogaster, and C. elegans using human miRNAs as the index. Only the families having human miRNAs are listed. (TXT 28 KB)

12864_2006_779_MOESM5_ESM.txt

Additional File 5: Supporting levels of the families. The supporting levels in the bootstrap trees of the human members in each family. (TXT 18 KB)

12864_2006_779_MOESM6_ESM.doc

Additional File 6: Multiple sequence alignments of selected miRNA families. The multiple sequence alignments of the miRNA families mentioned in the main text. (DOC 37 KB)

12864_2006_779_MOESM7_ESM.doc

Additional File 7: Classification of new human miRNAs in Release 9.0. The multiple sequence alignment and supporting levels (format see additional file 5) of the three families that three new human miRNAs in Release 9.0 have been assigned to. (DOC 28 KB)

12864_2006_779_MOESM8_ESM.doc

Additional File 8: The mir-10, 99, 100, 125 family. Sequence alignment and selected secondary structure of the miRNAs in the mir-10, 99, 100, 125 family. (DOC 147 KB)

12864_2006_779_MOESM9_ESM.zip

Additional File 9: The software and instruction for the PBC analysis pipeline. The PERL scripts of the programs used in the PBC analysis pipeline, and the instructions. The miRNAs sequences are retrieved from miRBase [10] (Release 8.2). (ZIP 521 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Huang, Y., Gu, X. A bootstrap based analysis pipeline for efficient classification of phylogenetically related animal miRNAs. BMC Genomics 8, 66 (2007). https://doi.org/10.1186/1471-2164-8-66

Download citation

Received: 28 October 2006
Accepted: 06 March 2007
Published: 06 March 2007
DOI: https://doi.org/10.1186/1471-2164-8-66

A bootstrap based analysis pipeline for efficient classification of phylogenetically related animal miRNAs

Abstract

Background

Results

Conclusion

Background

Results

Algorithm

The phylogeny-bootstrap-clustering (PBC) analysis pipeline

The "Vote" algorithm

Testing

The phylogenetic distribution of miRNAs

Comparison with the miFam classification

Classification of new miRNAs

Discussion

Conclusion

Methods

Implementation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us