- Research article
- Open Access
The expansion of the metazoan microRNA repertoire
BMC Genomicsvolume 7, Article number: 25 (2006)
MicroRNAs have been identified as crucial regulators in both animals and plants. Here we report on a comprehensive comparative study of all known miRNA families in animals. We expand the MicroRNA Registry 6.0 by more than 1000 new homologs of miRNA precursors whose expression has been verified in at least one species. Using this uniform data basis we analyze their evolutionary history in terms of individual gene phylogenies and in terms of preservation of genomic nearness across species. This allows us to reliably identify microRNA clusters that are derived from a common transcript.
We identify three episodes of microRNA innovation that correspond to major developmental innovations: A class of about 20 miRNAs is common to protostomes and deuterostomes and might be related to the advent of bilaterians. A second large wave of innovations maps to the branch leading to the vertebrates. The third significant outburst of miRNA innovation coincides with placental (eutherian) mammals. In addition, we observe the expected expansion of the microRNA inventory due to genome duplications in early vertebrates and in an ancestral teleost. The non-local duplications in the vertebrate ancestor are predated by local (tandem) duplications leading to the formation of about a dozen ancient microRNA clusters.
Our results suggest that microRNA innovation is an ongoing process. Major expansions of the metazoan miRNA repertoire coincide with the advent of bilaterians, vertebrates, and (placental) mammals.
MicroRNAs (miRNAs) are small non-coding RNAs that can be found in both multi-cellular animals and plants. In both kingdoms they act as negative regulators of translation. They are transcribed as longer primary transcripts from which approximately 70nt precursors (pre-miRNAs) with a characteristic stem-loop structure are extracted; after export to the cytoplasm, the mature miRNAs, approximately 22nt in length, are cut out from one side of the precursor stem structure. For reviews on the discovery and function of miRNAs we refer to the literature, see e.g. [1, 2].
Despite the rapid growth of our knowledge on microRNA regulation, little is known about the evolution and phylogenetic distribution of the hundreds of animal microRNA families. The exceptions are a few well-studied examples, including let-7 [3–5], the three non-homologous miRNA families comprising the mir-17 cluster [6, 7], two Hox-cluster associated genes mir-10 and mir-196 [8, 9], and the exceptional imprinted mir-134 cluster of microRNAs located at human locus 14q32 [10–12]. These few case studies, which were selected because of special properties of the miRNAs in question, of course cannot provide a comprehensive, or even representative, picture of microRNA evolution in animals.
Two very recent papers discuss in detail the phylogenetic distribution of plant microRNAs using expression profiling  and EST data , respectively. Both studies demonstrate that "several individual miRNA regulatory circuits have ancient origins and have remained intact throughout the evolution and diversification of plants." With only a limited number of miRNA families to investigate (17 in  and 23 in ) the situation is much more favorable than in animals, where the MicroRNA Registry 6.0 (MR 6.0) [15, 16] lists more than 1200 microRNAs which fall into more than 300 families defined by their "mir-number" . A recent comprehensive study of microRNA gene expression in zebrafish , for example, lists 142 miRNA loci in the genome of Danio rerio that are homologous to more than 100 different mammalian microRNAs, belonging to almost 100 different families.
In this contribution we report on a comprehensive study of the phylogenetic distribution and evolutionary histories of the currently known miRNAs (as defined by the content of version 6.0 of the MicroRNA Registry) and their homologs.
Novel microRNA genes
While microRNAs have been studied in much detail in mammals, insects, and nematodes, much less is known in other lineages. Information on chicken, frog, and actinopterygian microRNAs are almost exclusively based on sequence homology. In this study we have attempted to obtain this information systematically and as exhaustively as possible. To this end, we include only those predicted microRNA candidates which can be identified as homologs of a MR 6.0 entry. Note that our statistics ignores all microRNAs that are not contained in MR 6.0, most notably, many of those reported in recent studies of primates [19, 20] and zebrafish [18, 21]. While a recent survey for ncRNAs has provided evidence for a significant number of microRNAs in Ciona intestinalis , most of them are not included here because their homology with known vertebrate microRNAs cannot be established unambiguously.
Table 1 summarizes the microRNA precursor sequences that form the basis for this study, a detailed list is provided in additional file: 1; insect-specific microRNAs are summarized in additional file: 2 (see supplemental material).
Our knowledge of microRNAs in basal deuterostomes is sketchy at best, despite the fact that four genomes are available at various stages of completion. In this survey we detect a number of microRNAs in basal deuterostomes: 40 sequences in only 6 families (mir-1, mir-9, mir-31, mir-124, mir-125, mir-184) were found in the genome of the sea urchin Strongylocentrotus purpuratus. Most of the 40 sequences will probably turn out to be identical in more advanced assemblies of the genome. A handful of families were detected in urochordates. In , 41 putative microRNAs are predicted in Ciona intestinalis, of which only 4 are recognizable orthologs of known vertebrate microRNAs. It is not clear whether the other candidates are lineage-specific innovations, or whether they are too diverged to recognize their homology with known microRNA families.
Similarly, we find only three convincing microRNA candidates in the trematode Schistosoma mansoni: mir-1, mir-9, and mir-124. In contrast, no plausible orthologs were detected outside the metazoa e.g. in Schizosaccharomyzes pombe or Encephalitozoon cuniculi.
Phylogenetic distribution of microRNA families
The tables in additional file: 1 as well as in the summary of microRNA precursor sequences, both part of the extensive electronic supplement http://www.bioinf.uni-leipzig.de/Publications/SUPPLEMENTS/05-021/ summarize the sequences that were found through the combined blast and erpin searches described above. Since large-scale experimental surveys that were not based on a priori homology information have been performed only for 4 species (Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans) we can only analyze the innovation of microRNAs along the branches of the phylogenetic tree leading to those four species.
To this end, we map each miRNA to the branch that leads to the last common ancestor of all homologs that we could identify in our survey. Note that this does not imply that all children of this ancestral node carry a known homolog: miRNAs may have been lost in a particular lineage or they may have diverged too far to be recognizable by homology-based searches. We suspect that the small number of identified miRNAs in basal deuterostome (both Strongylocentrotus purpuratus and the urochordates) and in Schistosoma mansoni is predominantly due to sequence divergence rather than true gene loss.
To our surprise, we find that miRNA innovation is an ongoing process, exemplified already by the small number of rodent or primate-specific sequences contained in MR 6.0. Recent studies by Berezikov et al.  and Bentwich et al.  demonstrate that evolutionary young miRNAs are a common phenomenon. Many of these are members of large miRNA clusters. Note that our data set contains at least one representative of many of these clusters, suggesting that expansion of existing clusters is a major mode of miRNA evolution. On the other hand, we can clearly identify two edges in the phylogenetic tree along which innovation is concentrated: the edge leading to the ancestral gnathostome, and the edge leading to the ancestral eutherian.
• In addition to the introduction of a large number of novel miRNA sequences, we find a large number of paralogous miRNA sequences throughout the metazoa. Two classes of duplication events are easily distinguishable:
Local (tandem) duplications result in paralogous sequences that are (typically) located on the same transcript. These gene copies retain their physical linkage over long evolutionary timescales.
• Non-local duplications result in paralogous genes (or gene clusters) on (usually) different chromosomes. In some cases, copies on the same chromosome separated by large distances are observed, but in these cases the physical linkage is not preserved across larger evolutionary times.
Non-local duplications almost exclusively can be allocated to only two points in the metazoan phylogeny: in the stem of the teleost branch and in the edge separating the gnathostome ancestor from the urochordates. This is consistent with the large-scale, probably genome-wide, duplications postulated by the 2R/3R model [23–25].
As expected, we find no case of a microRNA family with more than 4 different genomic loci in tetrapods or more than 8 genomic loci in teleosts, with the sole exception of the let-7 family. In this case, which was studied in detail in , at least one non-local duplication event predates vertebrate-specific genome duplications.
Indeed, we find that about 50% of the isolated microRNAs or microRNA clusters that predate the last common ancestor of tetrapods and teleosts appear in at least two separate genomic loci. Similarly, about 50% of these "old" microRNAs show clear evidence for an additional duplication of at least one copy in the teleosts lineage.
A substantial fraction of microRNAs are located on polycistronic transcripts [26–29]. Tab. 2 lists the vertebrate microRNA clusters. MicroRNA clustering is also a common phenomenon in invertebrates: (see summary table in additional file: 2, supplemental material). The evolutionary history of four microRNA clusters has already been described in detail in the literature:
Probably the best-understood microRNA, at least in terms of its phylogenetic distribution is let-7, which was discovered in C. elegans as a timing regulator in development . The let-7 microRNA is present in diverse animal phyla including chordates, echinoderms, mollusks, annelids, arthropods, nematodes, chaetognaths, nemerteans, and platyhelminths, but it is absent in basal metazoa including cnidarians, poriferans, ctenophora, and acoel flatworms [3, 4]. In vertebrates a plethora of let-7 paralogs are known. Paralogs of the two miRNAs mir-100 and mir-125 are transcribed together with some of the let-7 paralogs in both vertebrates and insects. For a detailed reconstruction of the let-7 gene phylogeny we refer to .
The mir-17 cluster consists of up to 6 members belonging to three non-homologous microRNA families: mir-17, mir-19, and mir-92. While mir-92 can easily be traced back to the common ancestor of protostomes and deuterostomes, the other two families appear to be younger .
The mir-134 cluster is a unique system of microRNAs located at the imprinted human locus 14q32 [10–12, 31] and the orthologous mouse Dlkl-Gtl2 domain . It is restricted to eutherian mammals and consists of 6 known groups of microRNAs, which, however, according to our analysis share a common origin, see Fig. 7 below. The most prolific subgroup consists of mir-154 and its paralogs, which appear to be rapidly radiating. Local sub-clusters of this unique system are studied in detail in . These authors also report additional cluster members that are not contained in the MR 6.0.
The mir-290 cluster consists of murine microRNAs mir-290 to mir-295 and their human homologs mir-371 to mir-373. It is conserved in eutherian mammals and is rapidly evolving both in gene content and sequence [20, 34].
Other miRNA clusters have not been analyzed in detail to our knowledge. Our own finding are summarized below, see also Fig. 3. Gene phylogenies of all microRNA families are provided in the supplemental material.
The mir-1 cluster is ancient, consisting of mir-1 and mir-133; (except in nematodes where mir-133 seems to be absent). In vertebrates, there are three copies on different chromosomes.
The mir-9 family is also ancient. In diptera, we have both an isolated mir-9 paralog (most closely related to its vertebrate homologs) and a cluster of four microRNAs consisting of mir-9c, mir-306, mir-79, and mir-9b, see Fig. 3a. This cluster, which presumably arose by means of tandem duplications, is specific to diptera. One of the four members of this mir-9 cluster, mir-306, is so diverged that its homology with mir-9/mir-79 is not unambiguous.
The mir-15 cluster arose from an old tandem duplication. It occurs in 3 copies in tetrapoda, were one locus has only a single copy of the microRNA.
In some cases, even the combination of sequence information and physical linkage is insufficient to completely resolve the history of a microRNA cluster. As an example, consider the mir-23 cluster, consisting of mir-23, mir-24, and mir-27, which appear to have unrelated sequences. While tetrapoda have two clusters consisting of all three miRNAs, teleost fishes have either four (pufferfishes) or five (zebrafish) copies, usually on different chromosomes or at least separated several million bases from each other. Fig. 4 gives the two most plausible scenarios, both of which are based on the assumption of the 2R/3R model that leads us to expect up to four paralogs in the ancestral vertebrate and a duplication of this ancestral state in the teleosts.
The mir-141 cluster consists of the paralogous microRNAs mir-141 and mir-200. The ancient tandem duplication that created this cluster predates the origin of the chordates (but there do not seem to be homologous arthropod or nematode sequences). In vertebrates there are two copies of the clusters.
The mir-302 cluster consists of four tandem copies of mir-302 and a single copy of mir-367 in amniotes. Homologs in more distant groups, including frog and teleosts, could not be identified.
A small number of microRNA clusters arose only recently, i.e., after the last common ancestor of eutherian mammals. For example, mir-298 arose next to mir-296 in the rodent lineage. mir-105, which is located on the X-chromosome, exists in three copies in Canis and in two copies in Homo, while other mammals have only a single copy.
Conversely, a few ancient microRNA families have be remodeled considerably in mammals. The mir-130 cluster, Fig. 3c, may serve as an example. This family arose by tandem duplications very early in vertebrates. An additional copy appears early in the mammalian lineage followed by different lineage specific deletions.
MicroRNAs and repetitive DNA
Small interfering RNAs (siRNAs) are related to retro-elements in plants and fungi: In plants they are known to silence retro-elements (e.g. ) and promoter regions by DNA and histone methylation (e.g. ). In S. pombe siRNA complementary to centromeric dh repeats  and other retrotransposon LTRs  are involved in heterochromatin silencing. Recently, numerous mammalian miRNAs with extensive homology to known repetitive elements were described , including rat mir-333 . These and three further miRNA sequences (mir-308, mir-421, and mir-430) as well as mir-220, which is discussed in the following section, are excluded from the phylogenetic analysis. They are marked with the symbol ♠ in the summary table in the appendices found in the supplemental material.
The D. melanogaster and D. pseudoobscura mir-308 sequences reside in the last intron of the gene encoding the 23S ribosomal protein. Candidate sequences in insects were classified as simple repeats or low complexity regions by Repeatmasker . Putative homologs in vertebrates were identified as LINES, SINES, MER2_type and simple repeats. None of those are associated with Rps23S. The mature sequences were not conserved between those candidates, the only feature they had in common were long stretches of A and T rich regions.
The eutherian specific mir-421 is located on the X-chromosome. The majority of candidates were identified as L2/LINEs elements, the remaining ones as SINE/Alu (Alu, B1F), and SINE/MIR (MIRb). The locus reflects the features of repeat-derived miRNAs as described in . Two L2 elements in tail-to-tail orientation form the stem of the pre-miRNA, whereas the loop consists of the poly(T) tail (here poly(A) since one of the L2s is found on the minus strand) and the short intervening sequence. In contrast, the sequences of eutherian specific microRNAs that are not related to any known retrotransposon are in most cases conserved almost perfectly among different eutherian species.
The mir-430 family apparently is derived from a zebrafish repetitive element of unknown type.
Tubulin genes and mir-220
The tubulin superfamily comprises 6 families . Three of them, the alpha, beta and gamma tubulins, are ubiquitous for eukaryotes and used for several phylogenetic studies within this kingdom, e.g. . Multiple highly conserved alpha and beta tubulin genes are found within each species. In addition, several intronless tubulin pseudogenes were found [43, 44], flanked by different repeat regions . These remnants of functional genes were, for instance, used as molecular clock for investigating hominide evolution .
Mir-220 was discovered in D. rerio , where it is found in the fourth exon of an mRNA (NM199975.1) that appears to be related to tubulin-beta genes. It can be mapped unambiguously to the minus strand of several D. rerio ESTs.
The human mir-220 sequence was identified by homology to the experimentally verified D. rerio sequence. It is located in a genomic region highly conserved between several vertebrates according to the conservation track of the UCSC genome browser. On the DNA sequencing clone RP5-1189B24 (AL030996) this region is annotated as tubulin beta-5 (TUBB5) pseudo-gene. The mir-220 resides on the opposite strand of this predicted gene at a position homologous to the 5' end of exon 4 in the functional TUBB4. None of the sequences in the human ESTs of GenBank contained hsa-mir-220.
None of the numerous blast hits for mir-220 was identified as a repetitive sequence but rather appear to belong to tubulin genes and pseudogenes. Only the human sequence folds into a proper stem-loop structure, whereas the zebrafish microRNA results in a branched structure, Fig. 5. The multiple sequence alignment does not display typical features of miRNAs either. The mature sequence contains one gap in the human sequence and in addition one mismatch. Neither the loop region, nor the complementary arm, the 5' and 3' ends of the precursor are highly diverse. Furthermore, mir-220 would be the first microRNA to be processed from the anti-sense strand of a coding exon, a mode of transcription known so far only for cis-acting anti-sense transcripts .
Taking these facts together, it is conceivable that mir-220 is an experimental artifact. At the very least, homologous sequences in species other than zebrafish should not be interpreted as microRNAs in absence of additional evidence. We therefore disregard mir-220 in our further analysis.
Using blast, we have been able to identify a substantial number of microRNAs with different microRNA Registry names as homologs. As a consequence, our survey distinguishes 292 microRNA families (plus two sequences which could not be mapped to their respective genomes), while our starting point, the MR 6.0, contains 341 different family names for animal microRNAs.
In order to detect distant homologies between microRNA families that cannot be unambiguously determined from the precursor sequences, we also analyzed the mature microRNAs. Comparing alignments with shuffled sequences as described in the methods section, we obtain 95 pairs, 8 triples, and 3 quadruples of microRNA families at a z-score cutoff value of 3.0. Among them is in particular the entire mir-134 cluster, which can also be identified based on the precursor sequences Fig. 7.
While mature microRNAs are much better conserved than the rest of the precursor sequences, they are at the same time less informative because of their short length (≈ 22nt). It is therefore not warranted to conclude that mature miRNAs which exhibit statistically significant similarities (as measured by the z-score of their alignment) are true homologs. The observed similarities could also have arisen through convergent evolution. For example, the first 8 nucleotides of the mature sequences show highly conserved patterns between certain families of microRNAs that regulate target genes of the Notch signaling pathway. These motifs have been characterized as GY-box, Brd-box, and K-box . In general, the corresponding pre-miRNA sequences are too divergent to conclude that they derive from a common ancestral sequence.
In four cases we find strong evidence for homology that was not detectable directly by means of blast, see Fig. 6. The first two of these cases identify putative orthologs in distant clades:
Arthropod-specific mir-8 is related with vertebrate-specific mir-429. Their mature sequences are 74% identical, the combined stem regions still have about 60% sequence identity. A re-examination of the full precursor sequences leads us to conclude that arthropod mir-8 and vertebrate mir-429 are indeed orthologs.
Similarly, the mature sequences suggest that the nematode microRNA mir-72 is possibly homologous with mir-31 in arthropods and vertebrates. However, the full precursor sequences cannot be aligned convincingly. The z-score of z = 3.62 is only marginally significant. We hence (conservatively) count mir-31 and mir-72 as different families.
In a few more cases, distant putative paralogs can be detected using the z-score measure.
A particularly interesting case is the similarity between the Hox-cluster associated mir-10 and the mir-100 family, which is part of the let-7 cluster. They are annotated as members of the single microRNA precursor family RF00104 in the Rfam database. The mature sequences are 72% identical, the combined stem-regions share about 50% of the nucleotides, while the alignment of the complete precursor sequences is at the border of significance. In contrast, we cannot confirm that mir-51 and mir-57 are putative homologs of mir-10/mir-100. While it is likely that the mir-10 and mir-100, two old and developmentally important microRNAs, are homologous, we still treat them conservatively as distinct families in all statistics reported in this contribution. In any case, the putative duplication from which the mir-10 and mir-100 families arose, would date back at least to the eubilaterian ancestor.
The alignment z-scores of the mir-15 and mir-322 precursor sequences also hint a distant homology. The human ortholog of mir-322, designated as hsa-mir-424 is located 0.4 M downstream of the extra copy of the mir-17 cluster  located at the mammalian X-chromosome. It partially overlaps in its 3' end with the known mRNA BC007360, of which the third exon is annotated as Ensembl Gene ENSG00000165705 with predicted homologs in chimp (ENSPTRG00000022288) and cow (ENSBTAG00000001876). The entire region appears to be specific to mammals, as no homologs in the chicken genome can be found in the UCSC genome browser, although synthenic regions upstream and downstream of the miRNA exist on chicken chromosome 4. These genes as well as intergenic regions show roughly two to three-fold compression in chicken, but the region containing the miRNA is 18 times longer in human. The synthenic region of human Xq on chicken chromosome 4p corresponds to a microchromosome in all other birds but Galliformes, indicating a spot of heavy rearrangements, which might explain missing sequences . The available information is insufficient to determine unambiguously whether mir-322/mir-424 is a true homolog of mir-15 that arose during the processes that lead to the assembly of the eutherian X-chromosome. Thus we conservatively count mir-322/mir-424 and mir-15 as distinct microRNA families.
The systematic search for orthologs and paralogs of known animal microRNAs provides a suitable basis for studying their evolution. While microRNAs exist both in multicellular animals and multicellular plants, there is no evidence that particular microRNA sequences are homologous between the kingdoms. Here we systematically study the evolution of the more than 200 known animal microRNA families. Our analysis identified a substantial number of known microRNAs as homologs despite the fact that they have different names in the MicroRNA Registry. In a few additional cases, there is at least circumstantial evidence for distant homologies. Nevertheless, vertebrate genomes contain almost 200 distinct microRNA families that do not share significant sequence homology. As most of these families cannot be traced back to an ancestral bilaterian, we have to conclude that microRNAs can arise as de novo genes.
The evolution of the metazoan microRNA complement is therefore characterized by four processes:
(1) De novo appearance of novel miRNAs. Some of these sequences arise as additional members of existing clusters. In , a model is proposed for this expansion process based on the fact that hairpins are very abundant RNA secondary structures. Such innovations occur throughout animal innovation. They are concentrated in the bilaterian ancestor, the vertebrate ancestor, and the eutherian ancestor. The data are at present insufficient to determine whether such periods of increased microRNA innovation also happened in invertebrate lineages. However, a small number of microRNAs are derived from repetitive elements.
(2) Tandem duplications are a frequent mechanism accounting in particular for the expansion of microRNA clusters. Such local duplications are also strongly overrepresented in the vertebrate ancestor, and at the origin of placental mammals. In the latter case, most duplications are associated with the mir-134 cluster.
(4) A small class of non-local duplications is not associated with genome-wide duplication events. The only invertebrate example is the duplication of mir-9 in arthropods. In the ancestral eutherian we find 6 such events, mostly associated with the formation of the X-chromosome. Indeed, the mammalian X chromosome has generated and recruited a disproportionately high number of functional retroposed genes , which might also have affected some microRNA genes, including the X-chromosomal copy of the mir-17 cluster.
The expansion of the microRNA repertoire is consistent with the idea that the complex metazoan genomes require an additional level of regulators [54, 55]. As one would expect from such a model, dramatic expansions of the microRNA repertoire appear to be associated with major bauplan innovations: in ancestral bilaterians, ancestral vertebrates, and with the advent of (placental) mammals.
The protocol essentially follows , see  for a detailed description with examples. For RNA folding we used the programs contained in the Vienna RNA Package [56, 57]. Sequence searches were performed locally using NCBI blast (version 2.2.6)  with default settings and an E-value cutoff of E < 0.01, alignments were computed with clustalw  and visualized using clustalx . The non-stringent E-value cutoff was chosen in order to minimize false negatives, false positives at this stage do not pose a problem because of the stringent filters in the subsequent stages of the analysis.
All metazoan microRNA precursor sequences contained in the MR 6.0 (May 2005) were blasted against the available metazoan genomes (see list in the appendices, supplemental material) as well as a few protist genomes. The resulting blast hits were extracted from the database such that the retrieved sequences had approximately the same length as the query sequences. Multiple alignments of known microRNA sequences and putative homologs were constructed using clustalw and visually inspected for unrelated sequences or sequences not sharing a well conserved mature miRNA. The aligned sequences were trimmed to closely match the length of the known homologs from the MicroRNA Registry and then realigned.
RNAalifold  was used to verify the hairpin structure of the consensus fold. In some cases, sequences that deviated from the phylogenetic expectation were folded separately and tested for thermodynamic stability using the randfold program . In cases where candidate sequences had to be removed, the alignments were recomputed.
MicroRNAs for which only nematode sequences were known, were blasted against all vertebrate and all arthropod genomes with a cutoff of only E ≤ 0.1. Cases in which the blast hits consistently overlap with the mature microRNA were considered further. Next we considered the vicinity of the blast hit and checked whether it is conserved in vertebrates or arthropods, respectively. This leaves only mir-86 (vertebrates) and mir-72 (arthropods) as possible candidates with unknown orthologs. In both cases the candidate sequences do not form a conserved hairpin structure so that we conclude that they are probably not homologous microRNAs.
The blast searches were complemented by searches for distant homologs similar to the procedure described in .
The consensus secondary structure of the final alignments of the known microRNAs and their homologs as determined above was computed using RNAalifold and converted into a search pattern for the erpin program . For each microRNA, we determined the subtree spanned by known sequences and blast hits. Using erpin, we then screened within this subtree those genomes in which we did not find a blast hit, as well as all genomes from sister groups under plausible phylogenetic assumptions. In particular, both insects and nematodes were investigated for microRNAs that could be found in all vertebrates. Conversely, for apparently insect- or nematode-specific sequences we checked the other invertebrate clade as well as a sample of vertebrate genomes.
erpin searches were repeated with different score thresholds in order to balance sensitivity versus specificity, such that for each query model no more than a few dozen candidates per genome were returned. These candidates were filtered in the following way: (1) RNAfold was used to compute the secondary structure. Sequences were removed from the candidate list if removal of at most 4 base pairs did not result in an unbranched stem-loop structure. (2) Sequences passing the first test were removed if their p-value for structural stabilization computed by randfold-2  exceeded 0.03. (3) The remaining sequences were aligned with the original search profiles. Only candidates with a significant sequence similarity according to visual inspection were retained. (4) We finally used the erpin candidates in blast searches against the remaining genomes. Candidates without a plausible phylogenetic conservation were rejected.
We pragmatically define a microRNA family as a collection of microRNA precursors for which we can construct a plausible sequence alignment using a global alignment tool such as clustalw, i.e., for which sequence homology is unambiguous. Gene phylogenies were reconstructed using the neighbor-net method  as implemented in SplitsTree4 . The approximate trees were checked for consistency with accepted phylogenetic hypotheses.
For all microRNA precursors for which paralogs are known or have been detected in our survey, we attempted to reconstruct the duplication history from the gene tree. In the case of physically linked microRNA clusters we additionally verified that the gene phylogenies of the individual cluster members were consistent with the linkage information. We checked in particular for evidence of additional, relatively recent duplication events of microRNAs in teleosts relative to the tetrapods.
Detection of distant homologies
In order to identify distant sequence similarities between precursor miRNAs from different paralog groups we computed a similarity score based on the significance of the alignment score: The identity score s(I, J) for the pairwise alignment of two precursor miRNAs I and J was computed using the implementation of the fast approximate Wilbur-Lipman algorithm  from the clustalw program. Then the mean identity score m and the variance ν of randomly permuted sequences were estimated by sampling. The z-score z(I, J) = (s(I, J) - m)/ was used as a convenient measure of similarity between the sequences I and J.
We used the very well-conserved mature microRNAs to identify possible homologies that had not been reported previously. In the first step, clustalw alignments were used to determine groups of mature microRNAs with pairwise identities in excess of 70%. From the resulting 291 groups, which approximately correspond to the microRNA families, we determined consensus sequences. For these we computed all pairwise alignment z-scores using 100 shuffled sequences. Subclusters with pairwise z-scores better than z = 3.0 were extracted. In order to check the stability of the procedure, z-score matrices for these subclusters were re-calculated from 1000 shuffled sequences. This method produces robust similarity scores in regimes where reliable global alignments cannot be obtained . Standard WPGMA clustering  was then used to estimate a dendrogram from the z-scores.
Ambros V: The functions of animal microRNAs. Nature. 2004, 431: 350-355. 10.1038/nature02871.
Kidner CA, Martienssen RA: The developmental role of microRNA in plants. Curr Opin Plant Biol. 2005, 8: 38-44. 10.1016/j.pbi.2004.11.008.
Pasquinelli AE, Reinhart BJ, Slack F, Martindale MQ, Kurodak MI, Mailer B, Hayward DC, Ball EE, Degnan B, Müller P, Spring J, Srinivasan A, Fishman M, Finnerty J, Corbo J, Levine M, Leahy P, Davidson E, Ruvkun G: Conservation of the sequence and temporal expression of let-7 heterochronic regulatory RNA. Nature. 2000, 408: 86-89. 10.1038/35040556.
Pasquinelli AE, McCoy A, Jiménez E, Emili S, Ruvkun G, Martindale MQ, Baguñà J: Expression of the 22 nucleotide let-7 heterochronic RNA throughout the Metazoa: a role in life history evolution?. Evol Dev. 2003, 5: 372-378. 10.1046/j.1525-142X.2003.03044.x.
Bompfünewerer AF, Flamm C, Fried C, Fritzsch G, Hofacker IL, Lehmann J, Missal K, Mosig A, Müller B, Prohaska SJ, Stadler BMR, Stadler PF, Tanzer A, Washietl S, Witwer C: Evolutionary Patterns of Non-Coding RNAs. Th Biosci. 2005, 123: 301-369. 10.1016/j.thbio.2005.01.002.
Tanzer A, Stadler PF: Molecular Evolution of a MicroRNA Cluster. J Mol Biol. 2004, 339: 327-335. 10.1016/j.jmb.2004.03.065.
Tanzer A, Stadler PF: Evolution of MicroRNAs. MicroRNA Protocols, Methods in Molecular Biology. Edited by: Ying SY. 2006, Humana Press,
Yekta S, Shih Ih, Bartel DP: MircoRNA-directed cleavage of HoxB8 mRNA. Science. 2004, 304: 594-596. 10.1126/science.1097434.
Tanzer A, Amemiya CT, Kim CB, Stadler PF: Evolution of MicroRNAs Located Within Hox Gene Clusters. J Exp Zool: Mol Dev Evol. 2005, 304B: 75-85. 10.1002/jez.b.21021.
Lagos-Quintanta M, Rauhut R, Yalcin A, Meyer J, Lendeckel W, Tuschl T: Identification of tissue specific microRNAs from mouse. Current Biology. 2002, 12: 735-739. 10.1016/S0960-9822(02)00809-6.
Houbaviy HB, Murray MF, Sharp PA: Embryonic stem cell-specific microRNAs. Dev Cell. 2003, 5: 351-358. 10.1016/S1534-5807(03)00227-2.
Kim J, Krichevsky A, Grad Y, Hayes GD, Kosik KS, Church GM, Ruvkun G: Identification of many microRNAs that copurify with polyribosomes in mammalian neurons. Proc Natl Acad Sci USA. 2004, 101: 360-365. 10.1073/pnas.2333854100.
Axtell MJ, Bartel DP: Antiquity of MicroRNAs and Their Targets in Land Plants. Plant Cell. 2005, 17: 1658-1673. 10.1105/tpc.105.032185.
Zhang BH, Pan XP, Wang QL, Cobb GP, Anderson TA: Identification and characterization of new plant microRNAs using EST analysis. Cell Res. 2005, 15: 336-360. 10.1038/sj.cr.7290302.
Griffiths-Jones S: The microRNA Registry. Nucleic Acids Res. 2004, 32: D109-D111. 10.1093/nar/gkh023.
Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A: Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 2005, 121-124. 33 Database
Ambros V, Bartel B, Bartel DP, Burge CB, Carrington JC, Chen X, Dreyfuss G, Eddy SR, Griffiths-Jones S, Marshall M, Matzke M, Ruvkun G, Tuschl T: A uniform system for microRNA annotation. RNA. 2003, 9: 277-279. 10.1261/rna.2183803.
Wienholds E, Kloosterman WP, Miska E, Alvarez-Saavedra E, Berezikov E, de Bruijn E, Horvitz RH, Kauppinen S, Plasterk RHA: MicroRNA Expression in Zebrafish Embryonic Development. Science. 2005, 309: 310-311. 10.1126/science.1114519.
Berezikov E, Guryev V, van de Belt J, Wienholds E, Ronald Plasterk HA: Phylogenetic Shadowing and Computational Identification of Human microRNA Genes. Cell. 2005, 120: 21-24. 10.1016/j.cell.2004.12.031.
Bentwich I, Avniel AA, Karov Y, Aharonov R, Gilad S, Barad O, Barzilai A, Einat P, Einav U, Meiri E, Sharon E, Spector Y, Bentwich Z: Identification of hundreds of conserved and nonconserved human microRNAs. Nat Genet. 2005, 37: 766-770. 10.1038/ng1590.
Chen PY, Manninga H, Slanchev K, Chien M, Russo JJ, Ju J, Sheridan R, John B, Marks DS, Gaidatzis D, Sander C, Zavolan M, Tuschl T: The developmental miRNA profiles of zebrafish as determined by small RNA cloning. Genes Dev. 2005, 19: 1288-1293. 10.1101/gad.1310605.
Missal K, Rose D, Stadler PF: Non-coding RNAs in Ciona intestinalis. Bioinformatics. 2005, 21S2: i77-i78. [Proceedings ECCB/JBI'05, Madrid]
Holland PWH, Garcia-Fernàndez J, Williams NA, Sidow A: Gene duplication and the origins of vertebrate development. Development. 1994, 125-133. Suppl
Amores A, Force A, Yan YL, Joly L, Amemiya C, Fritz A, Ho RK, Langeland J, Prince V, Wang YL, Westerfield M, Ekker M, Postlethwait JH: Zebrafish Hox clusters and vertebrate genome evolution. Science. 1998, 282: 1711-1714. 10.1126/science.282.5394.1711.
Spring J: Genome duplication strikes back. Nat Genet. 2002, 31: 128-129.
Lee Y, Jeon K, Lee JT, Kim S, Kim VN: MicroRNA maturation: stepwise processing and subcellular localization. EMBO J. 2002, 21: 4663-4670. 10.1093/emboj/cdf476.
Mourelatos Z, Dostie J, Paushkin S, Sharma A, Charroux B, Abel L, Rappsilber J, Mann M, Dreyfuss G: miRNPs: a novel class of ribonucleoproteins containing numerous microRNAs. Genes Dev. 2002, 16: 720-728. 10.1101/gad.974702.
Lagos-Quintana M, Rauhut R, Meyer J, Borkhardt A, Tuschl T: New microRNAs from mouse and human. RNA. 2003, 9: 175-179. 10.1261/rna.2146903.
Lai EC, Tomancak P, Williams RW, Rubin GM: Computational identification of Drosophila microRNA genes. Genome Biol. 2003, 4: R42-10.1186/gb-2003-4-7-r42.
Reinhart BJ, Slack FJ, Basson M, Pasquinelli AE, Bettinger JC, Rougvie AE, Horwitz HR, Ruvkun G: The 21-nucleotide RNA let-7 regulates developmental timing in Caenorhabditis elegans. Nature. 2000, 403: 901-906. 10.1038/35002607.
Sewer A, Paul N, Landgraf P, Aravin A, Pfeffer S, Brownstein MJ, Tuschl T, van Nimwegen E, Zavolan M: Identification of clustered microRNAs using an ab initio prediction method. BMC Bioinformatics. 2005, 6: 267-10.1186/1471-2105-6-267.
Seitz H, Royo H, Bortolin ML, Lin SP, Ferguson-Smith AC, Cavaillé J: A Large Imprinted microRNA Gene Cluster at the Mouse Dlkl-Gtl2 Domain. Genome Res. 2004, 14: 1741-1748. 10.1101/gr.2743304.
Altuvia Y, Landgraf P, Lithwick G, Elefant N, Pfeffer S, Aravin A, Brownstein MJ, Tuschl T, Margalith H: Clustering and conservation patterns of human microRNAs. Nucleic Acids Res. 2005, 33: 2697-2706. 10.1093/nar/gki567.
Houbaviy HB, Dennis L, Jaenisch R, Sharp PA: Characterization of a highly variable eutherian microRNA gene. RNA. 2005, 11: 1245-1257. 10.1261/rna.2890305.
Hamilton A, Voinnet O, Chappell L, Baulcombe D: Two classes of short interfering RNA in RNA silencing. EMBO J. 2002, 21: 4671-4679. 10.1093/emboj/cdf464.
Mette MF, Aufsatz W, van der Winden J, Matzke MA, Matzke AJ: Transcriptional silencing and promoter methylation triggered by double-stranded RNA. EMBO J. 2000, 19: 5194-5201. 10.1093/emboj/19.19.5194.
Reinhart B, Bartel D: Small RNAs correspond to centromere heterochromatic repeats. Science. 2002, 297: 1831-1831. 10.1126/science.1077183.
Schramke V, Allshire R: Hairpin RNAs and retrotransposon LTRs effect RNAi and chromatin-based gene silencing. Science. 2003, 301: 1069-1074. 10.1126/science.1086870.
Smalheiser N, Torvik VI: Mammalian microRNAs derived from genomic repeats. Trends Genet. 2005, 21: 322-326. 10.1016/j.tig.2005.04.008.
Smit AFA, Hubley R, Green P: RepeatMasker Open-3.0. 1996, [http://www.repeatmasker.org]
Oakley BR: An abundance of tubulins. Trends Cell Biol. 2000, 10: 537-542. 10.1016/S0962-8924(00)01857-2.
Keeling P, Doolittle W: Alpha-tubulin from early-diverging eukaryotic lineages and the evolution of the tubulin family. Mol Biol Evol. 1996, 13: 1297-1305.
Wilde CD, Crowther CE, Cripe TP, Gwo-Shu Lee M, Cowan NJ: Evidence that a human beta-tubulin pseudogene is derived from its corresponding mRNA. Nature. 1982, 297: 83-84. 10.1038/297083a0.
Lemischka I, Sharp PA: The sequences of an expressed rat alpha-tubulin gene and a pseudogene with an inserted repetitive element. Nature. 1982, 300: 330-335. 10.1038/300330a0.
Lee MG, Lewis S, Wilde CD, Cowan NJ: Evolutionary history of a multigene family: an expressed human beta-tubulin gene and three processed pseudogenes. Cell. 1983, 33: 477-487. 10.1016/0092-8674(83)90429-4.
Lewis SA, Cowan NJ: Tubulin pseudogenes as markers for hominoid divergence. J Mol Biol. 1986, 187: 623-626. 10.1016/0022-2836(86)90340-2.
Lim LP, Glasner ME, Yekta S, Burge CB, Bartel DP: Vertebrate microRNA genes. Science. 2003, 299: 1540-1540. 10.1126/science.1080372.
Lavorgna G, Dahary D, Lehner B, Sorek R, Sanderson CM, Casari G: In search of antisense. Trends Biochem Sci. 2004, 29:
Lai EC, Tam B, Rubin GM: Pervasive regulation of Drosophila Notch target genes by GY-box-, Brd-box-, and K-box-class microRNAs. Genes Dev. 2005, 19: 1067-1080. 10.1101/gad.1291905.
Kohn M, Kehrer-Sawatzki H, Vogel W, Graves JAM, Hameister H: Wide genome comparisons reveal the origins of the human X chromosome. Trends Genet. 2004, 20: 598-603. 10.1016/j.tig.2004.09.008.
Holland PWH, Garcia-Fernández J, Williams NA, Sidow A: Gene duplication and the origins of vertebrate development. Development. 1994, 125-133. Suppl
Taylor J, Braasch I, Frickey T, Meyer A, Van De Peer Y: Genome duplication, a trait shared by 22,000 species of ray-finned fish. Genome Res. 2003, 13: 382-390. 10.1101/gr.640303.
Emerson JJ, Kaessmann H, Betrán E, Long M: Extensive Gene Traffic on the Mammalian X Chromosome. Science. 2004, 303: 537-540. 10.1126/science.1090042.
Mattick JS: Challenging the dogma: the hidden layer of non-protein-coding RNAs in complex organisms. Bioessays. 2003, 25: 930-939. 10.1002/bies.10332.
Mattick JS: RNA regulation: a new genetics?. Nature Genetics. 2004, 5: 316-323. 10.1038/nrg1321.
Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P: Fast Folding and Comparison of RNA Secondary Structures. Monatsh Chem. 1994, 125: 167-188. 10.1007/BF00818163.
Hofacker IL: Vienna RNA secondary structure server. Nucl Acids Res. 2003, 31: 3429-3431. 10.1093/nar/gkg599.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410. 10.1006/jmbi.1990.9999.
Thompson JD, Higgs DG, Gibson TJ: CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties, and weight matrix choice. Nucl Acids Res. 1994, 22: 4673-4680.
Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG: The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucl Acids Res. 1997, 24: 4876-4882. 10.1093/nar/25.24.4876.
Hofacker IL, Fekete M, Stadler PF: Secondary Structure Prediction for Aligned RNA Sequences. J Mol Biol. 2002, 319: 1059-1066. 10.1016/S0022-2836(02)00308-X.
Bonnet E, Wuyts J, Rouzé P, Van de Peer Y: Evidence that microRNA precursors, unlike other non-coding RNAs, have lower folding free energies than random sequences. Bioinformatics. 2004, 20: 2911-2917. 10.1093/bioinformatics/bth374.
Legendre M, Lambert A, Gautheret D: Profile-Based Detection of microRNA Precursors in Animal Genomes. Bioinformatics. 2005, 21: 841-845. 10.1093/bioinformatics/bti073.
Gautheret D, Lambert A: Direct RNA motif definition and identification from multiple sequence alignments using secondary structure profiles. J Mol Biol. 2001, 313: 1003-1011. 10.1006/jmbi.2001.5102.
Bryant D, Moulton V: Neighbor-Net: An Agglomerative Method for the Construction of Phylogenetic Networks. Mol Biol Evol. 2004, 21: 255-265. 10.1093/molbev/msh018.
Huson DH: SplitsTree: analyzing and visualizing evolutionary data. Bioinformatics. 1998, 14: 68-73. 10.1093/bioinformatics/14.1.68.
Wilbur WJ, Lipman DJ: Rapid similarity searches of nucleic acid and protein data banks. Proc Natl Acad Sci USA. 1983, 80: 726-730.
Sokal RR, Michner CD: A statistical method for evaluating systematic relationships. Univ Kans Sci Bull. 1958, 38: 1409-1438.
Phillipe H, Lartillot N, Brinkmann H: Multigene Analyses of Bilaterian Animals Corroborate the Monophyly of Ecdysozoa, Lophotrochozoa, and Protostomia. Mol Biol Evol. 2005, 22: 1246-1253. 10.1093/molbev/msi111.
Whiting MF: Phylogeny of the holometabolous insect orders: molecular evidence. Zoologica Scripta. 2002, 31: 3-15. 10.1046/j.0300-3256.2001.00093.x.
Yang Y, Zhang YpZ, Qian Yh, Zeng Qt: Phylogenetic relationships of Drosophila melanogaster species group deduced from spacer regions of histone gene H2A-H2B. Mol Phylog Evol. 2004, 30: 336-343. 10.1016/S1055-7903(03)00212-4.
This work was supported in part by the Austrian Fonds zur Förderung der Wissenschaftlichen Forschung, project no. P15893, by the Austrian Gen-AU bioinformatics integration network, the German DFG Bioinformatics Initiative project no. BIZ-6/1-2, and by the Austrian Gen-AU bioinformatics integration network sponsored by BM-BWK and BM-WA.
This work is based on the results of two bioinformatics computer lab courses held at the Universities of Vienna and Leipzig in the Winter Semester 2004/2005. The following students contributed their preliminary analysis of 10–20 microRNA families to this work:
Sten Heinze, Alexander "muppet" Donath, Sven Findeiβ, Stephanie Keller, Kevin Peter, Julian Jöris, Jakob Mühmel, Marco Dienelt, Lisa Hellwig, Maiko Lohet, Holger Schmidtchen, Nick Jagiella, Andrej Aderhold, Paul-Robert Kästerer, Thomas Skodawessely (in Leipzig), Martina Hödl, Bernhard Wurzinger, Camille Stephan-Otto Attolini, Ulrich Omasits, Sebastian Krüttner, Regina Anzengruber, Daniela Lenek, Gregor Neumayr, Sebastian Schmittner, Reinhard Wohlfart (in Vienna). The computer lab work was supervised by C.F., J.H., M.L., K.M., and A.T. Ch.F., I.L.H., and P.F.S. planned the courses and supervised the supervisors. A.T. contributed a re-analysis of the mir-17 cluster. J.H., M.L., K.M., C.F., and P.F.S. collected and cross-checked the student contributions. J.H., M.L., K.M., and P.F.S. computed the summary statistics, J.H. and A.T. investigated the distant homologies, A.T. analyzed the repeat associated microRNAs, and K.M. organized the supplemental material. All authors collaborated closely in preparing this manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.