- Research article
- Open Access
Annotation of primate miRNAs by high throughput sequencing of small RNA libraries
BMC Genomics volume 13, Article number: 116 (2012)
In addition to genome sequencing, accurate functional annotation of genomes is required in order to carry out comparative and evolutionary analyses between species. Among primates, the human genome is the most extensively annotated. Human miRNA gene annotation is based on multiple lines of evidence including evidence for expression as well as prediction of the characteristic hairpin structure. In contrast, most miRNA genes in non-human primates are annotated based on homology without any expression evidence. We have sequenced small-RNA libraries from chimpanzee, gorilla, orangutan and rhesus macaque from multiple individuals and tissues. Using patterns of miRNA expression in conjunction with a model of miRNA biogenesis we used these high-throughput sequencing data to identify novel miRNAs in non-human primates.
We predicted 47 new miRNAs in chimpanzee, 240 in gorilla, 55 in orangutan and 47 in rhesus macaque. The algorithm we used was able to predict 64% of the previously known miRNAs in chimpanzee, 94% in gorilla, 61% in orangutan and 71% in rhesus macaque. We therefore added evidence for expression in between one and five tissues to miRNAs that were previously annotated based only on homology to human miRNAs. We increased from 60 to 175 the number miRNAs that are located in orthologous regions in humans and the four non-human primate species studied here.
In this study we provide expression evidence for homology-based annotated miRNAs and predict de novo miRNAs in four non-human primate species. We increased the number of annotated miRNA genes and provided evidence for their expression in four non-human primates. Similar approaches using different individuals and tissues would improve annotation in non-human primates and allow for further comparative studies in the future.
From a comparative genomics standpoint the great apes are among the most studied groups of organisms . Since the completion of human genome sequencing in 2001 [2, 3] the genomes of all species belonging to this family have been or are being sequenced [4, 5]. Although only the human reference genome is considered of finished quality [2, 3], it is possible to compare and also use these genomes sequences as references for the alignment of reads generated in sequencing and gene expression studies. In addition to determine the DNA sequence of a genome, it is of particular importance to attach biological information to it e.g. determine the location and structure of protein-coding genes. Gene annotation is carried out both computationally and experimentally by sequencing cDNA e.g. traditionally using expressed sequence tags (ESTs) [6, 7] and more recently RNA-seq . Human EST resources are also more abundant than their non-human counterparts and therefore human gene annotation is also the most accurate among great apes . While the majority of efforts have focused on the annotation of protein-coding genes, the discovery of large-scale transcription outside of protein-coding genes [10, 11] has led to the identification of a great diversity of non-protein-coding RNA genes . Among these are the microRNAs (miRNAs) which are short (~22 bp) RNA molecules  that post-transcriptionally down-regulate protein-coding gene expression [14, 15]. The official repository of miRNAs miRBase (v.17) [16, 17] contains 1,424 human miRNA, whereas fewer miRNAs are annotated in other primate genomes (chimpanzee: 600; bonobo: 88; gorilla: 85; orangutan: 581; rhesus macaque: 479), a fact that is explained by the larger number of human studies.
MiRNAs have been annotated in humans using a mixture of bioinformatics prediction and cDNA sequencing . The identification of miRNAs in non-human primates has made use of a number of comparative methodologies such as sequence homology between closely related organisms [19–22], the genomic search for RNA secondary structure patterns characteristic of miRNAs  and by direct sequencing of small RNA libraries [24, 25]. However, direct characterization of small RNA libraries by high throughput sequencing has been performed for a limited number of tissues in only chimpanzees and rhesus macaques[24, 25]. As a result the majority of non-human primate miRNAs in miRBase have no evidence for their expression and their existence is only supported by computational prediction. In the present study we sequenced small RNA libraries from multiple chimpanzee, gorilla, orangutan and rhesus macaque individuals and tissues using the Illumina high throughput sequencing platform. We applied an algorithm (miRDeep) that uses sequencing reads in conjunction with a model of miRNA biogenesis to predict miRNAs with high accuracy[26, 27].
We used the program miRDeep2  to predict miRNAs from sequenced small RNAs. miRDeep2 takes as input the position and frequency of reads aligned to the genome ("signature") with respect to a putative RNA hairpin and scores the miRNA candidate employing a probabilistic model based on miRNA biogenesis . The score produced by miRDeep takes into account the energetic stability of the putative hairpin and the compatibility of the observed read distribution with miRNA cleavage . The more positive the score the more reliable the prediction. Additionally, miRDeep2 calculates false-positive rates by running the algorithm on a set of "signatures" and secondary structures that are paired by random permutation. Using predictions with a positive score and a significant folding p-value we identified from our sequences 47 (22 with expression evidence for star sequence) new miRNAs in chimpanzee, 240 (166 with expression evidence for star sequence) in gorilla, 55 (13 with expression evidence for star sequence) in orangutan and 47 (24 with expression evidence for star sequence) in rhesus macaque. miRDeep2 was able to predict 338 (64% of all annotated) known miRNAs (312 with a positive score) in chimpanzee, 75 (94% of all annotated, 73 with a positive score) in gorilla, 364 (61% of all annotated, 325 with a positive score) in orangutan and 348 (71% of all annotated, 312 with a positive score) in rhesus macaque (Figure 1). miRDeep2 performance statistics were similar to the ones reported in other species  (Figure 1).
MiRNAs show high expression conservation between species, and tissue-specific expression patterns [28, 29]. In testis we found a lower fraction of the total reads align to miRNAs (Table 1) as a result of the expression of an additional class of small-RNAs in this tissue - piRNAs . We were able to identify 11 tissue-specific miRNAs in chimpanzee (7 in brain, 1 in heart, 2 in kidney, 1 in testis), 110 in gorilla (100 in brain, 10 in liver), 28 in orangutan (25 in brain, 3 in liver) and 21 in rhesus macaque (11 in brain, 10 in testis).
To identify miRNAs which are shared between all the primates studied here we examined miRNAs that are encoded in orthologous locations in all four primate species and in human. For the miRNAs present in miRBase (v.17) we found 60 miRNAs that are located in orthologous regions in human and the four non-human primate species. When we included the set of miRNAs predicted in this study we increased this number to 175 miRNAs. This set of miRNAs can be considered prediction of high confidence since they were known in human and either known or predicted by us in all other four primate species.
All 60 of the known miRNAs present in all four species and human showed a high sequence identity i.e. the sequence is completely identical between the mature sequences for all of them. Using the set of 175 miRNAs we were able to reconstruct the expected phylogenetic relationships between the species studied for both the hairpin and the mature sequence. A principle component analysis on the sequence identity between hairpin sequences (Figure 2) shows a close relationship between chimpanzee and gorilla while both species are distant from orangutan and even more afar to rhesus macaque.
For some stages during their biogenesis miRNAs form a secondary structure that resembles a hairpin . Since the endonuclease that processes miRNAs recognizes them based on their three-dimensional structure , the stability of the secondary structure can be considered a proxy for miRNA functionality and therefore for the reliability of miRNAs predictions. We used the minimum free energy (MFE) as a measure of structure stability. We found that the hairpins of predicted miRNAs are as stable as hairpins from known miRNAs, which is not unexpected given that the score calculated by miRDeep2 takes into account the stability of the miRNA hairpin secondary structure.
Although the genomes of multiple non-human primates have been sequenced, the functional annotation of the human genome remains the most complete among primates. This is the case for miRNAs annotated in miRBase, where the number of human miRNAs is double than miRNAs annotated in chimpanzee (the second-best annotated genome) [16, 17]. In the present study we sequenced small RNA libraries from multiple individuals and tissues in four non-human primates in order to identify from expression data new miRNA genes. We identified these new miRNAs using miRDeep2 , which uses a model for miRNA precursor processing by Dicer to score miRNA predictions. Using this approach we predicted 47 new miRNAs in chimpanzee, 240 in gorilla, 55 in orangutan and 47 in rhesus macaque (Figure 1). We found that the secondary structures from our new miRNAs were as stable as miRNAs previously described in miRBase.
A similar number of new miRNAs were identified in chimpanzee, orangutan and rhesus macaque, whereas the number of new miRNA predictions in gorilla was much higher. While the genomes of the chimpanzee, orangutan and rhesus have been available for some time, and a number of miRNA studies in these species published, the gorilla genome has not yet been published and fully annotated [4, 5, 31], and no published description of miRNAs in gorilla - a requirement for inclusion of new miRNAs in miRBase - exists The majority of annotated miRNAs in the non-human primates are based on homology with human miRNAs [20–22]. However, the presence of a given locus in a genome is not a guarantee of its expression. We have, in this study, provided evidence of expression for 51% of the homology-based annotated miRNAs in gorilla, 49% in chimpanzee and 60% in rhesus macaque. We increased from 60 to 175 the number of miRNAs, which are located in orthologous regions in the four non-human primate genomes studied here and in human. This is a set of high confidence miRNAs based on homology, expression and miRNA biogenesis signatures.
In addition to the analysis of expression and folding, miRDeep incorporates a model of miRNA biogenesis, which makes its predictions more accurate than other software . While the sequencing of small RNA libraries is now technically feasible, the accurate identification of novel miRNAs remains challenging. A pioneer study in primates sequenced small RNAs libraries from human and chimpanzee brains . They predicted a large number (268 in human and 257 in chimpanzee) of new miRNAs in both species based on small RNA sequencing. Only few of these miRNAs have been included in miRBase, the public, curated repository for miRNAs (49 in human and 19 in chimpanzee). It is important to identify novel miRNAs accurately, and therefore particularly important to take into account the effect of genome quality and completeness on the ability to determine whether particular miRNAs are species-specific In primate comparisons the higher quality and completeness of the human genome means that miRNAs are frequently described as human-specific when in fact they are simply missed in related primate genomes due to sequence quality issues.
We sought to identify miRNAs that are expressed in tissue-specific manner. For species where we had samples from five tissues (chimpanzee and rhesus) we could say with more confidence that a given miRNA is tissue-specific than for the species where we had only two tissues (orangutan and gorilla). Brain was the tissue with both more miRNAs in total, and more tissue-specific miRNAs both in chimpanzee and marginally in rhesus. In orangutan and gorilla we could only identify miRNAs that are expressed mutually exclusively in either liver or brain. We found more miRNAs expressed exclusively in brain than in liver. This is in agreement with the fact that the miRNA repertoire in humans, chimpanzees and rhesus macaques is more diverse in brain compared to other tissues .
We have sequenced small RNA libraries from multiple individuals and tissues from chimpanzee, gorilla, orangutan and rhesus macaque. We identified known miRNAs and used miRDeep2 to predict de novo microRNAs in these four primate species. Our new expression-based predictions increased the number of known miRNAs in all four species. In addition, we showed the first expression evidence for miRNAs that were previously only annotated by sequence homology with humans. Accurate annotation of miRNAs in multiple primate species provides a fundamental to carry out evolutionary, comparative and functional studies of miRNAs.
We sequenced 56 small RNA libraries (24 from chimpanzees, 24 from rhesus macaques, four from orangutan and four from gorilla). The chimpanzee and rhesus macaque samples have been published . We added to this set eight samples from orangutan and gorilla (four liver and four brain samples from each species). All the individuals used in this study were adults and suffered sudden death that did not involve the tissues sampled. A description of the samples is available in Table 1.
Library preparation and sequencing
We used the individuals presented in  including 24 chimpanzee and rhesus macaque samples. Additionally, we sequenced four gorilla and four orangutan samples from brain and liver (two from each species and tissue). Total RNA was prepared as described in the Illumina Inc. manual "Small RNA Sample Preparation Guide" (Part # 1004239 Rev. A Illumina Inc. San Diego). Illumina Genome Analyzer I and II sequencing runs were analyzed starting from raw intensities. A detailed summary about the platform each sample was sequenced on, how many cycles and which chemistry was used can be found in Table 2. Base calling and quality score calculation was performed for all runs using the IBIS base caller .
Sample composition and read annotation
Read alignments were performed using PatMaN  allowing no mismatches. We mapped reads against miRBase [16, 17] version 17 and the corresponding species genomes - chimpanzee (panTro3), rhesus macaque (rheMac2), orangutan (ponAbe2) and the draft genome of gorilla (gorGor3).
MiRNA data was uploaded to the European Nucleotide Archive hosted by the European Bioinformatics Institute with the study accession number ERP000973 and ArrayExpress with accession number E-MTAB-828.
We used miRDeep2 prediction algorithm . All reads from each species were used for the corresponding predictions. We excluded redundant predictions for the same genomic location and only kept the prediction with the highest score. We used the mapper module (mapper.pl) provided by miRDeep2 with the following parameters: -n -d -c -i -j -l 18 -m -k TCGTATGCCGTCTTCTGCTTG. We ran miRDeep2 with default parameters. Newly predicted miRNAs that were found in orthologous genomic regions in all four species were submitted to miRBase. Names were assigned by miRBase and are available in Table 3.
Orthology of miRNAs
We identified orthologous regions starting from human hg19-based miRBase (version 17) hairpin locations [16, 17]. The genome coordinates were transferred to hg18 coordinates using liftOver  with the 95% identity cutoff. Human mature sequences from miRBase were aligned to the human genome (hg18) and their corresponding hairpin sequences were assigned by overlapping genome coordinates using intersectBed from Bedtools . All other primate miRNA mature sequences (known and predicted) were aligned against the corresponding genome and their genome locations were transferred to hg18 coordinates. The mature miRNA sequences found in the other primates that overlapped with human coordinates were defined as orthologous. The corresponding primate hairpin sequence was obtained by transferring the human genome hairpin coordinates to the corresponding primate genome. We excluded regions where liftOver was unable to identify an orthologous region.
MiRNAs were defined to be tissue specific when less than 5% of reads map to other tissues. This means that at least 80% of the perfectly aligned reads in chimpanzee and rhesus macaque (where we have reads from 4 tissues), and 95% of the perfectly aligned reads in gorilla and orangutan (where we have reads from 2 tissues) that were used for the prediction of the miRNA came from one tissue.
Secondary structure analysis
We calculated the minimum free energy (MFE) of known and predicted hairpin sequences by using RNAfold algorithm with default parameters . The MFE for each group of annotated/predicted miRNAs was computed by averaging the MFEs.
Marques-Bonet T, Ryder OA, Eichler EE: Sequencing primate genomes: what have we learned?. Annu Rev Genomics Hum Genet. 2009, 10: 355-386. 10.1146/annurev.genom.9.081307.164420.
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W: Initial sequencing and analysis of the human genome. Nature. 2001, 409 (6822): 860-921. 10.1038/35057062.
Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA: The sequence of the human genome. Science. 2001, 291 (5507): 1304-1351. 10.1126/science.1058040.
Mikkelsen TS, Hillier LW, Eichler EE, Zody MC, Jaffe DB, Yang SP, Enard W, Hellmann I, Lindblad-Toh K, Altheide TK: Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 2005, 437 (7055): 69-87. 10.1038/nature04072.
Locke DP, Hillier LW, Warren WC, Worley KC, Nazareth LV, Muzny DM, Yang SP, Wang Z, Chinwalla AT, Minx P: Comparative and demographic analysis of orang-utan genomes. Nature. 2011, 469 (7331): 529-533. 10.1038/nature09687.
Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF: Complementary DNA sequencing: expressed sequence tags and human genome project. Science. 1991, 252 (5013): 1651-1656. 10.1126/science.2047873.
Adams MD, Kerlavage AR, Fleischmann RD, Fuldner RA, Bult CJ, Lee NH, Kirkness EF, Weinstock KG, Gocayne JD, White O: Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. Nature. 1995, 377 (6547 Suppl): 3-174.
Garber M, Grabherr MG, Guttman M, Trapnell C: Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011, 8 (6): 469-477. 10.1038/nmeth.1613.
Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Federhen S: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2011, D38-51. 39 Database
Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007, 447 (7146): 799-816. 10.1038/nature05874.
Clark MB, Amaral PP, Schlesinger FJ, Dinger ME, Taft RJ, Rinn JL, Ponting CP, Stadler PF, Morris KV, Morillon A: The reality of pervasive transcription. PLoS Biol. 2011, 9 (7): e1000625-10.1371/journal.pbio.1000625. discussion e1001102
Ghildiyal M, Zamore PD: Small silencing RNAs: an expanding universe. Nat Rev Genet. 2009, 10 (2): 94-108. 10.1038/nrg2504.
Lee RC, Feinbaum RL, Ambros V: The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell. 1993, 75 (5): 843-854. 10.1016/0092-8674(93)90529-Y.
Lim LP, Lau NC, Garrett-Engele P, Grimson A, Schelter JM, Castle J, Bartel DP, Linsley PS, Johnson JM: Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs. Nature. 2005, 433 (7027): 769-773. 10.1038/nature03315.
Wightman B, Ha I, Ruvkun G: Posttranscriptional regulation of the heterochronic gene lin-14 by lin-4 mediates temporal pattern formation in C. elegans. Cell. 1993, 75 (5): 855-862. 10.1016/0092-8674(93)90530-4.
Griffiths-Jones S, Saini HK, van Dongen S, Enright AJ: miRBase: tools for microRNA genomics. Nucleic Acids Res. 2008, D154-158. 36 Database
Kozomara A, Griffiths-Jones S: miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic Acids Res. 2011, D152-157. 39 Database
Berezikov E, Cuppen E, Plasterk RH: Approaches to microRNA discovery. Nat Genet. 2006, 38 (Suppl): S2-7.
Berezikov E, Guryev V, van de Belt J, Wienholds E, Plasterk RH, Cuppen E: Phylogenetic shadowing and computational identification of human microRNA genes. Cell. 2005, 120 (1): 21-24. 10.1016/j.cell.2004.12.031.
Yue J, Sheng Y, Orwig KE: Identification of novel homologous microRNA genes in the rhesus macaque genome. BMC Genomics. 2008, 9: 8-10.1186/1471-2164-9-8.
Baev V, Daskalova E, Minkov I: Computational identification of novel microRNA homologs in the chimpanzee genome. Comput Biol Chem. 2009, 33 (1): 62-70. 10.1016/j.compbiolchem.2008.07.024.
Brameier M: Genome-wide comparative analysis of microRNAs in three non-human primates. BMC Res Notes. 2011, 3: 64-
Washietl S: Sequence and structure analysis of noncoding RNAs. Methods Mol Biol. 2010, 609 (2010): 285-306.
Berezikov E, Thuemmler F, van Laake LW, Kondova I, Bontrop R, Cuppen E, Plasterk RH: Diversity of microRNAs in human and chimpanzee brain. Nat Genet. 2006, 38 (12): 1375-1377. 10.1038/ng1914.
Somel M, Guo S, Fu N, Yan Z, Hu HY, Xu Y, Yuan Y, Ning Z, Hu Y, Menzel C: MicroRNA, mRNA, and protein expression link development and aging in human and macaque brain. Genome Res. 2010, 20 (9): 1207-1218. 10.1101/gr.106849.110.
Friedlander MR, Chen W, Adamidi C, Maaskola J, Einspanier R, Knespel S, Rajewsky N: Discovering microRNAs from deep sequencing data using miRDeep. Nat Biotechnol. 2008, 26 (4): 407-415. 10.1038/nbt1394.
Friedlander MR, Mackowiak SD, Li N, Chen W, Rajewsky N: miRDeep2 accurately identifies known and hundreds of novel microRNA genes in seven animal clades. Nucleic Acids Res. 2011
Landgraf P, Rusu M, Sheridan R, Sewer A, Iovino N, Aravin A, Pfeffer S, Rice A, Kamphorst AO, Landthaler M: A mammalian microRNA expression atlas based on small RNA library sequencing. Cell. 2007, 129 (7): 1401-1414. 10.1016/j.cell.2007.04.040.
Dannemann M, Prufer K, Lizano E, Nickel B, Burbano HA, Kelso J: Transcription factors are targeted by differentially expressed miRNAs in primates. 10.1093/gbe/evs033.
Winter J, Jung S, Keller S, Gregory RI, Diederichs S: Many roads to maturity: microRNA biogenesis pathways and their regulation. Nat Cell Biol. 2009, 11 (3): 228-234. 10.1038/ncb0309-228.
Gibbs RA, Rogers J, Katze MG, Bumgarner R, Weinstock GM, Mardis ER, Remington KA, Strausberg RL, Venter JC, Wilson RK: Evolutionary and biomedical insights from the rhesus macaque genome. Science. 2007, 316 (5822): 222-234.
Kircher M, Stenzel U, Kelso J: Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biol. 2009, 10 (8): R83-10.1186/gb-2009-10-8-r83.
Prufer K, Stenzel U, Dannemann M, Green RE, Lachmann M, Kelso J: PatMaN: rapid alignment of short sequences to large databases. Bioinformatics. 2008, 24 (13): 1530-1531. 10.1093/bioinformatics/btn223.
Hinrichs AS, Karolchik D, Baertsch R, Barber GP, Bejerano G, Clawson H, Diekhans M, Furey TS, Harte RA, Hsu F: The UCSC Genome Browser Database: update 2006. Nucleic Acids Res. 2006, D590-598. 34 Database
Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010, 26 (6): 841-842. 10.1093/bioinformatics/btq033.
Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32 (5): 1792-1797. 10.1093/nar/gkh340.
Grant BJ, Rodrigues AP, ElSawy KM, McCammon JA, Caves LS: Bio3d: an R package for the comparative analysis of protein structures. Bioinformatics. 2006, 22 (21): 2695-2696. 10.1093/bioinformatics/btl461.
Hofacker IL: Vienna RNA secondary structure server. Nucleic Acids Res. 2003, 31 (13): 3429-3431. 10.1093/nar/gkg599.
We would like to thank Thomas Giger for the dissection of the frozen tissues; Ines Drinnenberg, Matthias Meyer and the Sequencing Group of the MPI-EVA for coordinating sequencing runs; Martin Kircher for technical assistance with sequencing runs processing; Marike Schreiber for assistance with the figure preparation. The project was founded by a grant of the Max Planck Society.
MD: Conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper. BN: Performed the experiments. EL: Performed the experiments. HAB: Conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper. JK: Conceived and designed the experiments, contributed reagents/materials/analysis tools, wrote the paper. All authors read and approved the final manuscript.
Hernán A Burbano and Janet Kelso contributed equally to this work.
About this article
Cite this article
Dannemann, M., Nickel, B., Lizano, E. et al. Annotation of primate miRNAs by high throughput sequencing of small RNA libraries. BMC Genomics 13, 116 (2012). https://doi.org/10.1186/1471-2164-13-116
- miRNA Gene
- Minimum Free Energy
- Mature Sequence
- miRNA Biogenesis
- Orthologous Region