Annotation of primate miRNAs by high throughput sequencing of small RNA libraries

Background In addition to genome sequencing, accurate functional annotation of genomes is required in order to carry out comparative and evolutionary analyses between species. Among primates, the human genome is the most extensively annotated. Human miRNA gene annotation is based on multiple lines of evidence including evidence for expression as well as prediction of the characteristic hairpin structure. In contrast, most miRNA genes in non-human primates are annotated based on homology without any expression evidence. We have sequenced small-RNA libraries from chimpanzee, gorilla, orangutan and rhesus macaque from multiple individuals and tissues. Using patterns of miRNA expression in conjunction with a model of miRNA biogenesis we used these high-throughput sequencing data to identify novel miRNAs in non-human primates. Results We predicted 47 new miRNAs in chimpanzee, 240 in gorilla, 55 in orangutan and 47 in rhesus macaque. The algorithm we used was able to predict 64% of the previously known miRNAs in chimpanzee, 94% in gorilla, 61% in orangutan and 71% in rhesus macaque. We therefore added evidence for expression in between one and five tissues to miRNAs that were previously annotated based only on homology to human miRNAs. We increased from 60 to 175 the number miRNAs that are located in orthologous regions in humans and the four non-human primate species studied here. Conclusions In this study we provide expression evidence for homology-based annotated miRNAs and predict de novo miRNAs in four non-human primate species. We increased the number of annotated miRNA genes and provided evidence for their expression in four non-human primates. Similar approaches using different individuals and tissues would improve annotation in non-human primates and allow for further comparative studies in the future.


Background
From a comparative genomics standpoint the great apes are among the most studied groups of organisms [1]. Since the completion of human genome sequencing in 2001 [2,3] the genomes of all species belonging to this family have been or are being sequenced [4,5]. Although only the human reference genome is considered of finished quality [2,3], it is possible to compare and also use these genomes sequences as references for the alignment of reads generated in sequencing and gene expression studies. In addition to determine the DNA sequence of a genome, it is of particular importance to attach biological information to it e.g. determine the location and structure of protein-coding genes. Gene annotation is carried out both computationally and experimentally by sequencing cDNA e.g. traditionally using expressed sequence tags (ESTs) [6,7] and more recently RNA-seq [8]. Human EST resources are also more abundant than their non-human counterparts and therefore human gene annotation is also the most accurate among great apes [9]. While the majority of efforts have focused on the annotation of protein-coding genes, the discovery of large-scale transcription outside of protein-coding genes [10,11] has led to the identification of a great diversity of non-protein-coding RNA genes [12]. Among these are the microRNAs (miRNAs) which are short (~22 bp) RNA molecules [13] that post-transcriptionally down-regulate protein-coding gene expression [14,15]. The official repository of miRNAs miRBase (v.17) [16,17] contains 1,424 human miRNA, whereas fewer miRNAs are annotated in other primate genomes (chimpanzee: 600; bonobo: 88; gorilla: 85; orangutan: 581; rhesus macaque: 479), a fact that is explained by the larger number of human studies.
MiRNAs have been annotated in humans using a mixture of bioinformatics prediction and cDNA sequencing [18]. The identification of miRNAs in non-human primates has made use of a number of comparative methodologies such as sequence homology between closely related organisms [19][20][21][22], the genomic search for RNA secondary structure patterns characteristic of miRNAs [23] and by direct sequencing of small RNA libraries [24,25]. However, direct characterization of small RNA libraries by high throughput sequencing has been performed for a limited number of tissues in only chimpanzees and rhesus macaques [24,25]. As a result the majority of non-human primate miRNAs in miRBase have no evidence for their expression and their existence is only supported by computational prediction. In the present study we sequenced small RNA libraries from multiple chimpanzee, gorilla, orangutan and rhesus macaque individuals and tissues using the Illumina high throughput sequencing platform. We applied an algorithm (miRDeep) that uses sequencing reads in conjunction with a model of miRNA biogenesis to predict miRNAs with high accuracy [26,27].

MiRNA prediction
We used the program miRDeep2 [27] to predict miR-NAs from sequenced small RNAs. miRDeep2 takes as input the position and frequency of reads aligned to the genome ("signature") with respect to a putative RNA hairpin and scores the miRNA candidate employing a probabilistic model based on miRNA biogenesis [26]. The score produced by miRDeep takes into account the energetic stability of the putative hairpin and the compatibility of the observed read distribution with miRNA cleavage [26]. The more positive the score the more reliable the prediction. Additionally, miRDeep2 calculates false-positive rates by running the algorithm on a set of "signatures" and secondary structures that are paired by random permutation. Using predictions with a positive score and a significant folding p-value we identified from our sequences 47 (22 with expression evidence for star sequence) new miR-NAs in chimpanzee, 240 (166 with expression evidence for star sequence) in gorilla, 55 (13 with expression evidence for star sequence) in orangutan and 47 (24 with expression evidence for star sequence) in rhesus macaque. miRDeep2 was able to predict 338 (64% of all annotated) known miRNAs (312 with a positive score) in chimpanzee, 75 (94% of all annotated, 73 with a positive score) in gorilla, 364 (61% of all annotated, 325 with a positive score) in orangutan and 348 (71% of all annotated, 312 with a positive score) in rhesus macaque (Figure 1). miRDeep2 performance statistics were similar to the ones reported in other species [27] (Figure 1).
MiRNAs show high expression conservation between species, and tissue-specific expression patterns [28,29]. In testis we found a lower fraction of the total reads align to miRNAs (Table 1) as a result of the expression of an additional class of small-RNAs in this tissue -piR-NAs [29]. We were able to identify 11 tissue-specific miRNAs in chimpanzee (7 in brain, 1 in heart, 2 in kidney, 1 in testis), 110 in gorilla (100 in brain, 10 in liver), 28 in orangutan (25 in brain, 3 in liver) and 21 in rhesus macaque (11 in brain, 10 in testis).
To identify miRNAs which are shared between all the primates studied here we examined miRNAs that are encoded in orthologous locations in all four primate species and in human. For the miRNAs present in miR-Base (v.17) we found 60 miRNAs that are located in orthologous regions in human and the four non-human primate species. When we included the set of miRNAs predicted in this study we increased this number to 175 miRNAs. This set of miRNAs can be considered prediction of high confidence since they were known in human and either known or predicted by us in all other four primate species.

Sequence identity
All 60 of the known miRNAs present in all four species and human showed a high sequence identity i.e. the sequence is completely identical between the mature sequences for all of them. Using the set of 175 miRNAs we were able to reconstruct the expected phylogenetic relationships between the species studied for both the hairpin and the mature sequence. A principle component analysis on the sequence identity between hairpin sequences ( Figure 2) shows a close relationship between chimpanzee and gorilla while both species are distant from orangutan and even more afar to rhesus macaque.

Secondary structure
For some stages during their biogenesis miRNAs form a secondary structure that resembles a hairpin [30]. Since the endonuclease that processes miRNAs recognizes them based on their three-dimensional structure [30], the stability of the secondary structure can be considered a proxy for miRNA functionality and therefore for the reliability of miRNAs predictions. We used the minimum free energy (MFE) as a measure of structure stability. We found that the hairpins of predicted miRNAs are as stable as hairpins from known miRNAs, which is not unexpected given that the score calculated by miRDeep2 takes into account the stability of the miRNA hairpin secondary structure.

Discussion
Although the genomes of multiple non-human primates have been sequenced, the functional annotation of the human genome remains the most complete among primates. This is the case for miRNAs annotated in miR-Base, where the number of human miRNAs is double than miRNAs annotated in chimpanzee (the second-best annotated genome) [16,17]. In the present study we sequenced small RNA libraries from multiple individuals and tissues in four non-human primates in order to identify from expression data new miRNA genes. We identified these new miRNAs using miRDeep2 [27], which uses a model for miRNA precursor processing by Dicer to score miRNA predictions. Using this approach we predicted 47 new miRNAs in chimpanzee, 240 in gorilla, 55 in orangutan and 47 in rhesus macaque (Figure 1). We found that the secondary structures from our new miRNAs were as stable as miRNAs previously described in miRBase.
A similar number of new miRNAs were identified in chimpanzee, orangutan and rhesus macaque, whereas the number of new miRNA predictions in gorilla was much higher. While the genomes of the chimpanzee, orangutan and rhesus have been available for some time, and a number of miRNA studies in these species published, the gorilla genome has not yet been published and fully annotated [4,5,31], and no published description of miRNAs in gorilla -a requirement for inclusion of new miRNAs in miRBase -exists The majority of annotated miRNAs in the non-human primates are based on homology with human miRNAs [20][21][22]. However, the presence of a given locus in a genome is not a guarantee of its expression. We have, in this study, provided evidence of expression for 51% of the homology-based annotated miRNAs in gorilla, 49% in chimpanzee and 60% in rhesus macaque. We increased from 60 to 175 the number of miRNAs, which are located in orthologous regions in the four nonhuman primate genomes studied here and in human. This is a set of high confidence miRNAs based on homology, expression and miRNA biogenesis signatures.
In addition to the analysis of expression and folding, miRDeep incorporates a model of miRNA biogenesis, which makes its predictions more accurate than other software [27]. While the sequencing of small RNA libraries is now technically feasible, the accurate identification of novel miRNAs remains challenging. A pioneer study in primates sequenced small RNAs libraries from human and chimpanzee brains [24]. They predicted a large number (268 in human and 257 in chimpanzee) of new miRNAs in both species based on small RNA    Column 1: individual information; column 2: tissue; column 3: fraction of reads that could be mapped perfectly to species corresponding genome; columns 4-6 are based on the reads that could be mapped to the corresponding species genome and contain how many of these reads could be aligned to known miRNAs (column 4), newly predicted miRNAs (column 5) and to neither of these 2 categories (column 6); column 7: total number of sequenced reads. sequencing. Only few of these miRNAs have been included in miRBase, the public, curated repository for miRNAs (49 in human and 19 in chimpanzee). It is important to identify novel miRNAs accurately, and therefore particularly important to take into account the effect of genome quality and completeness on the ability to determine whether particular miRNAs are speciesspecific In primate comparisons the higher quality and completeness of the human genome means that miR-NAs are frequently described as human-specific when in fact they are simply missed in related primate genomes due to sequence quality issues. We sought to identify miRNAs that are expressed in tissue-specific manner. For species where we had samples from five tissues (chimpanzee and rhesus) we could say with more confidence that a given miRNA is tissuespecific than for the species where we had only two tissues (orangutan and gorilla). Brain was the tissue with both more miRNAs in total, and more tissue-specific miRNAs both in chimpanzee and marginally in rhesus. In orangutan and gorilla we could only identify miRNAs that are expressed mutually exclusively in either liver or brain. We found more miRNAs expressed exclusively in brain than in liver. This is in agreement with the fact that the miRNA repertoire in humans, chimpanzees and rhesus macaques is more diverse in brain compared to other tissues [29].

Conclusion
We have sequenced small RNA libraries from multiple individuals and tissues from chimpanzee, gorilla, orangutan and rhesus macaque. We identified known miR-NAs and used miRDeep2 to predict de novo microRNAs in these four primate species. Our new expression-based predictions increased the number of known miRNAs in all four species. In addition, we showed the first expression evidence for miRNAs that were previously only annotated by sequence homology with humans. Accurate annotation of miRNAs in multiple primate species provides a fundamental to carry out evolutionary, comparative and functional studies of miRNAs.

MiRNA samples
We sequenced 56 small RNA libraries (24 from chimpanzees, 24 from rhesus macaques, four from orangutan and four from gorilla). The chimpanzee and rhesus macaque samples have been published [29]. We added to this set eight samples from orangutan and gorilla (four liver and four brain samples from each species). All the individuals used in this study were adults and suffered sudden death that did not involve the tissues sampled. A description of the samples is available in Table 1.

Library preparation and sequencing
We used the individuals presented in [29] including 24 chimpanzee and rhesus macaque samples. Additionally, we sequenced four gorilla and four orangutan samples from brain and liver (two from each species and tissue). Total RNA was prepared as described in the Illumina Inc. manual "Small RNA Sample Preparation Guide" (Part # 1004239 Rev. A Illumina Inc. San Diego). Illumina Genome Analyzer I and II sequencing runs were analyzed starting from raw intensities. A detailed summary about the platform each sample was sequenced on, how many cycles and which chemistry was used can be found in Table 2. Base calling and quality score calculation was performed for all runs using the IBIS base caller [32].

Sequence data
MiRNA data was uploaded to the European Nucleotide Archive hosted by the European Bioinformatics Institute with the study accession number ERP000973 and ArrayExpress with accession number E-MTAB-828.

MiRNAs prediction
We used miRDeep2 prediction algorithm [27]. All reads from each species were used for the corresponding predictions. We excluded redundant predictions for the same genomic location and only kept the prediction with the highest score. We used the mapper module (mapper.pl) provided by miRDeep2 with the following parameters: -n -d -c -i -j -l 18 -m -k TCGTATGCCGTCTTCTGCTTG. We ran miRDeep2 with default parameters. Newly predicted miRNAs that were found in orthologous genomic regions in all four     species were submitted to miRBase. Names were assigned by miRBase and are available in Table 3.

Orthology of miRNAs
We identified orthologous regions starting from human hg19-based miRBase (version 17) hairpin locations [16,17]. The genome coordinates were transferred to hg18 coordinates using liftOver [34] with the 95% identity cutoff. Human mature sequences from miRBase were aligned to the human genome (hg18) and their corresponding hairpin sequences were assigned by overlapping genome coordinates using intersectBed from Bedtools [35]. All other primate miRNA mature sequences (known and predicted) were aligned against the corresponding genome and their genome locations were transferred to hg18 coordinates. The mature miRNA sequences found in the other primates that overlapped with human coordinates were defined as orthologous. The corresponding primate hairpin sequence was obtained by transferring the human genome hairpin coordinates to the corresponding primate genome. We excluded regions where liftOver was unable to identify an orthologous region.

Tissue specificity
MiRNAs were defined to be tissue specific when less than 5% of reads map to other tissues. This means that at least 80% of the perfectly aligned reads in chimpanzee and rhesus macaque (where we have reads from 4 tissues), and 95% of the perfectly aligned reads in gorilla and orangutan (where we have reads from 2 tissues) that were used for the prediction of the miRNA came from one tissue.

Sequence comparison
Sequence identity of miRNAs (mature/hairpin) in orthologous regions was computed using the multiple sequence alignment tool MUSCLE [36] and the identity function of the R package bio3d [37].

Secondary structure analysis
We calculated the minimum free energy (MFE) of known and predicted hairpin sequences by using RNAfold algorithm with default parameters [38]. The MFE for each group of annotated/predicted miRNAs was computed by averaging the MFEs.