New methods for next generation sequencing based microRNA expression profiling
© Buermans et al. 2010
Received: 28 May 2010
Accepted: 20 December 2010
Published: 20 December 2010
Skip to main content
© Buermans et al. 2010
Received: 28 May 2010
Accepted: 20 December 2010
Published: 20 December 2010
MicroRNAs are small non-coding RNA transcripts that regulate post-transcriptional gene expression. The millions of short sequence reads generated by next generation sequencing technologies make this technique explicitly suitable for profiling of known and novel microRNAs. A modification to the small-RNA expression kit (SREK, Ambion) library preparation method for the SOLiD sequencing platform is described to generate microRNA sequencing libraries that are compatible with the Illumina Genome Analyzer.
High quality sequencing libraries can successfully be prepared from as little as 100 ng small RNA enriched RNA. An easy to use perl-based analysis pipeline called E-miR was developed to handle the sequencing data in several automated steps including data format conversion, 3' adapter removal, genome alignment and annotation to non-coding RNA transcripts. The sample preparation and E-miR pipeline were used to identify 37 cardiac enriched microRNAs in stage 16 chicken embryos. Isomir expression profiles between the heart and embryo were highly correlated for all miRNAs suggesting that tissue or cell specific miRNA modifications do not occur.
In conclusion, our alternative sample preparation method can successfully be applied to generate high quality miRNA sequencing libraries for the Illumina genome analyzer.
MicroRNAs (miRNA) are non-coding RNA transcripts with average length of 21 nt. They play important roles during post-transcriptional regulation of gene expression in various organisms and tissues. Since their discovery by Lee et al  the cellular miRNA repertoire has expanded rapidly culminating in over 10,000 entries in the September 2009 miRBase database release . Primary miRNA transcripts are predominantly RNA polymerase II derived strands that contain stem loop hairpin structures. These hairpins get excised from the transcripts by Drosha/DGCR8 in the Microprocessor complex and are transported into the cytoplasm as miRNA precursor miRNA (pre-miRNA) transcripts. The RNase III endoribonuclease Dicer subsequently cleaves the pre-miRNA to release short double-stranded miRNA fragments of on average 21 nt in length. It is generally thought that one of the strands is bound to an Ago protein and incorporated into the RNA induced Silencing Complex (RISC) to act as a guide strand, while the other strand is degraded. Selection of the strands is thought to depend on the relative thermodynamic stability of the ends of the duplex. However, for some miRNAs both strands are detected at comparable expression levels. Once incorporated into a RISC complex, the miRNA strand guides the complex by imperfect base pairing to its targets (reviewed by Kim et al ).
Despite their small size, low abundant expression and lack of unifying structural features that allow for selective isolation and/or manipulation, different methods have been developed to measure their expression . A recent addition to the existing miRNA expression profiling techniques is high throughput sequencing. The millions of short sequence reads generated by Next Generation Sequencing (NGS), like the SOLiD (AppliedBiosystems) and Illumina Genome Analyzer, are particularly useful for small RNA transcription profiling. NGS provides miRNA expression profiling at an unprecedented sensitivity and resolution. Compared to available miRNA microarray platforms, the NGS systems are not limited by a predefined number of features, probe design, probe cross hybridization or array background issues. Moreover, the NGS systems directly count the number of transcripts found as a measure for expression abundance, have high multiplexing potential, are species independent, show high sensitivity towards low abundant transcripts and display excellent reproducibility .
Different microRNA sequencing library preparation methods have been described. The Illumina genome analyzer protocols for miRNA expression profiling are based on the Modban  method. In a comparison by Linsen et al , the Small RNA Expression Kit (SREK, Appliedbiosystems/Ambion) showed better between-sample reproducibility compared to both the Modban adaptor (IDT) ligation  and poly(A) tailing  methods. Although all three methods showed specific systemic biases, these were highly reproducible and therefore do not impair relative miRNA quantification. In the present paper, we present a modified version of the SREK, originally designed for the SOLiD system, for generating high quality miRNA sequencing libraries that are compatible with the Illumina Genome Analyzer technology. The SOLiD Total RNA-Seq kit which has replaced the SREK kit, is based upon the same principles for library generation. The modifications described here can therefore be applied to the new kit as well in order to generate smallRNA, and potentially RNA-seq, libraries that are compatible with the Illumina sequencing platform.
NGS for miRNA expression profiling is becoming a more widely used technology in various biological settings, e.g., hESC differentiation into embryoid bodies , leukemia progression , chicken [11–13], pig tissues , cardiac hypertrophy , but also in fundamental miRNA biogenesis studies . Although general guidelines on data processing procedures have recently been proposed , downstream data analysis tools for miRNA data, as for all NGS applications, are still in their infancy. Few applications [18–20] that cover all steps involved in miRNA sequencing data analysis, like adapter sequence removal, genome alignment and transcript annotation, in one single analysis pipeline have been described to date. In order to efficiently exploit the massive amounts of data generated by miRNA sequencing platforms, user friendly and easy to use tools need to be developed that report data in intuitive and comprehensive format. In this paper we describe E-miR, a perl-based miRNA sequence data analysis pipeline that combines all individual data handling steps. We apply both the E-miR pipeline and the sample preparation protocol to define cardiac enriched miRNA expression in chicken embryonic development.
EmiR data processing table
< 15 nt
Only sequence tags uniquely aligned to the genome with a maximum of one nucleotide mismatch were annotated to known non-coding RNA transcripts as annotated in Ensembl. A sequence tag was annotated to a non-coding RNA transcript when there is at least a 50% overlap on genomic positions. In agreement with other miRNA sequencing studies the majority of annotated sequences represented miRNA transcripts, with relatively small contributions from other non-coding RNA species like snoRNA, and tRNAs [9, 10]. A summary listing the annotated RNA species distribution is given in Table 1. Not all sequences mapped to known Chicken non-coding transcripts. Approximately 90% of these had expression levels below 5 tpm.
For the remainder of this manuscript, we did not use the mature/star nomenclature as these suggests that the miRNA transcript annotated as the mature form is bound to the Ago protein within the RISC complex and has more abundant expression than the star, while this is not always true. Moreover, the miRbase and Ensembl databases may not list information on both the mature and star transcripts from one precursor, impairing mature/star annotation. Instead, we chose to name the miRNA transcript according to the 5-prime (5p) or 3-prime (3p) arm of the hairpin they originated from. This nomenclature is already used in cases where two different miRNA dicer products were detected from opposite arms of the the same precursor transcript . In cases where annotation for only one of the 5p or 3p Dicer cleavage products was available from the Ensembl database, the complementary sequence positions were inferred as described in the Methods section. Sequencing data has shown that miRNA precursor may give rise to multiple miRNA Dicer products which vary in length and nucleotide compositions [9, 10]. These length variations impair estimation of genomic positions of the 5p and 3p miRNA transcript pairs that are derived from the precursor hairpin. In order to cope with this, EmiR allows for a minimal 50% overlap on genomics positions of the sequence read and the miRNA loci for annotation of the sequence reads.
One of the features of the SREK kit and its replacement SOLiD Total RNA-Seq kit is that both retain the strandedness of the inserted RNA transcripts in the library. Almost all of the sequencing reads that are annotated to non-coding RNA transcripts are sense, with only 0.017% in the antisense orientation. For a subset of microRNAs, low abundant sequence reads mapping to other Dicer cleavage fragments, like the head of the hairpin structure and the regions immediately flanking the precursors were detected, albeit with low expression levels.
The NGS platforms have a high multiplexing potential. In order to investigate multiplexing performance of the modified library preparation method, the individual sequencing libraries were constructed with a unique 6 nucleotide sequence bar-code positioned in the reverse primer during library pre-amplification. A multiplexed pool consisting of equimolar amounts of the smallRNA derived EM and HT libraries was prepared. After sequencing, the sequence reads were separated on sample origin based on perfect matches of the 6 nt multiplex tags. The multiplexed libraries showed highly similar data processing and transcript annotation characteristics compared to the the single runs as is evident from Table 1 and the scatter plot in Figure 3B.
For reasons unknown, gga-mir-92-3p expression levels deviated from expected in all multiplex libraries. Excluding these transcripts further increased Pearson correlations coefficients up to 0.995. For virtually all miRNA transcripts detected, multiple sequence variants, i.e., isomirs, could be observed in the data, reaching up to 1002 unique sequence variants for the 'miRNA|ENSGALT00000042432|5p||sense' transcript in the HT and EM samples. The number of unique isomirs for each miRNA per sample is listed in the E-miR expression output table. Different summation methods to calculate the expression level for each miRNA transcript from the individual isomir complement have been proposed. In agreement with previous reports, the sum of individual reads, the most abundant isomir read and the sum of isomirs aligned uniquely without mismatches, were highly correlated [9, 10]. Figure 3C shows a heatmap clustering based on Pearson correlation coefficients for the miRNA expression levels calculated from the sum of individual reads, the most abundant read and the sum of isomirs aligned uniquely without mismatches for both the multiplex and single sequencing runs used in this study. A clear separation between the heart tube and embryos is evident with a sub division between the Embryo libraries derived from small RNA enriched and total RNA. The E-miR pipeline reports all three expression values per transcript. For the remainder of this manuscript the sum of all isomirs per miRNA will be referred to as the expression of a specific miRNA transcript. In conclusion, our miRNA sequencing library preparation methods allows for reliable sample multiplexing.
For the whole embryo, both total RNA and smallRNA enriched RNA fractions, originating from the same sample homogenate, were used to generate small RNA sequencing libraries. Overall sequence alignment characteristics were similar between the two RNA types. However, they did show a striking difference in the total number of annotated sequences, i.e., approximately 45 and 23% for the smallRNA enriched fraction and totalRNA based libraries, respectively (Table 1). MicroRNA transcripts were the main component of this difference, representing approximately 50 and 73% of sequences for the totalRNA and smallRNA libraries, respectively. Also, in the small RNA derived sample, more miRNA transcripts were expressed above 5 tpm (Figure 3A) and in a Limma analysis comparing the small and total RNA derived Embryo samples, 6 miRNAs were shown to have significantly higher expression levels for the small RNA derived libraries (Additional file 3 Figure S2). These data indicate the choice of RNA input affects the miRNA expression profile.
Cardiac enriched miRNAs
miRNA | ENSGALT00000028994 | 5p~|gga-mir-133a-1 | sense #
chr2:105670407-105670428 | -
0 ± 0
9 ± 1.3
miRNA | ENSGALT00000028999 | 3p | gga-mir-1a-1 | sense
chr20:8107876-8107896 | -
0 ± 0
23 ± 6.9
miRNA | ENSGALT00000029007 | 5p~ | gga-mir-133c | sense #
chr23:4664062-4664083 | +
0 ± 0
8 ± 2.9
miRNA | ENSGALT00000035316 | 5p | gga-mir-499 | sense #
chr20:2599386-2599408 | -
2.8 ± 1.3
2542.5 ± 574
miRNA | ENSGALT00000035316 | 3p~ | gga-mir-499 | sense #
chr20:2599348-2599370 | -
15.5 ± 1.1
3048.4 ± 318.2
miRNA | ENSGALT00000029007 | 3p | gga-mir-133c | sense #
chr23:4664099-4664120 | +
2.8 ± 1.3
496.9 ± 44.9
miRNA | ENSGALT00000029006 | 5p~ | gga-mir-1b | sense
chr23:4663916-4663936 | +
1.7 ± 1.1
142.9 ± 22.7
miRNA | ENSGALT00000028999 | 5p~ | gga-mir-1a-1 | sense
chr20:8107838-8107858 | +
35.4 ± 20.4
2366.1 ± 777.2
miRNA | ENSGALT00000029006 | 3p | gga-mir-1b | sense
chr23:4663953-4663973 | +
91.1 ± 24.2
6086.5 ± 1328.3
miRNA | ENSGALT00000042439 | 5p || sense
chr4:2151195-2151215 | +
1.4 ± 2
32.5 ± 0.4
miRNA | ENSGALT00000042439 | 3p || sense #
chr4:2151238-2151260 | +
30.2 ± 6.1
1277.3 ± 123.2
miRNA | ENSGALT00000035276 | 3p | gga-mir-490 | sense
chr1:59948716-59948737 | -
45.3 ± 12.6
1447.5 ± 231.3
miRNA | ENSGALT00000042468 | 5p | gga-mir-1773 | sense
chr20:8109147-8109169 | +
1.1 ± 0.2
25.5 ± 3.8
miRNA | ENSGALT00000042438 | 3p | gga-mir-1799 | sense
chr5:42365992-42366012 | +
8.6 ± 6.9
130.5 ± 0.5
miRNA | ENSGALT00000035276 | 5p~ | gga-mir-490 | sense
chr1:59948755-59948776 | -
13.8 ± 2.1
228.3 ± 47.4
miRNA | ENSGALT00000042432 | 3p~ | sense
chr1:104486649-104486675 | -
8.2 ± 4.6
84.6 ± 5.7
miRNA | ENSGALT00000029028 | 3p | gga-mir-133b | sense #
chr3:110384948-110384968 | -
93 ± 21.6
866.5 ± 52.1
miRNA | ENSGALT00000042246 | 3p~ | gga-mir-1731 | sense
chr12:10938277-10938299 | -
1.7 ± 1.1
15 ± 1.6
miRNA | ENSGALT00000042411 | 5p | gga-mir-1434 | sense
chr28:1055205-1055224 | +
26.5 ± 11.3
204.5 ± 18.6
miRNA | ENSGALT00000035271 | 3p | gga-mir-193b | sense
chr14:759503-759526 | +
31.8 ± 0.3
213.2 ± 26
miRNA | ENSGALT00000042289 | 5p | gga-mir-1747 | sense
chr2:62758752-62758773 | -
2.1 ± 1.2
12.5 ± 0.6
miRNA | ENSGALT00000028974 | 5p | gga-mir-146a | sense
chr13:7555655-7555676 | -
10.6 ± 4.2
60.6 ± 8.3
miRNA | ENSGALT00000028997 | 5p | gga-mir-30d | sense #
chr2:148337300-148337321 | -
1452 ± 39.6
7275.3 ± 46.5
miRNA | ENSGALT00000042432 | 5p || sense
chr1:104486710-104486736 | -
915.9 ± 102.7
3702.3 ± 152.3
miRNA | ENSGALT00000028990 | 3p~ | gga-mir-138-1 | sense
chr2:40745167-40745183 | -
9.4 ± 2.4
33.5 ± 1
miRNA | ENSGALT00000028998 | 5p | gga-mir-30b | sense
chr2:148331648-148331669 | -
461.2 ± 63
1638.4 ± 129.7
miRNA | ENSGALT00000029030 | 3p | gga-mir-223 | sense
chr4:233007-233027 | +
2759.8 ± 88
9608.1 ± 564.1
miRNA | ENSGALT00000042472 | 5p | gga-mir-1677 | sense #
chr3:76659772-76659792 | +
219.8 ± 31.9
725.3 ± 80.2
miRNA | ENSGALT00000028951 | 5p | gga-mir-125b | sense #
chr1:102457663-102457684 | +
5255.7 ± 209.1
15151.5 ± 1241
miRNA | ENSGALT00000035272 | 3p | gga-mir-181a-1 | sense
chr8:2001620-2001641 | +
154.6 ± 49.6
424.6 ± 8.1
miRNA | ENSGALT00000029026 | 5p | gga-mir-30a | sense #
chr3:85102244-85102265 | +
898.4 ± 20.4
2238.4 ± 52.7
miRNA | ENSGALT00000035333 | 3p | gga-mir-144 | sense
chr19:5824134-5824155 | -
79.4 ± 4.2
187.5 ± 8.9
miRNA | ENSGALT00000029015 | 5p | gga-let-7k | sense
chr26:1442955-1442976 | -
11.8 ± 1
25.5 ± 3.8
miRNA | ENSGALT00000028983 | 5p | gga-mir-126 | sense #
chr17:8431792-8431812 | -
1310.7 ± 123.6
2561 ± 161.5
miRNA | ENSGALT00000029008 | 5p | gga-mir-30e | sense
chr23:5248432-5248450 | +
1382.7 _ 30.5
2630.2 ± 3.1
miRNA | ENSGALT00000042329 | 3p~ | gga-mir-1811 | sense
chr4:58651749-58651768 | +
60.9 _ 6
105.5 ± 4
miRNA | ENSGALT00000028959 | 3p~ | gga-mir-18a | sense
chr1:152248637-152248658 | -
49.1 _ 0.3
82.5 ± 2.8
To gain more insight into the expression levels of the significantly contributing isomirs, scatter plots were generated that show the individual isomir expression levels for the heart and embryo. Figure 6B shows similar patterns of differential expression for all isomirs related to gga-mir-499-3p, gga-mir-125b-5p, gga-mir-20b-5p and gga-mir-219-3p. These high correlations between isomir expression levels were also observed for the not significantly regulated miRNAs (data not shown). The observation that all isomirs are regulated similarly suggests that there is no tissue specific editing of miRNA transcripts.
In conclusion, the Globaltest can be applied to test for differential transcript expression, while taking the specific isomir complement of the miRNA into account. However, given the highly correlated isomir expression profiles, using either the most abundant transcript or the sum of transcripts as a measure of the miRNA abundance, would essentially lead to similar results with a Limma analysis. The highly correlated isomir expression between the heart and embryo suggest that tissue or cell specific miRNA modifications do not occur.
The library preparation methodology we describe is an adaptation on the SREK kit, which was initially designed to prepare miSeq libraries for the SOLiD Next-Generation Sequencing system (Appliedbiosystems), to make it compatible with the Illumina Genome Analyzer technology. Compared to the Modban method on which the standard Illumina library generation protocol is based, the SREK kit was shown to have more preferable reproducibility characteristics . The main modification to the protocol is usage of a set of alternative library amplification primers which contain sequence features compatible with both the adapter sequences that were introduced during the initial adapter ligation as well as the oligonucleotides deposited on the Illumina Genome Analyzer flow cell (Figure 1).
Our libraries were prepared from 500 ng total RNA or 120 ng small enriched RNA fractions. A mere 1 μ L of the ligated and reverse transcribed cDNA pool is used during library pre-amplification in a 100 μ L reaction volume, which generally yields more than enough library for several sequencing runs. This means that in principle an equivalent of 3 ng of small enriched RNA is sufficient to successfully prepare miSeq libraries with this method. However, no attempts were made to purify and concentrate the cDNA pool in order to significantly reduce the input RNA quantity to these lower ng levels.
Our results point to using the following recommendations for generating high quality miRNA sequencing libraries. 1) Although both small and total RNA can be used, miRNA enriched small RNA fractions libraries showed increased sensitivity towards miRNA expression detection. 2) 100 ng smallRNA enriched RNA. Other miRNA libraries have successfully been generated in our laboratory using as little as 25-50 ng RNA. However, to yield enough library, several parallel pre-amplification reactions may be be needed. 3) The short amplification primers allow for increased selectivity and sensitivity towards miRNA transcripts. Needless to say, identical conditions for all processed samples are recommended.
To validate our sample preparation approach, we prepared libraries from chicken HH16 heart tubes and whole embryos without heart tube. Chicken have been used as a model for cardiac developmental biology for many years mainly due to the fact that the heart initially develops outside chest cavity and cardiac development can be precisely timed . Moreover, for some miRNA transcripts cardiac enriched or cardiac restricted patterns during chicken (cardiovascular) embryonic development have previously been described , making this an excellent model to validate our approach.
Our statistical analysis of the expression data yielded a total of 37 significantly cardiac enriched miRNA transcripts, containing all of the previously described characteristic cardiac enriched microRNA transcripts, like miR499 and the miR1 and 133 family. Cross referencing our cardiac enriched miRNA transcripts with previously published mouse and human sequencing data generated by Landgraf et al  showed overlap for miR144-3p, miR126-5p and 146a-5p. However, overlap between our data and zebra-fish cardiac enriched transcripts was limited to the miR499 and the miR133 family , while the Zebra fish cardiac enriched miR221 and miR130b miRNA transcripts showed significantly lower expression in our chicken heart data compared to embryo expression levels. Differences in species and developmental stages are likely to underlie these apparent discrepancies.
A subset of the differentially expressed miRNA transcripts were confirmed with qPCR on a set of biologically independent samples. These included not only the characteristic cardiac miRNAs 133 and 499, but also novel cardiac enriched transcripts miR1677 and ENSGALT0000042439-3p. Differential expression for miR 1799-3p with qPCR could not be confirmed using qPCR since no reliable expression signal could be detected. Given that this transcript had low miSeq expression levels, i.e., approximately 130 tpm, this implies insufficient sensitivity of realtime PCR techniques. Although differences in the absolute expression levels did not perfectly correlate between methodologies, the relative expression patterns did agree as has also been described by Linsen et al. . In conclusion, the chicken cardiac enriched sequencing data generated by our alternative library preparation approach yields biologically valid data.
Essential to the analysis of miRNA sequencing data is removing any adapter sequences that were introduced during the library preparation as these impair proper genomic alignment. In the E-miR pipeline, the user is required to provide this sequence and state if single nucleotide mismatches are allowed during adapter identification. This stretch of adapter sequence needs to be chosen carefully since matches to the first 4-6 nt of the adapter sequence may occur by chance within the RNA insert and cause aberrant truncation of the sequence. Allowing for mismatches in this step augments this even more, leading to deviating expression profiles. In our experience, using the first 8 nucleotides of the 3' adapter sequence with allowing for one mismatch does not affect the expression profiles. The SeqBuster  miRNA analysis pipeline also includes a step to remove 3' adapter sequences from the sequences. For the same dataset, the E-miR pipeline removed the 3' adapter sequences approximately 2-3× faster. More importantly, after inspection of the truncated sequence output, E-miR appeared to be more accurate in identifying and removing of these sequence features. Another advantage of the E-miR over SeqBuster is that multiple data files can be processed simultaneously.
With the increasing amount of data generated by the different NGS systems, there is a need for faster sequence alignment tools. Many, short read aligners have been described, all of which have their specific characteristics . All of the next generation short-read alignment tools available like Eland (part of the standard Illumina analysis pipeline), Rmap  and Bowtie , outperform Blast  on speed. However, these increased alignment speeds affect the quality of the alignments . Unlike Eland, the Rmap and Bowtie aligners are not limited to a predefined 32 nt sequence length. Nevertheless, since microRNAs have an average length of approximately 21 nt, this 32 nt sequence length limitation present in Eland does not pose a bottleneck for microRNA transcript alignment purposes. However, Eland but not Bowtie, is a commercial aligner. To meet the demands for faster alignments and provide more flexibility in alignment settings a Bowtie E-miR version of the pipeline is also available. The Bowtie version did the same job as the Eland version in just 39 min, mainly by reducing the alignment time from 1 hour and 43 min to under 2 min.
MiRNA isomirs may be derived by a combination of variable Dicer cleavage points, nucleotide additions and RNA editing. Nucleotide variations to the reference miRNA transcript sequence may have altered target specificity and thereby modulate different biological processes to further fine tune post-transcriptional gene expression. For this to manifest, the isomir should be differential expressed between conditions, and, the expression should not correlate with the overall differential isomir expression pattern that is observed between the groups. Although we identified the specific subset of isomirs that significantly contributed to the differences in expression of each differentially expressed miRNA by exploiting the ability of the Globaltest , uniform differential expression was observed for all isomir transcripts between the heart and embryo. This was also seen for the set of not significantly differentially expressed miRNAs. Similar results were obtained when applying the crossmapping correction method as proposed by de Hoon et al  in order to handle sequence reads that map to multiple loci. These observations suggest that the mechanisms by which the isomir complement is generated may not be specific to the cell type or differentiation state of the tissue under investigation but rather to represent a general non-tissue dependent system and hence isomir tissue specific post-transcriptional effects not to occur. Further research is needed to ascertain whether the absence of non-uniform isomir expression is a general phenomenon or confined to our sample collection.
To date, only the SeqBuster analysis pipeline describes a method for analyzing isomir complement across and within samples . In the SeqBuster approach, first a subset of isomirs of interest is selected based on common features, e.g., nucleotide modifications. These subsets are subsequently tested for differential expression across the conditions. We chose to take the opposite approach, i.e., the Globaltest first provides insight into which miRNAs are differentially expressed across the groups, based on their isomir complement. Only then are those isomirs identified that significantly contribute to this difference. This is a more unbiased and less hypothesis driven.
The data presented in this study were generated using older generation Illumina reagents and flowcells. Using up to date consumables and basecalling algorithms, over 30 million reads per flowcell can be easily be reached. Assuming that approximately 5 million sequencing reads are needed to profile the microRNA transcriptome at sufficient depth, up to six libraries may be pooled and analysed in a single flowcell lane without loss of resolution. The number of sequencing reads is expected to increase in the the near future, making multiplexing of libraries an attractive approach for cost effective smallRNA sequencing. The barcoding system used for the preparation of the libraries in this study were identical to those described in the SREK protocol manual, but an alternative barcoding schemes can easily be implemented. Here we used only the perfect matching barcode reads to sort the sequencing reads to according to their sample identity, leaving a subset of reads un-assigned (data not shown). A more sophisticated method for multiplex barcode sorting, like the Levenshtein distance , that takes into account the potential single-base errors that may be introduced during PCR pre-amplification or primer synthesis can be applied in order to decrease the number of un-assigned reads.
In this paper we present an adaptation to the SREK protocol that reliably generates miRNA sequencing libraries that are compatible with the Illumina Genome Analyzer. Furthermore, we describe a new analysis tool for analysis of next generation miRNA transcription profiling called E-miR. It allows for automated and fast processing of expression data and reports the most commonly used results to the user in a comprehensive expression table and data visualization track files for the UCSC genome browser. Although successfully tested on chicken data, the pipeline can also be used to analyze miRNA data from any species. The main and accessory perl scripts are available via: http://www.lgtc.nl/EmiR.
Fertilized chicken eggs were obtained from a local hatchery (Drost BV, Nieuw Loosdrecht, The Netherlands), incubated at 39°C in a moist atmosphere, and automatically turned every hour. After the appropriate incubation time, embryos were isolated in Earl's balanced salt solution (EBBS, Life Technologies) and staged according to Hamburger and Hamilton . The complete heart tube (HT) was dissected from stage HH16 embryos (EM). Tissue samples were stored at -80°C prior to use. Total RNA and miRNA enriched RNA fractions were isolated with the mirVana ™ miRNA Isolation Kit (Ambion) according to the manufacturers protocol. RNA integrity was confirmed with Agilent 2100 Bioanalyzer pico-RNA and small-RNA chips.
Sequencing libraries were generated using a modification on the SOLiD Small RNA Expression Kit (SREK) to make it compatible with Illumina Genome Analyzer technology. MirVana-enriched miRNA fractions from HT or EM and EM totalRNA were hybridized and ligated to the A adapter mix to prepare 5' to 3' sequencing libraries. Reverse transcription and RNaseH treatment were as described in the SREK protocol. Library pre-amplification PCR was performed with Phusion Hot Start High-Fidelity DNA Polymerase (Finnzymes) with a set of alternative primers (Figure 1A and Additional file 4 Figure S3) (Integrated DNA Technologies) that contain sequence features compatible with both the adapters that were introduced during the ligation step as well as the oligonucleotides deposited on the Illumina Genome Analyzer flow cell. DNA was denatured for 30" at 98°C followed by 18 cycles with 30" at 98°C, 30" at 65°C, 30" at 72°C and final extension for 5' at 72°C. During amplification, the RNaseH treated cDNA input did not exceed 1% of total PCR volumes. Final primer concentration was 100 nM. All libraries were amplified using a different 6-nucleotide multiplex sequence tag in the 5'-primer. Library fragments were separated on a native 6% gradient pre-cast PAGE gel (Novex, Invitrogen). The 140-155 bp size fractions containing the miRNA inserts were excised, DNA was eluted from the gel, precipitated and dissolved in 15 μ L nuclease free water. Library yield was quantified on a Agilent Bioanalyzer DNA1000 chip. Single read cluster generation and single read sequencing were performed according to the standard Illumina v2 and v3 kits, respectively. An alternative set of sequencing primers was used (Figure 1A) during the sequence-by-synthesis steps to determine the RNA insert sequence and multiplex tag. Multiplexed and single read samples were 35 and 32 nt long, respectively. The smallRNA insert and multiplex tags were sequenced using standard Illumina primer annealing protocols. The complete protocol with description of the modifications to the SREK method are available in Additional file 8 Methods S1. These data have been deposited as fastq and wig files in the GEO database as series GSE20757.
A perl based data analysis pipeline, called E-miR, was built to process the microRNA sequencing data in several automated steps (Figure 2). Multiple data files can be processed in one single run. Multiple input data formats are supported including FASTQ and SCARF as well as a simple tab delimited sequence and counts format. First, the the non-coding transcript annotation files are processed. These annotation files, i.e., non-coding RNA and Dicer processed miRNAs can be retrieved from via the Ensembl perl API with a perl script that is available with the EmiR pipeline. Custom regions of interest may be added to the annotation after these files are downloaded. In EmiR, miRNA transcripts are named according to the 5-prime (5p) or 3-prime (3p) arm of the hairpin they originated from. The relative positions of the 5p and 3p products to the miRNA precursors were and used to recalculated their genome positions. For each miRNA, the two Dicer processed transcripts are processed separately by the pipeline. In cases where relative positions on only one of the 5p or 3p products was available, the complementary positions of the un-annotated transcript were inferred, taking into account the 2 nt overhang of the miRNA duplex. In the second step of the pipeline, the 3' adapter that is introduced during the sample preparation is removed from each sequence by regular expression matching, optionally allowing for one mismatch. Sequences shorter than 15 nt after truncation are excluded from further analysis. The Eland aligner cannot handle sequences longer than 32 nt, therefore, sequences longer that 32 nt are truncated at 32 nt. Next the sum of sequence counts of identical reads after truncation are calculated and separate files for all sequence lengths are prepared for Eland alignment (step 4). All unique sequences from the input samples are aligned to the genome of interest using the Eland alignment tool (step 5). Only sequences uniquely aligned to the genome with a maximum of one mismatch are accepted for further processing. Sequence reads were annotated in step 6 to known non-coding RNA transcript like miRNA, snRNA, snoRNA, tRNA and rRNAs based on overlap of their genomic positions from step 1. In order to handle variations in miRNA transcript lengths, a sequence read is annotated to a transcript locus from the annotation files when there is a minimal 50% overlap on genomics positions. In this step, sequence sets are generated for all transcript regions present in the processed annotation files. During step 7 of the pipeline, all data is compiled into an RNA expression table for all data input files combined. The table contains data for all expressed RNA transcripts as defined in the annotation files. For miRNAs the pipeline distinguishes between the 5p and 3p miRNA precursor sequence products. Previously un-annotated miRNA transcripts generated from known precursors are included as well. MiRNA identifiers are composed of both transcript annotation and genomic location, e.g., miRNA|ENSGALT00000028942 |5p~|gga-mir-29a|sense indicates a miRNA with Ensembl transcript ID ENSGALT00000028942, the 5 prime section of the precursor hairpin of miRNA 29a and the match is in the sense orientation. The '~' in the identifier indicates the positions of this transcript was inferred from the hairpin structure. For each of the expressed RNA transcripts the number of unique reads, representing the number of unique isomirs detected for miRNAs, the total sum of reads, the sum of reads without mismatches and the most abundant transcript are listed. In addition, the sense and antisense orientation of the sequence reads relative to the annotated transcript is reported in the EmiR output in the expression table. To correct for the differences in read counts between libraries, the sum of reads per transcript are scaled to tpm based on the sum of aligned sequence tags. A square root transformation was applied for variance stabilization. For data visualization in the UCSC genome browser files in the Bed (step 9) and Wig (step 10) format are generated. E-miR was designed to run on Linux systems exclusively. On a Ubuntu 9.04 (Jaunty) OS running on an Intel Quad core 2.4 MHz with 4 Gb of memory, analysis of all five libraries described in this manuscript was completed within 2 h and 26 min, 1 hour and 43 min of which were used for the Eland genome alignment step alone. Separating samples based on their multiplex tags is not performed by the E-miR pipeline. Multiplexed sequence reads were separated based on their multiplex sequence reads with a custom command line.
MicroRNA-expression levels were tested for significant differential expression between the HTs and EMs libraries using the Limma  package in R. The Benjamini-Hochberg method (BH-FDR)  was used to control the false discovery rate. Statistical significance was tested using the square root transformed tpm expression values of the 5p and 3p miRNA transcripts. Differences in expression with BH-FDR p-values ≤ 0.05 and a minimal 1.5 fold change were considered to be significant.
The Globaltest (version 5.1.5)  was used to investigate the contribution of each specific isomir to the differential expression of each miRNA. The lists of isomirs per miRNA transcript, i.e., the covariates to be tested, was compiled by the E-miR pipeline (optional feature). An expression table was generated containing square root transformed tpm expression levels for each unique individual isomir transcript. From this table, isomir transcripts containing the 'N' nucleotide were excluded from further analysis. The minimal sum of isomirs for either EM or HT was set at 10 tpm miRNAs with BH-FDR p-values ≤ 0.05 were considered to be significantly differentially expressed between EMs and HTs groups. To avoid mis interpretation due to stochastic sampling effects of less abundant miRNAs, only miRNAs for which the most abundant isomir was above 50 tpm were taken into consideration for further analyses. The differentially expressed isomirs for each miRNA were identified via the 'subsets' and 'leafNodes' functions in the Globaltest package with Holm multiple testing correction p-values at 0.1.
Sequencing derived expression profiles were compared to qPCR measurements on small RNA fractions from independent biological samples. Complementary DNA strands were generated from 300 ng mirVana enriched miRNA fractions in a 20 μ L reaction volume using the Invitrogen NCode miRNA First-Strand cDNA Synthesis kit. An equivalent of 0.5 ng RNA was used in the PCR amplification with 100 nM miRNA specific primers (Eurogentec; Additional file 9 Table S4) and 4 μ L iQ SYBR Green Supermix (bioRad) in an 8 μ L reaction using standard cycle parameters on an LightCycler480 (Roche), with annealing temperature set to 57°C. MicroRNA primers were designed at the 5' regions of the transcripts as to avoid potential mis priming due to the sequence variation observed at the 3' ends of microRNA transcripts. PCR amplification efficiencies for each miRNA transcript were determined directly from the amplifications curves using the LinRegPCR tool  and used to calculate relative expression between heart tube and embryo.
We thank M.J.B. van den Hoff and B. van Wijk (Heart Failure Research Center, Academic Medical Center, Amsterdam) for providing the chicken embryonic tissues used in this study and M. Hestand, M. van Galen and M. Villerius (Leiden Genome Technology Center, LUMC, Leiden) for their bioinformatics support and J.J. Goeman (Medical Statistics, LUMC, Leiden) for his assistance regarding the Globaltest analysis.
This research was supported by the European Union FP6 program HeartRepair LSHM-CT-2205-018630, a Horizon grant (# 93511015) and Centre for Medical Systems Biology within the framework of the Netherlands Genomics Initiative (NGI)/Netherlands Organisation for Scientific Research (NWO).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.