Structured RNAs and synteny regions in the pig genome

Background Annotating mammalian genomes for noncoding RNAs (ncRNAs) is nontrivial since far from all ncRNAs are known and the computational models are resource demanding. Currently, the human genome holds the best mammalian ncRNA annotation, a result of numerous efforts by several groups. However, a more direct strategy is desired for the increasing number of sequenced mammalian genomes of which some, such as the pig, are relevant as disease models and production animals. Results We present a comprehensive annotation of structured RNAs in the pig genome. Combining sequence and structure similarity search as well as class specific methods, we obtained a conservative set with a total of 3,391 structured RNA loci of which 1,011 and 2,314, respectively, hold strong sequence and structure similarity to structured RNAs in existing databases. The RNA loci cover 139 cis-regulatory element loci, 58 lncRNA loci, 11 conflicts of annotation, and 3,183 ncRNA genes. The ncRNA genes comprise 359 miRNAs, 8 ribozymes, 185 rRNAs, 638 snoRNAs, 1,030 snRNAs, 810 tRNAs and 153 ncRNA genes not belonging to the here fore mentioned classes. When running the pipeline on a local shuffled version of the genome, we obtained no matches at the highest confidence level. Additional analysis of RNA-seq data from a pooled library from 10 different pig tissues added another 165 miRNA loci, yielding an overall annotation of 3,556 structured RNA loci. This annotation represents our best effort at making an automated annotation. To further enhance the reliability, 571 of the 3,556 structured RNAs were manually curated by methods depending on the RNA class while 1,581 were declared as pseudogenes. We further created a multiple alignment of pig against 20 representative vertebrates, from which RNAz predicted 83,859 de novo RNA loci with conserved RNA structures. 528 of the RNAz predictions overlapped with the homology based annotation or novel miRNAs. We further present a substantial synteny analysis which includes 1,004 lineage specific de novo RNA loci and 4 ncRNA loci in the known annotation specific for Laurasiatheria (pig, cow, dolphin, horse, cat, dog, hedgehog). Conclusions We have obtained one of the most comprehensive annotations for structured ncRNAs of a mammalian genome, which is likely to play central roles in both health modelling and production. The core annotation is available in Ensembl 70 and the complete annotation is available at http://rth.dk/resources/rnannotator/susscr102/version1.02. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-459) contains supplementary material, which is available to authorized users.

Silva rRNAdb Version 102, vertebrates only [7] snoRNAbase Version 3 [8] tRNAdb Version 2009 [9] Version: 0.2 Table S3 -Sequence based homology search results Results of the BLAST scan of the databases against the pig genome using high or medium condent cutos.
See the main paper Table 2 for an explanation of row labels. The number of RNA (sub) families and the number of annotated loci at high and medium condent BLAST cutos. Conicts are the result of overlapping families of dierent RNA classes. In practice conicts always arise between a miRNA and a longer non-miRNA family. Cutos, High: BLAST E-value 0.1, 95% identity, 95% of query covered, Medium: BLAST E-value 0.   Results of the BLAST scan of Rfam against the pig genome using high or medium condent cutos. Conicts are found to be between hits from tRNA and tRNA-Sec. These conicts are not detected during the normal BLAST pipeline since conicts there were only marked between dierent RNA classes. Further more, these BLAST hits score insuciently with Infernal to be true tRNA-Sec family members. For details also see Table S3 condence levels   Table S7 -High, medium and low condent results of the homology based pipeline The combined results of the sequence similarity search, structure homology search and class specic tools at the dierent cuto levels (Compare with main text Table 1). Each condence level contains the RNAs of the previous one, that is high is contained in medium, which again is contained in low. The column RNA class contains cisreg-elements: cis-regulatory elements from Rfam/Infernal; lncRNA-loci: Infernal lncRNA structure loci; the next 7 rows contain (full length) ncRNA genes, miRNA: BLAST from miRBase and miRDeep predictions; ribozyme: ribozymes from Rfam/Infernal; rRNA: ribosomoal RNAs primarily from RNAmmer; snRNA and snoRNA: BLAST results and results from Infernal/Rfam; tRNA: tRNAs tRNAs from BLAST; tRNAscan-SE and Infernal/Rfam; lncRNA-loci: structural loci from larger genes(lncRNAs); other: RNA families from Rfam not belonging to one of the other classes; conict: conicts of annotation.
Loci are the number of RNA loci of a given class; Families are a subdivision of classes into RNAs with the same name. 12 tRNAs and 15 miRNAs were moved to the medium condent annotation as part of the curation procedure. See text for details. Note that for the nal annotation we add the RNA-seq based miRNA candidates, reaching the nal total of 3,556 high condent RNA loci, 3 Table S8 -Read supported annotation High condent annotation supported by reads from the small RNA library. The annotation is given in the rst column, the locus in the second, indication of miRDeep support in the third, and the # of reads overlapping with the annotation in the fourth and last column. The # of reads is given as multiple numbers separated by commas in the cases where gaps are observed in the read coverage of the annotation.                       Table S10 -Genic context of the high condent annotations Each annotated locus within 10,000 nucleotides of a protein coding gene has been marked a gene context (otherwise the annotation is marked as intergenic). In some cases an annotated locus is marked contexts of multiple protein genes, in which case the annotation counts as having a multi gene context in the table.
Annotations overlapping protein coding genes, but not covered over 50% of the annotation length by any exon (coding or not coding) are marked as intronic. Annotation are marked as coding when at least 50% of the annotation is covered by a protein coding exon. Annotations are marked as 3' UTR or 5' UTR when found within 10,000 nt of the start or end of the protein coding gene, but outside the coding sequence.
These choices are largely based on the ones made for the human pairwise alignments by UCSC. The classication of species as distantly or closely related to pig can be seen in the main text Table 4.      Table S24 -100nt upstream of human PolII sequences 100nt upstream of curated human sequences known to be transcribed by polII. Using these sequences we created the position weight matrices for the PSEA and PSEB promoter elements for polII transcripts. Figure S1a -miRDeep predictions with read prole best aligning with snoRNAs Five read proles for miRDeep predicted miRNAs where the read proles are best aligned with snoRNA read proles according to deepBlockAlign. From top to bottom the best matching read proles were found to be SNORD61, SNORA3, snoU89, SNORD49, and SNORA52. In the gure, the RNAz track and the deepBlockAlign track are colored dark blue; the high condent annotation based on miRDeep is colored with a lighter blue; the number of reads from the small RNA library that cover each base is shown in the bottom of each plot (blue for reads on the negative strand and red for reads on the positive strand).
(Continued on next page)  Figure S1b -miRDeep predictions with read proles best aligning with snoRNAs Five read proles for miRDeep predicted miRNAs where the read proles are best aligned with snoRNA read proles according to deepBlockAlign. From top to bottom the best matching read proles were found to be SNORD61, SNORA3, snoU89, SNORD49, and SNORA52. In the gure, the RNAz track and the deepBlockAlign track are colored dark blue; the high condent annotation based on miRDeep is colored with a lighter blue; the number of reads from the small RNA library that cover each base is shown in the bottom of each plot (blue for reads on the negative strand and red for reads on the positive strand). (Continued from previous page)  Figure S2 -miRDeep prediction with read prole not aligning to those of known ncRNAs Four read proles for miRDeep predicted miRNAs where the read proles do not aligning to those of any known ncRNA within the deepBlockAlign score cuto of 0.6. In the gure, the RNAz track and the deepBlockAlign track are colored dark blue; the high condent annotation based on miRDeep is colored with a lighter blue; the number of reads from the small RNA library that cover each base, is shown in the bottom of each plot (blue for reads on the negative strand and red for reads on the positive strand).  Figure S3 -Read prole annotated by deepBlockAlign with overlapping RNAz loci Three read proles for RNAs, which overlap with an RNAz locus and thus have conserved structure. However, they are not annotated by the high condent homology pipeline, nor detected by the miRDeep program. From top to bottom the read proles are aligned with tRNA.ala, mir-876, mir-2964. In the gure, the RNAz track and the deepBlockAlign track are colored dark blue; the number of reads from the small RNA library that cover each base, is shown in the bottom of each plot (blue for reads on the negative strand and red for reads on the positive strand).  Figure S4 -Read prole annotated by deepBlockAlign and overlapping with mir-431 Read prole recognized by deepBlockAlign as a miRNA. However, the locus is missed by both high condent BLAST and miRDeep. The locus has conserved RNA structure as it overlaps with an RNAz annotation and medium condent BLAST identies the locus as mir-431. In the gure, the RNAz track and deepBlockAlign track are colored dark blue; the medium condent annotation based on BLAST is colored with a lighter blue; the number of reads from the small RNA library that cover each base, is shown in the bottom of each plot (blue for reads on the negative strand and red for reads on the positive strand).  Figure S5 -Read prole annotated by deepBlockAlign and overlapping with mir-223 Read prole recognized by deepBlockAlign as a miRNA, However, the locus is missed by both high condent BLAST and miRDeep. The locus has conserved RNA structure as it overlaps with an RNAz annotation and medium condent BLAST identies the locus as mir-223. In the gure, the RNAz track and deepBlockAlign track are colored dark blue; the medium condent annotation based on BLAST is colored with a lighter blue; the number of reads from the small RNA library that cover each base, is shown in the bottom of each plot (blue for reads on the negative strand and red for reads on the positive strand).  Read prole recognized by deepBlockAlign as a miRNA. However, the locus is missed by both high condent BLAST and miRDeep. Medium condent BLAST identies the locus as mir-1388. In the gure, the RNAz track and deepBlockAlign track are colored dark blue; the medium condent annotation based on BLAST is colored with a lighter blue; the number of reads from the small RNA library that cover each base, is shown in the bottom of each plot (blue for reads on the negative strand and red for reads on the positive strand).    Plot of the mir-379 cluster in pig. The cluster is incomplete in the assembly and only contains miRNAs from 376c to 656, while miRNAs 379-495 are missing. In the gure, the RNAz track is colored dark blue; the high condent condent annotations based are colored with a lighter blue; the curated annotation is colored green; the human miRNAs from miRBase, which have been lifted over from hg19 to pig are colored red; the number of reads from the small RNA library that cover each base, is shown in the bottom of each plot (blue for reads on the negative strand and red for reads on the positive strand).  Figure S13 -mir-493 cluster in pig The mir-493 cluster is broken in the pig assembly. miRNAs are found on both strands, and even though the miRNAs are conrmed by high condent BLAST, they are not conserved in the pairwise alignments. In the gure, the RNAz track is colored dark blue; the high condent condent annotations based are colored with a lighter blue; the curated annotation is colored green; the human miRNAs from miRBase, which have been lifted over from hg19 to pig are colored red; the number of reads from the small RNA library that cover each base, is shown in the bottom of each plot (blue for reads on the negative strand and red for reads on the positive strand).  Figure S14 -mir-532 cluster in pig The mir-532 cluster is mostly conserved in pig, however, some miRNAs, e.g. mir-501 and mir-502 are poorly conserved. See main text for details. In the gure, the RNAz track is colored dark blue; the high condent condent annotations based are colored with a lighter blue; the curated annotation is colored green; the human miRNAs from miRBase, which have been lifted over from hg19 to pig are colored red; the number of reads from the small RNA library that cover each base, is shown in the bottom of each plot (blue for reads on the negative strand and red for reads on the positive strand).  Figure S15 -mir-17 cluster in pig The mir-17 cluster is well conserved in pig and the miRNAs are supported by reads. In the gure, the RNAz track is colored dark blue; the high condent condent annotations based are colored with a lighter blue; the curated annotation is colored green; the human miRNAs from miRBase, which have been lifted over from hg19 to pig are colored red; the number of reads from the small RNA library that cover each base, is shown in the bottom of each plot (blue for reads on the negative strand and red for reads on the positive strand).  Figure S16 -mir-363 cluster in pig The mir-363 cluster is well conserved in pig and the miRNAs are supported by reads. In the gure, the RNAz track is colored dark blue; the high condent condent annotations based are colored with a lighter blue; the curated annotation is colored green; the human miRNAs from miRBase, which have been lifted over from hg19 to pig are colored red; the number of reads from the small RNA library that cover each base, is shown in the bottom of each plot (blue for reads on the negative strand and red for reads on the positive strand).  Figure S17 -mir-367 cluster in pig Only 3 of the 5 miRNAs in the mir-367 cluster are identied by high condent BLAST, however, the cluster appears complete in the pairwise alignment. In the gure, the RNAz track is colored dark blue; the high condent condent annotation based on BLAST is colored with a lighter blue, except for the curated annotation, which is colored green; the human miRNAs from miRBase which have been lifted over from hg19 to pig are colored red; the number of reads from the small RNA library that cover each base, is shown in the bottom of each plot (blue for reads on the negative strand and red for reads on the positive strand).  Figure S18 -mir-513a cluster in pig The mir-513a and mir-514b clusters are incomplete in the pig genome, leading to several problems in the pairwise alignments. Three miRNAs are observed in the miRDeep analysis of these clusters, which may be assigned as mir-506, mir-508 and likely mir-509. The mir-509 miRNA is conrmed by high condent BLAST and is therefore curated. In the gure, the RNAz track is colored dark blue; the high condent condent annotations based are colored with a lighter blue; the curated annotation is colored green; the human miRNAs from miRBase, which have been lifted over from hg19 to pig are colored red; the number of reads from the small RNA library that cover each base, is shown in the bottom of each plot (blue for reads on the negative strand and red for reads on the positive strand). mir-513a-2;mir-513c;mir-513b mir-506 mir-507 mir-508 mir-514a-1;mir-509-1;mir-509-2;mir-514a-2;mir-514a-3;mir-514b;mir-509-3;mir-510 small_reads_minus ln(x+1) 20 _ 0 _ Version: 0.2 Figure S19 -mir-450b cluster in pig The mir-450b cluster consists of 6 miRNAs in human, but it is incomplete in the pig assembly.
The miRNAs that we nd are supported by both BLAST and reads from the small RNA library. In the gure, the RNAz track is colored dark blue; the high condent condent annotations based are colored with a lighter blue; the curated annotation is colored green; the human miRNAs from miRBase, which have been lifted over from hg19 to pig are colored red; the number of reads from the small RNA library that cover each base, is shown in the bottom of each plot (blue for reads on the negative strand and red for reads on the positive strand).  Figure S20 -mir-892c cluster in pig The mir-892 cluster consists of 6 miRNAs in human, neither of these are found in pig by high condent BLAST. However, the entire cluster is reproduced in pig according to the human-pig pairwise alignments. We do not annotate these miRNAs, they are supported by neither high condent BLAST, nor reads from the small RNA library. In the gure, the RNAz track is colored dark blue; the high condent condent annotations based are colored with a lighter blue; the curated annotation is colored green; the human miRNAs from miRBase, which have been lifted over from hg19 to pig are colored red; the number of reads from the small RNA library that cover each base, is shown in the bottom of each plot (blue for reads on the negative strand and red for reads on the positive strand).  l.h.s Density plot of the position distribution of the PSEA elements for the putative PolII sequences (black lines) and random sequences (red lines). The doted lines represent the PSEA distribution for scores belonging to the highest 20% quantile. The peak clearly shows that PSEA is preferentially found 50nts upstream of the transcript start. r.h.s Density plot of the distribution of the TATA-Box location for putative PolIII sequences (black line) and a set of 1000 random sequences (red line). TATA-Boxes are preferentially found directly upstream of the transcript start.