Genome-wide identification of tissue-specific long non-coding RNA in three farm animal species

Kern, Colin; Wang, Ying; Chitwood, James; Korf, Ian; Delany, Mary; Cheng, Hans; Medrano, Juan F.; Van Eenennaam, Alison L.; Ernst, Catherine; Ross, Pablo; Zhou, Huaijun

doi:10.1186/s12864-018-5037-7

Research article
Open access
Published: 18 September 2018

Genome-wide identification of tissue-specific long non-coding RNA in three farm animal species

Colin Kern¹,
Ying Wang¹,
James Chitwood¹,
Ian Korf²,
Mary Delany¹,
Hans Cheng³,
Juan F. Medrano¹,
Alison L. Van Eenennaam¹,
Catherine Ernst⁴,
Pablo Ross¹ &
…
Huaijun Zhou¹

BMC Genomics volume 19, Article number: 684 (2018) Cite this article

6195 Accesses
79 Citations
Metrics details

Abstract

Background

Numerous long non-coding RNAs (lncRNAs) have been identified and their roles in gene regulation in humans, mice, and other model organisms studied; however, far less research has been focused on lncRNAs in farm animal species. While previous studies in chickens, cattle, and pigs identified lncRNAs in specific developmental stages or differentially expressed under specific conditions in a limited number of tissues, more comprehensive identification of lncRNAs in these species is needed. The goal of the FAANG Consortium (Functional Annotation of Animal Genomes) is to functionally annotate animal genomes, including the annotation of lncRNAs. As one of the FAANG pilot projects, lncRNAs were identified across eight tissues in two adult male biological replicates from chickens, cattle, and pigs.

Results

Comprehensive lncRNA annotations for the chicken, cattle, and pig genomes were generated by utilizing RNA-seq from eight tissue types from two biological replicates per species at the adult developmental stage. A total of 9393 lncRNAs in chickens, 7235 lncRNAs in cattle, and 14,429 lncRNAs in pigs were identified. Including novel isoforms and lncRNAs from novel loci, 5288 novel lncRNAs were identified in chickens, 3732 in cattle, and 4870 in pigs. These transcripts match previously known patterns of lncRNAs, such as generally lower expression levels than mRNAs and higher tissue specificity. An analysis of lncRNA conservation across species identified a set of conserved lncRNAs with potential functions associated with chromatin structure and gene regulation. Tissue-specific lncRNAs were identified. Genes proximal to tissue-specific lncRNAs were enriched for GO terms associated with the tissue of origin, such as leukocyte activation in spleen.

Conclusions

LncRNAs were identified in three important farm animal species using eight tissues from adult individuals. About half of the identified lncRNAs were not previously reported in the NCBI annotations for these species. While lncRNAs are less conserved than protein-coding genes, a set of positionally conserved lncRNAs were identified among chickens, cattle, and pigs with potential functions related to chromatin structure and gene regulation. Tissue-specific lncRNAs have potential regulatory functions on genes enriched for tissue-specific GO terms. Future work will include epigenetic data from ChIP-seq experiments to further refine these annotations.

Background

Since the invention of genome sequencing technology, the focus of genomics has been to identify the genes present in an organism and understand their link to traits, or phenotypes, that the organism exhibits. As more is learned about genetics and the key role gene regulation plays in phenotypic expression, it is becoming clear that a complete understanding of the genome-to-phenome relationship will require a more comprehensive annotation of the genome than just protein-coding genes. RNA-seq data has revealed that while less than 5% of the human genome consists of protein coding sequences, most of the genome is transcribed [1,2,3]. Furthermore, comparative genome studies have shown evolutionary conservation in intergenic regions of the genome, indicating positive selection pressure and implying that these conserved regions have important functions [4,5,6,7].

One class of important regulatory elements that has recently been gaining attention is long non-coding RNAs (lncRNAs). These transcripts are distinct from miRNAs, snoRNAs, and others in that they are defined as greater than 200 bases in length and share some characteristics of mRNA, such as polyadenylation. LncRNAs were originally thought to not contain open reading frames (ORFs), however some have been found with short ORFs that may be translated, though the function of these is still a topic of debate [8, 9]. Some lncRNAs have been shown to have functions in regulating gene expression. XIST, for example, is a lncRNA that acts as one of the major components of the X-inactivation process in placental mammals [10]. HOTAIR is another lncRNA found on human chromosome 12. High expression of this lncRNA in breast cancer tumors is a significant predictor of metastasis [11]. HOTAIR is particularly notable as it was the first RNA discovered that is transcribed from one chromosome and regulates transcription of a gene on a different chromosome. Another lncRNA, Malat1, has been studied in mice and shown to affect the expression of neighboring genes on the same chromosome [12]. Long non-coding RNAs can therefore regulate genes in both cis and trans, demonstrating the importance of studying these molecules.

Many studies have identified genome-wide lncRNAs in model organisms such as human [13,14,15,16,17,18], mouse [18,19,20,21,22], zebrafish [23, 24], frog [25], fruit fly [26, 27], nematode [28], and Arabidopsis [29]. Some lncRNA identification efforts have focused on maize [30] and one of the primary malaria-causing parasite species, Plasmodium falciparum [31]. For farm animals, work has begun more recently to identify lncRNAs in chickens [32,33,34,35,36,37], cattle [38,39,40,41,42,43], pigs [33, 44,45,46,47,48], sheep [49,50,51,52], goats [53,54,55,56], and horses [57]. A recent review of lncRNA in livestock species provides a comprehensive overview of the current progress in the field [58]. Many of the lncRNA studies in livestock were performed using samples from varied developmental stages or using only one or two tissues while comparing between a control and experimental conditions. The chicken, cattle, and pig genomes are still lacking a comprehensive genome-wide catalog of lncRNAs in multiple tissues from adult animals.

The efforts of the ENCODE projects in creating comprehensive functional annotations of the human and mouse genomes have become a model for the Functional Annotation of Animal Genomes (FAANG) Consortium [59], whose goal is to functionally annotate all farm animal genomes. As one of the FAANG pilot projects, 48 tissue samples were collected from eight tissues across two biological replicates from chickens, cattle, and pigs. Adult male animals were used as they represent a transcriptionally stable state, avoiding the relatively more dynamic gene expression associated with development, growth, and the female reproductive cycle in certain tissues. Biological replicate animals were chosen to minimize biological diversity in each species. A highly inbred line was used for the chicken, the pigs sampled were littermates, and both cattle replicates had the same sire and were from a cattle line closely related to the cattle sequenced to construct the reference genome. The tissues were selected to include those that have a large number of associated quantitative phenotypic traits, focusing on traits relevant to the food production industry such as growth, health, feed efficiency, and disease resistance. The set of eight tissues used consisted of skeletal muscle, adipose, liver, lung, spleen, cerebellum, cortex, and hypothalamus.

As part of a FAANG pilot project, 48 stranded RNA-seq libraries were generated to identify lncRNAs in eight tissues from two biological replicates across the genomes of chicken, cattle, and pig. Using data from the same eight tissues in each species enabled the identification of tissue-specific lncRNAs, as well as those that appear to be generally expressed across the eight tissues examined. Finally, a comparative analysis of lncRNAs with shared expression between the three species was conducted to study evolutionary conservation of lncRNAs.

Results

Identification of lncRNAs

Since lncRNAs are generally expressed at low levels [17] and can be hard to separate from noise in the data, the use of two biological replicates helped to verify the reproducibility of the results. Filtered and aligned RNA-seq reads (Table 1) for each of the eight tissues surpassed 100 million reads, a recommended threshold for identifying novel isoforms or transcripts that are expressed at low levels [60]. Table 2 and Table 3 show the number of genes and transcripts assembled for each RNA-seq library individually, which were then merged into a common transcriptome across all tissues. The number of transcripts in the merged transcriptome that were assigned each of the Cufflinks class codes, which indicate the relationship to previously annotated transcripts, are shown in Table 4. LncRNAs were identified by comparing them with known protein-coding genes in the NCBI annotations and with known proteins across any species in the Pfam [61] and Swiss-Prot [62] databases (Fig. 1a). A total of 31,057 lncRNAs were identified across chicken, cattle, and pig (Fig. 1b). The sequences are available in Additional files 1, 2 and 3 and their genomic locations and structures in Additional files 4, 5 and 6 Each lncRNA was placed into one of three categories based on the NCBI annotation for that species: previously annotated lncRNAs, novel isoforms of annotated lncRNAs, or transcripts from novel lncRNA loci (Fig. 1c, Table 5). On average, half of lncRNAs were previously annotated; however, a larger percentage of the lncRNAs from pig were previously annotated. In all three species, more novel lncRNAs are from novel loci rather than new isoforms of previously annotated lncRNAs. Including both novel isoforms and lncRNAs from novel loci, 5288 novel lncRNAs were identified in chickens, 3732 in cattle, and 4870 in pigs. LncRNAs were also compared to the NONCODEv5 database using sequence similarity [63]. Only 7.77% of predicted chicken lncRNAs and 5.57% of cattle lncRNAs had sequences similar to those in the NONCODE database, defined as having at least 50% sequence identity and the alignment covering at least 50% of the predicted lncRNA. In pigs, 37.59% of predicted lncRNAs were similar to those in the NONCODE database. These results are summarized in Table 6, and the individual lncRNAs with their matching NONCODE IDs are in Additional file 7.

Table 1 Total number of aligned and filtered RNA-seq reads per tissue

Full size table

Table 2 The number of genes assembled from each RNA-seq library

Full size table

Table 3 The number of transcripts assembled from each RNA-seq library

Full size table

Table 4 The number of each Cufflinks “class code” in the transcriptome merged from all tissues

Full size table

Table 5 The number of lncRNA transcripts and loci from NCBI annotations and this study

Full size table

Table 6 LncRNA comparison with the NONCODEv5 database based on sequence similarity

Full size table

While a coding potential score was not used for indentification of lncRNAs for this study, scores were calculated by FEELnc [64] that can be used as a confidence metric for further filtering of candidates. Using the default cutoff for calling a transcript coding or non-coding by FEELnc, 996 chicken lncRNAs, 475 pig lncRNAs, and 1326 cattle lncRNAs had scores predicting them as coding. This corresponded to 11.9, 3.4, and 22.4% of candidate lncRNAs respectively.

The number of exons, transcripts, and length of lncRNAs and mRNAs are shown in Fig. 1d-f. In all three species, the majority of mRNAs contain at least 5 exons, while most lncRNAs contain only 2 or 3 exons (see Fig. 1e), which is consistent with findings from the human ENCODE project [65]. Figure 1d shows the distribution of the lengths of lncRNAs and mRNAs, which were similar within each species. However, there were differences between species that are present in both lncRNAs and mRNAs. In pigs, about 50% of both types of RNA were in the 200–999 bp range, whereas only about 25% were in this range in chickens, and cattle were in-between. A general trend was observed where chicken transcripts of both types were generally longer than cattle and pig, while pig was the shortest.

Potential regulatory targets of lncRNAs

To analyze potential regulatory function, each lncRNA was paired with the nearest protein-coding gene as a potential regulator of that gene. If no gene was within 50 kb upstream or downstream of a lncRNA (in other words, the distance between the transcribed regions), that lncRNA was not included in this analysis. Excluded lncRNAs represented 12.9% of lncRNAs in chickens, 16.8% of lncRNAs in cattle, and 21.5% of lncRNAs in pigs. Over 90% of all three genomes are distally intergenic enough to exclude any lncRNA by the above criteria, yet not even a quarter of lncRNAs were found in these regions. This reinforces the potential regulatory roles that lncRNAs may have on genes. The remaining lncRNAs were then labeled as intergenic if they did not overlap the annotated gene region, exonic if they overlapped an exon by at least 1 bp, and intronic if they overlapped only introns (Fig. 2a). The exonic and intronic lncRNAs were then categorized based on whether they were on the same strand (sense) or opposite strand (antisense) of the protein-coding gene (Fig. 2b), while the intergenic lncRNAs were categorized by strand and by whether they were upstream or downstream based on transcriptional direction of the coding gene (Fig. 2c). Table 7 shows in detail the number of lncRNAs in each of these groups. Many exon-overlapping lncRNAs overlapped only small portions of exons. Other lncRNA exons overlapped a full protein-coding exon, but contain novel exons that do not appear to be part of an annotated gene. Regardless of the nature of the overlap, the resulting lncRNA does not have any similarity to known protein-coding transcripts or exhibit similarity to any known protein domain, and therefore may be a non-coding isoform of the gene.

Table 7 Number of lncRNAs in each genomic location group

Full size table

In all three species, about 25% of the lncRNAs that were included in this analysis overlap the genic region, with the other 75% divided evenly between upstream or downstream location relative to the protein-coding gene. While the lncRNAs within the downstream region of genes did not appear to have any strand correlation with the gene (they were equally sense or antisense), there was a higher prevalence of antisense lncRNAs within the upstream region of genes in all three species. The Spearman correlation of the expression of the lncRNAs with their nearest genes was used to provide evidence for potential cis-regulatory function. To compare this correlation between groups and species, the average correlation was calculated for each species, then the difference was calculated from this average for each group of lncRNAs based on their positional relationship with the nearby gene, e.g. antisense upstream (Fig. 2d), and also for each tissue (Fig. 2e). A higher correlation between the expression of upstream antisense lncRNA-gene pairs was observed across all three species, supporting the potential co-regulation of these transcripts. The correlation in expression of intergenic lncRNA gene pairs was generally higher in cattle compared to chicken and pig, however in chicken the correlation was not affected by the distance of the lncRNA from the gene, while in cattle and pig shorter distances are associated with higher correlation (Fig. 2f). The lncRNA-gene pairs and their positional relationships are available as Additional files 8, 9 and 10, and the expression for every lncRNA in each sample is shown in Additional files 11, 12 and 13.

Tissue-specific lncRNAs

Tissue-specific lncRNAs were identified using a Tissue Specific Index (see Methods). Fewer tissue-specific lncRNAs were seen in brain and adipose across the three species (Fig. 3a). As lncRNAs are known to be expressed at lower levels than mRNAs [17], any cutoff would be arbitrary, therefore lncRNAs that were expressed at any non-zero level were included. The percentage of lncRNAs expressed at or above a sliding cutoff was graphed, and in all three species lncRNAs specific to liver and muscle stood out as being expressed at higher levels than other tissues (Fig. 3b-d). The Tissue Specific Index calculated for each lncRNA is shown in Additional files 14, 15 and 16. The same analysis was repeated, but instead by calculating tissue-specificity using the expression of lncRNA loci rather than the expression of individual transcripts. In other words, the expression of multiple transcripts originating from the same loci would have been measured by a single expression value. The results mirrored the trends of the transcript-level analysis and are not presented in detail.

The gene ontology (GO) terms enriched in the set of genes associated with nearby tissue-specific lncRNAs were analyzed to understand the potential regulatory function of these lncRNAs (Additional files 17, 18 and 19). The tissue-specific index was calculated for these sets of associated protein-coding genes, and the percentage found to be tissue-specific is shown in Fig. 3e. On average across all species and tissues, only 17% of these genes were tissue-specific, with a maximum of 27% in cattle liver (Fig. 3e). Only two tissues had GO terms that were enriched across all three species. In cerebellum, nervous system development, generation of neurons, positive regulation of developmental process, regulation of cell differentiation, and regulation of multicellular organismal development were enriched in chicken, cattle, and pig. In cortex, nervous system development was enriched in all three species. While no other GO terms were enriched across all three species in the same tissue, related GO terms were enriched across species in some tissues, or GO terms were shared between two species. In adipose, skeletal system development was enriched in both cattle and chickens. GO terms related to the skeletal system did not appear in adipose from pigs. In addition to the GO terms shared across all species previously reported, some brain tissues contained GO terms specific to individual brain regions. Regulation of circadian rhythm was enriched by lncRNAs specific to the hypothalamus in chickens, and spinal cord development was enriched by lncRNAs specific to the cerebellum in cattle. GO terms associated with vasculature were enriched in the cerebellum and hypothalamus chicken: circulatory system development in hypothalamus, blood vessel morphogenesis in cerebellum. In liver, many metabolic process related GO terms were enriched for cattle and pig such as monocarboxylic acid metabolic process in cattle and alcohol metabolic process in pig; however, these were absent in chickens. No GO terms were significantly enriched for lung in chickens, but in cattle and pigs significantly enriched GO terms included lung morphogenesis and immune response in pigs and cardiovascular system development in cattle. For muscle, very few terms were significantly enriched in cattle, but muscle tissue development was the most significant. Heart morphogenesis was the most significantly enriched term for muscle in pigs, which only had a total of three significantly enriched GO terms. Chicken had comparatively more significantly enriched terms in muscle, including skeletal muscle development. Finally, lymphocyte or T cell activation were enriched GO terms for spleen in all three species.

Conservation of lncRNAs

The lncRNAs identified in this study were used to analyze the evolutionary conservation of lncRNAs. In addition to chicken, cattle, and pig, the annotated lncRNAs from human and mouse were included. As the only non-mammal, chicken is the most evolutionarily distant of the species, while cattle and pig are more closely related to each other than to human or mouse (Fig. 4a). Previous studies have shown that lncRNAs are not well conserved at the sequence level [66]. Therefore, positional conservation was analyzed. Using the lncRNA-gene pairs used in the previous analysis (Fig. 2), a lncRNA from one species was considered conserved in another species if the genes paired to each lncRNA were orthologs of each other. There was ~ 30% conservation in all species (Fig. 4b, c). A total of 39 ortholog groups were identified containing lncRNAs across the five species, consisting of 64 chicken lncRNAs, 55 cattle lncRNAs, 67 pig lncRNAs, 78 mouse lncRNAs, and 113 human lncRNAs. These lncRNAs are listed with their associated genes in Additional file 20. A GO term analysis of the genes associated with conserved lncRNAs showed that they have functions fundamental to cell biology (Fig. 4d). Chromatin assembly and nucleosome organization appeared in all three farm animal species along with related terms. Multiple sequence alignments performed on each of the groups of lncRNAs (Additional file 21) showed some regions of conservation between the species, although not at the magnitude of what would be expected of orthologous protein-coding genes.

Discussion

The major goal of this study was to identify tissue-specific lncRNAs, evolutionarily conserved lncRNAs, and their potential regulatory functions across three farm animal genomes using deep RNA sequencing from eight tissues and two biological replicates. A major strength of this study compared to other lncRNA identification studies was the consistency in the methods used to obtain the data across the tissues and species. Because all the data were generated in the same lab by the same personnel and followed the same procedure from the same eight tissues taken from adult males, a comparison of lncRNAs among the three species with limited potential confounding factors such as different developmental stages, tissue types, or sexes was performed. Such a comparison would not have been possible using existing lncRNA annotations from Ensembl or NCBI, or by leveraging lncRNA sets previously identified by other researchers.