Comparison of transcriptomic landscapes of bovine embryos using RNA-Seq
© Huang and Khatib. 2010
Received: 26 August 2010
Accepted: 17 December 2010
Published: 17 December 2010
Skip to main content
© Huang and Khatib. 2010
Received: 26 August 2010
Accepted: 17 December 2010
Published: 17 December 2010
Advances in sequencing technologies have opened a new era of high throughput investigations. Although RNA-seq has been demonstrated in many organisms, no study has provided a comprehensive investigation of the bovine transcriptome using RNA-seq.
In this study, we provide a deep survey of the bovine embryonic transcriptomes, the first application of RNA-seq in cattle. Embryos cultured in vitro were used as models to study early embryonic development in cattle. RNA amplified from limited amounts of starting total RNA were sequenced and mapped to the reference genome to obtain digital gene expression at single base resolution. In particular, gene expression estimates from more than 1.6 million unannotated bases in 1785 novel transcribed units were obtained. We compared the transcriptomes of embryos showing distinct developmental statuses and found genes that showed differential overall expression as well as alternative splicing.
Our study demonstrates the power of RNA-seq and provides further understanding of bovine preimplantation embryonic development at a fine scale.
The recent developments in high throughput sequencing technologies allow for surveys of transcriptomes at unprecedented completeness and resolution. The 'next generation' RNA sequencing technology (RNA-seq) has several advantages, such as the ability to detect unannotated transcriptional activity, to differentiate between different transcriptional or splicing isoforms, and to provide digital measurements at single base resolution. Recently, the efficacy of RNA-seq has been demonstrated in humans  and model organisms including yeast , Arabidopsis [3, 4], and mice . An RNA-seq experiment typically sequences millions of cDNA fragments, which are subsequently aligned to a reference genome or assembled de novo to recover structural models of annotated and unannotated genes and/or infer expression. Due to its attractive features and continually decreasing cost, RNA-seq is quickly replacing traditional technologies such as microarrays for high throughput transcriptomic studies.
The recent sequencing and analysis of the bovine genome have provided opportunities for further post-genomic investigations of this important livestock species . Despite the increasingly wide applications of RNA-seq, its use in cattle had not yet been reported. In this study, we present the first application of RNA-seq in cattle by sequencing RNA from embryos cultured in vitro. Because early embryonic loss accounts for the majority of cow infertility , we have previously created an in vitro fertilization (IVF) system to recapitulate the early development of cattle and allow systematic investigations of abnormities during the very early stages of development . The transcriptomes of IVF embryos showing distinct developmental statuses were previously profiled using microarrays, and few gene expression differences were detected in embryos undergoing abnormal versus normal development .
Expression microarrays, however, rely on existing genome annotations and lack the ability to detect more complex regulations in gene expression such as changes in alternative splicing. Therefore, in this study, we seek to take advantage of RNA-seq to identify transcriptomic changes associated with abnormal bovine early embryonic development, especially for transcriptional activities that have not been annotated before and differential alternative splicing that could not be detected by expression microarrays. Although the importance of alternative splicing in regulating complex traits has been long recognized and extensively studied, there is limited information on the pattern of alternative splicing in cattle or its roles in bovine embryonic development.
Herein, using RNA-seq, we estimated and compared digital gene expression in IVF embryos. In addition, unannotated transcribed regions in the bovine genome and novel splice junctions were discovered. Finally, we show that alternative splicing is widespread in the bovine embryonic transcriptome, and we have identified alternative splicing events associated with abnormal bovine embryonic development. This study is the first application of RNA-seq in cattle and provides fine-scale insights into bovine preimplantation embryonic development.
Summary of sequencing read alignment to the reference genome
Paired end reads
18,377,462 × 2
20,499,571 × 2
Single end reads
Total sequenced fragments
Fragments mapped to nuclear genome
Uniquely mapped fragments
(percent uniquely mapped)
Fragments mapped to mitochondrial genome
Fragments mapped to annotated junctions
(number of junctions)
Fragments mapped to putative novel junctions
(number of junctions supported by at least 2 fragments)
Fragments mapped to annotated exons
Fragments overlapped with annotated introns
Fragments mapped to annotated genes (exons + introns)
Fraction of intronic fragments among genic fragments
PolyA containing reads
Mapped polyA reads (with polyA signal)
(number of putative polyA sites)
Sequencing reads were mapped to the reference genome (Btau4) using the software package Tophat . Approximately 60% of total sequenced fragments could be mapped contiguously or to exon-exon junctions of the nuclear genome, among which over 90% were mapped uniquely (Table 1). Reads that could not be mapped initially by Tophat were examined for evidence of polyA tails, trimmed to contain non-polyA sequences, and mapped to the reference genome using the software package Bowtie . In total, 226,735 and 287,088 polyA containing fragments could be mapped uniquely in blastocysts and degeneratives respectively, corresponding to 11,389 and 15,493 putative polyA signals (AAUAAA or AUUAAA) within 50 bases of the cleavage sites.
Among the uniquely mapped fragments, 14,121,649 and 10,748,218 were mapped to annotated genes (exons and introns) in blastocysts and degeneratives, respectively, including 2,174,923 and 1,268,315 fragments mapped to annotated exon-exon junctions (Table 1). We noticed a marked difference between the fractions of intronic fragments in blastocysts (17.9%) and degeneratives (33.9%; Table 1). This observation will be dicussed in more detail below. Moreover, the substantially smaller number of junction fragments in degenerative embryos was likely a result of a higher fraction of fragments overlapped with introns (Table 1). Nevertheless, using Tophat, a total of 10,638 and 8,686 novel splice junctions supported by at least two fragments were discovered in blastocysts and degeneratives respectively, among which 8,315 were found in both samples (Table 1).
To evaluate the RNA-seq gene expression measurements, we compared them to normalized expression microarray measurements for 6,169 genes that could be uniquely identified on both platforms. RNA-seq was particularly advantageous for low-abundance transcripts, for which microarrays are generally insensitive (Additional File 2). The correlations between the two platforms were high for both blastocysts (r = 0.838, Figure 1b) and degeneratives (r = 0.844, Figure 1c). Moreover, the correlation between fold changes on two platforms was also high (r = 0.525, Figure 1d). Interestingly, we observed a clear curvature towards the RNA-seq axis when comparing RNA-seq and microarray gene expressions (Figures 1b and 1c), suggesting that microarrays likely underestimated expressions of genes that were most highly expressed.
GO terms enriched for differentially expressed genes
Multicellular organismal process
Negative regulation of biological process
Cellular developmental process
Proteinaceous extracellular matrix
Identification of novel transcribed units (TUs) by RNA-seq
Supported by polyA reads
Supported by EST alignment (>50%)
In total, we have identified 1,648,307 transcribed bases in a combined total of 1,785 regions outside gene boundaries that have not been annotated, indicating that the current bovine genome annotation is incomplete. To test for differential expression between blastocysts and degeneratives for these novel TUs, raw fragment counts from novel TUs and annotated genes were combined, quantile normalized, and analyzed using the "DESeq" package. We found 25 novel TUs significantly altered in degeneratives (Figure 1a, Additional File 1). One particularly interesting example of differentially expressed novel TUs was located on the X chromosome, which showed almost exclusive expression at a high level in degeneratives (normalized counts of 60 and 24,892 in blastocysts and degeneratives, respectively).
Alternative splicing (AS) events and genes detected by RNA-seq
Number of AS events detected
Number of genes undergoing AS
Alternative 5' splice site
Alternative 3' splice site
In this study, we present the first application of RNA-seq in cattle using IVF embryos. We show the high potential of RNA-seq in transcriptomic studies, particularly in organisms that are not fully annotated. We found unannotated genes that may play roles in early embryonic development and several genes that showed differential alternative splicing between blastocysts and degeneratives. Our findings clearly demonstrate the power of RNA-seq and provided further insights into bovine early embryonic development at a finer resolution.
RNA-seq has been frequently cited as more advantageous than microarrays in many aspects . Indeed, with a typical sequencing depth, we were able to obtain digital expression measures for many low-abundance genes, for which expression microarrays do not have sufficient sensitivity (Additional File 2). Perhaps a more important feature of RNA-seq, particularly for incompletely annotated genomes such as cattle, is its independence from existing annotations. Applying stringent criteria, we were able to identify a combined total of 1,785 novel TUs covering more than 1.6 million bases in intergenic regions that have not been annotated before. Interestingly, a substantial fraction of these novel TUs were differentially expressed (Additional File 1), including a TU that showed almost exclusive expression in degenerative embryos. This X chromosome linked transcript (chrX:8564208-8564984) was supported by EST alignments showing the same gene structure. It is located in a region that lacks conservation among mammals and was predicted to have the potential to code for a 134 amino acid protein. RNA-seq is believed to have a higher dynamic range than microarrays, partly because microarrays often have detection limits. When comparing gene expression measures on the same sample for both platforms, we observed a clear curvature towards the RNA-seq axis as expression increased (Figure 1), indicating that microarray measurements were likely biased for highly-expressed genes, leading to a lower dynamic range. However, the agreement between fold changes on the two platforms was high, and we did not detect a systematic decrease in fold change estimates by microarrays (Figure 1).
Another important advantage of RNA-Seq is its ability to detect alternative splicing events. Indeed, we detected thousands of alternative splicing events and tested for their association with the status of embryonic development (Table 4). Among the differentially spliced genes between blastocysts and degenratives, MYL6 showed a higher inclusion ratio of its sixth exon in degeneratives, which was validated experimentally using real time RT-PCR. The inclusion and exclusion of exon six of the MYL6 gene switch between the smooth muscle isoform and the non-muscle isoform of myosin light chain, respectively . In degeneratives, the smooth muscle isoform expressed at a higher ratio than in blastocysts. Importantly, the protein and nucleotide sequences, as well as the alternative splicing pattern, are conserved in mammals. Furthermore, gene expression of MYL6 has been previously shown to be associated with bovine blastocyst formation, suggesting an important role of this gene in preimplantation embryonic development in cattle .
The differentially expressed novel TUs and differential alternative splicing events identified in this study adds additional layers of information to transcriptomic changes associated with abnormal embryonic development, especially for the bovine genome that has incomplete annotation and limited information on alternative splicing.
Meanwhile, we also asked if we were able to identify differentially expressed genes or pathways that were previously known to affect early embryonic development. For example, IGFBP7 (ENSBTAG00000019368) was upregulated by about eight fold in degenerative embryos significantly (Additional File 1). Importantly, this protein has been shown to promote apoptosis  and may have played a role in the embryonic degeneration process. Moreover, we found an enrichment of differentially expressed genes located to extracellular matrix (Table 2). This result, together with previous findings on the possible involvement cell communication and cell adhesion in early embryonic development , suggested that defective communication between cells may play an important role in the abnormal development of degeneratives. Indeed, many of the detected differentially expressed or spliced genes do not have known roles in embryonic development. The identification of such novel candidate genes would necessitate further investigation.
The linear amplification of mRNA was necessary to obtain sufficient materials to generate sequencing libraries from embryos. We were able to amplify approximately 2 ug aRNA from 100 ng total RNA. Assuming a 2% mRNA fraction, this was about 1000-fold amplification, providing a convenient approach for RNA-seq with limited starting materials. However, the effect of mRNA amplification on digital gene expression measurements by RNA-seq is not known. As expected, we observed higher coverage towards the 3' ends of transcripts in RNA-seq (e.g., Figures 3 and 4) as well as in probe hybridization in the microarray data. This also limited our ability to reconstruct full length transcripts, particularly for those expressed at low level. More systematic investigation is required to assess how this artificial shortening of transcripts during mRNA amplification may affect RNA-seq results.
We observed a substantial fraction of intronic reads in both embryo samples. Meanwhile, degeneratives contained considerably more intronic reads (Table 1). In addition, we did not detect any particular subset of genes that preferentially contained intronic reads in degenerative embryos. The evidence on the distribution of sequencing reads in the genome in RNA-seq was not consistent from other studies. Using polyA-selected RNA from liver total RNA, Mortazivi et al.  reported 93% and 3% of mapped reads aligned to exons and introns, respectively. In contrast, Marioni et al.  observed 32% genic reads mapped to introns, using liver and kidney RNA samples and similar sequencing protocols as Mortazavi et al.. There are several possible explanations for the high fraction of intronic reads in our degenerative embryo sample. First, minimal DNA contamination may be fragmented, selected, and PCR enriched during library generation, though the chance was likely small, particularly because RNA had been amplified (or selected for a typical RNA-seq) and the fragmentation protocol in RNA-seq specifically targeted RNA. To test for possible DNA contamination in our aRNA samples experimentally, we randomly selected several intergenic and intronic genomic regions, and the results indicated no DNA contamination at detectable level by PCR (Additional File 3). Second, nuclear pre-mRNA containing polyA tail may also be purified, amplified, and sequenced. It is important to point out that introns contain many more bases than exons thus a slight increase in pre-mRNA fraction may increase intronic reads disproportionally. Finally, alternative splicing events that retain introns may also increase the fraction of intronic reads. However, this would require many genes or highly-expressed genes to undergo such alternative splicing, which was not observed in our data.
Because a slight increase in the pre-mRNA fraction may result in dramatic increase in intronic reads (Additional File 4), for RNA-seq experiments that contain a substantial fraction of intronic reads, normalization by total mapped reads may severely bias gene expression estimates. For example, normalization by total mapped reads such as reads per kilobase per million mapped reads (RPKM) may underestimate gene expression if the intronic fraction is higher because there would be fewer reads on exons but the total reads remain the same. In addition, total read counts can also be dramatically affected by a few genes that express at a very high level. For these reasons, we chose to quantile normalize raw read counts, which is less affected by total read counts and has been shown to reduce bias for differential gene expression analysis .
Our results indicated that the current bovine genome annotation is far from complete. In addition to a combined total of 13,938 novel junctions, we also discovered 1,785 novel TUs. Although RNA-seq is able to detect unannotated transcriptional activities, various filters are needed to avoid detection of spurious "genes". Several software packages have been developed to identify novel splice junctions  and reconstruct transcripts from RNA-seq data [15, 21], yet sufficient sequencing depth and length are required. Manual inspection of reconstructed transcripts in our data showed that the accuracy was significantly affected by false splice junctions and low coverage. As sequencing technologies evolve to longer and more accurate reads, and more powerful software packages are developed, the need for a complete annotation is further limited. Nonetheless, a more complete annotation of the bovine genome will likely benefit RNA-seq at the current point.
Despite RNA-seq providing a deep survey of the transcriptomes of IVF embryos, little difference in overall gene expression was detected in degenerative embryos. Nevertheless, preference for a particular spliced isoform was observed for several genes in degeneratives. In addition, the frequent occurrence of alternative splicing in IVF embryos provided a repository for regulation of gene expression. Although we sequenced at standard depth (e.g., one lane per sample), it became apparent that deeper sequencing was required to detect subtler changes in alternative splicing with sufficient statistical power.
In summary, we demonstrated RNA-seq in cattle, providing for the first time a high resolution map of transcriptional and alternative splicing activities in IVF embryos. The results provided further systematic understanding of mammalian embryonic development at a fine scale.
IVF embryos were collected as previously described . Briefly, oocytes were aspirated from cow ovaries and allowed to mature before they were fertilized with bull semen. Embryos were cultured for five days and evaluated for signs of compaction (morula stage). Only embryos that showed evidence of compaction were further cultured for 72 hours before they were classified morphologically. On Day 8, embryos that showed a distinct inner cell mass and blastocoele were classified as 'blastocysts' while those that failed to transition from morula to blastocyst were classified as 'degeneratives'. Blastocysts (n = 20) and degeneratives (n = 20) were pooled and stored in RNAlater (Ambion, TX) to preserve RNA integrity. RNA was extracted from each embryo pool using RNaqueous Micro (Ambion), quality checked (Additional File 5) using Pico6000 on a Bioanalyzer 2100 (Agilent, CA), and amplified using MessageAmp II (Ambion). Sequencing libraries from amplified RNA (50 ng) for each pool was prepared according to Illumina's mRNA-seq protocol and sequenced on a Genome Analyzer IIx at the University of Wisconsin Biotechnology Center by one lane of a 75 bp single end run and one lane of a paired end 81 bp run. Because paired end reads were intrinsically associated, we defined a 'fragment' as a cDNA fragment that was sequenced from either end. Thus, there were two sequencing reads for one fragment in a paired end run and one sequencing read for one fragment in a single end run. Sequencing data in fastq format can be accessed by GEO with the accession number GSE25082.
The alignment of sequencing reads to the bovine reference genome (repeat masked assembly Btau4 downloaded from UCSC genome browser) comprised three stages. In the first stage, sequencing reads from blastocysts and degenerative embryos were trimmed to 60 bp and mapped independently using the software package Tophat v1.0.13 . Tophat uses Bowtie  for alignment and is able to detect novel splice junctions by splitting reads to segments and joining segment alignments . A total of 109,929 putative novel splice junctions were discovered from the data. Each junction was supported by at least one spliced read with at least five bases aligned to either exon. In the second stage of the alignment, putative novel junctions and 186,074 known junctions compiled from the Ensembl gene predictions and Ref-Seq gene models were combined and supplied to Tophat, which then aligned reads separately. This mapping strategy was able to cope with data sets with different read lengths and to fully utilize the identified novel junctions from all samples. A maximum of one mismatch in each of the 20 bp segments was allowed. Multireads that mapped equally well to more than 20 genomic locations were discarded. In the third stage, all unmapped reads were collected to be examined for evidence of polyadenylation at the ends. Because the sequencing was not stranded, we looked for reads that contained at least six As at the 3'end or six Ts at the 5'end. These polyA reads were trimmed and only reads with remaining sequences at least 25 bp long were analyzed further. We obtained 534,090 and 862,915 polyA reads in blastocysts and degeneratives, respectively (Table 1). These reads were mapped to the reference genome using the software package Bowtie allowing up to one mismatch and forcing uniqueness . We further filtered out genomic alignments where at least six consecutive nucleotides downstream of the polyA cleavage sites were As. Cleavage sites within 15 bp of each other were considered to result from the same putative polyA sites. We considered the most frequent polyA signals (AAUAAA and AUUAAA) that occurred within 50 bp of the cleavage sites.
For digital gene expression measurements, we first obtained constitutive exons that are present in all isoforms of each of the genes, and then filtered out exons that overlapped between genes. These non-overlapping constitutive exon sets were then used to calculate overall gene expression with assistance from SAMtools  and BEDTools . Raw counts were quantile normalized using a function in the 'limma' package in R . For differential gene expression analysis with count data using a negative binomial distribution without replication, the DESeq package in R was used . Microarray data was obtained in a previous study (GEO accession: GSE24936) and normalized as described in Huang et al. . For Gene Ontology enrichment analysis, GO annotations were obtained from the 'org.Bt.eg.db' database from R/Bioconductor for uniquely identified genes. GO enrichment was tested using a hypergeometric test implemented in the 'GOstats' package in R .
The alignments were assembled using Scripture, which assembles transcripts at disjoined genomic locations by finding paths through a connectivity graph representation of the alignments . After constructing all possible paths, Scripture then identifies statistically significant paths that deviated from the background distribution.
To identify novel transcriptional activities that have not been annotated, we required that for a locus that was farther than 1000 bp away from known genes, at least one assembled transcript was at least 500 bp long and covered by at least 40 sequencing fragments. Our sequencing fragments had a mean insert size of 122 bp, this threshold for expression was about five fragments per base. Transcripts that were overlapped by repeats for more than 50% of the sequences were discarded. The Cuffcompare utility in the Cufflinks package was used for comparing transcripts to known annotations . Regions that passed our filters were considered novel transcribed units (TUs). For external support of novel TUs, EST alignments and PhastCons scores were downloaded from the UCSC genome browser and compared to novel TUs using BEDTools . We required a valid support by EST alignments to be at least 50% coverage.
We estimated and classified alternative splicing events from reads that mapped to exon-exon junctions . We conservatively distrusted reads mapped to intronic regions and assembled first or last exons so we classified only the most frequent three classes of alternative splicing, "exon skipping", "alternative 5' splice site" and "alternative 3' site". For "exon skipping", we required that all three junctions be supported by at least two sequencing fragments. For "alternative 5' splice site" and "alternative 3' splice site" classes, we required that junctions be supported by at least two sequencing fragments in addition to 100% coverage of bases between the alternative splice sites. To test for association between alternative splicing and developmental status of embryos, Fisher's exact tests were used. For exon skipping, the count for inclusion of the exon was calculated as the average of the two splice junctions involving the exon.
This study was supported by USDA Hatch grant No. WIS-142-PRJ16JH from the University of Wisconsin-Madison. The authors thank Ashley Driver for assistance with the IVF experiments.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.