Identification and analysis of pig chimeric mRNAs using RNA sequencing data
© Ma et al.; licensee BioMed Central Ltd. 2012
Received: 11 December 2011
Accepted: 17 August 2012
Published: 28 August 2012
Skip to main content
© Ma et al.; licensee BioMed Central Ltd. 2012
Received: 11 December 2011
Accepted: 17 August 2012
Published: 28 August 2012
Gene fusion is ubiquitous over the course of evolution. It is expected to increase the diversity and complexity of transcriptomes and proteomes through chimeric sequence segments or altered regulation. However, chimeric mRNAs in pigs remain unclear. Here we identified some chimeric mRNAs in pigs and analyzed the expression of them across individuals and breeds using RNA-sequencing data.
The present study identified 669 putative chimeric mRNAs in pigs, of which 251 chimeric candidates were detected in a set of RNA-sequencing data. The 618 candidates had clear trans-splicing sites, 537 of which obeyed the canonical GU-AG splice rule. Only two putative pig chimera variants whose fusion junction was overlapped with that of a known human chimeric mRNA were found. A set of unique chimeric events were considered middle variances in the expression across individuals and breeds, and revealed non-significant variance between sexes. Furthermore, the genomic region of the 5′ partner gene shares a similar DNA sequence with that of the 3′ partner gene for 458 putative chimeric mRNAs. The 81 of those shared DNA sequences significantly matched the known DNA-binding motifs in the JASPAR CORE database. Four DNA motifs shared in parental genomic regions had significant similarity with known human CTCF binding sites.
The present study provided detailed information on some pig chimeric mRNAs. We proposed a model that trans-acting factors, such as CTCF, induced the spatial organisation of parental genes to the same transcriptional factory so that parental genes were coordinatively transcribed to give birth to chimeric mRNAs.
Chimeric mRNAs fused by two previously separate genes located on different genomic loci may allow a limited number of genes to encode a substantially large number of mRNAs and proteins. They are expected to increase proteomic diversity through chimeric proteins or altered regulation. As a consequence, gene fusion can change the properties of precursor proteins and can even perturb normal regulatory pathways and initiate or stimulate neoplastic cell growth. A well-known example is the BCR-ABL1 fusion gene, which is the result of the chromosomal translocation t(9; 22)(q34; q11) and is responsible for 90% of chronic myelogenous leukemia cases . In this sense, chimeric genes can be used as desirable therapeutic targets for cancers. For instance, matinib mesylate (Gleevec, Novartis) can target the oncogenic kinase activity of BCR-ABL1 in chronic myeloid leukemia [2–4]. Therefore, the identification and analysis of novel chimeric genes will pave the way for a greater understanding of the role of gene fusion.
Chromosomal translocation is generally responsible for the generation of some chimeric mRNAs in cancer cells. Therefore, chimeric mRNAs are often viewed as potential diagnostic biomarkers for tumours caused by chromosomal translocation. However, a low amount of a chimeric RNA (JAZF1-JJAZ1) was detected in normal endometrial tissues, joining the JAZF1 gene on chromosome band 7p15 to the JJAZ1/SUZ12 gene on chromosome band 17q21 . Chimeric RNAs and proteins are identical to those produced from a chromosomal rearrangement found in human endometrial stromal tumours . The explanation generally offered for this finding is that specific chromosomal rearrangements occur within small numbers of cells in healthy tissues. However, no rearranged bands t(7;17)(p15;q21) were detected in normal cells . Given the absence of any detectable rearranged DNA in cells producing chimeric RNAs, the obvious explanation is the rearrangement at the RNA level. After incubation of mixed extracts from a human endometrial stromal cell line and from a rhesus monkey fibroblast cell line, rhesus JAZF1 exons were joined to human JJAZ1 exons, implying that the JAZF1-JJAZ1 RNA is a result of trans-splicing .
In eukaryotes, trans-splicing is a special event in RNA processing where exons from two different primary RNA transcripts are joined from end to end and then ligated. In simulating the RNA cis-splicing mechanism, a cDNA is thought to be generated by trans-splicing when it is aligned to multiple non-contiguous genomic loci and the fusion junction obeys canonical GU-AG splice site. However, how precursor genes find each other before splicing remains to be elucidated, and where the trans-splicing event takes place is still poorly understood.
Some chimeras are derived from a non-spliceosome mechanism . Short homologous sequences are proposed to be associated with the generation of chimeric mRNAs in eukaryotes, suggesting that the ‘misaligns’ of short homologous sequences could guide the chromosomal interaction for the proximity of distal genes . In addition, read-through/splicing is another way of generating chimeric mRNAs [8–11]. In this process, an mRNA starts from the upstream gene, reads through intergenic regions, and ends at a termination point of the adjacent downstream gene, with the region in between removed by splicing. However, read-through/splicing cannot explain the chimeras derived from different chromosomes or opposite strands. Some chimeric mRNAs may have originated from the strand-switching feature of the reverse transcripatase . In some cases, chimeric mRNAs are considered as artefacts from the reverse transcription polymerase chain reaction (RT-PCR) .
The presence of chimeric mRNAs in normal cells is a critical issue because the important pathways in normal cells would be disrupted by the potential therapy targeting chimeric mRNAs and proteins. The identification of chimeric mRNAs in normal cells will provide a wealth of biological information for this issue. The pig (Sus scrofa) is an economically important species and a potential medical model for some human health issues . Therefore, research on chimeric mRNAs in normal cells can benefit from pigs. Results from the present study provide the first broad overview of chimeric mRNAs in pigs, and their analysis in normal tissue will aid in the further understanding of the molecular mechanisms of gene fusions.
For the confirmation of a hybrid transcript candidate, we inspected whether the fusion point corresponded to a pair of known splice sites. We separately extracted the chromosomal DNA sequences of the 5′ and 3′ partners of an inferred chimera and then connected the two non-contiguous genomic sequences to an artificially fused genomic sequence. Each inferred chimera was aligned to the corresponding artificially fused genomic sequence using the SIM4 program  to take into account consensus splice signals. The alignment around the fusion point was checked. Only the fusion points that were aligned precisely, without a gap or overlap, were retained. In addition, the reading frame must have structural integrity. Finally, 618 candidates had clear trans-splicing sites, 537 of which obeyed the GT/AG rule (Additional file 1).
To confirm further the trans-splicing events, 48 chimeric candidates were randomly selected for the RT-PCR assay using RNA from a number of tissues (see Methods). An RT-PCR product was required to span the fusion point. Through this assay, 36 out of randomly selected candidates showed identity with the expected fusion sequences (Additional file 2). Given that the transcription of mRNAs may vary in different tissues or stages of life, the selected samples for the RT-PCR assay may not be suitable for their expression. In addition, all mRNAs used in the present study have prior biological studies annotated in databases of the UCSC and the NCBI (National Centre for Biotechnology Information). Thus, the rate detected by the RT-PCR assay might underestimate the positive rate of chimeric mRNA identification. The use of expressed sequence tag (EST) and RNA-sequencing data from more tissues or stages would supply the gaps of the RT-PCR assay.
Putative chimeras were aligned to ESTs downloaded from the UCSC database to seek support from external experimental evidences and verify the putative fusion junctions. If at least 20 nt of the sequence on either side of a putative fusion point overlap with the ESTs, this candidate was retained for further analysis. The 431 candidates were supported by at least three ESTs (Additional file 3).
Putative pig chimeric mRNAs were aligned to known human chimeric transcripts annotated in the chimera database (ChimeraDB 2.0) to estimate the relationship between two kinds of transcripts . The fusion junctions of 21 putative pig chimeric mRNAs were matched to known human chimeric mRNAs (Additional file 4). However, only two putative pig chimera variants (AK239284 and AK349030) whose fusion junction was overlapped with that of a known human chimeric mRNA (AML1/AMP19 fusion gene) were found.
We collected 396.2 million sequence reads from the transcriptome sequencing of liver tissue samples from 11 adult Bama miniature pigs (five males and six females, Additional file 5). This procedure was done to verify that the putative chimeric mRNAs were real expressed genes rather than involved in exonic coding sequences shared among multiple genes or homologous pseudogenes. The Illumina Genome Anlayzer II was employed to sequence these samples. Two length types of single-end reads, 76 and 101 nt, were generated (Additional file 5). For the uniformity of the read length, 101 nt reads were trimmed to 76 nt from a low-quality (right) end, which would increase the quality of 101 nt reads.
Estimation was further performed on the validity of junction reads that overlapped fusion points with a minimum of 5 nt. In the present study, reads were trimmed to 76 nt. Therefore, the length of fusion junctions was 142 nt (71 nt on either side of the fusion junction) by requiring a 5 nt overhang for read mapping fusion points. If the start position of a read located in the region from the 1st to the 67th nt of the fusion junction, the read was termed as a junction read. In this estimation, reads from 11 liver samples were pooled together. The 496 fusion junctions were matched by at least one read. Among these junctions, 89.3% (443/496) were overlapped by at least three reads and 89.7% (440/496) were validated by reads starting from at least three different positions (Additional file 7).
We used a cut-off that required junction events to be present in all samples and unique without an overlap with other chimeras to access further the differential expression of unique chimeric mRNAs without the confounding issues of tissues. This cut-off resulted in 87 unique chimeric events. The dispersion of the expression of each unique chimeric event across the samples was measured using the coefficient of variation (CV), the percentage ratio of the sample standard deviation to the sample mean of the junction reads for each event. Figure 2B represents the distribution of junction events along the CVs. The mean of the CVs was 57%, with a standard deviation of 14%, following a normal distribution (P>0.57, Kolmogorov-Smirinov test). This result implies that most of these unique chimeric events were considered middle variances in the expression.
More attention was given to the variation in the expression of chimeric events among the pig breeds. A set of 49 nt single-end reads from three RNA-pooling samples of skeletal muscle was analysed in the same way as those from liver samples (Additional file 5, Additional file 6). These samples were obtained during embryo collection at slaughter. The first, second, and last samples were pooled using equivalent amounts of RNA from three adult female Wuzhishan, Tongcheng, and Landrace pigs, respectively. These samples may remove the difference among female individuals to some extent. The mean of the CVs was 35%, with a standard deviation of 19%, spanning 0% to 89% and following a normal distribution (P>0.61, Kolmogorov-Smirinov test).
Distribution of potential shared DNA motifs in genomic regions
Subsequently, these shared sequences were submitted to the TOMTOM  software in the MEME suite (4.6.1)  for comparison against the database of known motifs. This database is the JASPAR CORE (version 12-Oct-2009) that contains a curated, non-redundant set of profiles derived from published collections of experimentally defined transcription factor binding sites for multi-cellular eukaryotes . The 81 shared sequences significantly matched known DNA motifs in the JASPAR CORE database (P<0.00065 and false motif discovery rate < 0.05, Additional file 9). Among these matched sequences, 6 were shared in the upstream regions of both partners (P<0.00009 and false motif discovery rate < 0.042). This finding suggests that the same or similar transcription factors would bind these potential shared DNA motifs to coordinate the transcription of parental genes, which may be necessary in generating chimeric mRNAs.
The CCCTC-binding factor (CTCF) is a versatile trans-acting factor that binds distal regulatory elements such as enhancers, and CTCF binding sites are commonly distributed along the vertebrate genomes [22–26]. Thus, we placed efforts on computationally identifying potential CTCT binding sites shared in two non-continuous genomic regions of chimeric mRNAs. Four DNA motifs shared in parental genomic regions were significantly similar with known human CTCF binding sites (P<0.014 and false motif discovery rate < 0.029, Additional file 10). This result suggests that some trans-acting factors, such as the CTCT-binding factor, might bind these shared motifs to facilitate the approximation of the distal genomic parts and make up the subcellular environment for the generation of chimeric mRNAs. Communication between distal chromosomal elements would be an origin for the nuclear processes of gene fusions.
Several factors including strand-switching, deep sequencing errors, or reference genome errors would result in false positive results. Therefore, we rigorously inspected each chimera using several criteria. First, all the mRNAs used in the present study have prior biological information annotated in the UCSC and NCBI databases to avoid reference genome errors as much as possible. To remove false results from homologous, paralogous, or random spurious hits, strict filtering was performed on the highly qualitative alignments of mRNAs to the S. scrofa chromosomes. Trans-splicing sites were then inspected for each candidate to exclude strand-switching or the random connection of two cDNAs. In addition, 14 independent samples were used to evaluate the expression of the fusion transcripts. We could not completely rule out the possibility of the creation of a false fusion in the process of cDNA library construction. However, random breakage and rejoining of two cDNAs are unlikely to happen at the exact exon boundaries of two genes and simultaneously in multiple samples. Thus, although the present identification of chimeric RNAs filters out some genuine fusion gene transcripts by stringent cut-offs, it is conservative and reliable.
During transcription in vivo, different genes frequently share the same transcription factory where nascent RNA production and RNA polymerase II seem to be localised [27, 28]. For example, the Igh on chromosome 12 is preferentially recruited to the same transcription factory where the Myc gene on chromosome 15 is highly transcribed . Many active genes can dynamically co-localise to shared sites of ongoing transcription, which may be induced by the classical effectors of gene expression including trans-acting factors, enhancers, chromatin modifications, and chromosomal interaction . For example, CTCF can create the dynamic nature of nuclear spatial organisation of different genes by binding to the elements on distal genomic regions or different chromosomes [25, 30, 31]. The recruitment of different genes into shared factories is expected to have a fundamental role in gene expression, which may efficiently share limited resources or perhaps coordinate the transcription of different genes.
As earlier common computational methods for identifying precursor genes, a gene with the best alignments to a chimeric mRNA was considered as the precursor gene . However, exons often overlap exons for some cases. For example, the 5′ partner of the chimeric mRNA AK343294 was precisely mapped on the exons of mRNAs AK233826, AK231250, and AK346646 in chromosome 5. Therefore, the precursor mRNA would be discretionary if multiple transcriptional start sites were present. Furthermore, the partners of chimeric mRNAs may be transcribed independently at their own transcriptional start sites that are not associated with other genes. Thus, the selection of which variant would serve as the precursor gene would need more molecular experimental identifications.
The present study provided detailed information on pig chimeric mRNAs and further analysed the expression of unique chimeras among samples. Interestingly, similar DNA sequences widely shared in the two non-continuously genomic regions of chimeric mRNAs. Similar DNA sequences that share in the upstream regions of both partners significantly matched the known transcription factor binding sites in the JASPAR CORE database, suggesting the potential coordinated transcription of the parental genes. In addition, possible CTCF binding sites were also observed in the parental genomic regions. We supposed that trans-acting factors, such as CTCF, would induce the spatial organisation of parental genes to the same transcriptional factory so that parental genes would be coordinatively transcribed to give birth to chimeric mRNAs. Although this hypothesis needs further experimental evidence, it will provide useful information for the investigation of the mechanism for the generation of chimeric mRNAs. Overall, our results will aid in the further understanding of chimeric mRNAs.
The BED format table of all pig mRNAs were analysed for further study using the Galaxy [33–35] in the UCSC Table Browser (February 2011). According to the annotation of that table, GenBank pig mRNAs were aligned against the pig genome (SGSC Sscrofa9.2/susScr2, Nov. 2009) using the Blat program . The alignment with the highest base identity was found when a single mRNA was aligned in multiple places. Only alignments with a base identity level within 0.5% of the best and at least 96% base identity with the genomic sequence were kept ( http://genome.ucsc.edu/). An entry in that BED table annotates a chromosomal locus of an mRNA. We extracted mRNAs aligned to two non-contiguous loci. We required alignments from non-contiguous loci without long similar sequences at the putative junction sites to remove homologous, paralogous, or random spurious hits. In accommodating small errors in alignment that occur at the edges of the alignment, we only allowed overlaps or gaps of up to 10 nt within the fusion junction. Using the Circos software , we represented the genome-wide distribution of putative chimeric mRNAs in Figure 1.
To validate putative chimeras by external experimental evidence, we aligned predicted chimeras to the EST sequences downloaded from the UCSC (May 2012) using the BLAST program (Basic Local Alignment Search Tool, version 2.2.26+) [38–40] with default parameters except at least 96% base identity. The candidate was retained for analysis when at least 20 nt of the sequence on either side of a putative fusion point overlapped ESTs. To compare with known human chimeras, we aligned pig putative chimeras to human chimeric mRNAs downloaded from the ChimeraDB 2.0  using the BLAST with default parameters (May 2012).
As previously described , we prepared an artificially fused genomic DNA sequence for putative chimeras by joining the genomic sequences of the 5′ and 3′ partners. The fusion transcript candidate was then aligned to the corresponding artificially fused genomic sequence using the SIM4 program (version 2002-03-03)  with default parameters. The alignment around the fusion point was inspected to take into account consensus splice signals.
We obtained total RNAs from Tongcheng pig tissues (liver, kidney, spleen, heart, lung, testis, ovary, embryo, skeletal muscle, small and large intestine) using the RNA Extraction Kit (BioTeke). The cDNA was prepared by reverse transcription using the Strand cDNA Synthesis Kit (BioTeke) with random hexamer priming and oligo dT’s. PCR products covering the junction position were amplified using primers designed according to the hybrid transcripts (Additional file 2). PCR amplification was performed using the following thermocycling protocol: initial denaturation at 95°C for 4 min, followed by 30 cycles of denaturation at 95°C for 30 s, annealing at 60°C for 30 s, and elongation at 72°C for 30 s. The PCR products were then analyzed, cloned, and sequenced.
Up to 400 million sequence reads from deep sequencing the transcriptome of pigs were recently acquired in our lab. In brief, the following steps were used for transcriptome sequencing using the Illumina Genome Analyser II at Shanghai Biotechnology Co., Ltd. We isolated mRNA from 10 μg of total RNA with an RNA integrity number (RIN) ≥ 8. The isolated mRNA was then fragmented and converted into double-stranded cDNA. The ends of cDNA were ligated to adapters. The fragments with 200 to 300 base pairs in length were amplified by PCR to make a library. Finally, the library was sequenced to yield single-end reads.
A set of reads was derived from the transcriptome of the liver tissue samples obtained from 11 adult Bama miniature pigs (five males and six females, Additional file 5). Reads with a Phred quality score lower than 20 were filtered out. The length of the reads from eight pigs was 76 nt, whereas that from the other three pigs was 101 nt. To obtain uniform lengths of reads, the 101 nt were trimmed from the low-quality (right) end of each read to only 76 nt before mapping. The remaining reads were aligned to the pig genome (SGSC Sscrofa9.2/susScr2, Nov. 2009) using the Bowtie software (version 0.12.8)  with default parameters except maximum two mismatches, unique mapping, and trimming from 101 to 76 nt for the three samples.
The present version of the Bowtie program (version 0.12.8) does not report gapped alignments. Thus, a read mapped on the genome was derived from a contiguous locus in the genome. However, some unmapped reads may arise from non-contiguous genomic loci, making them suitable for inspecting splice junctions. The unmapped reads were further aligned to the putative chimeric mRNAs by the Bowtie program with default parameters except maximum two mismatches and trimming from 101 to 76 nt for the three samples. The previously unmapped reads that were matched on the putative junctions with an overlap of at least 5 nt on either side of the RNA junction were remained for further analysis.
Another set of 49 nt single-end reads from three equivalently pooled RNA samples of skeletal muscle was analyzed as described above (Additional file 5). These samples were extracted during embryo collection at slaughter. The first, second, and last samples were pooled using equivalent amounts of RNA from three adult female Wuzhishan, Tongcheng, and Landrace pigs, respectively.
CV was calculated to represent the variance in the expression. The reads uniquely mapped on the pig genome and the junction reads were pooled together to reveal the read coverage along the transcript. The RS test was used to evaluate the difference in the expression levels between the male and female samples.
The MEME software (version 4.6.1)  with default parameters (except DNA alphabet, zero or one occurrence of each motif per sequence, motif width between 10 and 30 nt, and maximum one motif to find) were used to search similar DNA sequences within two non-continuous genomic sequences of chimeric mRNAs. Then, using the TOMTOM  tool, similar DNA sequences were compared with the database of 476 known motifs, the JASPAR CORE (version 12-Oct-2009).
We thank Joshua Liao for his advice and two anonymous reviewers for their helpful suggestions on the manuscript. We also thank the China Postdoctoral Science Foundation (20110490045), the National Natural Science Foundation of China (30830080 and 31172189), and the Science Foundation of the Shihezi University (RCZX201137) for their support.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.