Genome-wide cataloging and analysis of alternatively spliced genes in cereal crops

Background Protein functional diversity at the post-transcriptional level is regulated through spliceosome mediated pre-mRNA alternative splicing (AS) events and that has been widely demonstrated to be a key player in regulating the functional diversity in plants. Identification and analysis of AS genes in cereal crop plants are critical for crop improvement and understanding regulatory mechanisms. Results We carried out the comparative analyses of the functional landscapes of the AS using the consensus assembly of expressed sequence tags and available mRNA sequences in four cereal plants. We identified a total of 8,734 in Oryza sativa subspecies (ssp) japonica, 2,657 in O. sativa ssp indica, 3,971 in Sorghum bicolor, and 10,687 in Zea mays AS genes. Among the identified AS events, intron retention remains to be the dominant type accounting for 23.5 % in S. bicolor, and up to 55.8 % in O. sativa ssp indica. We identified a total of 887 AS genes that were conserved among Z. mays, S. bicolor, and O. sativa ssp japonica; and 248 AS genes were found to be conserved among all four studied species or ssp. Furthermore, we identified 53 AS genes conserved with Brachypodium distachyon. Gene Ontology classification of AS genes revealed functional assignment of these genes in many biological processes with diverse molecular functions. Conclusions AS is common in cereal plants. The AS genes identified in four cereal crops in this work provide the foundation for further studying the roles of AS in regulation of cereal plant growth and development. The data can be accessed at Plant Alternative Splicing Database (http://proteomics.ysu.edu/altsplice/). Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1914-5) contains supplementary material, which is available to authorized users.


Background
Spliceosome mediated post-transcriptional modifications are the biggest challenges in understanding and predicting the degree of certainty and complexity of the proteome diversity [1,2]. One of the most important mechanisms that contribute to the diversity in the protein isoforms is alternative splicing (AS), thus modulating the protein function as a consequence of the linking of the functional units (exons and introns) in a ubiquitous manner [3]. In addition, to the observed alternative splicing sub-types such as exon skipping (ES), alternative donor (AltD) or acceptor (AltA) site, and intron retention (IR), various complex types can be formed by combination of basic events [4,5]. Apart from the four basic events, alternative transcripts may arise as a consequence of the alternative transcription initiation, alternative transcription termination, and alternative polyadenylation [2]. AS isoforms might encode distinct functional proteins, or might be nonfunctional, which harbor a premature termination codon. These nonfunctional isoforms generated through the process called "regulated unproductive splicing and translation" are degraded by a process known as nonsense-mediated decay [6].
Previous reports estimated around 90 % of human genes containing multiple exons are alternatively spliced [7,8]. In line with the observed reports in humans, alternative splicing has been shown to be a major player in generation of the plant proteome diversity with 60 % of Arabidopsis thaliana multi-exon genes undergoing alternative splicing [9]. Genome-wide identification and physiological implications of AS have been reported in a number of model and non-model plant species including A. thaliana [10][11][12][13], Oryza sativa [14], Nelumbo nucifera (sacred lotus) [15], Vitis vinifera [16], Brachypodium distachyon [5,17]. AS transcripts are generally generated through three pathways: (1) IR in the mature mRNA; (2) alternative exon usage (AEU), resulting in ES; and (3) the use of cryptic splice sites that may elongate or shorten an exon that generates AltD or AltA site or both [14,17]. Approximately 60-75 % of AS events occur within the protein coding regions of mRNAs, resulting changes in binding properties, intracellular localization, protein stability, enzymatic, and signaling activities [18]. In plants, IR has been shown to be the most dominant form with reports suggesting the proportions of intron containing genes undergoing AS in plants ranged from 30 % to >60 % depending the depth of available transcriptome data [4,5]. On contrast, recent reports suggest the down-regulation of the IR events and upregulation of the alternative donor/acceptor site (AltDA) and ES under heat stress in model Physcomitrella patens [19]. With the advent of the Next Generation Sequencing (NGS) based approaches, fine scale physiological implications revealed alternative splicing as the prominent mechanism, which regulates the microRNA-mediated gene regulation by increasing the complexity of the alternative mRNA processing in Arabidopsis [20]. Complex networks of regulation of gene expression and variation in AS has played a major role in the adaptation of plants to their corresponding environment and additionally in coping with environmental stresses [13].
Rice (O. sativa ssp japonica and indica), maize (Zea mays), and sorghum (Sorghum bicolor) are important cereal crops as major sources of food in many countries. Previously several approaches have widely demonstrated the identification of the quantitative trait loci, genes and proteins linked to the functional grain content in these species [21]. However, a major portion of the gene functional diversity is controlled by a spliceosomal regulated AS. AS has been shown to be a critical regulator in grass clade, demonstrating several of the genes involved in flowering and abiotic stress depicting alternative splicing [4,17,22]. Identifying alternative splicing genes in these cereal plants is the first step toward understanding the functions and regulations of these genes in plant development and abiotic or biotic stress resistance. Previously, using the homology based mapping approach and expressed sequence tags (ESTs) representing the functional transcripts, we identified a total of 941 AS genes in B. distachyon, a model temperate grass [5,17]. Previous and recent reports on the identification and prevalence of the alternative splicing events in O. sativa [4,23], S. bicolor [24], and Z. mays [25] have shown the functional diversity changes through EST/RNA-seq approaches. Previous report by Ner-Gaon et al. suggested a 3.7-fold difference in AS rates between O. sativa and S. bicolor using EST pairs gapped alignment [26]. The lack of the identification of the comparative AS events in cereal plants and realizing the importance of these functional foods in climate changes, we attempted to carry out the large scale analysis using the so far currently ESTs and mRNA based information in cereal plants to identify species specific and conserved AS events across cereal plants. In this work, we compared the AS event landscape and the AS gene functional diversity in cereal plants, which includes O. sativa ssp japonica and indica, S. bicolor and Z. mays, with a much deeper coverage of the identified AS events and also comparatively analyzed these AS genes with AS genes identified from B. distachyon to reveal conserved patterns of the AS across the grass species. Identified AS events will allow for the experimental characterization of the AS genes involved in important physiological processes. Investigation of the genome-wide conserved AS events across different species will shed light on the understanding of the evolution of the functional diversity in cereal plant for crop improvement.

Sequence datasets and sequence assembly
To identify the putative functional transcriptional changes across the Panicoideae lineage, we systematically queried and downloaded expressed sequence tags (ESTs) and mRNA sequences of O. sativa ssp japonica and indica, S. bicolor, and Z. mays from the dbEST and nucleotide repository of National Center for Biotechnology Information (NCBI; www.ncbi.nlm.nih.gov). Prior to aligning the ESTs/mRNAs to the corresponding genomic sequence, we applied stringent cleaning procedure using the strategy outlined below: 1) ESTs and mRNA sequences were subsequently cleaned using EMBOSS "trim" tool for trimming of the polyA or polyT ends; 2) Cleaned and trimmed ESTs and mRNA sequences were blasted using the BLASTN against UniVec and E. coli database for removal of vector and E. coli contaminants; 3) BLASTN searches against the plant repeat database which was built with TIGR gramineae repeat data and species specific repeat data including sorghum, maize, and rice available from ftp://ftp.plantbiology.msu.edu/ pub/data/TIGR_Plant_Repeats/. Following stringent cleaning procedure, we assembled rice and sorghum cleaned EST and mRNA sequences using CAP3 with the following parameters: −p 95 -o 50 -g 3 -y 50 -t 1000 [27]. In case of the maize data, owing to the large number of available ESTs for this species, which is difficult to assemble, we followed an alternative way of assembling those ESTs. We first mapped ESTs and mRNA sequences to each individual chromosome of the maize genome using GMAP with default settings [28], and then chromosome specific-mapped ESTs and mRNAs were assembled individually using CAP3 with the parameters as mentioned above. The unmapped data and all assembled data from each individual assembly were combined and then re-assembled using CAP3 to generate a final consensus assembly for the further identification of the AS events. The raw data and assembled data for each organism were summarized in Table 1. For the prediction of the AS events, genome sequences, predicted protein coding DNA sequences (CDS), and related GFF data of O. sativa ssp japonica, Z. mays, and S. bicolor were downloaded from Phytozome database (http://www.phytozome.net/) [29][30][31][32]. The genome sequences and CDS data of O. sativa ssp indica (strain 93-11) were downloaded from BGI database (http:// rise2.genomics.org.cn/page/rice/index.jsp) [33].
Putative unique transcripts to genome mapping, identification and functional annotation of AS isoforms In the present study, taking into the account the genome duplication events in Z. mays and S. bicolor, accurate prediction of the alternative splicing events is a major concern over the decades. In our study, calling and predicting alternative splicing events is taken into account by mapping of EST and mRNA assemblies, i.e. putative unique transcripts (hereafter simply referred them as PUTs), to the corresponding genomic sequences were carried out using in-house developed algorithm, ASFinder (http://proteomics.ysu.edu/tools/ASFinder.html/) [34], which uses SIM4 program [35] to map PUTs to the corresponding genome and then subsequently identifies those PUTs that are mapped to the same genomic location but have variable exon-intron boundaries as AS isoforms. To avoid the call of the spurious alternative splicing events, we applied a threshold of minimum of 95 % identity of aligned PUT with a genomic sequence, a minimum of 80 bp aligned length, and >75 % of a PUT sequence aligned to the genome [17]. Application of the above identity percentage and the aligned length removes the chance of the false positive AS events calling as a result of genome duplication events. The output file (AS.gtf) of ASFinder was then subsequently submitted to AStalavista server (http://genome.crg.es/astalavista/) for AS event analysis [36]. The percentage of alternative splicing genes was estimated using the genome predicted gene models having alternative splicing PUT isoforms among total genes models having at least one PUT, the results were presented in Table 2.
We further queried the coding potential and corresponding coding frame of each PUT using the ORF-Predictor [37], and to assess the full-length transcript coverage using TargetIdentifier [38] as previously described. Functional classification was assigned to the PUTs by performing BLASTX searches with an Evalue threshold of 1E-5 against UniProtKB/Swiss-Prot. Predicted protein sequences from ORFPredictor were further annotated using rpsBLAST against the PFAM database (http://pfam.xfam.org/). Gene Ontologies (GOs) were assigned on the basis of the functional homology obtained by the BLASTX searching algorithm against the UniProtKB/Swiss-Prot. The GO categories were further analyzed using GO SlimViewer using plant specific GO terms [39]. To assess the functional coverage of the assembled PUTs, we further compared PUTs against the predicted gene primary transcripts using BLASTN with a cut off E-value of 1E-10, ≥ 95 % identity and minimum aligned length of 80 bp.

Conserved alternatively spliced genes in cereal plants and visualization of AS
For the identification of the potentially conserved AS genes among O. sativa ssp japonica and indica, Z. mays and S. bicolor, reciprocal BLASTP (cutoff E-value 1E-10) were done using the longest (or longer) ORF of the AS PUT isoforms for classifying the conserved AS pairs between species or sub-species. Venn graphical visualization for conserved AS pairs were obtained using R programming language (http://www.r-project.org/). Visualization of the alternative splicing events with genome tracks is critically important from two points of views: (1) To have a graphic look at the corresponding genomic coordinate and associated genic functional changes; and (2) To extract the corresponding spliced region of interest for functional primer designing of putative AS events. Keeping in view the above points, AS events identified in this study along with the integrated genomic tracks are available from Plant Alternative Splicing Database (http://proteomics.ysu.edu/altsplice/) [15,17]. The specific pages associated with the cereal plants offer several end-users functionalities such as querying using the PUT ID, gene ID, keywords in functional annotation, PUTs putative unique transcripts PFAM, or AS event types as "query fields". Additionally, the identified AS events can be visualized and compared with predicted gene models using GBrowse for comparative assessment. Nevertheless, we also deployed BLASTN functionality to search for the PUTs and AS isoforms. The data analyzed along with the GO and PFAM annotations in the present research are publicly available at: http://proteomics.ysu.edu/publication/data/.

EST assembly and annotation
Optimization of the assembly parameters and mapping functionally annotated PUTs is a key parameter to provide a robust identification and classification of the AS events. Table 1  To check for the coverage of the assembled functional transcriptome, we further checked for the functional assignments and all the assembled PUTs were structurally and functionally annotated including putative open reading frame (ORF) prediction, coding region full-length prediction, a putative function and PFAM prediction, which ensures the reliability of the assembly strategies in case of large complex ploidy genomes underwent whole genome duplication events. PUTs were mapped to their corresponding genomes and predicted gene models were also visualized using GBrowse.
Gapped alignments of PUTs to genome, detection and classification of alternative splicing events Following the sequence assembly, resulting unique PUTs were mapped onto their corresponding genomic sequences using gapped alignments as implemented in SIM4 method that was integrated as part of ASFinder [34]. The numbers of mapped PUTs and matched gene models, as well as the number of the observed AS genes are presented in Table 2 Table 3). The percentage of AS genes was estimated based on the proportion of predicted gene models having AS PUT isoforms over the total gene models having an EST (PUT) evidence ( Table 2). The percentages of AS genes vary in different cereal plants, up to 30.1 % in O. sativa ssp japonica and 33.8 % in Z. mays, and relatively low in O. sativa ssp indica (13.9 %) and in S. bicolor (13.5 %). The difference in the mapping rate and AS rate might be due to the difference in the number of ESTs available for respective species. Previous reports on AS in B. distachyon clearly illustrates the fact that availability of the more ESTs/mRNAs reflects the prediction of the AS landscape [5,17]. Recent reports using the RNA-seq technology revealed that AS is common in plants-around 61 % of multiexonic genes in A. thaliana are alternatively spliced under normal growth conditions [12], and~40 % of  intron containing genes that undergo AS in maize [25]. Classification of the AS events observed in the cereal plants are listed in Table 3 showing the prevalence of the IR as the major splicing type showing frequency as high as 55.8 % in O. sativa ssp indica and as low as 23.5 % in S. bicolor (Table 3). The high frequency of the IR in the mature mRNA is perfectly in line with the previously observed frequencies of IR (30-50 %) in AS landscape in A. thaliana and O. sativa [14]. It is worthwhile to mention that plant spliceosomal machinery supports the intron definition model, thus identifies the introns for pre-mRNA splicing as oppose to the abundant exon-spliceosome model observed in case of mammals. Previous arguments have clearly justified the cause and benefits of retaining the introns as potential cytoplasmic translatable transcripts [26] or as mediators of increasing the gene expression, a process widely described as intron-mediated enhancement (IME) of gene expression [40].  [14,17,25,41,42]. In contrast, recently IR has been found remarkably repressed under elevated temperature in P. patens [19]. Alternative acceptor (AltA) and donor (AltD) represent the second most abundant and classified functional class of observed AS events with AltA showing a relatively higher frequency as compared to AltD (Table 3). Although ES events have been described as the rarest events in plants, which are in line with the observed results in this study, recently they have been proposed as the candidates of the transgene regulation using the conditional splicing [43]. We noted that 61.7 % events are complex events in sorghum, which have more than one basic event in compared paired PUTs. This is clearly related to the relative longer lengths of the PUTs in sorghum assembly. Recent reports suggest the differential up-regulation of the alternative donor/acceptor site (AltDA) and ES elucidating the importance of these events as indicators of early heat stress [19].
Our data in this work clearly showed that the number of AS genes and the percentage of genes with AS are different in different crops (Tables 2 and 3). However, this observation only reflects the current state in these plants based on the available data. Our previous analysis on AS in B. distachyon clearly demonstrated that more AS genes were identified with more available ESTs/ mRNA data [5,17]. This is also consistent with the finding of increasing frequency of occurrence of AS in Arabidopsis with time-a reflection of an accumulation of available transcriptome data, for example, only 1.2 % of the genes in Arabidopsis were reported undergo AS in 2003 and now it was estimated over 60 % of introncontaining genes undergo AS [13].

Features of exons and introns in protein coding genes: indicators of gene evolution
Understanding the patterns of gene evolution and identifying signatures of convergent and divergent evolution is of paramount importance, especially when we are addressing the genome complexity in terms of gene evolution. Exon-intron framework properties such as length distribution and GC content evolution have been previously used to demonstrate the gene evolution [44]. Additionally, longer introns as compared to short introns have been shown to play an important role in the gene expression [40,45]. However, reports by Yang [46] demonstrate the negative correlation of the long introns with the levels of the expression in A. thaliana and O. sativa. Realizing the importance of the features of exon-intron in evolution and physiological responses, we extracted and plotted the length distribution of all internal exons and introns from each plant and the results are summarized (Table 4; Fig. 1; Fig. 2). Interestingly, we observed that the average internal exon lengths in O. sativa ssp indica and Z. mays are almost similar, and are relatively much shorter than the internal exon lengths in O. sativa ssp japonica and S. bicolor. On the other hand, Z. mays had the longer intron length (554 bp) and showed a wide variation in intron lengths as compared to the observed range of intron lengths (422-440 bp) in other three cereal plants in this study. We further analyzed deeply the exon size and intron size distribution frequencies demonstrating that Z. mays and O. sativa ssp indica had a relatively much higher proportion of internal exons of a smaller size (<120 bp) (Fig. 1). The observed frequency  (Fig. 2). Prevalence of the introns richness and specifically long introns have been previously been shown to be widely associated with the increased expression of Adh1, Sh1, Bz1, Hsp82, actin, and GapA1 genes in Z. mays [47][48][49][50][51] and salT, Act1, and tpi genes in rice [52,53]. Additionally, a relative higher proportion of introns having a shorter length were observed in S. bicolor.
We also observed~2 % introns in maize and a small number of introns (<0.5 %) in other plants having a size >10 kb. However, taking into account the possible errors in PUT and genome assembly, these long introns were not included in the calculation of the average intron size. It is worthwhile to mention that the average internal exon size (180 bp) and intron size (440 bp) in O. sativa ssp japonica obtained in this work were close to the  [14].

Functional classification of AS genes
AS and gene regulation can be observed at almost all levels of biological interactions [54]. The AS transcripts identified in the present study were functionally annotated for the Gene Ontologies (GOs) and for putative protein domains association by performing a BLASTX search of all PUTs against UniProt/Swiss-Prot database. The ORFs of PUTs were identified using ORFPredictor webserver [37]. The protein families of the AS genes, using the longest ORF of each AS gene, were predicted using rpsBLAST searching PFAM database. Among predicted ORFs of these AS genes, 6,900 in Z. mays, 4,939 in O. sativa ssp japonica, 1,362 in O. sativa ssp indica, and 2,890 in S. bicolor were classified with a putative protein family (Table 5, Additional file 1: Table S1). We further classified AS gene functional products into 2,030 Note: a complete list is shown in Additional file 1: Table S1 unique protein families in Z. mays, 1,708 unique protein families in O. sativa ssp japonica, 757 unique protein families in O. sativa ssp indica, and 1,194 unique protein families in S. bicolor. Among the protein functions, encoded by these AS genes, widely includes protein kinase domain, RNA recognition motif, protein tyrosine kinase, ring finger domain, cytochrome P450, Myb-like DNA-binding domain, WRKY DNA-binding domain, Thioredoxin and protein phosphatase 2C (Table 5). A complete list of all the protein families encoded by AS genes is shown in Additional file 1: Table S1. Our analysis demonstrated that AS genes in cereal plants encode diverse protein families that play important roles in various biological processes. A classical example can be WRKY-DNA binding domains, which represents the largest and functionally diverse transcription factors in plants playing a major role in developmental and physiological processes. Previous studies have widely demonstrated the presence of the alternative ORF in the WRKY genes [55,56]. Yang et al. [57] and Feng et al. [58] have clearly highlighted the role of the alternative splicing and WRKY in plant immunity. Previous functional studies have shown the presence of the splicing of the R-type intron and V-type intron in O. sativa WRKY genes and functionally correlated them to plant immunity [59]. MYB-domains play an important role in plant defense mechanism and are transcriptionally regulated by alterative splicing in A. thaliana and O. sativa and encode MYB-or MYB-related proteins [60]. Alternative splicing of MYB related genes MYR1 and MYR2 have clearly demonstrated the change in protein dimerization and folding as a consequence of alternative splicing thus affecting the transcriptional sensitivity in light mediated responses [61]. GO analysis according to biological and molecular function revealed a wide visibility in all the major biological and molecular functions ( Table 6; Table 7). Interestingly, even the data we collected are from pooled data Table 6 Classification of biological processes based on Gene Ontology (GO) in the public domain, i.e., not from a strictly controlled experiment, our GO analysis revealed that relative to the average of AS percentage, a higher percentage of genes involved in response to abiotic stimulus, photosynthesis, carbohydrate metabolic process, and cell death are involved in AS in cereal plants. In contrast, the genes involved in multicellular organismal development and reproduction had a lower percentage of AS (Table 6). GO molecular function analysis revealed that genes encoding proteins having DNA binding, sequence-specific DNA binding transcription factor activity, nuclease activity had a lower percentage of AS, and the gene coding proteins for protein binding and having kinase activity had a higher percentage of AS in the majority of plants (Table 7). Our observed results are consistent with literature reviewed recently by Reddy et al. [4] and Staiger and Brown [22] that AS is involved in most plant processes and plays regulated roles in plant development and stress responses.

Conserved alternatively spliced genes
Classification of the conserved alternative splicing events provides a framework for understanding the evolution of the functional genes and their genic-regulation at the transcriptional level, which may initiate the cross-talks between the evolution of the genes under AS and between the transcriptional environment and the ecological adaptation. For the identification of the conserved AS pairs, longest ORFs of AS genes in each studied species were compared using the BLASTP (cutoff E-value 1E-10) to identify the best-reciprocal top hit as the conserved pairs. In total, we identified 1558 AS genes conserved between O. sativa ssp japonica and indica, 3,246 AS genes conserved between O. sativa ssp japonica and Z. mays, and 1,967 AS genes between S. bicolor and Z. mays (Additional file 2: Table S3). A total of 887 AS genes are conserved among Z. mays, S. bicolor, and O. sativa ssp japonica. More importantly, we identified 248 AS genes conserved among all four plants (Fig. 3). Furthermore, Table 7 Classification of molecular functions based on Gene Ontology (GO) Fig. 3 Conserved alternative splicing genes in rice (Oryza sativa) ssp japonica, rice ssp indica, sorghum (Sorghum bicolor), and maize (Zea mays) plants using the same approach, we identified a total of 53 AS genes conserved with B. distachyon belonging to BEPclade of grass evolution. The co-orthologous conserved 53 AS genes are listed in Table 8. The set of co-orthologs 248 AS genes conserved in the four plants, with 53 of them conserved to B. distachyon, are provided in Additional file 3: Table S2 (can be downloaded at http://proteomics.ysu.edu/altsplice/). Interestingly, one of the candidates among the conserved gene is Drought-induced protein (Di19). It has been previously suggested that the presence of the retained intron within the coding sequence may give rise to the non-sense mediated decay (NMD) [62]. Recent studies highlight the role of cycloheximide in introducing pre-mature termination codons (PTCs) and NMD in A. thaliana Di19, indicating the splicing mechanism in Di19 [63]. Identification of the Di19 mediated splicing will be of critical importance in increasing the drought resistance or increasing the captive yield of the cereal plants, which are acting as major suppliers of food in climate change. As current analysis were based on the pooled EST/mRNA sequences available in the public domain, more biologically functionally conserved AS genes will be identified when more transcriptome data are collected with improved technologies, various environmental conditions, developmental stages and tissues in these cereal crops. The present data is of immense potential for experimental validation and highlights the role of the AS and biological significance in plant, growth development and environmental regulation, which is a standing challenge in climate change.

Conclusions
In the present work, we investigated the functional landscape of the four most important cereal plants O. sativa ssp indica and japonica, S. bicolor and Z. mays using the updated EST and mRNA sequences available in NCBI thus bridging the knowledge gap and updating the conserved AS catalog with functional elucidation. The availability of the conserved AS genes among the four cereal plants will facilitate to understand the regulation of the alternative physiological processes in global climate change biology and their subsequent impact on the genic-environmental interactions.