Genome-wide analysis of alternative splicing in cow: implications in bovine as a model for human diseases
© Chacko and Ranganathan. 2009
Published: 3 December 2009
Skip to main content
© Chacko and Ranganathan. 2009
Published: 3 December 2009
Alternative splicing (AS) is a primary mechanism of functional regulation in the human genome, with 60% to 80% of human genes being alternatively spliced. As part of the bovine genome annotation team, we have analysed 4567 bovine AS genes, compared to 16715 human and 16491 mouse AS genes, along with Gene Ontology (GO) analysis. We also analysed the two most important events, cassette exons and intron retention in 94 human disease genes and mapped them to the bovine orthologous genes. Of the 94 human inherited disease genes, a protein domain analysis was carried out for the transcript sequences of 12 human genes that have orthologous genes and have been characterised in cow.
Of the 21,755 bovine genes, 4,567 genes (21%) are alternatively spliced, compared to 16,715 (68%) in human and 16,491 (57%) in mouse. Gene-level analysis of the orthologous set suggested that bovine genes show fewer AS events compared to human and mouse genes. A detailed examination of cassette exons across human and cow for 94 human disease genes, suggested that a majority of cassette exons in human were present and constitutive in bovine as opposed to intron retention which exhibited 50% of the exons as present and 50% as absent in cow. We observed that AS plays a major role in disease implications in human through manipulations of essential/functional protein domains. It was also evident that majority of these 12 genes had conservation of all essential domains in their bovine orthologous counterpart, for these human diseases.
While alternative splicing has the potential to create many mRNA isoforms from a single gene, in cow the majority of genes generate two to three isoforms, compared to six in human and four in mouse. Our analyses demonstrated that a smaller number of bovine genes show greater transcript diversity. GO definitions for bovine AS genes provided 38% more functional information than currently available in the sequence database. Our protein domain analysis helped us verify the suitability of using bovine as a model for human diseases and also recognize the contribution of AS towards the disease phenotypes.
Protein diversity in eukaryotic genomes is mainly credited to alternative splicing (AS). It is a fundamental mechanism by which a single pre-mRNA can produce more than one transcript. It is also considered by many to be an important mechanism for controlling gene expression . The introns in the pre-mRNA are spliced out and the exons are united in different combinations leading to a change in the primary transcript structure. This change in transcript structure can affect the encoded protein thereby disrupting its structure and also its function. The disruption in the protein structure and function brought about by AS are frequently associated with diseases . Results from previous studies indicate that more than 60% of human genes are alternatively spliced [3–9].
Association of AS with many diseases such as cardiovascular, cancer and neurodegenerative disorders sheds light on the fact that it is crucial to conduct an in-depth study on AS . Analyses have also shown that 15% of point mutations that cause genetic disease affect pre-mRNA splicing , providing a link between AS events and inherited genetic diseases.
Large scale sequencing of eukaryotic genomes and the knowledge of AS being an important player in controlling gene regulation has seen the emergence of several efforts [3–9] to create bioinformatics resources on alternative transcripts and protein isoforms . Conflicting results from previous analyses aiming to compare the rate of alternative splicing between different organisms contradict AS databases who discuss genome-wide computational analysis. All vertebrates and invertebrates showed a similar rate of alternative splicing with respect to both the number of genes affected and the number of variants per gene in a large-scale expressed sequence tag (EST) analysis across distinct eukaryotes by Brett and coworkers . On the contrary considerable variation in the rates of alternative splicing across organisms was reported by Lee and co-workers . Understanding the phenomenon of AS is difficult as these databases do not provide sufficient information for multi-gene comparison across various species. ASAP II  concentrates mainly on comparative and evolutionary studies. ECGene  provides functional annotation for AS genes in various genomes. Alternative Splicing Transcript Database (ASTD) [3, 4] does an exhaustive analysis of AS events in three species, namely human, mouse and rat. Representing the transcripts and their relation to each other has become extremely complicated due to the increasing number of transcripts for each gene. This has seen the dawn graph theory and its application to represent a gene transcript. Graph theory is a prominent concept that has been used to express transcripts and capture their relation, among many other solutions. The language of graph theory offers a mathematical abstraction for the description of biological relationships . Modrek and Lee used directed acyclic graphs for EST analysis, with the genomic DNA sequence as reference . Pevzner and coworkers  were the first to use de Bruijn graphs to depict the transcripts alone, without referring to the genomic DNA sequence, where the maximum common sub-sequences between transcripts were condensed into nodes and the variable regions connected by edges. Alternative Splicing Gallery (ASG) resource uses such an approach .
Our group has used directed acyclic splicing graphs, without a genomic DNA sequence as reference, with exons as nodes, interconnected by introns as edges, where the paths through the splicing graph represents the transcripts. This scheme was applied to the genome-wide analysis of Drosophila melanogaster , leading to the DEDB data resource. Here, the first transcript served as a reference sequence to generate splicing graphs, with automatic rule-based classification of splicing events. To reduce the uncertainty in selecting the primary transcript, this methodology was further enhanced. The most conserved exons in all transcripts of a given gene were chosen to be distinct reference exons and all others were considered to be variant exons. In order to generate a splicing graph from a set of transcripts for a given gene, we thereby developed the Alternative Splicing Graph Server (ASGS) .
As a part of the bovine genome annotation team, we have used comparative genomics in order to associate alternative splicing patterns in human and mouse to cow . Comparative genomics studies the correlation between genome structures and functions across different biological species. It aims at understanding many aspects of the evolution of modern species.
The intermediate evolutionary distance between human and bovine is 70-100 Myr . The bovine model has been found to be relevant to human health research priorities such as obesity, female health and communicable diseases. Cow provides a valuable biological model in these significant areas because of the vast amount of research that has been conducted with respect to genetic and environmental interactions associated with complex, multi-genic physiological traits . The Cetartiodactyl order of mammals, to which cattle and all other ruminants belong, is phylogenetically distant from the primates, and thus contains invaluable information for understanding human genome evolution .
In this study, we have analysed transcripts for each gene in the bovine genome. Since the bovine genome is not yet completely annotated we minimized any gene structure bias in the input data by carrying out comparative genome analysis on the orthologous subset of AS genes for the three species. We present here the comprehensive analysis of all bovine, human and mouse transcripts based on splicing graphs. AS events in these three genomes and their functional significance in terms of gene ontology (GO)  classifications were also identified. The two main AS events (cassette exons and intron retention) in the human disease genes (94) from NCBI Genes and Disease database  were mapped onto their respective bovine orthologous genes. A protein domain analysis on 12 human disease genes that are known to be occurring in cow was vital in providing significant insights into the protein structure/function affects of AS.
For AS analysis, the GTF files for Bos taurus, Homo sapiens and Mus musculus were extracted from Ensembl ver. 54 . Each line in the Gene Transfer Format (GTF)  file corresponds to the structure of the exons making up the transcripts, coding sequence, start codon and stop codon information. For our analysis, we extracted only the protein coding genes and eliminated the pseudo genes and mitochondrial genes. The unspliced transcript sequences were also obtained from Ensembl for cow to analyse the splice site motifs.
The procedure used in ASGS  has been adopted for compiling the graphs. The transcript information, including start and stop of each exon are compiled from the GTF file for each of the three genomes to generate the splicing graph. All transcripts are converted to the leading strand for consistency. Exons are divided into two main groups; distinct and variant. The exon that occurs in the majority of transcripts is retained as the distinct exon, with the rest classified as variant. When exons overlap, the exons with well-determined borders, occurring in most of the transcripts is considered to be distinct. If an exon is completely contained in another larger exon, these are not merged but retained as individual exons, considered variant and then entered into a list maintaining the mapping of variant exons to distinct exons . Splicing graphs are then generated using these distinct and variant exons. The first line of the resultant splicing graph is composed entirely of distinct exons, followed by subsequent lines showing the locations of variable exons. The exons are connected by edges, representing introns in the set of transcripts provided. Splicing graphs were compiled for every alternatively spliced gene for the three genomes. The splicing graphs were then further analysed to identify the splicing events and patterns for orthologous genes.
Basic statistical measures like the mean, median and standard deviation were calculated for all three genomes in order to analyse the exon and intron size conservation across the three genomes for the complete and orthologous AS gene sets. The number of exons per transcript for the three genomes was also calculated.
Splice site mutations are believed to cause several genetic diseases. It is therefore very important to identify variations in the splice site. The frequencies of GT-AG, GC-AG, AT-AC splice site motifs were computed for bovine and analysed and compared to the splice site information for human and mouse obtained from ASTD.
Analysis of the GO annotations was conducted for two sets of data. In the first set, the transcript sequences of orthologous bovine AS genes obtained from Ensembl were processed using ESTScan, as it can detect and extract coding regions from low quality sequences with high selectivity and sensitivity and is also able to accurately correct frame shift errors . To obtain even datasets, the human and mouse transcript sequences were also processed using ESTScan. The output was then processed using another bioinformatics tool, Blast2GO , which we have successfully used in the annotation of expressed sequence tag sequences . The BLAST results from this program were then mapped to GO terms to obtain the GO annotation. The annotation output file was then processed using a plotting tool, WEGO  in tool to compile the GO annotation results into category-based lists.
The second dataset was a text file comprising GO annotations for bovine AS genes orthologous to human and mouse AS genes, obtained from Ensembl using the BioMart  tool. The second dataset was reformatted and put through the WEGO tool to compile the GO annotation results for plotting.
A well-annotated set of all available (94) human disease genes was extracted from NCBI Genes and Disease database , with the view towards analysing which of these genes were alternatively spliced in human and bovine genomes. Of these 94 genes, AS analysis was conducted on the 66 spliced genes (with more than one transcript). The two most important events, cassette exon and intron retention, were examined in detail in these 66 genes. These exons were then mapped onto the orthologous exons in bovine using CLUSTALX  multiple sequence alignment tool to identify the conservation of these exons and the splicing event, across the two species. Irrespective of the position of the exons in different transcripts, if two pairs of exons have a good percentage of alignment they are still considered as conserved exons, thereby implying that in the event of exon shuffling, the exon pairs are still considered conserved.
We identified eight human disease genes that have bovine orthologues. The protein sequences encoded by the transcripts for these human and bovine genes were analyzed using Pfam  domain search tool to identify the effects of alternative splicing on the functional protein domains.
It was observed that only 21% of bovine genes were alternatively spliced as opposed to 68% of genes in human and 57% of genes in mouse upon comparison of 4567 bovine AS genes with 16715 human AS genes and 16491 mouse AS genes. The statistics provided by ASAP II database (26%, 53%, 53% for cow, mouse and human respectively)  compare well to these estimates of the number of AS genes in cow, mouse and human, although they appear almost twice as much as those reported by Nagasaki and group  (32.1% and 23% for human and mouse genomes, respectively). All AS genes in cow which have alternatively spliced orthologues in both human and mouse were extracted to minimize any gene structure bias and to get the best-annotated genes in cow for analysis. Such an approach has been adopted by the studies of Chen et al . In order to compile the orthologous genes subset, one-to-one, many-to-many, one-to-many and apparent mappings have been used. We found that 3504 genes in cow have alternatively spliced orthologues in human and mouse amounting to 3835 and 3774 genes respectively. This dataset amounted to 16% of bovine alternatively spliced genes, compared to 16% in human and 13% in mouse. Our values are consistent with those (10%) observed by Brett et al.  for AS between human and other species, including mouse and cow reinstating the credibility of our approach of using orthologous AS gene subsets for multi-species comparisons and to estimate the extent of AS in cow.
Comparison of alternative splicing in bovine, human and mouse genomes
Genes with multiples transcripts
% of alternative splicing
Transcripts per gene
(mean ± sd (med))
per transcript (mean ± sd (med))
Exon size (nt)
(mean ± sd (med))
Intron size (nt)
(mean ± sd (med))
2.3 ± 1.2 (3)
13.4 ± 10.5 (9)
181 ± 254 (126)
5215 ± 17003 (1191)
8.0 ± 8.0 (7)
7.7 ± 5.9 (6)
178 ± 196 (89)
5314 ± 4112 (4517)
6.5 ± 6.0 (5)
6.6 ± 4.2 (5)
160 ± 167 (63)
4311 ± 4003 (3889)
Orthologous gene set
2.5 ± 1.0 (3)
14.4 ± 9.8 (10)
162 ± 212 (123)
5105 ± 16900 (1152)
9.4 ± 7.4 (8)
9.1 ± 7.8 (8)
188 ± 150 (101)
5210 ± 4013 (4321)
7.0 ± 6.4 (6)
9.0 ± 7.2 (7)
145 ± 163 (103)
4304 ± 3921 (3789)
Statistics of alternative splicing events for all AS genes and the orthologous AS gene subset (gene level analysis)
Type of alternative splicing event
Bovine (Complete set)
Bovine (Orthologous set)
Human (Complete set)
Human (Orthologous set)
Mouse (Complete set)
Mouse (Orthologous set)
Transcriptional Start Site
Alternative Initiation Exons
Transcriptional Termination Site
Alternative Termination Exons
Statistics of alternative splicing events for the orthologous gene subset (event level analysis)
Type of alternative splicing event
Transcriptional Start Site
Alternative Initiation Exons
Transcriptional Termination Site
Alternative Termination Exons
Considerable conservation was observed in each of the nine AS events for the three species. Our analysis proves that exon skipping or cassette exon is the most prevalent internal AS event in the orthologous genes of all three species, comprising 28%, 26% and 16% of all AS events in bovine, human and mouse, respectively. On the other hand, intron retention and mutually exclusive exons were the least favoured AS events. Intron retention accounted for only 3% of bovine AS events, compared to 3% in human and 2% in mouse. Haussler and co-workers  estimated 38% exon skipping and 3% intron retention in human, which are very similar to our values. ASD [3, 4] reports 52% cassette exons and 17% intron retention, which differ considerably from our calculations. This could however be due to the fact that ASD has used the entire human genome for their calculations whereas we have only utilized orthologous AS genes for our analysis.
Overall, from the two sets of analyses, fewer bovine genes show equivalent % of AS events compared to human and mouse, which implies that these orthologous AS genes in cow show high variation between the transcripts structure, despite low number of actually different transcripts as opposed to human and mouse genes.
Alternative splicing class distribution based on splicing patterns for orthologous bovine, human and mouse AS genes
Splice site motif analysis for bovine, human and mouse AS genes
Splice site motifs
Gene ontology (GO) annotation summary for the orthologous AS gene set.
A. Molecular function
Molecular transducer activity
Transcription regulator activity
Enzyme regulator activity
Structural molecule activity
B. Biological process
Multicellular organismal process
Establishment of localization
Response to stimulus
Immune system process
C. Cellular component
Extracellular region part
However, a similar plot was also created for the bovine genome, using a different set of annotations, where the entire GO details were obtained from Ensembl using the BioMart tool . This analysis showed considerably low percentage for bovine as opposed to the previous plot. This, we believe can be a result of low level of annotation available for bovine genes. In this plot, a considerable drop in functionality was noticed across all the areas for bovine genome (Table 6 and Figure 7). Therefore, we were able to identify 38% more functional information in terms of GO annotations than currently available in Ensembl for bovine genes.
The use of farm animals like cattle, pigs, sheep, goats, horses and chickens as research models has won many Nobel Prizes for researchers worldwide . Various new opportunities in areas of biomedical research have been created by the application of the tools for genetic manipulation and genomic sequencing in farm animals . This provides valuable insights into gene function and genetic and environmental influences on animal production and human diseases . Because of the size and relatively long intervals between generations, domestic species are widely used to unravel the mechanisms involved in programming the development of an embryo and fetus, resulting in adult onset of diseases [37, 38]. Rogers et al.  have identified that the CFTR gene knockout model of pig better mimics human pathology than mouse models as they fail to develop the hallmark pancreatic, lung and intestinal obstructions that occur in humans. Reynolds et al.  note that surgery, blood sampling, tissue recovery, serial biopsies, instrumentations, whole organ manipulations and many other biomedical applications are more easily achieved in animals larger than a mouse, suggesting that size does matter when it comes to animal models. Hence mapping human disease genes to bovine orthologous genes is an excellent mode for carrying out analytical work and verifying the suitability of cow as a model organism.
Human disease genes: Conservation of cassette exons in bovine orthologous genes.
Number of cassette exons in 38 AS human disease genes
Exons present and constitutive in bovine orthologous gene
Exons present and regulated in bovine orthologous gene
Exons absent in bovine orthologous gene
Human disease genes: Cassette exons present and regulated in bovine orthologous genes.
Ensembl human transcript ID
Cassette exon position in human transcript
Ensembl bovine transcript ID
Cassette exon position in bovine orthologous transcript
Spinal muscular atrophy
Human disease genes: Intron retention present and constitutive in bovine orthologous genes.
Conserved exon in bovine
Polycystic kidney disease
Autoimmune polyglandular syndrome
For the eight human disease genes that have orthologous genes in the bovine genome, (three genes with CE and five genes with IR), protein domain analysis revealed that AS affects the structure and function of the proteins encoded by the various transcripts from these genes. It was evident that due to AS, the majority of the transcripts either lacked the complete functional domain or lacked an essential component/segment of the functional domain. This suggests that AS is a major machanism that could render these proteins non-functional, besides perturbing the structure or fold of the protein.
For the set of the bovine orthologous genes, only two of eight genes appear to be spliced, resulting in probable structure and function disruption. These genes are responsible for spinal muscular atrophy and colon cancer, with the former noted as a disease caused by AS . Further investigation revealed that four of these eight genes had all the domains from their human counterparts conserved. This implies that 4/8 orthologous bovine genes (including the two AS genes) had essential segments or complete functional domains missing, due to AS.
Wilson's disease is another disease that has been characterised in cow (OMIA). We observe that the human gene known to be responsible for this disease has a retained intron in one of its transcripts, which is orthologous to the only transcript available in the corresponding bovine gene. Thus, the cow would be most suitable as a model organism for this human disease.
This is the first comprehensive study of the bovine transcriptome, with 21% of bovine genes exhibiting alternative splicing, compared to 68% and 57% in human and mouse, respectively. Our analyses show that bovine AS genes are composed of fewer transcripts but many more exons than human and mouse AS genes, although comprising exons and introns of comparable extents. Nine different splicing events were compared among cow, human and mouse genomes. Compared to their human and mouse counterparts many more bovine AS genes show intron retention. The most common AS event was found to be exon skipping and the least common events were intron retention and mutually exclusive exons. With predominantly introns linking two variable exons, as opposed to human and mouse genes fewer AS bovine genes show high transcript variability.
38% more functional information than currently available in Ensembl was identified with our approach which helped us collate the GO annotations for bovine AS genes. The orthologous bovine AS genes are functionally very similar to human and mouse genes as suggested by GO annotations.
From the results of our protein domain analysis it is evident that AS plays a major role in disease implications in both human and cow, and is suitable as a model for investigating spinal muscular atrophy, colon cancer, tangier disease, glaucoma, spinocerebellar ataxia, polycystic kidney disease, autoimmune poly grandular syndrome and wilson's disease. Our results provide a window of opportunity for more in-depth analysis over a larger dataset, where the cow can serve as a model organism for many more human diseases.
EC is grateful to the Macquarie University for the award of the MQ Research Excellence Scholarship (MQRES). Open access publication changes are borne by Macquarie University.
This article has been published as part of BMC Genomics Volume 10 Supplement 3, 2009: Eighth International Conference on Bioinformatics (InCoB2009): Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/10?issue=S3.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.