Genome-wide survey of allele-specific splicing in humans
© Nembaware et al. 2008
Received: 30 January 2008
Accepted: 02 June 2008
Published: 02 June 2008
Skip to main content
© Nembaware et al. 2008
Received: 30 January 2008
Accepted: 02 June 2008
Published: 02 June 2008
Accurate mRNA splicing depends on multiple regulatory signals encoded in the transcribed RNA sequence. Many examples of mutations within human splice regulatory regions that alter splicing qualitatively or quantitatively have been reported and allelic differences in mRNA splicing are likely to be a common and important source of phenotypic diversity at the molecular level, in addition to their contribution to genetic disease susceptibility. However, because the effect of a mutation on the efficiency of mRNA splicing is often difficult to predict, many mutations that cause disease through an effect on splicing are likely to remain undiscovered.
We have combined a genome-wide scan for sequence polymorphisms likely to affect mRNA splicing with analysis of publicly available Expressed Sequence Tag (EST) and exon array data. The genome-wide scan uses published tools and identified 30,977 SNPs located within donor and acceptor splice sites, branch points and exonic splicing enhancer elements. For 1,185 candidate splicing polymorphisms the difference in splicing between alternative alleles was corroborated by publicly available exon array data from 166 lymphoblastoid cell lines. We developed a novel probabilistic method to infer allele-specific splicing from EST data. The method uses SNPs and alternative mRNA isoforms mapped to EST sequences and models both regulated alternative splicing as well as allele-specific splicing. We have also estimated heritability of splicing and report that a greater proportion of genes show evidence of splicing heritability than show heritability of overall gene expression level. Our results provide an extensive resource that can be used to assess the possible effect on splicing of human polymorphisms in putative splice-regulatory sites.
We report a set of genes showing evidence of allele-specific splicing from an integrated analysis of genomic polymorphisms, EST data and exon array data, including several examples for which there is experimental evidence of polymorphisms affecting splicing in the literature. We also present a set of novel allele-specific splicing candidates and discuss the strengths and weaknesses of alternative technologies for inferring the effect of sequence variants on mRNA splicing.
One of the key tasks of the post-genome era is to determine the functional implications of genomic variants. The development of high throughput genotyping technologies and the use of these technologies in large-scale studies has enabled the identification of increasing numbers of human loci that are associated with common genetic disorders (e.g.). However, the mechanisms through which genetic variants at many disease-associated loci affect disease susceptibility remain to be determined. Mutations or polymorphisms that affect mRNA splicing can have a profound effect on the function of the spliced product, but these effects are often difficult to predict from the primary genomic sequence. The medical and biological significance of such variants is evident from the large and rapidly increasing volume of literature reporting examples of aberrant mRNA splicing associated with human cancers and genetic diseases [2, 3]. Indeed, point mutations leading to aberrant splicing are thought to be among the most important contributors to human genetic diseases .
Sequence variants found on the pre-mRNA can affect a number of different, and in some cases imperfectly characterized, cis-acting sequences that control splicing. Polymorphisms that occur at the highly conserved donor and acceptor di-nucleotides are an obvious case in which we expect an effect on splicing  and these genomic variants, when they occur close to verified exon boundaries, tend be annotated in databases of sequence polymorphisms, such as dbSNP . A much larger proportion of variants are likely to occur at sites where the effect on splicing is less obvious, for example at less conserved sites close to intron/exon boundaries, close to the intronic branch-point , or within intronic or exonic splicing enhancer or suppressor sequences . In some cases, such sequence variants disrupt normal gene splicing, causing aberrant splicing of either a proportion, or all of the transcripts produced. However, if the gene is alternatively spliced to begin with, then sequence variants that affect sites that are involved in controlling isoform abundance may be affected, causing allelic differences in the regulation of alternative splicing, with potentially important biological consequences .
The contribution of heritable variation to the observed diversity of mRNA splice isoforms is well established [10–12]. Using the ASAP database of alternatively spliced mRNA isoforms  and transcribed SNPs, we previously estimated that approximately 20% of alternatively spliced genes show evidence of allele-specific splicing (either complete allele-specific splicing, in which one allele gives rise to one isoform and another results in the alternative form, or partial allele-specific splicing in which different alleles result in distinct relative isoform abundance ). Earlier large-scale studies of alternative and allele-specific splicing relied primarily on Expressed Sequence Tag (EST) sequences. More recently, both exon-junction and exon tiling arrays have been used for genome-wide studies of alternative splicing [14, 15]. The Affymetrix GeneChip Human Exon 1.0 ST Array has probe-sets targeting approximately 1.4 million known and predicted exons. Alternatively spliced mRNA isoforms detected using the Affymetrix exon array in cell lines genotyped as part of the HapMap project , has given rise to opportunities for high-throughput discovery of alleles that affect mRNA splicing [11, 12]. Though exon arrays are arguably a superior technology, with better exon coverage than ESTs , they are also affected by a range of caveats that need to be considered . Integration of results from ESTs and microarrays is likely to increase power to detect allele-specific splicing as both arrays and ESTs have different limitations and advantages for the analysis of alternatively spliced isoforms.
Though for the present it remains a distant goal, a complete description of the effect of human sequence variants on mRNA splicing would be a powerful resource for understanding human genetic diseases and phenotypes. One option for evaluating the potential effect of cis-acting mutations on splicing is to use ab initio prediction algorithms that make use of the availability of the complete human genome sequence [18, 19]. In several previous studies, computational tools have been effective in helping to shed light on the impact of a mutation on splicing [9, 20, 21] and databases of mutations that may affect splicing have been made available [22, 23]. However, because of the difficulty of predicting all splice regulatory elements from genomic sequence and the even greater difficulty of determining accurately the effect of mutations in these regions on splicing, genomic analysis of SNPs likely to affect splicing needs to be complemented by expression data that provides information about the splice isoforms that are associated with the alternative alleles of a candidate SNP.
We have performed a genome-wide scan for Single Nucleotide Polymorphisms (SNPs) likely to influence splicing efficiency in cis using publicly available tools (ESEfinder, , MaxENTScan , and Branch Site Analyzer ). We have tested predictions based on genomic sequences using publicly available EST and exon array data. We present a novel probabilistic method to infer allelic differences in mRNA splicing from EST data, and use recently published Affymetrix exon array hybridisation data derived from 166 lymphoblastoid cell lines  for which genome-wide genotype data are available through the HapMap project  to test for association between mRNA isoforms and the genotype of putative cis-acting splicing polymorphisms. We have also investigated the heritability of splicing and compared it to heritability of transcript level expression using the exon array data.
Summary of srSNPs with supporting evidence from EST and Exon array data.
EST evidence (α = 0.05)
Exon array (FDR = 0.1)
Exon array (Holm corrected α = 0.05)
Among the classes of splicing regulatory regions analysed, SNPs that occurred in donor sites were slightly more likely to be confirmed by EST and/or exon array data (Table 1). In addition to the srSNPs for which there is supporting evidence from EST and/or exon array data a further 66 mSNPs were supported by exon array data, but no candidate srSNP was identified that could explain the allelic difference in splicing. Some of these may be false positive mSNPs but for the remainder, the causative SNP may be in an intronic splicing element (intronic splicing elements were not included in the genome-wide scan for srSNPs) or in, as yet, uncharacterized splicing regulatory elements. The possibility also exists that some of the identified putative allele-specific isoforms are caused by mutations located within trans regulators of splicing and that association with nearby polymorphisms is a result of population stratification rather than a direct cis-acting effect.
Of the 54 distinct alternate exon junction pairs with evidence of allele specific splicing from the EST data using a false detection cut-off of 0.2 (see above), 15 could be tested for allele-specific splicing using the exon array data (in order to be tested the mSNP had to be among the SNPs genotyped in the HapMap populations, a probe or probes had to occur between the genomic coordinates spanned by the alternative exon junction pair and the probe had to be detectable above background in at least some of the cell lines). Of these 15, 9 (60%) had at least one probe between the genomic coordinates of the junction pair for which the SI was significantly associated with the genotype of the mSNP (p < 0.05, with Bonferroni correction in the case where multiple probes were tested for association with a single mSNP). By comparison, for a random set of 10,000 alternatively spliced exon junction pairs from ASAPII and nearby exonic SNPs that showed no association with the mRNA isoform there were 252 (22%) associations from 1,167 exon junctions that could be tested. The proportion of allele-specific splicing candidates from the EST data that could be confirmed using the exon array data was significantly higher than for alternatively spliced exon junctions with no evidence of allele-specificity from ESTs (p = 0.002 using Fisher's Exact Test). This overlap of allele-specific splicing candidates identified by very different technologies, provides cross-validation for the candidates identified using the two approaches.
A significant association between the SI of a probeset and an srSNP is insufficient to infer a causal relationship between the srSNP and variation in SI. It is possible that the putative srSNP is not causally related to the observed difference in splicing and, instead, that it is in linkage disequilibrium with a nearby SNP that was not predicted to affect splicing (because of the imperfect understanding of splicing regulation). We can begin to investigate this possibility by testing for an association between the SI and genotype for all of the other nearby SNPs for which genotype data are available for the lymphoblastoid cell lines. For each srSNP we tested for an association between SI and genotype for all genotyped SNPs within 10 kb of the srSNP. On average there were 25 such SNPs per srSNP. For the majority (61.8%) of the srSNPs supported by the exon array data (fdr < 0.1), the predicted srSNP showed the most significant or joint most significant association between SI and genotype for at least one of the probesets tested. For the remainder, an alternative SNP, not necessarily predicted to affect splicing, showed a more strongly significant association. The mechanisms through which these alternative SNPs may affect splicing require further investigation. Examples of the association plots are shown in Figures 3 and 4. Similar association plots for all of the srSNPs supported by the exon array data are available from http://mancala.cbio.uct.ac.za/splicing/ExonArray.
A subset of the previously reported allele-specific splice isoforms detected in this study
mSNP (p-value from exon array data)
srSNP (exon-array p-value)
Exon 2 and Exon 3
Exonic splicing silencer element in exon 2
2.78 × 10-15
rs10774671 (< 10-16)
rs1951119 (1.0 × 10-6)
There are also many cases of previously unpublished splicing polymorphisms among our results, some of which are likely to be functionally and medically important. For example, the exon array data provide strong evidence (robust ANOVA F statistic: 40.5; p = 9 × 10-11) for an association between a probeset in exon 4 of the GLO1 gene, encoding an enzyme (glyoxalase I) that has been reported to show lower activity in the brains of individuals affected by autism compared to control individuals  and the genotype of a SNP in the same exon (C419A or rs2736654; Figure 4). Reduction in enzyme activity has been attributed to the direct effect of this non-synonymous SNP on the amino acid sequence of the protein. The ancestral A allele has been reported to be significantly associated with autism  and certain types of panic disorders . A larger scale study, however, has questioned the association with autism, but has found that the A allele may have a protective effect in the siblings of individuals with autism . This non-synonymous SNP occurs in a predicted exon splice enhancer site (the genomic scan for srSNPs predicts that this site acts as an ESE for both SF2 and SRp55 and the A and C alleles have scores 0.44 and 2.96, respectively, for SF2 and 1.39 and 3.53, respectively for SRp55). EST evidence from ASAPII suggests that two exons are skipped . Skipping of these exons is likely to have a much greater impact on the protein function than the replacement of alanine by glutamine at a single site within one of the exons. While the role of GLO1 in neurological disorders remains controversial , Sacco et al. , highlight the need for further investigation of the functional impact of the C419A. Our results suggest that the polymorphism is very likely to impact on splicing. This could have a significant impact on glyoxalase I activity and be the mechanism underlying the disease association.
We assessed the heritability of splicing index by estimating the slope of the correlation of child SI with the mean of parent SI values for 46 two-parent and child trios, for which exon array data were available for the complete trio. Using only core probesets and meta-probesets, the mean of the slope as well as the mean of Pearson's correlation coefficient were positive (0.093 and 0.084, respectively) and significantly different from zero (p < 10-16 in each case), but significantly less than the mean values obtained when we performed the same regression using estimated transcript-level (i.e. meta-probeset) expression values, for which the corresponding values were 0.163 and 0.146. We estimated the proportion of SI values that do not conform to the null hypothesis (which assumes no correlation between mean parent SI and child SI) using the method described by Storey and Tibshirani . This proportion was 15% for splicing index, rising to 35% for transcript-level expression estimates. This suggests that probeset splicing index is, on average, somewhat less heritable than whole transcript expression. However, because probeset expression levels are estimated from much smaller numbers of probes than transcript level expression, probeset expression level estimates are likely to be noisier and this may contribute to their apparently lower heritability. It is also important to note that there were approximately 13 probesets per core meta-probeset. At the 1% significance level, 4.8% of probesets were correlated between child and the mean of the parent values. This figure was 7.7% for meta-probeset expression level estimates, but 12.3% of genes had at least one probeset with a significantly correlated SI value (applying Bonferroni correction for multiple probesets per meta-probeset). Therefore, heritability of the relative expression of parts of transcripts appears to be common, even more so than heritability of overall gene expression level.
Large-scale discovery of genomic variants that affect splicing has the capacity to accelerate the association of diseases to causative genomic variants. However, because it is difficult, and in many cases currently not possible, to determine the effect of a genomic variant on splicing or on the regulation of alternative splice isoforms from genomic sequence data alone, this remains a challenging task and requires the integration of information from different data types. At present, no single source of data can provide information about all forms of splice variants and each source of data has advantages as well as disadvantages. The publicly available exon array data that we have used here represents an extremely extensive dataset on isoform abundance in human lymphoblastoid cell lines that can be correlated with the genotype of the cell line. However, this data provides no information on transcripts that are not expressed in lymphoblastoid cell lines, or on splicing mutations that affect relative isoform abundance in only a subset of expression contexts. Furthermore, depending on the exact location of probesets in a given gene, many of the transcript isoforms that occur, particularly those that affect donor or acceptor site but do not cause exon skipping or inclusion, are undetectable using exon arrays. When alternative isoforms are distinguishable using the exon arrays, they still provide little information on the nature of the isoforms, and this may need to be inferred either by integrating information from other sources or experimentally.
EST sequences provide information on the structure of alternative isoforms and include data from different gene expression contexts, but this information is highly biased towards ends of genes and is sparse, for all but the most highly expressed genes. The simulations provide ample evidence for frequent allele-specific splicing but also illustrate that there is not enough data to confirm most cases, especially when the effect of a very large number of statistical tests is considered. There were several published examples in the current study of genes known to be spliced in an allele-specific manner, but for which the allele-specific splicing model fits the data no better than the null model. However, although most cases of allele-specific splicing will not be detectable using EST sequences alone, ESTs can often be used to elucidate the nature of the allele-specific splicing events detected because ESTs provide information on the actual transcripts that occur. We compared the length differences between alternatively spliced isoform pairs with and without evidence of allele specific splicing from the EST analysis. Across all classes of alternative splicing considered the length differences between isoform pairs were smaller for pairs with evidence of allele-specific splicing. There was also evidence of a trend towards a greater likelihood of frame preservation among the isoforms with evidence of allele specificity [see Additional file 4].
GLO1 provides an example of a gene with a mutation that is likely to affect splicing, but although there was good coverage of this gene in the EST databases, the allele-specific splicing event was not detectable from the EST data. Because the putative causal SNP is on the skipped exon it is only observed when the constitutive isoform occurs and therefore cannot be tested for association with the skipping event using the EST data. Furthermore, there is only one EST that captures the exon skipping event, denoted by junction 334716 in ASAPII. There are also several cases of known splicing polymorphisms that could be detected from ESTs but not from the exon array data (Table 2). The gamma-aminobutyric acid (GABA) receptor, rho 1, gene (GABRR1), for example, was previously shown to have a SNP (rs4590242), located in the acceptor site that promotes use of an alternate NAGNAG acceptor . We detected this srSNP in the genomic data and EST data provided evidence of its effect on splicing with an mSNP rs12200969, (p-value = 0.03). However, due to the lack of a probe that coincides exactly with the end of the exon, the exon arrays were unable to detect this subtle alternative splicing event.
The probability of linkage disequilibrium of an srSNP and mSNP decreases with the distance that separates them. This limitation is highlighted by the failure to associate several transcribed SNPs (rs3093906, rs3093905, rs3093921, rs3093925, rs3093926, and rs3093927), located >5000 bp away from a putative allele-specific splicing event in the Ribonuclease P RNA component H1 gene (PARP-2). The ASAPII database contains the two alternate donor sites at this junction that are 39 bp apart and are supported by a total of 28 expressed transcripts, and two PARP-2 protein isoforms differing by 13 amino acids have been deposited in the SWISSPROT database . We detected an srSNP (rs2297616) located at position 4 of the corresponding splice donor site and the exon-array data provide strong evidence for an association between the splicing index of a probeset that overlaps the 39 bp region between the alternative donor sites and the genotype of this SNP (p-value = 2 × 10-5).
In previous work we used a heuristic method to find associations between SNPs mapped to ESTs and alternatively spliced isoforms in order to detect candidate allele-specific isoforms and to quantify the proportion of alternatively spliced genes that are spliced allele-specifically . However, such associations can also occur because of normal regulation of alternative splicing. For example, consider an alternatively spliced gene for which ESTs occur in just two of the cDNA libraries in dbEST. Assuming that these libraries were constructed from the tissues of single individuals, it is possible that these individuals have different genotypes for an exonic SNP in the gene. If the alternative isoforms of the gene happen to be regulated in a tissue specific way and if the cDNA libraries are derived from different tissues then this could result in an association between the alleles of the SNP and the mRNA isoforms. This association can be highly significant if there are many ESTs of the gene in the two cDNA libraries in which it occurs. To circumvent this problem in our previous work, we took a maximum of two ESTs per cDNA library (one for each allele of the SNP from heterozygous libraries and just one from homozygous libraries). This caused a substantial loss of data and reduction in power to detect and quantify allele-specific mRNA splicing. In the present work we explicitly model the regulation of alternative splicing and make much better use of the available data.
Our results and previous reports [10, 12] suggest that polymorphisms that affect splicing are common. This has important implications, not only for discovering the molecular bases of genetic diseases, but also for the study of alternative splicing. A gene cannot be confirmed to be alternatively spliced unless multiple isoforms are observed from the same allele. Until then the possibility remains that the alternative isoforms observed are polymorphic variants rather than alternatively spliced. Although we have found ample evidence for allelic differences in splicing, isoforms that result entirely from sequence variants might be less common. In the set of examples we report here, there is a relatively small proportion of cases in which the data suggest that the SI might be zero for some variants. Allele-specific splicing may be particularly important in the context of investigations of the regulation of alternative splicing [41, 42]. Such investigations should ensure that multiple samples from the same tissue source are not treated as independent. This was an issue, in particular with early investigations of regulated splicing using EST sequences, in which multiple samples from the same cDNA library were used .
Regulation of splicing is incompletely characterized and additional cis elements that regulate splicing are still being discovered . A limitation of the current study is that the srSNP candidates are restricted to a subset of well characterized cis-acting splice regulatory elements (donor and acceptor sites, polypyrimidine tract, branch points, and some exonic splicing elements). The phosphomannomutase 2 gene (PMM2), for example, which has allele-specific skipping of exon 5 due to a SNP that disrupts an ESE composed of (GAR)n repeats , where R is a purine, was detected using the EST data (p = 0.003); however, we could not identify an srSNP because the disrupted ESE is not detected by ESEfinder. Polymorphisms not found in cis-regulatory regions can also result in apparent allele-specific splicing if they introduce premature termination codons (PTC)  and cause differential nonsense-mediated decay of alternative alleles. Such SNPs are not included in our srSNP database. We have also restricted our analysis to single nucleotide polymorphisms but allele-specific splicing could be due in many cases to other types of polymorphisms such as insertions and deletions .
In the majority of the examples of allele-specific splicing we have detected, the difference in splicing is quantitative rather than qualitative. This can occur for a gene that is alternatively spliced, but for which a polymorphism exists that affects the proportions of alternative isoforms produced. In some cases, particularly for common polymorphisms, the size of the effect on SI can be relatively small, but still highly significant because of the relatively large number of individuals in each genotype group. In other cases, e.g., the alternative isoforms of the OAS1 gene shown in Figure 3, the SI associated with one genotype may be much greater than for the other genotypes. In general, the size of an effect on SI sufficient for an effect on phenotype is likely to vary substantially from transcript to transcript. Consistent with what has been observed previously for cis-acting polymorphisms with a quantitative effect on splicing , for the majority of the probesets for which SI was significantly associated with SNP genotype, the SI value of the heterozygote was intermediate to the SI of the two homozygotes. In 904 (69%) of 1,318 associations (corresponding to 1,083 different srSNPs) for which heterozygote and both homozygote cell-lines for the SNP were available, the SI of the heterozygote had an intermediate value. This was the case for all of the 148 stronger associations (that remained significant using a family-wise error rate of 0.05).
Loci that affect splicing might be termed splicing quantitative trait loci (sQTLs), by analogy with expression quantitative trait loci (eQTLs) that have become the subject of significant interest . Although the heritability of individual SI values is, on average smaller than the heritability of overall transcript expression level, our results suggest that a greater proportion of genes show heritable splicing of at least one region of the transcript, than heritability of overall gene expression level. In this study we have attempted to identify only cis-acting sQTLs. Trans-acting sQTLs are also likely to exist, particularly at genes that are involved in regulating alternative splicing, but the ratio of trans to cis acting variants may be much smaller for sQTLs than for eQTLs, because of the relatively more complex regulation of transcription initiation compared to splicing. We have taken a candidate SNP approach to detecting splicing polymorphisms. With the availability of whole-genome exon array data it is also possible to adopt a less directed approach analogous to methods that have been used previously to detect expression quantitative trait loci . Each probeset could be tested for association with every SNP that overlaps the transcripts to which it belongs. However, because of the multiplicity of probesets per gene this would result in a very large number of tests and would be likely to yield a much larger set of candidates, but potentially a set with lower specificity and for which interpretation is more difficult. Although we could identify a large number of candidate splicing polymorphisms by combining genomic, EST and exon array data, many of which were strongly associated with splicing, confirmation of causal relationships between human sequence variants will require experiment, probably involving mutagenesis so that relative isoform abundance can be compared between alternative alleles against an identical genetic background.
We have carried out a genome-wide survey of human polymorphisms that are likely to affect mRNA splicing. We developed a maximum likelihood approach to test for evidence of allele-specific splicing from EST data. This was complemented by an analysis of exon array data in the public domain, generated from cell lines that were genotyped as part of the HapMap project. For each of the putative splicing polymorphisms identified in the genome-wide scan, we tested for an association between splicing index and SNP genotype and corroborated 1,185 examples at a false discovery rate cut-off of 0.1. We discuss advantages and disadvantages of alternative high-throughput methods to detect allele-specific splicing and report that a higher proportion of transcripts show evidence of heritability of mRNA splicing of at least some part of the transcript, than show evidence of gene expression heritability at the whole transcript level. Our results underscore the importance of mutations that affect splicing for understanding human phenotypes and genetic diseases and provide a resource that can be used to help assess the effects of human polymorphisms on mRNA splicing.
We downloaded known transcripts, chromosomal genomic data and SNP and exon tables from Ensembl version 36 , which is based on NCBI Genome build 35. Genes and SNPs that mapped to multiple locations on the genome were discarded. Introns were inferred from the exon genomic coordinates obtained from Ensembl. SNP positions relative to the Ensembl exons and introns were identified via genomic coordinates. SNP positions relative to exon/intron junctions were also determined for isoforms obtained from the ASAPII database.
Published tools for detecting splicing regulatory elements were either requested from authors or downloaded from their respective sites. We extracted 9 nucleotides from the donor splice sites and 23 nucleotides from the acceptor splice sites as required by the maximum entropy algorithm of Yeo and Burge, 2004 . Scores for each pair of alternate alleles were then calculated . We also identified an inflated frequency of SNPs at the G base of the canonical AG acceptor site which has been previously identified as a sequencing artifact . We therefore restricted our analysis to validated SNPs using the information from dbSNP125 in the Ensembl database.
The ESEfinder tool  is designed to predict four ESEs: SC35, ASF2, SRp55, and SRp40. ESEfinder uses a position specific weight matrix. An ESE is considered to have a pre-defined length, m, and a recommended minimum score S. For each SNP we extracted m-1 nucleotides up- and downstream of the SNP. We then calculated the ESE scores for each of the contiguous length m subsequences of this sequence. The highest score for each SNP allele was retained if at least one of the scores was above S and the other below S. Although some strong ESEs can influence splicing at a distance of several kilobases , functional ESEs are most abundant in close proximity to splice junctions of internal exons . We therefore restricted our analysis to ESEs located within 200 bps of exon-intron junctions of internal exons. Branch point scores for pairs of alternate SNP alleles were computed using Perl scripts provided by Kol et al., 2005 .
We downloaded pre-computed EST and SNP genomic locations from the UCSC Genome Browser , which is based on NCBI genome assembly 36. ESTs and SNPs that mapped multiple times onto the genomic sequence and ESTs for which less than 90% of the sequence mapped to the genome were discarded. We used SNP and EST genomic coordinates to identify the SNP allele corresponding to each EST overlapping the SNP position. ASAPII , a database of alternatively spliced gene clusters, was downloaded on 9/11/2006. This data included gene and exon genomic locations based on NCBI genome assembly 35 as well as alternative mRNA isoforms (represented by conflicting exon junction pairs) mapped to ESTs.
For a given allele, A, of an alternatively spliced gene with alternative splice isoforms S1 and S2, let x represent the proportion of isoform S1 produced from allele A in a cDNA library. We assume that x is constant for a given allele and library, but may vary across alleles and/or libraries. The purpose of the model is to determine, using data from several libraries (in which alternative transcript isoforms may be differentially regulated and have different relative expression levels), whether x shows significant variation across alleles.
Consider cDNA library i with N transcripts from allele A, of which we observe a i ESTs that map to S1 and b i = N-a i ESTs that map to S2. Because a i is binomially distributed with binomial parameter x, we use the beta distribution (conjugate to the binomial) to describe the probability density of x. We share this distribution across all libraries but not necessarily across the two alleles. Thus the values of x for separate libraries are modeled as independent draws from the distribution f(x, α A, β A ) for allele A and f(x, α B, β B ) for allele B, where f(x, α, β) is the beta function with parameters α and β.
The likelihood of the data observed in all cDNA libraries is a product over terms such as this, and the α and β parameters can be estimated by optimizing the likelihood for the combined data set.
where a i , b i are the numbers of ESTs in cDNA library i that map to allele A and splice junctions S1 and S2 respectively and c i , d i are the corresponding EST counts for allele B. The maximum likelihood parameter estimates were obtained by optimizing the likelihood using Powell's method .
For the null model, we impose the restriction that α A = α B and β A = β B, such that both alleles are considered to be sampled from the same distribution (no allele-specific effect). To model allele-specific splicing (the alternative model), we allow α A ≠ α B and estimate separate beta distributions for the alternate alleles of a SNP (we keep the constraint β A = β B because we found that this model already has sufficient freedom to model the desired effect and adding another degree of freedom was unnecessary). If the null model can be rejected in favour of the alternative model (using the likelihood ratio test) we conclude that there is evidence of allele-specific splicing.
We constructed 1000 random replicates of the EST data such that, for every SNP, the number of libraries derived from each genotype of the SNP was identical to the real data. Each library in the simulated data was assigned a genotype, with a probability proportional to the number of libraries of that genotype in the real data (this proportion was adjusted as each simulated library was assigned a genotype). For each library, the total numbers of ESTs derived from each isoform was constrained to be the same as in the real data. For heterozygous libraries ESTs were assigned to alternative SNP alleles with equal probability.
Whole genome exon data generated using the Affymetrix Human Exon 1.0 ST array by Huang et al.  were obtained from the Gene Expression Omnibus . These data were generated from 166 cell lines for which genome-wide genotype data are available through the HapMap project . All probes that overlapped with SNPs from dbSNP were removed because these may affect hybridization and cause artifactual association . The exon array data was processed using the Affymetrix Power Tools . For all probesets we estimated expression level in each cell-line using the Plier Sketch algorithm and estimated detection above background probabilities using DABG . For the meta-probeset (transcript) level expression we used only the high confidence (or 'core') probesets from the array to avoid inaccuracy caused by the inclusion of computationally predicted probesets . For each probeset that mapped to a meta-probeset, the splicing index (SI) was calculated by dividing the probeset expression estimate by the estimate of the transcript-level expression in each cell line. For non-core probesets that mapped to core as well as non-core meta-probesets the core meta-probeset expression estimate was used.
For each srSNP we used a robust linear model to test for an association between the SI of all probes within 1 kb of the probe and SNP genotype, using only unrelated individuals (i.e parents from the parent-child trios) and treating HapMap population as a covariate. We used Holm correction (with significance level 0.05) to control the family-wise error rate and establish a high-confidence or conservative set of probes with allele-specific SI. A false detection rate correction (also with cut-off set to 0.05) was also used to generate a larger set of events that includes a small proportion of false positive inferences. All statistical analyses were performed using the R statistical computing environment [31, 32].
Expressed Sequence Tag
Single Nucleotide Polymorphism
splicing regulatory SNP
Exonic splice enhancer
analysis of variance
Premature termination codon
splicing quantitative trait locus
expression quantitative trait locus
This research was funded by the South African National Bioinformatics Network. VN is supported by the UCT International Scholarship and the NRF Africa Scholarship.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.