Does conservation account for splicing patterns?

BACKGROUND
Alternative mRNA splicing is critical to proteomic diversity and tissue and species differentiation. Exclusion of cassette exons, also called exon skipping, is the most common type of alternative splicing in mammals.


RESULTS
We present a computational model that predicts absolute (though not tissue-differential) percent-spliced-in of cassette exons more accurately than previous models, despite not using any 'hand-crafted' biological features such as motif counts. We achieve nearly identical performance using only the conservation score (mammalian phastCons) of each splice junction normalized by average conservation over 100 bp of the corresponding flanking intron, demonstrating that conservation is an unexpectedly powerful indicator of alternative splicing patterns. Using this method, we provide evidence that intronic splicing regulation occurs predominantly within 100 bp of the alternative splice sites and that conserved elements in this region are, as expected, functioning as splicing regulators. We show that among conserved cassette exons, increased conservation of flanking introns is associated with reduced inclusion. We also propose a new definition of intronic splicing regulatory elements (ISREs) that is independent of conservation, and show that most ISREs do not match known binding sites or splicing factors despite being predictive of percent-spliced-in.


CONCLUSIONS
These findings suggest that one mechanism for the evolutionary transition from constitutive to alternative splicing is the emergence of cis-acting splicing inhibitors. The association of our ISREs with differences in splicing suggests the existence of novel RNA-binding proteins and/or novel splicing roles for known RNA-binding proteins.

There are two reasons Old + exons could have lower Ψ than Old -: increased ∆Ψ across tissues, or lower Ψ in every tissue. To ascertain which effect has a greater contribution, we expanded our analysis to a dataset of Ψ values for 52 human and mouse cell types from Irimia et al. (2014) in order to capture as much regulatory diversity as possible. 80% of exons in this dataset which are Old + both upstream and downstream have ∆Ψ > 1/3 between some pair of cell types, and this percentage could increase further as more cell types are considered (Fig. S1). Figure S1: Percentage of Old + exons with ∆Ψ > x between any pair of cell types for various x, as increasing numbers of cell types are added to the analysis in random order, averaged over 100000 randomizations. Error bars are not shown since standard error is negligible. Because the percentages have not completely plateaued by 52 cell types, the true percentages could be somewhat larger when considering all possible cell types and conditions.

Tables of ISREs
The following four tables list ISE and ISS 6-mers. 6-mers with a high affinity for at least one RBP (see next section) are bolded. 6-mers identified as ISREs by Yeo et al. (2007) are italicized.

RBPs with affinity for our ISREs
We catalogued 793 6-mers which contain the NOVA binding site YCAY (Dredge and Darnell 2003) or are subsequences of 7-mers found by Ray et al. (2013) to have a high affinity (E score > 0.45) for at least one of 207 RBPs. The following two tables list upstream and downstream ISREs which match these 793 6-mers, and the RBPs which have a high affinity for each one, with the affinity listed in brackets after each RBP. NOVA was not studied by Ray et al., so its affinities are not listed. ISEs are shown in green and ISSs in red.   . S2 shows the architecture of ConsNet, the neural network model that achieves state-of-the-art performance at absolute Ψ prediction. ConsNet predicts Ψ from sequence and conservation data via the following steps: 1. The raw genetic sequence, represented as a 4-bysequence-length binary matrix, is convolved with n filters of width w, analogous to position-weight matrices (Stormo et al. 1982), to give a 1-by-sequencelength 'feature map' for each filter indicating the similarity of that filter to each point in the sequence. While the filters do not necessarily have a direct biological interpretation, they can represent, for instance, consensus motifs for splice sites or cis-acting RBPs.
2. Each of the n feature maps is shifted upwards by a bias that depends on the filter, then rectified to set negative values (i.e. regions of the sequence that do not correspond well to the filter) to zero. The purpose of this step is to apply a non-linearity to the feature map, which increases the modelling capacity of the network.
3. Each feature map is weighted by conservation, specifically the phastCons score at each base, which intuitively has the effect of focusing the network's attention on the most conserved regions of the sequence. However, since conservation turns out to be much more useful than the sequence itself, it might be better to think of this step as weighting conservation by the sequence, rather than the other way around.
4. Each feature map is max pooled over a certain pooling width: only the largest value of the feature map in each region of the sequence is retained, and the other values are discarded.

5.
Features from the pooled feature maps of every filter pass through several hidden layers, where each layer is a non-linear function of the previous layer.
Here, important features of the max-pooled feature map are recombined and distilled into composite features.
6. The composite features of the last hidden layer are combined to predict each tissue's Ψ, using multitask learning: each tissue takes a different linear combination of the features, then applies the sigmoid function σ(x) = 1 1+e −x so that the result is between 0 and 1.
The convnet of Fig. 1(a) has the same architecture as ConsNet except that the conservation weighting step is omitted. Convnets, or convolutional neural networks, are a standard type of neural network architecture traditionally used for image processing (Fukushima 1988;LeCun et al. 1998). All other networks used a standard feed-forward architecture, with input data rescaled to have zero mean and unit variance to improve classification performance.
Informally, a model is trained by repeatedly showing it the sequence and conservation data for randomly chosen exons and asking it to predict their splicing levels in each tissue. If a prediction differs from the experimental value, every parameter in the model -the filters and their biases, as well as the components of the nonlinear functions for combining the feature maps into a prediction -is changed simultaneously in the direction (increase or decrease) that would reduce the error in the prediction, in proportion to the effect that this change would have in reducing the error.
The 7982 exons where at least one tissue had a highconfidence Ψ measurement were randomly divided into 3/4 training data, 1/8 validation data and 1/8 test data. Model settings, or 'hyperparameters', were selected via random search (Bergstra and Bengio 2012) to optimize performance on the validation data; AUCs are reported for predictions on the test set concatenated over all tissues.

Hyperparameters
The following hyperparameters were searched over for each of the experiments in Fig. 1: • Num hidden layers: the number of hidden layers (search range: 0 to 4, or 0 to 3 for (h) and (i)) • Num hidden units: the number of hidden units in each layer (50, 100, 200, 500 or 1000, with the restriction that each layer must not have more units than the previous) • Learning rate (LR): the initial learning rate • LR decay: the value the learning rate is multiplied by after each epoch, so that it decays exponentially (0.97 to 1) • LR multiplier -output: the multiple of the learning rate used for the output layer (0.2 to 5, log scale) • LR multiplier -hidden: the multiple of the learning rate used for the hidden layers (0.2 to 5, log scale) • Weight scale -hidden: the scale of the random weight initializations of the hidden layers, which are chosen uniformly between this number and its negative (0.001 to 0.1, log scale) Figure S2: Architecture of ConsNet, a deep neural network model that achieves state-of-the-art performance at predicting alternative splicing.
• L1 weight penalty: the error penalty for the magnitude of each weight (1e-10 to 1e-3, log scale) • L2 weight penalty: the error penalty for the squared magnitude of each weight (1e-10 to 1e-3, log scale) • Dropout probability: the chance of removing each hidden unit for each minibatch during training (0 to 0.5 in steps of 0.1) • Momentum: starts at Initial momentum (0.05 to 1, 1 -log scale), increases linearly to Plateau momentum (0.001 to 1, 1 -log scale) over Increase time epochs (0 to 40) to give the training procedure time to find a direction of consistent gradient, stays constant until Plateau end epochs (40 to 80), then decreases linearly to Final momentum (0.05 to 1, 1 -log scale) over Decrease time (10 to 60) epochs to avoid overshooting the optimum. Note that the momentum is not mandated to first increase and then decrease; rather, the search usually selects trials with this trajectory. For most trials, early stopping occurs before Decrease time, so subsequent momentum hyperparameters are not listed.
• Early stopping epoch: the epoch at which the maximum validation error is obtained (0 to 100) The weight scale of the output layer was not searched over, and was instead selected according to the normalized initialization formula from Glorot and Bengio (2010). A minibatch size of 100 was used for all trials.
Several additional hyperparameters were searched over for the convnet and ConsNet: • Pooling width: the width of the pooling regions used in max pooling (3, 6, 12 or 24). (a) and (b) also used features from an intermediate pooling step (48,96,192), although this did not significantly affect performance and was removed for (c) and (d).
• Num pooling regions: the number of pooling regions nearest the splice site used as features (4, 8, 12 or 16; for the intermediate pooling step, 2, 4, 6 or 8) • Filter width: the width of the convolutional filters (6 to 32) • Num filters: the number of filters (24 to 128) • Weight scale -conv: the scale of the random weight initializations of the convolutional layer (0.001 to 0.1, log scale) • LR multiplier -conv: the multiple of the learning rate used for the convolutional layer (0.2 to 5, log scale) The optimal hyperparameters found for each experiment are shown in Table S7.    Fig. 1, as selected by random search. 1(e), the previous state of the art, is not shown.