Construction of vector pIL07
As described [20]. The pIL07 plasmid was made from the pUC19 subcloning vector for the purpose of isolating ARSs. S. cerevisiae LEU2 gene was PCR amplified with primers containing XbaI sites, digested and ligated into the XbaI site in pUC19. URA3 and S. cerevisiae CEN5 were cloned similarly into the EcoRI and HindIII sites respectively. The BamHI site used for screening ARS fragments is located between the divergently transcribed URA3 and LEU2 genes. Full sequence of this vector is available upon request.
Construction of L. kluyveri Genomic Libraries
Genomic DNA from the sequenced diploid FM479 strain of L. kluyveri was isolated using standard methods. Cells were broken using glass bead lysis and the genomic DNA was separated from mitochondrial DNA using a standard CsCl gradient protocol. Following the gradient, DNA was precipitated with 75% ethanol and resuspended in water. Prior to library construction, the DNA was digested with MboI and treated with Antarctic Phosphatase (New England Biolabs) to prevent the multimerization of genomic fragments. The digested insert DNA was ligated into the unique BamHI site in the pIL07 vector [20] using T4 DNA Ligase (New England Biolabs). The ligation reaction was purified using a PCR Purification Kit column (QIAgen) and redigested with BamHI to linearize all empty pIL07 molecules. The resulting reaction was used to transform chemically competent E. coli cells using standard methodology. In order to estimate library coverage we performed colony PCR on a 24 E.coli colonies containing plasmid clones from a set of transformation reactions. The primers used (IL325, IL326, see below) anneal to the vector sequence flanking the cloned inserts and the amplified products were analyzed by agarose gel electrophoresis to determine cloning efficiency and insert sizes. The insert sizes ranged from ~50 bp to > 2 kbp with an average length of 350 bp. Cloning efficiency ranged from 50%/reaction to 80%/reaction. Each plate of E.coli colonies covered an average of 75 kb of genomic DNA. Each plate was scraped separately to pool the E.coli colonies, and plasmids were extracted using the Wizard Plus SV Miniprep Kit (Promega) prior to yeast transformation.
Screening of L. kluyveri Libraries for ARS Activity
The plasmid libraries were used to transform a ura3 auxotrophic L. kluyveri strain FM628 using a standard Lithium Acetate protocol [55] and plated on medium lacking uracil to select for ARS function. One yeast colony per library transformation was re-streaked onto fresh plates and subsequently grown in culture medium lacking uracil to enrich for the ARS bearing plasmid. The plasmids were isolated from yeast using a modified DNA extraction protocol. 2 mL of culture was pelleted and resuspended in 500 μL of a buffer consisting of 1 M sorbitol, 0.1 M EDTA, 1 μL/mL β-mercaptoethanol, and 0.5 mg/mL Yeast Lytic Enzyme (VWR, catalog # IC360951). The suspension was incubated at 37 degrees for 1 hour. The treated cells were pelleted and processed using the Wizard Plus SV Miniprep kit (Promega). 5 μL of the resulting eluate was used to transform E. coli and transformants were miniprepped to isolate the ARS-bearing plasmid. A small sample of each plasmid was used to re-transform FM628 to confirm ARS function. Confirmed ARS plasmids were sequenced using primers IL325 (5'-GCCAAACAACCAATTACTTGTTGAGA-3') and IL326 (5'-TTCGTTGCTTGTCTTCCCTAGTTTC-3') from both ends of the ARS fragment. All Lk ARSs identified are listed in Additional File 3.
Cloning, Truncation and Mutagenesis of ARS sequences
As described [20]. To clone and test predicted regions of DNA for ARS activity, primers containing BamHI and BglII cloning sites were designed to anneal to the relevant region. PCR amplified DNA was cloned into pIL07 and/or pRS406 and confirmed by sequencing. Several clones of each predicted region were used to transform MW98-8C to test for ARS function. ARS fragments were truncated by amplifying and cloning smaller fragments of the ARS region. Site-directed mutagenesis was performed using a fusion-PCR mutagenesis method [56]. The DNA region to be mutagenized was PCR amplified in two separate fragments, which overlap by 50 - 60 base pairs. The overlap primers contained the desired mutation. After the initial PCR, the two fragments were purified separately and used together in another PCR reaction without any template DNA. The overlapping regions in the two DNA fragments acted as primers for each other and PCR produced a final molecule which contained the entire DNA fragment including the mutation of interest. This fragment was then cloned into the vector and sequenced as above.
Analysis of ARS interchangeability data
We categorized the functionality of each Lk ARS in S. cerevisiae and K. lactis as 'no', 'weak', or 'yes' according to a visual inspection of its ARS functionality in the respective host species. Lk ARS function in other species is listed in Additional File 3.
We assessed the statistical significance of the number of Lk ARSs that function in both of the other two species using a hypergeometric test. More precisely, we conducted 3 tests one for each of the three possible interpretations of a weakly functional ARS: ignoring any ARS that is classified as 'weak' in either one of the two host species, classifying a 'weak' ARS as 'yes', and classifying a 'weak' ARS as 'no'.
The hypergeometric test measures the level of surprise of the size of the intersection between two sets, in our case the set of Lk ARSs that function in S. cerevisiae and the set of Lk ARSs that functions in K. lactis. Specifically our model is that we sample without replacement n balls (the K. lactis 'yes' Lk ARSs) out of an urn with m black balls (the S. cerevisiae 'yes' Lk ARSs) and N-m white balls (the S. cerevisiae 'no' Lk ARSs). Let k be the number of sampled black balls, then k, which is the number of Lk ARSs that function in both species, is statistically significant if the probability of observing in our sample k or more black balls is below 0.05. This probability is given by the hypergeometric distribution function.
We repeated this test for evaluating the size of the set of Sc ARSs that function in both K. lactis and L. kluyveri and the size of the set of Kl ARSs that function in both S. cerevisiae and L. kluyveri. We evaluated the p-values for the three different possible classifications of weak functionality for all three sets of foreign ARSs. Only the intersection of Kl ARSs that function in both S. cerevisiae and L. Kluyveri was statistically significant and in this case it was significant regardless of how we classify weak functionality: 0.002 (weak = functional), 0.006 (ignoring weak), 0.018 (weak = non-functional).
Identification of ACS position weight matrix (PWM)
The L. kluyveri genome we downloaded and analyzed is the Genolevures 2008 09 version of NRRL Y-12651 (aka CBS3082) project accession AACE00000000 http://www.genolevures.org/download.html#sakl.
We used the given annotations to generate a file of L. kluyveri intergenic sequences by filtering out all sequences with feature type 'gene' or 'LTR', or 'gap'.
When searching for the L. kluyveri ACS GIMSAN was applied to the set of 84 L. kluyveri ARSs that were uniquely mapped to the L. kluyveri genome with the aforementioned intergenic file and the additional parameters:
--w = 7,8,9,10,11,12,13,14,15,16,17,25,30,40,50 --oops --t = 200 --L = 200 --nullset = 50 --markov = 4
The "S. cerevisiae intergenic file" was generated by removing all sequences with feature type 'gene' or 'LTR', or 'gap' from the October 2003 GenBank version of the S. cerevisiae genome.
To define the S. cerevisiae ACS motif GIMSAN was applied to a list of 324 verified S. cerevisiae ARSs that is largely based on the list of OriDB confirmed ARSs and that was compiled as explained in [20]. In addition to the aforementioned S. cerevisiae intergenic file GIMSAN was given the following parameters
--w = 11,17 --oops --t = 200 --L = 200 --nullset = 0 --markov = 4
Assessing the selectivity of a PWM
We define the selectivity of a motif as its ability to distinguish between sites generated according to the motif model (PWM in this case) and sites occurring in "random DNA". Specifically, if we imagine we have a list of "real sites" of length l that are generated by the model and a null generated list of N l-mers we can ask how many null sites vs. real sites score above the threshold which we vary. This is a special case of an ROC (receiver operating characteristic) curve and as such it is summarized by the area under the curve (aROC). A maximally selective PWM will have an aROC of 1 (all real sites score higher than all null sites) whereas a non-selective PWM will have an aROC of 0.5 (a real site is equally likely to score higher or lower than a null site.
For each species putative ACS motif (we used the 9 bp Lk ACS, the 33 bp Sc ACS and the 50 bp Kl ACS) we compiled a PWM from all the sites selected by our motif finder when applied to the set of the species' native ARSs and added a pseudocount of 0.01. Using each of these PWMs at a time we sampled 200 sites using the canonical independent column assumption. We then added to the right and left of each sampled site flanks of length 4 bp sampled uniformly from the corresponding set of flanks of the same putative motif sites that are used to define the PWM2.
We then generated 200,000 "null sites" by sampling a sequence of length L (where L = 100,000 + width of motif - 1) from a background 4th order Markov chain trained on the species' set of intergenic sequences. The null sites were defined by the list of all the words in this string and its reverse complement. We then scored the 200 sampled "real sites" as well as these 200,000 sampled "null sites" using an LLR (log-likelihood ratio) score: a site score is the log of the ratio of the likelihood of the site under the PWM model over the likelihood of the site under the background Markov model. The slight twist we introduced here is to average the background likelihood over the two "strands", i.e., over the site and its reverse complement.
We then used the canonical measure of aROC to gauge how well the PWM distinguishes between the null and real sites.
Estimating the predictive power of an ACS PWM
Using each host species ACS PWM we assign each foreign ARS a score that is the score of the best match to the ACS (assigned by SADMAMA). We then rank each foreign ARS according to its score and evaluate, using the standard measure of aROC, how well this ranking agrees with the ARS functionality of these sequences in the host species. A perfectly predictive classifier (or PWM in this case) would give an aROC value of 1 while a random classifier would give an aROC of ~0.5. When using the aROC we can only define 2 classes so we use only the functional and non-functional set of ARSs in this evaluation, leaving out all weak ones.
To utilize the set of weak ARSs we note that the aROC has an equivalent probabilistic formulation. Namely, if you imagine randomly drawing one sequence from the functional and one from the non-functional set of foreign ARSs, then the aROC is the probability that the (classifier) score of the functional sequence will be higher than the score of the non-functional sequence. One advantage of this latter formulation is that it suggests an obvious generalization for more than 2 classes. For example, when we have 3 linearly ordered classes, as in our example (non-functional < weak < functional), we can define the generalized aROC as the probability that the scores of a randomly drawn triplet of sequences, one from each class, will be ordered correctly: the score of the non-functional sequence is the smallest and the score of the functional sequence is the highest. Note that while a perfect classifier would still have a generalized aROC of 1, the generalized aROC of a random classifier on 3 classes would be roughly 1/6 ~ 0.167 as there are 6 different permutations or ways to order the scores of the 3 sequences.
The summary of our analysis of the 3-class, or generalized, aROC is presented in Additional File 1, Table S5. The ranking of the predictive power of the ACS PWMs of the 3 species is the same as for the 2-class, or standard, aROC. The confidence interval for the predictive power of the Kl ACS is unusually wide since there is only one foreign Kl ARS that is weakly functional.
Combining the information from the 3-class and the 2-class aROC allows us to gauge the predictive power of our models on a finer scale than we can when using the standard 2-class aROC. For example, we can test whether the scores our models assign to the weak ARSs are correctly placed between the functional and the non-functional foreign ARSs. More precisely, we compute the probability that a randomly drawn weak foreign ARS is correctly placed between a randomly drawn pair of functional and non-functional foreign ARSs, conditioned on that pair being correctly ordered: the score of the non-functional foreign ARS is smaller than the score of the functional one. This conditional probability is given by the ratio of the 3-class aROC to the 2-class aROC. As there are 3 possible outcomes for placing the weak ARS score relative to the selected ordered pair, a ratio of 1/3 is essentially random: the model is not able to resolve the more subtle differences between weak ARSs and functional/non-functional ones.
Confidence intervals for aROC of predicting functionality of foreign ARSs
To account for the random effects in evaluating the aROC we used bootstrap to construct approximate 95% confidence as described next. Each foreign ARS is assigned a score, corresponding to the best match to the host species ACS, as well as a label describing its ARS functionality in the host species: functional, non-functional, or weakly functional. Thus, for each host species we have 3 lists of foreign ARS sequence scores: functional, non-functional and weak. We sample with replacement each of the three lists separately to generate 10,000 bootstrapped score lists of the same size as the original. We then compute the aROC for the bootstrapped functional and non-functional lists and compute the 3-class aROC using all three bootstrapped lists. This provides us with an empirical sample of 10,000 aROCs from which we generate approximate confidence intervals as described next.
When the host species was L. kluyveri the empirical sample was well approximated by a normal distribution as verified using a QQ plot (data not shown). Therefore we constructed normal 95% confidence intervals based on the standard deviation estimated from the empirical sample.
When S. cerevisiae and K. lactis are host species, the normal assumption is not well supported in the empirical sample. This is probably due to the smaller sample sizes as well as the proximity of the point estimator of the aROC to the maximal possible score of 1. Therefore in these two cases we used the 0.025 and 0.975 quantiles of the empirical distribution as a proxy for the 95% confidence intervals. To validate this approach we compared the quantiles based intervals with the normal derived intervals in the case of L. kluyveri and found good agreement between the two.
Searching for an auxiliary motif
GIMSAN was applied to the set of 84 native Lk ARSs after the length 9 bp motif site was masked out in each sequence. We used the L. kluyveri intergenic background file mentioned above and the following additional parameters:
--w = 6,7,8,9,10,11,14,17,25 --oops --t = 200 --L = 200 --nullset = 50 --markov = 4
--w = 6,7,8,9,10,11,14,17,25 --zoops = 0.2 --t = 200 --L = 200 --nullset = 50 --markov = 4
Evaluating the paired linear model
We considered paired linear models defined by the 9 bp putative L. kluyveri ACS PWM and an auxiliary PWM. The model had 2 parameters which are the positive weights assigned to each of the two PWMs. Given the weights and a match to each of the PWMs the score is the weighted sum of the standard scores assigned to each PWM match. The score of a sequence is defined as the maximal weighted match scores subject to the constraint that the matches cannot overlap. As the best matches to each of the PWM can potentially overlap in any given sequence, this sequence score is not necessarily equal to the weighted sum of these two best matches.
The three auxiliary motifs we considered were selected based on the fact that two of them were assigned by GIMSAN the best overall p-values: the 14 bp motif found using a ZOOPS model and a 25 bp motif found using an OOPS model. In addition we tested a 6 bp motif reported by GIMSAN using a ZOOPS model (all 3 motifs can be inspected in Additional File 2, Figure S3).
Given a training set of ARSs for which we know the labels ('yes', 'no', or 'weak') we define the optimal pair of weights as the one that will maximize either the 2-class (in which case we ignore all 'weak' ARSs) or 3-class aROC depending on which one we are trying to optimize. The optimization is achieved using the general Powell minimization3 method implemented in the Python Scipy package.
At the core of the optimization is the function that computes the score of each sequence given the current value of the pair of weights. In principle, computing this score for a given sequence involves considering every pair of sites in the sequence, one per each PWM and taking the maximum of all the corresponding weighted sums. However, we can rank the matches to each PWM and use those ranks to identify each pair of matches with a point in the 2-d integer lattice.
It is easy to see that ignoring the non-overlap constraint the maximal weighted sum will always coincide with the (1,1) point in the lattice, that is the best match to each PWM (recall the weights are positive). However, in general this point as well as others on this lattice might not be feasible due to overlap between the corresponding sequence matches. It is not difficult to see that in this general case the maximum can only be attained on the maximal lattice points among the set of feasible lattice points. Finding the latter points is something that can readily be done in a preprocessing step by going over the 2 lists of ranked matches, one for each PWM. This preprocessing significantly reduces the amount of computation required for evaluating the aROC associated with each pair of weights.
To evaluate the predictive power of our model we use cross-validation. We randomly partition the set of L. kluyveri foreign ARSs into n folds (we used n = 10 here), subject to the constraint that the classes proportion in each fold would be essentially constant. We then sequentially leave out one fold and use the remaining n-1 folds as a training set to find the pair of optimal weights as described above. We then use these weights to assign scores to the ARSs in the left out fold and evaluate the aROC (2-class or 3-class depending on the corresponding training target function). We next average the aROC over the n left out folds. Finally, we repeat this entire process 1,000 times, randomly partitioning the ARSs into n folds each time and report the average of the n-fold averaged aROC.
Constructing approximate confidence intervals for the cross-validation procedure
In this work we used cross-validation on a number of occasions to estimate a model's aROC. As usual with such point estimates there is randomness in its exact value. In this case, the average aROC depends on the arbitrary assignment of the sequences into the n folds. It also depends on the set of foreign ARSs we happened to isolate. To control for these two sources of random fluctuations we constructed approximate confidence intervals using the following procedure.
We randomly partition the data into n folds using the same n that was used in the original cross-validation scheme, preserving the (2 or 3) functional classes proportions in each fold. We then bootstrap each fold separately by sampling with replacement the sequences in each class of that fold. This procedure leaves us with a bootstrapped set of n folds that are disjoint (although a sequence might be found more than once within a fold). We apply the original "leave out one fold at a time" evaluation scheme to this bootstrapped set of folds to obtain an aROC value. Repeating this procedure 1,000 times we generate a bootstrapped empirical distribution of the aROC. Our reported approximate 95% confidence interval is estimated from the 0.025 and 0.975 quantiles of this empirical distribution.
A variant of this procedure allows us to determine whether one method for predicting ARS function is statistically significantly better than another method. Specifically, we evaluate the difference between the two methods' aROC on each bootstrapped sample generated as above. If the 95% confidence interval of the difference lies entirely to the right/left of 0 then we say the first/second method is significantly better. The 95% confidence intervals are constructed using the normal method if that normal assumption is supported by the Lilliefors test (at the standard 5% level). Otherwise, the 0.025 and 0.975 sample quantiles are used to define the approximate 95% confidence interval.
PWM contextual model
The alignment of the native ARSs in each species (Additional File 2, Figure S4) was visually partitioned into 3-4 segments one of which was reserved for the ACS. A PWM was learned from each segment (with the ACS segment yielding back its own PWM).
For each scanned sequence (a foreign ARS) the top 50 matches of the ACS PWM were found by SADMAMA (-pwmPC 0.01 -m 4 both_strands -siteNullScore avg_strands). A python script was written to parse the SADMAMA output and add to it the weighted scores of the neighboring segments using the contextual PWMs mentioned above. A pseudo count of 0 was used for each contextual PWM and the background model for the LLR score was a 0-th order Markov chain learned from the host intergenic sequences.
Given a training set, the weights are optimized for the 2-class aROC (or the 3-class aROC depending on the optimized target). The training sets are determined through a cross-validation scheme applied to the set of foreign ARSs. Specifically, for each species we divided its set of foreign ARSs into n folds (n = 10 for S. cerevisiae and L. kluyveri, n = 7 for K. lactis). Leaving out one fold at a time, we used n-1 of the n folds to learn a set of computationally optimal segment weights and used those to calculate the aROC on the left-out fold and finally report the average aROC over the n cycles. The 95% confidence intervals for the aROC as well as for the difference between the aROCs of alternative methods were estimated using bootstrap as described above.
We also examined the effects of allowing some flexibility at the seams between the PWMs. Specifically; we allowed some slack, a small gap or overlap, between the end of the current segment and the start of the next one. The slack in the offset from the ACS PWM was no more than k times the number of segments between the relevant segment and the ACS (k = 0,1,2 were considered). Adding this flexibility did not seem to improve the model's prediction power so all the results reported below are for the rigid case (k = 0).
The S. cerevisiae segments that were tested are (see Additional File 2, Figure S4, left pane):
The K. lactis segments that were tested are (see Additional File 2, Figure S4, right pane):
The L. kluyveri segments that were tested are (see Additional File 2, Figure S4, middle pane):
We also tested two "shorter" L. kluyveri contextual models, one using the 9 bp ACS together with a pair of flanking segments of length 25 bp, and another using the first 3 segments of the 4-segments L. kluyveri model described above.
Markov contextual model
The models that were tested were based on the same segmentation used in the PWM contextual model above.