Probabilistic models are widely used in biological sequence analysis. They are essential mechanisms to pre-process the plethora of data available, creating hypothesis for biological validation. Examples are Hidden Markov Models (HMM) [1–3], Weight Array Matrices (WAMs) [4] and Covariance Models (CMs) [5]. In this context, probabilistic models can be used to represent known families of sequences and to create programs to predict if specific sequences belong to the family of interest. However, these models assign non-zero probability values to most input sequences. Therefore, we need a criteria to decide when a given probability value is sufficient. One of the most commonly used technique is bayesian classification using two probabilistic models: F, that represents a family of sequences, and A, an alternative model. The likelihoods of each of the two models is measured and the sequence is classified as belonging to F if the likelihood of F is greater than the likelihood of A.
The choice of the alternative model is essential to reduce the number of false predictions and depends on the problem. An alternative model can be either a negative model representing the complementary set of the sequences of interest, or a null model, representing random sequences. Negative models are used when there is a deeper biological understanding of the particular problem and it is possible, with a high degree of certainty, to characterize the sequences that are not part of the family. Therefore, the choice of the probabilistic model to be used as the negative model depends on a strong biological hypothesis about the complementary set. Null models are used when we do not have sufficient information to characterize the complementary set of the sequences we want to classify. This situation is generally the rule for annotation software, where we want to characterize a sequence family (e.g. tRNAs, exons, miRNAs, transmembrane domains,...) against all other sequences. This is the scoring technique used by sequence analysis tools such as HMMER [2], SAM [6] and INFERNAL [7]. Null models is the chosen strategy for alternative model considered in this work.
More technically, we want to compute, given a nucleotide sequence x, which model better represents the sequence: the family model (representing the family of sequences we are interested on) or the null model (representing other sequences). The sequence x is classified as belonging to the family represented by the model F if P(F|x) > P(N|x) or, alternatively, if
. Considering P(F) = P(N), the classification of x simplifies to the comparison of the likelihoods:
. To cope with the very small probability values when sequences are long, log values are used. So, we use the log-odds score S:
We want null models that help classifiers reject sequences that do not belong to family F (which we will call negative sequences). Therefore, such a null model N should score higher than the family model F for any negative sequence. In other words, with a good null model, log-odds score for negative sequences will have value zero or less.
Null models, due to their very generic nature, should not present any structure. Therefore a convenient model to describe random sequences in a null model N is a position-independent probability distribution, which imposes no structure on the sequences. For nucleic acids sequences, the null model assigns a fixed probability value P
N
(i) to each nucleotide (i = A, C, G, T). Therefore, the probability value of a sequence x of length L is given by the formula:
P(x|N, L) = [P
N
(A)] cA *[P
N
(C)] cC *[PN (G)] cG *[PN (T)] cT (2)
where ci is the count of the nucleotide i in the sequence x,
.
There are many possible strategies to set up a null model discussed in literature [8–10], all of which seem to make sense biologically. Some of them are: i. using a uniform distribution, ii. using the genomic background distribution, iii. using the training set distribution, iv. using the target sequence distribution. Each strategy uses a different reasoning to minimize false positives. The reasoning behind each strategy is based on how we will characterize "random sequences". With the uniform distribution, we define randomness by the absence of information, even about the nucleotide composition. With the genomic background distribution, we define randomness by what should be the standard nucleotide distribution of a sequence in a specific genome. With the training set distribution, we assume the family model will favor a certain specific nucleotide distribution (that of the known sequences of that family used to infer the model, the training set); so if we use the nucleotide distribution of the training set as a null model, this will help the classifier reject sequences with a high score only due to their base composition. Finally, with the target sequence distribution, random sequences are those with the same base distribution of the target sequence (in other words, a genomic background strategy reduced to the sequence locality). Independently of the rationale chosen, the null model will fall in one of three classes: a uniform distribution, a fixed non-uniform distribution or a target-dependent distribution.
The goal of this study is to evaluate the impact of each of these three classes of null models in the false positive rate of classifiers. We found only two studies in literature that analyzed the performance of null models [8, 9]. Each study evaluates one specific benchmarks of aminoacid sequences and only one probabilistic model (HMM). This approach limits the generality of their conclusions. First, they do not address the problem for a wider amplitude of classification models. Second, and more important, they only analyze the final accuracy results for their specific benchmarks, without any consideration on why these can be generalized to other sequence families.
To make this study more general than previous works, we use random sequences and two different probabilistic models. Using random sequences guarantees there is no bias in the study towards any particular benchmark, so we expect the results to be of broad application. Also, the simulations used random sequences across the whole GC spectrum, in an effort to make the results applicable to any real-life situation. The two probabilistic models chosen are very different, aimed at covering a wide range of models: one with very simple architecture and one able to represent more structured sequences. The studies were performed using Weight Array Matrices (WAMs) [4] and Covariance Models (CMs) [5].
WAMs record only fixed-distance content dependencies, useful to represent sequence motifs. CMs are able to characterize indels and register dependencies in non-adjacent bases at arbitrary distances, which can be used to characterize secondary structure. We evaluated WAMs in the context of splice site prediction and CMs in the context of predicting RNA or other genomic elements with secondary structure. Splice sites were used for three reasons: first, splice site prediction is at the heart of gene prediction, an biologically important problem in bioinformatics, second, the abundance of data in public databases, third, because many successful predictors use position-dependent models, which is the base of our probabilistic model range. The spectrum of GC content in the dataset enabled using a single sequence family (splice sites) for all experiments with WAMs. In this context, the same was not possible for CMs, where training sets are generally small and concentrated on a small spectrum of GC content. In this case we had to use three different sequence families (see methods for details).
We will see below that the training set and the genomic background are not good choices for a null model. In fact, no fixed, non-uniform distribution is, as a quick mathematical analysis can demonstrate. As we will see below, two probabilistic i.i.d. models are best suited for classification: the uniform model and the target model. However, we also show that the uniform distribution can also have a deleterious effect in sequences with biased GC distribution. This is particularly relevant, since it has not been described before and since uniform models are widely used in the context of nucleotide sequences. The final conclusion is that the target model is more dependable when choosing candidates for biological validation due to its higher specificity. This is reinforced by the real data experiment using Plasmodium falciparum, a highly AT-rich genome. The study was performed in DNA sequences using GC content as the measure of content bias, but the results should be valid also for protein sequences.