Transcription and translation are the means by which the cells interpret and express their genetic information . Only part of the transcribed sequences carries information to codify proteins (CDS -CoDing Sequence). In other words, even though mRNA can be translated in its entirety, only a section of this mRNA is translated into amino acid . Therefore, given a molecule of mRNA, a central problem of molecular biology is to determine whether it contains CDS and thereafter to discover which protein will be codified. The region of the mRNA sequence where the initiation of the protein synthesis process occurs is called the Translation Initiation Site (TIS).
Control of the initiation of translation is one of the most important processes in the regulation of genetic expression . Thus, determining the TIS is not a trivial task; it is of great relevance for genetic inference. A high level of accuracy of prediction could be useful for a better understanding of the protein obtained from the sequences of nucleotides .
Normally, translation begins in the first ATG of the mRNA molecule that has an appropriate context , but can begin in a different codon . Depending on the position of the synthesis initiation in the mRNA strand, the triplet of nucleotides selected for the synthesis can vary, also altering the amino acids generated. The lack of knowledge about the preservative features in the process of identifying the initiation of translation makes the prediction of the TIS a complex task. For this reason, computational methods which identify patterns can be used with the aim of extracting the implicit knowledge involved in this process . Since 1982, the prediction of the TIS has been studied extensively using biological methods, statistics and computational techniques . Initially, statistical methods were exploited with the aim of discovering patterns in positive sequences. The pioneering work of Kozak , a statistical analysis of the sequences of 211 mRNAs of eukaryotic cells, revealed that some positions in the sequences of mRNAs, relative to the TIS, are very stable, determining the Kozak consensus , gcc[a/g]ccatg[g], where there is a predominance of these nucleotides in positions -3 and +4.
Another statistical analysis was conducted by Cavener et. al  on the Start codon (the codon which initiates translation) and the Stop codons (codons which finalize translation), and an algorithm was developed to analyze the frequency of the nucleotides and the multiple positions of the nucleotides. In the work developed by Kozak , a proportion of 79% of Adenine (A) in position -3 was identified (and 18% of G) while Cavener et. al , using 2,595 vertebrate sequences, obtained a 58% probability of A being in the aforementioned position.
Nakagawa et. al  conducted comparative analyses between 47 species, including animals, fungi, plants and protists, revealing the existence of consensus for different species. Based on this analysis, the following regions of consensus were identified: the presence of a purine (A or G) in position -3, the presence of A or C in position -2, the presence of C in position +5. The position -3 had already been discovered by Kozak  and was confirmed by this study.
Different computational methods have been applied to the prediction of the TIS including Artificial Neural Networks (ANN) [8, 9], Support Vector Machines (SVM) [2, 10, 11] and the Gaussian Model . Utilizing Artificial Neural Networks, Stormo et. al  classified the sequences of Escherichia coli using codification of 4 bits (A=1000, C=0100, G=0010, T=0001) and windows of 51, 71 and 101 nucleotides centered on ATG. Pedersen and Nielsen , however, trained Artificial Neural Networks using a database of vertebrates which was processed to obtain the correspondent sequences of mRNA. Of these sequences, only those with the TIS annotated and with at least 10 nucleotides in the upstream region and at least 150 in the downstream region were selected. The resultant base had 13,502 ATGs, 3,312 (24.5%) being TIS and the other 10,190 (75.4%) being non-TIS. In this study, windows of 13, 33, 53, 73, 93, 113, 133, 153, 173 and 203 nucleotides centered on ATG were used. The codification used was the same as that of Stormo et. al  - binary of 4 bits. Pedersen and Nielsen  obtained sensitivity, specificity and accuracy of 78%, 87% and 85%, respectively. The authors also conducted an analysis of the sequences to reveal that features are important for distinguishing TIS from non-TIS. It was discovered that position -3 is crucial in the identification of the TIS and this corroborates with the other studies cited.
Hatzigeorgiou  also used ANN to classify sequences of human cDNA, achieving accuracy of 94%. The author utilized two modules: consensus-ANN (analyses the immediate neighborhood of the TIS candidate) and coding-ANN (evaluates the upstream and downstream regions of the candidate). The consensus-ANN module evaluates the TIS candidate and its most immediate neighborhood through a window of 12 nucleotides. The sequences were extracted from positions -7 to +5 and the binary codification of 4 bits was used. The coding-ANN module evaluates the upstream and downstream regions of the TIS candidate and operates with windows of 54 nucleotides. The final method is the integration of the modules where scores are calculated for each ATG of the molecule and the first ATG which offers a score above 0.2 is considered the TIS of the molecule.
Using SVM, Zien et. al  achieved accuracy of 88.1% for the same database as Pedersen and Nielsen . The authors also used the same size of window (203 nucleotides) and the same codification. They showed how to obtain improvements using a new kernel function called locality-improved kernel with a small window in each position. The locality-improved kernel emphasizes correlations between the positions in the sequence that are close to each other and a size of 3 nucleotides upstream and downstream is empirically determined as optimum. In other words, the modification was to favor local correlations between nucleotides while dependencies between nucleotides in distant positions were considered of little importance or nonexistent. With this kernel function, the authors obtained sensitivity, specificity and accuracy of 69.9%, 94.1% and 88.1%, respectively.
At a later date, Zien et. al  improved these results through a more sophisticated kernel function known as the Salzberg kernel. The Salzberg kernel is essentially a conditional probabilistic model of the positions of dinucleotides. Using this kernel, the authors obtained an accuracy of 88.6% for the same database. Li et. al  utilized two new proposals for the identification of the TIS. Firstly, they introduced a class of new kernels based on string edit distance, called edit kernels, to be used with SVM. According to the authors, the edit kernels are simple and have significant and probabilistic biological interpretations. Next, they converted the downstream region of an ATG into a sequence of amino acids before applying SVM. They demonstrated that the approach they adopted is significantly better (sensitivity = 99.92%, specificity = 99.82% and accuracy = 99.9% for the database used by Pedersen and Nielsen ).
Nobre, Ortega and Braga  conducted experiments to discover the TIS using 12 nucleotides in the upstream and downstream regions, in addition to SVM with simple kernel functions. Inspired by a study conducted on the frequency of triplets of positive and negative sequences, they presented a new codification methodology. Instead of individually codifying each nucleotide, the codification was done per triplet, with a sliding window of size 3. The authors obtained a 50% reduction in the number of entries. In order to balance the data, they used the Smote algorithm  to replicate minority class samples. The authors worked with bases of five organisms extracted from the RefSeq database : Danio rerio, Drosophila melanogaster, Homo sapiens, Mus musculus and Rattus norvegicus, under six levels of inspection. Tzanis et. al  developed a methodology for the prediction of the TIS, called MANTIS, with three main components: Consensus, Coding Region classification, and ATG Location. The Coding Region Classification component involves training a model to classify whether or not the ATG of a sequence is the TIS. They utilized features selected from previous studies [1, 4] and PGA (Principal Component Analysis) to obtain the lowest number of non-correlated features, since many are correlated to each other. The Consensus component uses Markov rules which capture not only the probability of occurrence of a nucleotide in a determined position, but also how the occurrence of a base interferes with the occurrence of another in the region close to the ATG (between positions -7 and +5). The ATG location component is considered a new model, being based on the location of the ATG in the sequence in accordance with the Ribosome Scanning Model (RSM) described by Kozak [5, 16]. The final stage of MANTIS is the fusion of the decision of the components, the output being the estimated probability of an ATG being a TIS instead of a simple true/false decision. For the prediction, four classification algorithms were used: Naive Bayes, C4.5, K-nearest neighbor and SVM, obtaining an average accuracy and adjusted accuracy of 98.03% and 94.28%, respectively.
Tikole and Sankararamakrishnan  used ANN with two hidden layers to predict the TIS in sequences of human mRNA in which there is a week Kozak context. The authors stated that the translation initiation site has a weak Kozak context if purine and guanine are absent in positions -3 and +4, respectively. They obtained sensitivity of 83% and specificity of 73%.
In contrast to other authors, Zeng et. al  created an algorithm with the aim of constructing representative, dependable and readily available databases free from redundancy in order to facilitate the evaluation of the efficiency of the algorithms used for predicting the TIS. To prepare these databases, they considered three different features: the molecular weight (MW), the isoelectric point (IP) and the hydrophobicity index (HI) profile.
Saeys, Abeel, Degroeve and Peer  evaluated the performance of several TIS recognition methods at the genomic level, and compared them to state-of-the-art models for TIS prediction in transcript data. The authors concluded that the simple methods largely outperform the complex ones at the genomic scale, and proposed a new model for TIS recognition at the genome level that combines the strengths of these simple models.
Sparks and Brendel  demonstrated that improvements in statistically-based models for TIS prediction can be achieved by taking the class of each potential start-methionine into account, pending certain testing conditions. They developed the MetWAMer package for TIS prediction and demonstrated that the proposed model based on perceptron is suitable for the TIS identification task.
Having identified that the problem of predicting the TIS is highly imbalanced and that the oversampling methods, which have already been used in the present context, significantly increase computational complexity, this study proposes an undersampling class balancing method, M-Clus. This is particularly important for large databases where oversampling techniques are not viable as they significantly increase the size of the databases involved.
In addition to the balancing method, this study also investigates the integration of features into positive and negative sequences, attempting to increase the measures of performance.
Finally, a methodology for the inclusion of acquired knowledge (InAKnow) by the classifier is proposed, where, from the model obtained by training using upstream region sequences and the TIS, the sequences of the downstream region are first classified and later included in new training. This methodology increases the rate of precision of all the evaluated databases.
This paper is organized as follows: Firstly, the “Methods” Section shows all the steps used in this study for the prediction of TIS. To test the proposed methodology the organisms Mus musculus and Rattus norvegicus have been used as a reference. The “Results and Discussion” Section presents the results obtained by the proposed methodology for these two organisms. Once defined, the best configuration was tested with larger databases such as Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster and Homo sapiens. This is detailed in the “Validation of the methodology with other databases” Section. The “Comparison with other TIS prediction tools” Section provides a comparison between some existing tools for predicting TIS and the methodology proposed in this study. Finally, the “Conclusions” Section presents the final considerations.