In this study, we establish RNA subcellular localization datasets, and then propose an integration learning model for multi-label classification. The flowchart of our method is show in Figure S1.
Benchmark dataset
RNAs are generally divided into two categories. One is encoding RNAs, such as messenger RNAs (mRNAs), which play a very important role in transcription. Other is non-coding RNAs, including long non-coding RNA (lncRNA), microRNA (miRNA), small nucleolar RNA (snoRNA), which play an irreplaceable regulatory role in life. In order to study subcellular localization for Homo sapiens, we further establish human RNA subcellular localization datasets. Subcellular localizations of various RNAs in cells are shown in Fig. 6.
We use the database of RNA subcellular localization in order to integrate, analyze and identify RNA subcellular localization for speeding up RNA structural and functional researches. The first release of RNALocate (http://www.rna-society.org/rnalocate/) contains more than 42,000 manually engineered RNA-associated subcellular locali- zation and experimental evidence entries in more than 23100 RNA sequences, 65 organisms (e.g., homo sapiens, mus musculus, saccharomyces cerevisiae), localization of 42 subcells (e.g., cytoplasm, nucleus, endoplasmic reticulum, ribosomes), and 9 RNA categories (e.g., mRNA, microRNA, lncRNA, snoRNA). Thus, RNALocate provides a comprehensive source of subcellular localization and even insight into the function of hypothetical or new RNAs. We extract multi-label classification datasets about RNA-associated subcellular localizations on four RNA categories (mRNAs, lncRNAs, miRNAs and snoRNAs). The flowchart of mRNA subcellular localization dataset construction framework is shown in Fig. 7.
RNA subcellular localization datasets
We extract four RNA subcellular localization datasets, including mRNAs, lncRNAs, miRNA and snoRNAs. The procedure for constructing RNA datasets is listed as follows.
-
We download total RNA entries with curated subcellular localizations from RNAlocate, and use CD-HIT [64] to remove redundant samples with a cutoff of 80%.
-
We delete samples with duplicate Gene ID and remove samples without corresponding subcellular localization labels, and then construct four RNA subcellular localization datasets.
-
We count the number of samples for each category of subcellular localization labels, and then select some categories with the sample size greater than a reasonable threshold (N/Nmax>1/30).
The statistical distributions of these four RNA datasets are shown in Fig. 8. Details are shown in Additional file 1: Table S1-S2.
Human RNA subcellular localization datasets
We also extract four Homo sapiens RNA subcellular localization datasets, including H_mRNAs, H_lncRNAs, H_miRNA and H_snoRNAs. The procedure for constructing human RNA datasets is listed as follows.
-
We screen out samples of homo sapiens on above four RNA datasets, and construct four human RNA subcellular localization datasets.
-
We count the number of samples for each category, and then select some categories with the sample size greater than a reasonable threshold (N/Nmax>1/12).
The statistical distributions of these four human RNA datasets are shown in Fig. 9. Details are shown in Additional file 1: Table S3-S4.
Nucleotide property composition representation
RNA sequence can be represented as follow: S=(s1,⋯,sl,⋯,sL), where sl denotes the l-th ribonucleic acid and L denotes the length of S. How to formulate varied length RNA sequences as fixed length features, is the key point to effective operational problem-solving. Many studies have shown that the RNA sequence can be encoded by nucleotide property composition representation [65], which can profoundly affect the way of body behaves. Here, we encode the RNA sequence in order to better mine and explore information patterns.
k-mer nucleotide composition
For k-mer descriptor, RNAs are represented as occurrence frequencies of k neighboring nucleic acids, which has been successfully applied to human gene regulatory sequence prediction and enhancer identification. The k-mer (e.g. k=2) descriptor can be calculated as follows.
$$ f(t) = \frac{N(t)}{N-k+1}, \quad t \in \{AA, AC, AG, TT\} $$
(2)
where N(t) is the number of k-mer type t, while N is the length of a nucleotide sequence.
For k=1,2,3,4, there are four combinations together, each of which has 4k distinct types of nucleotide characteristics. Therefore, we extract 340-dimensional feature vector Fkmer1234.
Only remaining 4-mer, there are 44 types of nucleotide characteristics. Therefore, we extract 256-dimensional feature vector Fkmer4.
Reverse compliment k-mer
The reverse compliment k-mer (RCKmer) is a variant of k-mer descriptor, which is not expected to be strand-specific. For instance, there are 16 types of 2-mer (‘AA’, ‘AC’, ‘AG’, ‘AT’, ‘CA’, ‘CC’, ‘CG’, ‘CT’, ‘GA’, ‘GC’, ‘GG’, ‘GT’, ‘TA’, ‘TC’, ‘TG’, ‘TT’), ‘TT’ is reverse compliment with ‘AA’. After removing the reverse compliment k-mer, there are only 10 distinct types of k-mer in the reverse compliment k-mer approach (‘AA’, ‘AC’, ‘AG’, ‘AT’, ‘CA’, ‘CC’, ‘CG’,‘GA’, ‘GC’, ‘TA’).
For 4-mer with 256 types, after removing reverse compliment 4-mer, there are 136 distinct types in the reverse compliment k-mer approach. Therefore, we extract 136-dimensional feature vector FRCKmer.
Nucleic acid composition
The nucleic acid composition (NAC) encodes the frequency of each nucleic acid type in a nucleotide sequence, which is similar to 1-mer. The frequency of each natural nucleic acid (‘A’, ‘C’, ‘G’, ‘T’ or ‘U’) can be calculated as follows.
$$ f(t) = \frac{N(t)}{N}, \qquad t \in \{A, C, G, T(U)\} $$
(3)
where N(t) is the number of nucleic acid type t, while N is the length of a nucleotide sequence.
Therefore, we extract 4-dimensional feature vector FNAC.
Di-nucleotide composition
The di-nucleotide composition (DNC) encodes the frequency of each 2-tuple of nucleic acid type in a nucleotide sequence, which is similar to 2-mer. The frequency of each 2-tuple of natural nucleic acid can be calculated as follows.
$$ D(i,j) = \frac{N_{ij}}{N-1}, \qquad i,j \in \{A, C, G, T(U)\} $$
(4)
where Nij is the number of di-nucleotide type represented by nucleic acid types i and j.
Therefore, we extract 16-dimensional feature vector FDNC.
Tri-nucleotide composition
The tri-nucleotide composition (TNC) encodes the frequency of each 3-tuple of nucleic acid type in a nucleotide sequence, which is similar to 3-mer. The frequency of each 3-tuple of natural nucleic acid can be calculated as follows.
$$ D(i,j,k) = \frac{N_{ijk}}{N-2}, \qquad i,j,k \in \{A, C, G, T(U)\} $$
(5)
where Nijk is the number of di-nucleotide type represented by nucleic acid types i, j and k.
Therefore, we extract 64-dimensional feature vector FTNC.
Composition of k-spaced nucleic acid pair
The composition of k-spaced nucleic acid pair (CKSNAP) is used to calculate the frequency of nucleic acid pairs separated by any k nucleic acids (k=0,1,2,…). For each k-space, there are 16 types of nucleic acid pair composition (‘A...A’, ‘A...C’, ‘A...G’, ‘A...T’, ‘C...A’, ‘C...C’, ‘C...G’, ‘C...T’, ‘G...A’, ‘G...C’, ‘G...G’, ‘G...T’, ‘T...A’, ‘T...C’, ‘T...G’, ‘T...T’).
For k=0,1,2,3,4,5, there are six different combinations, each of which has 16 distinct types of nucleic acid pair composition. Therefore, we extract 96-dimensional feature vector FCKSNAP.
Multiple kernel support vector machine classifier
We apply radial basis function (RBF) on above feature sets to construct corresponding kernels, respectively. The RBF kernel is defined as follows.
$$ {}K_{ij}=K(\mathbf{x}_{i},\mathbf{x}_{j}) = exp(-\gamma \Vert \mathbf{x}_{i} - \mathbf{x}_{j} \Vert^{2}), \ i,j=1,2,...,N $$
(6)
where xi and xj are the feature vectors of samples i and j, N denotes the number of samples, and γ is the bandwidth of Gaussian kernel.
The kernel set with seven distinct kernels is denoted as follows.
$$\begin{array}{*{20}l} \mathbf{K}&= \left \{ \mathbf{K}_{\text{kmer4}}, \mathbf{K}_{\text{kmer1234}}, \mathbf{K}_{\text{RCKmer}},\right.\\&\left.\qquad\mathbf{K}_{\text{NAC}}, \mathbf{K}_{\text{DNC}}, \mathbf{K}_{\text{TNC}}, \mathbf{K}_{\text{CKSNAP}} \right \} \end{array} $$
(7)
Hilbert-Schmidt Independence criterion-multiple kernel learning
We use multiple kernel learning (MKL) to figure out weights of above kernels, and then integrate them together “Multiple kernel support vector machine classifier”, “??”, and “??” sections. The optimal combinatorial kernel can be calculated as follows.
$$ \mathbf{K}^{*} = \sum_{p=1}^{7} \beta_{p} \mathbf{K}^{p}, \quad \mathbf{K}^{p} \in \mathbf{R}^{N \times N} $$
(8)
The main purpose of hilbert-schmidt independence criterion (HSIC) [66] is to measure a difference in the distribution of two variables, which is similar to the covariance and is itself constructed according to the covariance. Let X∈RN×d and Y∈RN×1 be two variables from a data set of \(\mathbf {Z}=\left \{(\mathbf {x}_{i},y_{i})\right \}^{N}_{i=1}\), which is jointly from some probability distribution Prxy. HSIC measures the independence between x and y by calculating the norm of cross-covariance operator over domain X×Y.
Hilbert-Schmidt operator norm “Hilbert-Schmidt Independence criterion-multiple kernel learning” section of Cxy is defined as follows.
$$ HSIC(\mathbf{F},\mathbf{G},Pr_{\mathbf{x}y}) = \Vert C_{\mathbf{x}y} \Vert^{2}_{HS} $$
(9)
Given set Z, empirical estimate of HSIC is computed as follows.
$$ \begin{aligned} HSIC(\mathbf{F},\mathbf{G},\mathbf{Z}) &= \frac{1}{N^{2}}tr(\mathbf{K}\mathbf{U}) - \frac{2}{N^{3}}\mathbf{e}^{T}\mathbf{K}\mathbf{U}\mathbf{e} + \frac{1}{N^{4}}\mathbf{e}^{T}\mathbf{K}\mathbf{e}\mathbf{e}^{T}\mathbf{U}\mathbf{e}\\ &= \frac{1}{N^{2}} \left [ tr(\mathbf{K}\mathbf{U}) \!- \! \frac{1}{N}tr(\mathbf{K}\mathbf{U}\mathbf{e}\mathbf{e}^{T}) - \frac{1}{N}tr(\mathbf{U}\mathbf{K}\mathbf{e}\mathbf{e}^{T}) \right. \\ & \left. + \frac{1}{N^{2}}tr(\mathbf{U}\mathbf{e}\mathbf{e}^{T}\mathbf{K}\mathbf{e}\mathbf{e}^{T}) \right ] \\ &= \frac{1}{N^{2}} tr[\mathbf{K}(\mathbf{I} - \frac{1}{N}\mathbf{e}\mathbf{e}^{T})\mathbf{U}(\mathbf{I} - \frac{1}{N}\mathbf{e}\mathbf{e}^{T})]\\ &=\frac{1}{N^{2}}tr(\mathbf{K}\mathbf{H}\mathbf{U}\mathbf{H}) \overset{\triangle}{=} HSIC(\mathbf{K},\mathbf{U}) \end{aligned} $$
(10)
where F is the RKHS of feature set X,G is the RKHS of label set Y,e=(1,...,1)T∈RN×1,H=I−eeT/N∈RN×N (centering matrix), K,U∈RN×N are kernel matrices with Kij=k(xi,xj) and Uij=l(yi,yj),I∈RN×N is the identity matrix. The stronger the dependence between K and U, the larger the value. K and U are independent between each other, when HSIC(K,U)=0.
Enligthened by HSIC [67], we define optimization function as follows.
$$\begin{array}{*{20}l} \underset{\pmb{\beta},\mathbf{K}^{*}}{max} \ HSIC(\mathbf{K}^{*},\mathbf{U}) \end{array} $$
(11a)
$$\begin{array}{*{20}l} HSIC(\mathbf{K}^{*},\mathbf{U}) = \frac{1}{N^{2}}tr(\mathbf{K}^{*}\mathbf{H}\mathbf{U}\mathbf{H}) \end{array} $$
(11b)
$$\begin{array}{*{20}l} subject \ to \ \mathbf{K}^{*} = \sum_{p=1}^{P} \beta_{p}\mathbf{K}^{p}, \end{array} $$
(11c)
$$\begin{array}{*{20}l} \beta_{p} \ge 0, \ p = 1,2,...,P, \end{array} $$
(11d)
$$\begin{array}{*{20}l} \sum_{p=1}^{P} \beta_{p} = 1 \end{array} $$
(11e)
where K∗∈RN×N is the optimal kernel of feature space, and \(\mathbf {U} = \mathbf {y}_{train}\mathbf {y}_{train}^{T} \in \mathbf {R}^{N \times N}\) is ideal kernel matrix (label kernel), β∈RP×1 is the kernel weight vector. We aim to maximize HSIC between K∗ and U.
Convex quadratic programming problem can be solved as follows.
$$\begin{array}{*{20}l} \underset{\pmb{\beta},\mathbf{K}^{*}}{min} -\frac{1}{N^{2}}tr(\mathbf{K}^{*}\mathbf{H}\mathbf{U}\mathbf{H}) + \nu_{1}\Vert \pmb{\beta} \Vert^{2} \end{array} $$
(12a)
$$\begin{array}{*{20}l} subject \ to \ \mathbf{K}^{*} = \sum_{p=1}^{P} \beta_{p}\mathbf{K}^{p}, \end{array} $$
(12b)
$$\begin{array}{*{20}l} \beta_{p} \ge 0, \ p = 1,2,...,P, \end{array} $$
(12c)
$$\begin{array}{*{20}l} \sum_{p=1}^{P} \beta_{p} = 1 \end{array} $$
(12d)
where ν1 is L2 norm regularization term. The final training and testing kernels are linearly weighted by β, respectively.
Support vector machine
Support vector Machine [68] was first proposed by Cortes and Vapnik [69]. It deals primarily with dichotomies. Given a dataset of instance-label pairs {xi,yi},i=1,2,...,N, the classification decision function realized by SVM is expressed as follows.
$$ f(\mathbf{x}) = sign[\sum_{i=1}^{N} y_{i}\alpha_{i}\cdot K(\mathbf{x},\mathbf{x}_{i})+b] $$
(13)
where xi∈R1×d and yi∈{+1,−1}.
Solving the following convex Quadratic Programming (QP) problem can obtain the coefficient αi.
$$\begin{array}{*{20}l} \text{Maximize} \quad & \sum_{i=1}^{N} \alpha_{i} - \frac{1}{2}\sum_{i=1}^{N} \sum_{j=1}^{N} \alpha_{i}\alpha_{j}\cdot y_{i}y_{j}\cdot K(\mathbf{x}_{i},\mathbf{x}_{j}) \end{array} $$
(14a)
$$\begin{array}{*{20}l} \text{s.t.}\quad & 0 \le \alpha_{i} \le C \end{array} $$
(14b)
$$\begin{array}{*{20}l} & \sum_{i=1}^{N} \alpha_{i}y_{i} = 0, i=1,2,...,N \end{array} $$
(14c)
where C is a regularization parameter that controls the balance between boundary and misclassification errors, and when the corresponding αj>0,xj is called support vector.
One-vs-rest strategy
We use an indirect strategy to solve multi-label classification problem, which can be solved by converting multi-label problem into multiple binary classification problems. The one-vs-rest strategy is to treat one class as positive samples and the rest classes as negative samples. We can build a binary classifier for each class label, thus construct a total of k binary classifiers.