We have conducted the first systematic analysis of sequence differences between xenomiRs and non-xenomiRs, and significant difference was found, which argued in favor of the selective nature of the absorption of xenomiRs and its relation with miRNA sequences. We have then shown the feasibility of distinguishing between xenomiRs and non-xenomiRs based on miRNA sequences using machine learning models. High accuracies were achieved by both random forest model and 1D-CNN model. This could serve as a valuable tool for predicting potential xenomiRs that have not yet been discovered, based on which further biological experiments could be conducted for validating their ability of transferring into human bodies and more importantly, exploring their potential functions. The important features of xenomiRs identified here might offer insights into underlying mechanisms of xenomiRs transferring into and keeping stable in animal body. In addition, we have provided the first list of high-probability xenomiRs as well as their potential functions. Of interest, the functions of the predicted targets of these xenomiRs seem to be consistent with previous studies (see details below). We have shown in our previous paper [33] that the plant miRNAs detected in human bodies are tissue-specific and cannot be fully explained by contamination and provided evidence for the xenomiRs hypothesis. The selective absorption of plant miRNAs by animal bodies could provide an explanation for studies where xenomiRs were not detected in animal bodies. If more plant-derived xenomiRs in different human tissues are available, our 1D-CNN model could be adjusted slightly using transfer learning [37] to learn the different patterns of plant-derived xenomiRs in different tissues. Thus, the channel tropism proposed in Inner Gannon of Huangdi [38] may be better explained, and the corresponding tissues, where a specific herb function may also be predicted.
More than one hundred species of xenomiRs have been reported so far, however, more species of xenomiRs remain to be discovered, especially those in the plants seldom consumed, such as traditional medical herbs. The discovery of xenomiRs in traditional medical herbs has great significance in better understanding of the mechanisms underlying the therapeutic function of these herbs. Therefore, to predict the potential xenomiRs using machine learning model is of great significance. However, since no off the shelf data were available, we collected xenomiRs/non-xenomiRs in the following way.
The positive dataset was extracted through a rigorous pipeline, which includes all plant-derived miRNAs identified from all sRNA-seq data of healthy individuals in NCBI GEO database (before 2016). It accorded well with the xenomiRs reported in human tissues [33]. Hence, the positive dataset contained currently reported plant-derived xenomiRs in common food. It is likely that some plant-derived xenomiRs were not included in the positive dataset due to their low abundance under normal condition. However, it is still the optimum choice given current scientific advances.
The collection of negative data, on the other hand, is less straightforward. We chose miRNAs from plants that are human everyday diet or closely related to common vegetables, but have never been detected in human samples. There might exist xenomiRs remain undetected, due to either the inability of current detection technique or inappropriate experimental conditions, such as the time after oral administration or amount of food taken in. To maximally avoid false negatives in our negative dataset, we included only “non-xenomiRs” in four species that can be regarded as everyday diet, i.e. ath, gma, osa and zma (ath is closely related to common vegetables, such as Brassica rapa, Brassica oleracea, Brassica juncea and Oilseed rape). Therefore, the selection of negative dataset is based on current advances in plant-derived xenomiRs. Of note, we cannot rule out all false negatives, which might limit the accuracy of our models.
However, the identification of these false positives/negatives requires biological verification, which is time consuming due to the enormous number of candidates. Our model, although imperfect, could serve as the first and efficient tool for selecting the most probable ones for further experimental verification. The verified xenomiRs or non-xenomiRs could, in turn, be fed to our model, and thus improve its accuracy.
The 166 xenomiRs in the positive dataset are from 49 miRNA families and 55.3% are from 8 miRNA families (Additional file 10: Table S10). For non-xenomiRs in the negative dataset, 48% are from these 49 miNRA families but only 19.0% are from the top 8 families containing xenomiRs (Additional file 10: Table S10). In the prediction set, 201 (83.4%) additional members of these 49 miRNA families have been predicted as xenomiRs, among which 123 (50.9%) are from the top 8 families (Additional file 10: Table S10). It can be seen that both collected and predicted xenomiRs concentrate in limited and similar group of families.
In feature extraction, each sequence was assumed to have 24 nt. The choice of 24 nt was not arbitrary. We tried to use the fist 21 nt ~ 23 nt as well as the last 21 nt ~ 23 nt of each miRNA sequence as the sequence position features, for better capturing 5′ end and 3′ end absolute sequence position features, respectively. The final choice was made upon the model performance, i.e. training our models using 24 nt could slightly outperform the other cases. Furthermore, only 0.047% plant miRNAs for prediction are longer than 24 nt. Thus, truncating the part of miRNAs longer than 24 nt had minor effect on the performance of our models. We chose sequence-based features because they are very commonly used to characterize miRNAs and play essential roles in the function of miRNAs. They might also affect miRNAs’ ability of entering human body. This motivated us to investigate the difference in sequence-based features between xenomiRs and non-xenomiRs. Our results, indeed, confirmed that existence of significant difference between these two groups of miRNAs (Table 1).
The structure-based features of miRNAs were also tried to use in training our models. However, because a miRNA is very short, most of them cannot construct a stable secondary structure, which was verified by predicting miRNA secondary structures using both RNAfold [39] and our previously published tool Fledfold [40, 41]. Nevertheless, adding secondary features of pre-miRNA, including stem features, hairpin loop features, bulge loop features, internal loop features and energy (stability), did not improve model performance. There are two possibilities resulting in this. First, the ability of miRNA entering human body has little to do with the secondary structure of a pre-miRNA. Alternatively, the sample size might limit the usefulness of additional features given to the model.
In our study, the positive dataset was composed of miRNAs from more than 20 species, among which 57% were from ath, gma, osa or zma. All the negative miRNAs were from these four species. Therefore, it is very unlikely that the models simply perform species recognition. If the dataset were separated by species (ath, gma, osa and zma), the positive dataset will be very small (average 24 miRNAs for each specie), leading to a high risk of overfitting. A model such like that has poor generalization ability. Re-training our models using miRNA data from single species (ath, gma, osa or zma) verified this thought: the re-rained model performs much better on training set, compared to testing set (accuracy: 81% VS 62%).
High GC content was reported to be responsible for the absorption of MIR2911 by animal bodies [29]. More generally, the high GC content [16] and the short length could increase the stability of an RNA. Our results confirmed this by systematic statistical analysis, revealed that xenomiRs have higher GC content (p < 0.05) and shorter length (p < 0.05), compared to non-xenomiRs. Meanwhile, other differences between xenomiRs and non-xenomiRs sequences were also identified (Fig. 2, Table 1). Besides, the important features obtained from our RF model (Additional file 5: Table S5) provided more insights into the patterns of xenomiR sequence. And if some important patterns are sabotaged, the absorption might be abolished, as in the case where miR2911 sequence is disrupted by just two GC nucleotides [31]. In addition, the 3 mer motif ‘CAG’ was evaluated as one of the most important patterns for distinguishing xenomiRs with non-xenomiRs (Additional file 5: Table S5), suggesting its possible relation with the transference or stability of xenomiRs.
These results supported our assumption that plant-derived miRNAs are absorbed selectively by human and other animals, and only the sequence of a miRNA with certain patterns could be transferred into human bodies. However, further studies are needed to identify more concrete patterns in xenomiR sequences.
Deep Learning has been widely applied in bioinformatics and obtained satisfactory performance [42, 43]. Our study used a 1D-CNN model to identify xenomiRs using only raw miRNA sequences as inputs, and higher prediction accuracy was achieved on independent test set, compared to RF model, which used 105 hand-craft features as inputs. It is likely that 1D-CNN model, which is capable of extracting the features of successive nucleotides, could capture the specific patterns underlying xenomiR sequences that contain important information for identifying xenomiRs.
To uncover the potential role of xenomiRs in human body, we analyzed the function of predicted targets of xenomiRs using enrichment analysis tools [44]. Many identified functions have already been reported by other studies with bio-experiments. For example, xenomiRs were reported to be related with cancer [13, 14], inflammatory [16, 18], circulation system [7] in human body. And recent studies reported an association between xenomiRs with neuron development of pandas [10], and the caste development of honeybees [45].
This study does not deny the animal-derived xenomiRs hypothesis. However, animal-derived xenomiRs are much more difficult to identify because of the high sequence conservation, which obscures the differences between dietary animal miRNAs and endogenous miRNAs [8]. Hence, in this study, we only studied the plant-derived xenomiRs.
In addition, many other factors could affect the detectability of xenomiRs in human samples, such as, the miRNA abundance in the consumed plant materials, the stability of plant miRNAs, the time after consumption [28] and some special molecules in the diet consumed along with plant miRNAs [18]. Therefore, in xenomiR studies, randomly selecting miRNAs to perform biological verification is risky. The methodology present in this paper could serve as a valuable and efficient tool for selecting candidate plant miRNAs for biological verification.