Linking phenotypes to ~omics datasets is difficult due to size and noisiness of the data and lack of easily accessible tools that can (i) handle large and noisy data, (ii) find links to phenotypes and (iii) visualize links. We developed a web-tool - PhenoLink - to identify links to phenotypes using classification-based feature selection (in this study genes). The presence/absence of 2847 genes in 42 L. plantarum strains and phenotypes of these strains was used in PhenoLink to identify links to phenotypes assessed in 51 experiments. We tried different visualization techniques such as graph and tree structures for enhanced visualization of identified relations. In visualization as much as possible information should be embedded in a single figure, while still it should be easy to interpret. Visualization of identified links using different colours for each relationship type allowed capturing relations not only between genes and phenotypes, but also relations among genes and among phenotypes. Additionally, visualization allowed identifying partial relations between genes and phenotypes (shown in black colours), where different genes are essential for certain strains of a phenotype. For instance, among correctly classified polysaccharide (D-melezitose, D-turanose and D- raffinose) utilization experiments, only for D-raffinose additional polysaccharide biosynthesis genes (lp_1197, lp_1198 and lp_1199) were found to be related (see Additional file 6). All strains that can't grow on raffinose lack these genes and most of the growing strains have either these genes or other polysaccharide biosynthesis genes (lp_1216-lp_1227) (see Additional file 7). Possibly the growing strains can utilize raffinose degradation product for polysaccharide biosynthesis.
The L. plantarum strains used in this study often showed similar phenotypes, or some of them had ambiguously defined phenotypes such as "Maybe" due to either mild expression of a trait or possibly experimental error. Strains with this phenotype are, as expected, often misclassified. Therefore in this study, we discarded all strains with this phenotype; however it can be configured in our web tool to include them. Using ambiguous phenotypes in certain cases could be beneficial to validate input data as strains with similar gene content should have similar or the same phenotype. PhenoLink can be configured to generate classification accuracy plots for each experiment (see Additional file 8), which shows how accurately strains were classified. Reasons for consistent misclassification of strains are: (i) ambiguous or wrong phenotype, (ii) noisy ~omics data, and/or (iii) these strains could belong to a minority phenotype (see Methods).
In the present example, the presence/absence of genes was determined based on hybridization to an array containing probes to only a single reference strain (WCFS1) and its three plasmids. Because L. plantarum is a versatile species living in various environments, the gene content of many of these strains will be in part different from that of WCFS1 . Strain NIZO2776 is exceptional, as it was always misclassified to be not growing on the sugar L-arabinose (see Additional file 8). Based on CGH data, 16.6% of the genes of strain WCFS1 are predicted to be absent in strain NIZO2776 ; however other strains that lack even more WCFS1 genes have correctly been classified. Probably this strain either does not grow on L-arabinose (wrong phenotyping) or it uses different sets of genes to grow on L-arabinose, which differ too much in sequence compared to WCFS1 genes in order to be detected by CGH. Pan-genome arrays specifying probes based on the genomic content of multiple strains and plasmids within the same species, should provide a better estimate of species-level genomic divergence. However, cross-hybridization of probes is the general disadvantage of the microarray technology, which leads to inaccurate gene calling. With prices decreasing continuously, next-generation sequencing techniques are becoming better alternatives due to their accuracy and recall of new or divergent genes, that using microarrays would have been missed. Gene presence/absence determined by sequencing would allow PhenoLink to determine links to phenotypes more accurately.
PhenoLink allows decreasing huge combinations of possible experimental tests by pruning input data and prioritizing identified links. Though many phenotypes (more than 55%) were classified with accuracy above 80%, we used a 60% classification accuracy cutoff to accommodate noise in input data such as wrong gene calling or imbalance in phenotype data. Identifying partial relations is inherently difficult even with classification-based association analysis. Thus such findings, which are visualized in a black colour, should first be corroborated with available literature and/or databases before performing follow-up lab-experiments.
PhenoLink allows finding links to many phenotypes of several strains. The input data should contain information about at least a few strains (default of 4) with at least two different phenotypes (totaling 8 strains). However, most of the public data sets often lack either ~omics or phenotype data. Most of the ~omics and/or phenotype data sets are from studies of only a few strains, posing the small sample size problem preventing their use in PhenoLink, and yet many others had a phenotype imbalance problem [15, 16]. In this study, we describe the use of PhenoLink on two different datasets: (i) Lactobacillus plantarum genotype and phenotype data and (ii) Streptococcus pneumoniae gene essentiality data. These datasets are publicly available (see PhenoLink website).
In PhenoLink, redundant and noisy features are removed before association analysis. Therefore an increase in the number of features would not increase proportionally the total run-time. We tested the increase in PhenoLink's run-time as a function of an increase in the number of features. To this end we created two datasets by increasing total number of features for 42 L. plantarum strains from 2847 to 5000 and 10000. An increase in the number of features exponentially increased PhenoLink run-time. One has to note that this is likely due to that unlike with the actual L. plantarum data, most features in the randomly generated data had very high variances and were not often correlated. This in turn substantially increased the number of features used in classification, and PhenoLink's run-time.
We developed a web-tool - PhenoLink - to link phenotypes to ~omics data. It is a flexible and versatile tool. PhenoLink can be used to effectively prioritize links from different ~omics datasets, such as genotype, transcriptome, metabolome, proteome to phenotypes. It is a tool with enhanced visualization of links to phenotypes, is more accurate than correlation-based method and less resource-intensive than Bayesian-based methods. It has already been used in several studies to identify leads to phenotypes from diverse sets of ~omics data such as genotype, transcriptome and metabolome data. Thus, PhenoLink facilitates screening large ~omics and phenotype data sets, allowing to effectively capture known relations to phenotypes as well as novel relations.