miRNA targets are preferentially regulated by environmental chemicals
We first collected the genes regulated by ECs (EC-genes). According to the expression regulatory information from the CTD [14], such as "chemical x results in increased expression of protein y", "compound x results in decreased expression of protein y" or "compound x affects the expression of protein y", we compiled the dataset of proteins regulated by ECs (see Methods), and then transformed the gene symbols to their Ensembl gene IDs using the BioMart program (http://biomart.org). Based on the reports of 4,162 literatures, we retrieved 42,770 regulatory relationships among compounds and human protein-coding genes (Additional file 1). Specifically, the expression profiles of 9,692 protein-encoding genes were regulated by at least one of the 1,938 ECs, including polycyclic compounds, organic chemicals, heterocyclic compounds, inorganic chemicals, hormones and so on.
We next evaluated the probability of these EC-genes being targeted by miRNAs using TargetScan5.1 [22] and PicTar (four-way) [23], which predict miRNA targets based on sequence complementarities, sequence context information, binding energy, and were regarded by previous surveys having high confidence [24]. As the reliance of TargetScan and PicTar upon cross-species conservation might introduce potential bias, we also included a third set of predicted human miRNA targets derived from PITA [25], which only considered the sequence complementarities and site accessibility; therefore, many more genes were annotated as miRNA targets with the advantage of detecting human-specific miRNA targets.
Using each dataset of miRNA targets (Additional file 2), we observed that miRNA targets were significantly enriched among EC-genes. As shown in Figure 1A, miRNA targets comprise 43% of the EC-genes (4,195 out of 9,692), but only make up 20% of the non-EC-genes (2,473 out of 12,202) as predicted by PicTar. Using the targets detected by TargetScan (Figure 1B), both programs of PicTar and TargetScan (Figure 1C), and PITA (Figure 1D), we obtained similar results. Therefore, miRNA targets were roughly twice as likely to be EC-genes as comparable to the other genes (Chi-square test, p-values < 1.7E-289 for the four datasets).
Because the methods of TargetScan and PicTar depend upon alignments of human-mouse orthologs, we only included the human genes with mouse orthologs and repeated the comparative analyses. In Figure 1A-C, the horizontal lines above the histogram bars represent the proportion of miRNA targets using the genes with mouse orthologs as background. Significant enrichment of miRNA targets among EC-genes were observed in this dataset.
In Mus musculus and Rattus norvegicus, we retrieved 9,552 and 5,064 genes respectively, which were regulated by ECs and found miRNA targets were over-represented among them as compared to the other genes (Additional file 3), indicating that the enrichment of miRNA targets in EC-genes seems to be common in mammal systems. In the following analysis, we only focused on the human genes because similar conclusions were drawn from the analysis results of Mus musculus and Rattus norvegicus.
Enrichment of miRNA targets is not dependent on the collection bias of EC-genes
To evaluate whether the above-observed enrichment of miRNA targets among EC-genes was caused by some sample collection bias, we performed the following analysis. First, if the experiments in the literatures from which the EC-genes were extracted, were designed to deal with miRNA-related scientific questions, the enrichment would be intuitively expected but with serious bias. We downloaded all of the abstracts of the 4,261 papers from the NCBI via the PubMed IDs (http://www.ncbi.nlm.nih.gov/sites/batchentrez) and searched keywords such as "miR", "miRNA", "microRNA" or "let-7". As a result, only 5 papers were directly related to the study of miRNAs and about 30 genes appearing in the datasets were included by them (Additional file 4); therefore, there should be no bias in the literature collection results.
Second, some EC-genes may belong to a certain class of genes that are preferentially regulated by miRNAs; therefore, the enrichment may be only contributed by that class of genes. As cancer-related genes were extensively studied [26] and often found to be miRNAs targets [18, 19], it is possible that the exposure of the cancer-related genes to ECs were more likely to be investigated and eventually made the miRNA targets over-represented among the EC-genes. To test this, we retrieved a separate set of genes over-expressed in cancer tissues [27]. Specifically, 2,362 proteins corresponding to 2,062 Ensembl genes were at least over-expressed 4-fold in brain (astrocytoma and glioblastoma), breast, colon, endometrial, kidney, liver, lung, ovary, prostate, skin, and thyroid cancers as compared to healthy tissues of the same type. Significant enrichment of miRNA targets among EC-genes were still observed even after filtering out these cancer-related genes from the datasets (Additional file 5).
Third, for the 12,202 genes not observed to be regulated by ECs (non-EC-genes), some of them may in fact be regulated by ECs but not analyzed or reported thus far. If this is indeed the case, the potential false-negatives from the non-EC-genes may seriously challenge the enrichment conclusion. Because 45% of human genes (calculated by 9,692/(9,692 + 12,202)) were confirmed to be regulated by ECs genome-widely, we arbitrarily sampled genes with the probability of 0.45 from the non-EC-genes and assumed them to be non-annotated EC-genes. To investigate the impact of the potential false-negatives of non-EC-genes, we performed the following procedures: (a) randomly sampled genes from non-EC-genes with the probability of 0.45, S = 0.45; (b) constructing the dataset of pseudo-EC-genes with n = 9,692 + 12,202*S and a dataset of pseudo-non-EC-genes with n = 12,202-12,202*S; (c) comparing the proportion of miRNA targets between the pseudo-EC-genes and pseudo-non-EC-genes. We repeated this simulation several times and always obtained significant differences of miRNA targets between pseudo-EC-genes and pseudo-non-EC-genes (see Additional file 6 for the results of eight simulations). Thus, the potential false-negatives of non-EC-genes would not affect the enrichment tendencies of miRNA targets.
Fourth, it is interesting and important to know whether the enrichment of miRNA targets could still be observed in a single experiment. In Perl scripts, the keywords such as "microarray", "array", "affymetrix" and "chip" were used to search the abstracts of 4,162 papers. Many reports did not provide the raw datasets (.cel files), but rather only displayed the differentially expressed genes in tables in the main texts or supplemental materials. We read the full-text of ≥50 papers and investigated whether the raw datasets (.cel files) were available from GEO (NCBI Gene Expression Omnibus, http://www.ncbi.nlm.nih.gov/geo/) or ArrayExpress (EBI Gene Expression Atlas, http://www.ebi.ac.uk/gxa/). Finally, we manually selected six affymetrix raw datasets and used a uniform pipeline to identify the differentially expressed genes in seven cell lines treated with chemicals (see Methods). The enrichment of miRNA targets among differentially expressed genes was observed in each dataset, respectively (Additional file 7), implying the pervasive roles of miRNAs in responding to the various chemicals in different cells.
Enrichment of miRNA targets is not caused by other properties of EC-genes
Previous studies have reported significant differences between miRNA-targets and non-miRNA-targets. These differences may be potential sources of bias and contribute to the enrichment of miRNA targets in EC-genes. We adopted a sampling method to control these biases and then examined whether the enrichment of miRNA targets in EC-genes was still observable.
The first potential bias stems from the observation showing that genes with paralogs (gene duplication mechanisms) have a high probability of being targeted by miRNAs [28]. In Figure 2A, the EC-genes have a higher tendency of having paralogs compared with non-EC-genes (61% vs. 53%, p = 2.45E-36, Chi-square test, two tailed). We sampled genes with a probability of 0.71 (S = 0.71, calculated by [3,729*(6,472/5,730)]/5,963) from the 5,963 EC-genes with paralogs. In this way, the dataset of sampled EC-genes were constructed with n = 3,729 + 5,963*S and had the same proportion of genes with paralogs as that of the non-EC-genes. By eliminating potential bias from the higher propensity for gene duplication, we observed similar enrichment of miRNA targets among EC-genes (Figure 3A).
A second potential bias comes from the observation that genes with more TF-binding sites in the 5' upstream regions have a higher probability to become targets of miRNAs [29]; thus, an increased number of transcription factor binding sites may result in an over-representation of miRNA targets among EC-genes. We tested this hypothesis by using cis-elements that were exclusively predicted from conserved motif sequences among a set of vertebrate genome sequences (cisRED database, http://www.cisred.org/) [30]. A total of 94,321 and 100,112 predicted cis-elements were found to be in the proximity of 7,630 non-EC-genes and 8,182 EC-genes, respectively. Thus, EC-genes and non-EC-genes have a similar average number of cis-elements (p = 0.65, Manny-Whitney U, two-tail test) (Figure 2B), suggesting that the enrichment of miRNA targets among EC-genes was not likely related to the over-presence of TF-binding sites in 5'-upstream regions.
A third potential bias results from the observation showing that genes with longer 3'UTRs are more likely to be regulated by distinct types of miRNA [31].As shown in Figure 2C, the EC-genes tended to have longer 3'UTRs than non-EC-genes (p = 1.76E-37, Manny-Whitney U, two-tailed test), indicating that the enrichment may be attributed to the higher probability of EC-genes of being detected as miRNA targets. To test this, we sampled 4,643 EC-genes whose 3'UTR length falls into the 1st to 3rd quartiles of the 3'UTR lengths among the 9,223 non-EC-genes with available 3'UTR annotation. By eliminating potential bias caused by longer 3'UTRs, we further observed that miRNA targets were significantly enriched among EC-genes (Figure 3B).
Fourth, we inspected whether the 3'UTRs of the EC-genes and non-EC-genes were under the same level of selective pressure. The substitution rates of the 3'UTR (K3u) of each gene were normalized against the synonymous substitutions per synonymous site (Ks) in the coding region of the same gene (see Methods). Using the ratio of K3u/Ks to estimate the evolutionary constraints on the 3'UTRs, we found the 3'UTR of the EC-genes tended to evolve more conservatively than non-EC-genes (p = 4.97E-17, Manny-Whitney U, two-tailed test) (Figure 2D). To explore the possibility that the elevated level of overall sequence conservation led to an over-representation of miRNA targets among EC-genes, we sampled 3,854 EC-genes whose K3u/Ks ratios fall into 1st to 3rd quartiles of the K3u/Ks ratios calculated from the 6,484 non-EC-genes. After getting rid of the potential bias caused by the more conserved 3'UTR, the enrichment of miRNA targets was again observable among sampled EC-genes (Figure 3C).
Collectively, these results demonstrate that the enrichment of miRNA targets is not a simple by-product of ancillary features of the analyzed gene set, but is a reflection of the propensity of being targeted by miRNAs increasing the genes' probabilities of being regulated by ECs.
Target preference of miRNAs on genes regulated by different environmental chemicals
Based on the above statistical analyses, we have confirmed that genome-widely miRNA targets were preferentially regulated by ECs; but, whether miRNAs have different targeting preference for genes regulated by different ECs is still an open question, and vice versa. If the preference exists, we would expect a large number of concurrent miRNA-EC pairs, which tend to co-regulate the same genes.
We devised a randomization method to identify significant concurrent miRNA-ECs pairs (see a detailed description in Methods). As an example to illustrate the identifying pipelines: In Figure 4A, miRNA-a has 7 target genes, 5 of which are regulated by EC-β; therefore, the |Targets(miR-a)| is assigned to 7, |Targets(EC-β)| to 8 and |Targets(miR-a) ∩ Targets(EC-β)|realto 5. In each simulated run, the 7 targets for miRNA-a were randomly replaced by the targets of other miRNAs by an edge-swapping procedure [32] (the algorithm of edge-swapping can sufficiently randomize the content of targets, while keeping the number of targets for each miRNA), then |Targets(miR-α) ∩ Targets(EC-β)|randomwas recorded. Repeating this simulation 500 times, the |Targets(miR-α) ∩ Targets(EC-β)|randomfollowed a normal distribution as N(2, 0.5) (Figure 4B). The Z-score was adopted to assess the statistical significance of whether the miRNAs and ECs tend to regulate the same genes. Here, the Z-score was calculated by (5-2)/0.5 = 6 and transformed to a p-value as 9.87E-10; thus, the targets of miRNA-a were considered to be preferentially regulated by the component-β.
Using two lists, EC-Genes (Additional file 1) and miRNA-Target by TargetScan5.1, a miRNA-EC pair was considered to be significantly concurrent if the FDR-corrected p-value (the q-value) was less than 0.01. Finally, we identified 1,842 concurrent interactions among 407 miRNAs and 497 ECs (Additional file 8), which tend to synergistically regulate the same gene sets. Therefore, distinct miRNAs tend to be "adopted" to regulate genes in response to different ECs.
Association network of miRNAs and ECs
Graph theory provides paradigms to study biological networks [33]. Here, miRNAs and ECs can be represented respectively by different colored nodes, the concurrent relationship by links. We constructed the association network of miRNAs and ECs to provide a global view of how miRNAs function in concert with ECs. As shown in Figure 4C, the number of the concurrent ECs for each miRNA followed a power-law distribution, where a small proportion of miRNAs connect to many ECs; whereas, a large number of miRNAs only connected to one or two ECs. In this way, it is possible to select a single miRNA or a combination of miRNAs as biological markers in functional studies of their concurrent ECs.
Besides a scale-free structure, the network also demonstrated a modular structure (Figure 4D); that is, a set of miRNAs and ECs were found to be densely concurrent in community-like modules, suggesting that miRNAs in some functional pathway may be co-operatively "adopted" to respond to ECs [34]. We used the algorithm of Guimera and Amaral[35] to measure the modularity (see Methods), because it performed well in making links within modules much denser than those across modules [36] and has been validated in our previous network analysis [37]. With a final value of modularity being 0.65, 14 topological modules could be separated. In each module, the miRNAs and ECs were closely connected (see Figure 5 for topological structures and Additional file 9 for description of the 14 modules).
We used available disease information of miRNAs to explore the potential function of each module. Three databases have been recently developed, HMDD [38], miR2Disease [39] and PhenomiR [40], which contain a large number of miRNA-disease associations from the literatures (i.e., the abnormal regulations of miRNAs correlated with or leading to diseases). Therefore, the correlation of chemicals associated with human diseases could be interpreted by integrating the available disease information on miRNAs with the network modules. As shown in Figure 5, eight out of 19 miRNAs of module XII involved in "Head and Neck neoplasm", indicated that the concurrent ECs of this module had a high probability to be risk factors for head and neck neoplasms. For 16 out of 40 miRNAs of module V involved in"Heart Failure", the information on the concurrent ECs of this module could aid greater understanding of the regulatory mechanisms of heart disease.
Comments
View archived comments (1)