Discernment of possible mechanisms of hepatotoxicity via biological processes over-represented by co-expressed genes

Background Hepatotoxicity is a form of liver injury caused by exposure to stressors. Genomic-based approaches have been used to detect changes in transcription in response to hepatotoxicants. However, there are no straightforward ways of using co-expressed genes anchored to a phenotype or constrained by the experimental design for discerning mechanisms of a biological response. Results Through the analysis of a gene expression dataset containing 318 liver samples from rats exposed to hepatotoxicants and leveraging alanine aminotransferase (ALT), a serum enzyme indicative of liver injury as the phenotypic marker, we identified biological processes and molecular pathways that may be associated with mechanisms of hepatotoxicity. Our analysis used an approach called Coherent Co-expression Biclustering (cc-Biclustering) for clustering a subset of genes through a coherent (consistency) measure within each group of samples representing a subset of experimental conditions. Supervised biclustering identified 87 genes co-expressed and correlated with ALT in all the samples exposed to the chemicals. None of the over-represented pathways related to liver injury. However, biclusters with subsets of samples exposed to one of the 7 hepatotoxicants, but not to a non-toxic isomer, contained co-expressed genes that represented pathways related to a stress response. Unsupervised biclustering of the data resulted in 1) four to five times more genes within the bicluster containing all the samples exposed to the chemicals, 2) biclusters with co-expression of genes that discerned 1,4 dichlorobenzene (a non-toxic isomer at low and mid doses) from the other chemicals, pathways and biological processes that underlie liver injury and 3) a bicluster with genes up-regulated in an early response to toxic exposure. Conclusion We obtained clusters of co-expressed genes that over-represented biological processes and molecular pathways related to hepatotoxicity in the rat. The mechanisms involved in the response of the liver to the exposure to 1,4-dichlorobenzene suggest non-genotoxicity whereas the exposure to the hepatotoxicants could be DNA damaging leading to overall genomic instability and activation of cell cycle check point signaling. In addition, key pathways and biological processes representative of an inflammatory response, energy production and apoptosis were impacted by the hepatotoxicant exposures that manifested liver injury in the rat.

The gene expression data were first normalized by systematic variation normalization (SVN) using a polynomial fit with 3 degrees [1]. Briefly, the background value for each channel (the low end points determined from the intensity histogram distributions), for each array was determined to be within a range of 20 to 100 in pixel intensity. The first step in SVN is to subtract a background value (approximately 60) from a given array. Before non-linear regression, the intensity measure for each gene and each channel is log base 2 transformed. The log base 2 intensities from the arrays have a mean value. These mean values may vary among the different arrays. Multiarray normalization was performed in order to scale the array data such that their log base 2 intensity mean values are equal to each other. The normalized gene expression intensity values were then converted to ratios, fluor-flips from biological replicates merged (averaged) and then compiled to form a matrix A (Additional file 2). Each expression value has four indices, i.e. a i,j,k,l (1) where row index i is from 1 to N number of genes and column index breaks up into three sub-index, i.e., subgroup index j is from 1 to J number of conditions; index k is from 1 to K number of treatments within a subgroup j; and replicate index l is from 1 to L for a given index k (Additional file 3). The general idea of cc-Biclustering is to map matrix A to a binary coherent matrix H(h i,j ) according to an inclusion\exclusion criterion function.

Supervised cc-Biclustering algorithm
In the case of supervised cc-Biclustering, index i is from 1 to N and j is from 1 to J, number of subgroups.
where a i,j is a sub-profile at i th row and j th subgroup which consists of all the expression values of treatment index k from 1 to K and their replicate index l from 1 to L and the S j is the j th phenotypic anchoring measure S. CM represents a coherent measure between these two sub-profiles. In this paper, CM is a p-value of their Pearson correlation. The The unsupervised cc-Biclustering approach follows the same idea as the supervised one except that it uses a pair-wised approach instead of anchoring to a phenotypic measure.
As such, the binary coherent matrix H has its row indexes i from 1 to N(N+1)/2. The h i,j value is calculated from two different rows i1 and i2 of expression matrix A(N,M) of j th sub-group. In this unsupervised case, equation (2) is modified to where a i1,j and a i2,j were two different sub-profiles of expression values at i1 th and i2 th rows respectively and j th subgroup. To extract constant biclusters from that pair-wised binary coherent matrix H, a gene can be assigned to many of different biclusters. To avoid such, we only assign a gene to a bicluster(s) which has the largest number of subgroups.
In such a bicluster with given set of sub-groups j, a gene at least has another gene with which they are significantly correlated. However, one may also find some genes within this bicluster are not significantly correlated each other. Because of this, we consider these biclusters are "subgroup-selected-biclusters". The next step in a tandem analysis, one can partition these genes into different clusters using a conventional clustering method. In this study, we used EPIG [2] to these subgroup-selected-biclusters to cc-biclusters in which genes are significantly correlated each other or co-expressed with a given set of subgroups. Briefly, EPIG extracts a set of discrete patterns and then categories each of significant genes to one of the patterns. A gene set of a given pattern have their expression highly correlated to the pattern profile. The following is the unsupervised biclustering algorithm: Unsupervised cc-Biclustering algorithm Input: expression matrix A(N,M), sample information Columns partition into J subgroups Set thresholds p t Create an empty coherent measure h i,j , i = 1 to N(N+1)/2 Set i = 1 FOR each profile i1, i1 = 1 to N-1 FOR each profile i2, i2 = i1+1 to N FOR each subgroup j compute the p-value of correlation measure (a i1,j , a i2,j ) if (p < p t and r > 0) h i,j = 1 else h i,j = 0 i = i +1 Create constant subgroup-selected-biclusters {h i,j = 1} in which genes have largest number of subgroups Apply EPIG method to subgroup-selected-biclusters for row-wise separation to get cc-biclusters