Transcription network construction for large-scale microarray datasets using a high-performance computing approach
© Zhu and Wu. 2008
Published: 4 March 2008
Skip to main content
© Zhu and Wu. 2008
Published: 4 March 2008
The advance in high-throughput genomic technologies including microarrays has demonstrated the potential of generating a tremendous amount of gene expression data for the entire genome. Deciphering transcriptional networks that convey information on intracluster correlations and intercluster connections of genes is a crucial analysis task in the post-sequence era. Most of the existing analysis methods for genome-wide gene expression profiles consist of several steps that often require human involvement based on experiential knowledge that is generally difficult to acquire and formalize. Moreover, large-scale datasets typically incur prohibitively expensive computation overhead and thus result in a long experiment-analysis research cycle.
We propose a parallel computation-based random matrix theory approach to analyze the cross correlations of gene expression data in an entirely automatic and objective manner to eliminate the ambiguities and subjectivity inherent to human decisions. We apply the proposed approach to the publicly available human liver cancer data and yeast cycle data, and generate transcriptional networks that illustrate interacting functional modules. The experimental results conform accurately to those published in previous literatures.
The correlations calculated from experimental measurements typically contain both “genuine” and “random” components. In the proposed approach, we remove the “random” component by testing the statistics of the eigenvalues of the correlation matrix against a “null hypothesis” — a truly random correlation matrix obtained from mutually uncorrelated expression data series. Our investigation into the components of deviating eigenvectors after varimax orthogonal rotation reveals distinct functional modules. The utilization of high performance computing resources including ScaLAPACK package, supercomputer and Linux PC cluster in our implementations and experiments significantly reduces the amount of computation time that is otherwise needed on a single workstation. More importantly, the large distributed shared memory and parallel computing power allow us to process genomic datasets of enormous sizes.
The rapid growth of genomic sequence data starting in early 1980's has spurred the development of computational tools for DNA sequence similarity searches, structural predictions, and functional predictions. The emergence of high-throughput genomic technologies in the late 1990's has enabled the analysis of higher order cellular processes based on genome-wide expression profiles such as oligonucleotide or cDNA microarray. A typical microarray dataset contains hundreds of sample points for thousands or tens of thousands of genes. A colossal amount of profound knowledge at genome level is hidden inside such immense expression data. A single gene is usually extracted to differentially identify expression genes at a significant level. However, such point level analysis does not address the full potential of genome-scale experiments. Nowadays genes can be affiliated by their co-regulated expression waveforms in addition to sequence similarity and proximity on the chromosome as in gene content analysis. Genes ascribed to the same cluster are usually responsible for a specific physiological process or belong to the same molecular complex. Such transcriptome (mRNAs) datasets deliver new knowledge and provide a revealing insight to the existing genome (genes) datasets, and can be used to guide proteome (proteins) and interactome research that aims to extract key biological features such as protein-protein interactions and subcellular localizations more accurately and efficiently.
However, organizing genome-wide gene expression data into meaningful function modules remains a great challenge. Many non-supervised and supervised computational techniques have been proposed to conjecture the cellular network based on microarray hybridization data. The widely employed techniques include Boolean network methods, differential equation-based network methods, Bayesian network methods, hierarchical clustering, K-means clustering, self-organizing map (SOM), and correlation-based association network methods.
Boolean network method is a coarse simplification of gene networks to determine the gene state as either 0 or 1 from the inputs of many other genes [1, 2]. Differential equation-based network models gene networks as a set of nonlinear differential equations that indicate the gene rate change without the assumption of discrete time steps . Bayesian network gives a graphical display of dependence structure based on conditional probabilities among genes. In hierarchical clustering, a dendogram is constructed by iteratively grouping together genes with the highest correlation, which is essentially a greedy algorithm achieving local optimality while disregarding negative association . K-means clustering is an improved approach of hierarchical clustering requiring a subjective specification on the number of clusters . SOM is a neural network-based iterative clustering method and also requires the user to estimate the initial cluster number . The correlation-based association network technique has been commonly adopted to identify cellular networks due to its computational simplicity and the nature of microarray data (i.e., typically noisy, highly dimensional and significantly undersampled). However, the association network method relies on arbitrarily assigned thresholds for link cutoff, which inevitably introduces subjectivity in building network structure and topology. A novel technology, which can determine the structure of transcriptional networks and uncover biological regularities in a computerized and unbiased way, has been under active study by biological scientists.
There exists a wide range of microarray clustering and visualization tools available with statistical analysis support, including affy, cclust, cluster, mcluster, hybridHclust, SOM packages from R environment , integrated systems such as Bioconductor , and Cluster 3.0/Tree view , web-based systems such as cyberT , SNOMAD  and CARMAweb . Many stand-alone systems are built upon R statistical packages using aforementioned clustering algorithms. We will discuss both the advantages and disadvantages associated with each clustering algorithm using real data examples in the next section. On the other hand, rapidly emerging large-scale genomic datasets pose a great challenge to current bioinformatics software and hardware resources. Most existing bioinformatics tools were developed as serial codes that are suitable for running on a single workstation, but often incur an unbearably long time delay or even cannot complete execution for large datasets due to limited memory. To date, cutting edge supercomputers such as IBM BlueGene, SGI Altix and Cray XT3, high-speed networks, high-performance storages as well as large-screen display devices have been in place or are being deployed across the nation. Efficient utilization of these high performance computing resources can help solve the problem of computation bottleneck and expedite the experimental turnaround time. The growing desire for improved application performance and reduced operational costs necessitates the design and development of parallel computing programs targeted at large-scale biological problems.
We propose to develop a system that constructs and analyzes various aspects of transcriptional networks based on random matrix theory (RMT) [13, 14] using ScaLAPACK [15, 16] for parallel calculation of linear algebra routines. We run our software package on two datasets: (i) yeast cycle microarray dataset  with about 2,500 genes and 79 samples, and (ii) human liver cancer microarray dataset  with about 20,000 genes and 207 samples. Comparisons are performed between the results generated by our package and some other popular packages.
The program in this work is implemented in C and MPI Fortran, and currently runs on a Linux cluster with eight nodes. We are now in the process of transiting our system from the Linux cluster to supercomputers with thousands of compute nodes. The experimental datasets are extracted from two public project websites, namely yeast cycle and human liver cancer projects.
Yeast cell cycling data is one of the best known microarray datasets that have been extensively evaluated. Since the structure of the network has been quite well understood, we are able to evaluate our clustering results by referring to an extensive set of published works.
The entire yeast genome is partitioned into a large number of functional modules sharing similar expression patterns. The large components of a deviating eigenvector computed from the Pearson correlation matrix are identified as gene members that belong to a specific functional module involved in a similar cellular pathway.
We use cclust package under the R environment to conduct K-means clustering, which repeatedly moves all cluster centers to the mean of their Voronoi sets. The distance between the cluster center and the data points is based on the Euclidian distance and polynomial learning rate is used. The major drawback of K-means method is that a user must specify the number of clusters, which is usually unknown for unexplored microarray datasets. For experimental purposes, we set the cluster number to be 20 based on previous results we obtained from the RMT method. K-means is able to identify protein biogenesis group with 114 genes; however some closely related protein biogenesis genes are assigned to several other unrelated groups. K-means algorithm tends to break down a coherent group with a large or medium number of gene members but lacks the capability of identifying small groups such as histone group of 9 genes, which has been successfully identified by our RMT method.
Hierarchical clustering method has been widely used by contemporary biologists to cluster microarray datasets. Groups of genes are nested at different levels of details represented by a dendrogram. A user can choose either to build the hierarchical structure in a bottom-up or a top-down fashion. The bottom-up approach can identify small clusters but not large ones, while the top-down approach can easily discern a few large clusters. Chipman proposed a novel hybrid clustering method , which combines the advantages of bottom-up hierarchical clustering with that of top-down clustering. The key idea is to create mutual clusters comprised of members closer to each other than to any other members. A user can perform a constrained top-down clustering, which inhibits the breaking of any identified mutual clusters. We load the hybridHclust library into the R environment and run it on our yeast cycle data with 2,500 genes. Squared Euclidean distance is used to calculate the similarity between genes. Average linkage is exploited to join clusters. A user may also use single or complete for linking, which will only affect the dendrogram plotting not the mutual clusters. It generates a heavily cluttered dendrogram, which is tedious for users to interpret the nested structure even with a relatively small number of genes. When examining some of the mutual clusters, we find out that those mutual clusters are indeed highly correlated with each other. For example, five glycolysis genes are identified as one mutual cluster, eight out of nine histone genes are found within one mutual cluster. However, the size of a typical mutual cluster is generally quite small ranging from 2 to 8 with the majority of 2. We cannot use mutual clusters alone to identify any bigger size clusters. The mutual cluster method works well in recognizing distinct small size clusters. However, negative correlations providing important anti-regulation information in many cellular processes, are ignored in the similarity calculation. Moreover, mutual clustering is sensitive to small data variations which may easily cause gene membership change. Another problem associated with hybrid clustering is that with an increasing density of gene numbers, some genes will likely occur within the boundary of any mutual cluster, thus making it dificult to find mutual clusters .
We characterize the expression pattern of gene expressions in hepatocellular carcinoma (HCC) using RMT method. There are about 20000 genes with more than 200 samples, including 97 primary HCC, 76 nontumor liver tissues, 7 benigh liver tumor samples with 3 adenoma and 4 FNH, 7 metastatic cancers, and 9 HCC cell lines. We cluster the microarray data for both genes and samples. The liver samples are roughly divided into two major groups, namely the HCC tumor tissues and nontumor liver tissues, where a few HCC tumor samples are found in the nontumor cluster. Adenoma and FNH samples are dispersed within the HCC cluster. Metastatic colon cancer samples are identified as a single cluster due to their highly similar expression patterns. Two metastatic granulosa cell tumor samples are also grouped together. We observe that our method is also able to detect subclusters within a big cluster. For example, since tumor samples that are acquired from the same patient usually display similar expression patterns, 6 HCC samples from patient HK64 are grouped together as a subcluster within HCC cluster; 5 samples from patient HK66 are found in the same subcluster; 3 samples from patient SF34 also appear in one subcluster. Our clustering results conform nicely to the results published in the literature .
High-throughput genomic technologies such as microarrays have generated gene expression data at the transcription level. The unprecedented power for the study of gene expression of thousands of genes simultaneously can be potentially used to unveil the topology and functions of transcriptional networks. In this paper, we explored random matrix theory and parallel computing techniques to dissect transcriptional networks and identify various functional modules for large-scale microarray datasets.
Luo et al also proposed a random matrix theory-based approach to infer transcriptional networks based on microarray data. However, their analysis is mainly focused on eigenvalues. In addition, their method require more computation cycles to calculate eigenvalues for many different correlation matrices. In our approach, we only need to compute eigenstates for one correlation matrix. We experimentally compare the performance between hierarchical clustering, K-means clustering, and our RMT method. Hierarchical clustering method is a very popular grouping technique used by biologists due to the simple underlying rationale and tree-like structure that can be easily visualized. However, similar to other heuristic approaches, it suffers from no guarantee of global optimization. Another problem with hierarchical clustering is that once a local grouping decision is made, no backtracking is possible. Moreover, when the total number of genes becomes prohibitively large, it is extremely dificult to analyze the nested tree structure to identify clusters. K-means algorithm, a typical partitioning based clustering method, seeks to find K clusters that minimize the sum of squared distances between each gene and its centroid. Input parameters such as the number of clusters and initial centroid locations need to be selected. However, different input parameters may lead to very different results, consequently leading to the problem of human subjectivity. K-means typically converges very fast, however, to a local optima, rather than the global optimum.
Our RMT method analyzes the genomic datasets from a global view. High levels of noise inherent to most biological datasets are removed first, and the true signal is further amplified for enhanced data interpretation. Consequently, RMT method avoids being trapped into local optima. Small-sized clusters that are easily mixed with other clusters by some clustering algorithms can be accurately extracted by our RMT method. Experimental results show that RMT method is able to recognize biologically meaningful clusters with various gene numbers ranging from two to several hundreds.
Most previous clustering methods partition members into non-overlapping groups. However, in our RMT method, one gene is allowed to reside in multiple groups, which supports a legitimate argument from the biological perspective that a single gene may get involved in different pathways. Transitively co-regulated genes, which are not directly correlated but both of which have correlation with the same gene, can also be detected and grouped. Our method is computationally efficient, objective without human intervention, and robust to high levels of noise. Functions of unknown genes are conjectured and explored through their associated function modules. This computational analysis is solely based on microarray data. If genes in the same functional module do not show a significant correlation, we will not be able to identify them using RMT method. However, it is likely that genes in the same functional module show a significant correlation under one condition but not under another condition, for example, the module of heat shock proteins is rarely identified in other yeast microarray datasets. By consolidating functional modules from multiple microarray datasets simultaneously, we will be able to improve the liability of the structure of functional modules. The authors will work toward this direction in the future.
where s x and s y denotes the standard deviations. Pearson correlation ranges from 1 as perfect correlation to -1 as perfect anti-correlation. When C ij = 0, no correlation exists between genes i and j. However, conducting direct study on these empirical cross-correlation coefficients is rather difficult due to the unique properties of microarray experiments. Firstly, the cross-correlation between any pair of genes may not be constant: such co-regulations can fluctuate over time or under different sample conditions. Secondly, the limited number of samples that a microarray is typically conducted upon, may introduce significant “measurement noise” that compromises the accuracy of the underlying correlations. In order to filter out randomness contained within the empirical cross-correlation matrix, we test the eigenstates of this correlation matrix against those of a controlled counterpart, a truly random correlation matrix generated by computer random generator. Statistic properties that conform to the truly random matrix are labeled as noise contributions; on the other hand, any deviating eigenstates are treated as genuine correlations, which will be amplified and analyzed for transcriptional network construction.
We consider the set of eigenvalues that deviate from the eigenvalue range of the random matrix as genuine correlation. The amount of variance contributed by each eigenvector (factor) can be reflected by the proportion of eigenvalue over the sum of all eigenvalues based on principle component analysis (PCA). In other words, principle factors are responsible for the majority of variation within the data. Thus, only large eigenvalues and their corresponding eigenvectors are retained for further treatment and gene group analysis. The rest of the eigenstates contain either insignificant or noisy information.
After acquiring a set of normalized eigenvectors, we transform the eigenvector components to loading factors by taking the multiplication of vector components and the square root of corresponding eigenvalue. Each eigenvector represents one factor leading to one gene cluster. A larger loading factor (we select 0.5 to be the cutoff point) indicates that the corresponding gene “load” more on that eigenvector, or that gene is more expression-dominating for that cluster. To simplify the eigenvector structure and make the interpretation of gene clusters easier and more reliable, we apply orthogonal rotation to the retained eigenvectors. Since the rotation is performed in the subspace of the entire eigenstates, the total variation explained by the newly rotated factors are always less than the original space, but the variation within the subspace remains the same before and after rotation, only the partition of variation changes.
where θ i,j is the rotation angle from old axis i to new axis j.
where x t is the expression signal of a known gene at time t, and is the mean of x t . y k+t is the expression level of a gene at time k+t, and is the mean of y y+t . N denotes the number of time points and k represents the time delay/lag. Likewise, the correlation will also be large if the first expression signal leads the second with the expression waveform shifted to the left of the second.
where X(f) and Y(f) are the discrete Fourier transforms of x t and y t , and the asterisk represents complex conjugation.
In addition, after genes are clustered by time-lagging analysis, people usually conduct upstream sequence analysis to identify consensus regulatory elements for those genes that are controlled by the same transcription factor.
A single workstation is no longer fast and powerful enough to cope with emerging large scale microarray datasets. High performance computing facilities become indispensable tools for biologists to reduce the computing time and improve efficiency. In order to calculate the eigenvalues and eigenvectors for a large correlation matrix, we install ScaLAPACK on our Linux cluster. ScaLAPACK is an acronym for Scalable Linear Algebra Package, and is a library of high-performance linear algebra routines for distributed memory message-passing computers. PDSYEVX routine is used to compute selected eigenvalues and eigenvectors of a real symmetric matrix by calling the recommended sequence of ScaLAPACK routines. We use 2D block-cyclic data distribution for work load balance among all computer nodes to achieve performance and scalability. The size of the subblock dividing the symmetric correlation matrix is chosen to be 64 × 64, and the computer grid configuration is set to be 2×4 for an eight node cluster. With the help of parallel computing packages, we are able to finish some heavy computing tasks within short period of time (one hour), which might take up to days for a single workstation to run.
MZ designs and implements the RMT method on a single workstation. QW implements the parallel version of the RMT method. MZ also analyzes the yeast cycle and human liver cancer microarray datasets. Competing interests
We would like to express our sincere gratitude to Dr. Jizhong Zhou at University of Oklahoma and Dr. Yufeng Yang at Oak Ridge National Laboratory for their insightful suggestions.
This article has been published as part of BMC Genomics Volume 9 Supplement 1, 2008: The 2007 International Conference on Bioinformatics & Computational Biology (BIOCOMP'07). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/9?issue=S1.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.