Biclustering for the comprehensive search of correlated gene expression patterns using clustered seed expansion
 Taegyun Yun^{1} and
 GwanSu Yi^{1, 2}Email author
DOI: 10.1186/1471216414144
© Yun and Yi; licensee BioMed Central Ltd. 2013
Received: 27 October 2012
Accepted: 21 February 2013
Published: 5 March 2013
Abstract
Background
In a functional analysis of gene expression data, biclustering method can give crucial information by showing correlated gene expression patterns under a subset of conditions. However, conventional biclustering algorithms still have some limitations to show comprehensive and stable outputs.
Results
We propose a novel biclustering approach called “BIclustering by Correlated and Large number of Individual Clustered seeds (BICLIC)” to find comprehensive sets of correlated expression patterns in biclusters using clustered seeds and their expansion with correlation of gene expression. BICLIC outperformed competing biclustering algorithms by completely recovering implanted biclusters in simulated datasets with various types of correlated patterns: shifting, scaling, and shiftingscaling. Furthermore, in a real yeast microarray dataset and a lung cancer microarray dataset, BICLIC found more comprehensive sets of biclusters that are significantly enriched to more diverse sets of biological terms than those of other competing biclustering algorithms.
Conclusions
BICLIC provides significant benefits in finding comprehensive sets of correlated patterns and their functional implications from a gene expression dataset.
Background
Genes in common regulatory mechanisms under specific conditions are likely to show similar expression patterns. Identifying those patterns and the corresponding genes is one of the most important steps of microarray analysis to reveal the novel functions of genes, transcription factortarget relationships, and concerted gene functions in pathogenesis [1–3]. Clustering analysis is commonly performed to identify groups of genes expressed in similar patterns. However, an accurate gene expression analysis can be hindered owing to limitations in clustering analysis. Most clustering algorithms try to find nonoverlapping groups of genes that show similar expression patterns under all experimental conditions. In a common situation, genes tend to be coregulated, and thus, they could be coexpressed under a subset of experimental conditions, but not under all conditions. Parts of genes in one expression pattern may exhibit a different expression pattern under other conditions because genes can participate in more than one function differently depending on the specific conditions [4]. To resolve this issue, a biclustering method can suitably substitute general clustering methods by providing correlated gene clusters under a subset of conditions in an unsupervised gene expression analysis.
A bicluster can be defined as a submatrix in a whole gene expression data matrix representing groups of genes that show coherent expression patterns under a subset of conditions [5]. It is required to search exhaustive sets of biclusters for functional analysis of gene expression dataset. However, extracting complete sets of biclusters from a whole microarray data matrix is an NPhard problem that requires massive computation [6]. To avoid computational issues in biclustering, most existing biclustering algorithms use a greedy iterative heuristic approach that locally improves an appropriate scoring function starting from initial seed biclusters. To search more comprehensive sets of meaningful biclusters with a greedy iterative heuristic biclustering approach, it is important to determine initial seed biclusters and score functions properly.
The output from conventional biclustering methods shows lack of stability. Since common biclustering methods depend on random starting seeds, the numbers and the contents of resulting biclusters are changing every time even though the biclustering algorithm is applied to the same microarray datasets. Moreover, random starting seeds cannot guarantee diverse searching of biclustering and coherent biclustering results. However, in conventional biclustering methods, the use of random starting seeds was inevitable choice to compromise the computation complexity and there have been a few studies to overcome this limitation. Erten and Sözdinler [7] proposed a localization method that reorders rows and columns in an initial data matrix to exhibit similar patterns in nearby locations. Although this method could alleviate a part of the random seed issue by raising a chance to extract biclusters with similar patterns in a localized matrix, it could not solve comprehensiveness issues of random seeds.
The way to set the scoring function of bicluster is also important to improve the performance of biclustering. Mean squared residue, which measures variability of biclusters based on the arithmetic mean of gene expression, was the first scoring function used to find biclusters [8] and it was used in several other biclustering methods subsequently [9–11]. Mean square residue is a fundamental measure to find similar expression values, however, this measure is not adequate for finding the scaling patterns of biclusters as proved by AguilarRuiz [12]. Inability to find a scaling pattern can be a major drawback in biclustering analysis because groups of genes showing similar expression patterns with different scales are also meaningful correlated gene clusters that we aim to find.
In this point of view, the correlation coefficient can be an alternative scoring function to the mean squared residue. With this measure, correlated expression patterns, including both shifting and scaling patterns, can be detected and this is more relevant to the purpose of biclustering to find the coexpressed gene clusters under the same biological regulation. Allocco et al.[13] showed that if the correlation coefficient of two genes is greater than 0.84, there is more than 50% probability that such genes are regulated by a common transcription factor. Bhattacharya and De [14] proved that the correlation coefficientbased biclustering method, BiCorrelation Clustering Algorithm (BCCA), can find a greater number of common transcription factors and a significantly enriched biological function term than other noncorrelationbased biclustering methods.
Several correlationbased biclustering approaches have recently been proposed [15–19]. BCCA is a Pearson correlation coefficientbased biclustering method that finds groups of genes showing a correlated expression pattern across a subset of microarray conditions. The process of BCCA begins with pairs of genes. It backwardly eliminates uncorrelated conditions for each selected pair of genes to find correlated sets of biclusters. Theoretically, a large number of biclusters can be found with BCCA since BCCA searches biclusters from all pairs of correlated genes. However, BCCA is unable to extract comprehensive sets of biclusters in real situations since a backward elimination approach limits search spaces. Bozda˘ g et al.[15] proposed the Correlated Pattern Biclusters (CPB) algorithm, which discovers biclusters by setting reference genes with randomly selected columns, and then adding rows with high correlation and determining columns that have a smaller Root Mean Squared Error. In this case, the search space can be restricted again by the randomly selected seeds of columns. Ayadi et al.[19] proposed the PatternDriven Neighborhood Search (PDNS) algorithm for finding correlated expression patterns of biclusters based on Spearman’s rank correlation. It converts an original numerical matrix to a discretized matrix with −1, 0, or 1 for having trajectory patterns of genes. By using an initial solution of biclusters with a discretized matrix, this algorithm locally improves a solution by using descent search and perturbation. Because the PDNS algorithm requires initial solutions of biclusters from random selection or other fast greedy algorithms, such as Cheng and Church algorithm, biclustering results can be varied by selection of initial biclusters. The Qualitative Biclustering algorithm (QUBIC) is a recently proposed genewise discretizationbased biclustering algorithm to solve the general form of the biclustering problem efficiently, including constant, shifting, and scaling patterns [20]. QUBIC converts a microarray data matrix into a simplified integer matrix called a representing matrix, from which it finds biclusters. Therefore, QUBIC may not identify subtle changes of expression patterns. In addition, the search space in QUBIC is limited by the discretization process.
In this paper, we propose a novel biclustering algorithm called BIclustering by Correlated and Large number of Individual Clustered seeds (BICLIC) aiming to search comprehensive sets of biclusters with correlated gene expression patterns. The primary process of BICLIC is not conducted with random seed biclusters, but with the full search of correlated seed biclusters that are determined by individual dimensionbased clustering. Then comprehensive sets of correlated seed biclusters are expanded to larger biclusters using a greedy iterative heuristic approach with the Pearson correlation coefficient as the scoring function. As a result, BICLIC can find comprehensive biclusters accurately and also provides stable output in multiple runs.
We demonstrate that our proposed BICLIC method outperforms other conventional biclustering methods in finding correlated gene expression patterns both in simulated data sets and in real microarray datasets.
Results and discussion
The proposed BICLIC algorithm is implemented in the R language. Rcode of the BICLIC algorithm is freely available from http://bisyn.kaist.ac.kr/software/biclic.htm.
In this section, the performance of our biclustering algorithm will be compared with those of three wellknown existing bicluster algorithms: BCCA, CPB, and QUBIC. The BCCA, CPB, and QUBIC programs are from each paper’s cited sources. The performance comparison can be divided into two parts. In the first part, simulated datasets are used to test the accuracy and the coverage of the biclustering algorithm to identify implanted biclusters that have various correlated patterns. In the second part, a real microarray dataset is used to show that BICLIC can extract more diverse sets of correlationbased biclusters than those extracted by compared methods, BCCA and QUBIC, and the extracted biclusters from BICLIC are significantly enriched in biological terms, such as the gene ontology (GO) functional category [21] and the KEGG pathway [22].
Simulated datasets
Bozda˘ g proved that the value of the Pearson correlation coefficient is 1 for a perfect shifting, scaling, and shiftingscaling pattern [23]. Therefore, any correlated patterns of shifting, scaling, and shiftingscaling patterns can be extracted by the BICLIC biclustering method, which has the Pearson correlation coefficient as its scoring function. BICLIC considers positively correlated patterns when it generates biclusters because it collect genes with positively correlated with seed bicluster. However, negatively correlated patterns also can be discovered when positively correlated biclusters are compared each other and negatively correlated biclusters exists.
To simulate each correlated pattern, a 1000 X 100 data matrix is generated with random values in a normal distribution whose mean is 0 and standard deviation is 1. For each type of correlated pattern, 10 data matrices are generated, resulting in a total of 30 data matrices. For each data matrix, 10 nonoverlapping biclusters of size 100 X 10 are implanted in the matrix. Shifting, scaling, and shiftingscaling patterns of biclusters are generated from equations 4, 5, and 6, respectively. Shifting and scaling factors are randomly generated from a normal distribution whose mean is 0 and standard deviation is 1. To generate positively correlated patterns, randomly generated scaling factors are changed to absolute values of the original random values.
In addition, simulated datasets that have implanted biclusters with differentsized columns are generated to study the effect of column size on the performance of the biclustering algorithms. The size of the whole data matrix is 1000 X 100, the same as that of the previous simulated dataset. The number of rows of a bicluster is fixed as 100, but the number of columns varies from 20 to 100. Five different sized biclusters are implanted in each 1000 X 100 data matrix. These simulated datasets are also generated for three kinds of correlated patterns: shifting, scaling, and shiftingscaling.
Comparison of average recovery scores for simulated datasets with various correlated patterns
Algorithm  Shifting  Scaling  ShiftingScaling 

BICLIC  1  1  1 
BCCA  0.141  0.181  0.168 
CPB  1  0.996  0.915 
QUBIC  0.431  0.169  0.466 
Comparison of average relevance scores for simulated datasets with various correlated patterns
Algorithm  Shifting  Scaling  ShiftingScaling 

BICLIC  1  1  1 
BCCA  0.060  0.109  0.094 
CPB  0.143  0.297  0.258 
QUBIC  0.038  0.043  0.107 
Experimental dataset
To investigate the usefulness of BICLIC in searching comprehensive sets of correlationbased biclusters, a yeast Saccharomyces cerevisiae dataset [25] and lung cancer dataset [26] were analyzed. The yeast Saccharomyces cerevisiae dataset shows yeast gene expression under different stress conditions. It consists of 2993 genes and 173 conditions. The lung cancer dataset contains 12,625 genes and 56 samples. 56 samples consists of 20 pulmonary carcinoid samples, 13 colon cancer metastasis samples, 17 normal lung samples, and 6 small cell lung carcinoma samples. Since BICLIC is able to extract biclusters from numeric values of a microarray data matrix, no preprocessing step such as discretization or taking logarithms is necessary for BICLIC analysis. BCCA and CPB also do not require a preprocessing step, but the QUBIC algorithm includes a discretization step. Correlation thresholds for BICLIC, BCCA, and CPB were set to 0.9 to search biclusters with highly correlated expression. The minimum number of rows, mnr, and minimum number of columns, mnc, parameters of BICLIC were both set to five in order to filter out particularly small biclusters. For CPB, we set the maximum overlap level, MO, to 1 and the maximum number of biclusters, NB, to 10,000 to extract comprehensive sets of biclusters. In addition, we did not determine the reference rows of CPB to extract biclusters related with diverse sets of individual genes. We varied the parameters of QUBIC to report more comprehensive sets of biclusters. Although the default value of parameter o in QUBIC is 100, restricting the number of biclusters, it was set to 10,000 to report the maximum number of biclusters. Duplicated biclusters are removed in the results of each biclustering algorithm.
Summary statistics of biclustering algorithms for the yeast stress dataset
Method  Count  Average I x J  Gene cov.  Condition cov.  Cell cov. 

BICLIC  14791  2249.3  1  1  0.999 
(11172)  (7.2)  (0.905)  (1)  (0.109)  
BCCA  8163  2936.8  0.776  1  0.317 
CPB  3634  8413.6  0.512  1  0.185 
QUBIC  2146  847.4  0.884  0.746  0.112 
Summary statistics of biclustering algorithms for the lung cancer dataset
Method  Count  Average I x J  Gene cov.  Condition cov.  Cell cov. 

BICLIC  6019  2302.8  1  1  0.999 
(3734)  (4.2)  (0.389)  (1)  (0.021)  
CPB  386  4594.8  0.672  1  0.344 
QUBIC  1355  68.2  0.543  1  0.048 
Function enrichment evaluation
BICLIC found the largest number of significantly enriched functional terms compared to BCCA, CPB, and QUBIC in GO BP, GO CC, GO MF, and KEGG. Compared to QUBIC, BCCA and CPB found fewer unique functional terms, despite the fact that BCCA and CPB found more and larger biclusters. This means that there are a number of highly overlapped genes and conditions in the biclusters found by BCCA and CPB. Furthermore, the functional enriched terms are also highly redundant in BCCA and CPB. In contrast, BICLIC found comprehensive sets of biclusters. Moreover, it could obtain a number of significant results from the functional enrichment process with GO BP, GO CC, GO MF, and KEGG.
Conclusions
In this paper, we proposed a novel biclustering method, BICLIC, to search for comprehensive sets of correlationbased biclusters. Our algorithm conducts individual dimensionbased clustering for efficient determination of comprehensive sets of correlated seed biclusters, which are further expanded to larger correlationbased biclusters. Simulated and real microarray datasets were used to perform several experiments, and the results were compared to those obtained using BCCA, CPB, and QUBIC. The experiments showed that BICLIC could find implanted correlated biclusters accurately while other competing methods such as BCCA and QUBIC performed poorly. In addition, BICLIC was able to extract more comprehensive sets of biclusters than other biclustering algorithms. Although CPB performed well in the simulated dataset, it performed poorly in the real microarray datasets. Finally, the biclusters searched by BICLIC could be enriched to more diverse biological terms in GO and KEGG.
Methods
Definitions
Definition 1
An input microarray matrix, E(G,C), is defined as an n X m matrix of real numbers, where G = {g_{ 1 }, g_{ 2 }, …g_{ i }, …, g_{ n1 }, g_{ n }} is a set of genes and C = {c_{ 1 }, c_{ 2 }, …, c_{ j }, … c_{ m1 }, c_{ m }} is a set of conditions.
Definition 2
A seed bicluster, SB(G’, C’), is a small bicluster that is a candidate for being expanded to a larger bicluster, with G’ ⊆ G and C’ ⊆ C. Sets of genes in each condition have the same cluster index, which is generated from individual dimensionbased clustering for each condition. In other words, the gene expression values of genes in the same condition in a seed bicluster are very close to each other. Genes across a set of conditions in a seed bicluster show a correlated expression pattern. Therefore, each seed bicluster has two characteristics: an identical or very similar gene expression value in each condition, and a highly correlated gene expression pattern across conditions.
Definition 3
An expanded bicluster, BC, means that it is expanded from a seed bicluster to have larger elements of genes and conditions while maintaining an average Pearson correlation coefficient above a correlation threshold θ. Seed biclusters can be expanded in two directions: genewise and conditionwise.
Finding comprehensive seed biclusters
The generation of seed biclusters is illustrated in Algorithm 1. In this phase, comprehensive sets of initial biclusters are to be found and they will be expanded in a later phase. This phase consists of two steps: individual dimensionbased clustering and seed bicluster determination. An n X m microarray matrix can be decomposed into m separate n X 1 vectors. Individual dimensionbased clustering is employed to collect genes with similar expression levels in each decomposed vectors. It is an approach that is similar to that used in the Clustering analysis of Large microarray datasets with Individual dimensionbased Clustering (CLIC) algorithm [29]. CLIC uses individual dimensionbased clustering method to cluster larger microarray datasets efficiently. In this paper, individual dimensionbased clustering is conducted for n genes in each array of a dimension to divide very similarly expressed genes that are in the same cluster in one dimension. That is, thousands of genes in each condition are clustered into a large number of small sized clusters that contain highly similarly expressed genes.
An individual dimensionbased clustering method is more efficient than those conventional approaches although conventional clustering algorithms such as kmeans and hierarchical clustering can be used. Kmeans clustering requires additional steps to determine the appropriate number of clusters in each dimension, and hierarchical clustering needs to calculate the distances between all pairs of genes. Therefore, we used the following individual dimensionbased clustering approach to clusters efficiently genes with very similar expression in each dimension. Threshold values to determine whether the genes should be selected in each cluster in each condition are standard deviations of whole gene expression values in each condition and a cumulative standard deviation of gene expression values (in Step 1C and 1Ed).
After individual dimensionbased clustering, the genes that have similar expression values in each individual condition are labeled with the same cluster index. The m cluster index vectors are recombined into the original n X m matrix form. Comprehensive seed biclusters are determined from this cluster index matrix. The sum of the numbers of clusters that are determined in individual dimensionbased clustering over all conditions indicates the number of candidate seed biclusters. The number of discovered seed biclusters is sufficient because genes with similar expression in each condition are very finely divided to have a large number of biclusters. Genes in candidate seed biclusters in each condition are labeled with the same cluster index. These genes in another condition can be labeled with either different or the same kinds of cluster indexes. If genes in another condition are labeled with the same kind of cluster index, it means that the gene expression levels are similar not only in the original condition but also in the other conditions. In other words, genes show correlated expression patterns over these conditions. Nonduplicated sets of diverse seed biclusters are determined in this phase. These seed biclusters are more correlated than randomly extracted seed biclusters. Moreover, the same seed biclusters can be determined even in multiple executions of the algorithm.
Algorithm 1 Seed Bicluster Extraction Algorithm
Input: E: n X m gene expression matrix
Output: SB: List of seed biclusters
 1.
Individual dimensionbased clustering
For each m individual condition, do: A.
Align gene set G={g_{ 1 },g_{ 2 },…g_{ n }} to G`={g_{ 1 }‘,g_{ 2 }‘,..,g_{ n }‘ } in increasing order of gene expression value, where g_{ 1 }‘ ≤ g_{ 2 }‘ ≤, … , ≤ g_{ n1 }‘ ≤ g_{ n }‘.
 B.
Initially, set gene index i = 1 and set cluster index KI to 1.
 C.
Measure standard deviation of all genes in this condition and set it as sd_all.
 D.Let K _{ KI } for set of cluster member genes when cluster index is KI and set K _{ KI } == NULL
 a.
Set cumulative number of genes in cluster set, cum = 0
 b.
K _{ KI } = K _{ KI } ∪ {g _{ i }‘ }.
 c.
Assign cluster index KI to cluster member gene.
 a.
 E.If cluster K _{ KI } !=NULL, then
 a.
Set cum = cum + 1
 b.
Set i = i + 1.
 c.
Set K _{ KI } = K _{ KI } ∪ {g _{ i }‘ }.
 d.
Measure standard deviation of K _{ KI } when number of member gene in cluster set is cum, sd(K _{ KI, cum }).
 e.While sd(K _{ KI, cum }) ≤ sd(K _{ KI, cum1 }) and sd(K _{ KI, cum }) ≤ sd_all, do:
 i.
Set i = i + 1.
 ii.
Set K _{ KI } = K _{ KI } ∪ {g _{ i }’ }.
 iii.
Assign cluster index KI to cluster member genes.
 i.
 f.If sd(K _{ KI, cum }) > sd(K _{ KI, cum1 }) or sd(K _{ KI, cum }) > sd_all.
 i.
Set KI = KI + 1.
 ii.
Set K _{ KI } = K _{ KI } ∪ { g _{ i }’ }.
 iii.
Assign cluster index KI to cluster member genes.
 iv.
Set cum = 0.
 i.
 a.
 F.
Repeat Step 1D to 1E until i == n.
 G.
Align cluster indexed genes i.e. {1, 1, 2, 2, …, KI – 2, KI  1, KI} to original order as in G={g _{ 1 },g _{ 2 },…g _{ n }}.
 H.
Combine m cluster index vector to original n X m matrix form.
 A.
 2.
Seed bicluster determination
For each m individual condition, do: A.
Initially, Set seed bicluster set S = NULL
 B.For s = 1 to KI in each condition, do:
 a.
Let g(K _{ s }) for rows of genes when cluster index KI is s in each condition.
 b.
Set seed cluster condition set, CS = NULL.
 c.For j = 1 to m condition, do:
 i.
Let g(K _{ s, j }) for the collection of genes when genes are in g(K _{ s }) rows and condition is in jth column.
 ii.
If genes in g(K _{ s, j }) have same kinds of cluster index, then set CS = CS ∪ {c _{ j }}
 iii.If the number of elements in CS ≥ 2

Set seed bicluster, sb, consist of g(K _{ s }) and CS

Add each seed bicluster, sb to seed bicluster list, SB

 i.
 a.
 A.
Expanding seed biclusters
In this phase, previously determined comprehensive sets of seed biclusters are expanded to larger biclusters with correlated patterns. The Pearson correlation coefficient is used as scoring function to measure correlation between pairs of genes over subsets of conditions when seed biclusters are expanded, while maintaining similarity over a correlation threshold. BICLIC uses a heuristic approach to expand seed biclusters efficiently by merging each gene or each condition from the most similar one to the least similar one with a seed bicluster. Each seed bicluster is expanded in two ways, genewise and conditionwise, while maintaining the average Pearson correlation coefficient of pairs of genes over conditions in each expanded bicluster above the correlation threshold. The computation required in this heuristic approach is considerably less than that in the approach of exhaustive search of all possible combinations of genes and conditions. Although less comprehensiveness in the expanded biclusters may appear in the proposed heuristic approach than in an iterative approach, this disadvantage can be alleviated by the existence of comprehensive sets of correlated seed biclusters.
In genewise expansion, the minimum number of conditions in seed biclusters must be equal to or greater than 3. Otherwise, the average Pearson correlation coefficient of genewise expanded biclusters will be +1, 1, or noncomputable. For each seed bicluster, the Pearson correlation coefficient value between a seed bicluster and each gene vector is calculated to find candidate sets of correlated genes to expand. Then, each gene is merged to a seed bicluster in decreasing order of correlation coefficients between gene vectors and the seed bicluster to add similar genes to the seed bicluster efficiently, until the average Pearson correlation coefficient of the genewise expanded biclusters is no longer smaller than the correlation threshold value, θ. Such an efficient gene expansion approach also leads to stable expansion results because the order of genes to expand is determined when calculating the Pearson correlation coefficient value between a seed bicluster and each gene vector. The Pearson correlation coefficient between a seed bicluster and a gene vector is calculated using equation 1.
If cor_{ j } is greater than the overall correlation threshold θ, then all of the genes in the bicluster are still highly correlated, after condition j is added. Conditions are merged to a seed bicluster in decreasing order of Pearson correlation coefficients of expanded biclusters to add similar conditions to a seed bicluster efficiently until the correlation coefficient of a conditionwise expanded bicluster is not less than the correlation threshold, θ. This conditionwise expansion approach also leads to stable expansion results, because the order of conditions to expand is determined by the value of the Pearson correlation coefficient.
After expanding a seed bicluster in genewise and conditionwise directions, a vertically and horizontally long matrix can be acquired, respectively. These two matrices can be combined to form a larger matrix that has rows in the genewise expanded bicluster and columns in the conditionwise expanded bicluster. This combined matrix is theoretically the largest size of matrix to which a seed bicluster can be expanded. The correlation coefficient of this matrix is less than the correlation threshold θ because not all genes are correlated under a set of conditions in the combined matrix. By filtering uncorrelated genes and conditions in this combined matrix, a large bicluster with correlated pattern can be acquired. Genewise and conditionwise expanded biclusters are also candidate correlationbased biclusters that BICLIC algorithm has found.
Filtering less correlated genes and conditions
In each iteration, the less correlated set of genes and the least correlated condition are calculated from a candidate bicluster matrix. The least correlation condition is eliminated, and then, the degree of increase in the average Pearson correlation coefficient (APCC) of the remaining matrix is measured. While the former result is set aside, in turn, less correlated set of genes of the original matrix are to be eliminated. The degree of increase in the APCC is measured. Then, the two degrees of increased APCC from the previous steps are compared to eliminate the one that has higher degree. For instance, when the degree of increase in the APCC of the least correlation condition is higher than that of less correlated set of genes, the former is eliminated and the latter is remained and vice versa. The number of conditions represents the length of a correlated expression pattern. Therefore, the least correlated condition is compared to a set of less correlated genes to extract a large correlated expression pattern. After removing less correlated sets of genes or the least correlated condition in a repeated way until the average correlation coefficient of the matrix is equal to or greater than the correlation threshold, a correlationbased bicluster matrix is acquired.
Algorithm 2 Filtering less correlated genes and conditions Algorithm
Input: CB: n’ X m’ candidate bicluster matrix, the correlation threshold value, θ, the minimum number of rows, mnr, and the minimum number of columns, mnc.
Output: BM: Bicluster matrix with correlated pattern
 1.Calculating average Pearson correlation coefficient of candidate bicluster matrix
 A.
Calculate average Pearson correlation coefficient of all genes in candidate bicluster matrix, θ _{ CB },
 B.
If θ _{ CB }, ≥ θ, stop steps and report CB as BM
 C.
If θ _{ CB }, < θ, continue steps
 A.
 2.Calculating average Pearson correlation coefficient after eliminating less correlated sets of genes.
 A.
Initially, Set less correlated gene set LG = NULL.
 B.
For i = 1 to n’, do
 a.
Calculate average Pearson correlation coefficient after eliminating gene g _{ i }, θ _{CB, g }
 b.
If θ _{CB, g} > θ _{CB}, then set LG = LG ∪ {g _{ i }}
 C.
Calculate average Pearson correlation coefficient after eliminating less correlated gene set LG from CB, θ _{CB, lg}
 A.
 3.Calculating average Pearson correlation coefficient after eliminating the least correlated condition.
 A.
Initially, Set less correlated condition set LC = NULL and corresponding correlation coefficient set CC = NULL
 B.
For j = 1 to m’, do
 a.
Calculate average correlated coefficient after eliminating condition c _{ j }, θ _{ CB }, c _{ j }
 b.
If θ _{ CB, cj } > θ _{ CB }, then Set LC = LC ∪ {c _{ j }} and CC = CC ∪ {θ _{ CB }, c _{ j } }
 c.
Select maximum of CC, max(CC) and corresponding condition C _{ j,max }
 A.
 4.Comparing average Pearson correlation coefficient increase between eliminating set of genes and condition
 A.
If θ _{ CB }, lg > max(CC), permanently eliminate less correlated gene set LG from CB, CB = CB  LG
 B.
If θ _{ CB }, lg < max(CC), permanently eliminate least correlated condition C _{ j }, max from CB, CB = CB  C _{ j, max }
 A.
 5.
Repeat step 1 to 4 until θ _{ CB } ≥ θ
 6.
If θ _{ CB }≥ θ && number of genes in CB ≥ mnr && number of conditions in CB ≥ mnc, report CB as bicluster matrix BM
Checking and removing duplicated biclusters
After all seed biclusters are expanded, different seed biclusters can be expanded to the same biclusters. Also, some biclusters may include other biclusters. Therefore, it is necessary to examine whether there are duplicated biclusters. All biclusters are ordered in an increasing order of bicluster size. Composition of genes and conditions in a bicluster is compared to that the samesize or larger biclusters from the smallest bicluster size to the largest. If every gene and condition in a certain bicluster is included in other bicluster, those biclusters are removed. After removing duplicated biclusters, the remaining biclusters have unique composition of genes and conditions.
Declarations
Acknowledgements
This research was supported by the National Research Foundation of Korea (NRF) grant No. 2012–0001001, the Converging Research Center Program grand No. 2012 K001442, and the KAIST Future Systems Healthcare Project funded by the Ministry of Education, Science and Technology (MEST) of Korea government.
Authors’ Affiliations
References
 Eisen MB: Cluster analysis and display of genomewide expression patterns. Proc. Natl Acad. Sci. USA. 1998, 95: 1486314868. 10.1073/pnas.95.25.14863.PubMed CentralView ArticlePubMed
 Spellman PT: Comprehensive identification of cell cycleregulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998, 9: 32733297.PubMed CentralView ArticlePubMed
 Hughes JD: Computational identification of cisregulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol. 2000, 296: 12051214. 10.1006/jmbi.2000.3519.View ArticlePubMed
 Wang H, Wang W, Yang J, Yu PS: Clustering by pattern similarity in large data sets. 2002, Madison, WI: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, 394405.
 Hartigan JA: Direct clustering of a data matrix. J Am Stat Assoc. 1972, 67: 123129. 10.1080/01621459.1972.10481214.View Article
 Alexe G: Consensus algorithms for the generation of all maximal bicliques. Disc. Appl. Math. 2004, 145: 1121. 10.1016/j.dam.2003.09.004.View Article
 Erten C, Sözdinler M: Improving performances of suboptimal greedy iterative biclustering heuristics via localization. Bioinformatics. 2010, 26: 25942600. 10.1093/bioinformatics/btq473.View ArticlePubMed
 Cheng Y, Church GM: Biclustering of expression data. 2000, La Jolla, CA: Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, (ISMB 2000), 93103.
 Yang J: δclusters: capturing subspace correlation in a large data set. 2002, San Jose, CA: Proceedings of the 18th IEEE Conference on Data Engineering (ICDE 2002), 517528. 0769515312View Article
 Cho H: Minimum SumSquared Residue Coclustering of Gene Expression Data. 2004, Lake Buera Vista, Florida, TX: Proceedings of the 4th SIAM International Conference on Data Mining, (SIAM 2004), ISBN 089715687View Article
 Bryan K, Cunningham P, Bolshakova N: Biclustering of expression data using simulated annealing. 2005, Dublin, Ireland: Proceedings of the 18th IEEE Symposium on ComputerBased Medical Systems, (CBMS 2005), 383388.
 AguilarRuiz JS: Shifting and Scaling Patterns from Gene Expression Data. Bioinformatics. 2005, 21: 38403845. 10.1093/bioinformatics/bti641.View ArticlePubMed
 Allocco DJ: Quantifying the relationship between coexpression, coregulation and gene function. BMC Bioinformatics. 2004, 5: 1810.1186/14712105518.PubMed CentralView ArticlePubMed
 Bhattacharya A, De KR: Bicorrelation clustering algorithm for determining a set of coregulated genes. Bioinformatics. 2009, 25: 27952801. 10.1093/bioinformatics/btp526.View ArticlePubMed
 Doruk B: A Biclustering Method to Discover Coregulated Genes Using Diverse Gene Expression Datasets. 2009, Berlin, Heidelberg: Proceedings of the 1st International Conference on Bioinformatics and Computational Biology, 151163.
 Mitra S: Gene interaction – An evolutionary biclustering approach. Inf Fusion. 2009, 10: 242249. 10.1016/j.inffus.2008.11.006.View Article
 Yang WH: Finding Correlated Biclusters from Gene Expression Data. IEEE Trans Knowledge Data Engineering. 2011, 23: 568584.View Article
 Teng L, Chan L: Discovering Biclusters by Iteratively Sorting with Weighted Correlation Coefficient in Gene Expression Data. J Signal Processing Syst. 2008, 50: 267280. 10.1007/s1126500701212.View Article
 Wassim A: Patterndriven neighborhood search for biclustering of microarray data. BMC Bioinformatics. 2012, 13 (Suppl 7): S11
 Li G: QUBIC: a qualitative biclustering algorithm for analyses of gene expression data. Nucleic Acids Res. 2009, 37: e10110.1093/nar/gkp491.PubMed CentralView ArticlePubMed
 Ashburner M: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 2529. 10.1038/75556.PubMed CentralView ArticlePubMed
 Kanehisa M: KEGG for linking genomes to life and the environment. Nucleic Acids Res. 2008, 36: D480D484.PubMed CentralView ArticlePubMed
 Doruk B: Comparative analysis of biclustering algorithms. 2010, Niagara Falls, NY: Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology, 265274.
 Prelic A: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics. 2006, 22: 11221129. 10.1093/bioinformatics/btl060.View ArticlePubMed
 Gasch AP: Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell. 2000, 11: 42414257.PubMed CentralView ArticlePubMed
 Bhattacharjee : Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci USA. 2001, 98: 1379013795. 10.1073/pnas.191502998.PubMed CentralView ArticlePubMed
 Sun CH: COFECO: composite function annotation enriched by protein complex data. Nucleic Acids Res. 2009, 37: W350W355. 10.1093/nar/gkp331.PubMed CentralView ArticlePubMed
 Benjamini Y: Controlling the false discovery rate in behavior genetics research. Behav Brain Res. 2001, 125: 279284. 10.1016/S01664328(01)002972.View ArticlePubMed
 Yun TY, Hwang TH, Cha K, Yi GS: CLIC: Clustering analysis of large microarray datasets with individual dimensionbased clustering. Nucleic Acids Res. 2010, 38: W246W253. 10.1093/nar/gkq516.PubMed CentralView ArticlePubMed
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.