Domain-oriented functional analysis based on expression profiling

Background Co-regulation of genes may imply involvement in similar biological processes or related function. Many clusters of co-regulated genes have been identified using microarray experiments. In this study, we examined co-regulated gene families using large-scale cDNA microarray experiments on the human transcriptome. Results We present a simple model, which, for each probe pair, distills expression changes into binary digits and summarizes the expression of multiple members of a gene family as the Family Regulation Ratio. The set of Family Regulation Ratios for each protein family across multiple experiments is called a Family Regulation Profile. We analyzed these Family Regulation Profiles using Pearson Correlation Coefficients and derived a network diagram portraying relationships between the Family Regulation Profiles of gene families that are well represented on the microarrays. Our strategy was cross-validated with two randomly chosen data subsets and was proven to be a reliable approach. Conclusion This work will help us to understand and identify the functional relationships between gene families and the regulatory pathways in which each family is involved. Concepts presented here may be useful for objective clustering of protein functions and deriving a comprehensive protein interaction map. Functional genomic approaches such as this may also be applicable to the elucidation of complex genetic regulatory networks.


Background
Recent progress in genomic sequencing has led to the rapid enrichment of protein sequence databases. Computational biology strives to extract the maximum possible information from these sequences by classifying them according to their homologous relationships. Classical protein families are distinguished by members which exhibit sequence similarity, a feature which can often be used to infer that the sequences are related evolutionarily as well as functionally. One of the first goals of any genome-sequencing project is to broadly classify as many genes and their products as possible into putative functional families.
Proteins are translated from their corresponding mRNAs. Cellular mRNA levels are immensely informative about cell state and the activity of genes, and in most cases, changes in mRNA abundance are positively correlated with changes in protein abundance. Co-regulation of proteins often reflects that these proteins are involved in similar biological processes and have related functions [1]. Therefore changes in the expression patterns of protein families under different experimental conditions can provide clues about regulatory mechanisms, the relationships between broader cellular functions and biochemical pathways, as well as interactions between different protein motifs [2][3][4][5][6][7].
The advent of microarray technology has made simultaneous analysis of the gene expression profiles of tens of thousands of genes a practical reality [8]. Availability of human genome sequences and massive high throughput data on their expression provides an opportunity to study regulatory relationships between gene families [9,10]. Incyte's LifeExpress RNA Database is a gene expression database that contains raw and normalized data from hundreds of Incyte microarray experiments. Using these large-scale mRNA expression data, we can infer the functional relationships between protein families by comparing the aggregated expression profiles of the members of each protein family.
Pfam is a functionally annotated database of protein domain families [11]. In this study, each member of Pfam families was mapped to Incyte clones which were presented in the Incyte microarray chips based on the sequence identity. We then generated mRNA expression profiles for these Pfam family members using Incyte's LifeExpress RNA (LE) database (Version 3.0, April 2001 release, Incyte Genomics, Inc.). We analyzed 135 Pfam families, whose family members are well represented on the Incyte microarrays. The expression data for various members of a single Pfam family in one experiment was first converted into binary digits and then summarized as the Family Regulation Ratio. The set of Family Regulation Ratios for a particular Pfam family across multiple experiments is called a Family Regulation Profile. By using Pearson Correlation, we analyzed the similarities between these Family Regulation Profiles to impute functional relationships between the different families. This method was validated by comparing Family Regulation Profiles generated based on two different randomly selected LE data subsets.
This study explores an approach to relate protein families based on quantitative comparison of the family mRNA expression profiles. Analysis of the profiles may be useful to address what protein domain families are co-regulated and possibly how they may interact physically or genetically with one another.

Clones mapped to multiple Pfam families
Structural similarities between distinct proteins often involve only a portion of each protein sequence. These conserved sub-structures, which often have readily identifiable boundaries, may recur in a large number of different proteins in which they perform a similar function. In addition, many proteins consist of combinations of these recurrent substructures (domains) in which case the protein function can be economically annotated based on the presence of these domains. Pfam is a domain-based protein database. Multi-domain proteins can be classified into several Pfam families, which in turns means that several Pfam families can share the same corresponding proteins.
Among 135 Pfam families discussed here, there are 1647 clones mapped to more than one Pfam family, and 346 Pfam families sharing more than one clone. Multi-domain proteins are the major reason for this observation, i.e. PF00043 and PF02798 representing C-domain and Ndomain of Glutathione S-transferease respectively. Nonuniformity of Pfam classification and clone mapping may also cause clone sharing between Pfam families. A table listing pairs of Pfam families that share more than 50% of their family members is provided as an additional file (see Additional File 1: [supp1.doc]). These Pfam family pairings may suggest that the two domains are associated with a distinct functional class.

Correlations of Pfam Family Expression Profiles for Related Pfams
Our analysis includes 135 Pfam families, whose names are provided in an additional file (see Additional File 2: [supp2.doc]). Pearson Correlation Coefficients (PCC) were calculated between each pair of Pfam Family Regulation Profiles. Common clones shared by Pfam pairs were removed during PCC calculation to reduce this source of bias (See Methods). 9045 PCCs were computed based on the Family Regulation Profiles consisting of more than ten clone members. Of these, 781 PCCs were greater than 0.6 and 71 PCCs greater than 0.75. Table 1  The Pfam co-regulation network can be visualized using the Pajek software package, a program originally designed for social network analysis (12). The map is layed out with Kamada-Kawi methods where each node represents a Pfam family and each edge represents a correlation coefficient greater than 0.6 ( Fig. 2A) and 0.75 (Fig. 2B) between Family Regulation Profiles of the connecting nodes. A subset of interconnected Pfams in which each Pfam has at least k interactions (where k is an integer) forms a k-core. These cores represent Pfams associated with one another by multiple interactions. The 20-core subgraph where each Pfam had at least 20 connections, was derived from Fig. 2A and displayed as Fig 2C. This graph may represent the core network of cellular regulation.

Data cross-validation
Two independent sets of Family Regulation Profiles were constructed from the randomly selected, non-overlapping subsets of expression data S1 and S2 (see Methods). Separate Pearson Correlation Coefficients, PCC1 and PCC2, were computed from these data subsets respectively. The correlation between PCC1 and PCC2 was determined by calculating an Enrichment Factor (EF) that expresses the degree to which calculations based on one data subset were corroborated by calculations based on the other data subset.
First, we selected Pfam pairs from S1 by screening for pairs where PCC1 is greater than 0.5. If PCC1 and PCC2 for a particular pair of Pfam models are positively correlated, an increase in PCC2 cutoff should result in a greater likelihood of PCC1 falling above 0.5, i.e. EF should increase. We calculated a series of EFs with the PCC2 cutoff increasing from 0 to 1. EF stayed around 1 for low PCC2 (0 to 0.2), and gradually increased to as high as 4.2 when PCC2 -0.8 (Fig. 3).
EFs were also computed to compare PCC1 and PCC2 in the same cutoff range from 0 to 1. An EF greater than 1 indicates higher than background correlation between PCC1 and PCC2. The representation ratio predicted to occur if there is no correlation between PCC1 and PCC2, and the observed representation ratios were calculated as described in Methods. The discrepancy between them was evaluated by determining the statistical variable chisquare (χ 2 ) and the corresponding p value. The higher the chi-square and the lower the p value, the less likely the observed degree of correlation occurred by chance, which implies that the observed relationship is biologically significant. As shown in Table 2, both EF and χ 2 analyses showed strong correlation between PCC1 and PCC2 with the p value less than 0.00001.

Figure 1
Family Regulation Profiles for PF00046 and PF00520. Because of size limitation, only part of the line graph is shown here. The comparison between two profiles shows the high correlation. The Pearson Correlation Coefficients between PF00046 and PF00520 is 0.89. The Pearson correlation coefficient between all PCC1 and PCC2 values was determined to be 0.76, showing that PCC1 and PCC2 were well correlated. Our approach is validated by this consistency of Pfam correlations derived from randomly selected LE expression experiments S1 and S2.

Discussion
Expression monitoring on a genome-wide scale was first successfully demonstrated in budding yeast [1,13,14]. Whole-genome expression profiles provide a rich source of information on protein function, protein-protein interactions, and gene pathways, and have been compared with data sets describing transcription factor binding sites, protein families, protein-protein interactions, and protein abundance [15][16][17][18][19][20][21][22] mainly in budding yeast. In this study we used large scale human genome-wide expression data to study the relationship between the expression patterns of different protein families. We summarized the change in gene expression level for a set of protein family members as a binary code of "regulated" and "unregulated". A two-fold change in expression level was chosen as a cutoff for categorizing a change in expression level as significant. This cutoff is arbitrary, but from our experience with Taqman confirmation (data not shown), a two-fold change in expression level reliably distinguishes real signals from background noise for most expression experiments. The gene regulation was then converted to a percentage for gene families to profile the regulation for each microarray experiment.
Since the mapping from Pfam families to Incyte microarray clones is through the Swissprot database, Pfam family members that are not represented in the Swissprot database cannot be mapped to Incyte clones, and are therefore not included in the Family Regulation Profile analysis. That may be one source of false negatives. Other factors such as sensitivity of the Incyte microarray technology, the arbitrary twofold cutoff, etc. may also contribute to false negative results. In order to reduce these types of errors, we included in our analysis only those Pfam families that map to at least 10 Incyte clones. Due to the fact that the domain hierarchy of Pfam classification is not perfectly reflected in the database, multi-domain proteins may fall into several Pfam members leading to a significant number of false positives. To address this potential source of error, we excluded from pair-wise correlation analysis those Incyte clones shared by both Pfam families in the pair.
To validate relationships that are revealed by microarray expression profiling, one direct way is to split all the experiments randomly into two non-overlapping pools and derive the Pfam relationships independently from each of these pools. The 555 LE experiment were divided into two data sets of S1 and S2 and the corresponding Pfam relationships were denoted as PCC1 and PCC2. The correlation coefficient for PCC1 and PCC2 of 0.76 indicates a high degree of correlation between two data sets. In order to ensure the independence of the data, we also performed another analysis where data were split into two sets based on the tissue type. The Pfam relationships (not shown) derived from these latter data sets are consistent with each other and with those derived from the complete data pool, further corroborating the observed relationships.
It is important to note that many of the protein family relationships (based on co-expression) identified here are supported by functional links. Taking the G protein-coupled receptors (GPCRs) as an example, after agonist action, GPCRs activate G proteins and become phosphorylated by G protein-coupled receptor kinases [23]. This event promotes activation of effector enzymes and ion channels by the activated Gα GTP. This GPCRmediated regulation of ion channels also depends on the coordination and parallel regulation of protein tyrosine kinase and protein tyrosine phosphatase activities [24]. GPCRs can also interact with the growing family of PDZ domain-containing proteins [25]. All of these relationships are reflected in our study by the strong correlation between 7 transmembrane receptors (rhodopsin family) (PF00001), protein kinases (PF00069), protein tyrosine phosphatases (PF00102), ion transport proteins (PF00520), and PDZ domain proteins (PF00595).
Our study revealed many intriguing links between protein families. For example, the link between homeobox proteins (PF00046) and other protein families. Homeobox proteins are a very important family of transcription fac-

Figure 3
Enrichment factor analysis validates the correlation between PCC1 and PCC2. PCC1 -0.5 is used as G 1 cutoff. The EF is plotted for different PCC2 cutoffs. Please see Methods for the detail of EF calculation. tors. According to our analysis, homeobox proteins are involved in the regulation of or are regulated by many protein families such as GPCRs, ion channels, tyrosine phosphatases, nuclear hormone receptors, and protein ki-nases. For example the cone rod homeobox (Crx) binds and transactivates the rhodopsin promoter [26]. However, it has not been determined whether other rhodopsin GPCRs are also regulated by homeobox proteins. In C. el-  egans, it was reported that the homeobox gene, lim-6, is required for distinct chemosensory representations (ion sensing), which is related to ion channel/transportation mechanisms. Again, it has not been studied systematically whether homeobox proteins are important regulators of ion channel/transporter. Another example of an intriguing relationship revealed in the present study is between the PH domain, the EGF domain and the immunoglobulin superfamily. Pleckstrin-homology (PH) domains are protein modules of approximately 120 amino acids found in a wide variety of signaling proteins in organisms ranging from yeasts to humans. The EGF domain, which often occurs as multiple tandem repeats, is widely distributed among extracellular proteins involved in adhesion, receptor-ligand interactions, extracellular matrix structure, determination of cell fate, and blood coagulation [27]. Cell adhesion usually activates PI 3-kinase activity. Many PH domain proteins bind with high affinity to specific phosphoinositides such as PI-4,5-P2, PI-3,4-P2 or PI-3,4,5-P3 which are generated by PI-3 kinase [28]. This relationship was reflected by the high correlation coefficient (0.79) between the PH domain and the EGF domain seen in our analysis. Furthermore, another major cell adhesion molecular family is the immunoglobulin superfamily (PF00047). Two immunoglobulins which are particularly important in the cell adhesion cascade are Intercellular Adhesion Molecule-1 (ICAM-1) or CD54 and Vascular Cell Adhesion Molecule-1 (VCAM-1). The co-expression that we observed in our analysis for the EGF-like domains and immunoglobin superfamily suggested that they may cooperate with each other in the cell adhesion process. Links between Pfam families like these suggest novel hypotheses about family interactions that may be further explored.

Conclusions
Integrating genome-wide expression information with protein functional classifications and genetic networks is a huge challenge. Our model is based on the binary idealization of gene expression change to generate a Family Regulation Profile. Although this model is simplified, the abstraction may be useful for conceptualizing the nature of functional relationships between families of protein domains.

Selection of probe pairs in LifeExpress
The LE database contains data points from a total of 976 probe pairs (Note: here the term of "probe" refers to "fluorescent-labeled total RNA sample used for microarray hybridization") that have been hybridized to one or more Human Genome chips 1~5. 266 of these probe pairs have been hybridized to all five chips. Using data from all available probe pairs would reduce the number of genes that could be compared, since many genes are not represented in all 976 experiments. By contrast, limiting oneself to only those probe pairs that have been run against all the chips would greatly reduce the number of experiments from which the data is drawn. In order to optimize the amount of data that could be compared, 555 probe pairs that were hybridized to the first 4 Human Genome chips (HG1, HG2, HG3, and HG4) were used in this study. are represented on the microarrays by only one family member. In total 135 Pfam families are represented by more than ten clones on the microarrays. There were 2644 clones that belonged to these largest families.

Data Processing
One probe pair in LE could have been hybridized to the same Incyte Human Genome chip several times. In these cases, the average of the differential expression values (fold difference) for the different hybridizations was used in subsequent calculations. For each gene, the expression ratio for a particular probe pair was reduced to a binary variable ("regulated" or "unregulated") rather than a continuous variable. For each pair of experimental conditions, a gene was considered to be "regulated" if it showed at least a two-fold change in expression level (either up-or down-regulated) between conditions, and "unregulated" if the expression ratio was less than two-fold.
In order to analyze the expression profile for each Pfam family, we calculated the Family Regulation Ratio (FRR) for each Pfam family as the percentage of its members which were "regulated" in a pair-wise comparison and assigned the ratio to the Pfam ID for this probe pair. For example, 140 clones in the PF00001 family had been hybridized with the probe pair of "PBMC Cells, Untx, 24 hr, Dn4625 vs t/2 4 hr", of which 28 clones (20%) showed greater than two fold difference in the pair-wise comparison. This meant that 20% members in the family of PF00001 were "regulated" with this probe pair. So the Family Regulation Ratio of 0.2 was assigned to PF00001 for this probe pair.
In order to reduce random noise, we only studied the 135 Pfam families represented by more than ten clones on the microarrays. Family Regulation Profiles were generated for each of these largest Pfam families by calculating the FRRs corresponding to 555 probe pairs.

Cross-validation with two independent subsets
One way to validate our approach is to verify that the Pfam relationships based on our Family Regulation Profiles would be ubiquitous. In other words, different microarray experiment sets, as long as the experiment resource is rich, diverse and well-represented enough, should generate a similar result. The 555 LE probe pairs were randomly divided into two data sets with one data set containing 278 probe pairs (S1) and the other one containing 277 probe pairs (S2). Based on these two data sets, two Family Regulation Profiles for each Pfam were computed respectively. Pearson Correlations among Family Regulation Profiles PCC1 and PCC2 were calculated using S1 and S2 respectively. PCC1 and PCC2 should be positively correlated if the derived Pfam relationships are independent of the expression experiment selection.
The Enrichment Factor (EF) is a parameter that represents the extent to which the Pfam pairs identified by PCC1 within the cutoff range specified by C1 are over-represented in pools of Pfam pairs that could be identified by PCC2 within the cutoff range specified by C2. The number of total Pfam pairs are indicated as G total while the number of Pfam pairs selected by PCC1 -C1 and PCC2 -C2 are indicated as G 1 and G 2 respectively. The number of Pfam pairs common to both PCC1 -C1 and PCC2 -C2 is indicated as G both . Therefore the expected background representation ratio, or the random representation ratio for G 1 within G 2 is, and the observed representation ratio for G 1 within G 2 is, EF is then defined as, = with a value for EF > 1 indicating higher than background representation ratio for G 1 within G 2 .
Take an example where G 1 is selected by PCC1 -0.5 and G2 is selected by PCC2 -0. 8