Database resources
Pfam is a large collection of multiple protein sequence alignments and Profile Hidden Markov Models covering many protein domains. Pfam Version 6.3 (released on May, 2001) contains alignments and models for 2847 protein families, based on the Swiss-Prot 39 and SP-TrEMBL 14 protein sequence databases [29]. Pfam was downloaded from ftp://genetics.wusti.edu/pub/eddy/pfam-6.3/Pfam-A.full.gz. Pfam family ID and family members (represented by Swiss-Prot ID or TrEMBL ID) were extracted and organized in a relational database (Sybase, Adaptive Server Enterprise Release 11.9.3, CA, Sybase Inc.). A non-redundant protein sequence database comprised of Swiss-Prot and TrEMBL was also downloaded from ftp://ftp.expasy.org/databases/sp_tr_nrdb and the data were parsed and stored in Sybase.
The Incyte microarray is manufactured by depositing DNA onto a glass surface in an array format at a density of up to 10,000 array elements per chip. LifeExpress RNA is a gene expression database which, in Version 3.0 (April 2000 release), includes 6307 expression experiments performed using Incyte's Human Genome chip 1~5. (Incyte Human Genome chip 1 (HG1) contains 9766 cDNAs, while HG2 containing 9612 cDNAs, HG3 containing 9686 cDNAs, HG4 containing 9249 cDNAs, HG5 containing 9219 cDNAs. In total the five arrays contain 46909 cDNAs which represent about 41296 unique genes.). These experiments encompass the therapy areas of Cancer Biology, the Cardiovascular System, Immunology and Inflammation, Metabolism, Neurobiology, Toxicology, and Body Map.
Selection of probe pairs in LifeExpress
The LE database contains data points from a total of 976 probe pairs (Note: here the term of "probe" refers to "fluorescent-labeled total RNA sample used for microarray hybridization") that have been hybridized to one or more Human Genome chips 1~5. 266 of these probe pairs have been hybridized to all five chips. Using data from all available probe pairs would reduce the number of genes that could be compared, since many genes are not represented in all 976 experiments. By contrast, limiting oneself to only those probe pairs that have been run against all the chips would greatly reduce the number of experiments from which the data is drawn. In order to optimize the amount of data that could be compared, 555 probe pairs that were hybridized to the first 4 Human Genome chips (HG1, HG2, HG3, and HG4) were used in this study.
Mapping Pfam to LifeExpress
A total of 4935 Swiss-Prot human sequences and 22208 TrEMBL human sequences are included in 1427 protein families covered by Pfam Version 6.3. Corresponding GenBank mRNA sequences for those protein records were identified from annotation present in Swiss-Prot and TrEMBL sequence records, and were assembled together with the clone sequences from LifeExpress Human Genome chips 1~4 using the PHRAP assembler (threaded version 3.01 licensed from Southwest Parallel Software, Inc. http://www.spsoft.com. 4177 Incyte clones belonging to 1069 Pfam families were mapped to these sequences. Pfam family PF00069 is most highly represented on the chips with 231 Incyte clones while 343 Pfam families are represented on the microarrays by only one family member. In total 135 Pfam families are represented by more than ten clones on the microarrays. There were 2644 clones that belonged to these largest families.
Data Processing
One probe pair in LE could have been hybridized to the same Incyte Human Genome chip several times. In these cases, the average of the differential expression values (fold difference) for the different hybridizations was used in subsequent calculations. For each gene, the expression ratio for a particular probe pair was reduced to a binary variable ("regulated" or "unregulated") rather than a continuous variable. For each pair of experimental conditions, a gene was considered to be "regulated" if it showed at least a two-fold change in expression level (either up- or down-regulated) between conditions, and "unregulated" if the expression ratio was less than two-fold.
In order to analyze the expression profile for each Pfam family, we calculated the Family Regulation Ratio (FRR) for each Pfam family as the percentage of its members which were "regulated" in a pair-wise comparison and assigned the ratio to the Pfam ID for this probe pair. For example, 140 clones in the PF00001 family had been hybridized with the probe pair of "PBMC Cells, Untx, 24 hr, Dn4625 vs t/2 4 hr", of which 28 clones (20%) showed greater than two fold difference in the pair-wise comparison. This meant that 20% members in the family of PF00001 were "regulated" with this probe pair. So the Family Regulation Ratio of 0.2 was assigned to PF00001 for this probe pair.
In order to reduce random noise, we only studied the 135 Pfam families represented by more than ten clones on the microarrays. Family Regulation Profiles were generated for each of these largest Pfam families by calculating the FRRs corresponding to 555 probe pairs.
Correlation measure
Correlations between Family Regulation Profiles are measured with the standard Pearson Correlation Coefficient (PCC). The PCC between two profiles a, b with k dimensions is calculated as
where a
i
and b
i
represent the Family Regulated Ratios of Pfam a and Pfam b for the sample i, and
and
indicate the respective means. If a clone was mapped to both Pfam a and b, the Family Regulation Profiles of a and b would be modified with the clone excluded. In other words, common clones between Pfam pairs made no contribution to the PCC calculation.
Cross-validation with two independent subsets
One way to validate our approach is to verify that the Pfam relationships based on our Family Regulation Profiles would be ubiquitous. In other words, different microarray experiment sets, as long as the experiment resource is rich, diverse and well-represented enough, should generate a similar result. The 555 LE probe pairs were randomly divided into two data sets with one data set containing 278 probe pairs (S1) and the other one containing 277 probe pairs (S2). Based on these two data sets, two Family Regulation Profiles for each Pfam were computed respectively. Pearson Correlations among Family Regulation Profiles PCC1 and PCC2 were calculated using S1 and S2 respectively. PCC1 and PCC2 should be positively correlated if the derived Pfam relationships are independent of the expression experiment selection.
The Enrichment Factor (EF) is a parameter that represents the extent to which the Pfam pairs identified by PCC1 within the cutoff range specified by C1 are over-represented in pools of Pfam pairs that could be identified by PCC2 within the cutoff range specified by C2. The number of total Pfam pairs are indicated as G
total
while the number of Pfam pairs selected by PCC1 — C1 and PCC2 — C2 are indicated as G1 and G2 respectively. The number of Pfam pairs common to both PCC1 — C1 and PCC2 — C2 is indicated as G
both
. Therefore the expected background representation ratio, or the random representation ratio for G1 within G2 is,
and the observed representation ratio for G1 within G2 is,
EF is then defined as,
with a value for EF > 1 indicating higher than background representation ratio for G1 within G2.
Take an example where G1 is selected by PCC1 — 0.5 and G2 is selected by PCC2 — 0.8. Out of total 8968 Pfam pairs there are 2185 Pfam pairs with PCC1 — 0.5, so the ratio of R0 can be calculated as 2185/8968 = 0.24. There are 20 Pfam pairs with PCC2 — 0.8, of which all 20 also have PCC1 — 0.5. The ratio of R1, equals 20/20 = 1, therefore, EF = 1/0.24 = 4.2