Genome-wide identification and characterization of cytochrome P450 monooxygenase genes in the ciliate Tetrahymena thermophila
© Fu et al. 2009
Received: 19 November 2008
Accepted: 01 May 2009
Published: 01 May 2009
Skip to main content
© Fu et al. 2009
Received: 19 November 2008
Accepted: 01 May 2009
Published: 01 May 2009
Cytochrome P450 monooxygenases play key roles in the metabolism of a wide variety of substrates and they are closely associated with endocellular physiological processes or detoxification metabolism under environmental exposure. To date, however, none has been systematically characterized in the phylum Ciliophora. T. thermophila possess many advantages as a eukaryotic model organism and it exhibits rapid and sensitive responses to xenobiotics, making it an ideal model system to study the evolutionary and functional diversity of the P450 monooxygenase gene family.
A total of 44 putative functional cytochrome P450 genes were identified and could be classified into 13 families and 21 sub-families according to standard nomenclature. The characteristics of both the conserved intron-exon organization and scaffold localization of tandem repeats within each P450 family clade suggested that the enlargement of T. thermophila P450 families probably resulted from recent separate small duplication events. Gene expression patterns of all T. thermophila P450s during three important cell physiological stages (vegetative growth, starvation and conjugation) were analyzed based on EST and microarray data, and three main categories of expression patterns were postulated. Evolutionary analysis including codon usage preference, site-specific selection and gene-expression evolution patterns were investigated and the results indicated remarkable divergences among the T. thermophila P450 genes.
The characterization, expression and evolutionary analysis of T. thermophila P450 monooxygenase genes in the current study provides useful information for understanding the characteristics and diversities of the P450 genes in the Ciliophora, and provides the baseline for functional analyses of individual P450 isoforms in this model ciliate species.
The cytochrome P450 monooxygenases (P450s) constitute a conserved gene superfamily of heme-thiolate proteins ubiquitously distributed in different life forms, including prokaryotes (archaea, bacteria), unicellular eukaryotes (protists, fungi) and multicellular eukaryotes (plants and animals) . They play key roles in the metabolism of a wide variety of substrates, including endogenous chemicals such as steroids and other important small molecules, and also xenobiotic compounds including drugs, pesticides and environmental contaminants . Substrate and functional diversity are characteristic features of P450 enzymes and are considered to be the consequence of evolutionary adaptation driven by different metabolic or environmental demands in different organisms. Evolution and expansion of major P450 branches has been suggested to be linked with the historical occurrence of important evolutionary events. One particular example is the divergence of P450s of the common plant-animal ancestor either to synthesize biochemicals/metabolites (in plants) or to detoxify xenobiotics (in animals), followed by P450 gene expansions, especially in the plants . These may well reflect different survival strategies adopted between the two kingdoms, i.e. plants evolved sessile systems with P450 enzymes with more diverse and essential roles, while animals developed higher order sensory and locomotor systems, and comparatively fewer P450s .
Although over 9,000 P450s have been named and about 1,000 P450 genes have been characterized to date, none has been systematically characterized in the phylum of Ciliophora which, together with dinoflagellates and the exclusively parasitic apicomplexa, constitute the three major evolutionary lineages that make up the alveolates . The unicellular ciliate Tetrahymena thermophila is a free-living protozoan widely distributed in freshwater and estuarine environments, elaborating typical eukaryotic components (e.g., microtubules, membrane systems) into a highly organized cell whose structural and functional complexity is comparable to, or exceeds that, of human and other metazoan cells . The physiology, biochemistry and molecular biology of Tetrahymena are well characterized , and it is an excellent model organism for toxicological and ecotoxicological studies in aquatic toxicity test systems [8, 9]. Results from the EST project  and the macronuclear genome sequencing project  have shown that, although single-celled, Tetrahymena possesses core processes conserved across a wide diversity of eukaryotes (including humans) that are not found in other unicellular model species such as the yeasts Saccharomyces cerevisiae and Schizosaccharomyces pombe. It also contains a large number of gene families that are involved in processes associated with sensing and responding to environmental cues. In the case of the cytochrome P450 gene family, S. cerevisiae and S. pombe have only three and two P450s, respectively , while in T. thermophila the number is more than 40 (this study), which is close to the typical number (50–80) in a vertebrate genome .
In humans and other mammals, extensive studies have focused on aspects of P450 gene structure and biochemical properties. Important biological functions of P450s and the associated high degree of complexity in the gene polymorphism and expression patterns have been demonstrated . As genomic databases became available and functional genomics techniques such as DNA-microarrays have been applied, investigations on P450 isoforms have also been extended to other organisms (birds, crustaceans, insects, fungi, plants) (see  for a recent review). These developments have, in turn, led to the emergence of a new field, ecotoxicogenomics , which aims to develop effective tools for identification of possible toxic environmental pollutants by characterizing their effects on terrestrial and aquatic model organisms, such as the soil-dwelling nematode Caenorhabditis elegans , the aquatic crustacean Daphnia magna , and the small fish fathead minnow (Pimephales promelas) .
We previously identified a series of differentially expressed ESTs of T. thermophila that respond sensitively to treatment with the organochlorine insecticide DDT . One EST (GenBank accession No. CF653700) was identified to be a P450 gene by homology searches, and its expression levels under different concentrations of DDT treatment were further assessed. Recently, the first genome-wide microarray platform containing the predicted coding sequences (putative genes) has been established and validated in T. thermophila and was used to study gene expression during three major stages of the organism's life cycle: vegetative growth, nutrient starvation and conjugation . Substantial progress has also been made in closure and reannotation of the MAC genome sequence of this eukaryotic model organism . All these provided us the opportunity to investigate both the functional and evolutionary characteristics of the cytochrome P450 genes in T. thermophila at the genomic level.
In this study, the putative T. thermophila P450 genes that were previous identified both by the International Nomenclature Committee (Nelson's P450 Homepage http://drnelson.utmem.edu/CytochromeP450.html) and by the TIGR genome annotation team were further checked by EST data and through cDNA sequence cloning experiments for improvement of annotation. The expression patterns of the T. thermophila P450 genes during three important cell physiological/developmental conditions (growth, starvation and conjugation) were analyzed based on EST and microarray data. These results are discussed in the context of understanding the characteristics of the T. thermophila P450 monooxygenase isoforms and their functional and evolutionary diversity.
The putative cytochrome P450 gene sequences in T. thermophila were initially identified in 2004 by Dr. Nelson based on an early TIGR release of its macronuclear genome assemblies, and 47 P450-like genes (and fragments) were posted on his P450 website http://drnelson.utmem.edu/tetrahymena.FASTA.htm. When the draft whole genome of T. thermophila was sequenced and reported in 2006, 41 P450 genes were predicted by the TIGR genome annotation team . In this study, we retrieved the putative T. thermophila P450 genes from above two locations and further searched against the TGD (Tetrahymena Genome Database) utilizing the updated annotation (Aug. 2007). The obtained P450-like gene sequences were thereafter compared. Then, we used the data from two EST database resources (NCBI and TBestDB) and our own RT-PCR investigations to verify the accuracy of the predictions.
The 44 T. thermophila putative P450 genes.
GenBank accession No.
Scaffold GenBank accession No.
EST accession No.
There were four P450 pseudogenes (fragments) in Nelson's predictions (CYP5005A5P, CYP5005A7P, CYP5005A12P and CYP5005A13P, "P" stands for "pseudo"). It was revealed that one (CYP5005A5P) was a probable pseudogene due to the presence of in-frame stop codons, and two partial fragments (CYP5005A12P and CYP5005A13P) with typical P450 sequence features appeared to be missing the N-terminus. These sequences were not included in the following analysis of the T. thermophila putative P450 genes. However, when checking the cDNA sequences of putative T. thermophila P450 genes, we identified one transcription product that is most identical to the predicted pseudogene "CYP5005A7P" which has an in-frame TGA codon (the only stop codon in T. thermophila) within its ORF. Compared with the "CYP5005A7P" sequence, the EST of the cDNA transcript has several site mutations including one nucleotide transition (A to G) within the in-frame stop codon found in the genomic sequence, and thus can be read through. However, a Blast search of the T. thermophila whole genome database failed to detect any sequence that completely matched this cDNA sequence. Thus we designed a pair of primers (5005A7_Genome_Fw and 5005A7_Genome_Re) located in either the 5' or 3' flanking region of the "CYP5005A7P" sequence, and obtained a 2.1kb PCR amplification product that was sequenced. Due to the possibility that different P450 isoforms may exist in different T. thermophila strains, we checked the PCR products both from the genomic DNA of strain SB210 (the strain that used for the T. thermophila macronuclear sequencing project) and CU428. The results showed that the PCR products from the two strains were 100% identical and were also consistent with the sequence of the unexpected cDNA transcript. We thus assigned the name CYP5005A7 to this newly observed P450 isoform. Further, Blast searches in the TBestDB and NCBI EST database either found one sequence in each that perfectly matched the CYP5005A7 sequence (TBestDB accession No. TTL00012823, cDNA library from a T. thermophila CU strain; GenBank accession No. EC269404, cDNA library from the T. thermophila strain SB210, respectively). Besides, the microarray data demonstrated that the corresponding cDNA transcript was constantly expressed during the different cellular conditions. All the above observations serve as evidence that CYP5005A7 is a functional P450 gene in both the T. thermophila CU and SB210 strains. The erroneous prediction of the pseudogene "CYP5005A7P" by TIGR was probably due to false assembly of genome sequence contigs, caused by the high similarity between the CYP5005A5P and CYP5005A7 sequences (Additional file 2).
The largest expanded P450 family in T. thermophila is the CYP5005 family that contains 16 likely functional P450 genes (purple in Figure 1) and the two predicted pseudogenes (not indicated). The CYP5010 family (blue in Figure 1) and the CYP5013 family (orange in Figure 1) were "medium"-expanded P450 families in T. thermophila that possessed 8 and 6 genes, respectively. Moreover, 10 CYP5005 family members are located on the same scaffold (CH445786), which contains a gene cluster in the form of tandem repeats with four genes (CYP5005A1, CYP5005A2, CYP5005A3 and CYP5005A4) and one pseudogene (CYP5005A5P) in the same orientation. Three CYP5013 family members (CYP5013A1, CYP5013B1 and CYP5013C1) form a tandem triplicate repeat on the scaffold CH445623. Besides these, there are three other P450 gene pairs (CYP5005A18 and CYP5005A19, CYP5007B1 and CYP5007C1, CYP5012A1 and CYP5012A2) organized in tandem duplication. Such a feature is consistent with the gene expansion strategy used by some other lower eukaryotic genomes, particularly those who have evolved to meet an extensive demand for generation of a broad range of metabolites in the secondary metabolism when interacting with the environmental niches. For example, it was revealed that the majority of the multigene P450 families in the model white rot fungus Phanerochaete chrysosporium appear to have expanded locally in its genome as a result of extensive gene duplications and rearrangements, indicating a strong need for functional divergence in response to environmental stimuli .
The intron-exon organization of P450 genes exhibits a diversity of gene structure with indicating that multiple gains and losses of introns have occurred during the evolution of P450 genes in diverse species, with little conservation of intron positions among divergent P450 families [24, 25]. Intron positions were mapped and characterized for all 44 putative T. thermophila CYP genes (Figure 1). Of the 48 introns that were identified in the P450 gene sequences, 13 were phase zero (27.1%), 16 were phase I (33.3%) and 19 were phase II (39.6%). Meanwhile, based on the definition of the UIP (introns that occupy a unique position in the alignment) , 27.1% (13) of the total introns were unique and the remaining 35 introns were present in 7 consensus locations (introns shared by all members within a family at the same aligned position) among three different family clades. Sixteen P450 genes lack introns entirely. Five genes (CYP5002A1, CYP5003A1, CYP5011A1, CYP5012A1 and CYP5012A2) have the maximum number of four introns, respectively. From Figure 1, it can be observed that there is a good correlation between the conservation of intron position and phylogenetic relationships of T. thermophila P450 subfamily members. The evolution of introns in alveolates was recently studied by Nguyen et al. , who concluded that the rates of intron gain and loss were more or less constant in the last ~800 Myr after Tetrahymena branched off. Therefore, it can be inferred that the enlargement of several P450 families in T. thermophila resulted from recent separate small duplication events, which is also apparent within many other Tetrahymena protein families containing paralogous genes .
Our analyses of the EST data to determine the intron boundaries also revealed that the CYP5013A1 gene might exhibit alternative intron splicing. Alternative splicing was suggested to be an uncommon phenomenon in T. thermophila, at least under the several growth conditions that have examined . CYP5013A1 had the greatest number of ESTs of all T. thermophila P450 isoforms. Among the 26 retrievable EST sequences, 25 represented the correct transcriptional product of CYP5013A1, while one (GenBank accession No. FF565796) retained the first intron. The different transcripts were further investigated by RT-PCR. An intron-containing transcript was observed in PCR products both from the cDNA templates of starvation and conjugation stages, but was barely detectable from vegetative growing cells (data not shown). The retained intron in CYP5013A1 is 53 bp in length and belongs to phase I type. Thus, the putative intron-containing transcript should contain a frame-shift mutation. RNA transcripts carrying such premature stop codons can be prevented by the NMD pathway . Recently, this pathway was found to play an essential role in another ciliate, P. tetraurelia, by knocking down the two homologous genes encoding UPF1, a protein which is crucial in NMD, thus indicating a universal translational control of intron splicing in eukaryotes based on NMD surveillance . A blast search of the T. thermophila genome using P. tetraurelia UPF1 genes as queries revealed the existence of one homolog with 70% identity (GenBank accession No. TTHERM_00726300). Thus, intron retention in the CYP5013A1 transcript might be caused by inefficient NMD activity during specific physiological/developmental stages. However, it would lead to a mature "non-sense" transcript and could only serve to down-regulate expression of the CYP5013A1 gene. Whether such rarity of alternative splicing in this species may simply be tolerated or directed by other mechanisms, is currently unknown.
Prediction of sub-cellular localization of the CYP proteins showed that no predicted mitochondrial localized P450 was found in T. thermophila (Table 1), and all of the genes except CYP5001A1 encode a putative typical microsomal signal peptide of about 20 hydrophobic residues that likely serves as a membrane anchor in the endoplasmic reticulum. The secretory pathway of eukaryotic cells consists of a series of discrete, membrane-bound compartments, including the ER, Golgi, and vacuoles , all of which are present in Tetrahymena . For CYP5001A1, the predicted signal sequence was ambiguous and also lacked the positively charged residues in this N-terminus critical for endocellular transportation targeting. So far, mitochondrial P450s have been described only in the animal kingdom [4, 25]. As with T. thermophila, none was found in fungi and plants. These observations support the suggestion that mitochondrial P450s did not originate from the ancestral mitochondrial endosymbiont, but evolved later, possibly by mistargeting of microsomal P450s .
CYP51 is the only P450 family that has orthologs in multiple phyla of the animal, plant, fungal and bacterial kingdoms, and it has been postulated to be the ancestral P450 isoform . This family is functionally conserved and has a very limited range of substrates related to the biosynthesis of sterols and their derivatives, and a high conservation within the SRSs is a specific feature of CYP51. It was believed that two motifs, "YxxF/L(i)xxPxFGxxVxF/YD/a" and "GQ/hHT/sS", present within the regions of SRS1 and SRS4, respectively, can be considered as the CYP51 signature . In this study, no CYP51-like gene has been identified with such a signature in the T. thermophila P450 family. So far, all studied CYP51 family members were found to exhibit catalytic function in the oxidative removal of the 14α-methyl group from sterol precursors formed downstream from cyclization of squalene 2,3-epoxide, and the 14α-demethylated products are intermediates in the sterol biosynthetic pathways leading to formation of cholesterol in animals, ergosterol in fungi and a variety of functional sterols in plants and algae [35, 36]. CYP51 appears to be absent in insects and nematodes due to the fact that they don't make cholesterol and lack the post-squalene sterol biosynthesis pathway, and absence of the CYP51 gene was thought to be the consequence of gene loss during evolution . Thus, a similar CYP51 gene loss may be inferred in T. thermophila. Instead of the sterols in most other eukaryotes, Tetrahymena produces the pentacyclic triterpenoid alcohol 'tetrahymanol' and/or hopanoids de novo as functionally equivalent structural components of cell membranes, and its ability to synthesize and use this "primitive" substance can be considered as a metabolic relic . Besides, although T. thermophila is able to incorporate exogenous sterols into cell membranes and convert those to various derivates, it utilize enzymes other than P450s in the downstream metabolism of certain steroid compounds, such as the cytochrome b5 in the direct desaturation reaction of cholesterol .
Spatial and temporal gene expression patterns are important aspects of gene regulation and expression pattern analysis has played an important role in the study of the cytochrome P450 gene superfamily. In the present study, the gene expression patterns of all T. thermophila P450s during three important cell physiological stages were analyzed based on EST and microarray data.
To examine the expression profiles of the P450 genes, we initially retrieved the EST data of T. thermophila from both in the NCBI EST database and the taxonomically broad EST database (TBestDB) (Apr. 2008) that had been derived from cDNA libraries made from cells in different physiological conditions. The ESTs matching with a corresponding predicted P450 gene are listed in Table 1, which indicates that the gene of origin is expressed and the numbers of ESTs found could be considered as a first indication of the relative expression abundance of that gene. For example, ESTs derived from a few P450 genes (CYP5005A18, CYP5005A19, CYP5008A1, CYP5010B1, CYP5013A1 and CYP5013D1) were found a number of times and their relatively high expression levels were also checked by semi-quantitative RT-PCR analysis, while ESTs from many other P450s were not identified at any of the examined cell physiological/developmental stages. While EST analyses enabled the localization of introns and demonstration of a likely case of alternative splicing, due to random fluctuation in EST numbers and differences between the sizes of each library, the total EST counts in different cDNA libraries cannot be rigorously compared for quantifying the relative amounts of the expressed genes in different physiological states.
Secondly, nearly 50% of all the P450 isoforms, including the mostly highly expressed isoforms (CYP5005A18, CYP5008A1, CYP5008A2, CYP5010B1, CYP5012A2 and CYP5013D1) are constantly expressed at all life cycle stages, although their expression levels may vary to different extents (clades B and C in Figure 4). These genes probably take part in constitutive, endogenous physiological processes. It should be pointed out that, due to an erroneous prediction of the TIGR genome sequence that merged three adjacent P450 genes (CYP5013A1, CYP5013C1 and CYP5013B1) into one "monster" gene when the microarray was designed; the separate microarray data for these three genes were unavailable. However, by manually checking the signal of 5 of the 14 oligonucleotide probes whose position correctly matched the putative CYP5013A1 gene, its high expression level could be inferred. The EST information of CYP5013A1 also supports this conclusion.
Meanwhile, expression studies of T. thermophila cytochrome P450 genes when exposed to specific chemical substances are currently under investigation. Among the P450 genes that are not expressed in the three major physiological/developmental states, at least three of these isoforms (CYP5007C1, CYP5010A4 and CYP5010A5) are transcriptionally active under different xenobiotic stresses (W. Miao and C. Fu, unpublished microarray data), suggesting that they may play a role in the metabolism and biotransformation of some chemical compounds.
Secondly, if translational selection pressure influences the shaping of codon usage, the bias would show significant positive correlation to expression levels and some translation-preferred codons should be used more frequently than others. Thus, we calculated the two often used measures of codon usage bias, the CAI  and the ENc  of each T. thermophila P450 gene,. Among the most highly expressed P450 genes (clade C in Figure 4), three isoforms (CYP5008A1, CYP5010B1 and CYP5013D1) have a CAI value above 0.6, which is significantly above that of the average (0.421) following a student's t test (p < 0.05). Meanwhile, four P450 genes (CYP5005A18, CYP5008A2, CYP5012A2 and CYP5013A1), although exhibiting a similar high basal expression level, showed a CAI value close to, or even less than, that of the average. Thus, no simple trend exists between the expression levels of P450 genes and the codons they use. Meanwhile, in the ENc-plot analysis, if a given gene is only subject to G+C composition mutation constraint, it will lie above or just below the standard curve. From Figure 6B, it can be seen that while the most T. thermophila P450 genes had ENc values slightly smaller than expected ones, the three genes with the significant high CAI values displayed a more biased codon usage according to the respective GC3s, indicating that they are probably under pressure from direct expression selection.
The correspondence analysis on the RSCU was also performed to avoid identification of trends in codon usage due to biased amino acid usage among the genes. RSCU of T. thermophila CYP genes detected one major trend on the first axis of inertia which accounted for 26.05% of the total variation and was approximately three times as much of the variation as the second axis (8.87%), and no other axes accounted for more than 7.34% of the total variation, respectively. The codons of T. thermophila CYP genes that make the major contributors to this pattern showed preference for ending in C, including CCC (Proline), UCC (Serine), CUC (Leucine), GUC (Valine), CAC (Histidine) and AUC (Isoleucine), and this might indicate that codons ending in C could lead to better translational accuracy/efficiency. Besides, Axis 1 was significantly correlated to GC and GC3s (r = 0.913, 0.931, p < 0.01, Pearson correlation coefficient), but not to ENc (r = 0.074, p > 0.05). While this result implies that the nucleotide composition mutation could be considered as the relatively major factor in shaping the codon usage in P450 genes, it should be taken into account that the robustness of our current analysis would be limited by the relative small quantities of calculated genes. A recent study reported that there was a very small but significant correlation between the ENc and expression level (estimated based on the EST counts) in T. thermophila . However, in Miao et al.  the authors found that while most codon-biased genes (95%) are expressed constitutively and at high levels, only a fraction of highly expressed genes (<15%) are codon-biased. This may well reflect that the genome-wide microarray analyses covered a wide range of physiological/developmental stages and are subject to less bias than non-saturated, random analyses of cDNAs. Thus, although the overall codon usage biased is determined by nucleotide composition mutation in T. thermophila, the presence of a preferred gene subset is under pressure from direct expression selection which probably results from the large effective population size of this species . It has been reported that in the bacteria Bacillus subtilis  and Chlamydia trachomatis , the synonymous codon usage appeared to be determined by both the neutral mutational biases and translation selection. Therefore, while genes within one species often share a single codon usage pattern, exceptions also exist, especially among microorganisms.
The two largest P450 families in T. thermophila (CYP5005 and CYP5010) were chosen for investigating whether different evolutionary pressures exist on particular sites of P450 isoforms. Here the rates of substitutions that occurred within the coding regions of the family members were calculated for all types of selection (purifying, neutral and positive). Different rates of substitutions were estimated to occur within the coding regions among the two family members. The obtained Ka/Ks ratios were between low (<1) to median (= 1) along the whole sequences, and were especially low among some residues that represent some core P450 structure regions, indicating strong negative selection on these structural regions. This suggests they are functionally important sites in catalytic reactions and may experience restriction of deleterious mutations. Meanwhile, no positive selection was indicated at any specific site along the rest of gene coding regions, suggesting a divergent or neutral evolution process among both the CYP5005 and CYP5010 family members (Additional file 5).
In ciliate species, fast protein evolution has been reported for several genes, and was suggested to be the consequence of either relaxed functional constraint on the nucleosomes of amitotic macronuclei or of adaptive evolution through gene duplication coupled with the ciliates' highly processed macronuclear genomes . The low expression levels of many isoforms within the CYP5005 family, suggested they might have undergone pseudogene formation. However, the possibility cannot be ruled out that some members of these groups serve as inducible P450 isoforms that function when the cell contacts specific xenobiotic compounds from the environment. In a recent comparative genome analysis of vertebrate cytochrome P450 genes , it was suggested that all of the CYP genes that encode enzymes with known endogenous substrates are phylogenetically stable, characterized by few or no gene duplications or losses. In contrast, the unstable P450 genes that are characterized by frequent gene duplications and losses were found to constitute most of those that encode enzymes that function as xenobiotic detoxifiers. Besides, adaptive evolution has also been estimated to occur restrictively on those duplicated CYP450 genes at the amino acid sites within the SRSs regions in the protein structure. This may be due to their functional association with unstable environmental interactions such as toxin and pathogen exposure. Such specific xenobiotic-driven P450 gene expansion events were also observed in D. melanogaster  and C. elegans . The relatively high heterogeneity among the major clades within the T. thermophila P450 family members, along with their diverse expression patterns suggested that a similar situation might be occurring in this organism.
In evolving genomes, change in gene expression is one important mechanism that can lead to retention of duplicate genes, and studying the gene expression patterns has been suggested to be an important measure of gene functions that can facilitate our understanding of the genetic basis of evolutionary change [55, 56]. High-throughput approaches, such as microarray techniques, provide an opportunity to investigate gene expression of whole genomes simultaneously, allowing studies of how different genes respond to a certain environmental stimuli and the general gene expression patterns among various gene families that were categorized into different cellular functions on genome-wide scales .
We first investigated the gene-expression evolution patterns of the P450 duplicate pairs in the two largest T. thermophila P450 gene families, CYP5005 and CYP5010, by analyzing the gene-expression data from multiple microarray experiments. For the three physiological/developmental stages, the T. thermophila cells examined, the nonphylogenetic model was the best supported model both for the two families (Additional file 6). For the CYP5005 family, the best "nonphylogenetic-free" model suggested that more closely related duplicate genes are no more likely than more distantly related genes to share similar expression patterns. This may indicate either the gene family has little influence over physiological functions or the rapid rates of gene-expression evolution . Meanwhile, the "nonphylogenetic-distance" model best fit the CYP5010 family, indicating that genetic distances since last gene duplication predict change in expression, consistent with an initial coupling during evolution of expression and coding sequences, i.e. a correlation between the genetic distance in the coding region of CYP5010 family members and the change in gene expression level still can be detected (Additional file 7). This was similar to a report on duplicate genes in the yeast S. cerevisiae showing that gene expression patterns remain similar shortly after gene duplication, but the evolution of expression occurs quickly so that the patterns become distinct from each other in a relatively short period of time . Due to the observation that many CYP5005 family members are within the "expression-silent" category, the assumed gene pseudonization events probably obscured the possible correlation between the genetic distance and gene-expression divergence in the CYP5005 family.
Further, we estimated the relative duplication time of the two gene families. Since silent substitution rates were often used as an approximation to the neutral mutation rates , we calculated the 4-fold substitution rates (d4) of the synonymous sites for all the duplicate pairs for each of the two gene families. It was indicated that except for three genes (CYP5010A3, CYP5010A4 and CYP5010A5, d4<0.015) that raised by very recent duplication events (Figure 1), most P450 genes in the CYP5005 and CYP5010 families have uniform mutation rates (1.072 and 0.880, on average). Then we calculated the amino acid distance between duplicates for using as a proxy for evolutionary time. The results showed that the overall distance values are highly identical (0.748 and 0.742, on average), thus indicating that the duplication time of the two gene families was close to each other. Therefore, given that the enlargement of these P450 families in T. thermophila was probably caused by recent separate small duplication events, the divergence rate of gene expression thus may also vary between the two P450 gene families.
Since a rapid divergence of expression among the T. thermophila duplicate P450 genes could be inferred, we also tried to examine the 5' flanking regions of the T. thermophila duplicate P450 genes for identification of likely regulatory elements. It is well known that promoters and enhancers located upstream of the coding regions usually have critical roles in regulating gene expression levels, and that major P450 genes are selectively induced by different nuclear receptors in response to endogenous substances or diverse xenochemicals in multicellular organisms [60, 61]. However, the fact that the Tetrahymena genome has a 78% A/T content makes it difficult to identify regulatory elements by sequence homology to well-characterized promoters from other model organisms using transcriptional regulation databases such as TRANSFAC . Currently, except for a few regulatory elements that have been identified by deletion and mutational analysis [63, 64], little information is available on the cis-elements of specific genes in Tetrahymena. We further checked possibility whether the T. thermophila P450 duplicate pairs or their adjacent genes in the same chromosome (scaffold) have a tendency to show similar physiological/stage-specific patterns of expression. The results showed that most of these genes exhibited wide variations in their levels and stage-specificity of expression, i.e. no chromosomal or sub-chromosomal level of gene regulation was indicated (data not shown). Therefore, although the rapid divergence of upstream non-coding sequences of the relatively well-conserved P450 OFRs may contribute to the remarkably varied expression patterns among the T. thermophila duplicate P450 genes, further experimental investigations are needed to identify specific regulatory elements or trans-acting factors involved in the transcriptional induction of P450s in this organism.
In the current study, we identified 44 putative functional P450 monooxygenase genes in the model ciliate organism T. thermophila, analyzed their evolutionary relationships and characterized their expression based on both EST and microarray data, using bioinformatics tools. The current microarray data provide background information of T. thermophila P450 gene expression in normal physiological states. Our analyses provide information on the characteristics and diversity of the P450 genes in the Ciliophora, and will facilitate further functional analysis to understand the roles of individual P450 isoforms either in cellular physiological metabolism or the possible oxidative detoxification catalysis under environmental toxic exposures in this model ciliate species.
Putative T. thermophila cytochrome P450 gene sequences were retrieved from the Institute of Genome Research (TIGR; now known as the J. Craig Venter Institute) predicted peptide database (Preliminary_Gene_Predictions_Aug_2004.pep), used to search against both the T. thermophila genome sequences database from TIGR (Assembly2-Nov_2003.scaffolds) and the TGD (Tetrahymena Genome Database, http://www.ciliate.org/)  predicted gene database using Blastp with the filter option turned off. The P450 gene predictions made by Dr. David Nelson were used as an independent source (available at http://drnelson.utmem.edu/tetrahymena.FASTA.htm). The deduced amino acid sequences obtained from the above two strategies were then compared. To improve the accuracy of exon and signal peptide identification in the putative genes, the canonical GT/AG rule was used to determine the intron splice sites and all corresponding P450 nucleotide sequences were subjected to BLAST N searches against both the NCBI EST database http://www.ncbi.nlm.nih.gov/dbEST/index.html and the Protist EST Program (TBestDB, http://tbestdb.bcm.umontreal.ca/searches/organism.php?orgID=TT) . For some putative genes with uncertain intron boundaries, RT-PCR was carried out to check the cDNA sequences. Total RNA of 2 ml cultures of T. thermophila exponential phase cells (strain SB210) was isolated by the TRIzol reagent (Gibco BRL) method according to the user manual with slight modification. RNA purity and integrity were monitored by spectrophotometry and electrophoresis. RNA samples were treated with DNase I (Promega) and reverse transcribed into double strand cDNA using MMLV (Promega) in a 25 μL reaction mix. Primers (Additional file 8) were designed using Primer3 http://frodo.wi.mit.edu/cgi-bin/primer3/primer3_www.cgi. PCR products were sub-cloned into the vector pGEM-T (Promega, Madison, USA), transformed into Escherichia coli DH5α and sequenced using the ABI model 377 stretch automated DNA sequencer (PE Applied Biosystems). A pair of primers (5013A1Int-Fw, 5013A1Int-Re) that span the intron junctions was used to check the alternative splicing of the CYP5013A1 intron, and the different transcripts were further investigated by RT-PCR experiments coupled with non-denaturing PAGE electrophoresis.
All predicted P450 genes except pseudogene sequences from T. thermophila were used for alignment and phylogenetic analysis at the amino acid level. The pairwise levels of gene similarity/identity were calculated using the program MegAlign, which is embedded in the DNASTAR package (DNASTAR, Inc.) . Multiple sequence alignment (MSA) of the P450 proteins was conducted using both the CLUSTALW program at the EMBL-EBI website http://www.ebi.ac.uk/clustalw/ and the T-coffee (Tree-based Consistency Objective Function For alignment Evaluation) server http://www.tcoffee.org/, using parameters under default settings. The quality of each resulting alignment was evaluated by the CORE method, available through the T-coffee server, compared and manually improved by removing any badly aligned columns. For construction of the phylogenetic tree, the Neighbor-Joining (NJ) method (JTT matrix with different rates among sites, gamma parameter = 1.0, bootstrap test = 1000 replicates) was applied on the MSA using MEGA (version 4.0) . In addition, a maximum-likelihood (ML) tree was constructed with PhyML  (JTT matrix, four rate categories, gamma distribution parameter = estimated). The resulting tree was tested with 200 bootstrap repeats with PhyML with the same settings. Putative full length P450 genes of Paramecium tetraurelia were retrieved from (ParameciumDB http://paramecium.cgm.cnrs-gif.fr/ and the P450 website http://drnelson.utmem.edu/param3.htm, respectively. Totally, 19 P. tetraurelia P450 genes that either were consistent with the available EST data or shown to be identical in the above two predictions were chosen for the following analysis, and the standard nomenclature was used according to Nelson's website. MSA of both the T. thermophila and P. tetraurelia P450 protein sequences was conducted using the T-coffee program and manually improved and a maximum-likelihood tree was constructed by PhyML program. Phylogenetic network was conducted with the NeighborNet method using the SplitsTree program (version 4.10) , (Model = JTT, chartransform = ProteinMLdist, splitstransform = EqualAngle, gamma distribution parameter = 2.0, bootstrap test = 1000 replicates)
One T. thermophila P450 sequence (CYP5002A1, CYP5005A18, CYP5007C1, CYP5008A1, CYP5011A1, CYP5012A1 and CYP5013A1) from each of seven family clades and one P. tetraurelia P450 sequence (CYP693A1, ParameciumDB accession No. GSPATP00036495001) that have full length or partial cDNA information were selected as the representative set. Previously reported P450 protein crystal structure data were downloaded from the RCSB (Brookhaven protein data bank) website http://www.rcsb.org/, including P450cam (CYP101), P450BM3 (CYP102), P450terp (CYP108) and P450eryF (CYP107A1) from bacteria, P450nor (CYP55A1) from fungi, CYP2C5, CYP2C8 and CYP3A4 from mammals, corresponding to PDB codes 1UTU, 1ZO4, 1CPT, 1EUP, 1EHE, 1NR6, 1PQ2 and 1TQN, respectively. With each of the indicated P450 proteins, MSAs were done at the amino acid level through the 3DCoffee web server http://www.tcoffee.org/, which affords the possibility of producing an MSA with combined sequence and structure information. Judging by sequence identity from all the examined alignment results and CORE index, mammalian P450 CYP3A4 was chosen as the template for secondary structure. The program ESPript  was used for the assignment of secondary structure elements onto the corresponding aligned sequences, and substrate recognition sites were manually indicated based on the CYP3A4 enzyme information. Then obtained aligned sequences were used to investigate the conservation pattern of the T. thermophila P450 family as inferred by Consurf program http://consurf.tau.ac.il/. All putative functional T. thermophila P450 proteins were checked for likely sub-cellular localization by using the TargetP program http://www.cbs.dtu.dk/services/TargetP/ and the WoLF PSORT http://wolfpsort.org/ with the default parameters.
Cells in the growth, starvation (strain SB210) or conjugation (strains CU428 and B2086) stages of the life cycle were used for RNA extraction. Semi-quantitative RT-PCR analysis using gene-specific primers for selected P450 genes were carried out. A pair of primers designed for the T. thermophila β-actin gene (Genbank accession No. AY315823) was used as the control for normalization of expression data. PCR cycling conditions were: 5 min at 95°C; 20 s at 94°C; 20 s at 60°C; 60 s at 72°C in a 25 μL reaction with totally 30 cycles. To calculate the normalized relative gene expression levels, the same amount of PCR products underwent a 1% EB stained agarose gel electrophoresis and scanned pictures were taken. The relative expression levels were calculated by lane analysis using the QuantiScan for Windows (Biosoft, Cambridge, England) software according to the tutorial.
RNA sample preparation, Tetrahymena thermophila whole-genome oligonucleotide DNA Microarray synthesis and data analysis are described in Miao et al. . In brief, wild-type cell lines B2086 and CU428 of T. thermophila were provided by Dr. P.J. Bruns, Cornell University, Ithaca, NY. Both of these cell lines have inbred strain B genetic background, as does cell line SB210, the source of the MAC genome sequence used to design the microarray probes. Cells were grown in SPP medium  at 30°C. For microarray analyses of growing cells, CU428 cells at low, medium and high cell densities (~1 × 105 cells/ml, ~3.5 × 105 cells/ml and ~1 × 106 cells/ml; referred to as L-l, L-m and L-h) were collected. For starvation, CU428 cells at 2 × 105 cells/ml were washed and starved at 2 × 105 cells/ml in 10 mM Tris (pH 7.5); samples were collected at 3, 6, 9, 12, 15 and 24 hours referred to as S-0, S-3, S-6, S-9, S-12, S-15 and S-24). For conjugation, equal volumes of B2086 and CU428 cells that had been starved for 18 hours in 10 mM Tris (pH 7.5) at 2 × 105 cells/ml, were mixed, and samples were collected at 0, 2, 4, 6, 8, 10, 12, 14, 16 and 18 hours after mixing (referred to as C-0, C-2, C-4, C-6, C-8, C-10, C-12, C-14, C-16 and C-18). Total RNA was extracted using the RNeasy Protect Cell Mini Kit (Qiagen, Valenica, CA) according to manufacturer's instructions. cDNA synthesis and Cy3 labeling was performed by NimbleGen Systems, Inc. The custom microarrays were manufactured by NimbleGen Systems, Inc. using the maskless photolithography method described previously . For each growing and starved Tetrahymena sample, hybridizations were performed on three independent microarrays. For analysis of conjugation, hybridizations were performed on two independent microarrays. The final data were analyzed based on the RMA-processed expression values (RMA calls). The r 2, fold changes, p values and heat maps were calculated using ArrayStar software, version 2.0 (DNASTAR, Inc, Madison, WI).
G+C contents of each entire gene, first and second, third codon positions (GC, GC1, GC2 and GC3s, respectively) were calculated for each T. thermophila P450 gene using the CodonW software http://mobyle.pasteur.fr/cgi-bin/portal.py and DnaSP4.0 . GC12 was the average of GC1 and GC2 and was used for neutrality plot analysis. The strength of the selection on a given gene relative to the mutation pressure were estimated by the method of the relative neutrality plot (RNP), which consists of plotting the G+C content at the nonsynonymous positions (GC12) of the codons against the G+C content at the synonymous position (GC3s). If GC12 was as neutral as GC3s against selection, the points should be distributed along the diagonal line (slope of unity). In contrast, if GC12 was completely non-neutral, the points should be on the parallel lines of abscissa (slope of zero). Thus, the regression coefficient (slope) provided a measure of relative neutrality of GC12 to GC3s.
Correspondence analysis was used to investigate the major trend in relative synonymous codon usage variation among the genes, using the CodonW software. The RSCU value for a codon is the observed frequency divided by the frequency expected if all synonyms codons for that amino acid were used equally. Only those codons for which there is a synonymous alternative were used in the analysis. Each gene is described by a vector of 59 variables (codons). Correspondence analysis identifies the major trends in the variation of the synonymous codon usage data and distributes genes along continuous axes in accordance with these trends. RSCU values close to 1 indicate a lack of bias, while much higher and much lower values indicate preference and avoidance of that particular codon, respectively. The calculated codon usage parameters of 44 T. thermophila P450 genes were listed in Additional file 9. Correlation analysis was performed using SPSS version 12.0 and Microsoft Excel.
Since recombination may result in higher rates of false positives in maximum likelihood tests for positive selection, the possibility of intergenic recombination events in two datasets (CYP5005 and CYP5010 family) were checked using the TOPALi software , with the DSS (Difference of Sums of Squares) and PDM (Probabilistic Divergence Measures) methods (window size 100, step 2), and no positive signal was detected. Then the two datasets were selected to detect site-specific positive selection and purifying selection. For each set, an amino acid alignment was conducted using the CLUSTAL W program with default settings. The resulting alignment was used to generate the corresponding codon alignment with RevTrans , and to construct an unrooted ML tree with PhyML. The codon alignment and the phylogenetic tree were used for calculation of the ratio (ω) between non-synonymous Ka and synonymous Ks at each site of the predicted protein for all types of selection (purifying, neutral and positive) with Selecton (version 2.4) http://selecton.tau.ac.il/, according to Stern et al. . The positive selection of the two gene family members were also checked using Likelihood ratio tests (LRTs) with PAML (version 4) 
Gene-expression evolution analysis was done according to . Firstly, the amino acid sequences of CYP5005 and CYP5010 family members were aligned respectively using the CLUSTAL W program with default settings and translated to the corresponding codon alignment. The (GTR+I+G) model of evolution was selected as determined by likelihood ratio tests in ModelTest  and the phylogenetic tree was constructed by ML analysis, using PAUP* version 4.0b10 . The microarray data for each gene at different time points were log-based-two transformed and represent the character data at the taxa tips (see Additional file 7). Then both the phylogenetic tree and the transformed microarray data as a set of continuous data were read into the program CoMET (Continuous-character Model Evaluation and Testing) . The likelihood values for each of the two datasets were calculated using the default punctuation asymmetry threshold of 100. The nine maximum-likelihood models of gene-expression evolution were compared by the Akaike Information Criterion (AIC) and the best fitting evolutionary model was determined by selecting the one with the minimum AIC value. The 4-fold substitution rates (d4), i.e. the expected number of substitutions per site at the four-fold degenerate sites of the third codon position, were calculated as a measuring of the neutral mutation rates using Kumar method in the MEGA program (version 4.0) . The pairwise levels of amino acid distance between duplicate genes were calculated using the JTT Model.
The sequences reported here were deposited in GenBank (Accession No. EU349017-EU349060) at the National Center for Biotechnology Information. Microarray data have been deposited with the NCBI Gene Expression Omnibus http://www.ncbi.nlm.nih.gov/geo under accession numbers listed in document S11 in Miao et al. . (doi:10.1371/journal.pone.0004429)
Codon Adaptation Index
consistency of overall residue evaluation
Effective Number of Codons
Expressed Sequence Tag
multiple sequence alignment
nonsense-mediated mRNA decay
open reading frame
relative neutrality plot
relative synonymous codon usage
reverse transcriptase polymerase chain reaction
substrate recognition sites
unique intron position
The authors thank Prof. Martin Gorovsky and Jody Bowen (University of Rochester) for generously sharing the microarray data and their critical review of the manuscript, Prof. Eduardo Orias (University of California at Santa Barbara) for valuable suggestions on this study, and two anonymous reviewers for constructive comments on the manuscript. This work was supported by Projects of International Cooperation and Exchanges Ministry of Science and Technology of China (No.2006DFA31960), National Natural Science Foundation of China (No.30870356) and the Sunshine Foundation of Wuhan City (No.20055003059-22) to WM. The microarray work was supported by National Institutes of Health grants GM021793 and GM072752 to Prof. Martin Gorovsky when WM was a visiting scientist at the University of Rochester.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.