Functional annotation of novel lineage-specific genes using co-expression and promoter analysis
© Kumar et al. 2010
Received: 23 November 2009
Accepted: 9 March 2010
Published: 9 March 2010
Skip to main content
© Kumar et al. 2010
Received: 23 November 2009
Accepted: 9 March 2010
Published: 9 March 2010
The diversity of placental architectures within and among mammalian orders is believed to be the result of adaptive evolution. Although, the genetic basis for these differences is unknown, some may arise from rapidly diverging and lineage-specific genes. Previously, we identified 91 novel lineage-specific transcripts (LSTs) from a cow term-placenta cDNA library, which are excellent candidates for adaptive placental functions acquired by the ruminant lineage. The aim of the present study was to infer functions of previously uncharacterized lineage-specific genes (LSGs) using co-expression, promoter, pathway and network analysis.
Clusters of co-expressed genes preferentially expressed in liver, placenta and thymus were found using 49 previously uncharacterized LSTs as seeds. Over-represented composite transcription factor binding sites (TFBS) in promoters of clustered LSGs and known genes were then identified computationally. Functions were inferred for nine previously uncharacterized LSGs using co-expression analysis and pathway analysis tools. Our results predict that these LSGs may function in cell signaling, glycerophospholipid/fatty acid metabolism, protein trafficking, regulatory processes in the nucleus, and processes that initiate parturition and immune system development.
The placenta is a rich source of lineage-specific genes that function in the adaptive evolution of placental architecture and functions. We have shown that co-expression, promoter, and gene network analyses are useful methods to infer functions of LSGs with heretofore unknown functions. Our results indicate that many LSGs are involved in cellular recognition and developmental processes. Furthermore, they provide guidance for experimental approaches to validate the functions of LSGs and to study their evolution.
Placentae exhibit remarkable variation in tissue structure and morphology within and between mammalian clades, and even within a single mammalian order . The diversity of placental architectures is thought to be the result of adaptive evolution arising from rapidly diverging and novel genes [2–4]. A greater understanding of the functional roles that these genes play would provide insights into the molecular basis for the unique phenotypic and metabolic adaptations among closely related mammalian species. Toward that end, we previously identified and bioinformatically characterized novel transcripts in cattle using placenta as a source tissue . These transcripts are lineage-specific (LSTs), and the genes that encode them have no detectable homology to genes outside of that lineage (LSGs). Functional elucidation of LSGs remains a daunting task and only a few have been characterized beyond their expression patterns [5–10]. A complementary approach that would direct the genetic and biochemical characterization of LSGs and their products is functional inference using co-expression  and promoter analysis .
Gene expression is regulated by a complex interaction of transcription factors (TFs) and their binding sites (TFBS) on the gene promoter. Co-expression analysis is based upon the assumption that a high degree of similarity in gene expression profiles correlates with relatedness of their functions . Genes that are highly co-expressed are often regulated by common transcription factor(s), forming sub-networks of genes with a common function . As a general rule, co-regulated genes share a specific arrangement of TFBSs on their promoters. The TFBSs are often located in a specific order relative to the transcription start site (TSS) as well as in a particular orientation with respect to the promoter . For example, Kindy et al.  showed that both strands of the c-myc gene are transcribed in an overlapping fashion and that transcription of the coding and non-coding strands is regulated independently. Yu and coauthors  showed a strong correlation between inter-TFBS distances and their orientation with respect to each other, demonstrating that a combination of TFs rather than an individual TF is the functional unit in tissue-specific gene regulation. Others have shown that the inter-TFBS distance between functionally over-represented TFBS pairs can vary significantly from 10 to 200 bp, although it may be greater than 200 bp in some cases [16–18]. These findings provide insights into factors governing the interactions between specific TFs and document TF pairs that are predicted to act synergistically in a tissue-specific manner  or at specific stages of development .
In a previous work we identified 91 cattle- and cetartiodactyl-specific novel transcripts that included coding sequences and noncoding RNAs (ncRNAs) . In the present work, we have inferred functions of a subset of these LSTs using co-expression analysis. In addition, we identified over-represented TFBSs and their composites in the promoters of co-expressed genes and searched existing databases and the literature for pathways and functions in which these TFs may play a synergistic role in a specific tissue or developmental stage. Using these functional inferences, we predicted sub-networks of genes that may be co-regulated with the LSGs. Our results predict that subsets of these LSGs function in glycerophospholipid/fatty acid metabolism and protein trafficking in liver and near-term placenta, and in processes involving the initiation of parturition and immune system development.
From the liver time-series dataset , 28 of the 49 LSTs that had tissue profiles and 4,711 known expressed genes were selected for clustering after data filtering (see Methods). Two large clusters were identified with average pairwise Pearson correlation r ≥ 0.75, and r ≥ 0.90 between any LST and transcripts encoded by known genes. The identity of the genes in these clusters overlapped, and instead of merging the clusters by lowering the correlation threshold, we selected the largest cluster containing four LSTs and 208 known transcripts (LIVR) for further analysis. These transcripts were co-expressed at seven time-points and two diets (Figure 2). Apart from liver, the genes in this cluster were expressed at higher levels in adrenal gland, cerebrum, and placentome (Additional file 1).
Summary information for nine LSTs co-expressed with known genes.
Bt, Ss, Oa
Bt, Ss, Oa, Ch, Ec,
Over-represented ordered TFBS pairs and unordered TFBS triplets in LIVR, PLAC and THYM co-expression clusters.
aTFBS singles and pairs
CRSD pathway (<10-03) and
Agrin in postsynaptic differentiation;
Agrin in postsynaptic differentiation
Wnt signaling pathway;
nicotinate and nicotinamide
metabolism; signal transduction
AP-2, ZF5, c-Ets
adipocytokine signaling pathway;
HIV-I Nef: negative effector of Fas
glycerolipid metabolism (with STAT
family); prion pathway; mod027529
EGFR-specific transcription factor
(ETF) not found in CRSD
mod003360; mod065501; mod070287
Phosphatidylinositol signaling system;
N-glycan biosynthesis; ribosome;
phospholipase C-epsilon pathway
Ingenuity Pathway Analysis of gene clusters.
Significantbfunctions (F) and canonical pathways (C)
Genes included in the function
glycerophospholipid metabolism (C)
LYPLA1, PGS1, PLCE1, PLCL2
FABP5, GLRX, GRRP1, MET, MLH1, PLCE1, PLXNB2, ASB2, KCTD11, CDK3
repair of DNA (F)
CDC5L, ERCC1, MLH1, NHEJ1, NTHL1, POLI, XRCC1
immune response of organism (F)
CD48, GATA3, MX1
development of epidermis (F)
ALDH3A2, FABP5, GJB5
Wnt/beta-catenin signaling (C)
CDH1, CSNK1G2, DKK1, TLE4
acute-phase response signaling(C)
FOS, HMOX2, PTPN11, SOD2
tissue morphology--size (F)
CDKN1C, DLX5, IGF2, STC1, PTGS2, FOS, CDH1
small molecule biochemistry- transport of amino acids and synthesis of prostaglandin (F)
SLC7A3, STX1A, COMT, PTGS2, IGF2, FOS, IGFBP7, CYP4A22, BCAT, STC1, MAN2A1, PTPN11, TFPI, SOD2
embryonic development-- proliferation and formation of embryonic tissue (F)
ESM1, MED28, PTGS2, CDH1, DKK1, FOS, HAND1
development of embryonic and trophoblast cells (F)
CDKN1C, HAND1, IGF2, PTPN11
cell cycle--entry into cell stage (F)
CDH1, CDKN1C, FOS, MAD2L1, PTPN11, SOD2
CYP4A22, IGF2, IGFBP7, PTGS2, STC1, PTPN11, COMT
cell adhesion (F)
CASK, CD151, CDH1, IGFBP7, MAD2L1, MAN2A1, PTPN11, PVRL2, TFPI
CASK, CDH1, CDKN1C, DKK1, DLX5, FADD, FOS, GATAD2A, HAND1, IGF2, MED28, MSX1, PTPN11, RP13-122B23.3, SNAPC2, SOD2, SPEN, TARBP2, THOC4, TLE4, UBTF, ZNF281
cancer--cell death of tumor cell lines(F)
CDH1, CDKN1C, DKK1, FADD, FOS, IGF2, IGFBP7, IHPK2, MAD2L1, MSX1, PTGS2, PTPN11, SOD2, UBTF
cellular growth and proliferation (F)
BTG1, CDCA7, ELF1, HMGB1, NCOR2, PCNA, PTK2, TCF12, ZFP36L2, ASXL1
cell death (F)
PCNA, TRAP1, PLA2G7, BTG1, NCOR2, HMGB1, PTK2, TCF12, ZFP36L2
gene expression--transcription and transactivation (F)
HMGB1, HMGB2, PCNA, ELF1, ASXL1, NCOR2, ZBTB7A, BTG1, PTK2, TCF12, NXF1
immune and lymphatic system development and function (F)
HMGB1, TCF12, PTK2, CDCA7, NCOR2
The PLAC cluster was expressed preferentially in placentome and consisted of 116 genes, including three that are LSGs, 34FL, 22JE, and 104JE (Table 1; Figure 3A). On the basis of PSI-BLAST search  and multiple sequence alignments we have annotated one of the LSTs, 34FL [GenBank: NM_001105478], as an SSLP-1 (secreted seminal vesicle protein) homolog, which belongs to a class of secreted Ly6 domain containing proteins. The predicted protein product of 34FL, like the SSLP-1 glycoprotein in mouse , has 10 cysteines and contained the conserved C-terminal CCXXXXXCN motif, indicating that it is a member of the SSLP-1 secreted Ly-6 glycoprotein subfamily (Additional file 3). In addition, 34FL was predicted by PSORTII  to contain a signal peptide, and was localized to the extracellular region providing evidence that it is a secreted protein. Furthermore, the 34FL gene was located on BTA29 in an orthologous region that is syntenic with mouse SSLP-1 on MMU9.
The PLAC cluster was not found to be enriched for any single TFBS or TFBS triplets. However, we identified four TFBS composite pairs in the cluster (Table 2). The pair, STAT*Pax-2 (signal transducer and activator of transcription; paired homeobox 2), was predicted in the LSG 34FL, PAG2 (Pregnancy associated glycoprotein), and PTGS2 (COX2, prostaglandin-endoperoxide synthase 2). The motifs predicted ab initio by ANN-Spec in the cluster had significant matches to NF-κB (nuclear factor kappa B), MAZ (Myc-associated zinc finger), and Sp1 TFBSs. All three sites were predicted in the cluster at varying frequency, although none were predicted in an LSG (Additional file 2). The cluster was found to be enriched for Wnt/β-catenin signaling and acute phase response (APR) signaling pathways. Other enriched IPA functions in the PLAC cluster were transport of amino acids and synthesis of prostaglandins, adhesion, development of trophoblast cells, and lipid metabolism.
A thymus-specific cluster (THYM) was identified, consisting of 32 genes, including two LSGs 383NG and 21PW. Both of these are single-exon transcripts (Figure 3B) and have multiple ESTs from different libraries as evidence of transcription. 383NG is a paralog that has been duplicated in two other locations on the same chromosome . The THYM cluster was found to be enriched for v-Myb (myeloblastosis viral oncogene homolog) and KROX (also EGR, early growth response gene) TFBSs (Table 2). Three TFBS composite pairs were over-represented in the THYM cluster, of which one, Nkx2-5*CdxA, was predicted in the LSG 21PW and ASXL1 (Additional sex comb-like 1). An ab initio predicted motif matched the IRF (Interferon regulatory factor) TFBS. IRF-1 was identified in 21% of the genes in the cluster, including the LSGs 383NG and 21PW (Additional file 2). An analysis of the THYM cluster using IPA showed enrichment for genes involved in apoptosis, immune and lymphatic system development, transcription and trans-activation, and cell proliferation (Table 3).
Functional elucidation of a novel gene is a challenging task. We have used an informatics-based strategy (Figure 1) to infer functions of a set of LSGs first found expressed in a cattle term-placenta cDNA library . This was accomplished by generating co-expression clusters (Figures 2 and 3) using LSTs as seeds to cluster other genes from two microarray datasets consisting of transcript profiles from 18 cattle-tissues and liver of animals fed two different diets at several peripartal time-points. We then identified over-represented TFBSs and their composites in the promoters of co-expressed genes, and searched existing databases and the literature, for pathways and functions in which these TFs may play a role in a specific tissue or developmental stage. Yu and co-authors found that genes targeted by the same TF tend to be co-expressed, with the degree of co-expression increasing if genes share more than one TF . This provides significant validation of our approach, and gives us confidence in the sub-networks of co-regulated genes that were identified. We present below a synthesis of our results with the aim of supporting the inferred functions of LSGs in each cluster.
The LIVR cluster was found to be enriched for genes in the glycerophospholipid metabolism pathway, DNA repair, transport, cell death, organ development of epidermis, and immune response functions (Table 3). These pathways and functions are also characteristic of term placenta , which was the source tissue used to create the cDNA library from which the LSTs were identified. In support of the correlated pathways and functions of genes in liver and placenta we also found that the LIVR cluster genes are expressed in placentome (Additional file 1). Glycerophospholipid metabolism plays a significant role in the onset of labor in humans , and apoptosis and immunological processes are known to represent important cellular functions in term-placenta . The overlapping functions likely represent common subpopulations of cells in liver and placenta, such as macrophages and lymphocytes.
Genes in the LIVR cluster were enriched for p53 and Oct-1 TFBSs. p53 exerts a variety of regulatory effects following DNA damage . An Oct-1 TFBS has been predicted in the 39NG promoter along with a PPARγ site. PPARγ works in concert with Oct-1 to mediate transcriptional activation of GADD45 (growth arrest and DNA damage-inducible gene 45) . The presence of both Oct-1 and PPARγ sites on the 39NG promoter suggests a role for the encoded protein in DNA repair processes in response to DNA damage. In addition, the protein is predicted by PSORTII  to be a nuclear protein, which supports such a role. A paired TFBS composite, Srebp-1*Pax-8, was significantly over-represented in the LIVR cluster, and was predicted in two LSGs, 237NG and 266NG. It was also predicted in PLCE1, NGLY1, and TRIP10, which have known roles in fatty acid (FA) metabolism, turnover of glycoproteins, and lipid binding, respectively (see Additional file 4 for protein functions). Srebp-1 is known to regulate genes involved in the biosynthesis of fatty acids, triglycerides and phospholipids in liver and adipocytes , and has been shown to play a role in glycerophospholipid metabolism  suggesting that 237NG and 266NG are also involved in these processes. Smith and coauthors reviewed evidence that show Pax-8 works together with Srebp-1 to target PPARγ (peroxisome proliferator-activated receptor gamma) in adipocytes and liver . Some of these LIVR genes were shown to form sub-networks that participate in glycerophospholipid metabolism, protein transport and signaling pathways in liver (Figure 5). The LSG, 237NG, is inferred to play a role in glycerophospholipid metabolism and cytokine signaling, and is one of the hub genes.
Inferred biological functions of LSTs.
Involved in glycerophospholipid/fatty acid metabolism, cell
signaling and protein trafficking in epithelial cells. 39NG
possibly plays a role in DNA repair processes in response to
DNA damage. Responsive to differences in pre-partum plane of
nutrition at time-points +1, +14 after onset of lactation (Figures
Preferential expression in placentome; involved in immune
response, acute phase and inflammatory processes. 34FL is a
pre-term and term placentome-specific SSLP-1 glycoprotein,
possibly involved with PAG2 and PTGS2 in the final events
before parturition at the feto-maternal interface.
Preferentially expressed in thymus and may play a role in
immune system development and cell-proliferation. 21PW may
play a role in gene activation in fetal thymus development.
Expression of the LIVR genes was found to be affected by pre-partum diet. They were down-regulated by restricted feeding at +1 and +14 days postpartum suggesting that the predicted functions (e.g., apoptosis, glycerophospholipid metabolism, DNA repair mechanisms, and cell signaling) are down-regulated during the early post-partum period when the animals are fed restricted diets that do not meet 100% of the estimated energy requirements during the non-lactating period. This management strategy is more successful in preparing the animal to the onset of parturition and lactation, and leads to lower incidence of metabolic disease . Therefore, animals on a higher plane of nutrition (i.e. consuming diets to meet or exceed energy requirements) show increased inflammatory responses, apoptosis, and DNA repair; a conclusion shared by Loor and coauthors . Above, we suggested that glycerophospholipid metabolism is a common function in liver and near-term placenta in animals approaching labor and delivery. Metabolic processes in both tissues have been shown to be affected by diet in non-ruminants. For example, in pregnant mice the FA composition in the mother's diet influences the maternal liver and fetal placenta FA composition [45, 46]. These findings suggest that the LIVR genes, many of which are involved in FA-linked functions, protein transport and cell signaling, play similar diet-responsive roles in both liver and placenta of pregnant animals (Figure 5), given that nearly all (99%) of the LIVR genes, including the LSGs, are also expressed in the placenta (Additional file 1).
The PLAC cluster genes were found to be preferentially up-regulated in placentome and enriched for specific processes in the placenta; e.g. transport of amino acids and synthesis of prostaglandins, trophoblast cell adhesion, lipid metabolism, transcription, and cell proliferation (Table 3). The cluster is also enriched for acute phase response (APR) genes, which function to restore homeostasis. These APR gene products are a variety of serum proteins synthesized in increased amounts in response to trauma and infection. Given that labor and delivery result in oxidative and immunological stresses, with APR and apoptotic responses in placental tissue , APR enrichment provides a snap-shot of these processes in near-term placenta. The cluster is also enriched for Wnt/β-catenin signaling, which has been shown to play a central role in coordinating uterus-embryo interactions required for implantation in mouse .
The composite TFBS pair, STAT*Pax-2, was over-represented in three co-expressed genes; 34FL, PAG2, and PTGS2. The PWM for the predicted STAT binding site is common to a range of STAT proteins that are involved in the development and function of the immune system and play a role in maintaining immune tolerance and tumor surveillance. PTGS2 is a biosynthetic isoenzyme that was shown in pregnant cows and guinea-pigs to be involved in intrauterine prostaglandin (PG) synthesis, which is crucial for the initiation of parturition [49, 50]. PTGS2 was found to be 20-fold greater in cattle term placentomes (delivery at 260 days or later) compared with preterm placentomes (delivery between day 174 and day 260 of gestation) further supporting its role in parturition . Given that our data show that 34FL (a predicted SSLP-1 glycoprotein), PAG2, and PTGS2 are highly co-expressed and predicted to be regulated by STAT TFs, we suggest that 34FL also plays a role in pregnancy and/or parturition.
The ANN-Spec motifs predicted ab initio in the PLAC cluster have significant matches to TFBSs for NF-κB (nuclear factor kappa B), MAZ (Myc-associated zinc finger), and Sp1 (Additional file 2). NF-κB is known to initiate transcription for a variety of genes that are involved in immune response, acute phase and inflammatory processes . It has been located in human fetal membranes and decidua at term and pre-term delivery . The physiological expression of COX-2 (PTGS2) in rat trophoblast involves a sustained activation of NF-κB, and its inhibition abrogates the inducibility of PTGS2 . This result functionally links NF-κB and PTGS2 with the other co-expressed genes in the PLAC cluster, suggesting a complex role for glycoproteins including 34FL in initiating and orchestrating the cell biology at the feto-maternal interface before parturition (Table 4).
The thymus is an immune system organ that is of central importance to the maturation of T lymphocytes. Genes in the THYM cluster are enriched for the related functions immune system and lymphatic system development, cell death, and cellular growth and proliferation (Table 3). The v-Myb TFBS was over-represented in the THYM cluster and predicted in LSG 383NG. The v-Myb oncogene product causes late onset T cell lymphomas when expressed in the T cell lineage of transgenic mice , thus suggesting a role for this LSG 383NG in cell-proliferation. The TFBS composite pair Nkx2-5*CdxA was over-represented in promoters of 21PW and ASXL1 (additional sex comb-like 1). CdxA and Nkx2 have been shown to be markers for endoderm germ layer patterning during gastrulation, a process necessary for formation of the thymus . The AsxL1 gene in Drosophila is required to maintain homeotic gene activation and silencing, and its homologs have been identified in mouse and found to be expressed in thymus . The roles played by the TFs CdxA and Nkx2 in endoderm germ layer patterning, and that of ASXL1 in homeotic gene activation and silencing support a role for the LSG 21PW in thymus development. Furthermore, the IRF-1 TFBS, which regulates IL-15 gene expression and influences the development of T-cells and natural killer cells in the thymus , is predicted in LSGs 383NG and 21PW (Additional file 2). Taken together, these findings implicate 383NG and 21PW in immune system development and cell-proliferation (Table 4).
We selected the placenta as a model system to identify and functionally characterize novel LSGs because of its unique characteristics as a rapidly evolving physiological system in mammals. As we and others have shown, the placenta is a rich source of expressed LSGs and rapidly diverging genes [2, 3, 57–60]. Such genes are candidates for adaptive placental functions acquired by the ruminant lineage. We used a combination of cluster analysis, promoter analysis, WCGNA, and gene annotation to predict the functions of nine previously uncharacterized LSGs (Table 4) from a starting set of 49 (18%). The stringent analysis criteria produced unique and highly correlated gene expression clusters among 18 different tissues and across seven time-points and two diets in liver (Figures 2 and 3). The three clusters analyzed contained nine LSTs, seven of which are encoded by presumptive novel protein encoding LSGs and two are presumptive ncRNAs . Our results represent a major advance in characterizing the novel LSTs expressed in bovine placenta and have yielded predictions of functions that are consistent with their putative role in ruminant reproductive and immune physiology.
As additional animal genomes are sequenced and the numbers of novel genes with unknown functions increases, our approach establishes a valuable precedent for future studies. We show that it is possible to identify and characterize a significant fraction of lineage-specific genes bioinformatically, which may guide hypothesis-driven experiments to determine their biochemical and cellular functions. These may in turn yield new insights into the role of LSGs in speciation and adaptive evolution.
In a previous study, 91 novel transcripts were identified in a cattle placenta cDNA library. These LSTs were characterized on the basis of their genomic distribution and annotation in Btau_2.0 and expression patterns in 18 cattle tissues . For the present work, the annotation was updated to Btau_3.1 (December 2007) . Of the original 91 LSTs, 63 currently have no matches to non-Cetartiodactyl sequences in public databases . The remaining 28 transcripts were not considered in this study as they were re-annotated as representing divergent homologs.
Two cDNA microarray expression datasets profiling ~7,000 cattle genes were used. The cDNAs used for the array were selected from a near-term cattle placenta cDNA library . The first dataset (GEO GSE3029) was obtained by profiling total RNA from 18 cattle tissues . For this dataset, transcripts were included in the analysis if the intensity was above the median signal intensity of negative control spots present on the array, and in addition, the minimum intensity was 250 units in at least one sample-point. The second dataset [GEO: GSE3331] was generated by temporal gene expression profiling of liver RNA during the peripartal period in Holstein cows fed with a moderate energy ad-libitum, or restricted diet in which the animals were fed to consume ca. 80% of their calculated energy requirements from -65 days until parturition . The temporal data spanned -65 to +49 days relative to parturition for animals receiving each diet. Expression levels of the transcripts were analyzed further if the intensity was above the median signal intensity of negative control spots present on the array, the minimum raw intensity was 150 in at least one sample-point, and the relative expression compared to the control was statistically significant in at least one sample-point with a raw P-value (P) < 0.05. For both datasets, only those intensity spots that were flagged as 'present' were included in the analysis.
Among the 63 LSTs, 49 were present in at least one of the 18 tissues with a raw intensity of 250 (Additional file 5). In addition to these 49 LSTs, expression levels of 6,178 transcripts passed this filter. The LSTs were clustered using Pearson correlation (r) threshold of 0.90. A representative was selected from each cluster and the un-clustered LSTs were self-represented. Genes on the array that co-expressed with each of these LSTs at r ≥ 0.90 were grouped into clusters that included all co-expressed LSTs. The cluster was adjusted to bring the average cluster r ≥ 0.75.
Of the 49 LSTs, 28 were present in at least one liver sample with a raw intensity of 150 and significantly expressed at P < 0.05 compared to a control mixture of tissues excluding liver. Expression of 4,711 unique genes passed these filter conditions. The temporal profiles of the LSTs were clustered hierarchically using gene-condition clustering as implemented in GeneSpring . The liver gene expression profiles of representative LSTs were used as seeds to identify co-expressed genes from the 4,711 genes on the array using Pearson correlation (r) at a threshold r ≥ 0.90. As before, the clusters were adjusted to bring the average cluster r ≥ 0.75.
A mixed effects model using the SAS procedure Proc MIXED (SAS Institute, Cary, North Carolina, USA)  was used on the 212 unique genes in the liver cluster to determine expression differences between groups of animals on two diets (moderate energy ad-libitum and restricted) at different time points (-65, -30, -14, +1, +14, +28, +49 days). The LOG2-transformed ratios were analyzed for each gene using a mixed model that included the fixed effect of diet within time point. Statistically significant P-values for the models were adjusted for multiple comparisons using the Benjamini-Hochberg false discovery rate (FDR) correction .
Functions, gene symbols and genome coordinates were assigned to each clone accession on the array using RefSeq (Btau_3.1) and human protein annotations in UCSC genome browser tables . Manual curation of the clusters involved removing identical genes and using the UCSC browser to check if each gene was annotated with the correct gene symbol and genome coordinates. This manual inspection was crucial for ensuring the transcription start site of genes and their promoter regions.
Mammalian regulatory elements are concentrated near transcription start sites (TSS). For this reason, promoter analysis was concentrated on the proximal promoter region, -1000 to +100 bp relative to the TSS. Both unmasked and repeat-masked promoter sequences [-1000, +100] were extracted for gene clusters from Btau_3.1 using the UCSC Genome Table browser. To identify TFBSs that are over-represented in the gene clusters, we used promoters of unique cattle RefSeq genes as the background set. The coordinates for a non-redundant set of Btau_3.1 RefSeq genes were downloaded from UCSC Genome Table Browser, and the proximal promoters [-1000, +100] were extracted as described for the clusters.
Vertebrate-specific TFBSs were predicted by scanning the repeat-masked promoters using the Match program  with a core similarity threshold of 0.9 and a matrix similarity threshold of 0.85. The promoters in each cluster were searched against a predefined matrix profile in the TRANSFAC Professional 11.4 database . This database contained a set of 214 high-quality, vertebrate-specific, non-redundant position weight matrices (PWMs) with minimized false positives ("vertebrate_non_redundant_minFP" with high-quality matrices selected). Only a single occurrence of a TFBS was counted in each promoter, and the predicted TFBS counts in each cluster were compared to those in the cattle RefSeq promoter set using Fisher's exact test (FET), which is based on a hypergeometric distribution. The computed P-values were adjusted for multiple comparisons using the Benjamini-Hochberg FDR correction .
Two TFBSs were defined as co-occurring if they were distinct, non-overlapping, and their PWMs had a core similarity threshold of 0.9 and matrix similarity threshold of 0.85 in the output from the Match program. The Match output lists the PWM matches in their positional order on the promoter. Both ordered (A-B ≠ B-A relative to the TSS) and unordered (A-B = B-A) TFBS pairs and triplets were predicted separately, and the orientation of the TFBSs was ignored. To prevent double-counting, only a single occurrence of a combination was counted per gene. For the unordered combinations, non-redundancy was ensured by collapsing each identified combination in its sorted order (A-B-A or A-C-B or B-A-C = A*B*C), and then counting only a unique occurrence of the unordered combination within a gene promoter. Unordered TFBS composites were denoted with a comma separating the sites. Composite ordered TFBSs were denoted with an asterisk between the sites indicating that they were predicted to be co-occurring in that order in the promoter relative to the TSS. TFBS pairs and triplets were predicted for three different minimum threshold distances of 20 bp, 50 bp, and 100 bp between the TFBSs to identify all adjacent non-overlapping TFBS combinations . The maximum allowed inter-TFBS distance was set to 250 bp.
The counts of the ordered pair and triplet TFBSs were computed for each cluster of genes and compared to the counts of the respective pairs and triplets in the background RefSeq promoters using Fisher's exact test. A minimum cell count of five was necessary for comparisons. The computed P-values were adjusted for multiple comparisons using the FDR correction as before, and comparisons were deemed significant if the adjusted P-value was ≤ 0.1. The entire analysis was carried out for both repeat-masked and unmasked promoters and significant predictions in the repeat-masked promoters had to be predicted in the unmasked cluster promoters to be selected. This precautionary measure ensured that no predictions were within repeats. TFBS prediction results were manually checked for the presence of over-represented composites.
ANN-Spec  was used for ab initio prediction of motifs that were common to an entire cluster. For each cluster of genes, the motif predictions were made on unmasked promoters, using the unmasked Btau_3.1 RefSeq promoters from which the cluster genes were subtracted as background. ANN-Spec was run iteratively by varying the predicted motif length from 6 to 16 bp and setting the run cycle (parameter m) to 100. The PWMs of predicted motifs were parsed from the ANN-Spec output. Tomtom  was used to compare the predicted PWMs with TRANSFAC v11.4 PWMs and comparisons with P < 0.01 were deemed significant. Logos depicting the frequency of each nucleotide at each position of a motif were generated for the ANN-Spec -predicted and corresponding matching Transfac PWMs using the EnoLOGO web server .
Ingenuity Pathway Analysis (version 5.5)  was used to identify functional enrichment in the clusters, using the respective source gene sets (6,149 genes in tissue experiments, 4,711 in liver time-series experiments) as reference. The Ingenuity Pathway Knowledge Base (IPKB) was used as the source database for biological function and pathway assignment to genes. The significance threshold for function and pathway enrichment was P ≤ 0.05.
To identify known pathways in which the TFs were involved we queried the CRSD, which consists of miRNA, TF and gene expression regulatory signatures assigned to specific BioCarta and KEGG pathways using genome-wide enrichment analysis . A Perl script was written that accepted a TFBS composite and parsed the dataset for the co-occurrence of the TFs in the composite in a common pathway at a P < 10-03. In addition, we used the Predicted Regulatory Module (PREMOD) database  to identify any known modules within our set of TFBS composites.
where aij is the adjacency between two genes i and j, x is the expression of a gene, and β is the power factor for a scale-free network. For the LIVR cluster, this power was 8 as determined by the scale-free network criterion provided by the authors . Default parameters were used for module generation. Gene connectivity was determined, and the top five genes with the highest connectivity (hub genes) were identified using 1.2 as a cutoff for gene significance and intramodular connectivity (K/Kmax) cutoff of 0.95.
GeneGO MetaCore  was used to identify known interactions in the LIVR cluster of genes, modeled on the human interaction database included in GeneGO.
Transcription factor binding site
position weight matrix
cluster of genes expressed in liver and showing effect of diet
cluster of genes preferentially expressed in cattle placenta
cluster of genes preferentially expressed in cattle thymus
cluster of genes preferentially expressed in cattle skin
cluster of genes preferentially expressed in cattle adrenal gland, thalamus, and cerebellum
Fisher's exact test
Benjamini-Hochberg false discovery rate
We would like to thank Prof. Sheng Zhong, Department of Bioengineering, University of Illinois at Urbana-Champaign, for providing critical advice on gene clustering using the seeding approach and Dr. Denis Larkin for his helpful comments on the manuscript. We would also like to thank the UCSC genome bioinformatics staff for their help at various times with data download using the genome browser tables.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.