Microarray and comparative genomics-based identification of genes and gene regulatory regions of the mouse immune system
© Hutton et al. 2004
Received: 30 March 2004
Accepted: 25 October 2004
Published: 25 October 2004
Skip to main content
© Hutton et al. 2004
Received: 30 March 2004
Accepted: 25 October 2004
Published: 25 October 2004
In this study we have built and mined a gene expression database composed of 65 diverse mouse tissues for genes preferentially expressed in immune tissues and cell types. Using expression pattern criteria, we identified 360 genes with preferential expression in thymus, spleen, peripheral blood mononuclear cells, lymph nodes (unstimulated or stimulated), orin vitroactivated T-cells.
Gene clusters, formed based on similarity of expression-pattern across either all tissues or the immune tissues only, had highly significant associations both with immunological processes such as chemokine-mediated response, antigen processing, receptor-related signal transduction, and transcriptional regulation, and also with more general processes such as replication and cell cycle control. Within-cluster gene correlations implicated known associations of known genes, as well as immune process-related roles for poorly described genes. To characterize regulatory mechanisms and cis-elements of genes with similar patterns of expression, we used a new version of a comparative genomics-basedcis-element analysis tool to identify clusters ofcis-elements with compositional similarity among multiple genes. Several clusters contained genes that shared 5–6cis-elements that included ETS and zinc-finger binding sites.cis-Elements AP2 EGRF ETSF MAZF SP1F ZF5F and AREB ETSF MZF1 PAX5 STAT were shared in a thymus-expressed set; AP4R E2FF EBOX ETSF MAZF SP1F ZF5F and CREB E2FF MAZF PCAT SP1F STAT cis-clusters occurred in activated T-cells; CEBP CREB NFKB SORY and GATA NKXH OCT1 RBIT occurred in stimulated lymph nodes.
This study demonstrates a series of analytic approaches that have allowed the implication of genes and regulatory elements that participate in the differentiation, maintenance, and function of the immune system. Polymorphism or mutation of these could adversely impact immune system functions.
The immune system is composed of a multiplicity of individual cell types that derive from a relatively small number of immuno-hematopoietic progenitors that undergo complex developmental and exposure-driven differentiation and activation. Cell-type specific gene expression is driven to a large measure by complex transcriptional regulation that orchestrates differential expression of a wide variety of genes necessary to accomplish immune effector functions. A number of specific transcription factors (TFs) which regulate gene expression in immune system cell types have been identified, largely through gene knockout experiments and isolation of protein complexes that bind to regulatory regions of target genes. Examples include PU.1/Ets, Ikaros, E2A, EBF, PAX5, GATA3, NFAT, cMYB, and OCT-2 [1–4]. These proteins bind to clusters ofcis-regulatory elements in multiple diverse combinations to give rise to specific patterns of gene expression . However, the layout of regulatory and coding regions is not known for most genes that are preferentially expressed in lymphocytes and immune tissues (see, for examples, [6–11]). Based on the nearly completed nucleotide sequences of the mouse and human genomes (http://genome.ucsc.edu;http://www.ensembl.org), we have sought to expand our knowledge of the structure and function of compartment-specific genes, and in particular, to find clusters ofcis-elements that bind TFs and regulate gene expression during biological processes. DNA sequences of both coding regions and non-coding regions which harborcis-elements that govern expression, are phylogenetically conserved [14–16]. This conservation of functionally important regions of DNA underpins current methods of identifying putative regulatory regions by comparative sequence analysis. In practice, finding relevant clusters ofcis-elements is difficult and computationally intensive.
High-throughput gene expression profiling provides a powerful approach to the investigation of relative transcriptional activity as a function of biological differentiation across a variety of cells and tissues. Published examples that probe a wide variety of distinct, differentiated materials include the Human Gene Expression (HuGE) Index databasehttp://www.hugeindex.org and the GNF (Genomics Institute of the Novartis Research Foundation;http://web.gnf.org/) database of human and mouse gene expression . These resources provide access to patterns of expression of a significant fraction (15–25%) of all mouse and human genes in several dozen tissues and cell types. We have created a large database locally, which has permit investigators from our campus to profile gene expression in mouse tissues and cell types specific to their interests . To do this, we used the Incyte Mouse GEM1 microarray, an 8638 element spotted cDNA gene expression platform and a universal reference design that employed poly A+ mRNA was prepared from whole day-1 postnatal mouse. Two channel Cy3-Cy5 microarray hybridization technology was used to identify relative strength of signals from each element of the array for a specific tissue. From this database, we identified 360 cDNAs on the microarray that exhibit preferential expression in immune tissues such as lymph nodes, thymus, and activated T-cells relative to most other types of tissues. We identified 333 genes that encode these sequences and have grouped them by biological functions and by patterns of expression.
Cis-element clusters that are conserved in pairs of orthologs are strong predictors of regulatory regions within mammalian genes [15,20,21]. We have used this method to identify putative regulatory modules, which are clusters of conservedcis-regulatory elements that occur in coordinately regulated genes of the immune system and may play a role in controlling their expression during development or mature cell function. Several of the modules identified through this approach containcis-elements whose biological relevance has been experimentally validated in previous studies. Other computationally identified modules from this immunomic database have not been studied in detail, but the results and a tool to analyze them further, are provided at the websitehttp://cismols.cchmc.org. Taken together these data provide valuable guidance to the design of experiments that seek to identify regulatory modules in genes with specific patterns of expression.
Our goal is to identify genes, which are essential for the differentiation, maintenance, and function of the immune system, and their associated regulatory elements. Polymorphisms or mutation in these might underlie well-known variation among individuals in effectiveness of their immune response. Mouse immune genes were identified from our gene expression database constructed using the 8638 element microarray and probed with mRNA prepared from 65 normal adult and fetal tissues. We chose to select relevant genes by collecting those expressed above a threshold value rather than by statistical analysis of variance. Given the small number of replicates and the large number of comparisons being made, we would not have enough statistical power to detect differentially expressed genes by using traditional statistical tests with appropriate specificity. In addition, with the reduced specificity of statistical tests, the biologically non-significant, but somewhat reproducible differences in gene expression will obscure changes that are of biologically significant magnitude, but vary from replicate to replicate. Expression of genes is not discontinuous from tissue to tissue, but varies quantitatively over a wide range. The threshold to distinguish expressed from non-expressed genes was set to identify the hundred or so most highly expressed gene in each relevant tissue.
Hierarchical tree clustering of the 360 sequences and 65 normal adult and fetal tissues was carried out by Pearson correlation using the log of the average of the relative expression ratio for each gene as measured in replicate arrays (Figure1A). While the band of high expression extends across the 6 immune tissues, relative expression of each gene within the immune tissues shows distinct patterning (Figure1B). For intestinal and fetal tissues, the areas of high expression are localized and do not include the majority of the immune genes. Function of the genes expressed in these tissues will be described.
Six sets of genes that are highly expressed in immune tissues, grouped by function. Gene symbol and GenBank accession number identify genes
Defense - Immune - 38 Genes
Defense - 21 Genes
Signal - 47 Genes
Apoptosis - 14 Genes
Lysosomes - 6 Genes
Chemotaxis - 8 Genes
There are sets of genes that work together to produce the cellular and humoral immune responses. For example, molecules of the major histocompatibility complex present foreign peptides to T cells. They are encoded by genes suchH2-Aa,H2-Ab1,H2-DMa,H2-Eb1,H2-K,H2-L,H2-Oa,H2-Ob,H2-Q7,and B2m(Table1, Defense - Immune). Signal transduction pathways are abundant and play critical roles in the function of lymphocytes. They link the recognition of antigens or chemokines by receptors on the cell surface to the transcription of genes required for cell division and new protein synthesis. This process of lymphocyte activation requires an intracellular signaling cascade with participation of protein kinases, G-proteins, and products of cleavage of membrane phospholipids [27–29] (Table1, Signal). Janus kinases, encoded by genes such asJak1, phosphorylate both signal transducers and activators of transcription (Stat1,Stat3, andStat4) as part of the lymphocytes' response to cytokines. The product ofRac2is a G protein that participates in the cascade of kinases leading to activation of TFs. Chemokines are a family of small proteins that activate cells such as lymphocytes as part of the host response to infection. Genes that encode the chemokines (Ccl4,Ccl6,Ccl19,Ccl22,Cxcl13) and chemokine receptors (Cxcr4andCcr2) (Table1, Chemotaxis) are highly expressed in immune tissues.
Twenty-one sequences representing 19 known genes were highly expressed in gastrointestinal tissue (Figure1B). Of these, 5 were classified as "Defense - Immune", includingB2m,H2-Q7,Tcrg,Tlr1, andH2-K. Of the 51 genes expressed in fetal tissues (Figure1B), 46 are annotated. Sixteen genes functioned in protein synthesis and 13 in cell cycle/DNA synthesis. No "Defense" or "Defense-Immune" genes were highly expressed in fetal tissues. Genes expressed in fetal tissues reflect active growth and proliferation of cells. In immune tissues, these same genes are particularly well expressed in activated T-cells and thymus, where cell proliferation is occurring.
Cluster analysis of genome wide expression data from microarrays permits the grouping together of genes with similar patterns of expression across cells, tissues or experimental conditions. Clustering of genes by patterns of expression was first applied on a large scale to yeast , where control of important variables like genotype, phase of cell cycle, and growth conditions permits precise identification of coordinately regulated genes. Clustering has also been used to catalog mammalian genes that are differentially expressed in normal and malignant immune cells [35,36]. While yeast genes with similar patterns of expression have been found to share regulatory elements , identification of such elements in clustered genes of mammals is complex and not very successful [38,39]. Conservation of functionally important regions of DNA underpins current methods of identifying putative regulatory regions by computational analysis of nucleotide sequences [14–16]. Using K-means clustering in GeneSpring (Version 4.2.1), 160 genes, which had been annotated using SOURCE  early in our studies, were divided into distinct sets based on similarity of expression patterns across 15 tissues. Tissues were given equal weight, the number of clusters was set at 20, and similarity was measured by standard correlation. For technical reasons, GeneSpring did not assign 4 genes to clusters. The cluster sets are shown inAdditional file 2.
Examples of modules of sharedcis-elements in K-cluster, set 15 genes. All elements of a module are within a 200 bp window and are present in both the human and mouse orthologs. Modules are located within 3-kb upstream of the transcription start site.
Modules of shared cis -elements
AP2F EGRF MAZF SP1F ZBPF
AP2F EGRF SP1F ZBPF
AP2F MAZF SP1F ZBPF
AP2F HESF MAZF SP1F
MAZF SP1F ZBPF
AP2F MAZF SP1F
AP2F SP1F ZBPF
EGRF MAZF ZBPF
ETSF SP1F ZBPF
AP2F EGRF ETSF SP1F ZBPF
AP2F MAZF MZF1 SP1F ZBPF
AP2F MAZF MZF1 SP1F
AP2F MZF1 SP1F ZBPF
ETSF MAZF SP1F
EGRF ETSF ZBPF
Figure4shows the computationally predicted arrangement ofcis-elements immediately upstream of the transcription start site (promoters) of specific individual genes:Arid1a,Abcg1, andZfp162. Elements were required to be within 500 bp of transcription start to be shown in Figure4, which focuses on sequence conservation in classical promoters of pairs of orthologs and does not require that elements be shared with other genes. Modules in Table2were within 3 kb of transcription start, which could include both classical promoters and upstream enhancers, and were shared by more than one pair of orthologs. A number of the elements of modules listed in Table2are also present in the predicted promoters. For example, MAZF and SP1F are also present in the promoters ofArid1a,Abcg1andZfp162.
K-Cluster set 7 includes 19 genes. As a group the genes were better expressed in stimulated lymph nodes and activated T-cells than in the other tissues. Expression was characteristically low in peripheral blood mononuclear cells and in other non-immune adult and fetal tissues. Among the genes in set 7 are the integral surface membrane proteinCD72found on B-cells, the transcription regulators Irf5andIcsbp1, the tyrosine kinasesHck,Stk10, andLynthat are a part of the intracellular signaling cascade, the mitogen activated protein kinasesMap3k1andMap4k1that participate in the very earliest steps of induction of new gene expression after lymphocytes are exposed to antigen, and the ATP-binding cassette transportersAbca7andTap1of the type that transport peptides during antigen processing. Other less well-characterized genes in these sets may have functions similar to the genes that are better annotated. Sequences of 11 genes from Set 7 and their human orthologs were examined for the presence of clusters of TF binding sites, at least one of which is a lymphoid element, as defined in Methods. The 11 genes shared relatively few clusters of TF binding sites. There were 7 clusters shared by 3 genes. The largest cluster contained 6 elements, AP2F CDEF EGRF SP1F ZBPF ZF5F and was shared byIrf5andStk10. There were 21 clusters shared by 2 genes and containing 3 to 6 TF binding sites. The composition and location of these are shown in images from CisMols (Additional files3,4,5,6,7,8and9).
In addition to searching for potential regulatory regions within sets of genes clustered by similarities of patterns of expression across sets of tissues and within regions immediately upstream of exon 1, we also sought to identify genes characterized by high expression in specific immune tissues. It is not known whether clustering by pattern of expression across tissues and/or grouping by high expression in specific tissues (or neither) will be a useful way to group genes for computational identification of regulatory elements and regulatory regions. It is clear, however, that although modules ofcis-elements that regulate expression of genes in tissues can occur at many different locations relative to a gene's promoter, at least some regulatory elements are located within promoter regions and this is the region we have searched most intensively for conservation of known TF binding sites. For the purposes of this analysis, we defined genes that were highly expressed based on their normalized expression being at least 4 times higher in an individual immune tissue relative to their median signal across the entire database. High expression in a single tissue does not preclude significant expression in other tissues, so high expression is not synonymous with unique expression. We examined highly expressed mouse genes and their human orthologs for the presence of clusters of TF binding sites, with the additional constraint that at least one of the cis elements present in the cluster was a lymphoid element, as defined in Methods. Grouped by tissue, suitable paired mouse/human orthologs were: activated T-cells, 17 genes:Ctsz,Kpnb1,Tnfrsf9,Tnfrsf4,Myc,Mcm2,Mcm5,Mcm6,Mcm7,Gzmb,Ncf4,Gapd,Ccl4,Pcna,Rpl13,Cd86,Icsbp1; thymus, 7 genes:Satb1,Hdac7a,Sgpl1,Abca1,Prss16,Abcg1,C1qg; stimulated lymph node, 4 genes,Stk10,Irf5,Cxcl9,Tnfrsf1. Identical analyses of 6 genes highly expressed in skeletal muscle (Ckm,Myf6,Aldo1,Myog,Dmd,Chrm3) and 8 in liver (G6pc,Cyp7a1,Proc,Ttr,Aldo2,Ins2,Igf1,Pah) served as negative controls, i.e. not tissues that play a critical role in lymphocyte differentiation or the immune response. The MCM family andMycare involved in replication of DNA and chromosomes. The TNF and TNFR families of genes encode receptors and ligands that couple directly to signaling pathways for cell proliferation, survival and differentiation .Prss16encodes a thymus specific protease which is specifically expressed by epithelial cells in the thymic cortex and plays a role in T-cell development and, perhaps, in susceptibility to autoimmunity .Hdac7aencodes a histone deacetylase. Members of theHdacfamily of genes modify histones and play a role in the regulation of expression of genes such as those functioning in the cell cycle, apoptosis, and transcription . Cxcl9 is an inflammatory chemokine induced by interferon. Its promoter contains binding sites for CREB, STAT1, and NFKB .
Examples of modules of shared cis-elements found in highly expressed genes in 3 immune tissues. Elements were clustered with at least 2 other cis-elements within a 200 bp window, indicating the presence of a putative regulatory module which contained at least 3 transcription factor binding sites, one of which was required to be a lymphoid element. All are located within 3 kb upstream and 100 bp downstream of the first bp of exon 1. The modules were present in the mouse and human orthologs of at least 2 genes from sets of genes that were highly expressed in thymus, stimulated lymph nodes, or activated T-cells. The number of genes for which orthologs were available: thymus, 7; lymph node, 4; activated T-cells, 17.
Stimulated Lymph Node
MAZF SP1F ZBPF
AP2F CDEF EGRF SP1F ZBPF ZF5F
MAZF SP1F ZBPF
EGRF MAZF SP1F ZBPF
LHXF NKXH OCT1 RBIT
CREB SP1F ZBPF
ETSF SP1F ZBPF
EGRF ETSF NFKB
E2FF MAZF SP1F
AP2F EGRF HESF MAZF SP1F ZBPF
GATA HOXF NKXH
ECAT PCAT SP1F ZBPF
ETSF MAZF SP1F STAT ZBPF
LHXF NKXH OCT1
ETSF MAZF MZF1
AP2F EGRF ETSF SP1F ZBPF
NKXH OCT1 RBIT
EGRF SP1F ZBPF
EGRF MAZF P53F SP1F
IKRS MAZF NFKB
GATA HAML MYT1
E2FF EBOX ETSF MAZF SP1F ZF5F
BCL6 CREB E2FF STAT
HOXF LEFF LHXF OCT1
MAZF MZF1 NFKB PAX5
Regulatory modules, which have been proved biologically to regulate expression of genes, contain multiple TF binding sites, much as is shown in Figures2,3and4. Examples of modules of sharedcis-elements (i.e., within a 200 bp window) in highly expressed genes are listed in Table3. For example, of the modules highly expressed in thymus SP1F MAZF ZBPF was present in paired orthologs ofAbca1,C1qg,Abcg1andSgpl1; module AP2F EGRF HESF MAZF SP1F ZBPF was present inSgpl1andAbcg1. Of the modules highly expressed in stimulated lymph nodes, AP2F CDEF EGRF SP1F ZBPF ZF5F was present inStk10andIrf5; module GATA HOXF NKXH was present inStk10andIrf5. Of the modules highly expressed in activated T-cells, E2FF EBOX ETSF MAZF SP1F ZF5F was present inKpnb1andMcm6; module BCL6 CREB E2FF STAT was present inIcsbp1andTnfrsf4.
Individual differentiated biological states can be characterized by gene expression profiling. Large-scale comparisons of profiles of cells, tissues, and developmental stages have the potential to identify a wealth of coordinately regulated groups of genes that reflect the interplay of their functional relationships and transcriptional control mechanisms. We have built a database comprised of the mRNA expression profiles of 65 normal adult and fetal C57BL/6J mouse tissues using the Incyte Mouse GEM1, 8638 element, clone set. Using microarray analysis, 680 sequences were identified that were highly expressed in one or more of 6 immune tissues. Many were also expressed in certain other tissues. Some of these other tissues were organs such as heart, kidney, and brain which do not normally contain lymphocytes in large numbers and do not play a role in the immune response. Others, such as intestine and lung, interface with the external environment, contain significant numbers of lymphocytes, and can mount an immune response. The 680 expressed sequences were filtered to remove 320 that were expressed in "non-immune" brain or heart or kidney. This resulted in a list of 360 expressed sequences called "immune genes" that were less broadly expressed in tissues without immune function than were the 680. Mutations and polymorphisms in both the 680 expressed sequences and the 360 immune genes have a significant chance of specifically affecting immune function. We predict this will be more common with changes in the 360 immune genes. We tested this by comparing reports of disease causing mutations in the 360 immune genes with those reported for the 320 genes that were more broadly expressed (Online Mendelian Inheritance in Man). Of the 360 mouse immune genes, 32 had an ortholog with gene symbol in OMIM and17 had annotations that described a function clearly linked to development or function of the immune system. Mutations in 2 (LCP2andPARVG) cause severe immunodeficiency disease. Examples of other diseases caused by mutations in these 32 genes were B- and T-cell malignancies, autoimmune disorders, and reduced viral or bacterial resistance. Of the 320 genes removed from the list of 680, 37 had orthologs with gene symbols listed in OMIM. 4 genes were expressed in lymphocytes and mutation in one, Bruton's tyrosine kinase, causes agammaglobulinemia. Mutations in other genes caused disorders of coagulation, red cells, or granulocytes, rather than the immune system. We conclude that the list of 360 immune genes includes a higher percentage of genes preferentially expressed in immunocompetent tissues and with more specific immune-related functions than does the full list of 680 sequences expressed in immune tissues, but also with expression in non-immune tissues.
The 360 immune genes represent a portion of the complete set of genes that encode proteins and processes necessary for the differentiation, maintenance, and function of the immune system. These genes are functionally diverse and represent both ubiquitous and specialized cellular processes. 10 or more genes are in specific functional clusters that carry out general processes such as DNA and chromosomal replication, cell cycle regulation, transcription, and translation. Other genes are in functional clusters that carry out specialized functions, largely restricted to immune tissues. These include genes that encode proteins involved in antigen recognition and transport, chemokine synthesis, chemokine recognition, and the intracellular signaling cascade necessary to initiate transcription and new protein synthesis in lymphocytes, as part of the host response to antigen. Functional annotation of these genes is a work in progress. While probable functions have been assigned to most of the expressed sequences and the genes that encode them, using information shown in theAdditional file 1, there is much work to be done. Most functional annotations are based on the sharing of presently known protein domains and sequence homologies and provide general clues to the role a gene or protein may play in cells that participate in the immune response to antigen. A more precise understanding will come about as new laboratory data are correlated with studies of the expression of specific immune genes, their coordination with expression of other genes, and the structure and function of their products.
For several reasons, the "immune genes" that we have identified are not all of the genes that are expressed in immune tissues: (1) the Incyte set of 8638 genes probably contains representative cDNAs from 25% or less of all mouse genes; (2) genes that are essential to immune function, but are expressed at similar levels in immune and other tissues will not be included in the immune set; and (3) a gene with a very low level of expression will be missed, if cDNA made from its RNA is not present in sufficient quantity to give a signal on the microarray. Genes may be included or excluded in error because of the large number of genes screened for expression with a limited number of replicates. Incyte cDNA microarrays are no longer manufactured and no Incyte arrays or public databases are available to check expression of our immune genes in other species. There are two relevant publicly available Novartis gene expression databases (Genomics Institute of the Novartis Research Foundation, ), which can be accessed. One uses Affymetrix chip U74Av2 and a set of 90 mouse tissues and cell lines and the second uses Affymetrix HG U133 and 158 human tissues and cells. Relating Affymetrix probes to Incyte cDNA probes is complex and the Novartis tissue sets do not contain the same tissues we have used. However, our immune genes, when expressed on the Novartis arrays, are generally clustered in tissues of the immunohematopoietic system, the gastrointestinal tract, and lung. These types of publicly available databases will permit identification and functional annotation of new immune genes with consequent availability of larger sets of coordinately regulated genes for searches of conserved regulatory modules.
Using comparative genomics-based,cis-element analyses (http://trafac.cchmc.org andhttp://genometrafac.cchmc.org), we identified compositionally similar clusters ofcis-elements in upstream regions of mouse/human orthologs of several immune genes. There was an excellent agreement between the computationally predicted and experimentally determined arrangements ofcis-elements in the promoters of the mouseH2-Kand humanHLA-Agenes. Analyses of other immune genes identified a wealth of potential immune system-specific regulatory modules. For example (Table2),Arid1a,Abcg1, andSgpl1are members of a K-clustered set of immune genes and share a phylogenetically conserved module of 5cis-elements: AP2F EGRF MAZF SP1F and ZBPF, all within a 200 bp interval. Other examples of clustered TF binding sites that could be within regulatory modules of genes highly expressed in specific tissues are given in RESULTS. Striking examples of putative modules include the 6cis-element module AP2F EGRF HESF MAZF SP1F ZBPF in genes highly expressed in thymus; the 6cis-element module E2FF EBOX ETSF MAZF SP1F ZFSF in genes highly expressed in activated T-cells; and the 6 element module AP2F CDEF EGRF SP1F ZBPF ZF5F in genes highly expressed in stimulated lymph nodes (Table3). Putative regulatory modules are not distributed randomly across an entire segment of DNA, but are highly clustered within distinct short segments that are the computationally identified promoters and enhancers (Figure3). Because of the nature of the scanning algorithm with its 200 bp window, variations of multiple modules may occur within one segment. These phenomena are more easily understood by examining Figure3. Our data support the hypothesis that (1) regulatory modules of genes are highly clustered in a few sites that can be computationally identified, (2) modules in different genes may sharecis-elements that bind TFs, and (3) certain combinations of TF binding sites are phylogenetically conserved and appear to be reused across genes when specific patterns of expression are required.Cis-elements from the same family have a high probability of interacting with similar groups of transcription factors, although they will not necessarily be in the same position relative to the transcription start site. We have identified genes and putative regulatory modules that play a role in the differentiation, maintenance, and function of the immune system. These results serve to advance both our understanding of normal gene and immune system function and also to identify genes and regulatory regions whose mutation or polymorphic variation lead to immunologic disease.
C57BL/6J mice from The Jackson Laboratory were the source of normal adult and fetal tissues. The complete panel of tissues for microarray analyses by our group has been described . Peripheral blood mononuclear cells were separated from whole blood on Ficoll/Hypaque gradients; unstimulated lymph nodes, spleen, and thymus were each collected from unimmunized mice and pooled separately; "stimulated" lymph nodes were collected from mice 10 days after they were immunized with hen egg-white lysozyme (HEL) in complete Freund's adjuvant; activated T cells were prepared by enriching T cells from peripheral blood and treating them with anti-CD3 and anti-CD28. Except for activated T-cells and pancreatic islet cells, all cells and tissues were collected in duplicate. 128 preparations of poly (A)-RNA were made from 65 different tissues, checked for quality, and quantified as previously described [19,47].
Microarray analyses were carried out using Incyte mouse GEM1 cDNA arrays (Incyte Genomics, Palo Alto, CA), as described previously for our group [19,47]. Relative abundance of probes was calculated as the ratio of the sample value against the value from the labeled whole mouse reference cDNA for each gene on each array. Data analyses were carried out with GeneSpring version 4.2.1 (Silicon Genetics) software, including filtering, K-means and hierarchical clustering. A list of all tissues in the full set of 65 normal adult and fetal tissues is provided inAdditional file 12. Our analyses focused on comparison gene expression in 18 tissues that were selected to represent a variety of adult and fetal tissues (Figure1B), most with immunological function. 6 of the 18 tissues were the "immune tissues" - unstimulated and stimulated lymph nodes, spleen, peripheral blood mononuclear cells, activated T-cells, and thymus. The remaining 12 tissues of the 18 tissue set were: fetal day 16.5 intestine and lung; adult duodenum, jejunum, ileum, proximal and distal colon; adult lung and liver, and joint synovium from normal adult mice and mice with acute and chronic arthritis. All pertinent microarray data are available through the Children's Hospital Research Foundation expression database web serverhttp://genet.chmcc.orgwithin the ExpressionDB folders of the Incyte Mouse GEM1 chip genome.
Genes on the Incyte array were identified by NCBI GenBank accession and systematic numbers and by gene symbol, where available. For those sequences that could not be assigned a gene symbol, sequence homologies to known mouse genes were sought using MouseBLAST , BLAT , MGD , and LocusLink . BLAST comparisons of the human and mouse confirmed Ensembl predictions of human orthologs of mouse genes. Identity of genes was confirmed by BLAST comparison of the GenBank sequences from NCBIhttp://www.ncbi.nlm.nih.gov with Ensembl  sequences. When downloading the genomic sequences with flanking sequences, it was important to have an mRNA that contained exon 1, so the site of initiation of transcription was correctly identified. Presence of an upstream exon 1 in an isoform would lead to re-defining of the promoter and intronic regions. Criteria for presence of exon 1 included: comparison of the number and location of exons in orthologous genes, alignment of transcripts of the gene as reported by different databases, and alignment of the 5' end of the transcript with the putative start site and signals in the gene. In cases where we encountered multiple high scoring transcript hits against the genome, we manually looked into the alignments to rule out the occurrence of pseudogenes that frequently lacked introns when compared to the "true" genes. Additional information about sequences of both the transcript and the gene was obtained from UCSC Golden Path . Confirmation of the presence of exon 1 in orthologs was particularly important because of the need to locate the start site of transcription. Computational prediction of exons is error prone. DNA sequences of genes were downloaded to include at least 10,000 flanking base pairs upstream and downstream of the first and last exons respectively. The November 2002 and April 2003 assemblies of human and the February 2002 and February 2003 assemblies of mouse genome were used for this purpose depending upon their availability at the time of our analyses (Additional files 10and11list relevant FASTA sequences and genomic coordinates).
The GO and MGI databases were searched for annotations of the immune genes, using Stanford SOURCEhttp://source.stanford.edu. For genes not found or incompletely annotated, manual annotation was done using criteria similar to the Gene Ontology (GO) , Mouse Genome Informatics (MGI) , and LocusLink classificationshttp://www.ncbi.nlm.nih.gov/LocusLink. A function was assigned if the encoded protein contained distinctive InterPro functional domains, or sequence similarity to paralogs previously annotated, or sequence similarity to functionally characterized SwissProt/TrEMBL proteins. Using the information about structure and function, the authors simplified annotations and grouped genes by major functions, such as antigen binding and processing (defense - immune function), transcription, protein synthesis, apoptosis, cell division. Highly detailed annotations are provided in the supplementary materials (Additional file 1).
To identify putative consensuscis-acting regulatory sequences in genes that were coordinately regulated, we first selected groups of genes based on their expression patterns in different immune tissues. The complete genomic sequences (with flanking upstream and downstream regions of 40 kb) of the selected genes and their orthologs were extracted from the Ensembl/UCSC human and mouse databases [12,13]. Where available the NCBI-RefSeq mRNAs were used as references for downloading the genomic sequences with upstream and downstream gene flanking regions of 40 kb. The transcription start site was thus at 40,000 in the downloaded sequences used in comparative genomic analysis for identification of potential regulatory clusters using Trafac server . Repeat elements were masked using the RepeatMaskerhttp://ftp.genome.washington.edu. Conserved clusters of regulatory elements in the evolutionarily conserved non-coding regions of mouse and human orthologs were displayed using the TraFaChttp://trafac.cchmc.org or GenomeTraFaChttp://genometrafac.cchmc.org servers which integrate results from MatInspector Professional (Version 4.1, 2004; 356 individual matrices in 138 families)http://www.genomatix.de and Advanced PipMaker (chaining option)http://pipmaker.bx.psu.edu/cgi-bin/pipmaker?advanced programs. We compared conserved putativecis-regulatory regions of each of the different groups of genes from mouse and human to identify known TF binding sites. The CisMols analyzerhttp://cismols.cchmc.org permits selection of TFs that must be present in clusters of TFs that constitute a putative regulatory module. To convey specificity to the search for modules relevant to regulation of gene expression in immune tissues, we required the presence of one or more of the following TFs, which we call "lymphoid elements". They have been reported to play a role in some aspect of lymphoid biology (see for example, [1–3]: BCL6, CMYB, CREB, EGRF, ETSF, GATA, IKRS, IRFF, MZF1, NFAT, NFKB, OCT1 (site also binds OCT2), PAX5, SP1F, VMYB, and WHZF. ECAT and PCAT were also included because of their frequent occurrence in promoters at the start of transcription. The search was limited to a region 3 kb upstream and 100 bp downstream of the start site of exon 1 (based on the NCBI-RefSeq mRNA annotations). This is where the promoter and associated regulatory elements would be expected, given that additional regulatory elements (enhancers/silencers) are almost certain to be located elsewhere. Images of the CisMols analyses of genes to identify regulatory elements are also provided in supplementary materials (Additional files 3to9). One example is shown in Figure3.
This work was supported in part by an award from the Howard Hughes Medical Institute to the University of Cincinnati for the development of Bioinformatics Core Resources, and by grants NIEHS U01 ES11038 and ES06096 Mouse Centers Genomics Consortium, Center for Environmental Genetics, NCI Mouse Models of Human Cancer Consortium and the National Library of Medicine G08 LM007853 IAIMS. We thank Amy Sherman of Incyte Genomics, Andrew Conway of Silicon Genetics, Paul Spellman and Rodney DeKoter for valuable discussion.
This article is published under license to BioMed Central Ltd. This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.