angaGEDUCI: Anopheles gambiae gene expression database with integrated comparative algorithms for identifying conserved DNA motifs in promoter sequences
BMC Genomicsvolume 7, Article number: 116 (2006)
The completed sequence of the Anopheles gambiae genome has enabled genome-wide analyses of gene expression and regulation in this principal vector of human malaria. These investigations have created a demand for efficient methods of cataloguing and analyzing the large quantities of data that have been produced. The organization of genome-wide data into one unified database makes possible the efficient identification of spatial and temporal patterns of gene expression, and by pairing these findings with comparative algorithms, may offer a tool to gain insight into the molecular mechanisms that regulate these expression patterns.
We provide a publicly-accessible database and integrated data-mining tool, angaGEDUCI, that unifies 1) stage- and tissue-specific microarray analyses of gene expression in An. gambiae at different developmental stages and temporal separations following a bloodmeal, 2) functional gene annotation, 3) genomic sequence data, and 4) promoter sequence comparison algorithms. The database can be used to study genes expressed in particular stages, tissues, and patterns of interest, and to identify conserved promoter sequence motifs that may play a role in the regulation of such expression. The database is accessible from the address http://www.angaged.bio.uci.edu.
By combining gene expression, function, and sequence data with integrated sequence comparison algorithms, angaGEDUCI streamlines spatial and temporal pattern-finding and produces a straightforward means of developing predictions and designing experiments to assess how gene expression may be controlled at the molecular level.
The sequenced genome of the principal vector of human malaria parasites in subSaharan Africa, Anopheles gambiae , has raised expectations for the development of new and unexpected ways to manage or manipulate vector populations to control disease transmission . As part of efforts to meet these expectations, we generated and organized large data sets using gene expression microarrays to quantify genome-wide transcription in different developmental stages and tissues of this mosquito [3, 4]. Arrangement of these data into a searchable format has streamlined the elucidation of genes expressed with stage-, tissue-, and sex-specificity. In addition, by juxtaposing these microarray findings with DNA comparative algorithms, the regulation of genes co-ordinately expressed in specific spatial and temporal patterns can be studied at a mechanistic level. We provide here a public database and web-based data-mining tool that combine stage and tissue expression microarray data, functional annotation, and regulatory DNA sequence comparison algorithms to provide insight into gene expression and regulation in An. gambiae.
Construction and content
Stage-specific transcriptional signal values were imported from genome-wide microarray analyses of An. gambiae larvae, male sugar-fed adults, female sugar-fed adults, and female blood-fed adults 3, 24, 48, 72, 96 hours and 15 days after a bloodmeal using Affymetrix GCOS software. Values from tissue-specific microarray analyses also were imported using GCOS to quantify genome-wide transcription in fat bodies, midgut, and ovaries at 24 hours after bloodfeeding [3, 4]. Functional gene annotation was imported from the Ano-Xcel database  to populate angaGEDUCI with keywords and annotation from the ENSEMBL, NCBI non-redundant, GO, PFAM, and SMART databases. Promoter sequences were selected as regions 1.5 kilobases (kb) in length adjacent to the 5'-ends of transcription start sites of genes using genomic data from ENSEMBL (Assembly: AgamP3, Feb 2006; Genebuild: VectorBase, Feb 2006; Database version: 37.3). Transcription factor binding sites from several classes of organisms were imported from the Transcription Factors Database (TFD) available publicly at ftp://ftp.ncbi.nih.gov/repository/TFD/datasets/. Of the 7,066 sites listed in TFD, 6639 (94.0%) are eight nucleotides or longer and 623 (8.82%) contain degenerate notation. Five-hundred and eleven sites in the database were identified in insects (7.23%), of which 499 (97.7%) are eight nucleotides or longer, and 34 (6.65%) contain degeneracy.
The data have been stored as a MySQL relational database that is accessible directly through an Apache web server. A web-based data mining interface is used to manage queries to identify genes that meet specific expression, keyword, and sequence criteria (Figure 1). A sequence comparison program based on the Boyer-Moore algorithm  is built into the data-mining interface for comparison of promoter regions of genes within a selected gene set.
The main page of the database provides hyperlinks to: Filter Database, Import Gene Set, Download Data, View Database, Submit Study, Documentation, and Contact. Selection of the Filter Database link opens the data-mining interface and allows users to focus on specific genes that satisfy input criteria based on: 1) stage- and tissue-specific expression, 2) annotated keywords, 3) DNA sequences present in promoter, 3' untranslated regions (UTR), or coding regions, or 4) presence of specific transcription factor binding sites (Figure 1). Queries are conducted by stepwise entry of input criteria with each query imposed on the previous so that all genes currently displayed meet all preceding query criteria as well as the criterion that was last entered. Once a gene set of interest has been selected, users then can use the analysis menu in the interface to search for conserved DNA motifs within the promoters of the gene set, view expression profiles, build a distribution of annotated keywords, or export the set for future retrieval (Figure 2). Detailed annotation and expression data for each gene also can be viewed at any time by selecting the gene identifier link to invoke the description of a gene entry.
Description of a gene entry
Each gene has a corresponding data page that can be accessed by selecting the gene identifier link during data retrieval. Gene entry pages display data from microarray expression analyses for stage- and tissue-specific expression and functional annotation as gathered by Ano-Xcel from ENSEMBL, NCBI non-redundant, GO, PFAM, and SMART databases (Figure 3). A link to the Vectorbase database that contains additional, centralized gene data also is provided on each entry page. User-contributed notes and a form for sharing notes for a gene entry are found below the annotation of each gene. To encourage data sharing, note submission does not require user pre-registration.
Comparing promoters to identify conserved DNA sequence motifs
After clustering genes into gene sets that show similar patterns of expression, the data-mining interface analysis menu can be used to search for common DNA motifs that may act as regulatory sequences in coordinating these expression patterns. Two parameters must be selected to begin the analysis: 1) motif match length: the desired conserved sequence motif length to search for in the analysis, 2) mismatches: the number of base mismatches allowed between two nearly-conserved sequence motifs without disqualification.
The resulting output from the analysis contains three parts. First, a comparison matrix is displayed indicating the number of conserved motifs found in each pair-wise comparison among every gene in the gene set (Figure 4). Each link in the matrix invokes a new page that prints the promoter sequences of the two genes being compared with areas of sequence conservation and transcription factor binding sites highlighted (Figure 5). Second, a table of the conserved motifs is displayed that compares the frequency of occurrence of each conserved motif within the gene set against the frequency of each motif in all 1) exons, 2) exons and introns, and 3) promoters within the An. gambiae genome (Figure 6). Each motif that matches or contains a transcription factor binding site is indicated in the same output. The third item displayed is a table indicating the frequency of occurrence of each transcription factor binding site of any size found within the gene set (Figure 7). Due to the degeneracy and varied size of transcription factor binding sites in the TFD database, the frequencies reported here are noticeably higher in this item compared to the frequencies in the conserved motif table that precedes it.
Visualization of transcription profiles
The transcription profiles for a gene set can be viewed in batch by using the analysis menu from the data-mining interface after a gene set has been selected. The resulting graphs print transcriptional expression according to developmental stage: larvae, male sugar-fed adults, female sugar-fed adults, and female blood-fed adults 3, 24, 48, 72, 96 hours and 15 days after a bloodmeal (Figure 8).
A keyword distribution listing all keywords found in a gene set, as gathered by Ano-Xcel , and their respective frequency of occurrence, can be constructed by using the analysis menu from the data-mining interface (Figure 9).
Import gene set
A gene set can be imported by entering a list of gene identifiers in ENSANGG, ENSANGP, ENSANGT, Probeset ID, or Celera form, or by choosing from a list of pre-defined gene sets. Pre-defined gene sets consist of groups of genes that have been linked to similar function or regulation in existing literature (Figure 10). Users can submit gene sets for automatic and immediate listing as a pre-defined gene set from the same page. Gene sets can be exported from the data-mining interface by using the analysis menu.
Submit a microarray study
The angaGEDUCI database has the capacity to store and integrate additional Affymetrix microarray studies that examine gene expression in An. gambiae. The Submit Study link provides a short form for uploading microarray data and specifications.
Utility and Discussion
The angaGEDUCI database identifies genes that meet stage- and tissue-specific expression criteria, and incorporates keyword searching and promoter sequence analysis into one unified data-mining tool. A case study best illustrates the utility of this integration. In this example, we will identify genes linked to the complex regulation of phenoloxidase, an enzyme involved in the melanization of invading parasites and micro-organisms as part of invertebrate innate immunity [7, 8]. Specifically, we will search for pro-phenoloxidase genes that are preferentially found in fat bodies and expressed highly three hours after bloodfeeding. Three filters will be used to complete this inquiry (Figure 1). First, a filter selects genes that contain the keyword "prophenoloxidase" in their functional annotation. Eighty-eight of the 13,639 transcripts in the An. gambiae genome contain this keyword. Second, a stage-specific filter identifies 14 of these 88 transcripts that show 5-fold up-regulated expression three hours after bloodfeeding (BF3h) as compared to sugarfed mosquitoes (NBF). Third, a tissue-specific filter isolates six of these 14 transcripts that are expressed 5-fold higher in fat bodies as compared to their corresponding expression in the midgut and ovaries (Figure 2).
The analysis menu can be used with this gene set of interest to search for common DNA sequence motifs that occur within the promoter regions of the genes corresponding to these transcripts. Analysis of the promoter regions of the six prophenoloxidase-related genes shows the occurrence of 14 conserved 12-basepair DNA sequence motifs (Figure 6). Of these 14 motifs, 10 match known transcription factor binding sites while the other four do not. Additional motifs of interest can be found by executing the promoter analysis as a search for a conserved motif length less than 12 nucleotides in length or by specifying a number of mismatches that may be allowed within a nearly-conserved but imperfectly-matching motif. Depending on how these parameters are adjusted, the output from the promoter analysis of a gene set may generate more or less conserved motifs, as well as a different number of motifs that are or are not matched to known transcription factor binding sites. A survey of the data produced with different specifications of these parameters in the analysis of the prophenoloxidase gene set is included in Figure 11 to aid users in choosing parameters that are most appropriate for their particular investigation.
While existing databases may allow individualized searching by expression, keyword, or sequence criteria, it is the unification of these fields that makes angaGEDUCI a unique facilitator of experimental design. The database may be used in many different ways, but perhaps most useful is the ability to use the stage- and tissue-specific expression microarray data to identify genes that are expressed in spatial and temporal patterns of interest and then compare the promoter regions of such genes to investigate putative means of facilitating such expression. The experimentally validated utility of such applications may pave the way for similar investigations into the regulatory role of conserved DNA sequence motifs in other control regions within the genome, such as putative microRNA target sites that may be found in 3' UTRs.
In addition to its current microarray data based on genome-wide tissue- and stage-specific gene expression, angaGEDUCI has been built with the goal of expanding its scope to house, integrate, and display additional microarray studies of An. gambiae. For example, Affymetrix microarray data from a study investigating gene expression in An. gambiae following infection with Plasmodium falciparum can be integrated with the existing data in the database to produce a clearer picture of how the mosquito responds to parasite challenge at the transcriptional level. This flexibility assures that angaGEDUCI is capable of growing alongside the increasing quantity of data being produced from other studies. By working closely with Vectorbase and other laboratories in this way, it is hoped that angaGEDUCI will act as a catalyst in accelerating the study and understanding of gene expression and regulation in this important and devastating vector of disease.
Availability and requirements
The Anopheles gambiae Gene Expression Database at UCI is publicly accessible from the URL: http://www.angaged.bio.uci.edu. Questions and comments are welcomed through the site.
Holt RA, Subramanian GM, Halpern A, Sutton GG, Charlab R, Nusskern DR, Wincker P, Clark AG, Ribeiro JMC, Wides R, Salzberg SL, Loftus B, Yandell M, Majoros WH, Rusch DB, Lai Z, Kraft CL, Abril JF, Anthouard V, Arensburger P, Atkinson PW, Baden H, de Berardinis V, Baldwin D, Benes V, Biedler J, Blass C, Bolanos R, Boscus D, Barnstead M, Cai S, Center A, Chaturverdi K, Christophides GK, Chrystal MA, Clamp M, Cravchik A, Curwen V, Dana A, Delcher A, Dew I, Evans CA, Flanigan M, Grundschober-Freimoser A, Friedli L, Gu Z, Guan P, Guigo R, Hillenmeyer ME, Hladun SL, Hogan JR, Hong YS, Hoover J, Jaillon O, Ke Z, Kodira C, Kokoza E, Koutsos A, Letunic I, Levitsky A, Liang Y, Lin JJ, Lobo NF, Lopez JR, Malek JA, McIntosh TC, Meister S, Miller J, Mobarry C, Mongin E, Murphy SD, O'Brochta DA, Pfannkoch C, Qi R, Regier MA, Remington K, Shao H, Sharakhova MV, Sitter CD, Shetty J, Smith TJ, Strong R, Sun J, Thomasova D, Ton LQ, Topalis P, Tu Z, Unger MF, Walenz B, Wang A, Wang J, Wang M, Wang X, Woodford KJ, Wortman JR, Wu M, Yao A, Zdobnov EM, Zhang H, Zhao Q, Zhao S, Zhu SC, Zhimulev I, Coluzzi M, della Torre A, Roth CW, Louis C, Kalush F, Mural RJ, Myers EW, Adams MD, Smith HO, Broder S, Gardner MJ, Fraser CM, Birney E, Bork P, Brey PT, Venter JC, Weissenbach J, Kafatos FC, Collins FH, Hoffman SL: The genome sequence of the malaria mosquito Anopheles gambiae. Science. 2002, 298: 129-49. 10.1126/science.1076181.
Hill CA, Kafatos FC, Stansfield SK, Collins FH: Arthropod-borne diseases: vector control in the genomics era. Nat Rev Microbiol. 2005, 3: 262-268. 10.1038/nrmicro1101.
Marinotti O, Nguyen QK, Calvo E, James AA, Ribeiro JMC: Microarray analysis of genes showing variable expression following a bloodmeal in Anopheles gambiae. Insect Mol Biol. 2005, 14: 365-373. 10.1111/j.1365-2583.2005.00567.x.
Marinotti O, Calvo E, Nguyen QK, Dissanayake S, Ribeiro JMC, James AA: Genome-wide analysis of gene expression in adult Anopheles gambiae. Insect Mol Biol. 2006, 15: 1-12. 10.1111/j.1365-2583.2006.00610.x.
Ribeiro JM, Topalis P, Louis C: AnoXcel: an Anopheles gambiae protein database. Insect Mol Biol. 2004, 13: 449-457. 10.1111/j.0962-1075.2004.00503.x.
Boyer RS, Moore JS: A fast string searching algorithm. Communications of the ACM. 1977, 20: 762-772. 10.1145/359842.359859.
Cerenius L, Söderhäll K: The prophenoloxidase-activating system in invertebrates. Immunol Rev. 2004, 198: 116-126. 10.1111/j.0105-2896.2004.00116.x.
Dimopoulos G: Insect immunity and its implication in mosquito-malaria interactions. Cell Microbiol. 2003, 5: 3-14. 10.1046/j.1462-5822.2003.00252.x.
Li J, Riehle MM, Zhang Y, Xu J, Oduol F, Gomez SM, Eiglmeier K, Ueberheide BM, Shabanowitz J, Hunt DF, Ribeiro JM, Vernick KD: Anopheles gambiae genome reannotation throughsynthesis of ab initio and comparative gene prediction algorithms. Genome Biol. 2006, 7: R24-10.1186/gb-2006-7-3-r24.
The authors thank Dr. Norman Jacobson for his advice and Lynn Olson for help in preparing the manuscript. This work was supported by a grant from the National Institutes of Health (AI29746 to AAJ).
SND designed and implemented the website, database, and promoter analysis algorithms and wrote the principal draft of the manuscript. OM assisted in designing the analysis and editing of the manuscript. JMCR captured putative promoter sequences and constructed the Ano-Xcel database. AAJ assisted in the editing of the manuscript.
Authors’ original submitted files for images
About this article
- Transcription Factor Binding Site
- Transcription Factor Database
- Functional Gene Annotation
- Gambiae Genome
- Specific Transcription Factor Binding Site