- Research article
- Open Access
An expression map for Anopheles gambiae
© MacCallun et al; licensee BioMed Central Ltd. 2011
- Received: 25 July 2011
- Accepted: 20 December 2011
- Published: 20 December 2011
Quantitative transcriptome data for the malaria-transmitting mosquito Anopheles gambiae covers a broad range of biological and experimental conditions, including development, blood feeding and infection. Web-based summaries of differential expression for individual genes with respect to these conditions are a useful tool for the biologist, but they lack the context that a visualisation of all genes with respect to all conditions would give. For most organisms, including A. gambiae, such a systems-level view of gene expression is not yet available.
We have clustered microarray-based gene-averaged expression values, available from VectorBase, for 10194 genes over 93 experimental conditions using a self-organizing map. Map regions corresponding to known biological events, such as egg production, are revealed. Many individual gene clusters (nodes) on the map are highly enriched in biological and molecular functions, such as protein synthesis, protein degradation and DNA replication. Gene families, such as odorant binding proteins, can be classified into distinct functional groups based on their expression and evolutionary history. Immunity-related genes are non-randomly distributed in several distinct regions on the map, and are generally distant from genes with house-keeping roles. Each immunity-rich region appears to represent a distinct biological context for pathogen recognition and clearance (e.g. the humoral and gut epithelial responses). Several immunity gene families, such as peptidoglycan recognition proteins (PGRPs) and defensins, appear to be specialised for these distinct roles, while three genes with physically interacting protein products (LRIM1/APL1C/TEP1) are found in close proximity.
The map provides the first genome-scale, multi-experiment overview of gene expression in A. gambiae and should also be useful at the gene-level for investigating potential interactions. A web interface is available through the VectorBase website http://www.vectorbase.org/. It is regularly updated as new experimental data becomes available.
- Gene Ontology
- Blood Meal
- Odorant Binding Protein
- Paralogous Group
- Node Vector
Genome sequencing  and gene expression microarray technologies have, in recent years, enabled systems-level research into the malaria-transmitting mosquito Anopheles gambiae. By measuring transcript levels with respect to biological events, such as blood feeding, development, parasite infection and mating, one can identify genes that are likely to be involved in the underlying processes. However, due to the wealth of information produced by individual experiments and the numerous leads that require further investigation, it is understandable that research groups rarely perform so-called meta-analysis of gene expression data, whereby multiple experiments are analysed simultaneously. Furthermore, meta-analysis is impeded by incompatibilities between different versions of genome annotations, microarray technologies, file formats, experimental designs, data processing pipelines and statistical analyses. Several ongoing projects are aiming to eliminate these inconsistencies and produce uniform processed and analysed data for the end user. Human curators at the two major microarray repositories, NCBI GEO  and Array Express , are working to produce enriched resources known as GEO Datasets and the Gene Expression Atlas , respectively. The VectorBase consortium  produces a similar unified gene expression resource for the invertebrate vector community.
Web-based expression summaries provide useful and concise biological overviews for individual genes of interest, however a common requirement is to know which other genes are expressed in a similar manner to a particular gene. GEO and ArrayExpress' curated expression resources provide such "nearest neighbour" gene lists, but within a single experiment only, not across multiple experiments. Some years ago, gene expression data from 553 Caenorhabditis elegans two-colour microarray experiments was clustered simultaneously to produce a 2D map known as TopoMap . It was found that TopoMap clustered many genes of similar function, such as lipid metabolism, heat shock and neuronal genes. TopoMap is integrated into the WormBase genomics resource, but the underlying expression data is not available, reducing its utility. To the best of our knowledge, no large-scale meta-analysis of expression data has been made public for any other species.
Here we present a simple method for clustering expression data from a diverse set of microarray experiments. We have used data from A. gambiae, but the method is applicable to any organism. The results are visualised on a 2D map, and we show that many regions of the map are strongly linked to biological function. Two case studies are presented. One focuses on odorant binding proteins, which can be classified into several functional groups. The second looks at a large number of immunity-related genes, and likewise suggests specialised roles for members of several immunity gene families.
A map of A. gambiae gene expression
Developmental series 
embryo:12-14 hours, larva:48 hours, larva:96 hours, larva:144 hours, larva:192 hours, larva:240 hours, pupa:240 hours, adult:312 hours
Adult female tissues 
head, midgut, ovaries, carcass
Odumasy vs. Kisumu strain 
Odumasy v Kisumu
Blood meal time series 
Non-blood-fed, Blood-fed 3 h, Blood-fed 24 h, Blood-fed 48 h, Blood-fed 72 h, Blood-fed 96 h, Blood-fed 15 d
Blood-fed adult female tissues 
midgut, fat body, ovaries
Alimentary canal compartments 
gastric caeca, anterior midgut, posterior midgut, hindgut, whole organism
Larval salivary glands 
salivary gland, whole organism
Blood meal after 15 days 
Non-blood-fed 18 d, Blood-fed 15 d
Male vs. female 
Two consecutive blood meals 
Non-blood-fed, Blood-fed 24 h, Blood-fed twice
Larval and adult stages 
Male vs. female 
Plasmodium berghei midgut invasion time-series 
wild-type parasite infection v invasion-deficient parasite infection:before midgut invasion, wild-type parasite infection v invasion-deficient parasite infection:during midgut invasion, wild-type parasite infection v invasion-deficient parasite infection:after midgut invasion
Plasmodium berghei midgut invasion stage comparisons 
wild-type parasite infection:during midgut invasion v before midgut invasion, invasion-deficient parasite infection:during midgut invasion v before midgut invasion, wild-type parasite infection:after midgut invasion v during midgut invasion, invasion-deficient parasite infection:after midgut invasion v during midgut invasion
M and S form 4th instar larvae 
M form, S form, M form:M-GA-CAM, M form:Mali-NIH, S form:KIST, S form:Pimperena
M and S form virgin females 
M form, S form, M form:M-GA-CAM, M form:Mali-NIH, S form:KIST, S form:Pimperena
M and S form gravid females 
M form, S form, M form:M-GA-CAM, M form:Mali-NIH, S form:KIST, S form:Pimperena
Permethrin-resistant strain 
permethrin-selected v unselected
Mated females 
virgin 0 h, mated 2 h, mated 6 h, mated 24 h
Chloroquine exposure 
chloroquine v none:Plasmodium berghei infected, chloroquine v none:uninfected
Embryonic development 
2 h, 4 h, 6 h, 7 h, 8 h, 10 h, 13 h, 16 h, 19 h, 22 h, 25 h, 28.5 h, 31 h, 34 h, 37 h, 40 h, 43 h, 46 h
Embryonic serosa 
embryonic serosa, embryo
Given the assumed difficulty of mapping such high-dimensional data into two dimensions, how reproducible are the maps with respect to the random initialisation step? A simulation, based on an additional 100 randomly seeded maps (not shown), was performed to see how often genes that are co-clustered in the "main" map (shown in the figures) would co-cluster in a re-mapping. It was found that 9907 of 50,000 (20%) randomly selected co-clustered gene pairs co-cluster again in a randomly selected re-mapping, while 40,747 (81%) of gene pairs re-map to the same or "nearby" clusters (≤ 5 grid units separation). This indicates that the general topology of the map is reproducible, although the fine details may not always be.
Map nodes and regions are enriched with respect to gene function
Over-represented Gene Ontology terms P < 1 × 10-6
structural constituent of ribosome
threonine-type endopeptidase activity
monovalent inorganic cation transmembrane transporter activity
protein catabolic process
unfolded protein binding
modification-dependent macromolecule catabolic process
cellular biopolymer catabolic process
transmembrane receptor activity
cation-transporting ATPase activity
ribonucleoside triphosphate biosynthetic process
purine ribonucleoside triphosphate metabolic process
purine nucleoside triphosphate biosynthetic process
proton-transporting two-sector ATPase complex
oxidoreductase activity, acting on NADH or NADPH
purine ribonucleotide biosynthetic process
oxidoreductase activity, acting on heme group of donors
heme-copper terminal oxidase activity
sensory perception of chemical stimulus
monovalent inorganic cation transport
structural constituent of cuticle
coenzyme catabolic process
acetyl-CoA metabolic process
protein-DNA complex assembly
serine-type endopeptidase activity
cellular response to stress
amino sugar metabolic process
polysaccharide metabolic process
response to DNA damage stimulus
establishment of protein localization
cellular protein localization
regulation of cellular biosynthetic process
oxygen transporter activity
nucleic acid binding
regulation of macromolecule biosynthetic process
regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolic process
Highly enriched gene functions are frequently found in multiple distinct regions of the map, indicating major differences in their expression and hence the biological context in which the genes operate. Some examples are discussed below.
The strongest enrichment of function is seen for the ribosomal proteins on the left hand edge of the map. However, there is a second group of ribosomal protein genes in the centre of the map that is characterised by high expression in ovaries and is therefore likely to be involved in egg production.
Genes involved in DNA, RNA and protein synthesis are generally found above the diagonal from lower-left to upper-right. Temporally, spatially or functionally related metabolic functions are often co-located on the map. For example, near the centre of the map, clusters enriched in protein synthesis (ribosome), protein folding and protein degradation (proteasome) are found together. Additional file 1, Figure S1 shows a wider selection of DNA/RNA/protein metabolic functions and their relationships (e.g. the proximity of DNA replication and repair, transcription and RNA processing, and protein synthesis and protein transport).
Non-randomly distributed Gene Ontology terms P < 0.01
regulation of transcription, DNA-dependent
serine-type endopeptidase activity
transcription factor activity
regulation of transcription
structural constituent of ribosome
signal transducer activity
intracellular signaling cascade
G-protein coupled receptor activity
G-protein coupled receptor protein signaling pathway
structural constituent of cuticle
ion channel activity
rhodopsin-like receptor activity
sensory perception of smell
olfactory receptor activity
intracellular protein transport
imaginal disc-derived wing morphogenesis
transcription regulator activity
chitin metabolic process
ATP-dependent helicase activity
sodium ion transport
microtubule motor activity
homophilic cell adhesion
compound eye development
structural molecule activity
ATP synthesis coupled proton transport
proton-transporting ATPase activity, rotational mechanism
hydrogen ion transporting ATP synthase activity, rotational mechanism
proton-transporting two-sector ATPase complex
DNA-directed RNA polymerase activity
compound eye morphogenesis
protein complex assembly
Multi- vs. single-experiment maps
Blood meal time-series
All experiments release 1.0.3
All experiments release 1.0.5
All experiments release 1.0.7
+ [16, 17]
Case study: odorant binding proteins
Odorant binding proteins (OBPs ), which transport odorant molecules through the extracellular fluid of chemosensilla to transmembrane odorant receptor (OR) proteins on olfactory receptor neurons, are found in three main regions of the expression map (Figure 2, two in purple and one in grey labelled "minimal regulation", bottom right). One region is characterised by minimal differential regulation and presumably represents constitutively expressed genes (although we cannot rule out differential expression in some yet-to-be-performed experiment). Another region rich in OBPs is characterised by high expression in non-blood fed females, suggesting a role in mating or host seeking for these genes. The third cluster is expressed after blood feeding and may be implicated in locating suitable sites for egg laying. A similar functional hypothesis for OBPs, based on blood meal data alone, has already been proposed , however this only identified three of the most differentially expressed genes, whereas the expression map classifies the majority of this large family into three, or perhaps more (see below), functional groups. In contrast however, the vast majority of ORs (as defined by InterPro domain IPR004117) appear to be unregulated (data not shown, but easily available via the web interface).
Case study: immunity genes
The genes REL1 and REL2, encoding the core transcription factors of the Toll and IMD pathways respectively, have very different expression profiles. This is perhaps expected since REL1, an orthologue of Drosophila dorsal, and other Toll pathway members have well documented roles in dorso-ventral pattern formation in the early embryo, and indeed we see TOLL1B, TOLL5A, REL1, and CACT in the early embryo region of the map. Notably, TOLL1A, 1B, 5A and 5B are co-orthologues of Drosophila Toll, which codes for a transmembrane receptor with developmental and immune roles. One can speculate that, of these four mosquito receptors, TOLL1B is the most likely functional orthologue of Toll as it clusters closely with REL1 on the map. However, the location of TOLL5B close to many other immunity genes (Region labelled "High: fat body, NBF", middle right) as well as that of TOLL5A close to TOLL1B, REL1 and CACT may imply that at least three of the four co-orthologues of Toll play central, but likely distinct, roles in mosquito immunity.
Many of the major immunity gene family members are quite widely dispersed on the map. For example, the anti-microbial cecropin genes CEC1, CEC2 and CEC3 are tightly clustered in a region characterised by strong midgut expression and low expression 3 h post blood meal (top right), while CEC4 is located quite far away in a region with less overall differential expression and a mild positive response at 3 h post blood meal (lower right). This suggests that cecropins 1-3 have similar roles but are perhaps specialised to counter a range of pathogens, while CEC4 has evolved to perform a different role. The four defensins have a similarly informative distribution: DEF1 is with the main cluster of cecropins suggesting it has a similar function, while the others are in the lower right corner where the 3 h post blood meal response is strong. In particular, DEF3 is clustered with a large number of cuticle genes, suggesting a role in immunity during blood meal induced cuticle expansion, perhaps against fungal infection.
The peptidoglycan recognition proteins (PGRPs) are another gene family whose functional diversity is reflected in the map. All PGRPs, as their name implies, are able to bind microbial peptidoglycan specifically but some are believed to have catalytic activity due to the conservation of three active site amino acids [23–25]. In A. gambiae, the putative catalytic members of the family are PGRPLB, PGRPS2 and PGRPS3. Interestingly these three genes all map to the right-most edge of the map. PGRPLB lies in a region populated by other effector genes and peptides (GAM1 (AGAP008645), LYSC7, DEF1 and cecropins 1-3), supporting its proposed role as an antimicrobial agent. PGRPS2 and PGRPS3 map close to DEF2 and DEF4 respectively, suggesting parallel but as yet unidentified roles.
The recently described physical interactions between two leucine-rich repeat (LRR) proteins LRIM1 (AGAP006348) and APL1C and the complement C3-like protein TEP1  are mirrored in the expression map; the two LRR genes map to the same grid node while TEP1 maps to the node below. These proteins are implicated with the activation of the mosquito complement system, with TEP1 being shown to localise around invading Plasmodium berghei ookinetes. This region of the map (Figure 5, green outline, middle right) has the highest density of immunity genes, including many other TEPs (1,2,9,10,12,17), CLIP-domain serine proteases, and one additional member of the recently characterised LRIM family , LRIM17 (LRRD7), which has been shown through RNAi mediated knockdowns to affect Plasmodium ookinete invasion .
Although the clustering of genes based on their expression in many different experiments appears to be successful--as assessed by the co-clustering of genes with similar function, at least--the methodology has some potential shortcomings which merit discussion.
Since data from so many experimental conditions is presented in one place there is the possibility that users could over-interpret map cluster expression summaries. For example, genes in cluster 22,9 could be (wrongly) interpreted as having "high expression in the fat bodies of non blood-fed females". However, the fat body assays used tissue from blood-fed females, so the correct summary should be "high expression in non blood-fed females and the fat bodies of blood-fed females". Users should be aware that very few of the possible combinations of experimental conditions have actually been assayed.
The use of different mosquito strains from one laboratory to another may also make interpretation of the map more difficult. First, polymorphisms may differentially alter microarray hybridisation efficiency in one strain relative to another for certain genes. However, this would appear to have a minimal confounding effect, since microarray studies have directly compared different strains and the results have been successfully validated with quantitative PCR [15, 29]. Second, strains may actually exhibit biologically meaningful differences in expression (e.g. a gene may be highly expressed in the midgut of one strain but not in another). On first impressions this may seem like a problem, but it is actually an advantage because the differential (inter-experiment) expression resulting from strain differences (and other sample characteristics, such as sex, rearing conditions, etc) simply provides data with which finer-grained clustering can be obtained. The web interface, however, could be enhanced in future versions to display all available sample characteristics. Currently only the most pertinent information is available in the experiment titles (e.g. "Adult female tissues").
While we have re-analysed all data in order to standardise the statistical treatment there is still a possibility that technical differences between microarray technologies (platforms) could affect the meta-analysis. For example, platforms with a wider range of detection are capable of producing data with greater dynamic range. If high and low dynamic range datasets are mapped together, the high dynamic range data will have a greater influence on the clustering of genes. However, the dynamic range of expression data can also be influenced by the relative severity of the experimental conditions being tested (for example a 10°C heat shock will cause greater magnitude gene expression changes than a 1°C heat shock ). The VectorBase 1.0.7 expression data set contains both high and low dynamic range experiments (Additional file 2, Figure S2). The low dynamic range experiments tend to involve less severe conditions, such as strain comparisons. If datasets were range-normalised prior to mapping, the biological relevance of very highly regulated genes would be lost.
Another limitation is that we discard/ignore the statistics relating to the mean expression values used as input data to build the map. For instance, the numbers of replicates and standard deviations could be used to filter out bad data or to produce Gaussian models for each expression value (with which the map could be trained). Such enhancements, if implemented, would likely improve the quality of the mapping still further.
We have tried to keep the number of parameters in our approach to a minimum, however the size and shape of the map has a major effect on the outcome and was decided somewhat arbitrarily. In general, small maps produce large gene clusters, while large maps produce smaller clusters. For any given biological annotation, the extent of its enrichment within clusters will depend on cluster size and the number of genes annotated as such (i.e. in a large map, members of a large gene family may be spread across many neighbouring nodes but not be significantly enriched in any one node; while in a smaller map, significant enrichment in one large cluster may be seen). Thus, no map size is optimal in all cases. The dimensions of the VectorBase A. gambiae expression map (25 × 20) were chosen to give an average of 20 genes per cluster--a manageable number. Alternative map sizes could be provided by VectorBase in the future.
VectorBase strives to be unbiased and include all data for its core species in the expression database, particularly those with raw data deposited in public repositories. However, for technical reasons, total coverage of experiments cannot be guaranteed. Furthermore, in the mosquito field there is quite a heavy experimental bias (for example, the majority of data comes from female mosquitoes). As the VectorBase resource expands, questions arise as to what to do with largely redundant datasets (there are now three adult tissue experiments: [8, 10, 31]). Multiple assays of similar conditions or tissues (albeit with strain and rearing differences, see above) will proportionally shift the focus of the map towards those conditions or tissues; less space will be available for the allocation of genes into clusters based on other expression characteristics. One solution may be to perform some pruning of redundant datasets, another may be to produce specialist maps (e.g. developmental studies only) in addition to the "all conditions" map.
One obvious use for the A. gambiae expression map is to short-list potential interaction partners for proteins of interest. For example, one can extrapolate from the recent findings for LRIM1 that other LRIM family members will form heteromeric complexes and perhaps also interact with one or more TEPs, and that these genes will, like LRIM1, APL1C and TEP1, probably also be co-located on the map. Similarly, we observe a general tendency for CLIP-domain serine proteases and serpin family serine protease inhibitors (marked "CLIP" and "SRPN" in Figure 5 respectively) to be clustered together in many areas of the map, which suggests that the experimental elucidation of enzyme-inhibitor relationships can be greatly accelerated using the map.
A further advantage of performing clustering on all available data is that the clusters obtained are likely to be fine-grained enough for promoter analysis aimed at the discovery of cis-regulatory DNA sequences responsible for co-regulation. Post-transcriptional regulatory mechanisms (e.g. endogenous miRNA) will also be responsible for some of the observed co-regulation, so transcript-based signals (e.g. miRNA targets) might also be detectable in expression map clusters.
Finally, we propose a role for expression maps in comparative transcriptomics. Current approaches compare data from two or more broadly equivalent experiments that have been performed in two or more organisms (e.g. developmental stages in A. gambiae and Drosophila melanogaster). If the experiments are performed in different laboratories and at different times, the experimental designs are likely to be different enough to invalidate or at the least complicate the analysis. However, expression maps tend to smooth out these differences, so that intra-map distances between pairs of orthologous genes should be robustly comparable between species, especially if the maps have been generated using a similar set of experiments. One can also quantify the functional divergence of gene families by measuring their intra-map dispersal, and compare these between species.
The A. gambiae expression map has potential for making expression data more accessible and useful to researchers throughout the field. A web interface is available at http://funcgen.vectorbase.org/ExpressionMap/Anopheles_gambiae/paper -- showing the data presented in this paper. However, as the resource is updated as part of VectorBase's release cycle, newer versions are also available. While this manuscript was being revised an expression map for the Dengue vector Aedes aegypti was also made available. In addition, the source code for map generation and web visualisation is available under the GNU General Public License at https://github.com/VectorBase/ExpressionMap.
All data was obtained from the VectorBase gene expression resource, which is a curated collection of published, publicly available gene expression data for invertebrate vectors of human pathogens. The standard VectorBase curation pipeline begins with importing original raw data files, obtained from GEO , ArrayExpress  or the authors, into the microarray data management system BASE . Low quality data is then removed according to the authors' quality flags. Intensity data is normalised with either the Lowess algorithm  for two colour data, or the RMA algorithm  for single channel data, using the relevant BASE plugin with default parameters. All ratio or intensity values for a given gene and hybridisation combination (there may be multiple reporters per gene and/or multiple spots per reporter) are summarised by their mean. The means from multiple hybridisations for the same experimental condition (these are usually biological replicates, or less often, technical "dye swap" replicates) are then averaged again to give a single value per gene and condition combination. The number of averaged data points and their variance are discarded (see Results and Discussion: "Limitations").
Some microarray technologies and experimental designs produce intensity values whose absolute values cannot always be compared directly from gene to gene. These include single channel technologies and some two colour experiments using global reference samples. With this kind of data, it is only possible to calculate correlation coefficients between gene expression profiles within a single experiment. Some form of normalisation is needed to give expression values from different "reference-less" experiments a common reference point so that multi-experiment expression profiles can be compared. We chose to apply a "median shift" normalisation step to such ratios and intensity values. In median shift normalisation, each expression profile is centred around zero by subtracting its median value (example: for a gene with expression values in one particular experiment (say, three tissues) being 11,4, and 6, the normalised values will be 5, -2, and 0). The median-shift normalised data for 10194 genes and 93 experimental conditions is available from the VectorBase download page.
The expression data was clustered using the self-organizing map algorithm as follows. Unless otherwise stated, the map dimensions were 25×20, the starting learning rate was 0.1, and the starting neighbourhood radius was 10. Prior to training, the map was randomly initialised with values within the range of the expression data. During the training of a self-organizing map, input vectors are compared with reference vectors at each map node (henceforth: "node vectors"). These vectors have the same number of dimensions as the input data (93 in this case). In this work, the comparison is made with the Pearson correlation coefficient, and missing values are simply excluded from the calculation. (The Euclidean distance measure was also tried and gave similar results.) The node vector with the highest correlation and its neighbours within a specified radius are updated towards the input vector by an amount proportional to the learning rate. As training proceeds, input vectors are "presented" to the map at random (with replacement) on average 20 times each while the learning rate and neighbourhood radius are linearly reduced towards zero. When training is complete, genes are assigned for a final time to their closest node. Each node vector can be thought of as a mean expression vector (or profile) for the genes mapping to that node. The algorithm attempts to preserve the topology of the high dimensional input data in the two-dimensional mapping, however the two axes of the map have no predetermined meaning.
The algorithm was implemented in Perl and PDL (Perl Data Language), and the maps are stored in a relational database through the object oriented Class::DBI interface. All source code is available under the GNU General Public License at https://github.com/VectorBase/ExpressionMap.
The coloured outlines in Figures 1, 2 & 5 indicate regions where one or more node vector components satisfy a simple arithmetic inequality. For example, the orange outlines marked "embryo" in Figure 1a highlight map nodes where the node vector component for embryo expression  is greater than 0.25. The choices of node vector component and thresholds was largely arbitrary, with an emphasis on simplicity and clear visualisation. For Figures 2 & 5, nodes of interest (e.g. with a large fraction of genes with a particular function) were chosen manually and vector component thresholds were determined in a semi-automatic fashion. Different thresholds may be explored interactively via the web interface.
Gene function over-representation analysis
The self-organizing map presented in Figures 1, 2 & 5 consists of 500 nodes, each of which can be considered as a gene cluster. We applied a Gene Ontology (GO) over-representation analysis as implemented in the program ErmineJ  on each cluster. The analysis uses Fisher's Exact Test and the null hypothesis states that genes with a particular GO term are randomly distributed between the cluster of interest and the rest of the map. GO terms that are associated with less than ten or more than a quarter of the genes on the map were excluded from the analysis as they are generally not informative. The GO term database of 2009/03/02 was used to defined GO term relationships, and the GO annotations for A. gambiae genes were retrieved from VectorBase BioMart on the same date.
The P values reported from the GO analysis are corrected for multiple testing (> 1000 GO terms are tested) according to the Benjamini-Hochberg false discovery rate (FDR) procedure, and correspond to the minimum FDRs (false positives as a fraction of all positives) at which the null hypotheses can be rejected. This correction does not take into account overlaps between parent and child GO terms.
Additionally, a GO term is only reported as enriched if four or more genes in the cluster are annotated with that term.
Empirical non-random distribution test
The over-representation analysis described above is not ideal in situations where genes with a particular function are localised within the map, but are not necessarily confined to one map node/cluster. We therefore implemented a sampling-based test to quantify the general non-randomness of a gene set on the map as follows. For the set N of n genes of interest located on the map we calculate the mean, d, of the city block distance to their closest neighbours within N. Then, sets N' of n genes are randomly sampled from the map 100 times. For each sample of genes, their mean distance to closest neighbour d' is calculated as above and compared with the "true" value d. For a non-randomly distributed set of genes, d' is not likely to be smaller than d. The estimated P value is therefore . Where multiple tests are performed, a Bonferroni correction is applied by multiplying the number of random samplings by the number of tests (159 in the case of Table 3).
Odorant binding protein paralogous groups
For this analysis, odorant binding proteins are defined as the 49 VectorBase genes annotated with InterPro domain IPR006625 (Insect pheromone/odorant binding protein PhBP). The within-species paralogues for each gene were retrieved via the Perl API from the VectorBase/Ensembl Compara database (7 species, schema version 54, August 2009). Paralogous groups (PGs) are defined as sets of genes with the same mutual paralogues. Six genes have no paralogues, two PGs contain two genes each, and one PG contains three genes. The remaining PGs contain five or more genes and are listed in Figure 4.
This work was supported by National Institutes of Health/National Institute for Allergy and Infectious Diseases (contract numbers HHSN266200400039C, HHSN272200900039C). The authors would like to thank VectorBase colleagues for assistance with data management. We are very grateful to Michael Povelones and Fotis Kafatos for useful discussions and suggestions.
- Holt RA, Subramanian GM, Halpern A, Sutton GG, Charlab R, Nusskern DR, Wincker P, Clark AG, Ribeiro JM, Wides R, Salzberg SL, Loftus B, Yandell M, Majoros WH, Rusch DB, Lai Z, Kraft CL, Abril JF, Anthouard V, Arensburger P, Atkinson PW, Baden H, de Berardinis V, Baldwin D, Benes V, Biedler J, Blass C, Bolanos R, Boscus D, Barnstead M, Cai S, Center A, Chaturverdi K, Christophides GK, Chrystal MA, Clamp M, Cravchik A, Curwen V, Dana A, Delcher A, Dew I, Evans CA, Flanigan M, Grundschober-Freimoser A, Friedli L, Gu Z, Guan P, Guigo R, Hillenmeyer ME, Hladun SL, Hogan JR, Hong YS, Hoover J, Jaillon O, Ke Z, Kodira C, Kokoza E, Koutsos A, Letunic I, Levitsky A, Liang Y, Lin JJ, Lobo NF, Lopez JR, Malek JA, McIntosh TC, Meister S, Miller J, Mobarry C, Mongin E, Murphy SD, O'Brochta DA, Pfannkoch C, Qi R, Regier MA, Remington K, Shao H, Sharakhova MV, Sitter CD, Shetty J, Smith TJ, Strong R, Sun J, Thomasova D, Ton LQ, Topalis P, Tu Z, Unger MF, Walenz B, Wang A, Wang J, Wang M, Wang X, Woodford KJ, Wortman JR, Wu M, Yao A, Zdobnov EM, Zhang H, Zhao Q, Zhao S, Zhu SC, Zhimulev I, Coluzzi M, della Torre A, Roth CW, Louis C, Kalush F, Mural RJ, Myers EW, Adams MD, Smith HO, Broder S, Gardner MJ, Fraser CM, Birney E, Bork P, Brey PT, Venter JC, Weissenbach J, Kafatos FC, Collins FH, Hoffman SL: The genome sequence of the malaria mosquito Anopheles gambiae. Science. 2002, 298: 129-149. 10.1126/science.1076181.View ArticlePubMedGoogle Scholar
- Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Muertter RN, Edgar R: NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Res. 2009, 37: D885-890. 10.1093/nar/gkn764.View ArticlePubMedGoogle Scholar
- Parkinson H, Kapushesky M, Kolesnikov N, Rustici G, Shojatalab M, Abeygunawardena N, Berube H, Dylag M, Emam I, Farne A, Holloway E, Lukk M, Malone J, Mani R, Pilicheva E, Rayner TF, Rezwan F, Sharma A, Williams E, Bradley XZ, Adamusiak T, Brandizi M, Burdett T, Coulson R, Krestyaninova M, Kurnosov P, Maguire E, Neogi SG, Rocca-Serra P, Sansone SA, Sklyar N, Zhao M, Sarkans U, Brazma A: ArrayExpress update-from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res. 2009, 37: D868-872. 10.1093/nar/gkn889.View ArticlePubMedGoogle Scholar
- Kapushesky M, Emam I, Holloway E, Kurnosov P, Zorin A, Malone J, Rustici G, Williams E, Parkinson H, Brazma A: Gene expression atlas at the European Bioinformatics Institute. Nucleic Acids Res. 2010, 38: D690-698. 10.1093/nar/gkp936.View ArticlePubMedGoogle Scholar
- Lawson D, Arensburger P, Atkinson P, Besansky NJ, Bruggner RV, Butler R, Campbell KS, Christophides GK, Christley S, Dialynas E, Hammond M, Hill CA, Konopinski N, Lobo NF, MacCallum RM, Madey G, Megy K, Meyer J, Redmond S, Severson DW, Stinson EO, Topalis P, Birney E, Gelbart WM, Kafatos FC, Louis C, Collins FH: VectorBase: A data resource for invertebrate vector genomics. Nucleic Acids Res. 2009, 37: D583-587. 10.1093/nar/gkn857.View ArticlePubMedGoogle Scholar
- Kim SK, Lund J, Kiraly M, Duke K, Jiang M, Stuart JM, Eizinger A, Wylie BN, Davidson GS: A gene expression map for Caenorhabditis elegans. Science. 2001, 293: 2087-2092. 10.1126/science.1061603.View ArticlePubMedGoogle Scholar
- Vlachou D, Schlegelmilch T, Christophides GK, Kafatos FC: Functional genomic analysis of midgut epithelial responses in Anopheles during Plasmodium invasion. Curr Biol. 2005, 15: 1185-1195. 10.1016/j.cub.2005.06.044.View ArticlePubMedGoogle Scholar
- Marinotti O, Nguyen QK, Calvo E, James AA, Ribeiro JM: Microarray analysis of genes showing variable expression following a blood meal in Anopheles gambiae. Insect Mol Biol. 2005, 14: 365-373. 10.1111/j.1365-2583.2005.00567.x.View ArticlePubMedGoogle Scholar
- Müller P, Donnelly MJ, Ranson H: Transcription profiling of a recently colonised pyrethroid resistant Anopheles gambiae strain from Ghana. BMC Genomics. 2007, 8: 36-10.1186/1471-2164-8-36.View ArticlePubMedPubMed CentralGoogle Scholar
- Koutsos AC, Blass C, Meister S, Schmidt S, MacCallum RM, Soares MB, Collins FH, Benes V, Zdobnov E, Kafatos FC, Christophides GK: Life cycle transcriptome of the malaria mosquito Anopheles gambiae and comparison with the fruitfly Drosophila melanogaster. Proc Natl Acad Sci USA. 2007, 104: 11304-11309. 10.1073/pnas.0703988104.View ArticlePubMedPubMed CentralGoogle Scholar
- Neira Oviedo M, Vanekeris L, Corena-McLeod MD, Linser PJ: A microarray-based analysis of transcriptional compartmentalization in the alimentary canal of Anopheles gambiae (Diptera: Culicidae) larvae. Insect Mol Biol. 2008, 17: 61-72. 10.1111/j.1365-2583.2008.00779.x.View ArticlePubMedGoogle Scholar
- Müller P, Warr E, Stevenson BJ, Pignatelli PM, Morgan JC, Steven A, Yawson AE, Mitchell SN, Ranson H, Hemingway J, Paine MJ, Donnelly MJ: Field-caught permethrin-resistant Anopheles gambiae overexpress CYP6P3, a P450 that metabolises pyrethroids. PLoS Genet. 2008, 4: e1000286-10.1371/journal.pgen.1000286.View ArticlePubMedPubMed CentralGoogle Scholar
- Rogers DW, Whitten MM, Thailayil J, Soichot J, Levashina EA, Catteruccia F: Molecular and cellular components of the mating machinery in Anopheles gambiae females. Proc Natl Acad Sci USA. 2008, 105: 19390-19395. 10.1073/pnas.0809723105.View ArticlePubMedPubMed CentralGoogle Scholar
- Abrantes P, Dimopoulos G, Grosso AR, do Rosrio VE, Silveira H: Chloroquine mediated modulation of Anopheles gambiae gene expression. PLoS ONE. 2008, 3: e2587-10.1371/journal.pone.0002587.View ArticlePubMedPubMed CentralGoogle Scholar
- Cassone BJ, Mouline K, Hahn MW, White BJ, Pombi M, Simard F, Costantini C, Besansky NJ: Differential gene expression in incipient species of Anopheles gambiae. Mol Ecol. 2008, 17: 2491-2504. 10.1111/j.1365-294X.2008.03774.x.View ArticlePubMedPubMed CentralGoogle Scholar
- Neira Oviedo M, Ribeiro JM, Heyland A, VanEkeris L, Moroz T, Linser PJ: The salivary transcriptome of Anopheles gambiae (Diptera: Culicidae) larvae: A microarray-based analysis. Insect Biochem Mol Biol. 2009, 39: 382-394. 10.1016/j.ibmb.2009.03.001.View ArticlePubMedPubMed CentralGoogle Scholar
- Goltsev Y, Rezende GL, Vranizan K, Lanzaro G, Valle D, Levine M: Developmental and evolutionary basis for drought tolerance of the Anopheles gambiae embryo. Dev Biol. 2009, 330: 462-470. 10.1016/j.ydbio.2009.02.038.View ArticlePubMedPubMed CentralGoogle Scholar
- Kohonen T, Makisara K: The Self-organizing Feature Maps. Phys Scripta. 1989, 39: 168-172. 10.1088/0031-8949/39/1/027.View ArticleGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.View ArticlePubMedPubMed CentralGoogle Scholar
- Lee HK, Braynen W, Keshav K, Pavlidis P: ErmineJ: tool for functional analysis of gene expression data sets. BMC Bioinformatics. 2005, 6: 269-10.1186/1471-2105-6-269.View ArticlePubMedPubMed CentralGoogle Scholar
- Xu PX, Zwiebel LJ, Smith DP: Identification of a distinct family of genes encoding atypical odorant-binding proteins in the malaria vector mosquito, Anopheles gambiae. Insect Mol Biol. 2003, 12: 549-560. 10.1046/j.1365-2583.2003.00440.x.View ArticlePubMedGoogle Scholar
- Pennetier C, Warren B, Dabire KR, Russell IJ, Gibson G: "Singing on the wing" as a mechanism for species recognition in the malarial mosquito Anopheles gambiae. Curr Biol. 2010, 20: 131-136. 10.1016/j.cub.2009.11.040.View ArticlePubMedGoogle Scholar
- Christophides GK, Vlachou D, Kafatos FC: Comparative and functional genomics of the innate immune system in the malaria vector Anopheles gambiae. Immunol Rev. 2004, 198: 127-148. 10.1111/j.0105-2896.2004.0127.x.View ArticlePubMedGoogle Scholar
- Reiser JB, Teyton L, Wilson IA: Crystal structure of the Drosophila peptidoglycan recognition protein (PGRP)-SA at 1.56 A resolution. J Mol Biol. 2004, 340: 909-917. 10.1016/j.jmb.2004.04.077.View ArticlePubMedGoogle Scholar
- Zaidman-Rémy A, Hervé M, Poidevin M, Pili-Floury S, Kim MS, Blanot D, Oh BH, Ueda R, Mengin-Lecreulx D, Lemaitre B: The Drosophila amidase PGRP-LB modulates the immune response to bacterial infection. Immunity. 2006, 24: 463-473. 10.1016/j.immuni.2006.02.012.View ArticlePubMedGoogle Scholar
- Povelones M, Waterhouse RM, Kafatos FC, Christophides GK: Leucine-rich repeat protein complex activates mosquito complement in defense against Plasmodium parasites. Science. 2009, 324: 258-261. 10.1126/science.1171400.View ArticlePubMedPubMed CentralGoogle Scholar
- Waterhouse RM, Povelones M, Christophides GK: Sequence-structure-function relations of the mosquito leucine-rich repeat immune proteins. BMC Genomics. 2010, 11: 531-10.1186/1471-2164-11-531.View ArticlePubMedPubMed CentralGoogle Scholar
- Dong Y, Aguilar R, Xi Z, Warr E, Mongin E, Dimopoulos G: Anopheles gambiae immune responses to human and rodent Plasmodium parasite species. PLoS Pathog. 2006, 2: e52-10.1371/journal.ppat.0020052.View ArticlePubMedPubMed CentralGoogle Scholar
- Marcombe S, Poupardin R, Darriet F, Reynaud S, Bonnet J, Strode C, Brengues C, Yebakima A, Ranson H, Corbel V, David JP: Exploring the molecular basis of insecticide resistance in the dengue vector Aedes aegypti: a case study in Martinique Island (French West Indies). BMC Genomics. 2009, 10: 494-10.1186/1471-2164-10-494.View ArticlePubMedPubMed CentralGoogle Scholar
- Laramie JM, Chung TP, Brownstein B, Stormo GD, Cobb JP: Transcriptional profiles of human epithelial cells in response to heat: computational evidence for novel heat shock proteins. Shock. 2008, 29: 623-630.PubMedGoogle Scholar
- Baker DA, Nolan T, Fischer B, Pinder A, Crisanti A, Russell S: A comprehensive gene expression atlas of sex- and tissue-specificity in the malaria vector, Anopheles gambiae. BMC Genomics. 2011, 12: 296-10.1186/1471-2164-12-296.View ArticlePubMedPubMed CentralGoogle Scholar
- Saal LH, Troein C, Vallon-Christersson J, Gruvberger S, Borg A, Peterson C: BioArray Software Environment (BASE): a platform for comprehensive management and analysis of microarray data. Genome Biol. 2002, 3: SOFTWARE0003-View ArticlePubMedPubMed CentralGoogle Scholar
- Cleveland WS: Robust Locally Weighted Regression and Smoothing Scatterplots. Journal of the American Statistical Association. 1979, 74 (368): 829-836. 10.2307/2286407. [http://www.jstor.org/stable/2286407]View ArticleGoogle Scholar
- Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003, 31: e15-10.1093/nar/gng015.View ArticlePubMedPubMed CentralGoogle Scholar
- Cleaver JE: DNA repair and its coupling to DNA replication in eukaryotic cells. Biochim Biophys Acta. 1978, 516: 489-516.PubMedGoogle Scholar
- Wild K, Halic M, Sinning I, Beckmann R: SRP meets the ribosome. Nat Struct Mol Biol. 2004, 11: 1049-1053. 10.1038/nsmb853.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.