### Data sets

Whitfield et al. [20] measured the levels of cDNA expression during behavioral maturation ages in the brains of honey bees from two races (*Apis mellifera mellifera* and *Apis mellifera ligustica*) raised in one of two colonies (*mellifera* and *ligustica*) representing two environments. Full sisters were used to create the two host colonies for each race. Within each combination of race and colony, three nurse bees were sampled on days 0, 4, 8, 12 and 17 after adult emergence and three forager bees were sampled on day 17 after emergence.

The expression of genes from individual brains was assessed using the double-spotted *Apis mellifera* brain 9K version 3.0 cDNA microarray using the protocols described by Whitfield et al. [11], Grozinger et al. [18] and Cash et al. [12]. The majority of the cDNAs in the microarray (5001 out of 8887 reporter cDNAs) have been mapped to 3610 individual genes in the honey bee genome assembly version 2. Based on the annotation of the genome, there is an approximately 38% redundancy of cDNAs to mapped genes and approximately 55% (1970 genes) of these mapped gene have Gene Ontology information.

The gene expression from each combination of bee race and colony in the experiment of Whitfield et al. [20] is available as separate loop designs consisting of 20 cDNA microarrays. This loop design maximized the direct comparisons between consecutive time points because samples at consecutive ages and at the first (day 0 nurse) and last (day 17 forager) ages were hybridized to the same array. From the full experiment two independent data sets, *ligustica* (*L*) and *mellifera* (*M*), were identified and used for cross-validation purposes in this study. The *L* data set included the measurements from *ligustica* bees raised in a *ligustica* colony and the *M* data set included the measurements from *ligustica* nurse bees raised in a *mellifera* colony. Each data set consisting of 20 microarrays was analyzed separately and considered an independent data set suitable to validate the results from the reminder data set because no bee was part of both data sets and a pilot analysis indicated that only 0.75% of the cDNAs had a significant (P-value < 0.00001) age by colony interaction and only 0.79% of cDNAs had a significant colony effect. Thus, we expect that at least 98% of the cDNAs will exhibit the same pattern in both data set and should be consistently assigned to the same gene collections in both data sets.

### Data processing

The same filtering and analysis procedures were conducted separately for each data set. The background subtracted fluorescence intensities were set to 1 when background was higher and foreground intensity and log2-transformed. Spot (or feature) intensities were filtered when a) the spots pertain to controls or other sequences (e.g. virus, suspected to be contaminated or present in high levels in hypopharyngeal glands) also excluded in Cash et al. [12], b) the spots were deemed of bad quality (and assigned a -100 flag) by the image analysis software (GenePix Pro 5.0 [21]), c) the spots had low signal (foreground intensities lower the 3 standard deviations of the background intensity within dye and array), d) the Cook distance of a spot intensity from the other intensity values from the same brain and dye was > 0.99 thus suggesting an inconsistent or unusual spot, e) and the median of all the intensities of a cDNA within dye and array was less than 300 [11]. After filtering the intensities from the duplicated spots on the same microarray were combined into one value, the average of the two spots when available or the value of a single spot remaining after filtering. Finally the following two gene-wise filtering criteria were applied: i) the presence of at least 75% of the spots per cDNA (at least 9 spots per cDNA out of a possible maximum of 12 spots in the arrays with hybridizations of consecutive time points or 16 spots in the arrays including day 0 nurse and forager day 17 samples), and ii) the availability of measurements at all ages.

The log2 intensity values were normalized using a linear-logarithmic transformation [22] and a general model was applied to estimate the trends of gene expression in age using the two-step approach described by Wolfinger et al. [23]. First, a linear model including the fixed effect of dye and the random effect of microarray was fitted across all cDNAs to adjust all measurements across dye and microarray combinations. Second, a linear mixed effects model including the fixed effect of dye and age and the random effect of microarray was used for each previously adjusted cDNA intensities to identify the cDNAs with significant orthogonal linear, quadratic and cubic trend across day 0, 4, 8 and 12 nurse and day 17 forager honey bees. The contrasts excluded day 17 nurse bees due to the similarity in age between the 17 day nurse and forager bees. A set of cDNAs that had significant (Bonferroni adjusted P-value < 0.05) linear or higher age trends in both data sets were identified. The adjusted age estimates of gene expression at day 0, 4, 8 and 12 in nurse bees and at day 17 in forager bees were obtained for each cDNA with significant linear, quadratic or cubic trends within data set.

The estimates of gene expression at each age were analyzed in each data set (see Additional file 5 and 6) using a semiparametric group-based mixture approach [6, 7] that assigns cDNAs to groups with distinct trajectories based on the probability of belonging to that group. A two-step clustering approach was also implemented for comparison purposes.

### Semiparametric approach

A group-based regression approach [6, 7] was used to identify groups of cDNAs with distinctive expression trajectories and, model the uncertainty of the trajectories and of the membership of cDNAs to groups. An infinite number of groups are assumed and thus the mixing distribution is nonparametric. Nonparametric methods, unlike parametric methods, do not make (or make very few) assumptions about the distribution of the observations. This property is desirable in the identification of groups of gene expression trajectories because the distribution is unknown. The limitation of a full nonparametric approach is the typically low power of results. The semiparametric approach offers a compromise because it relies on mildly strong assumptions, thus reducing the risk of misspecification while maintaining the precision [24]. The derivation of the semiparametric approach assumes a continuous distribution of the groups or cDNA trajectories that is approximated by a discrete function. The finite-group approximation results in a semiparametric maximum-likelihood approach [25, 26].

This semiparametric model allows for the generalization of mixtures of distributions while making no specific parametric assumptions about the distribution of the hidden heterogeneity over the cDNAs. The semiparametric maximum likelihood estimators are described in detail by Land et al. [25] and Nagin and Tremblay [27] in the context of psychometric data. Briefly, let *y*
_{
ij
} denote the gene expression level of cDNA *i* (*i* = 1, ..., *n*) at age *j* (*j* = 1, ..., *J*) and y
_{
i
} = (*y*
_{i1}, ..., *y*
_{
iJ
}) denote the vector of gene expression across age.

The probability density function corresponding to cDNA *i* is:

where *f*
_{
k
} (**y**
_{
i
}|**β**
_{k}, *σ*
^{2}) is the *k* component densities of the mixture, *C*
_{
i
} is an indicator variable that denotes the component that cDNA *i* belongs to, and *Pr*(*C*
_{
i
} = *k*) is the probability of cDNA *i* to belong to the *k* group. Thus *f*(**y**
_{
i
}) is a mixture distribution with K components. In this study, *f*
_{
k
} (**y**
_{
i
}|**β**
_{k}, *σ*
^{2}) follows a multivariate Normal distribution:

*f*_{
k
} (**y**_{
i
}|**β**_{k}, *σ*^{2})~MVN(**X**_{
i
}**β**_{
k
}, **I**_{
J
}*σ*^{2})

where X
_{
i
} is the design matrix relating the observations of cDNA *i* to the parameter vector β
_{k}. To describe the cDNA expression trajectories, linear, quadratic and cubic trends on age were considered and the design vector corresponding to cDNA *i* and age *j*, is
. Although the availability of information on five ages allows the evaluation of up to quartic trends, the estimates of this trend would be based on limited information and would be unreliably estimated. The estimates of **β**
_{
k
} = (*β*
_{0k
}, *β*
_{1k
}, *β*
_{2k
}, *β*
_{3k
}) are group-specific and provide K cDNA expression trajectories across age.

The unobservable discrete variable *C*
_{
i
} indicates the group membership or group that the cDNA *i* pertains. This variable can take any of *K* distinct values, each corresponding to a distinct cDNA expression trajectory and *Pr*(*C*
_{
i
} = *k*) is the mixing proportion or weight. In addition

and *Pr*(*C*
_{
i
} = *k*) follows a polychotomous multinomial (*K* degree) logistic distribution:

To address the estimability problem, one group level (e.g. 1) is considered the baseline level. Thus, *θ* = (*θ*
_{
2
},..., *θ*
_{
k
}..., *θ*
_{
K
}) and the estimated parameters are the log odds of membership in level k versus the baseline group.

Based on the regression coefficient estimates, the probability of observing each expression pattern is computed conditional on its belonging to each group. The cDNA is then assigned to the group with the highest probability of having generated the group pattern.

The likelihood is:

The Bayesian information criterion (BIC) was used to identify the optimal number of groups supported by the data [27, 28].

The expression for BIC is:

BIC = -2 [log_{e}(likelihood)] + plog(N)

where *log*
_{
e
} denotes the natural logarithmic transformation, *p* is the number of parameters in the model and *N* is the number of observations.

The BIC can consistently identify the optimal number of components in the mixture model [14] even when the models are not nested [29]. The BIC approach favors parsimonious models and Kass and Raftery [31] indicated that BIC can be used to approximate the Bayes factor for comparisons of models. The model with the smallest BIC value is preferred over the alternative specifications. The BIC offers a good compromise between model adequacy and simplicity when compared to the Akaike information criterion (AIC) and the mean square error (MSE) that tend to favor more and less parsimonious models than BIC respectively [30]. The parameters were estimated by direct maximization of the full likelihood using a SAS macro [32]. All models with 2 to 15 groups following a polynomial trajectory of up to order three were evaluated. The maximum number of groups considered was 15 because higher group numbers resulted in groups with less than 1% of the cDNAs studied within group and the trajectory would not be reliably estimated.

### Two-step clustering approach

In the two-step clustering approach, a cubic polynomial was fitted to the cDNA estimates across age and the predicted intensities at each age were clustered. Complementary clustering techniques were used to collect the cDNAs into clusters. Within hierarchical clustering, a maximum-likelihood hierarchical clustering (or EML) method was implemented. In EML, the distance between clusters A and B is:

where

D_{AB} = distance between clusters A and B,

N = number of observations (cDNAs),

v = number of variables (5 ages),

C = cluster resulting from grouping clusters A and B,

log_{e} = natural logarithmic transformation,

N_{J} = number of observations in cluster J (for J = 1, ..., A, ..., B, ..., C)

= sample mean vector pertaining to cluster J

The EML clustering method clusters observations or clusters that maximizes the likelihood at each level of the dendrogram and assumes multivariate Normal mixture distribution, equal spherical covariance matrices and unequal sampling probabilities [33]. The EML approach does not have the bias towards clusters of equal number of cDNAs that the Ward's method has and may not be appropriate for this data set. This method was implemented using SAS [33].

The k-means clustering approach was also implemented and results were compared to hierarchical clustering to minimize the potential impact of the clustering method on the resulting groups. This approach requires the specification of the number of clusters and uses the minimum least squares criterion. Euclidean distances were used, and three sets of seeds were tested to minimize the impact of the starting values on the resulting clustering implemented using SAS [33]. The clustering approaches are suitable to group the cDNA expression trajectories because there was no evident outlier intensity estimates across age. Evidence of this was the lack of singleton clusters with one cDNA in the k-means clustering. The optimal number of clusters supported by the data was identified based on consensus on three criteria computed, the R^{2} (proportion of the total variance accounted by the clusters), R^{2} ratio [R^{2}/(1-R^{2})], and the pseudo-F statistic:

Pseudo F
**= ⌊**
R
^{2}
**/(**
G
**- 1)⌋/⌊(1 -**
R
^{2}
**)/(**
N
**-**
G
**)⌋**

where

G = number of clusters and,

N = number of cDNAs.

The criteria used to identify the optimal number of clusters were local pseudo-F maxima and slight reductions of the R^{2} indicators across cluster numbers. There was no single optimal number of clusters (multiple local pseudo-F maxima were detected) and the number closest to the number of groups identified with the semiparametric approach was used for comparison purposes. The outputs from both clustering approaches were highly consistent and therefore only the results of the hierarchical EML method applied to the *M* and *L* data sets are presented.

### Model performance

The performance of this approach was evaluated in three-ways. The results from both data sets were compared first to each other, second to the results from a less obvious two-step clustering analysis and third further validated using the gene annotations supported by the assembly version 2 of the honey bee genome [13] and associated gene ontology. The cDNA collections are termed "groups" and "clusters" to distinguish the results from the semiparametric and two-step clustering approaches, respectively.