Transcript profiling of two alfalfa genotypes with contrasting cell wall composition in stems using a cross-species platform: optimizing analysis by masking biased probes

Background The GeneChip® Medicago Genome Array, developed for Medicago truncatula, is a suitable platform for transcript profiling in tetraploid alfalfa [Medicago sativa (L.) subsp. sativa]. However, previous research involving cross-species hybridization (CSH) has shown that sequence variation between two species can bias transcript profiling by decreasing sensitivity (number of expressed genes detected) and the accuracy of measuring fold-differences in gene expression. Results Transcript profiling using the Medicago GeneChip® was conducted with elongating stem (ES) and post-elongation stem (PES) internodes from alfalfa genotypes 252 and 1283 that differ in stem cell wall concentrations of cellulose and lignin. A protocol was developed that masked probes targeting inter-species variable (ISV) regions of alfalfa transcripts. A probe signal intensity threshold was selected that optimized both sensitivity and accuracy. After masking for both ISV regions and previously identified single-feature polymorphisms (SFPs), the number of differentially expressed genes between the two genotypes in both ES and PES internodes was approximately 2-fold greater than the number detected prior to masking. Regulatory genes, including transcription factor and receptor kinase genes that may play a role in development of secondary xylem, were significantly over-represented among genes up-regulated in 252 PES internodes compared to 1283 PES internodes. Several cell wall-related genes were also up-regulated in genotype 252 PES internodes. Real-time quantitative RT-PCR of differentially expressed regulatory and cell wall-related genes demonstrated increased sensitivity and accuracy after masking for both ISV regions and SFPs. Over 1,000 genes that were differentially expressed in ES and PES internodes of genotypes 252 and 1283 were mapped onto putative orthologous loci on M. truncatula chromosomes. Clustering simulation analysis of the differentially expressed genes suggested co-expression of some neighbouring genes on Medicago chromosomes. Conclusions The problems associated with transcript profiling in alfalfa stems using the Medicago GeneChip as a CSH platform were mitigated by masking probes targeting ISV regions and SFPs. Using this masking protocol resulted in the identification of numerous candidate genes that may contribute to differences in cell wall concentration and composition of stems of two alfalfa genotypes.


Background
Alfalfa [Medicago sativa (L.) subsp. sativa] is the most widely cultivated forage legume in the world [1] and the fourth most widely grown crop in the United States [2]. In 2008, over 60 million metric tons of alfalfa dry hay with a value of over $10 billion were harvested from over 8.5 million hectares in the US [3]. In addition to being a valuable forage crop for livestock, alfalfa has considerable potential as a sustainable, cellulosic feedstock for ethanol production [2]. Alfalfa is a relatively high biomass crop that also provides environmental benefits [2]. For example, alfalfa improves soil and water quality, promotes wildlife diversity and provides its own nitrogen fertilizer through symbiotic nitrogen fixation [2,[4][5][6].
A promising strategy for developing alfalfa as a cellulosic ethanol crop involves separating leaves and stems following harvest [2]. The leaves would be used as a protein supplement for livestock while the stems would be used to produce ethanol. Our research has focused on selecting for large stem, non-lodging, biomass-type alfalfa germplasm and developing management strategies to optimize biomass yield. To date, these efforts have resulted in a 40% increase in total biomass and a doubling of theoretical ethanol yield [7]. We have also initiated research to modify the composition of alfalfa stem cell walls via a transgenic approach. The efficiency of ethanol production from cellulosic biomass is positively correlated with cellulose content but negatively correlated with lignin content [8,9]. Thus, the value of alfalfa as a cellulosic feedstock would be enhanced by developing new alfalfa varieties that have increased cellulose and decreased lignin in stem cell walls [8,9]. To facilitate the identification of key genes regulating cell wall composition, we selected alfalfa germplasm (genotypes 252 and 1283) that exhibit significant differences in lignin and cellulose concentrations in stem cell walls [10]. On a dry matter basis, stem cellulose and Klason lignin concentrations of plants at flowering are significantly higher in genotype 252 compared to genotype 1283 (302 gkg -1 vs. 257 gkg -1 for cellulose and 117 gkg -1 vs. 98 gkg -1 for Klason lignin, respectively).
A high-density oligonucleotide microarray is not yet available for global transcript profiling in alfalfa. However, the GeneChip ® Medicago Genome Array is available. This GeneChip contains a total of 52,796 Medicago probe sets designed from 50,900 and 1,896 sequences from M. truncatula and alfalfa, respectively. Each probe set in the GeneChip consists of 11 perfect match (PM) and 11 mismatch (MM) 25-mer probes. An underlying assumption when using microarrays for cross-species hybridization (CSH) is that the level of sequence homology among genes of closely-related species is significant enough to enable detection by probes originally designed for their orthologs. Previous research indicated that the Medicago GeneChip ® is a suitable cross-species platform for transcript profiling in alfalfa [11,12]. In large part, this is because there is a significant level of gene homology. For example, a previous study reported that DNA sequence identity was 93% or greater between protein coding regions of selected homologous genes in alfalfa and M. truncatula [11]. However, in previous research using the Medicago GeneChip ® for transcript profiling in alfalfa tissues, we observed decreased sensitivity (number of genes detected) and decreased accuracy in measuring foldchanges in gene expression compared to results obtained with M. truncatula tissues [11,12].
Numerous studies, conducted with both animals and plants, have reported transcript profiling involving CSH to DNA microarrays of a closely-related species [13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31][32]. In a number of these studies, electronic masking was used to remove biased probes prior to microarray data analysis. For example, Ranz et al. [18] introduced a probe selection method based on genomic DNA hybridizations of the target and non-target species to the GeneChip. This approach has been used for CSH studies involving plant species [19,27,30]. However, a recent CSH study involving Xenopus species questioned the reliability of this method for selecting unbiased probes [32]. Transcript profiling in non-human primates using the human GeneChip for CSH was optimized by identifying interspecies conserved probe sets [26]. These probe sets were identified by aligning expressed sequence tags (ESTs) in non-human primate with probe sequences on the Affymetrix human GeneChip ® platform. However, this approach is not feasible for species with limited sequence information such as alfalfa. In a study using the human GeneChip as a cross-species platform to measure gene expression in heart and liver tissues of non-human mammals (e.g. cattle, pig, dog, mouse), Ji et al. [21] developed a protocol to selectively mask poorly hybridized probes using the match/mismatch feature of the GeneChip. To evaluate whether masking improved the accuracy of measuring gene expression, it was hypothesized that different organs (heart, liver) of humans and non-human mammals have similar gene expression patterns. After masking low intensity probes in the microarray data of the cross-species, Ji et al. [21] found a linear correlation (r = 0.93) for Ln(heart/liver) values between human and mouse GeneChip data. These authors concluded that comparisons of gene expression patterns in defined tissues of related species could be used to optimize CSH studies involving other mammals or plants.
In earlier research, we examined gene expression at two stages of stem development for alfalfa and Medicago truncatula [12]. In both species, transcript profiling was conducted in elongating stem internodes (ES) and postelongation stem internodes (PES). Genes associated with primary cell wall development were preferentially expressed in ES internodes while genes associated with secondary xylem development were enriched in PES internodes. The objective of this study was to identify genes that are differentially expressed in ES and PES internodes of alfalfa genotypes 252 and 1283 using the GeneChip ® Medicago Genome Array as a cross-species platform. To optimize cross-species hybridization analysis, we developed a protocol for masking probes targeting inter-species variable (ISV) regions. After masking for ISV regions and single-feature polymorphisms (SFPs) previously detected in genotypes 252 and 1283 [10], we identified numerous genes that were differentially expressed in ES and PES internodes of the two genotypes.

Masking probes targeting inter-species variable regions
As a preliminary analysis of sequence divergence between orthologous genes of Medicago truncatula and Medicago sativa (alfalfa), we blasted 550,074 M. truncatula probe sequences (25-mer) on the GeneChip ® Medicago Genome Array against the 12,072 alfalfa expressed sequence tag (EST) sequences that are currently available from the public database (e-value cut-off = 0.001, minimum nucleotide alignment length = 20). A total of 21,176 M. truncatula probe sequences had alfalfa EST hits and 14,960 of them (~70%) showed at least one base mismatch (data not shown). These results suggested that masking ISV regions would optimize transcript profiling when using the Medicago GeneChip ® as a cross-species platform for measuring gene expression in alfalfa.
We developed a protocol to mask probes targeting ISV regions of the probe sets expressed in ES and PES internodes of alfalfa genotypes 252 and 1283. A work-flow diagram of the protocol used to mask probes targeting ISV regions is shown in Figure 1. For each tissue sample, three biological replicates were collected producing a total of 12 data points per probe (2 tissues × 2 genotypes × 3 replicates). A series of hybridization signal intensity thresholds were used to mask probes with signals below the threshold (see Methods for details). For a particular probe (12 data points), all data points were kept if three or more probes were above the signal intensity threshold. Otherwise, all 12 probes were masked.
The gene (probe set) expression levels were derived from the signal intensities of the retained probes in each probe set by using Robust Multi-array Average (RMA) which involves background corrections, multi-chip quantile normalization and multi-chip summarization processes [33]. As expected, the number of probes retained decreased rapidly (from ~674,000 to ~62,000) while the number of probe sets retained decreased gradually (from 61,000 to ~15,000) as signal intensity threshold increased ( Figure 2).
In previous work, we used the Medicago GeneChip to compare gene expression in ES versus PES internodes of M. truncatula genotype A17 from which most of the probes on the GeneChip were designed [12]. The mask-   ing protocol developed in this study is based on the assumption that the ratio of expression of genes in ES and PES internodes (ES/PES) of alfalfa is very similar to that measured in these tissues in M. truncatula (A17). Using the M. truncatula dataset, we identified genes (probe sets) that exhibited at least a 2-fold difference in gene expression between ES and PES tissues. Next, we identified genes that exhibited at least a 2-fold difference in expression ratio between ES and PES internodes of alfalfa genotype 252 as probe signal intensity threshold was increased from 0 to 640. Genes exhibiting at least a 2-fold difference in expression ratio between ES and PES internodes of both alfalfa and M. truncatula are referred to as commonly-selected genes. The number of commonlyselected genes increased from ~1,200 to ~1,600 as the signal intensity threshold increased to a value of 40 reflecting increased detection sensitivity ( Figure 3). The effect of masking biased probes with low signal intensity on the condensed signal intensity of the corresponding probe set is shown in Figure 4. Boxplots of the twelve GeneChip datasets (2 tissues × 2 species × 3 replicates) for alfalfa before and after masking ISV regions (probes with a signal intensity less than 40) show that masking increased mean signal intensity of the GeneChip dataset by approximately four-fold. The mean signal intensity after masking ISV regions is very close to the mean signal intensity of GeneChip dataset for stem internodes of M. truncatula (genotype A17). Overall, masking ISV regions increased the signal intensity of the Medicago probe sets which in turn increased the sensitivity of detecting alfalfa gene expression on the Medicago GeneChip.
To evaluate the effect of masking for ISV regions on the accuracy of measuring fold-changes in gene expression between PES and ES internodes of alfalfa, we examined the correlation of the hybridization intensity signal ratio of PES and ES internodes for the commonly-selected genes from the two Medicago species as signal intensity threshold was increased. The Pearson correlation coefficient of the PES/ES ratio between M. truncatula and alfalfa increased from 0.29 to 0.34 as signal intensity threshold increased up to about 100 reflecting increased accuracy ( Figure 3). The decline in correlation detected at signal intensity thresholds above 100 may be due to masking too many informative probes. Although the highest correlation of the PES/ES ratio between the two Medicago species was achieved with a signal intensity threshold of 100, the number of commonly-selected genes was reduced ( Figure 3). The data in Figure 3 show that the effect of masking on accuracy and sensitivity intersect at a signal intensity threshold of 40 where over 50,000 probe sets (about 85% of the total number on the Genechip) were retained ( Figure 2). On the basis of these results, we used a signal intensity threshold of 40 for masking biased probes due to ISV regions. The use of this masking threshold significantly improved sensitivity (the number of expressed genes detected) while maintaining a high level of accuracy in measuring fold-difference in gene expression. Most CSH studies in plants have used a genomic DNAbased strategy for probe selection [19,27,30]. To our knowledge, this study is the first to employ an RNAbased probe selection protocol to mask ISV regions when using a cross-species platform for transcript profiling in plants. The masking protocol that we developed has some advantages over previously reported masking protocols especially for crops with limited sequence information. For example, neither DNA hybridization [18] nor prerequisite sequence information [26] is needed to identify inter-species conserved probe sets. In addition, with careful experimental design including adequate replication, the masking protocol developed in this study is relatively simple to implement (see Methods). The protocol is based on the assumption that the ratio of gene expression in PES and ES internodes of two closely-related Medicago species (M. truncatula and M. sativa) is similar. In mammals, a similar approach involving comparisons of gene expression between organs has been used successfully in CSH studies [21]. In our study, the ratio of expression of probe sets in ES and PES internodes of M. truncatula was used to optimize both the sensitivity and the accuracy of detecting genes differentially expressed in alfalfa stem internodes. Our results suggest that a similar RNA-based approach for masking ISV regions could be successfully applied to other closely-related plant species where a microarray platform is available for one species.
Although the masking protocol used in this study is a useful tool for optimizing CSH GeneChip date, it does not correct for all bias in the data. It is important that candidate genes selected for further study based on masking results be validated by real-time quantitative RT-PCR. In addition, one limitation of a masking protocol based on RNA hybridization intensity is bias toward abundant transcripts. Low abundance genes (probes sets) would most likely be masked using this protocol.

Masking probes for both ISV regions and SFPs
Single-feature polymorphisms (SFPs) are polymorphisms detected by single probes in microarrays [34]. Previously, we identified 10,890 SFPs between alfalfa genotypes 252 and 1283 using the GeneChip expression data files for ES and PES internodes [10]. These allelic variations between the two genotypes can bias transcript profiling by causing both false positives and false negatives [35,36]. The effect of masking for both ISV regions and SFPs (i.e. doublemasking) on the number of probe sets retained was minimal. Only about 450 additional probe sets were lost after further masking for SFPs (data not shown). By masking for probes targeting both inter-and intra-species variable regions, we improved the quality of the CSH GeneChip data for the two alfalfa genotypes examined. The doublemasking strategy employed in this study can be applied to other species when using a cross-species platform for transcript profiling between two genotypes.

Effect of masking on detection of differentially expressed genes
We used a t-test (p-value and FDR cutoff of 0.001 and 0.05, respectively) combined with an additional cut-off ratio of 2-fold to identify genes that were differentially expressed in ES and PES internodes of genotypes 252 and 1283 (Additional file 1). The Venn diagrams in Figure 5 show the number of differentially expressed genes identified before masking, after masking for ISV regions, and after double-masking (i.e. masking for both ISV regions and SFPs). Masking for ISV regions using a signal intensity threshold of 40 increased the number of differentially-expressed genes identified in ES and PES internodes by about 2.5-fold. After double-masking, the number of differentially expressed genes detected was decreased compared to the number detected after masking for ISV regions only ( Figures 5A and 5B). This decrease is probably due to removal of probes associated with SFPs that were responsible for generating false positives for differentially expressed genes. After doublemasking, 639 and 1129 genes were differentially expressed between the two alfalfa genotypes in ES and PES internodes, respectively ( Figure 5C). A total of 251 genes were detected in both ES and PES internodes. From these results, we can estimate the number of putative false positives and false negatives that might have been caused by sequence variation (Figures 5A and 5B). For example in Figure 5A, 91 genes identified as being differentially expressed prior to masking, but not included after double-masking may be putative false positives caused by ISV regions and/or SFPs; 299 genes selected after masking for ISV regions but not included after double-masking may be putative false positives caused by SFPs; and 57 differentially expressed genes detected only after double-masking may be putative false negatives caused by ISV regions and/or SFPs ( Figure 5A).

Differences in gene expression between stem internodes of genotypes 252 and 1283
Genes that were differentially expressed in the internodes of alfalfa genotypes 252 and 1283 after double-masking were functionally classified using the MapMan gene functional classification system [10,37] (Additional file 1). To obtain an overview of gene functional classes that were differentially expressed in the two alfalfa genotypes, we conducted over-representation analysis using Page-Man, a software tool for comparative analysis of gene ontology [38] (see Methods for details). Compared to all probe sets on the Medicago GeneChip, "regulation of transcription" and "signalling" classes were significantly over-represented among genes up-regulated in PES internodes of genotype 252 compared to genotype 1283 (Figure 6, Additional file 2).
The transcript profiles of regulatory genes that were differentially expressed in ES and PES internodes of alfalfa genotypes 252 and 1283 are visually displayed in Figures 7A and 7B, respectively. A total of 115 putative transcription factor and 32 receptor kinase genes were differentially regulated between stems of the two alfalfa genotypes. The number of regulatory genes that were differentially expressed between the two genotypes was greater in PES internodes compared to ES internodes. The majority of putative transcription factor and receptor kinase genes that were up-regulated in PES internodes were found in genotype 252 ( Figure 7, Additional file 1).
The role of various transcription factors in regulating cell wall development has been examined primarily in Arabidopsis and poplar (Populus spp.) [39]. For example, over-expression of some NAC (NAM/ATAF/CUC) and MYB proteins in Arabidopsis led to abnormal ectopic deposition of secondary cell walls and suppression of their functions resulted in a decrease in secondary cell wall thickening [40][41][42][43][44][45][46][47]. Some MYB family transcription factors also regulate the expression of genes involved in lignin biosynthesis [47][48][49][50]. Interestingly, a recent study suggested that the NAC family transcription factor SND1 (SECONDARY WALL-ASSOCIATED NAC DOMAIN PROTEIN1) acts as a master transcriptional switch for activating secondary cell wall biosynthetic pathways by regulating the expression of 11 transcription factors (1 homeobox-, 2 NAC-and 8 MYB-domain containing genes) essential for normal secondary cell wall development [45]. In the alfalfa genotypes examined in this study,

A B
putative NAC genes Mtr.50934.1.S1_at and Mtr.25921.1.S1_at were up-regulated in both ES and PES internodes of genotype 252 compared to the same tissues in genotype 1283. Three MYB genes (Mtr.6897.1.S1_at, Mtr.42648.1.S1_at, Mtr.44850.1.S1_at) were up-regulated in 252 PES internodes compared to 1283 PES internodes. Among these, Mtr.42648.1.S1_at is a putative homolog of AtMYB63 (86% identical at the protein level), a SND-1 regulated MYB transcription factor that specifically activates lignin biosynthetic genes during secondary cell wall formation in Arabidopsis [47]. AtMYB63 was specifically expressed in fibers and vessels undergoing secondary cell wall thickening. Over-expression of AtMYB63 resulted in specific activation of lignin biosynthetic genes causing ectopic deposition of lignin in normally non-lignifying cells. Suppression of AtMYB63 led to a reduction in secondary cell wall thickening and lignin content [47]. Mtr.42648.1.S1_at also has high sequence homology with two other SND1-regulated AtMYB genes (AtMYB85 and AtMYB103) with 73% and 70% identity at the protein level, respectively. Over-expression of AtMYB103 and AtMYB85 led to an increase in secondary cell wall thickening in fibers and ectopic deposition of lignin in epidermal and cortical cells in stems [45]. Dominant repression of AtMYB103 and AtMYB85 resulted in significantly reduced secondary cell wall thickening in fiber cells. We also identified numerous other differentially expressed transcription factor families that have not been previously reported to play a role in cell wall development. For example, zinc finger (14 genes total) and WRKY (18 genes total) were the most abundant families among the differentially expressed transcription factors in genotypes 252 and 1283. Other significant transcription factor families identified include bHLH, b-ZIP, and AP2/EREBP (Additional file 1).
Receptor-like kinases (RLKs) were also significantly over-represented among genes up-regulated in genotype 252 PES internodes compared to genotype 1283 PES internodes (Figure 7, Additional file 2). RLKs are known to play significant roles in plant growth, development and defence responses [51][52][53]. There are more than 600 RLKs in the Arabidopsis genome. Several recent reports suggested a significant role for RLKs in regulating cell wall development. For example, a loss of function mutant of THESEUS1, a plasma membrane receptor kinase, suppressed the ectopic lignification and growth inhibition phenotype of prc1-1, a recessive CELLULOSE SYN-THASE 6 Arabidopsis mutant, by repressing the induction of stress responses. These results suggested that the THESEUS1 RLK acts as sensor of cell wall integrity [54]. Mutations in two leucine-rich repeat (LRR) RLKs (FEI1 and FEI2) disrupt anisotropic expansion and the synthesis of cell wall polymers including cellulose biosynthesis [55]. WAKs (wall-associated Ser/Thr receptor kinases) are tightly bound to the cell wall and are thought to play a significant role in regulating cell wall function as well [56,57]. Among the 32 putative RLKs identified in genotypes 252 and 1283, 23 were up-regulated in 252 PES internodes, one was up-regulated in both ES and PES internodes of 252, and one was up-regulated in 252 ES internodes compared to 1283 ES or PES internodes. One of the RLKs up-regulated in 252 PES internodes (Mtr.9325.1.S1_at) is a homolog of Arabidopsis FEI1. Two putative WAKs (Mtr.13054.1.S1_at and Mtr.3807.1.S1_at) were up-regulated in 252 PES internodes as well.

Differences in gene expression between stem internodes of genotypes 252 and 1283 are consistent with differences in cell wall composition
Overall, our results show significant up-regulation of a number of regulatory and cell wall-related genes in PES internodes of genotype 252 compared to genotype 1283. Many of the regulatory genes that were up-regulated are putative transcription factors and receptor kinases. Several of these up-regulated genes are known to play a role in the development of secondary xylem in Arabidopsis (e.g., AtMYB63, AtMYB85, AtMYB103, and FEI1). In addition, putative CesA genes that play a role in secondary cell wall development and putative genes involved in lignin synthesis were up-regulated in 252 PES internodes compared to 1283 PES internodes. The up-regulation of these regulatory and cell wall-related genes may play a role in the greater cell wall concentration and modified composition of PES internodes of genotype 252. On a dry matter basis, cellulose and lignin concentrations of stems of flowering plants (primarily PES internodes) are significantly higher in genotype 252 compared to genotype 1283 (302 g kg -1 vs. 257 g kg -1 for cellulose and 117 g kg -1 vs. 98 g kg -1 for Klason lignin, respectively) [10]. The greater cellulose and lignin concentrations in stems of genotype 252 compared to genotype1282 are associated with an 11% increase in total cell wall dry matter and a reduction in concentration of pectin sugar residues in the cell wall. The genotypic differences in cell wall concentration and composition of PES internodes are consistent with greater deposition of secondary xylem in internodes of genotype 252 compared to genotype 1283. Previous research has shown that increased development of secondary xylem increases cell wall concentration in alfalfa stems expressed on a dry weight basis [63]. Furthermore, the thick secondary walls of this tissue are rich in cellulose, xylan and lignin, but contain little if any pectin compared to primary cell walls [63]. The candidate genes identified in this study, especially transcription factor genes and genes involved in secondary cell wall synthesis, may play important roles in the development of secondary xylem in PES internodes of alfalfa. Future research involving transgenic approaches will be used to evaluate the role of these genes in the deposition of secondary xylem in alfalfa stems. Modifying the amount and composition of secondary xylem in stems of alfalfa will improve the value of alfalfa as a cellulosic feedstock.

Validation of selected candidate genes
A subset of 50 differentially expressed candidate genes from three functional categories (regulatory, signalling and cell wall-related genes) was initially selected for realtime quantitative RT-PCR validation. However, only 34 of these genes produced a single amplicon based on dissociation curves (see Methods for details). Most primers for real-time quantitative RT-PCR were designed using M. truncatula sequences because most probe sets selected for validation were designed from M. truncatula sequences. Sequence variation between the two Medicago species and among multi-gene families within species may explain the lower than expected RT-PCR success rate.
A total of 13, 15, and 6 candidate genes from cell wall, transcription factor, and signal transduction categories, respectively, were used for real-time quantitative RT-PCR validation ( Table 1). The GeneChip data with no masking indicated that 12 of the 34 candidate gene examined were differentially expressed between the two alfalfa genotypes. The quantitative RT-PCR validation data (2 plates × 3 wells × 2 genotypes =12 data points) revealed that of the 12 probes sets detected with no masking, 10 (83.3%) were true positives (Table 1). Two probe sets (16.7%) were identified as false positives. After double-masking the GeneChip data, an additional 22 genes were identified as differentially expressed (Table 1). The quantitative RT-PCR validation data for the 22 genes that were identified only after double-masking indicated that 19 (86.4%) were true positives. Three probe sets (13.6%) were identified as false positives. Overall, these results indicate a significant increase in detection power after double-masking (about a 2.8-fold increase in sensitivity) with little change in the false positive rate.
Next, we examined the effect of double-masking on the accuracy of measuring fold-differences in gene expression by plotted ΔΔC T values obtained from the real-time quantitative RT-PCR data against log 2 (252PES/1283PES) values from the GeneChip data before and after masking (Figure 9). The results show a linear relationship between the ΔΔC T value from real-time quantitative RT-PCR and the log ratio both before and after masking. Masking increased the Pearson correlation coefficient (R) from 0.85 to 0.92 (Figure 9). The stronger positive correlation Receptor-like kinase Mtr.3024.1.S1_at -+ 0.6 2.1 2.2** 1 + indicates genes that were detected as differentially-expressed between PES internodes of genotypes 252 and 1283 after a t-test (p < 0.001, FDR < 0.05, ≥2-fold difference) using the GeneChip data with no masking or after double-masking. 2 Log 2 (252PES/1283PES) value obtained from the GeneChip data with no masking or after double-masking. 3

ΔΔC T = [C T (1283PES) -C T (18s)] -[C T (252PES) -C T (18s)
] from real-time quantitative RT-PCR data. C T 18s rRNA values were stable in the tissues examined and were used to normalize data. *,** indicates probe sets that were detected as differentially expressed after a t-test (P < 0.01 or P < 0.001) respectively, The t-test of the real-time quantitative RT-PCR data for PES internodes of genotypes 252 and 1283 was performed with ΔC T values (12 data points = 2 plates × 3 wells × 2 genotypes).
between the two data sets after double-masking indicates increased accuracy.

Physical mapping of differentially expressed genes
Alfalfa and M. truncatula share the same eight orthologous basic chromosome sets but with different ploidy levels; 2n = 4x = 32 for alfalfa and 2n = 2x = 16 for M. truncatula. In addition, previous studies reported a significant level of colinearity in gene content and order between diploid and tetraploid alfalfa [64] and also between diploid alfalfa and M. truncatula [65,66]. The Medicago genome sequencing project has released genome sequence version 2.0 (http://www.medicago.org/ genome/) which contains about 220 Mbp of the total estimated 300 Mbp of M. truncatula euchromatin. A total of 38,844 non-overlapping coding sequences have been annotated to some extent with locus information to the corresponding chromosomes. To examine trends in the chromosomal distribution of genes that were differentially expressed between alfalfa genotypes 252 and 1283, we mapped 1044 differentially expressed genes (out of the total of 1624) onto their corresponding orthologous loci on M. truncatula chromosomes (Figure 10, Additional file 3). Of these mapped genes, 445 and 775 were differentially expressed in ES and PES internodes, respectively ( Figure 10A). A total of 176 genes were differentially expressed in both ES and PES internodes of the two genotypes.
Although differentially expressed genes between alfalfa genotypes 252 and 1283 were distributed over all chromosomes, regions of high frequency were evident ( Figure  10A). The frequency of differentially expressed genes in ES and PES internodes was analyzed in a 50 kb sliding window along chromosomes. The results indicated possible chromosomal clustering of genes with the same expression pattern ( Figure 10B). For example, in several chromosomal segments, we observed possible clustering of genes that were up-regulated in 252 PES internodes compared to 1283 PES internodes ( Figure 10B, chromosomal regions where only blue peaks were observed). These results suggested possible co-regulation of differentially expressed genes found in localized chromosomal regions ( Figure 10). We used the simulation method described by Grant et al. [67] to statistically analyze whether the clustering patterns observed for differentially expressed genes with similar expression patterns (i.e. genes up-regulated in 252 PES internodes compared to 1283 PES internodes) were non-random (see Methods for details). The difference between experimental data (selected gene distribution) and simulated data (random distribution) was considered statistically significant if the absolute value of (experimental data -simulation mean)/(simulation standard deviation) was ≥ 2 (see Methods for details). The results indicated statistically significant clustering of two to four differentially expressed genes with a similar expression pattern within both 50 kb and 100 kb windows in the genome ( Table 2, Additional file 4).
To some degree, the clustering of co-expressed genes detected in this study may represent tandem repeats of duplicated genes. During the physical mapping process, some probe sets were targeted to closely linked multiple loci on the M. truncatula genome. If multiple loci hits per probe set were detected, we mapped only the top hit locus for each differentially expressed probe set onto the M. truncatula genome. By doing so, we reduced the chances that clusters detected in our cluster simulation analysis were due to tandem repeats of duplicated genes. Thus, the majority of co-expressed gene loci used for clustering simulation analysis are sequence unrelated. Previous studies conducted in other model systems reported chromosomal clustering of co-expressed genes even after removing duplicated genes [68,69].
Numerous studies have reported co-expression of neighbouring genes in eukaryotes [70][71][72][73][74] including Arabidopsis [68,75,76] and rice [77]. Co-regulated gene clusters often share the same biological functions and/or are in the same pathway [72,75]. These co-expressed genes could be regulated by the same transcription factor [72] or share the same promoter elements [71] for co-regulation. Natural selection may promote the clustering of coexpressed genes as well [70]. However, the mechanism behind the clustering of co-expressed genes is still unclear.
Chromosomal segments with clusters of co-expressed candidate genes will be useful for alfalfa breeding, especially for wide-crosses involving introgression of foreign  (Table 1) plotted against log 2 (252PES/1283PES) hybridization intensity ratio values from the GeneChip data before (blue triangles) and after (red circles) masking. chromosomal segments from alien species into elite alfalfa cultivars. In addition, the genomic DNA sequence of multiple candidate genes can be obtained by sequencing a BAC containing the candidate gene cluster.

Conclusion
Masking biased probes due to inter-species variable (ISV) regions and SFPs increased the sensitivity and accuracy of the transcript profiling data for alfalfa when using the Medicago GeneChip as a cross-species platform. The masking protocol developed in this study can be applied to other CSH studies involving the use of GeneChips for transcript profiling. The transcript profiling data, indicating up-regulation of putative cellulose and lignin genes involved in secondary cell wall thickening in 252 PES internodes compared to 1283 PES internodes, is consistent with difference in cell wall concentration and composition between the two genotypes. Numerous cell wall and regulatory genes that may contribute to differences in cell wall composition and concentration of the two alfalfa genotypes were identified. These candidate genes will be useful for improving alfalfa as a cellulosic feedstock via a transgenic approach. Physical mapping and clustering simulation analysis of the differentially expressed alfalfa genes on orthologous loci of M. truncatula suggested chromosomal regions where statistically significant coexpression of neighbouring genes occurred.

RNA extraction, labelling and GeneChip hybridization
Elongating and post-elongation stem internodes of genotypes 252 and 1283 grown in the greenhouse were harvested as previously described [10]. Methods used for RNA extraction, labelling, and GeneChip hybridization were previously described [10]. The raw data cel files used in this study are available in the NCBI Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/query/ acc.cgi?acc=GSE13602).

Masking probes targeting variable regions
First, we created a series of probe mask files based on normalized probe intensity signals. Briefly, all cel files were background corrected and quantile-normalized at the probe level using affy package from Bioconductor (http://bioconductor.org). Based on the minimums, maximums, and different quantile distributions of the probe signal intensities, we chose several intensity points (5, 7, 10, 15, 20, 30, 40, 60, 80, 100, 120, 160, 320, 640, 1280, and 2560) for masking. One mask file was created for each probe intensity point. If the intensity of a probe was lower than the masking intensity in a defined percentage of all cel files, the probe was masked for all samples. The defined percentage (P) was determined by the number of replicates in each sample type and the number of sample types: where Ts is the total number of sample files, R is the replicate number of each sample type and S is the number of sample types. The floor and ceiling are mathematical functions that map a real number to the next smallest and next largest integer, respectively. In this study, we used genotypes 252 and 1283 with ES and PES internode tissues for each genotype (S = 4) and three replicates (R = 3) for a total of 12 sample files (Ts = 12, P = 0.83). Based on the defined percentage (P), we created a series of mask files at each intensity point for all samples. A file was also created for masking the probe locations of the previously identified 10,890 single-feature polymorphisms (SFPs) in ES and PES internodes of alfalfa genotypes 252 and 1283 [10].
Next, we applied the series of masking files to the cel files of ES and PES internodes of genotype 252 using Expressionist Refiner module (http://www.genedata.com). Briefly, the raw data cel files were loaded into Refiner with the masking files. The RMA algorithm summarized the probe-level signals into probe set expression indexes. The resulting expression index files were evaluated by comparing them to the expression files obtained from the same internode tissues in the reference species, M. truncatula. For each signal intensity, we examined the correlation of the PES/ES expression ratio for the commonly-selected genes from the two Medicago species. Commonly-selected genes are defined as genes that exhibit at least a 2-fold difference in gene expression in ES and PES internodes of the two species. The masked expression data that yielded the best performance, as evaluated by optimization of both sensitivity and accuracy, was selected for differential gene expression analysis.

Detection of differentially expressed genes
After selection of the optimum intensity threshold for masking, the differentially expressed genes were identified in the masked data set by applying a t-test to the expression values in ES or PES internodes (3 replications for each tissue type) of the two alfalfa genotypes (e.g., 252 ES vs. 1283 ES) with a p-value and FDR cutoff of 0.001 and 0.05, respectively. An additional ratio cutoff of 2-fold was applied using the Genedata Expressionist Analyst module (http://www.genedata.com/). The gene expression signals corresponding to the bacterial microsymbiont (Sinorhizobium meliloti) probe sets were excluded in this analysis.

Functional classification and over-representation analysis
The MapMan gene functional classification system [37] was assigned to the probe sets on the Medicago GeneChip following the method previously described [10]. The functional class over-representation analysis was performed using PageMan [38] as previously described [10] except that the log 2 (252/1283) values for ES and PES internodes were given to the selected probe sets and the remainder of the probe sets on the GeneChip (not selected) were given a false expression value of "zero". For over-representation analysis, the z-value cuttoff was set as 1 after Bonferroni correction.

Real-time quantitative RT-PCR
Total RNA used for GeneChip hybridization was also used to make cDNAs for real-time quantitative RT-PCR. First strand cDNAs for each sample were made using random hexamers and Taqman Reverse Transcription Reagents (Applied Biosystems, CA) following the manufacturer's recommendations. Gene specific primers for the selected probe sets were designed based on the consensus sequences (http://www.affymetrix.com) using Primer Express (Applied Biosystems, CA) (Additional file 5). Samples and standards were run in triplicate on each plate and repeated on at least two plates using SYBR-Green PCR Master Mix (Applied Biosystems, CA) on a GeneAmp 7500 Sequence Detection System (Applied Biosystems, CA) following the manufacturer's recommendations. Real-time quantitative RT-PCR was performed in a 20 μl reaction containing 7 μl ddH 2 O, 10 μl 2× PCR mix, 1 μl forward primer (4 μM), 1 μl reverse primer (4 μM), and 1 μl of template cDNA (10 ng/μl). The PCR conditions were two minutes of pre-incubation at 50°C, 10 minutes of pre-denaturation at 94°C, 40 cycles of 15 seconds at 95°C and one minute at 60°C, followed by steps for dissociation curve generation (30 seconds at 95°C, 60 seconds at 60°C and 30 seconds at 95°C). The 7500 System SDS software v.1.2.2 was used for data collection and analysis. Dissociation curves for each amplicon were carefully examined to confirm the specificity of the primer pair used. Relative transcript levels for each sample were obtained using the "comparative C T method". The threshold cycle (C T ) value obtained after each reaction was normalized to the C T value of 18S rRNA. The relative expression level was obtained by calibrating the ΔΔC T values for other samples using a normalized C T value (ΔΔC T ) for the PES internodes of alfalfa genotype 252.

Physical mapping and frequency distribution
The Medicago genome release version 2.0 (Mt2.0) (http:// www.medicago.org/genome/) contains 38,844 coding sequences with various degrees of annotation and predicted chromosome locations. We used these coding sequences to search against the Medicago GeneChip Probe consensus sequences database (http:// www.affymetrix.com) using blastn [78] with match matrix BLOSUM62 and a mismatch penalty of -3. We chose the blast parameter E value (0.0001) and bit score (100) for hit cutoff. If there were multiple loci hits per probe set, only the top hit locus was mapped for each probe set to minimize the effect of tandem repeats during the clustering simulation analysis. This analysis generated putative orthologous chromosome locations for 36,709 of the Medicago GeneChip probe sets. We used an R script to map the differentially expressed genes (p < 0.001, >2fold difference) between alfalfa genotypes 252 and 1283 [ ES internodes (red triangles) and PES internodes (blue circles)] onto the putative orthologous loci in the M. truncatula chromosomes 1 through 8.
Frequencies of differentially expressed genes on chromosomes 1 through 8 in ES and PES internodes were examined in a 50 kb sliding window. For each tissue, the physical location information of the differentially expressed genes on the M. truncatula chromosomes was extracted. The frequency of genes selected within a sliding 50 kb window was calculated. This sliding window shifted every 10 kb along the chromosome. Within each window, the calculated gene frequency was plotted against chromosome distance in kb.

Clustering simulations
The simulation protocol described by Grant et al. [67] was used to test for clustering of selected differentially expressed candidate genes on chromosomes. Mt2.0_pseudomolecule contains a total 266,102,767 bases on 8 chromosomes. This genome sequence was partitioned into a total of 3,856 and 2,122 bins based on 50 kb and 100 kb windows, respectively. The simulation program randomly positioned a defined number of selected genes on the genome and the number of bins with different frequency of assigned genes was determined. The simulation was repeated 2,000 times. The difference between experimental data (selected gene distribution) and simulated data (random distribution) was considered statistically significant if the absolute value of (experimental data -simulation mean)/(simulation standard deviation) was ≥ 2. A significant difference is indicative of clustering within the defined 50 kb or 100 kb window.