Sequence signatures involved in targeting the male-specific lethal complex to X-chromosomal genes in Drosophila melanogaster
© Philip et al; licensee BioMed Central Ltd. 2012
Received: 4 October 2011
Accepted: 19 March 2012
Published: 19 March 2012
In Drosophila melanogaster, the dosage-compensation system that equalizes X-linked gene expression between males and females, thereby assuring that an appropriate balance is maintained between the expression of genes on the X chromosome(s) and the autosomes, is at least partially mediated by the Male-Specific Lethal (MSL) complex. This complex binds to genes with a preference for exons on the male X chromosome with a 3' bias, and it targets most expressed genes on the X chromosome. However, a number of genes are expressed but not targeted by the complex. High affinity sites seem to be responsible for initial recruitment of the complex to the X chromosome, but the targeting to and within individual genes is poorly understood.
We have extensively examined X chromosome sequence variation within five types of gene features (promoters, 5' UTRs, coding sequences, introns, 3' UTRs) and intergenic sequences, and assessed its potential involvement in dosage compensation. Presented results show that: the X chromosome has a distinct sequence composition within its gene features; some of the detected variation correlates with genes targeted by the MSL-complex; the insulator protein BEAF-32 preferentially binds upstream of MSL-bound genes; BEAF-32 and MOF co-localizes in promoters; and that bound genes have a distinct sequence composition that shows a 3' bias within coding sequence.
Although, many strongly bound genes are close to a high affinity site neither our promoter motif nor our coding sequence signatures show any correlation to HAS. Based on the results presented here, we believe that there are sequences in the promoters and coding sequences of targeted genes that have the potential to direct the secondary spreading of the MSL-complex to nearby genes.
KeywordsDosage compensation Sequence signatures Drosophila Motif discovery MSL-complex
Drosophila melanogaster males are heterogametic (XY), while females are homogametic (XX). The Y chromosome has gradually lost genes and degenerated, resulting in an increasingly aneuploid condition in males and the evolution of systems that compensate for between-sex differences in doses of genes located on X chromosomes [1–4]. The dosage-compensation system equalizes X-linked gene expression between males and females, thus maintaining an appropriate balance between the expression of genes on X chromosome(s) and the autosomes [5, 6].
The amount of transcripts from the single X chromosome of male Drosophila individuals is boosted about two-fold relative to levels of each of the two in females, thereby roughly equalizing their overall X chromosome gene expression . This dosage compensation is critical, and loss of required proteins leads to male-specific lethality [8, 9]. These proteins include MSL-1 (male-specific lethal 1), MSL-2, MSL-3, MOF (males absent on the first) and MLE (maleless), which form an X chromosome-specific MSL complex, or dosage compensation complex (DCC), with two functionally redundant long non-coding RNAs: RNA on the X1 and X 2; roX1 and roX2, respectively [10–14]. The selective activation of X chromosomal genes is at least partly due to the hyperacetylation of histone H4 lysine 16 (H4K16) by the histone acetyltransferase (HAT) - MOF, an integral subunit of the MSL complex [15, 16].
The binding pattern of MSL proteins on the X chromosome has been identified in diverse cell lines, embryos and third instar larvae using various genome-wide techniques such as chromatin immunoprecipitation coupled with microarray technology (ChIP-on-chip) or deep sequencing (ChIP-seq) [17–22]. Transcript levels of genes in RNAi-mediated depletion backgrounds and msl gene mutants have also been examined in diverse cell lines, embryos, and larvae using hybridization of transcript populations to gene expression microarrays or Real-time PCR [20, 23–25]. These studies have revealed that: the MSL complex preferentially binds to gene coding regions, particularly the 3' end of genes; the binding pattern does not dramatically change during different stages of development; and loss of MSL-complex functionality only reduces expression of X-linked genes to about 80% of wild type levels. In addition, results of a recent analysis indicate that the MSL complex mediates dosage compensation of X chromosomal genes by enhancing transcriptional elongation, in accordance with the observed 3' bias .
Two main models have been proposed to explain the distribution of MSL complexes along the X chromosome. One suggests that the complex initially targets a relatively small number of X chromosome-specific primary recruitment or chromosomal "entry" sites (CES) then "spreads" along the chromosome from these sites in cis [27, 28]. The other postulates that large numbers of specific sites of varying affinities are present, based on data gathered from X chromosomal translocation studies [29, 30].
In situ hybridization analyses of polytene chromosomes have shown that the Drosophila X chromosome is enriched in (dC - dA)n/(dG - dT)n sequences , and that in every Drosophila species examined to date dosage-compensated chromosomes have higher than average CA/TG, CT/AG and C/G frequencies . Subsequent, computational whole-genome sequence analysis showed that throughout the Drosophila genus X chromosomes can be distinguished from other chromosomes by their A, T, C/An and G/Tn repeat sequences [33, 34]. Recent MSL protein-binding region analyses have also detected X chromosomal enrichment of low complexity sequence elements, such as GA- and CA-based dinucleotide repeats and runs of adenines [19, 22, 29, 35]. In addition, GA-rich or TC-rich motifs have been identified in high affinity binding sites (HAS) for MSL proteins on the X chromosome using genome-wide techniques [18, 22]. A repetitive sequence motif [G(CG)N]4 was also recently discovered in low affinity sites targeted by MSL proteins . However, although the enrichment of simple sequence elements has been detected on the X chromosome it is still unclear if primary DNA sequences are involved in the targeting of the MSL complex to and within individual genes.
Here we present an extensive analysis of X chromosome sequence variation, and its potential involvement in dosage compensation, in which we used multivariate modeling and previously published data to explore relationships between MSL complex distributions, transcription patterns and five gene features -- promoters, 5' UTRs, coding sequences (CDS), introns, 3' UTRs -- and intergenic sequences (hereafter also classed as gene features, for convenience). Our results show that: the X chromosome has a distinct sequence composition within all six types of features examined; some of this variation correlates with genes targeted by the MSL-complex; the insulator protein BEAF-32 binds preferentially upstream of MSL-bound genes; BEAF-32 and MOF co-localizes in promoters; and bound genes have a distinct sequence composition that shows a 3' bias within coding sequence.
The Drosophila melanogaster X chromosome has distinct sequence signatures
In a previous analysis of sequence variation between the Drosophila melanogaster (Dm) chromosomes we found evidence of chromosome-distinguishing sequence words on the fourth and X chromosomes . In the study presented here we focus on the X chromosome and whether any of its sequence variation can be related to the dosage compensation of this chromosome. We excluded the 4th chromosome (since it is atypical in many respects) from all our analyses.
To study sequence variation that is not directly correlated to AT-content we applied the following simple normalization. The frequency of each sequence word was divided by its expected frequency in each observation, see  for details. Interestingly, the PCA based on the normalized values showed that there is more sequence variation between gene features than between chromosomes (Figure 1B). The first Principal Component separates coding sequences of all chromosomes from all other gene features and the second separates 5' UTRs. The other gene features separate in the proceeding components (Figure 1C). This has strong implications for attempts to identify discriminating patterns between groups of sequences; if the differences between gene features included are not accounted for, the results may reflect differences in gene feature composition rather than biologically relevant sequence variation.
In the fifth Principal Component of the gene feature PCA, all X-chromosomal gene feature sequences are shifted away from the autosomal sequences (Figure 1D; X-chromosomal sequences, shown in green, are shifted to the right). Corresponding loadings for the fifth component reveal that all gene features on the X chromosome are enriched in mono- and di-nucleotide repeats (Additional file 2; a summary of fits for all multivariate models is available in Additional file 3). The AT-normalized PCA models for each individual class of gene features also clearly showed that the X chromosome has a distinct sequence composition (Additional file 4), indicating that there are potential X chromosome targeting sequences within all of its gene features.
Gene features of expressed genes strongly bound by MSL have distinct sequence signatures
We next explored whether any of the sequence signatures in the gene features correlated to binding of the Male-Specific Lethal (MSL) complex. For this purpose we used the data from  to select strongly MSL bound and weakly MSL bound genes as well as expressed and unexpressed genes (see Methods). We did not try to define genes that are bound or unbound by the MSL-complex, but rather select two extreme groups (one with very strong binding and one with very weak or no binding). The data from  currently represent the only dataset where mapping of several MSL-complex components and transcription in mutants/knock-downs of MSL-components are available in the same cell-type. Using several components of the MSL-complex should improve estimates of its binding values to genes, but we also compared the gene binding values obtained using the MOF, MSL-1 and MSL-3 data from  to gene binding values calculated in the same way based on data from [17, 18, 22, 37] and modENCODE http://www.modencode.org. In general all data sets correlate very well despite being performed by different groups using different conditions and antibodies (average Spearman R = 0.83, Additional file 5). We conclude that estimates of MSL-binding values are consistent irrespective of which data sets are used.
Promoters and coding sequences are potentially involved in MSL-complex spreading along X chromosomes
We next wanted to see whether it was possible to identify sequences or motifs that could be used to predict the MSL-binding status of individual genes. For this, we applied the supervised multivariate method Orthogonal Partial Least Squares Discriminant Analysis (OPLS-DA) [38, 39], which seeks variables that are predictive of a pre-defined classification of the observations (rather than merely overrepresented). Since transcription has been shown to be important for MSL-complex targeting , we excluded all genes for which transcription levels cannot be accurately determined (see Methods). We selected genes that are expressed and strongly bound by the MSL-complex (n = 167) and those expressed but weakly bound by the MSL-complex (n = 151) as before. In the different models some genes were excluded because they lacked the annotated gene feature modeled (see Methods and Additional file 7). Two-thirds of the dataset was randomly selected and used as a training set for constructing the models, and the other third as a test set for assessing the accuracy of their predictions. We excluded intergenic sequences as they cannot be specifically assigned to a particular gene. We obtained significant models for promoters, CDS and introns, but not for 3' UTRs and 5' UTRs (Additional file 3). By plotting the relation between the first component and both expression levels and AT-content we determined that expression levels did not significantly influence the models (Additional file 8). However, AT-content strongly affected the intron and 3' UTR models and when we normalized for AT-content, we obtained no significant models for these features. The promoter and coding sequence models were then used to predict the MSL-binding status of the previously excluded third of the genes. Strikingly, the Y prediction scores for expressed genes strongly and weakly bound by MSL differed significantly according to both the promoter and coding sequence models (Mann-Whitney U Test; p = 3 × 10-6 and 1 × 10-3, respectively). To verify these results we constructed new models, for which we again randomly selected two thirds of the data for modeling and one third for testing predictions. The prediction results were very similar (data not shown). The promoter and coding sequence models could therefore robustly predict the MSL-binding status of the genes.
Insulator protein BEAF-32 is enriched in promoters of genes strongly bound by MSL
Our OPLS-DA model was based on two-thirds of the 167 strongest bound genes on the X-chromosome and previous studies have indicated that 534-773 genes are bound by the complex [17, 19, 20]. Therefore, we plotted the fraction of genes containing the promoter motif within 500 bp of the TSS versus the average gene-binding value of the three MSL proteins for all expressed X-linked genes (Figure 3C, Additional file 9). We observe that promoters of genes that have an average binding of the three MSL-proteins considered here of about 0.6 or more seems to have an enrichment of this motif. The number of expressed genes with an average of the three proteins above 0.6 is 660, which falls within the range of previous estimates for number of MSL bound genes. Further, a search with our promoter motif PWM in the TOMTOM tool  indicated that the promoter motif match the boundary element associated factor BEAF-32 motif. To test whether this protein preferentially binds to promoters of MSL-bound, expressed genes we downloaded BEAF-32 mapping data gathered by the modENCODE consortium for S2 cells . Interestingly, the average distance to a BEAF-32 binding site from the transcription start site is significantly shorter for strongly MSL-bound, expressed genes than for both weakly MSL-bound, expressed genes (Mann-Whitney U Test p = 2.6 × 10-18) and unexpressed genes (Mann-Whitney U Test p < 1 × 10-20). In addition, a larger fraction of the strongly bound class of genes have BEAF-32 binding within 500 bp of the TSS (49%) compared to weakly bound genes (13%) and active genes on chromosome 2L (31%, Figure 3B). Similar to the motif, the BEAF-32 binding sites overlap a larger fraction of the genes (within 500 bp of the TSS) that have an average gene-binding value of the three MSL proteins of more than about 0.6, likely representing functionally MSL-bound and dosage compensated genes (Figure 3C, Additional file 9). However, we found no general enrichment of BEAF-32 on the X chromosome relative to the autosomes when we analyzed genome coverage of binding regions (3L has the lowest, 2R the highest and X chromosomes intermediate coverage: 1.6%, 2.0% and 1.9%, respectively). To test whether only BEAF-32 is found preferentially upstream of strongly bound genes or whether also other insulator proteins behave similarly, we run the same analysis (distance to TSS) using CP190, Su(Hw) and CTCF (also mapped by modENCODE consortium for S2 cells ). None of the other insulator proteins showed any preference for genes strongly bound by MSL (Additional file 10). However, Su(Hw) has a tendency to bind upstream of expressed but weakly MSL-bound genes (Mann-Whitney U Test p = 7.5 × 10-4).
The High Affinity Sites (HAS) that are thought to be responsible for the initial recruitment of the MSL-complex to the X-chromosome, was defined by being the strongest MSL-sites along the X chromosome . Therefore, it is likely that many of the genes we selected as strongly bound (the genes along X with the strongest gene-binding values) overlap a HAS. To find out how our results correlate with the presence of high affinity sites we used 188 defined HAS (a union of the HAS reported by Alekseyenko et al.  and Straub et al. . Indeed, 116 of the 167 strongly bound genes overlap or have a HAS within 5 kb. We also see that the top sequence words from the very weak 3' UTR OPLS-DA model with two-third dataset (Additional file 7) resembles the GA-rich HAS motif [18, 22]. To investigate if the presence of the promoter motif we identified is correlated to the presence of HAS, we divided the strongly bound genes into those proximal to a HAS (within 5 kb) and those far from a HAS (> 5 kb). We found no significant difference in terms of distance from the TSS to the nearest promoter motif for these two groups (Mann-Whitney U Test p = 0.90). In fact, 41% of the HAS proximal strongly bound genes have the motif within 500 bp of the TSS compared to 49% of the strongly bound genes far from a HAS (> 5 kb). We also constructed a new promoter OPLS-DA model (genes strongly bound by MSL versus genes weakly bound by MSL), this time excluding genes within 5 kb of a HAS. Since, many of the most strongly bound genes are close to a HAS, we decreased the cut-off for being strongly bound to 0.8 (instead of a binding value of at least 1 for the three proteins). We ended up with 86 strongly bound and 88 weakly bound genes that are far from any HAS. The motif extracted from this model is virtually identical with the motif in Figure 3A, showing that this motif is not correlated to High Affinity Sites. In addition, the expressed genes on the X chromosome with BEAF-32 in the promoter are not preferentially found close to high affinity sites (Additional file 9).
Sequence signatures in coding sequence of genes strongly bound by MSL have a 3' bias
Since the number of genes in the initial model is relatively low (around 200-300 observations) we wanted to test how likely it is that a model based on similar numbers of randomly selected genes produces a significant OPLS-DA model. We randomly selected seven non-overlapping groups consisting of coding sequence from 100 expressed genes from chromosome 2L. Using OPLS-DA we modelled all pair wise combinations of these seven groups (21 combinations). All pair wise combinations produced models with negative Q2 values (predictive power). Six models had a strong correlation to AT-content (Spearman R > 0.4). Normalizing for AT-content did not improve the models. Next, we repeated this process with coding sequence from seven randomly selected groups of intermediately bound genes (100 genes/group) on the X-chromosome. Intermediately bound genes where defined as expressed genes on the X chromosome not belonging to the previously defined strongly and weakly MSL-bound genes. 20 pair wise combinations produced models with a negative Q2 and one combination produced a very weak model with a Q2 of 0.03 (to be compared to a Q2 of 0.27 in the strongly versus weakly MSL-bound model). Three models had a strong AT-correlation, but normalizing for nucleotide content did not improve the models. We conclude that randomly selected groups of sequences are unlikely to produce a significant model with predictive power.
We have thoroughly investigated X chromosome sequence variation in D. melanogaster and related this variation to the targeting of the dosage compensation complex, using frequencies of two to six base pair sequence "words" and multivariate statistical analyses. The advantage of our approach is that it is unbiased and focused on finding sequences with predictive value, rather than merely over-represented sequences. First, we divided the genome sequence into intergenic, promoter, 5' UTR, coding, intron and 3' UTR sequences. Interestingly, there is more divergence among these six sequence types or gene features than within the sequence types on different chromosomes (see Figure 1). Our findings also show that sequences are present in promoters and coding sequence that could be involved in the spreading of the MSL-complex from the high affinity sites on the X chromosome. The coding sequences we have identified share a similar 3' bias with the MSL-complex. Further, the highest scoring promoter sequences form the target motif of the insulator protein BEAF-32, and BEAF-32 mapping data indicate that this protein binds preferentially upstream of genes strongly bound by MSL.
Sequence variation between gene features
Different gene features are known to vary in sequence composition, but in our opinion their variation is not normally taken into account in attempts to discover new sequence motifs. We here show the extent of this sequence variation, and that coding sequences have the most distinct sequence composition followed by 5' UTRs, 3' UTRs and promoters. This has important implications for studies of sequence variation and motif discovery; when groups of sequences are compared it is important to take gene features into account (e.g. when using the MEME option of discriminative motif discovery), otherwise the results may reflect differences in gene feature composition rather than biologically relevant sequence variation.
The X chromosome has a distinct sequence composition
Our separate analyses of the six gene features clearly show that the sequence composition of those in the X chromosome differs from the composition of corresponding features in all other chromosomes (Additional file 4). This distinction of the X chromosome is mainly due to differences in frequencies of various di-nucleotides, many of which have been previously found to be enriched on X [33, 34]. These sequences could, in principle, be involved in recruiting X chromosome-specific factors, such as the MSL-complex. Apart from being dosage-compensated in males, the X-chromosome might also be under selective forces that do not act on the autosomes. Some of the sequence variation of the X-chromosome is likely a result of its evolution as a sex chromosome. The MSL-complex is the only known protein complex involved in dosage compensation in Drosophila with an X chromosome-specific distribution. We have focused here on the sequence variation that could be related to the targeting of this complex. It has been shown that the MSL-complex is initially targeted to X by binding to so-called high affinity sites (HAS) that contain the GA-rich MSL recognition element (MRE) [18, 22]. The MSL-complex can be recruited to autosomes by inserting MRE-containing high affinity sites [18, 22], but the mechanism involved in the spreading of MSL to X-chromosomal genes is under debate [15, 44]. We have investigated here whether sequence patterns may be involved in this spreading of the MSL-complex, as discussed below.
Genes strongly bound by MSL have a distinct sequence composition
The genome distribution of the MSL-complex has been mapped in several studies [17–22]. We used the data from (Kind et al., ) to select genes that are expressed and strongly MSL-bound, expressed and weakly MSL-bound as well as unexpressed genes. This is the only currently available dataset where mapping of several MSL-complex components and transcription in mutants/knock-downs of MSL-components was done in parallel and in the same cell-type. When merging all strongly MSL-bound expressed genes into one observation and all weakly MSL-bound expressed genes into another, we find that all six sequence types have sequences that differ between strongly bound and weakly bound genes (Figure 2). We observed that sequence variation between expressed genes strongly bound and weakly bound by MSL complex is much higher than that between expressed and unexpressed genes on chromosome X. Further, expressed genes that are weakly bound by the MSL complex group more closely to unexpressed genes than to expressed MSL-bound genes in our PCA score plots. Therefore, the small but significant expression difference detected between the expressed genes that are strongly bound and weakly bound by the MSL complex did not have any major correlation on the sequence variation observed between the two groups. Sequence words extracted from PCA models of intron, 3' UTR and 5' UTR sequences were more GA, CA or adenine rich which are in agreement with the previous identification of CA dinucleotide repeats, runs of adenines and GA-rich MRE motif from High Affinity Sites (HAS) [18, 22, 32, 35]. We conclude that there are differences in sequences of all six features between expressed genes that are strongly bound and weakly bound by the MSL-complex. However, these results merely identify sequence words that are overrepresented in groups of genes strongly or weakly bound by MSL. In order to search for predictive sequence patterns for MSL-binding to individual genes we applied Orthogonal Partial Least Squares Discriminant Analysis, OPLS-DA.
Promoters and coding sequences are potentially involved in MSL-complex spreading along X chromosomes
Using OPLS-DA we explored differences between features of individual genes that are strongly MSL-bound and expressed versus weakly MSL-bound and expressed, extracted sequence words with the highest predictive power, and attempted to combine them into more complex motifs using the algorithm described herein. Interestingly, both coding sequence and promoter models yielded sequence words that could be used to predict the MSL-binding status of genes excluded from the modeling. Neither nucleotide content nor expression level significantly influence these promoter and coding sequence models and the top sequence words we identified are only weakly overrepresented on the X-chromosome. We conclude that promoters and coding sequences contain sequence signatures that are potentially involved in the spreading of the MSL-complex from high affinity sites. In principle, there may be motifs in unbound, expressed genes that block the binding of the MSL-complex, but we obtained no evidence for such motifs.
Insulator protein BEAF-32 preferentially binds to promoters of MSL-bound genes
From the promoter model we extracted the motif in Figure 3A, which could be used to predict promoters of genes strongly bound by MSL. This motif proved to correspond to the targeting motif for the insulator protein BEAF-32 , which binds to hundreds of sites across the genome, generally located upstream of active genes [46, 47]. Although the molecular mechanisms of BEAF-32 activity are unknown, it seems to be linked with active transcription [41, 42]. In order to test whether the BEAF-32 protein itself is enriched at strongly MSL-bound genes we used BEAF-32 ChIP-chip mapping data obtained from modENCODE, and found that BEAF-32 preferentially binds proximal to transcription start sites of genes strongly bound by MSL. This exciting link between BEAF-32 and dosage compensation is supported by the observation that beaf-32 mutants have a male-specific defect in X-chromosome morphology . Further, Laverty et al.  found that reporters inserted on the X chromosome are better able to recruit the MSL-complex if they have binding sites for GAGA and DREF factors. The DREF binding site is very similar to the BEAF-32 binding site and although DREF might be involved in dosage compensation it is possible that increased BEAF-32 recruitment is the true case of the effects observed by . However, since DREF has not been mapped genome wide we cannot exclude the possibility that our promoter motif correlate better with DREF. BEAF-32 is associated with active transcription and might facilitate the MSL-complex targeting of active genes. Since MSL-complex bound genes show MOF binding in the promoter  and MOF clearly co-localize with BEAF-32, we hypothesize that BEAF-32 and MOF interact in promoters of MSL-complex bound genes. BEAF-32 is a DNA-binding protein and might recruit MOF to active genes on the X-chromosome, genes that are then targeted by the MSL-complex. However, further experimental efforts are needed to understand the link uncovered here between BEAF-32 and the MSL-complex.
Sequence signatures in coding sequences of MSL-bound genes have a 3' bias
The finding of sequence patterns that are predictive of MSL-binding genes within coding sequences is intriguing, although it has been previously reported . Scoring the sequence words only in the transcribed strand or the correct frame did not improve the coding sequence model, suggesting that the relationships are not attributable to (for instance) specific variations in amino acid composition. Neither did we find any codon usage bias between strongly bound and weakly bound expressed genes, or any model correlation with expression and AT-content. However, we found using OPLS-DA that bound coding sequences are rich in AG di-nucleotides, which have been previously reported to be abundant in dosage-compensated chromosomes .
The MSL-complex binds to genes with a preference for exons . The relatively low binding to introns might suggest that the complex targets spliced RNA transcripts. However, it was recently found that the complex targets chromatin rather than transcribed RNA . The exon specificity could be explained by various chromatin factors, nucleosome density and/or sequence specificity. Variations in nucleosome density may partially explain the exon bias, as it is higher in exons  and thus may provide more targets for H4K16 acetylation, a modification that is strongly linked to the MSL-complex . In addition, the MSL-complex binding profile clearly shows that it binds most strongly towards the 3' end of genes . Accordingly, our models predicted the MSL-binding status of genes better from the 3' thirds than from the 5' thirds of the coding sequences. This is in contrast to the lack of 3' bias of the [G(GC)N]4 motif reported by . Taken together, our results strongly indicate that the MSL-complex distribution within genes on the X-chromosome is influenced by the primary DNA sequence.
The MSL-complex evidently targets a limited number of High Affinity Sites along the X-chromosome. Although, many strongly bound genes are close to a HAS neither our promoter motif nor our coding sequence signatures show any correlation to HAS. Based on the results presented here, we believe that there are sequences in the promoters and coding sequences of targeted genes that have the potential to direct the secondary spreading of the complex to nearby genes. However, a number of genes are dosage-compensated by MSL-independent mechanisms  and expression on the X-chromosome is only reduced to ~80% of wild type levels in males when msl genes are mutated or knocked down using RNAi [20, 24]. Apart from the dosage compensation mediated by the MSL-complex there is evidence for a more general buffering system that targets haploid regions in the genome . So other, as yet unknown, factors are likely involved in compensating the X chromosome and these factors could potentially act on a number of levels, such as transcription regulation, mRNA export, mRNA stability and translation . The observed optimal codon usage on the X-chromosome likely represents compensation on the translational level. However, even if additional factors involved in dosage compensation remain to be discovered, we here show that there are plenty of sequences within all types of gene features that could act as X-targeting elements.
Genome annotations and sequences
We obtained Dm genome annotation and sequence Release 5.23 from Flybase  and parsed non-overlapping coordinates of the six types of gene features (promoters, 5' UTRs, CDS, introns, 3' UTRs and intergenic sequences) on each of the chromosomes in Dm from the genome annotation data (defining promoters as sequences extending 500 bp upstream of transcription start sites). We then used the gene feature coordinates to extract corresponding sequences from the genome sequence.
For each gene feature we constructed two-dimensional data matrices with full-length chromosomes or gene-specific regions as objects (rows) and frequencies of all possible di-(16), tri-(64), tetra-(256), penta-(1024) and hexa-mers (4096) in the gene feature sequences as variables (columns). To calculate frequencies of oligonucleotides (sequence words) we counted every word in each target sequence sliding one nucleotide at a time and divided the count by the length of the target sequence. Both forward and reverse complements of each sequence word were scored and treated as one.
Determining expression and MSL-binding of genes
To explore the relationship between sequence variation in X chromosome gene features and targeting of the MSL complex to X chromosomes we used the SL-2 cell ChIP-chip and gene expression array data presented by . We preprocessed raw Affymetrix (ChIP-chip MSL-1, MSL-3 and MOF) .CEL files using Tiling Analysis Software (TAS; http://www.affymetrix.com/partners_programs/programs/developer/TilingArrayTools/index.affx), with default parameters and Probe Analysis settings: bandwidth (BW) - 300, Test Type - one sided upper, perfect match probe intensities. We used the Integrated Genome Browser (IGB;  to visualize data, then converted the genome coordinates of Release 4 to those of the latest Release (5), using the Flybase sequence coordinate converter tool. We considered all probes in all exons within each gene and calculated gene binding values as the average of the top 50% probes, to minimize the influence of alternative splicing. Genes on the X chromosome with enrichment ratios exceeding one (in log2 scale) for all three proteins were selected as representatives of genes strongly bound by the MSL complex and genes with enrichment ratios of the three proteins below 0.5 as representatives of weakly bound genes. These cut-offs was set in order to select the two extremes in terms of binding values, while keeping the size of the two groups similar as well as reasonably large. Genes with enrichment ratios between 0.5 and one were excluded from the initial models (Additional file 11 and Additional file 12). For the different gene feature models we excluded genes that lacked the gene feature in question and for the promoter models we excluded genes for which their entire 500 bp upstream region overlapped a neighbouring gene (promoters with partial overlap to other genes or promoters were truncated).
Gene expression levels in EGFP control SL-2 cells relative to mof, msl-1 and msl-3 RNAi knock-down cells were computed from raw gene expression Affymetrix .CEL files using Robust Multi-Array Average (RMA)  with the Bioconductor "affy" package . Using the latest library files from Affymetrix we then mapped each probeset to genomic release 5 coordinates. For each gene, expression was calculated as the median of the three replicates. Any effect on gene expression after RNAi knock-down can be observed only for genes expressed between six and 10 in EGFP control after RMA normalization (Additional file 13). This means that genes with expression levels outside this range are either unaffected by the RNAi or their expression levels cannot be accurately estimated. Hence, we focused on genes with expression values between six and 10 and considered those with expression values less than six as unexpressed.
Determining insulator proteins binding sites
Insulator proteins (BEAF-32, CP190, CTCF and Su(Hw)) S2 cell-line ChIP-chip data from modENCODE (BEAF-HB.S2, BEAF-70.S2, CP190-VC.S2, CP190-HB.S2, CTCF-VC.S2, CTCF.S2 and SU(HW)-HB.S2, Su(Hw)-VC.S2) were preprocessed as described above. To identify bound regions, and their peak positions, binding ratios exceeding the genomic average by at least three standard deviations were extracted. Bound regions were then defined by stretches of array probes (passing the three standard deviation cut-off) at least 360 bp in length and a region was extended as long as there was a value within 360 bp of the previous value. Regions with fewer than five probes were excluded. The value for each detected region was set to the average of the highest six consecutive probe values (ratios). The center of each peak was set to the mid-position of the six highest consecutive probe values. Only binding regions detected in both datasets (obtained using two different antibodies for each insulator protein) were used. For each protein, distances to transcription start sites were calculated as the distances from each TSS to the nearest peak center of each protein.
All multivariate analysis and visualization described below was performed using Evince (Umbio, Sweden), except OPLS analysis, for which we used SIMCA (Umetrics, Sweden). AT normalization was applied by dividing each word frequency by the expected frequency based on the nucleotide composition of the word and the target sequence assuming random distribution of nucleotides . We center-scaled all data that were AT normalized prior to multivariate analysis. For all other modeling we applied unit variance (UV) scaling.
All univariate statistical analysis and visualization were performed using Statistica software (Statsoft, USA).
Sequence word alignment and motif scoring
We developed an iterative algorithm to identify complex motifs with significant predictive values for MSL complex binding, based on the top sequence words obtained from OPLS-DA models designed to detect sequence differences between genes strongly bound and weakly bound by the complex. A flowchart for the algorithm is presented in Additional file 14, and Perl codes for the sequence word aligner can be downloaded from http://www.molbiol.umu.se/english/research/researchers/per-stenberg/. This algorithm constructs motifs in the form of position weight matrices (PWMs), constructed by summing loadings of sequence words included in the motif and normalizing so that values in each column sum to one. Each motif is scored at all possible positions in each sequence by multiplying the values of each matching position in the PWM. A user-defined number of best scores (here one) within each sequence are then summed. The top 1000 sequence words (sorted by loading) from each OPLS-DA model were used in the alignment presented here. If a sequence word is used to extend the PWM it is excluded from further extensions. All sequence words are tested for variation of the PWM before each extension. Both forward and reverse complements of the sequence words are tested in each alignment and up to two simultaneous alignments to the PWM are tested. Logistic regression is used to test each update of the PWM for predictive improvement. When the predictive value of each PWM can no longer be improved, the resulting motif is masked in all sequences and a new PWM is created from the top sequence word in the remaining list of sequence words (excluding sequence words included in any previous PWM).
We defined a cut-off score for the promoter motif using the highest scores of the motif in each of the promoters of expressed genes strongly and weakly bound by MSL, such that the motif would be present in only 5% of the expressed, weakly bound genes. This cutoff was used to find promoter motif sites across the X chromosome.
All custom software and scripts are available upon request.
- Dm :
- 5' UTR:
5 prime untranslated region
- 3' UTR:
3 prime untranslated region
Transcription start site
High affinity site
Principal component analysis
Orthogonal partial least squares discriminant analysis
Position weight matrix
We thank Jan Larsson and Yuri Schwartz for critical reading of the manuscript. We also thank members of the Computational Life Science Cluster (CLiC) for sharing their expertise in statistics and multivariate modeling. This work was supported by grants from the Carl Tryggers, Erik Philip-Sörensens and Åke Wiberg Foundations (PS).
- Larsson J, Meller VH: Dosage compensation, the origin and the afterlife of sex chromosomes. Chromosome Res. 2006, 14: 417-431. 10.1007/s10577-006-1064-3.View ArticlePubMed
- Mank JE: The W, X, Y and Z of sex-chromosome dosage compensation. Trends Genet. 2009, 25: 226-233. 10.1016/j.tig.2009.03.005.PubMed CentralView ArticlePubMed
- Stenberg P, Larsson J: Buffering and the evolution of chromosome-wide gene regulation. Chromosoma. 2011, 120: 213-225. 10.1007/s00412-011-0319-8.PubMed CentralView ArticlePubMed
- Vicoso B, Bachtrog D: Progress and prospects toward our understanding of the evolution of dosage compensation. Chromosome Res. 2009, 17: 585-602. 10.1007/s10577-009-9053-y.PubMed CentralView ArticlePubMed
- Gupta V, Parisi M, Sturgill D, Nuttall R, Doctolero M, Dudko OK, Malley JD, Eastman PS, Oliver B: Global analysis of X-chromosome dosage compensation. J Biol. 2006, 5: 3-10.1186/jbiol30.PubMed CentralView ArticlePubMed
- Nguyen DK, Disteche CM: Dosage compensation of the active X chromosome in mammals. Nat Genet. 2006, 38: 47-53. 10.1038/ng1705.View ArticlePubMed
- Arkhipova I, Li J, Meselson M: On the mode of gene-dosage compensation in Drosophila. Genetics. 1997, 145: 729-736.PubMed CentralPubMed
- Belote JM, Lucchesi JC: Male-specific lethal mutations of Drosophila melanogaster. Genetics. 1980, 96: 165-186.PubMed CentralPubMed
- Fukunaga A, Tanaka A, Oishi K: Maleless, a recessive autosomal mutant of Drosophila melanogaster that specifically kills male zygotes. Genetics. 1975, 81: 135-141.PubMed CentralPubMed
- Copps K, Richman R, Lyman LM, Chang KA, Rampersad-Ammons J, Kuroda MI: Complex formation by the Drosophila MSL proteins: role of the MSL2 RING finger in protein complex assembly. EMBO J. 1998, 17: 5409-5417. 10.1093/emboj/17.18.5409.PubMed CentralView ArticlePubMed
- Meller VH, Gordadze PR, Park Y, Chu X, Stuckenholz C, Kelley RL, Kuroda MI: Ordered assembly of roX RNAs into MSL complexes on the dosage-compensated X chromosome in Drosophila. Curr Biol. 2000, 10: 136-143. 10.1016/S0960-9822(00)00311-0.View ArticlePubMed
- Mendjan S, Taipale M, Kind J, Holz H, Gebhardt P, Schelder M, Vermeulen M, Buscaino A, Duncan K, Mueller J, Wilm M, Stunnenberg HG, Saumweber H, Akhtar A: Nuclear pore components are involved in the transcriptional regulation of dosage compensation in Drosophila. Mol Cell. 2006, 21: 811-823. 10.1016/j.molcel.2006.02.007.View ArticlePubMed
- Scott MJ, Pan LL, Cleland SB, Knox AL, Heinrich J: MSL1 plays a central role in assembly of the MSL complex, essential for dosage compensation in Drosophila. EMBO J. 2000, 19: 144-155. 10.1093/emboj/19.1.144.PubMed CentralView ArticlePubMed
- Smith ER, Pannuti A, Gu W, Steurnagel A, Cook RG, Allis CD, Lucchesi JC: The Drosophila MSL complex acetylates histone H4 at lysine 16, a chromatin modification linked to dosage compensation. Mol Cell Biol. 2000, 20: 312-318. 10.1128/MCB.20.1.312-318.2000.PubMed CentralView ArticlePubMed
- Gelbart ME, Kuroda MI: Drosophila dosage compensation: a complex voyage to the X chromosome. Development. 2009, 136: 1399-1410. 10.1242/dev.029645.PubMed CentralView ArticlePubMed
- Hilfiker A, Hilfiker-Kleiner D, Pannuti A, Lucchesi JC: mof, a putative acetyl transferase gene related to the Tip60 and MOZ human genes and to the SAS genes of yeast, is required for dosage compensation in Drosophila. EMBO J. 1997, 16: 2054-2060. 10.1093/emboj/16.8.2054.PubMed CentralView ArticlePubMed
- Alekseyenko AA, Larschan E, Lai WR, Park PJ, Kuroda MI: High-resolution ChIP-chip analysis reveals that the Drosophila MSL complex selectively identifies active genes on the male X chromosome. Genes Dev. 2006, 20: 848-857. 10.1101/gad.1400206.PubMed CentralView ArticlePubMed
- Alekseyenko AA, Peng S, Larschan E, Gorchakov AA, Lee OK, Kharchenko P, McGrath SD, Wang CI, Mardis ER, Park PJ, Kuroda MI: A sequence motif within chromatin entry sites directs MSL establishment on the Drosophila X chromosome. Cell. 2008, 134: 599-609. 10.1016/j.cell.2008.06.033.PubMed CentralView ArticlePubMed
- Gilfillan GD, Straub T, de Wit E, Greil F, Lamm R, van Steensel B, Becker PB: Chromosome-wide gene-specific targeting of the Drosophila dosage compensation complex. Genes Dev. 2006, 20: 858-870. 10.1101/gad.1399406.PubMed CentralView ArticlePubMed
- Kind J, Vaquerizas JM, Gebhardt P, Gentzel M, Luscombe NM, Bertone P, Akhtar A: Genome-wide analysis reveals MOF as a key regulator of dosage compensation and gene expression in Drosophila. Cell. 2008, 133: 813-828. 10.1016/j.cell.2008.04.036.View ArticlePubMed
- Legube G, McWeeney SK, Lercher MJ, Akhtar A: X-chromosome-wide profiling of MSL-1 distribution and dosage compensation in Drosophila. Genes Dev. 2006, 20: 871-883. 10.1101/gad.377506.PubMed CentralView ArticlePubMed
- Straub T, Grimaud C, Gilfillan GD, Mitterweger A, Becker PB: The chromosomal high-affinity binding sites for the Drosophila dosage compensation complex. PLoS Genet. 2008, 4: e1000302-10.1371/journal.pgen.1000302.PubMed CentralView ArticlePubMed
- Deng X, Koya SK, Kong Y, Meller VH: Coordinated regulation of heterochromatic genes in Drosophila melanogaster males. Genetics. 2009, 182: 481-491. 10.1534/genetics.109.102087.PubMed CentralView ArticlePubMed
- Hamada FN, Park PJ, Gordadze PR, Kuroda MI: Global regulation of X chromosomal genes by the MSL complex in Drosophila melanogaster. Genes Dev. 2005, 19: 2289-2294. 10.1101/gad.1343705.PubMed CentralView ArticlePubMed
- Zhang Y, Malone JH, Powell SK, Periwal V, Spana E, Macalpine DM, Oliver B: Expression in aneuploid Drosophila S2 cells. PLoS Biol. 2010, 8: e1000320-10.1371/journal.pbio.1000320.PubMed CentralView ArticlePubMed
- Larschan E, Bishop EP, Kharchenko PV, Core LJ, Lis JT, Park PJ, Kuroda MI: X chromosome dosage compensation via enhanced transcriptional elongation in Drosophila. Nature. 2011, 471: 115-118. 10.1038/nature09757.PubMed CentralView ArticlePubMed
- Kelley RL, Meller VH, Gordadze PR, Roman G, Davis RL, Kuroda MI: Epigenetic spreading of the Drosophila dosage compensation complex from roX RNA genes into flanking chromatin. Cell. 1999, 98: 513-522. 10.1016/S0092-8674(00)81979-0.View ArticlePubMed
- Oh H, Park Y, Kuroda MI: Local spreading of MSL complexes from roX genes on the Drosophila X chromosome. Genes Dev. 2003, 17: 1334-1339. 10.1101/gad.1082003.PubMed CentralView ArticlePubMed
- Dahlsveen IK, Gilfillan GD, Shelest VI, Lamm R, Becker PB: Targeting determinants of dosage compensation in Drosophila. PLoS Genet. 2006, 2: e5-10.1371/journal.pgen.0020005.PubMed CentralView ArticlePubMed
- Fagegaltier D, Baker BS: X chromosome sites autonomously recruit the dosage compensation complex in Drosophila males. PLoS Biol. 2004, 2: e341-10.1371/journal.pbio.0020341.PubMed CentralView ArticlePubMed
- Pardue ML, Lowenhaupt K, Rich A, Nordheim A: (dC-dA)n.(dG-dT)n sequences have evolutionarily conserved chromosomal locations in Drosophila with implications for roles in chromosome structure and function. EMBO J. 1987, 6: 1781-1789.PubMed CentralPubMed
- Lowenhaupt K, Rich A, Pardue ML: Nonrandom distribution of long mono- and dinucleotide repeats in Drosophila chromosomes: correlations with dosage compensation, heterochromatin, and recombination. Mol Cell Biol. 1989, 9: 1173-1182.PubMed CentralView ArticlePubMed
- Gallach M, Arnau V, Marín I: Global patterns of sequence evolution in Drosophila. BMC Genomics. 2007, 8: 408-10.1186/1471-2164-8-408.PubMed CentralView ArticlePubMed
- Stenberg P, Pettersson F, Saura AO, Berglund A, Larsson J: Sequence signature analysis of chromosome identity in three Drosophila species. BMC Bioinformatics. 2005, 6: 158-10.1186/1471-2105-6-158.PubMed CentralView ArticlePubMed
- Gilfillan GD, König C, Dahlsveen IK, Prakoura N, Straub T, Lamm R, Fauth T, Becker PB: Cumulative contributions of weak DNA determinants to targeting the Drosophila dosage compensation complex. Nucleic Acids Res. 2007, 35: 3561-3572. 10.1093/nar/gkm282.PubMed CentralView ArticlePubMed
- Gallach M, Arnau V, Aldecoa R, Marín I: A sequence motif enriched in regions bound by the Drosophila dosage compensation complex. BMC Genomics. 2010, 11: 169-10.1186/1471-2164-11-169.PubMed CentralView ArticlePubMed
- Larschan E, Alekseyenko AA, Gortchakov AA, Peng S, Li B, Yang P, Workman JL, Park PJ, Kuroda MI: MSL complex is attracted to genes marked by H3K36 trimethylation using a sequence-independent mechanism. Mol Cell. 2007, 28: 121-133. 10.1016/j.molcel.2007.08.011.View ArticlePubMed
- Trygg J, Wold S: Othogonal projections to latent structures (O-PLS). J Chemometrics. 2002, 16: 119-128. 10.1002/cem.695.View Article
- Bylesjo M, Cloarec O, Nicholson JK, Holmes E, Trygg J: OPLS discriminant analysis: combining the strengths of PLS-DA and SIMCA classification. J Chemometrics. 2006, 20: 341-351. 10.1002/cem.1006.View Article
- Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS: Quantifying similarity between motifs. Genome Biol. 2007, 8: R24-10.1186/gb-2007-8-2-r24.PubMed CentralView ArticlePubMed
- modENCODE Consortium, Roy S, Ernst J, Kharchenko PV, Kheradpour P, Negre N, Eaton ML, Landolin JM, Bristow CA, Ma L, Lin MF, Washietl S, Arshinoff BI, Ay F, Meyer PE, Robine N, Washington NL, Di Stefano L, Berezikov E, Brown CD, Candeias R, Carlson JW, Carr A, Jungreis I, Marbach D, Sealfon R, Tolstorukov MY, Will S, Alekseyenko AA, Artieri C, et al: Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science. 2010, 330: 1787-1797.View Article
- Jiang N, Emberly E, Cuvier O, Hart CM: Genome-wide mapping of boundary element-associated factor (BEAF) binding sites in Drosophila melanogaster links BEAF to transcription. Mol Cell Biol. 2009, 29: 3556-3568. 10.1128/MCB.01748-08.PubMed CentralView ArticlePubMed
- Singh ND, Davis JC, Petrov DA: X-linked genes evolve higher codon bias in Drosophila and Caenorhabditis. Genetics. 2005, 171: 145-155. 10.1534/genetics.105.043497.PubMed CentralView ArticlePubMed
- Georgiev P, Chlamydas S, Akhtar A: Drosophila dosage compensation: Males are from Mars, females are from Venus. Fly (Austin). 2011, 5: 147-154.View Article
- Hart CM, Zhao K, Laemmli UK: The scs' boundary element: characterization of boundary element-associated factors. Mol Cell Biol. 1997, 17: 999-1009.PubMed CentralView ArticlePubMed
- Bushey AM, Ramos E, Corces VG: Three subclasses of a Drosophila insulator show distinct and cell type-specific genomic distributions. Genes Dev. 2009, 23: 1338-1350. 10.1101/gad.1798209.PubMed CentralView ArticlePubMed
- Nègre N, Brown CD, Shah PK, Kheradpour P, Morrison CA, Henikoff JG, Feng X, Ahmad K, Russell S, White RA, Stein L, Henikoff S, Kellis M, White KP: A comprehensive map of insulator elements for the Drosophila genome. PLoS Genet. 2010, 6: e1000814-10.1371/journal.pgen.1000814.PubMed CentralView ArticlePubMed
- Roy S, Gilbert MK, Hart CM: Characterization of BEAF mutations isolated by homologous recombination in Drosophila. Genetics. 2007, 176: 801-813.PubMed CentralView ArticlePubMed
- Laverty C, Li F, Belikoff EJ, Scott MJ: Abnormal Dosage Compensation of Reporter Genes Driven by the Drosophila Glass Multiple Reporter (GMR) Enhancer-Promoter. PLoS One. 2011, 6: e20455-10.1371/journal.pone.0020455.PubMed CentralView ArticlePubMed
- Johansson AM, Allgardsson A, Stenberg P, Larsson J: msl2 mRNA is bound by free nuclear MSL complex in Drosophila melanogaster. Nucleic Acids Res. 2011, 39: 6428-6439. 10.1093/nar/gkr236.PubMed CentralView ArticlePubMed
- Schwartz S, Meshorer E, Ast G: Chromatin organization marks exon-intron structure. Nat Struct Mol Biol. 2009, 16: 990-995. 10.1038/nsmb.1659.View ArticlePubMed
- Gelbart ME, Larschan E, Peng S, Park PJ, Kuroda MI: Drosophila MSL complex globally acetylates H4K16 on the male X chromosome for dosage compensation. Nat Struct Mol Biol. 2009, 16: 825-832. 10.1038/nsmb.1644.PubMed CentralView ArticlePubMed
- Stenberg P, Lundberg LE, Johansson AM, Rydén P, Svensson MJ, Larsson J: Buffering of segmental and chromosomal aneuploidies in Drosophila melanogaster. PLoS Genet. 2009, 5: e1000465-10.1371/journal.pgen.1000465.PubMed CentralView ArticlePubMed
- Tweedie S, Ashburner M, Falls K, Leyland P, McQuilton P, Marygold S, Millburn G, Osumi-Sutherland D, Schroeder A, Seal R, Zhang H, FlyBase Consortium: FlyBase: enhancing Drosophila Gene Ontology annotations. Nucleic Acids Res. 2009, 37: D555-D559. 10.1093/nar/gkn788.PubMed CentralView ArticlePubMed
- Nicol JW, Helt GA, Blanchard SG, Raja A, Loraine AE: The Integrated Genome Browser: free software for distribution and exploration of genome-scale datasets. Bioinformatics. 2009, 25: 2730-2731. 10.1093/bioinformatics/btp472.PubMed CentralView ArticlePubMed
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003, 4: 249-264. 10.1093/biostatistics/4.2.249.View ArticlePubMed
- Gautier L, Cope L, Bolstad BM, Irizarry RA: affy-analysis of Affymetrix GeneChip data at the probe level. Bioinformatics. 2004, 20: 307-315. 10.1093/bioinformatics/btg405.View ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.