Transcriptome map of mouse isochores
© Arhondakis et al; licensee BioMed Central Ltd. 2011
Received: 15 April 2011
Accepted: 17 October 2011
Published: 17 October 2011
Skip to main content
© Arhondakis et al; licensee BioMed Central Ltd. 2011
Received: 15 April 2011
Accepted: 17 October 2011
Published: 17 October 2011
The availability of fully sequenced genomes and the implementation of transcriptome technologies have increased the studies investigating the expression profiles for a variety of tissues, conditions, and species. In this study, using RNA-seq data for three distinct tissues (brain, liver, and muscle), we investigate how base composition affects mammalian gene expression, an issue of prime practical and evolutionary interest.
We present the transcriptome map of the mouse isochores (DNA segments with a fairly homogeneous base composition) for the three different tissues and the effects of isochores' base composition on their expression activity. Our analyses also cover the relations between the genes' expression activity and their localization in the isochore families.
This study is the first where next-generation sequencing data are used to associate the effects of both genomic and genic compositional properties to their corresponding expression activity. Our findings confirm previous results, and further support the existence of a relationship between isochores and gene expression. This relationship corroborates that isochores are primarily a product of evolutionary adaptation rather than a simple by-product of neutral evolutionary processes.
The genomes of vertebrates are mosaics of isochores, long regions (from 0.2Mb up to several Mb) that are fairly homogeneous in base composition. The isochores belong to a small group of families characterized by different GC levels (molar ratio of guanine and cytosine over the total number of bases of the area) [1–4]. In the human genome, a typical mammalian genome, five isochore families can be found (L1, L2, H1, H2, and H3 -- in order of increasing GC level) that cover a wide GC range (30-60%) [2–4]. The GC-richest families, H2 and H3, represent approximately 15% of the genome, and contain about 50% of the protein-coding genes. This high gene density is accompanied by other striking properties, such as open chromatin structure, localization at the center of the nucleus, high density of short interspersed elements (SINES), low density of long interspersed elements (LINES), early replication, high level of recombination, high mutation rate, and higher expression level, while GC-poorer families have the opposite properties . In the mouse genome, which is of interest in this study, the L1 isochore family is under-represented, compared to other vertebrates, and the H3 family is almost absent . This narrow isochore distribution in the mouse genome has been interpreted as the result of a higher substitution rate [6, 7] and weak repair mechanism , both phenomena reducing compositional heterogeneity (see also ). Despite these differences, the distribution of genes is similar to that of the other vertebrates (gene density increases as GC level increases), and the average GC levels of the different families are remarkably conserved across species, reflecting a functional relation to the chromatin structure .
The emergence of the isochores is an open debate of relevant evolutionary importance, where in addition to the selectionist model (functional advantage ), other models attempt to explain the evolution of the isochores: the mutational bias , the GC-biased gene conversion [10, 11], as also a unifying one . Despite the importance of this debate, our study is focused on investigating how base composition affects mammalian gene expression. Such a relationship would provide additional evidence on a functional implication of the isochores, supporting that they are mainly a product of evolutionary adaptation [2, 4], rather than a simple by-product of neutral evolutionary processes [9–11].
Previous studies have investigated the effects of base composition on gene expression, both in human and mouse tissues, through an exhaustive use of expression data from techniques based on sequencing (ESTs, SAGE, MPSS) and/or hybridization (microarrays, single-arrays, cDNA arrays) [13–21], and despite some quantitative differences, agree that the expression levels of genes are positively correlated with the GC level. Two recent studies [22, 23], through in silico compositional analysis of expression vectors and DNA carriers, showed that aside from the GC3 level (GC level in the third codon position) of the coding sequences, the genomic compositional context in which a gene is embedded affects its expression. Additionally, the Human Transcriptome Map (HTM), using SAGE data, revealed domains of highly and weakly expressed genes , namely the "RIDGES" and "anti-RIDGES", respectively. The former were found to be located in gene-dense, high GC-rich, and SINE-rich genomic regions, while the latter were in regions with opposite properties [15, 25]. The above reflect the partitioning of vertebrate genes into two types of genomic regions: the gene-rich regions ("genome core"), which correspond to the GC-rich isochores, and the gene-poor regions ("genome desert"), which correspond to the GC-poor isochores [2, 3, 26, 27]. In addition, when a similar to the HTM transcriptome map was established for the mouse genome, the expression patterns were found to be conserved to that of the human genome [28, 29]. Next-generation sequencing (NGS) techniques revolutionized transcriptome analyses and, compared to previous transcriptome technologies, appear to be characterized by several advantages, i.e. a better dynamic range (absence of background noise and signal saturation phenomena, although misaligned reads could be considered as background), better quantification of transcript levels and of their isoforms (absence of an upper limit to the quantification, detection of lowly expressed transcripts), identification of yet unknown coding and non-coding RNA species [30–32]. Moreover, NGS reduced the processing time and cost of sequencing by orders of magnitude, making it a more attractive tool in a broad range of research, for both DNA and RNA sequencing and for detection and analysis of genetic variability [33–36]. In this study, we took advantage of publicly available NGS data of three distinct mouse tissues  in order to investigate the expression patterns across the isochores of the mouse chromosomes and the effects of the isochores' compositional properties on their expression activity. In the second part, we investigated the relations between genes' expression levels and their localization in the five isochore families for the three transcriptomes considered (brain, liver, and muscle).
Reads aligned to coding sequences
It is well-known that in vertebrates, including the mouse, GC-richer isochores have higher gene densities compared to the GC-poorer ones (see the Background Section). This is confirmed by the positive linear correlation we found between the gene density of the isochores and their respective GC level (R = 0.42). Having shown the positive effect of high GC levels to the isochoric expression and between GC levels and gene density, we also looked into the direct relation between the gene density and the expression level of the individual isochores. We found a positive correlation, with similar coefficients for all tissues (coefficients: R brain = 0.57, R liver = 0.57, and R muscle = 0.58).
In order to isolate and investigate the effects of the GC level on the expression activity of the isochores, it was necessary to eliminate the effects of the gene density. To this end, the normalized per tissue count of reads aligned within each isochore was normalized by the respective gene density of the isochore, and the log 2 values were calculated (Additional file 3). This approach limited our analysis to isochores containing at least one CDS (1, 902 isochores out of the 2, 319). As expected, we found that the percentage of isochores containing at least one CDS increased as the isochore family GC level increased (more than 60% of the L1 isochores contain no CDS against only 6% of the H2 isochores -- see Additional file 4). Notable exception to the trend is the H3 family, where an increase of isochores without any CDS is observed. However, this increasing trend in H3 isochore is due to the fact that in the mouse genome the H3 icoshores consists of just nine isochores, two of which had no CDS.
Summarizing, in this section, we initially presented the transcriptome map of the mouse isochores, and demonstrated an agreement between isochores GC level and their expression levels. Finally, after gene density effects were removed from the isochores expression levels, we found a tissue-dependent correlation between the isochores GC levels and their expression activity.
We then looked for differences in the distribution of the expressed genes in the isochore families against that of the genes that are not expressed. As expressed, we considered genes with at least 10 aligned reads to avoid possible noise from misalignments, while as non-expressed, we considered genes without any aligned reads.
Looking into the distribution of the expressed genes in the isochore families, we found no differences among the three tissues (Additional file 5). The percentage of expressed genes (12, 414 CDSs in the brain, 9, 793 in the liver, and 10, 749 in the muscle) progressively increases from low to high GC families, and peaks at the H2 family. Regarding the H3 family, the massive drop observed is related to the extreme under-representation of this family in the mouse genome. Repeating the analysis with a higher expression threshold (at least 100 reads per CDS) affects mostly the lower GC families, but overall it does not change the observed trend (data not shown). With either threshold, the distribution is different from that observed for the non-expressed genes.
In this section, we showed that genes located in GC-richer isochores have a higher expression level than genes located in GC-poor isochores. Moreover, we observed that, between liver and muscle, the genes located in L1, H2, and H3 isochores appear to maintain a similar expression activity, contrary to the expressed genes located in L2 and H1 isochores. We also presented evidence that, in three adult mouse tissues, the non-detected as expressed genes are preferably located in GC-poor isochores, while the expressed genes are preferably located in GC-rich isochores.
As mentioned in the Background Section, the way base composition affects mammalian gene expression is an issue of prime practical and evolutionary interest and, although it has been a matter of debate, most studies agree that there is a positive correlation. The transcriptome of the mouse isochores for the three tissues (Additional file 1, Figure 1), the positive correlation between the isochores' GC level and their respective expression activity (Figure 2), and the increase of the average expression level of genes as the GC of the isochores increases (Figure 3) support the existence of a relationship between expression level and base composition.
The herein reported correlation coefficients, between the expression activity of the isochores and their respective GC levels (Figure 2), are slightly higher to those reported in previous studies on mouse [16, 19], where the genes expression was correlated with their GC3 levels. Moreover, the order in which the expression level in the three tissues is most affected by the GC level (brain > muscle > liver) agrees to those in . Finally, despite the virtual absence of H3 isochores in the mouse genome and the small number of L1 isochores, our coefficients were found to be similar to those of human, the latter containing both L1 and H3 isochores [16, 18–21].
In regards to the GC-poor localization of the genes that are not expressed in any of the three adult mouse tissues considered here, the notion that they may be implicated in developmental processes is supported by several studies. Indeed, two recent studies [38, 39] identified, in the genome deserts of vertebrates, long-range conserved systems comprised of highly-conserved non-coding elements and their developmental regulatory gene targets. Similarly, although in a different context, it has been shown that during the development of the mouse brain, most expression changes occur in the GC-poor and LINE-rich regions , and that the genes expressed in the early development stages of the mouse have AT-ending codons, unlike the genes expressed in later developmental stages . Genes rich in AT-ending codons are expected to be typically found in GC-poor isochore families .
This work is the first where NGS data are used in order to establish the transcriptome map of the mouse isochores for three different tissues, and to investigate the effects of base composition on the expression activity. Our results are consistent with previous ones, and further support the idea of a functional implication of the isochores in gene expression. We conclude proposing that similar compositional approaches, using NGS data from carefully designed experiments, may shed more light into the role of the genomic (in the term of isochores) and genic compositional properties in gene expression, in the context of specific tissues or biological processes, and reveal valuable information on the implicated regulation mechanisms.
To produce the transcriptome map of the isochores, we used publicly available RNA-seq data of three distinct mouse tissues (brain, liver, and muscle), obtained in a recent study by Mortazavi et al  using the standard Solexa pipeline (version 0.2.6). The initial 32-mer reads were subsequently truncated to a length of 25 base pairs. The data comes from pooled adult C57BL6 individuals. We aligned the reads against the reference mouse genome (UCSC release mm9)  using REad ALigner (REAL) [44, 45]. REAL is based on a new, relatively simple, algorithm for the alignment of short reads onto a reference sequence. It uses two-bits-per-base encoding of the DNA alphabet for both the reference and read sequences. We used the appropriate arguments to allow up to two mismatches per read with no gaps, and to report the unique alignment with the least number of mismatches. In this case, REAL splits the reads in four fragments, and approximate string-matching implements the pigeon-hole principle , as a means to quickly filter out some of the alignments that have more than two mismatches. The remaining candidate alignment locations are then examined in order to eliminate the rest of them that have more than two mismatches. Unlike other current fast aligners like Bowtie  and SOAP2 , REAL is not hindered by the very short length of the reads in this dataset. This gap-less alignment method will surely miss reads that span splice sites. However, these should represent only a small fraction of the total reads. Since the study is aimed at the bigger picture, rather than the exact quantification of individual mRNAs and alternate splicing variants, the loss of sensitivity will have little impact. In any case, gapped alignment of such short single-end reads has its own perils.
Because the normalized counts are very small, the logarithm produces negative values, however, higher expression still corresponds to peaks. Details on the isochores' coordinates, GC levels, aligned reads, and expression levels, for each of the three tissues, can be found in Additional file 6.
To investigate the expression at gene level, the coding sequences for the mouse were retrieved from the Consensus Coding Sequence Database (CCDS) . From the 17, 704 CDSs, 14 were found to lack a starting codon, and were eliminated. The remaining 17, 690 CDSs were assigned to isochores based on the coordinates of their exons, as given in the CCDS database.
Details on the expression levels of the CDSs, for each of the three tissues, can be found in Additional file 7.
We thank Prof. Giorgio Bernardi and Oliver Clay for reading the manuscript and giving valuable comments. SA and SK are supported by institutional funds. KF is funded by the Greek State Scholarships Foundation. This work is also partially supported by the SeqAhead COST action.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.