The isochore patterns of invertebrate genomes

Background Previous investigations from our laboratory were largely focused on the genome organization of vertebrates. We showed that these genomes are mosaics of isochores, megabase-size DNA sequences that are fairly homogeneous in base composition yet belong to a small number of families that cover a wide compositional spectrum. A question raised by these results concerned how far back in evolution an isochore organization of the eukaryotic genome arose. Results The present investigation deals with the compositional patterns of the invertebrates for which full genome sequences, or at least scaffolds, are available. We found that (i) a mosaic of isochores is the long-range organization of all the genomes that we investigated; (ii) the isochore families from the invertebrate genomes matched the corresponding families of vertebrates in GC levels; (iii) the relative amounts of isochore families were remarkably different for different genomes, except for those from phylogenetically close species, such as the Drosophilids. Conclusion This work demonstrates not only that an isochore organization is present in all metazoan genomes analyzed that included Nematodes, Arthropods among Protostomia, Echinoderms and Chordates among Deuterostomia, but also that the isochore families of invertebrates share GC levels with the corresponding families of vertebrates.


Background
Recent investigations from our laboratory [1] showed that the isochore families of all vertebrate genomes explored are essentially conserved in GC levels (as well as in dinucleotide patterns) and in isochore sizes (with some exceptions in this case). Moreover, at least in eutherian mammals and birds, even the relative amounts of isochore families (i.e. the compositional patterns) are largely conserved. It is well established by our previous work that a number of very basic genome properties, such as the distribution of genes and interspersed repeats, DNA methylation, gene expression, replication timing and recombination, are different in GC-poor and GC-rich isochore families (see [2,3], for reviews). These results obvi-ously support the idea of isochores being a "fundamental level of genome organization" [4], at least in vertebrates. We also suggested that the conservation of GC levels of isochore families may underlie chromatin structures [5], and that the conservation of isochore size may be associated with their role in the structure and replication of chromosomes [6].
The above points raise the question as to how general the isochore structure is in the genomes of metazoa. This is a pertinent question because evidence for a compositional heterogeneity was already obtained in our previous work on the genomes of trypanosomes, plasmodium and drosophila [see ref. [2]] Here we approached this problem by investigating the compositional compartmentalization of genomes from invertebrates in all cases in which either full sequences or at least scaffolds were available. The case of Apis mellifera will be presented elsewhere because the presence in this genome of a very GC-poor family (in addition to the other isochore families), which was absent from other insects, required a special analysis. Expectedly, the present analysis broadens our view of the structure and evolution of isochores. In fact, it demonstrates that an isochore organization endowed with a number of shared features is found in the genomes of all metazoans explored in the present work. Figure 1, which displays a simplified tree of life (derived from [7]) of metazoans, in order to outline the phylogenetic position of the genomes investigated. Figure 1 is shown to indicate the phylogenetic positions of invertebrate genomes explored. These comprised Nematodes, Arthropods, Echinoderma and Chordata. As shown in Additional File 1, the sizes of the genomes explored covered a 2.5-fold range, 95 to 240 Mb which, added to the 8-fold range of vertebrates, 400 to 3200 Mb, expand the overall range of investigated genome sizes up to a 33-fold value. Table 1A presents the relative amounts of the isochore families of all invertebrates investigated (along with those of human and three fishes that are shown for the sake of comparison). The salient features that appear in the isochore patterns (Figures 2 and 3) will now be outlined. The isochore families of Ciona intestinalis, a Urochordate, which is the closest ancestral species to vertebrates [8] are presented in Figure 2 (along with those of three representative vertebrates, human, chicken and zebrafish) in order to show the existence of a major L1 and a minor L2 isochore family. The genome of the nematode Caenorhabditis elegans (Figure 2) also displays a very GC-poor genome, consisting essentially of L1 isochores with only a very small amount (less than 10%) of L2 isochores.

Isochore patterns
Among insects (Figure 3), the three Drosophila species studied and Anopheles gambiae display three isochore families: a minor L2 family, a predominant H1 family and a H2 family which is barely represented in Drosophila, but is rather abundant in Anopheles. A very minor (about 1%) L1 component appears to correspond to chromosome 4 in Drosophilids and is barely represented (0.5%) in A. gam- The location of the species investigated in these genomes are reported in the approximate tree presented Figure 1 The location of the species investigated in these genomes are reported in the approximate tree presented. This is derived from Dunn et al. [7], to which the reader is addressed for the precise tree.  Distribution of isochores according to GC levels Figure 2 Distribution of isochores according to GC levels. The first three histograms, shown for the sake of comparison, display the distribution (by weight) of isochores as pooled in bins of 0.5% GC of some vertebrates: human [9], chicken [11] and zebrafish [10]. The bottom panels show the isochore pattern of C. intestinalis and of C. elegans. The total amount of DNA calculated from the sums of isochores, GC % and the number of isochores are reported. Colors represent the five isochore families of the human genome. Notice the different scales on the ordinate axis.  The compositional patterns of scaffolds or contigs from Branchiostoma floridae, Strongylocentrotus purpuratus, Aedes aegypti, Tribolium castaneum and Daphnia pulex are shown in Figure 4. Compositional distribution were generally narrow, covering a range of about 5% GC, with the exception of T. castaneum in which case the range was about 10% GC, the center of distribution being lower (33%) than in the other cases.

GC levels and dinucleotides of isochore families
The average GC levels of the isochore families from the invertebrates investigated are very close to each other and to the corresponding values of vertebrates (see Table 1B).
As far as dinucleotide patterns are concerned, C. intestinalis showed a remarkable observed/expected (O/E) pattern in that complementary dinucleotides belonged in two classes, higher and lower than statistical expectation, respectively (see Figure 5). Indeed, AA/TT, AC/GT, CA/TG, CC/GG were in the more frequent class, whereas AG/CT, GA/TC were in the less frequent class. The dinucleotides from all these pairs showed the same frequency. In contrast, TA and CG were less frequent than their complementary dinucleotides AT and GC. Moreover, AT/TA belonged to the less frequent class whereas CG/GC were close to statistical expectation.
Some features of the C. intestinalis, a Urochordate, were also found in C. elegans, a Nematode. Indeed in C. elegans AA/TT, CC/GG and CA/TG were also high, whereas AG/ CT were also low, but, in contrast, GA/TC were low and AC/GT were high. Trinucleotides frequencies, showed the expected similarities and differences (see Additional File 2).
AA/TT, CC/GC and CA/TG were also high in the insects (and in vertebrates; see ref. [1]; for sake of comparison with human see also Additional File 3), whereas AG/CT was low in insects (but not in vertebrates), and AT/TA were low (TA especially) in all cases (see Figure 6).
Very interestingly, CG was remarkably low in S. purpuratus and B. floridae. The 0.6 values attained are, however, still much higher than the 0.2 values reached by mammals (see Figure 5B).

Isochore sizes
The average size from the most represented families of isochores (see Table 1C) is about 0.5 Mb (megabases) for C. intestinalis, but higher values were found for C. elegans (3.4 Mb) for Drosophilids (1.6-2 Mb) and A. gambiae (1.3 Mb). The GC-poorest isochore families comprise two size groups, a large one and a small one, as in the case of vertebrates [9][10][11]. In particular, C. elegans showed a number of extremely long GC-poor stretches (see Discussion). In contrast, the GC-richest isochores are characterized by one size group, the small one (see Additional File 4). Figure 7 displays the gene densities of the isochore families from D. melanogaster and A. gambiae along with those

B)
Observed/expected frequencies for dinucleotides in 100-kb DNA segments isochore families from the Drosophilids, A. gam-biae and T. castaneum of two vertebrates, human and stickleback, that are presented for the sake of comparison. In both cases, gene density increased with increasing GC of isochore families. Unfortunately, the data for C. intestinalis and C. elegans did not permit to establish a reliable ratio between gene densities in the L1 and L2 families because of the small amounts of DNA and number of genes of the minority family. In the case of the genomes only represented by scaffolds, genes were localized on the scaffolds and gene densities were shown to follow the general trend (see Figure 4).

Discussion
C. intestinalis and C. elegans essentially show a very predominant L1 isochore family and a minor L2 family, a compositional pattern mimicking that of zebrafish. The case of C. intestinalis is of interest because a previous investigation by analytical ultracentrifugation in a CsCl density gradient had shown a remarkable homogeneity at an average molecular weight of 100 kb [12]. The apparent discrepancy can, however, be explained by the fact that the CsCl investigation dealt with random fragments, whereas the present one with 100 kb unique segments. The latter show an average standard deviation of 1.3% GC (Cammarano R., Ph.D. Thesis) a value very slightly above the average standard deviation of 100 kb segments from human isochores from the L1 and L2 families. The dinucleotide patterns (observed/expected) present some significant differences (e.g. TA being much lower than AT in C. elegans, but not in C. intestinalis), which are also expectedly shown by trinucleotides.
Among insects, the Drosophilids exhibit similar isochore patterns that are intermediate between those of medaka and stickleback with a major H1 family and two minor L2 and H1 families. Another point of interest is the close similarity of GC levels of the isochore families as assessed on Drosophilids and Anopheles (see Table 1B). The compositional pattern of A. gambiae, although being mainly represented by H1 isochores, shows a substantial amount of H2 isochores. In fact, the GC-richer isochores of A. gambiae have probably been underestimated because of the presence of a large number of 100-kb GC-rich stretches that were pooled with the flanking regions because of the procedure used in assessing isochores (see below).
The histograms represent the gene density in the isochore families Figure 7 The histograms represent the gene density in the isochore families. The gene concentrations of D. melanogaster and A. gambiae increase with increasing GC in isochore families, as in the case of the genomes from human and stickleback. In the other cases, certainly a number of factors played a role and led to different compositional patterns. Indeed, we already noted that while the large evolutionary changes in isochore patterns occurring between mammals/birds and amphibians/fishes mainly depend upon body temperature, definitely other factors play a role as well in the case of fishes.
The average sizes exhibited by different isochore families of invertebrates showed a greater variability compared to those of the corresponding families from vertebrates [1]. This may be due, however, to artefactual reasons, such as gaps, but also to the experimental approach used. Indeed, in the human genome [9], isochores were taken to be at least 200 kb in size, a condition linked to the need of assessing standard deviations of the 100 Kb segments used in the analysis. Expectedly, this occasionally led to standard deviation higher than 3 % GC within a given isochore. Since however this only concerned ~7% of the human genome such "transition isochores" were accepted.
In the case of C. elegans long stretches of DNA very low in GC and belonging to L1 isochores were present and interspersed short L2 isochores were neglected. If one accepts in this case isochores reaching a low size value of 100 Kb, the very long L1 structures are resolved into shorter stretches and the high size values are brought back to the 0.5-1 Mb range. For instance, in the case of C. elegans the large size (3.37 Mb) estimated according to the criteria of Costantini et al. [9], is reduced to 1.00 Mb, if 100 Kb segments belonging to the GC range of L2 isochores (and averaging 0.23 Mb in size) are considered separately. This considerations also applies to the large size of major families of insects as indicated by thin 100-kb lines appearing in the GC profiles.

Conclusion
The major conclusions of the present investigations are the following: (i) an isochore structure appears to be general for all metazoans explored; this raises the question whether, in fact, all eukaryotic genomes are characterized by an isochore structure; current work on plants and unicellular eukaryotes should clarify this point; (ii) the isochore families are generally characterized by GC levels that were identical or very close to those of vertebrates; (iii) differences in dinucleotide patterns (observed/ expected values) were found among invertebrates, as well as between invertebrates and vertebrates; in the latter case, the most salient feature was the CpG shortage which is due to the methylation of C in CpG followed by its deamination to T; this feature was also found in S. purpuratus and B. floridae even if at a lower extent compared to mammals; (iv) the average size of isochore shows a certain variability, which is apparently due at least to a large extent to artefactual reasons, as discussed in the preceding section; (v) no correlation was found between isochore size and genome size in spite of the very large genome size range explored so far; this practical independence of isochore size on genome size stresses their possible correlation with the structure and replication of chromosomes, as suggested by Costantini and Bernardi [6]; (vi) the relative amounts of isochore families are different in different genomes, a situation due in our opinion to the different environmental factors that play a role in determining compositional patterns of genome (for example, if Anopheles has a higher body temperature than Drodophilids it could explain the higher amounts of GC-rich isochores in A. gambiae); (vii) gene concentration increases with increasing GC of isochore families, as previously found for vertebrates.
These conclusions are in keeping with some previous suggestions [1]: that (i) the high similarity of GC levels of isochore families may be due to their composition being linked to chromatin structure; (ii) the increasing variability in isochore patterns from warm-to cold-blooded vertebrates and to invertebrates may be correlated with the increasing dependence from environmental factors that affect genome organization and functions; (iii) the distribution of genes seems to be dictated by the need of a certain genomic context, whose composition influences the transcriptional activity, and also the structure and function of the encoded proteins.

Genome and gene sequences
The sequences of the eukaryotic genomes as well as of the genes analyzed in this study were downloaded from different websites (see Additional File 5). Partial, putative, synthetic construct, predicted, not experimental, hypothetical protein, r-RNA, t-RNA, ribosomal and mitochondrial genes were eliminated and then the cleanup program [13] was applied for ridding nucleotide sequence databases of redundancies. For the remaining genes a script implemented by us was used in order to identify the coding sequences beginning with a start codon and ending with a stop codon. The coordinates of genes on the chromosomes were retrieved from the website used for downloading the chromosomes.

Isochore mapping
The entire chromosomal sequences of the finished genome assembly were partitioned into non-overlapping 100-kb windows, and their GC levels were calculated using the program draw_chromosome_gc.pl [14,15] (http://genomat.img.cas.cz; see Additional File 6).
The methodology used for isochore mapping was described by Costantini et al. [9]. Briefly, isochores are defined by two sets of data, GC levels and their standard deviations. Indeed, when genome sequences are scanned using four window sizes ranging from 12.5 to 100 kb, the final choice of the window is determined by the plateau values reached by the standard deviations [9]. Compositional jumps significantly larger than the standard deviations of isochore GC (1-2%) separate one isochore from the contiguous ones (see Additional File 7 for the coordinates, sizes, GC levels and GC standard deviations of the isochores identified in the analyzed genomes). When isochores are put in bins of 0.5 % or 1% GC, families appear as well defined peaks. In the case of the human genome, five isochore families were found which ranged in GC from 34% to 58% and decreased in relative amounts with increasing GC. These families were also detected by using four different approaches that are based, however, on our boundaries of isochore families [1,16].

Nomenclature
As far as the name of each isochore is concerned, we used here, as in previous work, a convention in which the first number in the name represents the chromosome number, the following two letters are the initials of the scientific (latin) name of the Eukaryotes under consideration, and the last number identifies the isochore.