The human genome: a multifractal analysis
- Pedro A Moreno†1Email author,
- Patricia E Vélez†2, 3,
- Ember Martínez4,
- Luis E Garreta1,
- Néstor Díaz4,
- Siler Amador4,
- Irene Tischer1,
- José M Gutiérrez5,
- Ashwinikumar K Naik6,
- Fabián Tobar3 and
- Felipe García3
© Moreno et al; licensee BioMed Central Ltd. 2011
Received: 13 April 2011
Accepted: 14 October 2011
Published: 14 October 2011
Several studies have shown that genomes can be studied via a multifractal formalism. Recently, we used a multifractal approach to study the genetic information content of the Caenorhabditis elegans genome. Here we investigate the possibility that the human genome shows a similar behavior to that observed in the nematode.
We report here multifractality in the human genome sequence. This behavior correlates strongly on the presence of Alu elements and to a lesser extent on CpG islands and (G+C) content. In contrast, no or low relationship was found for LINE, MIR, MER, LTRs elements and DNA regions poor in genetic information. Gene function, cluster of orthologous genes, metabolic pathways, and exons tended to increase their frequencies with ranges of multifractality and large gene families were located in genomic regions with varied multifractality. Additionally, a multifractal map and classification for human chromosomes are proposed.
Based on these findings, we propose a descriptive non-linear model for the structure of the human genome, with some biological implications. This model reveals 1) a multifractal regionalization where many regions coexist that are far from equilibrium and 2) this non-linear organization has significant molecular and medical genetic implications for understanding the role of Alu elements in genome stability and structure of the human genome. Given the role of Alu sequences in gene regulation, genetic diseases, human genetic diversity, adaptation and phylogenetic analyses, these quantifications are especially useful.
The human genome is one of the most complex molecular structures ever seen in nature. Its extraordinary information content has revealed a surprising mosaicims between coding and non-coding sequences [1–4]. This highly regionalized structure introduces complex patterns for understanding the gene structure and repetitive DNA sequence composition and its role in human development, physiology, medicine and phylogeny. The coding regions are defined, in part, by an alternative series of motifs responsible for a variety of functions that take place on the DNA and RNA sequences, such as, gene regulation, RNA transcription, RNA splicing, and DNA methylation. For example, sequencing of the human genome revealed a controversial number of interrupted genes (25,000 - 32,000) with their regulatory sequences [1, 2] representing about 2% of the genome. These genes are immersed in a giant sea of different types of non-coding sequences which make up around 98% of the genome. The non-coding regions are characterized by many kinds of repetitive DNA sequences, where almost 10.6% of the human genome consists of Alu sequences, a type of SINE (short interspersed elements) sequence . These elements are not randomly distributed throughout the genome but rather are biased toward gene-rich regions . They can act as insertional mutagens and the vast majority appears to be genetically inert . LINES, MIR, MER, LTRs, DNA transposons, and introns are other kinds of non-coding sequences, which together conform about 86% of the genome. In addition, some of these sequences are overlapped one to another, for example, the CpG islands (CGI), which complicates analysis of the genomic landscape. In turn, each chromosome is characterized by some particular properties of structure and function. Furthermore, the new era of rapid sequencing methods will have available more than one thousand human genome sequences , which reveals the genetic variation between different human groups. This knowledge will have a major impact on human health (disease origin), population studies and adaptation, among others. All these structural variations are challenging the inventive of theoretical and experimental scientists to create, develop and apply new approaches to quantify them. These variations allow carrying out studies of comparative genomics aimed at discovering correlations with some life characterizing properties [8, 9]. Given that all these genomic variations produce a regionalized genomic landscape in the human genome, we thought fractal geometry could be an appropriate approach to studying how the genetic information content is fragmented.
The methodologies derived from fractal geometry have been a very useful approach to studying the degree of fragmentation (or irregularity) in natural, artificial and statistical structures or processes . Fractal structures are characterized by self-similarity, scaling independence, and a fractal dimension, an exponent obtained from a power or scaling law [11, 12]. Thus, power laws are powerful tools for searching self-similar properties in biological structures and processes and for quantifying the scaling properties of information contents.
Few studies have used the fractal approach and power laws to study the whole human genome [13, 14]. However, due to the complexity of the human genome, one exponent may not be enough to characterize a complex phenomenon. Multifractal formalism allows using more exponents . In this case, the object of analysis is divided into several fractal sets, each generating a fractal dimension that is then translated into a continuous spectrum of exponents (the so-called singularity spectrum). The multifractality degree (MD) obtained from this continuous spectrum allows measuring the genetic information content. Multifractal systems are common in nature, especially in geophysics. They include fully developed turbulence , stock market time series, heartbeat dynamics , human gait, and natural luminosity time series, among others. In post-genomics times, multifractal analysis has been a very useful approach to studying problems related with microorganism classification [18, 19], distinguishing coding and non-coding sequences , studying proteins , promoter prediction , and - recently - this formalism was used to study human chromosomes  and the genetic information content in the C. elegans genome . In the latter work, a significant relationship between the structural genetic information content and multifractal parameters was found, which has important biological implications. We thought that applying a similar method could be a valid approach to study the structure of the human genome. In the present paper, we report multifractal analysis from the draft sequence of the human genome.
1) Multifractal analysis by chromosome fragment
1.1) Analyses of multifractal parameters
The multifractal parameters for 9,379 chromosome fragments were calculated and analyzed (Additional file 1). Initially, the generalized dimension spectrum and MD for all chromosome fragments were determined. The extreme generalized dimension spectra and a medium spectrum are depicted for comparison (Figure 1B). Note that the maximum varies very little due to the fact that negative q values are associated with the structure and properties of sparse regions, with few points in the CGR of the human genome. In contrast, the Dq minimum varies widely because positive q values emphasize regions where the points are dense.
Subsequently, the corresponding scaling exponents τ(q) were calculated for each fragment (Additional file 2). The three multifractal spectra τ(q) show differences related to each other (Figure 1C). The scaling exponent τ(q) can reveal aspects of chromosome fragment structure. Monofractal behavior would correspond to a straight line for τ(q); for multifractal behavior, τ(q) is nonlinear. The changing curvature for the data for the chromosome fragments indicates multifractality. In contrast, τ(q) tends to be linear for that chromosome fragment with the lowest multifractality, indicating partial loss of multifractality.
Using the whole data set for each chromosome we calculated the MD from each generalized dimension spectrum. Thus, the degree of multifractality for all chromosome fragments goes from ~0.79 to 1.56 with an average of 1.042 and median of 1.018 (Additional file 1). Analysis by range of multifractality (RM) reveals that the multifractal behavior for the whole data set is biased toward low multifractal values, as expected (Figure 1D).
Next, we used a discrimination method based on 2-D distributions to study the information dimension for all chromosome fragments. The data show two different informational patterns (similar to a > symbol), one with high information content (Figure 1E, dots on top) and the other with low and medium information content (Figure 1E, dots on bottom) being the occurrence the latter more numerous in data than the former. We hypothesize that these behaviors are related with some molecular parameter, which is analyzed in the following section.
1.2) Analyses of molecular parameters
This relationship was assessed in terms of Alu families and the Alu-S was found to be more correlated than the other Alu families (Figure 2B). Furthermore, the Alu contents (in conjunction with CGI) are biased toward high multifractal ranges suggesting the significant role of these sequences in determining the non-linearity in the human genome (Figure 2C).
When sequencing the human genome, a strong relationship between Alu and CGI contents [1, 2] also became evident. We observe that CGI have a lower multifractal relationship than that found for the Alu elements (Figure 2D). However, when both parameters are combined a significant fit was obtained (R2 = 0.85, p < 0.05). Other molecular parameters such as gene density, exons, introns, LINE, MIR, MER, and LTRs did not show a significant fit by a simple linear regression. However, when all repetitive elements (Alu, LINE, MIR, MER, and LTRs) are taken into account the R2 ~0.57. Thus, among the studied genomic features, Alu has the highest correlation with multifractal degree.
Multivariate analyses of ΔDq versus all variables (Alu, G+C, CGI, LINE, MIR, MER, LTRs, nCoding, nNonCoding, exons, genes, and SNPs) per chromosome were carried out and for each case the most relevant variables explaining ΔDq were selected. The most frequently used variables are (G+C), Alu, CGI, which are significant in 23, 23, 21 cases of the 24 chrs. (Additional file 3). CGI coefficients in all regressions are negative, probably compensating the high positive (G+C) coefficients, given that (G+C) and CGI are strongly correlated (R ~0.805). Positioning Alus among the most relevant variables confirms our prior analyses based on 1 and 2 dimensional regression. In the same way, we analyzed ΔDq for the whole genome, obtaining again (G+C), Alu, CGI as the most relevant variables explaining ΔDq (Additional file 4). Moreover, when the long interspersed repeats are analyzed by RM they tend to be located on low and medium multifractality (LMM) ranges (Figure 2E).
Given that the information dimension studied takes a form of > symbol (Figure 1E), we studied its behavior using a discrimination method based on 3-D distributions. In this analysis, the high information content is related to Alu content, whereas the low and medium information contents are rather related to low Alu contents and other genomic structures (Figure 2F).
1.3) Multifractal map of the human genome
Given that other repetitive elements could contribute to increase the local multifractality, five chromosome fragments with a low number of Alus and high MD in chromosomes 4, 21, and Y called our attention (Figure 3, asterisks, Additional file 5). We analyzed these sequences and found many variable short repeats in tandem (VSRTs) (Additional file 6). Thus, the presence of these repeats increases local multifractality but reduces the entire chromosome multifractality for these chromosomes, as mentioned before.
1.4) Chromosomal location of the most multifractal chromosome fragments
1.5) Analyses by gene function, gene family, and gene length
To gain further insights into the gene function, we focused on about 208 human gene families, consisting of 4,614 genes [25, 26]. We asked about the multifractal genomic context for these gene families. The distributions obtained show three different multifractal behaviors (Figure 5B): low-skewed (for OR, KCN, HLA, IFN, KRT, CDH, and RGS), high-skewed (for ZNF, SNORA, USP, RPS, SNORD, GTF, DHX, ALOX, and UBE2), and "medium" for most gene families. Other gene families can be placed within some of these categories (Additional file 10).
When multifractality is related to the information content (for example, number of exons, Figure 5A), it is expected that the more genetic information exists, the greater is the extent of genetic information fragmentation. To verify this assumption we looked for the average lengths of genes, exons, and introns in relation to the RM. The three corresponding distributions show how the average lengths decrease as multifractality increases (Figure 5C, Additional file 11). Another approach to validate this assumption is to observe the number of information units (IU) (exons plus introns) per RM. Here, the distribution shows that the number of IUs increases when the RM increases (data not shown).
2) Multifractal analysis by chromosome
We asked whether this subjective classification could be obtained by hierarchical clustering of the complete data set of averaged multifractal parameters, using multifractality as a similarity measure (Additional file 14). The clustering process classified the chromosomes into three multifractality groups (on top of Figure 6B), with among group similarities of 0.84 and 0.4, respectively: low, medium and high confirming (in part) the visual observation. Nearly all chromosomes (92%) lie on the consecutive, visually identified low, medium and high sections on the regression line. The only exceptions are chromosomes 22 and Y, which are placed in other groups.
3) Multifractal analysis by average of chromosome regions
Other relevant characteristics are observed in some chromosome bands and arms (Figure 7A). For example, the X chromosome, involved in X chromosome inactivation (XCI), which is rich in LINE1 elements and poor in Alu sequences showed a 0.95 ≤ ΔDq ≤ 1.027 (Additional file 16). The Y chromosome has two particular regions to the Yp and Yq ends, the pseudoautosomal region and the palindromic region, respectively . We thought that the palindromic regions should have low multifractality because of their symmetric structure. We found, in fact, that this region has lowered non-linearity. Moreover, recombination rates in chromosome 8 tend to be much higher in distal regions (around 20 Mb)  and the analysis showed medium non-linearity at this region as expected (Additional file 17). Regarding chromosome 1, rich in Alu sequences in one of its arms , we found significantly high multifractality (~1.13) at this region; in contrast, the other three regions have a ΔDq ≤ 1.058 (Additional file 18). Similar situations can be analyzed for other chromosomes. As two opposing references we use chromosomes 4 and 19 for comparison (Additional file 19).
Antibodies to histone modifications previously linked to active transcription, showed close correspondence to regions rich in genes and CGI in human methaphase epigenome . We analyzed chr. 1 and found that CGI profiles correspond well to multifractality (Figure 7B, Additional file 20).
We discovered a strong relationship between the multifractal parameters and part of the genetic information coded by the human genome.
Initially, the multifractality in human genome was found strongly dependent on the Alu contents
Herein, thousands of chromosome fragments with multifractality ranging from low to high values were analyzed (Figure 1A-C). For all chromosome fragments, τ(q) is a nonlinear function (Figure 1C), indicating that the molecular structure of the chromosome fragments has a multifractal behavior. However, in many chromosome fragments, τ(q) tends to be close to linear behavior, especially for τ(q≥ 2), indicating partial loss of multifractality. These results suggest that nucleotide fluctuations are less anti-correlated in many chromosome fragments. In fact, the fragment distribution is biased toward low and medium multifractal values (Figure 1D), suggesting that the human genome has a large number of regularly arranged elements, highly periodic and not very polymorphic. This is not surprising because the human genome has about 98.9% of non-coding sequences with a complex composition given by introns and intergenic regions. That is, at least 55% of this information is poorly polymorphic given that these regions mainly consist of introns, LINEs (especially L1), LTRs and DNA transposons [1, 2]. In contrast, the human genome also has a significant number of chromosome fragments with high multifractality (Figure 1D). That means these regions should be rich in specific types of sequences that are highly polymorphic and organized in a large number of possible combinations. When the information dimension was analyzed a dual informational behavior confirmed such assumption (Figure 1E). Indeed, the multifractality was found to be strongly correlated with the Alu content (Figure 2A), which became visible when plotted against the information dimension (Figure 2F). This result is very significant given that the Alu family is highly polymorphic [29, 30] and in a 300 kb chromosome fragment one can find Alu elements in many combinations in up to 50% of its length. The Alu elements are not identical and can be classified into three major families: Alu-J, Alu-S and Alu-Y representing the oldest, intermediate, and youngest Alus, respectively and each family is divided into one or more levels of subfamilies . In total, ~45 subfamilies encompass the complete Alu family. We found that multifractality was mainly dependent on the Alu-S contents (Figure 2B), especially the Alu-Sx, an expected result since these sequences are the most abundant Alu members in the human genome . Analysis via RM confirmed that the Alu sequences tend to be located toward medium and high ranges of multifractality (Figure 2C) because of the high Alu content in the human genome.
The CGI showed a moderate relationship with the multifractality (Figure 2C, D), which might be because more than 95% of CGI are less than 1,800 pb long . Genes, exons, introns, LINES, MIR, MER and LTR contents did not show any significant relationship with the multifractality because most of these sequences have a low number of members, are large and have few polymorphisms. For example, LINE elements are ~6 kb long, more numerous than Alus and consist of four families, being LINE-1 the most abundant family (~17%) in the genome , and their density pattern is quite uniform for most chromosomes . Thus, the combination of number of members, size and polymorphism seem to be determining characteristics for multifractality changes. The earlier mentioned abundant number of polymorphic Alu sequences confirms the relation between these characteristics and multifractality. In fact, an in silico comparative genomics study between public and Celera versions of human genome sequences identifies several hundred new Alu insertion polymorphisms, showing that these elements are highly polymorphics . A similar behavior is found in C. elegans where the TTAGGC repeat is abundant in number and combinations within the flanking sequences .
Subsequently, we elaborated a multifractal map of the human genome (Figure 3), which shows MD and Alu density along the human chromosomes. The map reveals that the human chromosomes contain many significant correlation structures for Alu-rich regions. Thus, the high contents of Alu account for the high aperiodicity and genetic variability of many chromosome sections. A similar result in C. elegans reported changes in multifractality related to a specific type of repetitive DNA . Additionally, the correlations for CGI are lower but significant. However, no significant correspondence was found in regions poor in Alu sequences and rich in LINE, MIR, MER and LTR sequences. Not all multifractality is due to the Alu contents, many VSTRs can also contribute to increasing local multifractality (Figure 3, asterisks). We found very poor correspondence to the number of genes perhaps due to their low frequency. These results, taken together, indicate that the observed multifractality is primarily related to nonlinear distributions for those chromosome fragments which are rich in Alu sequences, next for those with high CGI content and in few instances, for those with high VSRT contents.
Hundreds of highly multifractal chromosome fragments mapped in chromosomes rich in genetic information
There were a large number of chromosome fragments with very high multifractality (Figure 4A), mainly located on chromosomes 19, 17, 22, and 16 (Figure 4B, above). All of these chromosome sections, so we suggest, generate a mosaic of regions locating the genetic information far from equilibrium [17, 24], which could be interpreted both, as a protector "shield" for the human genome against environmental fluctuations and as "genomic attractors" to maintain many components, functions and processes under a "deterministic" genomic control. In contrast, the same analysis also identified thousands of LMM chromosome fragments (Figure 2C) with low Alu content (Figure 4B, below) and perhaps prone to being affected by the environment. This result might be interpreted as some genome sections with low nonlinearity that might have high genetic instability associated with some particular (structural or functional) gene property.
Several gene characteristics are related to multifractality
This is not striking since three-fourths of all genes in the genome are associated with Alus (Figure 5A) . Therefore, some gene families tend to be located preferentially within a multifractal genomic context (Figure 5B). For example, the hOR gene family lies mainly on a low multifractal genomic context. This is due to this family has a very periodic and repetitive structure. It is known that the OR gene family has about 390 active members which were propagated on the genome by gene duplication. Hence they share a high homology due to their high structural homogeneity and possess many clusters of regular characteristics; nonetheless, their functional expression depends on a complex interplay between regulatory sequences and the environment . A similar behavior is observed in the KCN gene family, responsible for building potassium channels for cell communication. In contrast, the ZNF gene family, which codes for regulatory proteins and is, therefore, involved in many cellular functions, is located in a medium and high multifractal genomic context. For example, the ZEB2 protein involved in a chemical signaling pathway regulates early growth and development and obeys a pre-determinate genetic program. In addition, these genes have a high structural inhomogeneity and many irregular characteristics. Similar inferences might apply for the RPS gene family, which codes for highly conserved proteins for the ribosome, for the SNORA machinery involved in the nuclear splicing and for USPs that help to control the levels of many proteins in the cell . This seems to suggest that the low multifractal genetic context might be related to information inputs from environmental processes, and the high one to inputs from deterministic processes. Thus, a few gene families in the human genome might be subjected to two types of information (or stimulus) inputs, while most gene families seem to be subjected to a complex regulatory interplay between epigenetic and genetic controls.
On the other hand, the degree of gene fragmentation by RM (Figure 5C) behaves according to the multifractal theory: multifractality increases when the length of exons and introns in the human genome decreases and the number of IUs per interrupted gene increases with multifractality, as expected.
The multifractal approach per chromosome permitted classifying the human chromosomes. This analysisvalidated the strong relationship to the Alu elements (Figure 6) we found especially for chromosomes 19, 17, 22, and 16, which are rich in genetic information content [1, 2]. Particularly chromosome 19 is by far the most multifractal chromosome and has the highest gene density of the whole genome. It is also unusual with respect to its density of repeat sequences. In fact, nearly 55% of this chromosome consists of repetitive elements, whereas chromosomes 6, 7, 14, 20, 21 and 22 all have repeat contents ranging from 40% to 46% (the genome average is 44.8%). This difference is due mainly to an unusually high content of SINEs in chromosome 19 . In contrast, chromosomes 13, X, and 4 have the lowest multifractality because their Alu content is lower than the autosomal average, they have low gene density. Some of these chromosomes have very large "gene deserts" and the CGI and LINE contents are the highest percentage among all autosomes [1, 30]. A similar behavior can be observed for chromosomes 19, 17, and 4, as reported in a recent multifractal analysis .
Our analysis permits classifying human chromosomes into three multifractality groups suggesting that the chromosome molecular structure might be organized as a system operating far from equilibrium  (Figure 6B). Thus, those chromosomes with low multifractality might be closer to equilibrium and have greater genetic instability. If so, this would explain, why some chromosomes would be involved in some genomic disorders (structural and numerical chromosome alterations). For example, some microdeletion syndromes have been reported for chr. 4: Wolf-Hirschhorn syndrome, chr. 5: Cri du chat syndrome and chr. 15: Angelman and Prader-Willi syndromes. Some aneuploids can be present in chr. 8: Syndrome of Warkany, chr. 13: Patau syndrome, chr. 18: Edward syndrome, chr. 21: Down syndrome, chr. X: Turner syndrome (XO), Klinefelter syndrome (XXY), triple X syndrome and other tetra and pentaploids of chr. X. For chr. Y: XYY syndrome and Turner syndrome. With the exception of chromosomes 21 and Y, all were classified as chromosomes with low multifractality and are more susceptible to genetic damages or a wrong meiotic segregation.
The multifractal approach by chromosome region reveals different genomic scenarios (Figure 7A)
For instance, 21 chromosome regions with low multifractality might promote genetic instability during meiotic segregation in Down syndrome. Similar behaviors might arise for chromosomes X and Y to explain XCI and sex determination. For example, the most remarkable enrichment of repetitive sequences obtained for L1, which accounts for 29% of the X chromosome sequence compared to the average of only 17% . Some studies have reported significant association between L1 and coverage and inactivation, and others have refuted this result . However, the low multifractality, especially at the third region (AMD ~0.96) may be prone to XCI. With regard to chromosome Y, the pseudoautosomal region is more stable, while the palindromic (more periodic) region is unstable and more prone to producing some genetic disorder such as the mixed gonadal dysgenesis and infertility [1, 34]. On the contrary, the 8p region in which a vast section of ~15 Mb has a strikingly high mutation rate lay on a medium multifractality region . Similar behavior can be inferred in the C. elegans chromosome arms, rich in mutation rates .
A similar approach showed that the CGI and Alus correspond well to multifractality (Figure 7B). This result is significant because of the role that CGI play in heritability of epigenetic states during the active transcription or modifications associated with active chromatin .
Finally, we propose a descriptive, non linear model for the function and organization of the human genome (Figure 8)
Secondly, there is a strong tendency to increase genetic information content when multifractality increases and to increase gene fragmentation when multifractality increases. These results are consistent with what the multifractal theory predicts (Figures 5A, C, and 8E). Thus, the human genome seems to be made by many information units (interrupted genes, Alus and CGIs) with different degrees of fragmentation (or size) that account for the aperiodic scaling of short and long range correlations found by other authors .
Thirdly, a multifractal genomic context seems to be a significant requirement for the functional and structural organization of thousands of genes and many gene families, i.e., a low multifractal context seems to be necessary for many sequences (generated by gene duplication and periodicy) to interact with environmental signals, while a high multifractal context (aperiodic) seems to be prone (or a "genomic attractor") to many genes; and some (very aperiodic) gene families are involved in deterministic and genetic processes (Figures 5A, B, and 8E, F). Thus, the highly multifractal regions would be a guaranty to maintain a deterministic regulation control in the genome , although most of the human genome sequences can be subjected to a complex epigenetic and genetic control as observed when the human epigenome due to the CGI contents is related to multifractality .
Fourthly, the human chromosome classification and some chromosomic region assays may have some medical implications. That is, the structure of low non-linearity exhibited for some chromosomes (or chr. regions) might imply an environmental predisposition to be sensible targets for structural and numerical chromosomic alterations (Figures 6, 7, and 8G). In fact, the loss of non-linearity is associated with failure or alterations of many vital systems close to equilibrium [17, 40, 41]. Additionally, the sex chromosomes must have low multifractality to maintain the sexual dimorphism and likely the XCI.
All these fractal and biological arguments might explain why the Alu elements are shaping the human genome in nonlinear manner. We believe that applying comparative multifractal genomics among many human genomes and other model organisms can help to respond to how the genome came to exist.
We report evidence for multifractality in the human genome. We identified thousands of chromosome fragments with low, medium and high multifractality, which can be translated in terms of variable genetic stability. Using these fragments we demonstrated -by different approaches- that changes in multifractality depend strongly on changes in contents of Alu sequences. The generated multifractal map of the human genome allows discussing the multifractal context in which thousands of genes and repetitive sequences lie. Thus, the Alu elements (and CGI) are non-linearity shapers and protectors of the genetic information of the human genome.
Likewise, the averaged multifractality permitted analyzing chromosome regions and classifying human chromosomes into three groups. This non-linear classification has significant medical implications because it is able to explain some chromosomal disorders, among other genomic particularities.
All of these findings help to propose a useful and integrative conceptual non-linear model to discuss and quantify the structural variation and nonlinear organization of the human genome.
Databases, sequences, and multifractal approaches
The Hs_refseq human genome sequence build 36.2 was downloaded from the NCBI web site . Three multifractal approaches were followed in this study: 1) By chromosome fragment, 2) by chromosome, 3) by average of chromosome regions. In the first approach, we tested several fragment sizes of DNA sequence and we found 300 kb was an adequate length to be analyzed. This selection was based on several criteria such as percentage of discarded genome, average gene size, gene family, genetic and multifractal context, and scale independence for chromosome fragment size (Data not shown). Nevertheless, other sizes could have been taken into account. Subsequently, the contig order for each chromosome was defined according to the contig files at the 36.2 version and each contig was divided into fragments of 300 kb. That resulted in 9,389 fragments, representing 2,816,700 kb. It is about 98.6% of the whole human genome, discarding about 1.4% of the genome. Another ten chromosome fragments were removed from the analysis, because of an excessive number of Ns and lack of annotation, leaving 9,379 chromosome fragments (Additional file 1). By using these fragments, five types of analyses were implemented: analyses of multifractal parameters, analyses of molecular parameters, multifractal map of the human genome, chromosomal location of the most multifractal chromosome fragments, and analyses by gene function, gene family, and gene length. In the second approach, the resulting fragments per chromosome were averaged for multifractality to obtain a measure for each chromosome. In the third approach some chromosomes with some structural particularities were studied. Here, the resulting fragments per chromosome from the first approach were divided into four regions (or 27 for chr. 1) and averaged to evaluate the multifractality of each chromosome region.
Molecular parameters and chaos game representation
The (G+C) contents and Ns were counted for each DNA fragment of 300 kb by a script written in Python. Likewise, several molecular parameters were counted from different files: CGI from seq_cpg_islands.gz file, Alu (Y, S, J), LINEs, MIRs, MERs and LTRs from seq_repeat.md.gz file, genes from seq_gene.md.gz file, exons and introns from gbk.gz file, SNPs from seq_snp.md.gz file, and the number of gene functions from rna.q file. All these files were downloaded from NCBI human build 36.2. KEGG(ftp://ftp.genome.jp/pub/kegg/pathway/organisms/hsa/), and KOGs (ftp://ftp.ncbi.nih.gov/pub/COG/KOG/) were analyzed. As control, we compared some molecular parameter profiles (G+C and Alus) with those reported in literature .
Multifractal analysis and discrimination analyses
where the number Mi of points that fall in the i-th grid box is determined and related to the total number M0 and εis the box size.
which shows that ln(M i q ) for fixed q is a linear function in ln(ε), therefore D q can be evaluated as slope of the fitted relationship between ln(M i q ) and (q - 1)ln(ε) . We used this box-counting method for the multifractal spectrum estimation of CGR points and the corresponding analysis according to [10, 45].
Directly from the multifractal dimension D q , the correlation exponent τ(q) is derived asτ(q) = (q -1)D q . The degree of multifractality, ΔDq, is the difference between maximum and minimum values of Dq: ΔDq = Dq max - Dq min [17, 46]. When ΔDq is high, the multifractal spectrum is rich in information and highly aperiodic; when ΔDq is small; the resulting dimension spectrum is poor in information and highly periodic. For each chromosome the number of Alu versus the MD per fragment were plotted. Discrimination analyses were performed using 2-D and 3-D plots, with combined molecular and multifractal parameters.
The whole data set and each set of chromosome fragments per chromosome were analyzed by simple and multivariate regressions using the PASW statistics 18 software, to determine the goodness of the fit of several molecular parameters versus MD . For multivariate regression of ΔDq the data were normalized (values between 0 and 1). In each chromosome we determined the 5 variables with highest coefficient absolute values and the most relevant ones were considered. For some molecular parameters, their RM at a 95% of occurrence level was analyzed. And to classify the human chromosomes, a clustering analysis was generated by using the Hierarchical Clustering Explorer version 3.5 program (HCE3.5) . The clustering tree was generated by using the following parameters: row by row normalization by control, complete linkage method and Person's correlation coefficient.
Pedro A. Moreno was formerly a graduate student at University of Houston and is currently an assistant professor at the Universidad del Valle (http://eisc.univalle.edu.co). He teaches courses in bioinformatics, molecular biology, and information technologies. Pedro has participated in several researches working with biologists, mathematicians, and engineers in molecular biology, bioinformatics, metagenomics, fractal geometry and is currently an advisor for many students at the University. He pioneered bioinformatics research in Colombia with fractal geometry studies applied to biological problems. Patricia E. Vélez is a professor at Universidad del Cauca and Director of the BIMAC Group (http://bimac.unicauca.edu.co) and she has been pioneered in leadership several researches on breast cancer, human genetics, bioinformatics, and fractal geometry applied to genetics problems. Ember Martínez is a graduate student and professor at Universidad del Cauca. Luis E. Garreta is a doctoral student at Universidad del Valle. Néstor M. Díaz is a graduate student and professor at Universidad del Cauca. Siler Amador is a professor at Universidad del Cauca. Irene Tischer is a professor at Universidad del Valle and Director of the Laboratorio de Bioinformática. José M. Gutiérrez was formerly a post-doctoral associate at Cornell University and is currently a professor at Universidad de Cantabria, Spain. Ashwinikumar K. Naik is a medical doctor and bioinformatician in Bangalore and participated in the human genome sequencing in Celera Genomics. Fabian Tobar is a doctoral student and bioinformatician at Laboratorio de Bioinformática, Universidad del Valle. Felipe Garcia was formerly a postdoctoral associate at Harvard University. He is currently a professor at the Universidad del Valle and Director of the Laboratory of Molecular Biology and Microbiology.
Chaos Game Representation
- MD or ΔDq:
range of multifractality
average of multifractality degree
low and medium multifractality
X chromosome inactivation
We gratefully acknowledge the Departamento Administrativo de Ciencia, Tecnología e Innovación - COLCIENCIAS of the Republic of Colombia for supporting this project with Biotechnology Research Grant (#1103-12-16765). Thanks to Dr. Alberto Bohorquez and Dr. Heiber Cárdenas for their comments and suggestions. Finally, thanks to the Universidad del Cauca by the computer facilities and Escuela de Ingeniería de Sistemas y Computación at Universidad del Valle for the financial support for publication.
- Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, et al: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1038/35057062.PubMedView Article
- Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Gabor Miklos, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N, et al: The sequence of the human genome. Science. 2001, 291: 1304-51. 10.1126/science.1058040.PubMedView Article
- International human genome sequencing consortium: Finishing the euchromatic sequence of the human genome. Nature. 2004, 431: 931-45. 10.1038/nature03001.View Article
- Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, Lin Y, MacDonald JR, Pang AW, Shago M, Stockwell TB, Tsiamouri A, Bafna V, Bansal V, Kravitz SA, Busam DA, Beeson KY, McIntosh TC, Remington KA, Abril JF, Gill J, Borman J, Rogers YH, Frazier ME, Scherer SW, Strausberg RL, et al: The diploid genome sequence of an individual human. PLoS Biol. 2007, 5: e254-10.1371/journal.pbio.0050254.PubMed CentralPubMedView Article
- Versteeg R, van Schaik BDC, van Batenburg MF, Roos M, Monajemi R, Caron H, Bussemaker HJ, van Kampen AHC: The human transcriptome map reveals extremes in gene density, intron length, GC content, and repeat pattern for domains of highly and weakly expressed genes. Genome Research. 2003, 13: 1998-2004. 10.1101/gr.1649303.PubMed CentralPubMedView Article
- DeCerbo J, Carmichael GG: SINEs point to abundant editing in the human genome. Genome Biology. 2005, 6: 216-10.1186/gb-2005-6-4-216.PubMed CentralPubMedView Article
- The 1000 genomes project consortium: A map of human genome variation from population-scale sequencing. Nature. 2010, 467: 1061-1073. 10.1038/nature09534.PubMed CentralView Article
- Moreno PA, Vélez PE, Burgos JD: Biología molecular, genómica y post-genómica. Pioneros, principios y tecnologías. 2009, Editorial Universidad del Cauca: Popayán
- Restrepo S, Pinzón A, Rodríguez-R LM, Sierra R, Grajales A, Bernal A, Barreto E, Moreno PA, Zambrano MM, Cristancho M, González A, Castro H: Computational biology in Colombia. PLoS Comput Biol. 2009, 5 (10): e1000535-10.1371/journal.pcbi.1000535.PubMed CentralPubMedView Article
- Mandelbrot B: La geometría fractal de la naturaleza. 1982, Tusquets editores: Barcelona
- Peitgen H-O, Jürgen H, Saupe D: Chaos and fractals: New frontiers of science. 1992, Springer-Verlag: New York InsView Article
- Voss R: Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys Rev Lett. 1992, 68: 3805-8. 10.1103/PhysRevLett.68.3805.PubMedView Article
- Provata A, Oikonomou T: Power law exponents characterizing human DNA. Phys Rev E Stat Nonlin Soft Matter Phys . 2007, 75: 056102-PubMedView Article
- Carpena P, Bernaola-Galván P, Coronado AV, Hackenberg M, Oliver JL: Identifying characteristic scales in the human genome. Phys Rev E. 2007, 75: 032903-View Article
- Stanley HE, Meakin P: Multifractal phenomena in physics and chemistry. Nature. 1988, 335: 405-409. 10.1038/335405a0.View Article
- Meneveau C, Sreenivasan KR: Simple multifractal cascade model for fully developed turbulence. Phys Rev Lett. 1987, 59: 1424-1427. 10.1103/PhysRevLett.59.1424.PubMedView Article
- Ivanov PCh, Nunes LA, Golberger AL, Havlin S, Rosenblum MG, Struzikk ZR, Stanley HE: Multifractality in human heartbeat dynamics. Nature. 1999, 399: 461-465. 10.1038/20924.PubMedView Article
- Yu ZG, Anh V, Lau KS: Measure representation and multifractal analysis of complete genomes. Phys Rev E Stat Nonlin Soft Matter Phys. 2001, 64: 031903-PubMedView Article
- Yu ZG, Anh V, Lau KS: Multifractal and correlation analyses of protein sequences from complete genomes. Phys Rev E Stat Nonlin Soft Matter Phys. 2003, 68: 021913-PubMedView Article
- Zhou LQ, Yu ZG, Deng JQ, Anh V, Long SC: A fractal method to distinguish coding and non-coding sequences in a complete genome based on a number sequence representation. J Theor Biol. 2005, 232: 559-567. 10.1016/j.jtbi.2004.09.002.PubMedView Article
- Yu ZG, Anh V, Lau KS, Zhou LQ: Clustering of protein structures using hydrophobic free energy and solvent accessibility of proteins. Phys Rev E Stat Nonlin Soft Matter Phys. 2006, 73 (3 Pt 1): 031920-PubMedView Article
- Yang JY, Zhou Y, Yu ZG, Anh V, Zhou LQ: Human Pol II promoter recognition based on primary sequences and free energy of dinucleotides. BMC Bioinformatics. 2008, 9: 1-13. 10.1186/1471-2105-9-1.View Article
- Provata A, Katsaloulis P: Hierarchical multifractal representation of symbolic sequences and application to human chromosomes. Phys Rev E Stat Nonlin Soft Matter Phys . 2010, 81: 026102-PubMedView Article
- Vélez PE, Garreta LE, Martínez E, Díaz N, Amador S, Tischer I, Gutiérrez JM, Moreno PA: The Caenorhanditis elegans genome: a multifractal analysis. Genet and Mol Res. 2010, 9 (2): 949-965. 10.4238/vol9-2gmr756.View Article
- HUGO. [http://www.hugo.org]
- Genetics home reference. [http://ghr.nlm.nih.gov/geneFamily]
- Bolzer A, Kreth G, Solovei I, Koehler D, Saracoglu K, Fauth C, Müller S, Eils R, Cremer C, Speicher MR, Cremer T: Three dimensional maps of all chromosomes in human male fibroblast nuclei and prometaphase rosettes. PLoS Biology. 2005, 3: e157-10.1371/journal.pbio.0030157.PubMed CentralPubMedView Article
- Terrenoire E, McRonald F, Halsall JA, Page P, Illingworth RS, Taylor AMR, Davison V, O'Neill LP, Turner BM: Immunostaining of modified histones defines high-level features of the human metaphase epigenome. Genome Biology. 2010, 11: R110-10.1186/gb-2010-11-11-r110.PubMed CentralPubMedView Article
- Roy-Engel AM, Carroll ML, Vogel E, Garber RK, Nguyen SV, Salem A-H, Batzer MA, Deininger PL: Alu Insertion Polymorphisms for the Study of Human Genomic Diversity. Genetics. 2001, 159: 279-290.PubMed CentralPubMed
- Grover D, Mukerji1 M, Bhatnagar P, Kannan K, Brahmachari SK: Alu repeat analysis in the complete human genome: trends and variations with respect to genomic composition. Bioinformatics. 2004, 20: 813-817. 10.1093/bioinformatics/bth005.PubMedView Article
- Wang J, Song L, Gonder MK, Azrak S, Ray DA, Batzer MA, Tishkoff SA, Liang P: Whole genome computational comparative genomics: A fruitful approach for ascertaining Alu insertion polymorphisms. Gene. 2006, 365: 11-20.PubMed CentralPubMedView Article
- Myers JS, Vincent BJ, Udall H, Watkins WS, Morrish TA, Kilroy GE, Swergold GD, Henke J, Henke L, Moran JV, Jorde LB, Batzer MA: A Comprehensive Analysis of Recently Integrated Human Ta L1 Elements. Am J Hum Genet. 2002, 71: 312-326. 10.1086/341718.PubMed CentralPubMedView Article
- Tobar F, Moreno PA, Vélez PE: Highly conserved regions in the 5' region of the human olfactory receptor genes. Genet and Mol Res. 2009, 8: 117-128. 10.4238/vol8-1gmr550.View Article
- Nussbaun RL, McInnes RR, Willard HF: Thompson and Thompson Genetics and Medicine. 2007, USA: Sounders Elsevier
- Ross M, Grafham DV, Coffey AJ, Scherer S, McLay K, Muzny D, Platzer M, Howell GR, Burrows C, Bird CP, Frankish A, Lovell FL, Howe KL, Ashurst JL, Fulton RS, Sudbrak R, Wen G, Jones MC, Hurles ME, Andrews TD, Scott CE, Searle S, Ramser J, Whittaker A, Deadman R, Carter NP, Hunt SE, Chen R, Cree A, Gunaratne P, et al: The DNA sequence of the human X chomosome. Nature. 2005, 434: 251-252. 10.1038/nj7030-251a.View Article
- Nusbaum C, Mikkelsen TS, Zody MC, Asakawa S, Taudin S, Garber M, Kodira CD, Schueler MG, Shimizu A, Whittaker CA, Chang JL, Cuomo CA, Dewar K, FitzGerald MG, Yang X, Allen NR, Anderson S, Asakawa T, Blechschmidt K, Bloom T, Borowsky ML, Butler J, Cook A, Corum B, DeArellano K, DeCaprio D, Dooley KT, Dorris L, Engels R, Glöckner G, et al: DNA sequence and analysis of human chromosome 8. Nature. 2006, 439: 139-143. 10.1038/439139a.View Article
- Krane DE, Raymer ML: Fundamental concepts of bioinformatics. 2003, San Francisco: Benjamin cummings
- Häsler J, Strub K: Alu elements as regulators of gene expression. Nucl Acids Res. 2006, 34: 5491-5497. 10.1093/nar/gkl706.PubMed CentralPubMedView Article
- Huda A, Mariño-Ramírez L, Jordan IK: Epigenetic histone modifications of human transposable elements: genome defense versus exaptation. Mobile DNA. 2010, 1: 1-12. 10.1186/1759-8753-1-1.View Article
- Goldberger AL: Non-linear dynamics for clinicians: chaos theory, fractals, and complexity at the bedside. Lancet. 1996, 347: 1312-1314. 10.1016/S0140-6736(96)90948-4.PubMedView Article
- Burgos JD, Moreno-Tovar P: Zipf-scaling behavior in the immune system. Biosystems. 1996, 39: 227-232. 10.1016/0303-2647(96)01618-8.PubMedView Article
- NCBI. [http://www.ncbi.nlm.nih.gov]
- Jeffrey HJ: Chaos game representation of gene structure. Nucl Acids Res. 1990, 18: 2163-2175. 10.1093/nar/18.8.2163.PubMed CentralPubMedView Article
- Yu ZG, Anh V, Lau KS: Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses. J Theor Biol. 2004, 226: 341-348. 10.1016/j.jtbi.2003.09.009.PubMedView Article
- Halsey TC, Jensen MH, Kadanoff LP, Procaccia I, Shraiman BI: Fractal measures and their singularities: The characterization of strange sets. Phys Rev A. 1986, 33: 1141-1151. 10.1103/PhysRevA.33.1141.PubMedView Article
- Dewey TG, Strait BJ: Multifractals, encoded walks and the ergodicity of protein sequences. Pac Symp Biocomput. 1996, 216-
- PASW Statistic 18. Chicago. [http://www.spss.com/statistic/]
- Hierarchical Clustering Explorer version 3.5. [http://www.cs.umd.edu/hcil/hce/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.