Genomic samples and computation of recent human mutations ("fixed substitutions")
Taking human-chimp (human build 36.1 and chimp build 2 version 1) and human-macaque (macaque build v1 edit4) whole-genome pairwise alignments from the UCSC Genome Browser [21]http://hgdownload.cse.ucsc.edu/downloads.html as input, we generated a Perl script for the identification of the common genomic regions for these three species. The process involved the usage of the human genomic sequence as the reference for the location with the chimp and macaque sequences being extracted only in areas where the sequences of all three species were represented. We then invoked the ClustalW (v1.83) program with default parameters to obtain a whole-genome human-chimpanzee-macaque triple alignment. The obtained alignment is available at our website http://bpg.utoledo.edu/human_chimp_macaque.html. This triple alignment was used to calculate the dataset of recent mutations in humans. We considered a recent substitution at a particular position (for example T → C at position 23456719 on chromosome 7) to be valid if the human genome has a C base while both chimp and macaque have a T base in the corresponding aligned positions. In addition, we required that the quality of the alignment in the vicinity of the mutation be reliable (more than 70% similarity between human and macaque in the 20 bp flanking region [-10, +10]). The frequency table of all inferred recent human mutations is presented in the Additional file 3. We analyzed these recent substitutions together with the SNP datasets and call the former mutations "fixed substitutions," assuming that the majority of them occurred less than 10 million years ago and were already fixed across all human populations. In the same manner we processed indels in the triple alignments and computed all unambiguous cases of human insertions and deletions with sizes from 1 to 49 nucleotides.
Processing of SNP data
Over 4.62 million human SNPs from all chromosomes were obtained (dbSNP build 128 [22], ftp://ftp.ncbi.nih.gov/snp/), filtered for completeness and correctness annotations (676499 records discarded total), and mapped onto the whole-genome human-chimpanzee alignment. SNP allele frequencies were averaged from the frequency data of all populations of that allele. However, only those SNPs that were successfully located within the alignment were processed further. For each SNP site we verified the existence of the particular polymorphic bases in the specified position of the human genome reference sequence and also in the corresponding aligned position on the chimp genomic sequence. If any of these two species had different bases than the SNP alleles, the SNP was discarded (20469 SNPs discarded total).
Otherwise, we defined the origin of the polymorphism based on the chimpanzee nucleotide. Consider the following example to illustrate this process: suppose one has an A/G polymorphism located at position 34567812 of chromosome 5 with an average A allele frequency of 0.6 and a G allele frequency of 0.4. Then at position 34567812 of chromosome 5 of the human genome reference sequence (Genbank build 36.1), we would first examine if the A or G allele is present at that position and discard the SNP if not. Next, using the flanking region of that SNP we could align the chimp genomic sequence. If the chimp nucleotide were T or C then the SNP would also be discarded because those alleles are not a part of the human haplotype at that position. However supposing that the chimp nucleotide were G, then the polymorphism would be declared as a G → A polymorphism with G being declared the ancestral allele that at some point in human evolution mutated into an A allele within some human population(s). From the frequency data we may finally characterize this example SNP more precisely as a 0.4G → 0.6A polymorphism.
Using this approach we successfully characterized 3.93 million human SNPs. This last group of SNPs was divided into four subgroups based on the abundance of the mutant allele in the given human populations:
I. rare polymorphisms with the frequency of the mutated allele being less than 3%;
II. minor polymorphisms with frequencies ranging from 3% to 20%;
III. medium polymorphisms with frequencies going from 20% to 80%; and
IV. major polymorphisms with the frequency being above 80%.
For our method, misidentification of the ancestral allele might arise when the site for the human SNP is also polymorphic in chimp populations (e.g. A/G polymorphism) or for the possible case that this site had a recent substitution in chimps (A → G) after their divergence from humans. Human and chimpanzee genomes only differ by 1.23% due to single nucleotide substitutions with 1.06% being due to fixed substitutions and the rest (0.17%) being due to polymorphisms in human and chimp [20]. Moreover, according to the Chimpanzee Sequencing and Analysis Consortium the average estimated error rate of human alleles being misidentified due to chimp polymorphisms is only ~1.6% across all typical SNPs, which is acceptably low. It is also observed, however, that in the mutational hotspot of the CpG dinucleotide, there is an increased error rate for ancestral misidentification. If the human alleles are C/T in the CpG and TpG context and the chimp allele is T (in the TpG context) then the estimated error rate is actually 9.8% [20]. Thus, in the context of studying our MRI regions, any substitution (especially in GC-rich MRI regions since they contain an overabundance of G and C) going from TpG → CpG could have the ancestral allele misidentified, which would mean that the substitution would actually be CpG → TpG, although in the case of GC-rich MRI regions where such dinucleotides are more likely, an error rate of 9.8% is not sufficient to change the trend or conclusion of our results.
X-rich MRI genomic regions and control regions with average base composition
Any base or combination of bases can be described by a parameter X. For example, X could be G-base; C+T-bases; or A+T+G bases, et cetera. It is also useful to refer to nonX base(s) as all bases not X. Thus, X + nonX must represent all four nucleotides A, G, T, and C. For the examples above, nonX are A+T+C-bases; G+A-bases, and C-base, respectively. MRI is characterized by a specific base composition within a region under analysis. We characterize X-rich MRI regions based on an overabundance of the X base(s) within a region of a certain length (the so-called window), where the percentage of X should be above a certain threshold (Bechtel et al 2008). We calculated MRI regions in the human genome for single nucleotides and various nucleotide combinations using a stretchy window of 100+ nucleotides with the following threshold parameters: for A or T the threshold was 49%; for G or C we used 40%; for G+C it was 70%; for A+T the threshold was at 80%; for G+T, C+A, G+A, and C+T were at 70%; nonA or nonT was 87%; and non G or non C the threshold was 93%. These thresholds were chosen experimentally in such a way that MRI regions should represent about 2% of the whole human genome. A stretchy window of N + nucleotides means that we scan genomic sequence with an N-size window to find a genomic MRI region that fits the threshold criterion, then we extend the window above the detected region by 10 nt steps until the criterion is no longer met. After registering the full MRI region we jump beyond the current MRI region and continue with the default N-size window. Using this approach we characterized all MRI regions in the triple human-chimp-macaque alignments using the human sequence for calculating nucleotide composition and MRI features. We also discarded those MRI regions in the alignments where the indel composition exceeded 50%. For the collection of control regions with average base compositions we used the same stretchy window approach with the nucleotide composition corresponding to the following average genomic frequencies: for A, T between 30 and 31% thresholds; for G, C between 20-21%; for G+C between 40-42%; A+T at 58-60%; G+T, C+A, G+A, or C+T were at 49-51%.
Note that control regions with genome-average X-composition also have genome-averaged nonX-composition. Therefore, their subsitution ratios are in inverse proportion to each other: S
X
= 1/S
nonX
. Due to this only one ratio for X and nonX pair is shown in Figures 1 and 2.
Calculation of the substitution ratios in MRI and control regions
Studying SNPs and fixed substitutions in X-rich MRI regions we measured the number of changes from X to nonX (denoted as N
X→nonX
) and also the number of changes from nonX to X (denoted as N
nonX→X
). The fluctuations in the observed distribution of N
X→nonX
and N
nonX→X
are well-known as Poisson noise. Thus, the standard deviation for the true values for N
X→nonX
and N
nonX→X
is calculated according to v the Poisson distribution, that is: σ
N
=
. For each X-rich MRI region we measured the substitution S
X
-ratios with S
X
= N
X→nonX
/N
nonX→X
shown in Figures 1 and 2. The propagation of uncertainty for a ratio f = A/B can be calculated using the formula (σ
f
/f)2 = (σ
A
/A)2 + (σ
B
/B)2 - 2(σ
A
·σ
B
)/(A·B)·ρ
AB
, where ρ
AB
is the correlation coefficient for A and B variables. Because the observed frequency of having a SNP at a genomic site in humans is less than 1%, it is correct to assume that the correlation between N
X→nonX
and N
nonX→X
is negligible. Therefore, the standard deviation for the S
X
ratio was calculated by the following formula:
Calculation of base composition equilibrium for the observed substitution rates
As described in the previous paragraphs, for studying SNPs and fixed substitutions in X-rich MRI regions of the human genome we measured the number of changes from X to nonX (denoted as N
X→nonX
) and also the number of changes from nonX to X (denoted as N
nonX→X
). These N
X→nonX
and N
nonX→X
helped us to estimate the frequencies of these two types of mutations per X or nonX site, named here as F
X→nonX
and F
nonX→X
, correspondingly. Suppose one has a sample of MRI regions with a total nucleotide sequence length of L and a composition of X with the region richness given as P
X
being measured in numbers from 0 to 1. Then, the total number of X sites in this sample will be L·P
X
, and the total number of nonX sites will be L·(1 - P
X
). During a certain time interval called ΔT there will be ΔN
X→nonX
and ΔN
nonX→X
substitutions. Therefore the frequency of substitutions per site is F
X→nonX
= ΔN
X→nonX
/(ΔT·L·P
X
) and F
nonX→X
= ΔN
nonX→X
/(ΔT·L·(1 - P
X
)). It is impossible to measure directly these ΔN values for a specific time interval of ΔT. However, with a good approximation we can assume that the frequencies are proportional to the observed numbers N
X→nonX
and N
nonX→X
and can be represented by the simple formula: F
X→nonX
= A·N
X→nonX
/P
X
and F
nonX→X
= A·N
nonX→X
/(1 - P
X
), where A is a scaling factor having the same value for F
X→nonX
and F
nonX→X
, since N
X→nonX
and N
nonX→X
are counted from the same sample. In a gedanken experiment, let's assume that the current F
X→nonX
and F
nonX→X
values will stay unchangeable forever for our MRI sample. Then, in time, mutations should alter the base composition of our sample until it reaches an equilibrium composition with a new percentage for X-bases denoted here as Q
X
. This equilibrium composition Q
X
can be computed using the observed parameters of P
X
, N
X→nonX
, and N
nonX→X
. Indeed, under the equilibrium, the number of changes from X to nonX must be equal to the number of reverse changes from nonX to X, or:
We can compute these ΔN
X→nonX
and ΔN
nonX→X
values from frequencies in such a way:
also in a similar way
By putting these transformations into Equation 1 we get:
or
Finally, simple transformation of Equation 2 gives us the final Equation 3 for calculation of equilibrium percentage:
In the Results section, Formula 3 is used to compute the equilibrium percentage for X-bases in the studied MRI regions.