Changes in predicted protein disorder tendency may contribute to disease risk
© Hu et al. licensee BioMed Central Ltd 2011
Published: 23 December 2011
Skip to main content
© Hu et al. licensee BioMed Central Ltd 2011
Published: 23 December 2011
Recent studies suggest that many proteins or regions of proteins lack 3D structure. Defined as intrinsically disordered proteins, these proteins/peptides are functionally important. Recent advances in next generation sequencing technologies enable genome-wide identification of novel nucleotide variations in a specific population or cohort.
Using the exonic single nucleotide variations (SNVs) identified in the 1,000 Genomes Project and distributed by the Genetic Analysis Workshop 17, we systematically analysed the genetic and predicted disorder potential features of the non-synonymous variations. The result of experiments suggests that a significant change in the tendency of a protein region to be structured or disordered caused by SNVs may lead to malfunction of such a protein and contribute to disease risk.
After validation with functional SNVs on the traits distributed by GAW17, we conclude that it is valuable to consider structure/disorder tendencies while prioritizing and predicting mechanistic effects arising from novel genetic variations.
"Sequence → Structure → Function" is the traditional view that amino acid sequences determine the structure of a protein molecule and that a definite protein structure is a prerequisite to biological function. This view has been amended by the finding that more and more proteins possess no definite ordered three-dimensional structure but are still involved in key biological processes, including cell cycle and gene regulation, molecular recognition, assembly of complexes, and signalling in general [1, 2]. Indeed, over 33% of eukaryotic proteins contain structure-lacking regions. This kind of protein is often named "intrinsically disordered proteins" (IDPs). Several studies have shown a strong correlation between disease-associated proteins and proteins containing significant amounts of intrinsic disorder , leading to the D2 concept of "disorder in disease" . Complex diseases such as cancer, neurodegenerative diseases, cardiovascular diseases, and diabetes are often associated with IDPs , likely because errors in signalling and regulation arising from IDPs are important for these disease associations. It was found that mutations that cause disorder tendencies to flip to structure tendencies are the most likely mutations in disordered regions to be disease-causing .
Recent advances in genetic studies enabled the discovery of many genetic regions linked or associated to complex diseases, using array-based genotyping technology, or more recently, next generation sequencing technology. Many known or novel single nucleotide variations have been identified, and their potential roles on disease pathogenesis are unknown. Many bioinformatics tools including FastSNP , Panther , PolyPhen2 , SIFT , SNPs3D  and SPOT , have been developed; many of them prioritize the SNV functions based on their roles in affecting protein structures.
Of particular interest here is that one study reported that 114 out of 122 (93%) single amino acid polymorphisms (SAPs) located in disordered regions are associated with disease. Thus SAPs occurring in disordered regions are highly likely to affect the functions of the proteins and be associated with disease .
In the present study, we systematically evaluate the potential disease risk on the SNVs whose resultant amino acid changes, SAPs, can change their structure/disorder tendencies, based on the single nucleotide variants derived from the 1,000 Genome Project and distributed by the Genetic Analysis Workshop (GAW17).
SNV genotypes for 697 individuals were obtained from the sequence alignment files identified in the 1,000 Genomes Project and distributed by the Genetic Analysis Workshop (GAW17). A total of 24,487 exonic SNVs within 3205 autosomal genes were included in the data, regardless whether they are synonymous or non-synonymous. In this study, we focus our analysis on the missense non-synonymous variations that can cause single amino acid polymorphisms.
Three quantitative phenotypes (i.e. Q1, Q2, Q4) are generated for each of the unrelated individuals, and a total of 200 simulations are available to us. The disease model of Q1 includes 39 SNPs in 9 genes from VEGF pathway; Q2 is influenced by 72 SNPs in 13 genes related to cardiovascular risk and inflammation; Q4 is not affected by any of the available SNPs. Liability to disease is defined as latent liability+Q1+Q2-Q4, where latent liability is determined by 51 SNPs in 15 genes involved in the VEGF pathway.
Since UniProtKB/Swiss-Prot is a high quality manually annotated and non-redundant protein sequence database , we choose UniProtKB/Swiss-Prot to retrieve protein sequences. Furthermore, for every gene in the list, we retrieved both the canonical sequence and isoform data, downloaded from UniProtKB/Swiss-Prot database. We used the gene symbol in the GAW17 data set as the query to search for the amino acid sequences. Among the 3,205 genes provided in the data set, amino acid sequences of 2,893 genes were available in the UniProtKB/Swiss-Prot database, and retrieved for further analysis. Among the 24,487 SNVs provided in the GAW 17 data, 18,075 SNVs were mapped in the UniProtKB/Swiss-Prot database . We exclude synonymous and nonsense non-synonymous mutations from further analysis, since the former causes no change in disorder score and the latter leads to truncation of the protein.
In order to map the single nucleotide variations into the appropriate amino acid sequences, we used BLASTX algorithm  downloaded from the National Center for Biotechnology Information. BLASTX program translates the query nucleotide sequences in all six possible reading frames and provides combined significance statistics for hits to different frames. The nucleotide sequences from human genome build 36 (hg18) were used to retrieve the nucleotide query for each gene, and the reference sequences were the amino acid sequences from UniProtKB/Swiss-Prot. For ~10% of the genes, one segment of amino acid sequence may be mapped to different parts of nucleotide coding regions. To ensure the accuracy of the mapping, we further confirmed that paired segments in the query nucleotide sequence and the resultant amino acid sequences are in the same sequential order. We found this dramatically increase the mapping accuracy.
To evaluate the potential of an SNV to alter the disorder or structure tendency, we first predict the disorder score for a SNV in the specific sequence. In this study, the per-residue disorder predictor PONDR-VSL2  was used. VSL2 is composed of a set of support vector machines trained on datasets containing structured and disordered regions of various lengths. VSL2 provides one disorder prediction score between 0 and 1 for each residual. A score above or below 0.5 indicates that the target amino acid is located in a predicted region of disorder or structure, respectively. Overall, VSL2 achieves accuracy close to 80% correct and is one of the more accurate disorder predictors currently available. Although PONDR-VSL2 used in our study is one of the most accurate protein disorder predictors. There still exist two sources of uncertainty in disorder prediction, one is model uncertainty, and the other one is data uncertainty. Such work has been discussed in recent study .
where DS min and DS maj represent the SNV's disorder scores for minor allele and major allele, respectively. A positive or negative ΔDS value indicates that the minor allele will be associated with an increased or decreased disorder potential, respectively.
The hypothesis that we want to test is that SNVs that cause changes in the structure/disorder tendency are deleterious. Destabilization of a structured protein domain by a mutation is often harmful . A positive ΔDS in a structured protein corresponds to this case. As indicated above 114/122 mutations in disordered regions were associated with disease , so we also need to consider the possibility that SNVs causing negative ΔDS might also be deleterious. Of course for a mutation, it would likely be the change in the structure ←→ disorder tendency,, ΔDS that would be important, not necessarily the absolute value.
Using the VSL2 software , we evaluated how a given amino acid change alters the potential for protein disorder by calculating the ΔDS as described above. The output of the VSL2 software depends not only on the amino acid at a given position but also on the amino acids surrounding that position. Thus, a particular ΔDS value will depend both on the amino acid change and also on the sequence context of the given amino acid change. Here we will compare two scenarios: when the same amino acid substitution occurs in two isoforms of related proteins and when the same amino acid substitution occurs in unrelated proteins.
To determine context-dependence for different isoforms, we identified SNVs in 1,170 genes that have isoform records in UniProtKB/Swiss-Prot database. In total 1,082 SNVs were found to potentially exist in 3,229 different isoforms. Pair wise comparison of all of the related isoforms yields a distribution showing how much each ΔDS value changes as the sequence-context changes in a different isoform. This distribution shows a very strong peak very close to a shift of 0.0003 in the ΔDS value followed by an extended tail, giving a average of ~ 0.01 for the data. Such a small context-dependent change for most of the sequences likely results from the very similar amino acid sequences of the different isoforms. Given such a small context-dependent change, it is not surprizing that only 0.2% of these amino acid substitutions lead to a change in the sign of the ΔDS value.
Rank of top 10 amino acid (AA) changes with + mean ΔDS
W → S
V → D
Rank of top 10 amino acid (AA) changes with - mean
Rank of top 10 codon changes with + mean ΔDS
Rank of top 10 codon changes with - mean ΔDS
As expected from the large standard deviations in Tables 1, 2, pairwise comparisons of the same amino acid changes in different sequences give a very broad distribution. This distribution has a very weak peak near 0.006 and mean value of ~ 0.036. Large sequence differences for unrelated proteins accounts for these much more significant context-dependent changes here for the ΔDS values for a particular type of amino acid change as compared to the much smaller changes observed for the protein isoform data as discussed above.
To illustrate the disorder/structured potential of SNVs, we focused our analysis on 10,254 missense non-synonymous SNVs, with4,345 and 5,909 showing positive and negative, ΔDS respectively.
Small values of ΔDS are expected to be less important, so partitioning the data into subsets having larger and smaller ΔDS values should be helpful. Examination of Figure 2 suggests that a threshold of |ΔDS| ≥ 0.04 would eliminate most of the small peaks in the figure. This value is also slightly larger than the mean value of ~ 0.036 for the context-dependent shifts of the ΔDS distribution as mentioned above. So, for both of these reasons, 0.04 was chosen as a threshold for our initial studies.
Number of SNVs in synonymous/non-synonymous region
|Δ DS | > 0.04
|Δ DS | > 0
Summary of results on GAW17 simulated data
# of genes
# of SNVs with
|ΔDS| > 0.04
# of SNVs with |ΔDS| > 0.04 in the answer sheet
# of SNVs in the answer sheet
SNVs with |ΔDS |>0.04 in Q1, Q2 and disease liability
Major Allele Score
P → L
V → E
S → F
S → L
M → R
P → L
In a previous study of deleterious mutations in structured proteins, energy calculations on 3-D structures showed that the deleterious mutations were those that destabilized the structure . A + ΔDS for a region of protein predicted to be structured (e.g. with a prediction score < 0.5) would correspond to the potential destabilization of a region 3-D structure. Interestingly, 6/7 of the examples with + ΔDS values have scores < 0.5, with the remaining example having a score = 0.51. It would be interesting to try this approach on real data to determine if this approach could substitute for that described previously  for the identification of deleterious mutations in structured proteins. The advantage here is that the 3-D structures would not be needed.
Of the 15 examples, 8 involved - ΔDS values. Five of these corresponded to regions likely to be structured (e.g. score < 0.5), so from the energetic point of view described above it is not clear how these changes could affect structure. On the other hand, the observed changes (P→L, D→Y and S→L) could certainly lead to protein malfunction.
The last three examples involved amino acid changes that would increase the structural tendency within a region of disorder. Interestingly, recent work suggests that mutations causing disorder tendencies to change to structure tendencies occur much more often than the reverse [5, 19]. Furthermore, a very high fraction of disease-associated mutations mapping to regions of disorder exhibit such tendencies  while a significant fraction of these mutations are also associated with the loss of sites of posttranslational modification . Note that posttranslational modifications very commonly occur in disordered regions , perhaps much more often than in ordered regions, especially those modifications involving phosphorylation [21, 22].
Also note the earlier work cited above indicating that 114/122 mutations in disordered regions proved to be deleterious . Thus, it would be worthwhile to obtain these data and determine the disorder prediction threshold with the best identification of these harmful mutations. Such a new result could then be included in any future work.
The use of mutation-induced changes in disorder prediction scores, called ΔDS, has been studied here. The overall idea is that a significant change in the tendency of a protein region to be structured or disordered could lead to malfunction of such a protein. The initial findings on the data provided by GAW17 give insight with regard to the directions to explore on real data. For example, with real data one could explore whether some particular threshold for ΔDS would give significant separation of harmless versus harmful mutations. If the data of reference  turns out to be a general finding, or even if only applicable to certain diseases, this observation, the discovery of which came out of this work, certainly points towards important new directions to try.
There are some limitations for the study. Single SNV may not change the disorder properties too much, since changes of disorder score depend more on changes of a segment than an individual mutation. So in our further work, we would try to investigate the combination or pattern of SNVs nearby the high ΔDS spot.
The authors thank the valuable discussion and suggestion from Dr. Iakoucheva at University of California San Diego. This work was supported in part by the grants R01 LM007688 (to AKD), GM071714 (to AKD), R21 AA017941 (to YL) and P01 AG018397 (to YL) from the US National Institute of Health, the grant EF 0849803 (to AKD) from the US National Science Foundation. The Genetic Analysis Workshops are supported by NIH grant R01 GM031575. Preparation of the Genetic Analysis Workshop 17 Simulated Exome Data Set was supported in part by NIH grant R01 MH059490 and used sequencing data from the 1000 Genomes Project (http://www.1000genomes.org).
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.