- Open Access
BASILIScan: a tool for high-throughput analysis of intrinsic disorder patterns in homologous proteins
BMC Genomicsvolume 19, Article number: 902 (2018)
Intrinsic structural disorder is a common property of many proteins, especially in eukaryotic and virus proteomes. The tendency of some proteins or protein regions to exist in a disordered state usually precludes their structural characterisation and renders them especially difficult for experimental handling after recombinant expression.
A new intuitive, publicly-available computational resource, called BASILIScan, is presented here. It provides a BLAST-based search for close homologues of the protein of interest, integrated with a simultaneous prediction of intrinsic disorder together with a robust data viewer and interpreter. This allows for a quick, high-throughput screening, scoring and selection of closely-related yet highly structured homologues of the protein of interest. Comparative parallel analysis of the conservation of extended regions of disorder in multiple sequences is also offered. The use of BASILIScan and its capacity for yielding biologically applicable predictions is demonstrated. Using a high-throughput BASILIScan screen it is also shown that a large proportion of the human proteome displays homologous sequences of superior intrinsic structural order in many related species.
Through the swift identification of intrinsically stable homologues and poorly conserved disordered regions by the BASILIScan software, the chances of successful recombinant protein expression and compatibility with downstream applications such as crystallisation can be greatly increased.
The incidence of intrinsic disorder – defined as the lack of a fixed three-dimensional conformation of a protein - has become increasingly appreciated and examined in recent years. Many such proteins have been characterised, and their disorder, as well as order-disorder transitions upon ligand and partner binding demonstrated with a range of biophysical techniques . It is now widely acknowledged that protein disorder is a remarkably common phenomenon – especially in complex organisms and viruses. An estimated 30–40% of the human proteome is disordered to a significant degree . This has far-reaching consequences for the structural characterisation and experimental handling of many proteins from the human and other disorder-enriched proteomes.
Depending on the extent and location of flexibility, the intrinsically-disordered protein (IDP) can exhibit a number of different behaviours. Although the highly-charged nature of disordered regions may confer high solubility , presence of proteases in IDP preparations often causes severe proteolytic degradation in the case of fully-disordered IDPs or digestion of long connecting loops within and between otherwise structured domains [4, 5]. Ultimately, biochemical and biophysical characterisation of such proteins is usually difficult because of challenging protein expression and purification and lack of sample homogeneity. Crystallisation of IDPs for X-ray crystallography is only feasible with the flexible regions removed, bound to a co-factor, or entropically-stabilised otherwise. Multidimensional nuclear magnetic resonance (NMR) spectroscopy remains the only high-resolution biophysical technique for studying some disordered protein systems, although the sample has to meet stringent compatibility standards (small and globular proteins, homogeneity, stability under low ionic strength for long periods of time) .
Since IDPs display characteristic patterns of amino acid content and distribution, the presence of disordered regions can be predicted from their primary sequence with high confidence. Many predictive algorithms have been devised and used successfully to predict the probability of residues in a given protein sequence to exist in an ordered or disordered state [7,8,9,10,11]. Such predictions have been repeatedly confirmed experimentally with multi-dimensional NMR [12, 13]. It is currently becoming common practice to take the intrinsic disorder predictions into account while designing a protein construct for recombinant expression. The N- or C-terminal disordered tails/domains can be truncated (for example: [14, 15]), while long loops connecting neighbouring domains or elements of secondary structure can be shortened to aid conformational stability . Unfortunately, the thin line between limiting disorder and affecting the function and correct folding of the protein can easily be crossed and the experimental trial-and-error process often takes long before an improved construct is found.
The software presented here, called BASILIScan, offers an alternative approach to streamline and simplify the construct design process. The core mechanism relies on a BLASTP search with the user’s amino acid sequence linked with simultaneous intrinsic disorder prediction of all hits. A specialised scoring system, called the FLEX score, is then employed to identify closest homologues exhibiting lowest disorder content. A hit possessing a FLEX score superior to the submitted protein is very likely to also show improved in vitro behaviour. Analytical features of BASILIScan, such as multiple sequence disorder overlay and alignment, allow for analysis of disorder conservation patterns between selected homologous hits in parallel. In addition to identifying a more suitable homologue, such analysis can guide rational construct design by exposing intrinsically-disordered regions of low or limited conservation which could be truncated from the expression construct. The adjustability of numerous search and analysis parameters provides compatibility with protein sequences of a wide range of intrinsic disorder, length and inter-species sequence variation.
The main framework of the BASILIScan software is centred upon the connection of the modules outlined in Fig. 1, written in Python 2.7. Where possible, Biopython (ver. 1.67)  libraries or their derivatives are used. In order to promote the multi-disciplinary use of BASILIScan, a graphical user interface (GUI) has been created in Tkinter ver. 8.5.9 for Python.
Sequence-based similarity search
After the plain unformatted amino acid sequence is provided, a BLASTP  search is conducted against the selected sequence database (UniprotKB/Swissprot being the default and recommended one ) and the results displayed according to the filters set by the user – such as an expect value (E-value) threshold or a maximum number of hits. The Bio.Blast functionality of Biopython is used to run on-line BLAST. The .xml output file is parsed with the NCBIXML function in Biopython.
Users may perform searches on custom, remote, FASTA-formatted libraries instead, by selecting the library file through the “Advanced properties/Select database” option. Remote search on the selected library is then performed by BLAST+ ver 2.7.1, including conversion of the library by makeblastdb. Sequence identifiers used in remote libraries should be Uniprot-derived in order for all BASILIScan modules to work correctly.
Handling of viral polyproteins
In the case of the resulting sequences constituting a viral polyprotein, BASILIScan will perform virtual processing of the polyprotein and will apply all its analysis and metrics tools to the appropriate fragment only. This functionality is only offered when UniprotKB/Swissprot database is selected for search, due to lack of proteolytic processing information in non-manually curated databases. This option can also be disabled in “Edit/Advanced preferences”.
Prediction of intrinsic disorder
For BLAST hits fulfilling the criteria set by the user, intrinsic disorder is calculated by the IUPRED algorithm , which has been used extensively in other publications for predicting intrinsic disorder in silico [20,21,22,23]. Two alternative IUPRED modes are available: “long disorder” or “short disorder”. For most applications, the “long disorder” mode is suggested. The intrinsic disorder score (IDS) of each entry is then calculated in the following way:
where l is the length of the protein and s is the residue’s IUPRED score.
FLEX score computation
Since identification of a homologue with superior intrinsic disorder properties requires scoring at least two parameters simultaneously, the hybrid FLEX score has been implemented. The FLEX score incorporates a weighted average of the intrinsic order and a hyperbolic transform of the E-value parameter in the following fashion:
The weight is determined by the FLEX coefficient (η), which is set by the user before the homology search is run. Allowed values are between 0 and 1 and will shift the contribution ratio of intrinsic structural order (1 - IDS) to E-value transform. The hyperbolic transform of the E-value is meant to converge the extreme low-end E-value range while resolving the high-end and middle ranges. Consequently, the function of the logarithm of the E-value is sigmoidal, and bound from 0 to 1. It is characterised by a near-linear relationship between arguments corresponding to E-values of 10− 3 and 10− 14, while either tail approaches 0 or 1, respectively (Additional file 1: Figure S1).
Visualisation of results
Results of a sequence query are presented in a table, for each hit showing the UniProt identifier, the GeneID, the expect value (E), sequence identity, similarity, the IDS score and the FLEX score. By default, the result hits are sorted from the lowest to the highest E-value. Sorting priority can be adjusted at any time from the main menu. The right-hand-side menu allows for more in-depth analysis of results. The ‘View’ option acquires the most important parameters of the selected item from the UniProt repository, displaying information such as sequence length, molecular mass and organism taxonomy.
The “Details” button draws a detailed trace of intrinsic disorder of the selected record within an interactive two-coordinate environment, implemented with Matplotlib. The environment allows for enlargement of selected parts of the trace, as well as for its translation. The option of exporting the graph as an image file is also provided. Traces can be overlayed on top of each other and therefore the intrinsic disorder can be explicitly compared between multiple protein records simultaneously.
Importantly, if the “enable disorder trace alignment” setting is switched on, the multiple disorder traces for the selected protein records will be automatically aligned on the axis, according to a multiple sequence alignment conducted in CLUSTALW . Default CLUSTALW parameters can be adjusted in Edit/Advanced properties. Any gaps inserted through the alignment algorithm will be visible in the aligned traces as residues with the IUPRED disorder score of 0.0 – an extremely unlikely occurrence for a protein residue otherwise.
Windows and OSX binary distributions of BASILIScan were packaged with Py2exe and Py2app, respectively. Packages for both platforms are freely available for academic use under the GNU distribution license and can be downloaded at www.basilisc.com/downloads. Open-source version is also available. Please consult the ReadMe file for further instructions on installation and running of BASILIScan, as well as for the dependencies required to run the open-source version (www.basilisc.com/readme/).
Case scenario: Human CDC7 protein kinase
Every BASILIScan job starts with providing a raw amino acid sequence and a job title. Next, the IUPRED mode is selected. Unless one is looking for very short stretches of disorder, or within very short protein sequences, the “long disorder” mode should be selected. The “E-value threshold” indicates the upper limit of the BLAST E-value, above which any hits will be ignored. The number of homology hits found can also be capped. Lastly, the “FLEX score priority coefficient” has to be set. This parameter will not influence which results are shown, but is meant to adjust the sensitivity of the FLEX score by shifting the contribution of homology versus structural order towards the final score. This means, for instance, that when the “priority coefficient” is set to 100% (all the way towards structural order), the FLEX score will only reflect the intrinsic order (1-IDS) of a given hit and will ignore the homology score component. The adjustment of the “priority coefficient” is particularly useful in cases where the BASILIScan run yields many hits containing one or more subpopulations clustered around particular values of IDS or E-value. The default priority coefficient value is 50%, in which case both terms will be taken into account equally.
To showcase the functioning of BASILIScan, the human cell division cycle protein kinase 7 (hCDC7, Uniprot identifier: O00311) was chosen as the sequence of interest. The protein is a 63.9 kDa serine/threonine protein kinase, which is an essential S-phase kinase implicated in DNA replication control and cell proliferation . The choice of hCDC7 kinase was dictated by availability of subject literature and structural information, as well as its functional conservation across many species. The 574-amino acid sequence of hCDC7 was submitted to BASILIScan for search in “long disorder” IUPRED mode, with an E-value threshold of 10− 10, maximum number of hits of 100 and a FLEX score priority coefficient of 50%. Figure 2 and Additional file 1: Table S1 show, respectively, the way results are displayed by the BASILIScan GUI and a full results dataset from the above search.
The BASILIScan search yielded 18 hits, including the search query entry which would usually appear as the first row in the results window and always returns the E-value of 0. It is immediately apparent that although many homologues with low E-values were found, their intrinsic disorder varies considerably (from 0 to 72%). Human CDC7 kinase shows an IDS score of about 14%, with many residues oscillating around the 0.5 disorder threshold. This can be seen by using the module “Trace” which graphs the disorder scores for every residue in a given sequence (Fig. 3a). Indeed, lowering the IUPRED score calculation threshold from the default 0.5 to 0.4, leads to a dramatic increase in IDS of hCDC7 up to 35%. The FLEX score, being a weighted average of the E-value transform and the IDS score, should in most cases serve as the easiest measure of identifying suitable hits. The most promising hit returned by this BASILIScan search, possessing the highest FLEX score, is a probable CDC7 kinase homologue from a fungus Encephalitozoon cuniculi. In addition, a remote search with BASILIScan against a TrEMBL database encompassing all available vertebrate proteomes has also found numerous evolutionarily-closer homologous sequences with IDS and FLEX scores superior to the search query statistics (Additional file 2).
Overlaying the calculated intrinsic disorder trace of the submitted sequence (O00311) on top of the BASILIScan hit of the highest FLEX score (Q8SR85), clearly shows the conservation of some intrinsic disorder patterns (residues 363–390) and confirms that there are regions of diminished intrinsic disorder (e.g. residues 420–446, 532–574) as well as significant deletions of particularly disordered segments (residues 1–40, 202–331) in the fungal and vertebrate homologues (Fig. 3a-c). The “align” feature can also be used to perform a multiple sequence alignment on multiple selected hits and overlay the calculated intrinsic disorder scores for each residue in the alignment as a heatmap, in order to compare the conservation of disorder between many sequences at once (Additional file 1: Figure S2).
The crystal structure of human CDC7 kinase has been reported . Crystallisation was achieved through identification of disordered regions by limited proteolysis and NMR spectroscopy, with their subsequent deletion. The crystallised protein encompasses the three ordered regions predicted by BASILIScan (highlighted in cyan, purple and green in Fig. 3c-d).
The parametrisation offered by BASILIScan, in particular the “priority coefficient”, allows for a quick assessment of the results, targeted to one’s needs. This is especially useful when many result hits are returned. Resubmitting the above CDC7 kinase query with the E-value threshold lowered to − 5, the maximum set number of 100 hits are obtained (Table 1). Since many homologues with a low E-value are returned, it is easier to assess the results by shifting the priority coefficient towards intrinsic stability (i.e. higher value). In this instance, there will be many more hits with low IDS scores among the top FLEX score results. Changing the priority coefficient in the opposite direction will have an opposite effect.
Occurrence of structured homologues of human proteins in other species
In order to learn about the scope of BASILIScan applicability in finding structured homologues of disorder-enriched proteins, every sequence of the human proteome (all 20,243 sequences available in UniprotKB/Swissprot) was subjected to an individual BASILIScan search with the same parameters as used for hCDC7 above. All human sequences were removed from the results database. This high-throughput search resulted in disorder prediction and scoring of nearly 100,000 protein sequences, and showed that a surprisingly large extent of the human proteome has homologues of superior intrinsic stability in other species. BASILIScan identified at least one homologue exhibiting a FLEX score higher than that of the query sequence for the vast majority of the human proteome (14,865 sequences or 78%) and ten or more such homologues for over 30% (Fig. 4a).
Importantly, in just under half of hits scored as promising – possessing FLEX scores higher than the query – the improvement of intrinsic disorder represented by the difference in IDS scores was very significant. The IDS score improvement of 10% or more was found in 47% of such BASILIScan-identified hits, while 27% of hits showed at least 20% improvement (Fig. 4b). The top hits resulting from the BASILIScan search belonged to 631 unique species, but by far the highest occurrence was displayed by mouse homologues (Fig. 4c). The improvement in IDS score appeared not to correlate with the hit frequency of occurrence in a given species.
As structural biology is evolving from tackling thermodynamically-stable, well-folded protein domains into much larger, multimeric and multicomponent protein systems, the need for rapid assessment and minimisation of intrinsic disorder becomes increasingly important. Software developed to assist recombinant expression construct design for predicting the regions of disorder (such as GlobPlot , or IUPRED ), maximising protein stability and crystallisibility through computing of parameters sourced from previously-solved crystal structures (XtalPred, ) and surface entropy reduction (SERp, ) have been used extensively. However, to the author’s knowledge, user-friendly and widely-accessible tools for high-throughput comparative analysis of intrinsic disorder patterns in related proteins have not been developed and explored for the purpose of recombinant expression construct design. This is tackled by the novel homologue search and scoring system offered by BASILIScan, with the disorder predictions based on the widely-adopted IUPRED algorithm. The key advantage of BASILIScan over the pre-exisiting software is the parallel analysis of disorder predictions for multiple related proteins at a time: both globally (the “trace” function) and residue-by-residue, as a heatmap (the “align” function). When the aim is to identify regions of disorder, such alignments can provide information on the conservation of the putatively-disordered region – and hence, how likely it is that the removal of such a fragment will not affect the structure or function of the protein.
The functionalities implemented in BASILIScan serve two main experimental purposes: identification of homologues with diminished intrinsic disorder; and recognition of disordered regions suitable for deletion due to limited conservation. For both applications, the accessible user interface and the adjustable parametrisation are invaluable for identification of the most promising candidates and regions. Furthermore, as demonstrated by the high-throughput BASILIScan screen of the human proteome, identification of a more intrinsically-stable homologue of an IDP is likely to be feasible in most cases. Therefore, BASILIScan should be a valuable and easily-accessible resource for more streamlined rational expression construct design approaches and study of IDPs.
Basic local alignment search tool
Cell division cycle protein kinase 7
Graphical user interface
Intrinsic disorder score
Nuclear magnetic resonance
Uversky VN. Dancing protein clouds: The strange biology and chaotic physics of intrinsically disordered proteins. J Biol Chem. 2016;291(13):6681–8.
Pentony MM, Jones DT. Modularity of intrinsic disorder in the human proteome. Proteins. 2010;78(1):212–21.
Tokmakov AA, et al. Content of intrinsic disorder influences the outcome of cell-free protein synthesis. Sci Rep. 2015;5:14079.
Paz A, et al. Biophysical characterization of the unstructured cytoplasmic domain of the human neuronal adhesion protein neuroligin 3. Biophys J. 2008;95(4):1928–44.
Hinds MG, et al. Bim, Bad and Bmf: intrinsically unstructured BH3-only proteins that undergo a localized conformational change upon binding to prosurvival Bcl-2 targets. Cell Death Differ. 2007;14(1):128–36.
Uversky VN. A decade and a half of protein intrinsic disorder: biology still waits for physics. Protein Sci. 2013;22(6):693–724.
Li X, Romero P, Rani M, Dunker AK, Obradovic Z. Predicting protein disorder for N-, C-,and Internal Regions. Genome Inform Ser Workshop Genome Inform. 1999;10:30–40.
Prilusky J, et al. FoldIndex: a simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics. 2005;21(16):3435–8.
Linding R, Russell RB, Neduva V, Gibson TJ. GlobPlot: exploring protein sequences for globularity and disorder. Nucleic Acids Res. 2003;31(13):3701–8.
Jones DT, Ward JJ. Prediction of disordered regions in proteins from position specific score matrices. Proteins. 2003;53(Suppl 6):573–8.
Dosztanyi Z, Csizmok V, Tompa P, Simon I. IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics. 2005;21(16):3433–4.
Haynes C, et al. Intrinsic disorder is a common feature of hub proteins from four eukaryotic interactomes. PLoS Comput Biol. 2006;2(8):e100.
Habchi J, Longhi S. Structural disorder within Paramyxoviral nucleoproteins and phosphoproteins in their free and bound forms: from predictions to experimental assessment. Int J Mol Sci. 2015;16(7):15688–726.
Cho Y, Gorina S, Jeffrey PD, Pavletich NP. Crystal structure of a p53 tumor suppressor-DNA complex: understanding tumorigenic mutations. Science. 1994;265(5170):346–55.
Barski M, et al. Rift Valley fever phlebovirus NSs protein core domain structure suggests molecular basis for nuclear filaments. Elife. 2017;6.
Nagi AD, Regan L. An inverse correlation between loop length and stability in a four-helix-bundle protein. Fold Des. 1997;2(1):67–75.
Cock PJ, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–3.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–10.
The UniProt C. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2017;45(D1):D158–69.
Pushker R, Mooney C, Davey NE, Jacque JM, Shields DC. Marked variability in the extent of protein disorder within and between viral families. PLoS One. 2013;8(4):e60724.
Pajkos M, Meszaros B, Simon I, Dosztanyi Z. Is there a biological cost of protein disorder? Analysis of cancer-associated mutations. Mol BioSyst. 2012;8(1):296–307.
Meszaros B, Dosztanyi Z, Simon I. Disordered binding regions and linear motifs--bridging the gap between two models of molecular recognition. PLoS One. 2012;7(10):e46829.
Edwards RJ, Davey NE, Shields DC. SLiMFinder: a probabilistic method for identifying over-represented, convergently evolved, short linear motifs in proteins. PLoS One. 2007;2(10):e967.
Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22(22):4673–80.
Labib K. How do Cdc7 and cyclin-dependent kinases trigger the initiation of chromosome replication in eukaryotic cells? Genes Dev. 2010;24(12):1208–19.
Hughes S, et al. Crystal structure of human CDC7 kinase in complex with its activator DBF4. Nat Struct Mol Biol. 2012;19(11):1101–7.
Slabinski L, et al. XtalPred: a web server for prediction of protein crystallizability. Bioinformatics. 2007;23(24):3403–5.
Goldschmidt L, Cooper DR, Derewenda ZS, Eisenberg D. Toward rational protein crystallization: a web server for the design of crystallizable protein variants. Protein Sci. 2007;16(8):1569–76.
I would like to acknowledge Dr. Goedele Maertens as well as Mr. Michal Kosicki, Dr. George Gerogiokas and Dr. Robert White for fruitful discussions, suggestions, a critical review of the manuscript and testing of the initial versions of BASILIScan.
Availability of data and materials
Project name: BASILIScan.
Project home page: http://www.basilisc.com
Archived version: 1.4
Operating system: OSX, Windows
Programming language: Python 2.7
Other requirements: none for binary; multiple dependencies for the source package
License: GNU for academic use only
Any restrictions to use by non-academics: non-academics need to obtain a license by contacting the author directly.
Ready-to-use binary distributions of BASILIScan (currently version 1.4) are available for free for academic users. Packages for OSX and Windows systems can be downloaded from the project website: www.basiliscan.com/download. Open-source Python code will be released through the website upon publication. Instructions on which dependencies need to be installed for compiling the open-source version are under www.basilisc.com/readme. Please also consult the website for tutorials, FAQs and updates.
All datasets generated and supporting the conclusions of this article will be available for download from www.basiliscan.com upon publication of the paper and are also included with this publication.
Ethics approval and consent to participate
Consent for publication
The author declares that he/she has no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary tables and figures. (DOCX 270 kb)
BASILIScan search with human CDC7 kinase (Uniprot ID: O00311) against all vertebrate sequences available from UniprotKB (both Swissprot and TrEMBL). (CSV 263 kb)