BASILIScan: a tool for high-throughput analysis of intrinsic disorder patterns in homologous proteins

Background Intrinsic structural disorder is a common property of many proteins, especially in eukaryotic and virus proteomes. The tendency of some proteins or protein regions to exist in a disordered state usually precludes their structural characterisation and renders them especially difficult for experimental handling after recombinant expression. Results A new intuitive, publicly-available computational resource, called BASILIScan, is presented here. It provides a BLAST-based search for close homologues of the protein of interest, integrated with a simultaneous prediction of intrinsic disorder together with a robust data viewer and interpreter. This allows for a quick, high-throughput screening, scoring and selection of closely-related yet highly structured homologues of the protein of interest. Comparative parallel analysis of the conservation of extended regions of disorder in multiple sequences is also offered. The use of BASILIScan and its capacity for yielding biologically applicable predictions is demonstrated. Using a high-throughput BASILIScan screen it is also shown that a large proportion of the human proteome displays homologous sequences of superior intrinsic structural order in many related species. Conclusion Through the swift identification of intrinsically stable homologues and poorly conserved disordered regions by the BASILIScan software, the chances of successful recombinant protein expression and compatibility with downstream applications such as crystallisation can be greatly increased. Electronic supplementary material The online version of this article (10.1186/s12864-018-5322-5) contains supplementary material, which is available to authorized users.


Background
The incidence of intrinsic disorderdefined as the lack of a fixed three-dimensional conformation of a proteinhas become increasingly appreciated and examined in recent years. Many such proteins have been characterised, and their disorder, as well as order-disorder transitions upon ligand and partner binding demonstrated with a range of biophysical techniques [1]. It is now widely acknowledged that protein disorder is a remarkably common phenomenonespecially in complex organisms and viruses. An estimated 30-40% of the human proteome is disordered to a significant degree [2]. This has far-reaching consequences for the structural characterisation and experimental handling of many proteins from the human and other disorder-enriched proteomes.
Depending on the extent and location of flexibility, the intrinsically-disordered protein (IDP) can exhibit a number of different behaviours. Although the highly-charged nature of disordered regions may confer high solubility [3], presence of proteases in IDP preparations often causes severe proteolytic degradation in the case of fully-disordered IDPs or digestion of long connecting loops within and between otherwise structured domains [4,5]. Ultimately, biochemical and biophysical characterisation of such proteins is usually difficult because of challenging protein expression and purification and lack of sample homogeneity. Crystallisation of IDPs for X-ray crystallography is only feasible with the flexible regions removed, bound to a co-factor, or entropically-stabilised otherwise. Multidimensional nuclear magnetic resonance (NMR) spectroscopy remains the only high-resolution biophysical technique for studying some disordered protein systems, although the sample has to meet stringent compatibility standards (small and globular proteins, homogeneity, stability under low ionic strength for long periods of time) [6].
Since IDPs display characteristic patterns of amino acid content and distribution, the presence of disordered regions can be predicted from their primary sequence with high confidence. Many predictive algorithms have been devised and used successfully to predict the probability of residues in a given protein sequence to exist in an ordered or disordered state [7][8][9][10][11]. Such predictions have been repeatedly confirmed experimentally with multi-dimensional NMR [12,13]. It is currently becoming common practice to take the intrinsic disorder predictions into account while designing a protein construct for recombinant expression. The N-or C-terminal disordered tails/domains can be truncated (for example: [14,15]), while long loops connecting neighbouring domains or elements of secondary structure can be shortened to aid conformational stability [16]. Unfortunately, the thin line between limiting disorder and affecting the function and correct folding of the protein can easily be crossed and the experimental trial-and-error process often takes long before an improved construct is found.
The software presented here, called BASILIScan, offers an alternative approach to streamline and simplify the construct design process. The core mechanism relies on a BLASTP search with the user's amino acid sequence linked with simultaneous intrinsic disorder prediction of all hits. A specialised scoring system, called the FLEX score, is then employed to identify closest homologues exhibiting lowest disorder content. A hit possessing a FLEX score superior to the submitted protein is very likely to also show improved in vitro behaviour. Analytical features of BASILIScan, such as multiple sequence disorder overlay and alignment, allow for analysis of disorder conservation patterns between selected homologous hits in parallel. In addition to identifying a more suitable homologue, such analysis can guide rational construct design by exposing intrinsically-disordered regions of low or limited conservation which could be truncated from the expression construct. The adjustability of numerous search and analysis parameters provides compatibility with protein sequences of a wide range of intrinsic disorder, length and inter-species sequence variation.

Implementation
The main framework of the BASILIScan software is centred upon the connection of the modules outlined in Fig. 1, written in Python 2.7. Where possible, Biopython (ver. 1.67) [17] libraries or their derivatives are used. In order to promote the multi-disciplinary use of BASILIScan, a graphical user interface (GUI) has been created in Tkinter ver. 8.5.9 for Python.

Sequence-based similarity search
After the plain unformatted amino acid sequence is provided, a BLASTP [18] search is conducted against the selected sequence database (UniprotKB/Swissprot being the default and recommended one [19]) and the results displayed according to the filters set by the usersuch as an expect value (E-value) threshold or a maximum number of hits. The Bio.Blast functionality of Biopython is used to run on-line BLAST. The .xml output file is parsed with the NCBIXML function in Biopython. Users may perform searches on custom, remote, FASTA-formatted libraries instead, by selecting the library file through the "Advanced properties/Select database" option. Remote search on the selected library is then performed by BLAST+ ver 2.7.1, including conversion of the library by makeblastdb. Sequence identifiers used in remote libraries should be Uniprot-derived in order for all BASILIScan modules to work correctly.

Handling of viral polyproteins
In the case of the resulting sequences constituting a viral polyprotein, BASILIScan will perform virtual processing of the polyprotein and will apply all its analysis and metrics tools to the appropriate fragment only. This functionality is only offered when UniprotKB/Swissprot database is selected for search, due to lack of proteolytic processing information in non-manually curated databases. This option can also be disabled in "Edit/Advanced preferences".

Prediction of intrinsic disorder
For BLAST hits fulfilling the criteria set by the user, intrinsic disorder is calculated by the IUPRED algorithm [11], which has been used extensively in other publications for predicting intrinsic disorder in silico [20][21][22][23]. Two alternative IUPRED modes are available: "long disorder" or "short disorder". For most applications, the "long disorder" mode is suggested. The intrinsic disorder score (IDS) of each entry is then calculated in the following way: where l is the length of the protein and s is the residue's IUPRED score.

FLEX score computation
Since identification of a homologue with superior intrinsic disorder properties requires scoring at least two parameters simultaneously, the hybrid FLEX score has been implemented. The FLEX score incorporates a weighted average of the intrinsic order and a hyperbolic transform of the E-value parameter in the following fashion: The weight is determined by the FLEX coefficient (η), which is set by the user before the homology search is run. Allowed values are between 0 and 1 and will shift the contribution ratio of intrinsic structural order (1 -IDS) to E-value transform. The hyperbolic transform of the E-value is meant to converge the extreme low-end E-value range while resolving the high-end and middle ranges. Consequently, the function of the logarithm of the E-value is sigmoidal, and bound from 0 to 1. It is characterised by a near-linear relationship between arguments corresponding to E-values of 10 − 3 and 10 − 14 , while either tail approaches 0 or 1, respectively (Additional file 1: Figure S1).

Visualisation of results
Results of a sequence query are presented in a table, for each hit showing the UniProt identifier, the GeneID, the expect value (E), sequence identity, similarity, the IDS score and the FLEX score. By default, the result hits are sorted from the lowest to the highest E-value. Sorting priority can be adjusted at any time from the main menu. The right-hand-side menu allows for more in-depth analysis of results. The 'View' option acquires the most important parameters of the selected item from the UniProt repository, displaying information such as sequence length, molecular mass and organism taxonomy.
The "Details" button draws a detailed trace of intrinsic disorder of the selected record within an interactive two-coordinate environment, implemented with Matplotlib. The environment allows for enlargement of selected parts of the trace, as well as for its translation. The option of exporting the graph as an image file is also provided. Traces can be overlayed on top of each other and therefore the intrinsic disorder can be explicitly compared between multiple protein records simultaneously.
Importantly, if the "enable disorder trace alignment" setting is switched on, the multiple disorder traces for the selected protein records will be automatically aligned on the axis, according to a multiple sequence alignment conducted in CLUSTALW [24]. Default CLUSTALW parameters can be adjusted in Edit/Advanced properties. Any gaps inserted through the alignment algorithm will be visible in the aligned traces as residues with the IUPRED disorder score of 0.0an extremely unlikely occurrence for a protein residue otherwise.

Distribution
Windows and OSX binary distributions of BASILIScan were packaged with Py2exe and Py2app, respectively. Packages for both platforms are freely available for academic use under the GNU distribution license and can be downloaded at www.basilisc.com/downloads. Open-source version is also available. Please consult the ReadMe file for further instructions on installation and running of BASI-LIScan, as well as for the dependencies required to run the open-source version (www.basilisc.com/readme/).

Results
Case scenario: Human CDC7 protein kinase Every BASILIScan job starts with providing a raw amino acid sequence and a job title. Next, the IUPRED mode is selected. Unless one is looking for very short stretches of disorder, or within very short protein sequences, the "long disorder" mode should be selected. The "E-value threshold" indicates the upper limit of the BLAST E-value, above which any hits will be ignored. The number of homology hits found can also be capped. Lastly, the "FLEX score priority coefficient" has to be set. This parameter will not influence which results are shown, but is meant to adjust the sensitivity of the FLEX score by shifting the contribution of homology versus structural order towards the final score. This means, for instance, that when the "priority coefficient" is set to 100% (all the way towards structural order), the FLEX score will only reflect the intrinsic order (1-IDS) of a given hit and will ignore the homology score component. The adjustment of the "priority coefficient" is particularly useful in cases where the BASILIScan run yields many hits containing one or more subpopulations clustered around particular values of IDS or E-value. The default priority coefficient value is 50%, in which case both terms will be taken into account equally.
To showcase the functioning of BASILIScan, the human cell division cycle protein kinase 7 (hCDC7, Uniprot identifier: O00311) was chosen as the sequence of interest. The protein is a 63.9 kDa serine/threonine protein kinase, which is an essential S-phase kinase implicated in DNA replication control and cell proliferation [25]. The choice of hCDC7 kinase was dictated by availability of subject literature and structural information, as well as its functional conservation across many species. The 574-amino acid sequence of hCDC7 was submitted to BASILIScan for search in "long disorder" IUPRED mode, with an E-value threshold of 10 − 10 , maximum number of hits of 100 and a FLEX score priority coefficient of 50%. Figure 2 and Additional file 1: Table  S1 show, respectively, the way results are displayed by the BASILIScan GUI and a full results dataset from the above search.
The BASILIScan search yielded 18 hits, including the search query entry which would usually appear as the first row in the results window and always returns the E-value of 0. It is immediately apparent that although many homologues with low E-values were found, their intrinsic disorder varies considerably (from 0 to 72%). Human CDC7 kinase shows an IDS score of about 14%, with many residues oscillating around the 0.5 disorder threshold. This can be seen by using the module "Trace" which graphs the disorder scores for every residue in a given sequence (Fig. 3a). Indeed, lowering the IUPRED score calculation threshold from the default 0.5 to 0.4, leads to a dramatic increase in IDS of hCDC7 up to Other parameters can be accessed through the "Advanced Properties" menu button. The middle panel shows the most important characteristics and calculated parameters of each result hit. Upon selection of one or more hit identifiers, the right panel buttons will activate different analysis modules 35%. The FLEX score, being a weighted average of the E-value transform and the IDS score, should in most cases serve as the easiest measure of identifying suitable hits. The most promising hit returned by this BASILIScan search, possessing the highest FLEX score, is a probable CDC7 kinase homologue from a fungus Encephalitozoon cuniculi. In addition, a remote search with BASILIScan against a TrEMBL database encompassing all available vertebrate proteomes has also found numerous evolutionarily-closer homologous sequences with IDS and FLEX scores superior to the search query statistics (Additional file 2).
Overlaying the calculated intrinsic disorder trace of the submitted sequence (O00311) on top of the BASI-LIScan hit of the highest FLEX score (Q8SR85), clearly shows the conservation of some intrinsic disorder patterns (residues 363-390) and confirms that there are regions of diminished intrinsic disorder (e.g. residues 420-446, 532-574) as well as significant deletions of particularly disordered segments (residues 1-40, 202-331) in the fungal and vertebrate homologues (Fig. 3a-c). The "align" feature can also be used to perform a multiple sequence alignment on multiple selected hits and overlay the calculated intrinsic disorder scores for each residue in the alignment as a heatmap, in order to compare the conservation of disorder between many sequences at once (Additional file 1: Figure S2).
The crystal structure of human CDC7 kinase has been reported [26]. Crystallisation was achieved through identification of disordered regions by limited proteolysis  and NMR spectroscopy, with their subsequent deletion. The crystallised protein encompasses the three ordered regions predicted by BASILIScan (highlighted in cyan, purple and green in Fig. 3c-d).
The parametrisation offered by BASILIScan, in particular the "priority coefficient", allows for a quick assessment of the results, targeted to one's needs. This is especially useful when many result hits are returned. Resubmitting the above CDC7 kinase query with the E-value threshold lowered to − 5, the maximum set number of 100 hits are obtained (Table 1). Since many homologues with a low E-value are returned, it is easier to assess the results by shifting the priority coefficient towards intrinsic stability (i.e. higher value). In this instance, there will be many more hits with low IDS scores among the top FLEX score results. Changing the priority coefficient in the opposite direction will have an opposite effect.

Occurrence of structured homologues of human proteins in other species
In order to learn about the scope of BASILIScan applicability in finding structured homologues of disorderenriched proteins, every sequence of the human proteome (all 20,243 sequences available in UniprotKB/Swissprot) was subjected to an individual BASILIScan search with the same parameters as used for hCDC7 above. All human sequences were removed from the results database. This high-throughput search resulted in disorder prediction and scoring of nearly 100,000 protein sequences, and showed that a surprisingly large extent of the human proteome has homologues of superior intrinsic stability in other species. BASILIScan identified at least one homologue exhibiting a FLEX score higher than that of the query sequence for the vast majority of the human proteome (14,865 sequences or 78%) and ten or more such homologues for over 30% (Fig. 4a). Fig. 4 Results of a high-throughput BASILIScan search with every available ORF in the human proteome. The success of this search can be assessed by the number of homologous sequences found with a FLEX score higher than that of the query sequence (a). Limiting the results dataset to the queries which yielded at least one such homologue, the extent of improvement in IDS score can be plotted in 5% increments (b). Information about the most common species of the highest-scoring homologue for each query was also plotted (blue), along with the corresponding average difference in IDS (orange) (c). For full raw data, see Additional file 3 Importantly, in just under half of hits scored as promisingpossessing FLEX scores higher than the querythe improvement of intrinsic disorder represented by the difference in IDS scores was very significant. The IDS score improvement of 10% or more was found in 47% of such BASILIScan-identified hits, while 27% of hits showed at least 20% improvement (Fig. 4b). The top hits resulting from the BASILIScan search belonged to 631 unique species, but by far the highest occurrence was displayed by mouse homologues (Fig. 4c). The improvement in IDS score appeared not to correlate with the hit frequency of occurrence in a given species.

Discussion
As structural biology is evolving from tackling thermo dynamically-stable, well-folded protein domains into much larger, multimeric and multicomponent protein systems, the need for rapid assessment and minimisation of intrinsic disorder becomes increasingly important. Software developed to assist recombinant expression construct design for predicting the regions of disorder (such as GlobPlot [9], or IUPRED [11]), maximising protein stability and crystallisibility through computing of parameters sourced from previously-solved crystal structures (XtalPred, [27]) and surface entropy reduction (SERp, [28]) have been used extensively. However, to the author's knowledge, user-friendly and widely-accessible tools for high-throughput comparative analysis of intrinsic disorder patterns in related proteins have not been developed and explored for the purpose of recombinant expression construct design. This is tackled by the novel homologue search and scoring system offered by BASI-LIScan, with the disorder predictions based on the widely-adopted IUPRED algorithm. The key advantage of BASILIScan over the pre-exisiting software is the parallel analysis of disorder predictions for multiple related proteins at a time: both globally (the "trace" function) and residue-by-residue, as a heatmap (the "align" function). When the aim is to identify regions of disorder, such alignments can provide information on the conservation of the putatively-disordered regionand hence, how likely it is that the removal of such a fragment will not affect the structure or function of the protein.