The main framework of the BASILIScan software is centred upon the connection of the modules outlined in Fig. 1, written in Python 2.7. Where possible, Biopython (ver. 1.67) [17] libraries or their derivatives are used. In order to promote the multi-disciplinary use of BASILIScan, a graphical user interface (GUI) has been created in Tkinter ver. 8.5.9 for Python.
Sequence-based similarity search
After the plain unformatted amino acid sequence is provided, a BLASTP [18] search is conducted against the selected sequence database (UniprotKB/Swissprot being the default and recommended one [19]) and the results displayed according to the filters set by the user – such as an expect value (E-value) threshold or a maximum number of hits. The Bio.Blast functionality of Biopython is used to run on-line BLAST. The .xml output file is parsed with the NCBIXML function in Biopython.
Users may perform searches on custom, remote, FASTA-formatted libraries instead, by selecting the library file through the “Advanced properties/Select database” option. Remote search on the selected library is then performed by BLAST+ ver 2.7.1, including conversion of the library by makeblastdb. Sequence identifiers used in remote libraries should be Uniprot-derived in order for all BASILIScan modules to work correctly.
Handling of viral polyproteins
In the case of the resulting sequences constituting a viral polyprotein, BASILIScan will perform virtual processing of the polyprotein and will apply all its analysis and metrics tools to the appropriate fragment only. This functionality is only offered when UniprotKB/Swissprot database is selected for search, due to lack of proteolytic processing information in non-manually curated databases. This option can also be disabled in “Edit/Advanced preferences”.
Prediction of intrinsic disorder
For BLAST hits fulfilling the criteria set by the user, intrinsic disorder is calculated by the IUPRED algorithm [11], which has been used extensively in other publications for predicting intrinsic disorder in silico [20,21,22,23]. Two alternative IUPRED modes are available: “long disorder” or “short disorder”. For most applications, the “long disorder” mode is suggested. The intrinsic disorder score (IDS) of each entry is then calculated in the following way:
$$ IDS=\frac{\sum_{r=1}^l{\gamma}_s}{l}\times 100\%\kern0.5em where\ {\gamma}_s=1\ if\ s>0.5\ \left( or\ user\ selection\right); otherwise\ {\gamma}_s=0 $$
where l is the length of the protein and s is the residue’s IUPRED score.
FLEX score computation
Since identification of a homologue with superior intrinsic disorder properties requires scoring at least two parameters simultaneously, the hybrid FLEX score has been implemented. The FLEX score incorporates a weighted average of the intrinsic order and a hyperbolic transform of the E-value parameter in the following fashion:
$$ FLEX\ score=\left(\left(1-\eta \right)\left(1-{0.99}^{{\left(-\log \left(E- value\right)\right)}^2}\right)+\eta \left(1- IDS\right)\right) $$
The weight is determined by the FLEX coefficient (η), which is set by the user before the homology search is run. Allowed values are between 0 and 1 and will shift the contribution ratio of intrinsic structural order (1 - IDS) to E-value transform. The hyperbolic transform of the E-value is meant to converge the extreme low-end E-value range while resolving the high-end and middle ranges. Consequently, the function of the logarithm of the E-value is sigmoidal, and bound from 0 to 1. It is characterised by a near-linear relationship between arguments corresponding to E-values of 10− 3 and 10− 14, while either tail approaches 0 or 1, respectively (Additional file 1: Figure S1).
Visualisation of results
Results of a sequence query are presented in a table, for each hit showing the UniProt identifier, the GeneID, the expect value (E), sequence identity, similarity, the IDS score and the FLEX score. By default, the result hits are sorted from the lowest to the highest E-value. Sorting priority can be adjusted at any time from the main menu. The right-hand-side menu allows for more in-depth analysis of results. The ‘View’ option acquires the most important parameters of the selected item from the UniProt repository, displaying information such as sequence length, molecular mass and organism taxonomy.
The “Details” button draws a detailed trace of intrinsic disorder of the selected record within an interactive two-coordinate environment, implemented with Matplotlib. The environment allows for enlargement of selected parts of the trace, as well as for its translation. The option of exporting the graph as an image file is also provided. Traces can be overlayed on top of each other and therefore the intrinsic disorder can be explicitly compared between multiple protein records simultaneously.
Importantly, if the “enable disorder trace alignment” setting is switched on, the multiple disorder traces for the selected protein records will be automatically aligned on the axis, according to a multiple sequence alignment conducted in CLUSTALW [24]. Default CLUSTALW parameters can be adjusted in Edit/Advanced properties. Any gaps inserted through the alignment algorithm will be visible in the aligned traces as residues with the IUPRED disorder score of 0.0 – an extremely unlikely occurrence for a protein residue otherwise.
Distribution
Windows and OSX binary distributions of BASILIScan were packaged with Py2exe and Py2app, respectively. Packages for both platforms are freely available for academic use under the GNU distribution license and can be downloaded at www.basilisc.com/downloads. Open-source version is also available. Please consult the ReadMe file for further instructions on installation and running of BASILIScan, as well as for the dependencies required to run the open-source version (www.basilisc.com/readme/).