Logo2PWM: a tool to convert sequence logo to position weight matrix
© The Author(s) 2017
Published: 3 October 2017
position weight matrix (PWM) and sequence logo are the most widely used representations of transcription factor binding site (TFBS) in biological sequences. Sequence logo - a graphical representation of PWM, has been widely used in scientific publications and reports, due to its easiness of human perception, rich information, and simple format. Different from sequence logo, PWM works great as a precise and compact digitalized form, which can be easily used by a variety of motif analysis software. There are a few available tools to generate sequence logos from PWM; however, no tool does the reverse. Such tool to convert sequence logo back to PWM is needed to scan a TFBS represented in logo format in a publication where the PWM is not provided or hard to be acquired. A major difficulty in developing such tool to convert sequence logo to PWM is to deal with the diversity of sequence logo images.
We propose logo2PWM for reconstructing PWM from a large variety of sequence logo images. Evaluation results on over one thousand logos from three sources of different logo format show that the correlation between the reconstructed PWMs and the original PWMs are constantly high, where median correlation is greater than 0.97.
Because of the high recognition accuracy, the easiness of usage, and, the availability of both web-based service and stand-alone application, we believe that logo2PWM can readily benefit the study of transcription by filling the gap between sequence logo and PWM.
Position weight matrix (PWM), introduced by Stormo et al. , is widely used for representing transcription factor binding site (TFBS) in biological sequences. PWMs are often computed from a list of aligned sequences which are potentially functionally related, and have replaced consensus sequences to be the most commonly used TFBS representation in motif discovery software and biological publications. Presented by Schneider et al. , sequence logo is a successful graphical representation of PWM/sequence pattern. From a sequence logo, people can easily perceive the information content and the relative frequency of nucleotide for each position of the consensus sequence, therefore can distinguish subtle sequence patterns and significant residues [2, 3]. While sequence logo is good for human perception and understanding, PWM still has advantages over sequence logo in computational field, such as its preciseness and compactness in computer storage; especially PWM is used as the standard format for motif finding and scanning .
There are a few tools available for generating sequence logos from PWM or aligned sequences [5–7]; however, currently we have no tool to convert sequence logo back to PWM. In biology publications, the corresponding PWM of a sequence logo may not be found easily. Such tool is especially needed to scan a TFBS represented in sequence logo format in an ‘ancient’ publication where the original PWM is very hard to be acquired. Even if the PWMs are provided by a publication, to have a tool to convert logo to PWM could save time and speedup the motif finding workflow.
In this work, we propose logo2PWM to reconstruct PWMs from sequence logo images, and overcome the major difficulty of reconstructing PWMs from large variety of sequence logo images. Evaluation results on over one thousand logos from three sources with different logo format show that the correlation between the reconstructed PWMs and the original PWMs are constantly high, further support that logo2PWM can be readily used to benefit the study of transcriptional regulatory network.
Sequence logo and PWM
where H j represent the entropy of position j, and 1 to 4 represents nucleotide A, C, G and T [3, 8]. Then, the height for each nucleotide is calculated by P ij ·I j . For the color denotation, usually green, blue, yellow and red represent A, C, G and T respectively.
Reconstruction of the probabilities for nucleotides A, C, G and T from the sequence logo is not as straight forward as it looks. Here we only focus on logo image of one logo-column j.
First of all, directly measuring the proportion of each letter of the whole letter-stack at position j does not work - the lower the letter stack, the lower the resolution, and therefore the harder to measure the proportion of the letter height. For the cases that one big letter dominates a position, such as logo-columns on position 1, 3, 4, 6 and 9 in Fig. 1, a few pixel difference of the bottom letters will severely influence the accuracy. This phenomenon is worse for the cases that have two similar sized letters and have lower information content, such as column 2, 5, 7, where directly measuring the letter height for probability would cause the reconstructed PWM to have much higher information content. This influence is even worse when the resolution of image is low.
In order to speed up the slow computational time of solving the above formula, we pre-calculated a I to p 1st lookup table, B, with p 1st interval 0.01 from 0 to 1.
Concatenating the estimated probabilities for all positions, we obtained the reconstructed PWM.
Secondly, the algorithm determines if the logo contains X and Y coordinates using the black pixel feature. We assume that the coordinates in the image are black. If the logo has X and Y coordinates, the algorithm determines the full length and height of the logo, then cuts the pure logo area. If the logo image does not contain X and Y coordinates, the algorithm uses the logo boundary to cut the pure logo area and assume that the highest letter has maximum information content - 2 bits. If the logo image contains Y coordinate, then the height of Y coordinate is used to estimate the information content later. Then the program cuts the pure logo region and removes noise in the image, such as horizontal dashed lines of background patterns in some logo images.
a) Sum the matrix to a 1-D array on X, then count the number of peaks,
b) Vote for consensus gap distances and determine the letter-column width,
c) Use the letter width to determine the number of letters-columns in the logo, then fine-adjust the width of letter.
Then for each pixel in the image, the algorithm determines its color by finding its nearest color central point. White pixel is for background, black pixel is for axis and labels/marks, the other four color pixels are for the four nucleotides in DNA sequences. We use consensus voting to determine the color of the letter sub-image. We do not choose to use ‘optical character reader (OCR)’ for the letter recognition task because OCR has low recognition speed and low accuracy for this case, especially due to the uncommon shape of font and the variety of colors.
Stand-alone application The core functions of logo2PWM is written in MATLAB 8.6 with image processing toolbox. The complete MATLAB program requires as input the file name of the sequence logo image, and outputs three files in the same folder of the original logo image file: the reconstructed PWM in.csv format, the reconstructed PWM in enologo format, and the Position-specific Scoring Matrix (PSSM) file for MEME suite. The program can also provide the flexibility to accept an optional parameter - the ‘number of columns’, therefore has a higher chance to return a good result.
The source code of stand-alone application can be accessed at http://www.cs.utsa.edu/~jruan/logo2pwm_sa.
logo2PWM is available at http://www.cs.utsa.edu/~jruan/logo2pwm.
We evaluate our algorithm by computing the correlation between estimated and true PWMs, and visually examining the original sequence logos and sequence logos regenerated with the estimated PWMs. Three systematic evaluations have been performed on 1946 TFBS logos from - Zhu et al. , MacIsaac et al. , and, JASPAR-2016 database [11, 12]. There are 179 sequence logo-PWM pairs from Zhu et al. , 124 sequence logo-PWM pairs from MacIsaac et al. , and 1643 available logo-PWM pairs from the JASPAR-2016 database respectively.
Results and Discussion
Currently, execution time for a logo image under MATLAB environment is around 2 s, while the execution time on the web-server is around 15 seconds.
logo2PWM has been tested on a large variety of sequence logo images to validate and improve our algorithm. We also performed three systematic evaluations on over one thousand TFBS logos from - Zhu et al. , MacIsaac et al. , and the JASPAR-2016 database  - results for all three data sets are good.
Evaluation on 179 logos from Zhu et al. 
Evaluation on 124 logos from MacIsaac et al. 
Evaluation on 1,643 logos from the JASPAR-2016 database
We proposed logo2PWM to reconstruct PWM from sequence logo images. Based on the decent evaluation results on over one thousand logos images from variety of logo format, the easiness of usage, and, the availability of both web-based service and stand-alone application, we believe that logo2PWM can readily benefit the study of TF-DNA interaction.
This research and this article’s publication costs were supported by NSF grant IIS-1218201 and ABI-1565076, NIH grants G12MD007591.
Availability of data and materials
Logo2PWM web-application, source code of stand-alone application and test data is available at http://www.cs.utsa.edu/~jruan/logo2pwm.
About this supplement
This article has been published as part of BMC Genomics Volume 18 Supplement 6, 2017: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM) 2016: genomics. The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume-18-supplement-6.
ZG proposed the study, designed the application, accomplished the data analysis, and, drafted the manuscript. LL involved in the application design and deployed the web service. JR conceived of the study, participated in its design and helped to draft the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Stormo GD, Schneider TD, Gold L, Ehrenfeucht A. Use of the ‘perceptron’ algorithm to distinguish translational initiation sites in e. coli. Nucleic Acids Res. 1982; 10(9):2997–3011.View ArticlePubMedPubMed CentralGoogle Scholar
- Schneider TD, Stephens RM. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990; 18(20):6097–100.View ArticlePubMedPubMed CentralGoogle Scholar
- Schneider TD, Stormo GD, Gold L, Ehrenfeucht A. Information content of binding sites on nucleotide sequences. J Mol Biol. 1986; 188(3):415–31.View ArticlePubMedGoogle Scholar
- Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS. Meme suite: tools for motif discovery and searching. Nucleic Acids Res. 2009; 37(suppl_2):W202–W208.View ArticlePubMedPubMed CentralGoogle Scholar
- Workman CT, Yin Y, Corcoran DL, Ideker T, Stormo GD, Benos PV. enologos: a versatile web tool for energy normalized sequence logos. Nucleic Acids Res. 2005; 33(suppl 2):389–92.View ArticleGoogle Scholar
- Crooks GE, Hon G, Chandonia JM, Brenner SE. Weblogo: a sequence logo generator. Genome Res. 2004; 14(6):1188–90.View ArticlePubMedPubMed CentralGoogle Scholar
- Schneider TD. Consensus sequence zen. Appl Bioinforma. 2002; 1(3):111.Google Scholar
- Schneider TD. Information theory primer with an appendix on logarithms (postscript file). Natl Cancer Inst. 2000; 26:1–7.Google Scholar
- Zhu C, Byers KJ, McCord RP, Shi Z, Berger MF, Newburger DE, Saulrieta K, Smith Z, Shah MV, Radhakrishnan M, et al. High-resolution dna-binding specificity analysis of yeast transcription factors. Genome Res. 2009; 19(4):556–66.View ArticlePubMedPubMed CentralGoogle Scholar
- MacIsaac KD, Wang T, Gordon DB, Gifford DK, Stormo GD, Fraenkel E. An improved map of conserved regulatory sites for saccharomyces cerevisiae. BMC Bioinforma. 2006; 7(1):1.View ArticleGoogle Scholar
- Mathelier A, Fornes O, Arenillas DJ, Chen C-y, Denay G, Lee J, Shi W, Shyr C, Tan G, Worsley-Hunt R, et al. Jaspar 2016: a major expansion and update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2016; 44(D1):D110–D115.View ArticlePubMedGoogle Scholar
- Sandelin A, Alkema W, Engström P, Wasserman WW, Lenhard B. Jaspar: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 2004; 32(suppl 1):91–4.View ArticleGoogle Scholar