Sequence logo and PWM
PWM for DNA sequences has four rows, each representing a nucleotide, multiple columns, each representing a TFBS position. Figure 1 shows an example of PWM and its corresponding sequence logo. Here we denote the PWM as P. Each element P
ij
represents the probability of nucleotide i∈{A,C,G,T} in position j, where the sum probability of each column is 1. To convert a PWM to sequence logo, for each letter column j, the height of the whole letter stack is determined by the information content I of the column, which is calculated by
$$\begin{array}{*{20}l} I_{j} &= H_{j, max}\left(P_{1}, P_{2}, P_{3}, P_{4}\right) - H_{j}\left(P_{1j}, P_{2j}, P_{3j}, P_{4j}\right) \\&= 2 + \sum_{i=1}^{4}P_{ij}\cdot log_{2}{P_{ij}} \end{array} $$
where H
j
represent the entropy of position j, and 1 to 4 represents nucleotide A, C, G and T [3, 8]. Then, the height for each nucleotide is calculated by P
ij
·I
j
. For the color denotation, usually green, blue, yellow and red represent A, C, G and T respectively.
Reconstruction of the probabilities for nucleotides A, C, G and T from the sequence logo is not as straight forward as it looks. Here we only focus on logo image of one logo-column j.
First of all, directly measuring the proportion of each letter of the whole letter-stack at position j does not work - the lower the letter stack, the lower the resolution, and therefore the harder to measure the proportion of the letter height. For the cases that one big letter dominates a position, such as logo-columns on position 1, 3, 4, 6 and 9 in Fig. 1, a few pixel difference of the bottom letters will severely influence the accuracy. This phenomenon is worse for the cases that have two similar sized letters and have lower information content, such as column 2, 5, 7, where directly measuring the letter height for probability would cause the reconstructed PWM to have much higher information content. This influence is even worse when the resolution of image is low.
Thus, we utilize the formula of information content to calculate the probabilities. For the cases that only one strong letter is present (the probability of this strong letter is denoted as p
1st
), we assume that all the three weak letters have equal probability - p
weak
=(1−p
1st
)/3. p
1st
is estimated based on the information content (height of logo). In this case, information content:
$$I = 2 + p_{1st}\cdot log_{2}p_{1st} + 3 \cdot p_{weak}\cdot log_{2}p_{weak} $$
In order to speed up the slow computational time of solving the above formula, we pre-calculated a I to p
1st
lookup table, B, with p
1st
interval 0.01 from 0 to 1.
For the cases that there are two strong letters in the letter-column (such as logo-columns on position 2, 5, 7 in Fig. 1), we consider the height of both the top and the secondary letter, h
1st
and h
2nd
, thus p
2nd
=p
1st
∗h
2nd
/h
1st
. Here we consider a letter with at least 3 pixel-lines as a strong secondary letter, and we assume that the other two letters have the same probabilities: p
weak
=(1−p
1st
−p
2nd
)/2. The probabilities can be obtained by solving the following equation:
$$I = 2 + p_{1st}\cdot log_{2}{ p_{1st}} + p_{2nd}\cdot log_{2}{ p_{2nd}} + 2 \cdot p_{weak} \cdot log_{2}{p_{weak}} $$
Concatenating the estimated probabilities for all positions, we obtained the reconstructed PWM.
Algorithm
Figure 2 shows the workflow of the mian algorithm. Firstly,The image pre-processing module converts the image file to a common three-channel RGB formatted file. Current supported file format includes ‘png’, ‘jpg’, ‘jpeg’ and ‘gif’, except ‘png’ file with alpha channel (transparent channel).
Secondly, the algorithm determines if the logo contains X and Y coordinates using the black pixel feature. We assume that the coordinates in the image are black. If the logo has X and Y coordinates, the algorithm determines the full length and height of the logo, then cuts the pure logo area. If the logo image does not contain X and Y coordinates, the algorithm uses the logo boundary to cut the pure logo area and assume that the highest letter has maximum information content - 2 bits. If the logo image contains Y coordinate, then the height of Y coordinate is used to estimate the information content later. Then the program cuts the pure logo region and removes noise in the image, such as horizontal dashed lines of background patterns in some logo images.
Thirdly, based on the pure logo area, the algorithm determines the sub-image of each letter-column by several image processing algorithms, including:
-
a) Sum the matrix to a 1-D array on X, then count the number of peaks,
-
b) Vote for consensus gap distances and determine the letter-column width,
-
c) Use the letter width to determine the number of letters-columns in the logo, then fine-adjust the width of letter.
Lastly, for each letter-column image, the algorithm estimates the probabilities for letter A, C, G and T. There are two important sub-tasks: letter recognition and probability estimation. To determine the main letter(s) in the sub-image, we used a ‘nearest neighbor’-like algorithm to guess the letter from color. We use the most common color code that green is for A, blue for C, yellow for G and red for T. Six color central points are pre-determined to boost up speed:
$$\begin{array}{llllllr}\textsf{black}: &&{\phantom{00}} \textsf{(0}, &&{\phantom{00}} \textsf{0}, &&{\phantom{00}} \textsf{0)}\\ \textsf{white}: &&{\phantom{00}} \textsf{(255}, &&{\phantom{00}} \textsf{255}, &&{\phantom{00}} \textsf{255)}\\ \textsf{red}: &&{\phantom{00}} \textsf{(200}, &&{\phantom{00}} \textsf{25}, &&{\phantom{00}} \textsf{32)}\\ \textsf{green}: &&{\phantom{00}} \textsf{(57}, &&{\phantom{00}} \textsf{178}, &&{\phantom{00}} \textsf{65)}\\ \textsf{blue}: &&{\phantom{00}} \textsf{(43}, &&{\phantom{00}} \textsf{60}, &&{\phantom{00}} \textsf{147)}\\ \textsf{yellow}: &&{\phantom{00}} \textsf{(240}, &&{\phantom{00}} \textsf{173}, &&{\phantom{00}} \textsf{10)}\\ \end{array} $$
Then for each pixel in the image, the algorithm determines its color by finding its nearest color central point. White pixel is for background, black pixel is for axis and labels/marks, the other four color pixels are for the four nucleotides in DNA sequences. We use consensus voting to determine the color of the letter sub-image. We do not choose to use ‘optical character reader (OCR)’ for the letter recognition task because OCR has low recognition speed and low accuracy for this case, especially due to the uncommon shape of font and the variety of colors.
Implementation
Stand-alone application The core functions of logo2PWM is written in MATLAB 8.6 with image processing toolbox. The complete MATLAB program requires as input the file name of the sequence logo image, and outputs three files in the same folder of the original logo image file: the reconstructed PWM in.csv format, the reconstructed PWM in enologo format, and the Position-specific Scoring Matrix (PSSM) file for MEME suite. The program can also provide the flexibility to accept an optional parameter - the ‘number of columns’, therefore has a higher chance to return a good result.
The source code of stand-alone application can be accessed at http://www.cs.utsa.edu/~jruan/logo2pwm_sa.
Web-based service As shown in Fig. 3, the software architecture for the web-based service has four layers. From bottom to top, these layers are MATLAB source code, MATLAB compiler runtime executable, web framework, and deployment.
logo2PWM is available at http://www.cs.utsa.edu/~jruan/logo2pwm.
Evaluation
We evaluate our algorithm by computing the correlation between estimated and true PWMs, and visually examining the original sequence logos and sequence logos regenerated with the estimated PWMs. Three systematic evaluations have been performed on 1946 TFBS logos from - Zhu et al. [9], MacIsaac et al. [10], and, JASPAR-2016 database [11, 12]. There are 179 sequence logo-PWM pairs from Zhu et al. [9], 124 sequence logo-PWM pairs from MacIsaac et al. [10], and 1643 available logo-PWM pairs from the JASPAR-2016 database respectively.