Two fundamental elements of word-based genomic signatures are created with the approach presented in [42, 43]. SIG1 identifies the set of statistically overrepresented words, while SIG2 represents a set of words from SIG1 that is in itself similar to the elements of SIG1, based on a specific distance measure.
The set SIG1 is computed as described in [42, 43], which is summarized as follows:
1. Identify maximally repeated words of length [m, n].
2. Remove low complexity words, redundant words, and words that are contained in repeat elements.
3. For each word compute a 'score' that characterizes the statistical overrepresentation of the word.
4. Select the words with the highest scores.
The set SIG2 is found by taking each of the elements of SIG1 and performing 'word clustering'. For each word w ∈ SIG1, this involves a two-step process:
1. Construct a set (cluster) of words from RGS that have a 'distance' of no more than h from word w. Hamming distance and edit distance are used for this step.
2. Construct a motif that characterizes the set of words found in step 1.
Word-based signature (SIG1)
As the foundation of the signature generation it is necessary to compute the set of distinct words W
in a set of input sequences S. In order to determine the statistical significance of w ∈ W
it is necessary to count the total number of occurrences of a given word w
, as well as the number of sequences containing the word, s
. The occurrence information is modelled as a set of tuples
. Assuming a binomial model for the distribution of words across the input sequences, it is possible to model the total occurrence of a word w by introducing the random variable
, where l is the complete sequence length, v the length of w, and Y
a binary random variable indicating if a word occurs at position i, or not, leading to the series of yes/no Bernoulli experiments. An expected value for the specific number of occurrences for a word w can then be computed as
is the probability of word w. Following a similar modelling approach, the expected number of sequences a word occurs in is given by
. The actual probabilities are determined by a homogenous Markov chain model of a specific order m. Based on the expected values we compute multiple scores for each word:
: This scoring function, called SlnSES, enables the inclusion of sequence coverage into the score. A highly scored word occurs in a large percentage of sequences in the data set. It does not necessarily have to be highly significant if the overall number of occurrences is taken into account, but it is of particular use for the discovery of shared regulatory elements across multiple sequences.
p-Value: The p-value is defined as the probability of obtaining at least as many words as the actual observed number of words:
, where |S| represents the number of sequences in S and l
is the length of sequence j.
Word-based clusters (SIG2)
Two methods are employed for the detection of similarities between the words that make up SIG1: hamming distance and Levenshtein distance (also called edit distance). While hamming distance is defined as the number of positions for which the corresponding characters of two words of the same length differ, edit distance allows the comparison of different length words and accounts for three edit operations (insert, delete and substitute), rather than the plain mismatch (corresponds to substitute) employed by the hamming distance.
The biological reasoning for employing distance metrics in order to group similar words together can be found in the evolution of sequences. A biological structure is constantly exposed to mutation pressure. These mutations can occur as insertions, deletions or substitutions, however insertions and deletions are deleterious in most cases, leading to the issue that edit distance provides a very detailed model of the mutations but hamming distance is a reasonable abstraction and will work well for this case. The motif logos for the hamming distance clusters were constructed using the TFBS Perl module by Lenhard and Wasserman . ClustalW2  was used to align the words of the edit distance clusters.
The sequence clustering conducted in this research is focussed on the words shared between element of a set of sequences. A set of words is taken as the input for the clustering. A binary vector s
) for each sequence s
is created, marking an element s
is the number of words used to distinguish the sequences with k
|. The element s
of the vector is populated with a '1' if the word k
is found in sequence i
, and '0' if it is not. The similarity between sequences is determined by the dot product between the binary sequence vectors, and is deducted from the complete number of words in the vector space. In order to determine the distance between k
sequences (with k
≥ 2), the dot product is extended to accommodate multiple sequences.
The cluster with the smallest distance is visualized using GMOD's GBrowse framework . For each of the sequences contained in the cluster, the words pertaining to SIG1 are displayed.
Biological significance (lookup)
Once genomic signatures are identified, the next step is to discern their biological role. One important aspect of this role, crucial to understanding gene regulation , is the location of the preferred binding sites for certain proteins (transcription factor binding sites or TFBSs). To locate these sites, the signatures are compared to a set of known binding sites, which are usually represented as weighted matrices . However, a simple scoring scheme can misclassify results when applied to the typically short sequences produced by signature finders. In this simple approach, short signatures are aligned to each matrix by ignoring the parts of the matrices that are longer than the signature. This results in erroneous scores since a signature could match just the very end of large matrix, which is often of little significance (the core of the matrix generally represents the sites of strongest binding).
To give a more significant measure of similarity, we developed a tool that uses a window around the original sequences (those which the signature is based upon) to improve the comparison. The naive implementation of this approach is to use a window of base pairs around each signature and find the optimal alignment to each TFBS matrix by scoring every possible sub-sequence containing the signature. For instance, if a signature is located 10 times within the set of sequences, each matrix is aligned to each of the 10 loci containing the signatures. Our tool uses a faster approach; it finds all occurrences of TFBSs meeting the desired threshold in every sequence, and subsequently uses this information to quickly score the signatures. As a benefit, the list of TFBS can be reused to quickly score new signatures or to redo the analysis with interesting subsets of sequences, such as all sequences which in liver cells are highly expressed.
The co-occurrence analysis aims to determine the expected number of sequences containing a given pair of not necessarily distinct words at least once. If n
denotes the word length, m
the number of sequences,
the probability for a word i
to occur anywhere in the sequence, and l
the length of sequence k
, the expected number of sequences containing a given pair of words can be calculated as:
score is used as the main scoring function in the co-occurrence analysis.
Sequence conservation was mapped using PhastCons conservation scores  calculated on 28 species, which are based on a two-state (conserved state vs. Non-conserved region) phylo-HMM. PhastCons scores were obtained from the UCSC Human Genome Browser . The scores reported by the UCSC Human Genome Browser contain transformed log-odds scores, ranging from 0–1000. Conserved regions were required to cover the majority of the word length.
Words can have significantly different scores for each of the data sets in which they occur. In order to analyze the words based on their impact on the data sets it is useful to assign a distance metric that determines which data set is described best by a given word. Based on a graphical analysis, three points of interest can be determined: the point where the perpendicular of a given point on the x-axis crosses the main diagonal, the point where the perpendicular of a given point on the main diagonal crosses the main diagonal and finally the point where the perpendicular from a given point on the y-axis crosses the main diagonal. Based on the conventional techniques of fold-change detection in microarray analysis, we consider the perpendicular on the main diagonal. The resulting distance formula is:
, with y
0 being the score for the word within the unidirectional data set, and x
0 being the score of the word in the bidirectional data set.