Skip to main content
Figure 1 | BMC Genomics

Figure 1

From: Unsupervised genome-wide recognition of local relationship patterns

Figure 1

Saguaro ’s workflow. (a) Saguaro takes genome-wide aligned nucleotide sites in genomic order as input. The filtering of non-informative nucleotides is configurable. A Hidden Markov Model (HMM) scores each site against cacti and segments the genome by the best statistical fit. Each cactus is then re-computed by all the sites it represents. This step can be repeated to refine both the segmentation and the cactus. To augment the set of cacti, Saguaro trains one Self Organizing Map (SOM) per cactus from all the sites residing in segments represented by this cactus. In each SOM, Saguaro then finds the neuron that best represents its sites, and based on that, determines the neuron farthest from that as the worst representative given its sites. Saguaro then picks the SOM with the longest distance between the best and worst neurons and hypothesises a new cactus from this worst neuron. This cactus is passed back to the HMM which assigns segments to it. (b) Shown is a low-level schematic of how Saguaro processes input into output. During the SOM stage, input sites are translated into binary vectors of 1’s (mismatch, black) and 0’s (matches, white), relative to a randomly chosen genome that serves as the ‘reference’. The SOM is then presented with these vectors in random order, so that the continuous vectors contained in the neurons model the input space. The most common input pattern results in the highest density of neurons, whereas patterns not well-modelled by these neurons form their own cluster (shown in red). These neurons are then expanded into a cactus, and added to the HMM’s set. The HMM then re-segments the genome and re-trains the cacti. This process is repeated iteratively in order to build a set of cacti that model different subsets of the genome via their patterns.

Back to article page