A singular value decomposition approach for improved taxonomic classification of biological sequences
- Anderson R Santos†^{1},
- Marcos A Santos†^{2},
- Jan Baumbach^{3},
- John A McCulloch^{1},
- Guilherme C Oliveira^{4},
- Artur Silva^{5},
- Anderson Miyoshi^{1} and
- Vasco Azevedo^{1}Email author
https://doi.org/10.1186/1471-2164-12-S4-S11
© Santos et al; licensee BioMed Central Ltd. 2011
Published: 22 December 2011
Abstract
Background
Singular value decomposition (SVD) is a powerful technique for information retrieval; it helps uncover relationships between elements that are not prima facie related. SVD was initially developed to reduce the time needed for information retrieval and analysis of very large data sets in the complex internet environment. Since information retrieval from large-scale genome and proteome data sets has a similar level of complexity, SVD-based methods could also facilitate data analysis in this research area.
Results
We found that SVD applied to amino acid sequences demonstrates relationships and provides a basis for producing clusters and cladograms, demonstrating evolutionary relatedness of species that correlates well with Linnaean taxonomy. The choice of a reasonable number of singular values is crucial for SVD-based studies. We found that fewer singular values are needed to produce biologically significant clusters when SVD is employed. Subsequently, we developed a method to determine the lowest number of singular values and fewest clusters needed to guarantee biological significance; this system was developed and validated by comparison with Linnaean taxonomic classification.
Conclusions
By using SVD, we can reduce uncertainty concerning the appropriate rank value necessary to perform accurate information retrieval analyses. In tests, clusters that we developed with SVD perfectly matched what was expected based on Linnaean taxonomy.
Background
We developed a methodology, based on singular value decomposition (SVD), for improved inference of evolutionary relationships between amino acid sequences of different species [1]. SVD produces a revised distance matrix for a set of related elements. Our SVD-based computations provide results that are close to the internationally accepted scientific gold standard, Linnaean taxonomy.
The reason we chose this methodology is the proven capacity that SVD has to establish non-obvious, relevant relationships among clustered elements [2][3][4][5], providing a deterministic method for grouping related species. A distance matrix derived from SVD can be used by cladogram software to produce a "phylogenetic tree", yielding a visual overview of the relationships. We compared species grouping by this method with Linnaean taxonomy grouping and found that the species clusters were similar.
The justification for using only D_{ k } is that it has k lines instead of m lines from A_{ k }, so D_{ k } is made up of linear combinations from U_{ k } columns, which in turns provides the relationship A ≈ A_{ k } ≈ D_{ k }.
The quality of the clusters that were generated was measured by the number of Linnaean taxonomy levels each species within the cluster bore in common with the other species; this was calculated as a function of an increasing rank value. When certain rank values are reached, larger values do not improve cluster quality, because there is no increase in taxonomic levels that the species have in common; in some cases a decrease is observed. The cluster quality obtained from a certain rank value maintains the number of shared common Linnaean taxonomy levels constant. This is evidence that there is an intrinsic relationship between these species that is mirrored in the distance matrix derived from these clusters; this quality helps build relevant cladograms.
Results and discussion
Singular value decomposition and number of clusters matters
Using the distance matrix that corrected separated Aves cluster: K-Means compared to ASAP
Number of species joined by clusters | Linnaean Taxonomy levels in common by clusters | common Linnaean taxonomy levels frequency (cLtlf) by cluster | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Cluster | K-means with rank 60 | SNJ with rank 60 | K-means with rank 09 | SNJ with rank 09 | K-means with rank 60 | SNJ with rank 60 | K-means with rank 09 | SNJ with rank 09 | K-means with rank 60 | SNJ with rank 60 | K-means with rank 09 | SNJ with rank 09 |
1 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 100 | 100 | 100 | 100 |
2 | 14 | 27 | 14 | 25 | 10 | 9 | 10 | 9 | 140 | 243 | 140 | 225 |
3 | 4 | 1 | 4 | 7 | 12 | 13 | 12 | 8 | 48 | 13 | 48 | 56 |
4 | 7 | 17 | 4 | 7 | 8 | 8 | 10 | 8 | 56 | 136 | 40 | 56 |
5 | 2 | 2 | 9 | 2 | 13 | 11 | 9 | 12 | 26 | 22 | 81 | 24 |
6 | 6 | 1 | 4 | 4 | 10 | 13 | 10 | 10 | 60 | 13 | 40 | 40 |
7 | 5 | 1 | 6 | 4 | 9 | 13 | 10 | 12 | 45 | 13 | 60 | 48 |
8 | 11 | 1 | 8 | 1 | 9 | 13 | 8 | 13 | 99 | 13 | 64 | 13 |
Inferring quality from clustering methods
Algorithm/ software | Rank | N | Min cLtlf | Max cLtlf | Mean cLtlf | cLtlf clusters sum (∑cLtlf) | cLtlf standard deviation (σ) | Linnaean clusters quality (∑cLtlf/σ) | Linnaean clusters quality gain (K09/K60)% | cLtlf median | Median clusters quality gain (K09/K60)% |
---|---|---|---|---|---|---|---|---|---|---|---|
AQBC-javaml | K09 | 8 | 32 | 180 | 71.25 | 570 | 52.27 | 10.90 | 49.58% | 42.50 | 26.87% |
K60 | 8 | 0 | 220 | 64.38 | 515 | 70.64 | 7.29 | 33.50 | |||
EM-weka | K09 | 8 | 40 | 120 | 70.12 | 561 | 31.53 | 17.79 | 48.99% | 57.00 | 1.79% |
K60 | 8 | 16 | 160 | 70.25 | 562 | 47.06 | 11.94 | 56.00 | |||
Kmeans-weka | K09 | 8 | 30 | 180 | 69.38 | 555 | 46.70 | 11.88 | 9.26% | 61.50 | -2.38% |
K60 | 8 | 16 | 180 | 69.88 | 559 | 51.39 | 10.88 | 63.00 | |||
Kmeans-R | K09 | 8 | 40 | 140 | 71.62 | 573 | 34.48 | 16.62 | 9.21% | 62.00 | 6.90% |
K60 | 8 | 26 | 140 | 71.75 | 574 | 37.72 | 15.22 | 58.00 | |||
K-Medoids-R | K09 | 8 | 24 | 160 | 70.12 | 561 | 44.37 | 12.64 | 15.92% | 60.00 | 13.21% |
K60 | 8 | 26 | 180 | 68.50 | 548 | 50.24 | 10.91 | 53.00 | |||
MDBC-weka | K09 | 8 | 30 | 180 | 69.38 | 555 | 46.70 | 11.88 | 9.26% | 61.50 | -2.38% |
K60 | 8 | 16 | 180 | 69.88 | 559 | 51.39 | 10.88 | 63.00 | |||
ASAP-in house | K09 | 8 | 13 | 225 | 70.25 | 562 | 67.68 | 8.30 | 27.51% | 52.00 | 197.14% |
K60 | 8 | 13 | 243 | 69.12 | 553 | 84.92 | 6.51 | 17.50 |
In the remainder of this paper, we show preliminary findings and methods that helped us reach our final conclusions, including how we arrived at an adequate number of singular values that allowed us to separate a set of species into groups with biological significance. To this end, we found that using arrays of trigram frequencies of amino acids to determine statistical properties was as good as using 4-gram frequencies [19]. We show that the size of the sequences that are analyzed can affect the separation of elements into clusters. We also present measures that allow us to infer the biological significance of a cluster and measure the quality of the clustering methods compared to Linnaean taxonomic classification of species.
Algorithm kdcSearch: parameterizing rank and number of partitions
Linnaean taxonomy levels
Linnaean Taxonomy levels | ||
---|---|---|
Number | Name | Value |
14 | Species | Aythya americana |
13 | Genus | Aythya |
12 | Familia | Anatidae |
11 | Ordo | Anseriformes |
10 | Subclassis | Carinatae |
9 | Classis | Aves |
8 | Infraphylum | Gnathostomata |
7 | Subphylum | Vertebrata |
6 | Phylum | Chordata |
5 | Cladus2 | Deuterostomia |
4 | Cladus1 | Bilateria |
3 | Subregnum | Eumetazoa |
2 | Regnum | Animalia |
1 | Superregnum | Eukaryota |
Function Finalize: sample data
06clusters k03 | 06clusters k06 | 08clusters k06 | 08clusters k09 | 08clusters k12(-) | 08clusters k45 | 10clusters k30(-) | 12clusters k12 | 14clusters k18 | 14clusters k21 | 14clusters k36(-) | 14clusters k60 |
---|---|---|---|---|---|---|---|---|---|---|---|
100 | 100 | 100 | 100 | 248 | 100 | 144 | 100 | 100 | 100 | 88 | 100 |
243 | 243 | 200 | 225 | 13 | 243 | 252 | 216 | 240 | 250 | 240 | 220 |
45 | 64 | 56 | 56 | 180 | 13 | 13 | 13 | 13 | 13 | 13 | 13 |
96 | 100 | 13 | 56 | 30 | 136 | 13 | 64 | 90 | 88 | 96 | 112 |
13 | 13 | 100 | 24 | 13 | 22 | 56 | 22 | 22 | 22 | 22 | 22 |
40 | 40 | 45 | 40 | 32 | 13 | 13 | 13 | 13 | 13 | 20 | 20 |
24 | 48 | 13 | 13 | 13 | 16 | 16 | 13 | 13 | 13 | ||
40 | 13 | 13 | 13 | 13 | 24 | 24 | 13 | 13 | 24 | ||
13 | 40 | 40 | 30 | 13 | 13 | ||||||
13 | 48 | 13 | 13 | 13 | 13 | ||||||
13 | 13 | 13 | 13 | 13 | |||||||
13 | 13 | 13 | 13 | 13 | |||||||
13 | 13 | 13 | 13 | ||||||||
13 | 13 | 13 | 13 |
Function Finalize: sample statistics
ASAP/ Clusters | Rank | N | Min cLtlf | Max cLtlf | Mean cLtlf | cLtlf clusters sum (ΣcLtlf) | cLtlf standard deviation (σ) | Linnaean clusters quality (ΣcLtlf/σ) | cLtlf median |
---|---|---|---|---|---|---|---|---|---|
06clusters | K03 | 6 | 13 | 243 | 89.50 | 537 | 82.46 | 6.51 | 70.50 |
06clusters | K06 | 6 | 13 | 243 | 93.33 | 560 | 80.81 | 6.93 | 82.00 |
08clusters | K06 | 8 | 13 | 200 | 72.25 | 578 | 60.65 | 9.53 | 50.50 |
08clusters | K09 | 8 | 13 | 225 | 70.25 | 562 | 67.68 | 8.30 | 52.00 |
08clusters(-) | K12 | 8 | 13 | 248 | 67.75 | 542 | 92.41 | 5.87 | 21.50 |
08clusters | K45 | 8 | 13 | 243 | 69.12 | 553 | 84.92 | 6.51 | 17.50 |
10clusters(-) | K30 | 10 | 13 | 252 | 54.30 | 543 | 81.02 | 6.70 | 13.00 |
12clusters | K12 | 12 | 13 | 216 | 48.50 | 582 | 59.10 | 9.85 | 23.00 |
14clusters | K18 | 14 | 13 | 240 | 44.50 | 623 | 63.29 | 9.84 | 14.50 |
14clusters | K21 | 14 | 13 | 250 | 43.36 | 607 | 66.12 | 9.18 | 13.00 |
14clusters(-) | K36 | 14 | 13 | 240 | 41.64 | 583 | 63.66 | 9.16 | 13.00 |
14clusters | K60 | 14 | 13 | 220 | 43.00 | 602 | 60.68 | 9.92 | 13.00 |
From 76 to 60 species and eight clusters
Analyses were then carried out on only the 60 species from the data set that were joined as a single cluster; the ASAP algorithm was run with 15 clusters and a rank value of 39. When the ASAP algorithm was run with the original 64 species data set, some elements were separated into isolated clusters despite actually sharing several Linnaean taxonomy levels in common with all of the other species.
Eight clusters from 60 data set
Cluster | Number of species joined | Linnaean taxonomy levels in common | Deepest Linnaean taxonomy level |
---|---|---|---|
1 | 10 | 10 | Carinatae |
2 | 25 | 9 | Mammalia |
3 | 7 | 8 | Gnathostomata |
4 | 7 | 8 | Gnathostomata |
5 | 2 | 12 | Hominidae |
6 | 4 | 10 | Elasmobranchii |
7 | 4 | 12 | Salmonidae |
8 | 1 | 13 | Rattus |
Conclusions
Clusters and cladistic trees drawn from distance matrices, which were generated with SVD, showed a good correlation with Linnaean taxonomy. Considering the best estimate, when a difference is found, this does not necessarily mean strong divergence from taxonomic methods, but perhaps a more accurate picture of the relationship between the species that clustered together. This was demonstrated by clusters that were separated from mammalian clusters due to their greater protein sequence relatedness. It also was reinforced by Linnaean taxonomy information.
The similarity between clusters generated by our distance matrix and Linnaean taxonomy is indicative that distance matrices generated by SVD can demonstrate evolutionary relationships of species and construct better quality clusters and phylogenetic trees. These clusters and phylogenetic trees would benefit from amino acid trigrams and the Euclidean distance property of displaying a distance proportional to the number of necessary edits needed to perform a global alignment sequence within a polynomial execution time.
Methods
Datasets
The set of species used in this work is not original [8]. We opted for using a previously known set of data to allow comparisons with other studies that also use this data. We named this set of 13 mitochondrial proteins from 64 vertebrate species, dataset1. Within dataset1, a group of 10 species belonging to the class Aves was chosen to be the positive control group. We developed a negative control group with mitochondrial protein from 12 other species. Joining the proteins from these 12 species with the 64 in dataset1 gave origin to dataset2. Figure 1 schematically represents dataset2 as a set of data composted of dataset1 and 12 additional species. These 12 additional species were selected based on the criterion of being at least one level above the Linnean level common to all of the species in dataset1. Two species were randomly selected for each Linnaean taxonomic level, from Phylum to Superregnum. The same 13 mitochondrial proteins from dataset1 were selected for these 12 additional species. The additional amino acid sequences were obtained from the NCBI site. The union of these 13 mitochondrial proteins from the 12 new species with the sequences in dataset1 gave origin to dataset2, which includes positive and negative control groups of species. In order for a partitioning method to be successful, the positive control group needs to stay together in a partition and no other partition can be contaminated by the negative control group.
Positive control group and statistics
In order to show how rank values and the number of imposed clusters affect SVD, we ran ASAP algorithm with different rank values and numbers of clusters. Figure 5 shows the results of these runs for a single cluster, the cluster denominated cluster 1, which contains species belonging to the Linnaean taxon, the Aves class. This taxon is ideal for testing our hypothesis, because few and closely related species within the data we used belonged to this taxon. Furthermore, the Aves species in our data set tended to mix with less evolutionarily related species when the algorithm was incorrectly calibrated or the number of clusters was too small. For evaluating the quality of the cluster generated, we considered the product of common shared Linnaean taxa among clustered elements multiplied by the number of clustered elements. This indicator gives us a good measure of cluster quality, as it assesses the frequency of commonality within the cluster. Here, we denominated this indicator as “common Linnaean taxonomy level frequency”, or cLtlf, and used it to show how cluster quality can vary as a function of the rank value or the maximum number of clusters used. Figure 5 shows the quality of cluster 1 generated by the algorithm, as rank value increases when different numbers of clusters are used to group the entire 76 species data set.
Figure 5 shows that, independent of the maximum number of clusters chosen to represent the 76 species data set, an increase in rank value does not improve cluster quality; consequently, we can safely use a considerably smaller number of singular values than the theoretical maximum. It is possible to roughly estimate an optimal value for rank value from this particular data set. If we consider 15 clusters, a rank value over 39 will not dramatically increase the quality of each cluster (Figure 5).
When we evaluate cluster quality measured by cLtlf, (Figure 5), we see that there is no significant improvement in cluster quality beyond the rank value of 39. This rank is sufficient for a good data representation of our original data set. Also, within cluster 1, the number of elements clustered together and the number of Linnaean taxonomy levels in common as a function of rank value, can be seen, respectively, in Figures 6 and 7. The maximum number of Linnaean taxonomy levels in common within cluster 1 obtained was 10. There is another interpretation for this graph in Table 3, associating these 10 levels in common within the cluster with the 14 Linnaean taxonomy levels considered in our study. This shows that the stringency of the data representation provided with SVD is sufficient to infer Linnaean taxonomy levels. On the other hand, if a less stringent fit is used, such as with an inappropriate number of clusters and rank value, a panoply of unrelated species are included in a cluster. It must be pointed out that our main task in this study was to learn and exemplify the calibration of our algorithm in order to retrieve desirable information. With the data set we used, the desirable information to be retrieved was Linnaean taxa, however, with other data sets this calibration should be tuned to direct the desired objective.
Table 3 characterizes a bird species, Aythya americana. The taxonomy levels shared by cluster 1 species in our algorithm executions with 20, 25, and 30 clusters and rank value 24, are levels lower than level number 11, namely the order (Ordo). Levels numbered as 11 (order) and 12 (family) were not shared among the 10 bird species in the data set. As more non-Aves species are added to this bird set, there is a decrease in cluster quality.
Euclidean distance
We can produce a distance matrix that contains a measure of how each species is related to each other. To construct this matrix, each species rank values set is treated as a vector in a k-dimension space. One can choose the best measure to calculate the distance among vectors, depending on the particular characteristics in a data set. We decided to use Euclidean distance instead of the cosine distance used by Stuart [8]. This is because there is data indicating that Euclidean distance produces better cluster quality results than cosine distance. There is evidence [20], using the same 64 species data set that we present here, that Euclidean distance is proportional to the number of editions needed to perform a global sequence alignment. Consequently, it gives a more accurate measure of evolutionary relatedness than cosine distance, without the need for a global alignment sequence. There is evidence that the superiority of this Euclidean distance calculation is due to intrinsic evolutionary differences that affect the size of vectors. This is easy to see when one considers two vectors with the same cosine distance but with significant differences in length.
ASAP algorithm: in house agglomerative clustering
We implemented a clustering algorithm that was called ASAP (As Simple As Possible) and showed that even a naive algorithm can benefit from data adequately treated by SVD. Thus, it is not our intention to demonstrate it's worth using this clustering algorithm, but we want to leave the message that regardless of the algorithm, it is worth using SVD conjugated with positive controls in information retrieval, as an initial filter against noise [10][18].
ASAP is an algorithm designed to facilitate the work of measuring the impact of using SVD in clustering algorithms. This algorithm somewhat resembles single-linkage clustering; the differences are that no clustering starts from the two elements with the lowest Euclidean distance. Clustering starts with a random element; also, a new entry is not inserted in the matrix of Euclidean distances for each cluster created between the algorithm interactions.
- (1)
Repeat as long as the number of columns in the distance matrix is greater than one:
1.1. Fix the first column as the pivotal element;
1.2. Create a cluster of elements so that the Euclidean distance is smaller than a 'd' value for the pivotal element;
1.3. Remove elements from the novel cluster (lines and columns) from the distance matrix;
1.4 End repeat.
This algorithm was implemented using Scilab1 5.2.1 run on GNU linux Ubuntu, core 2.6.22-16. This implementation is available in the Additional file 2, accompanied with data and raw results.
Clustering algorithms evaluated
K-Means-R
The K-Means algorithm implemented [11] in the R statistical software aims to partition points into k groups such that the sum of squares from points to the assigned cluster centers is minimized. At the minimum, all cluster centers are at the mean of the set of data points which are nearest to the cluster center [16].
K-Means-WEKA
The K-Means algorithm implemented in the WEKA software is denominated SimpleKMeans. This implementation can use either the Euclidean distance or the Manhattan distance. If the Manhattan distance is used, then centroids are computed as the component-wise median rather than mean [15].
Expectation Maximization (EM)
The EM algorithm [12] creates partitions assigning a probability distribution to each instance. EM can decide how many clusters to create by cross validation, or is possible to specify apriori how many clusters to generate [15].
Adaptive Quality-based Clustering Algorithm (AQBC)
It's a heuristic iterative two-step algorithm with computational complexity approximately linear. The first step consists in finding a sphere in the high-dimensional representation of the data where the density of expression profiles is locally maximal. In a second step, an optimal radius of the cluster is calculated based only on the significantly coexpressed items which are included in the cluster. By inferring the radius from the data itself, there is no need to find manually an optimal value for this radius by trial-and-error [13].
K-Medoids
It's an exact algorithm based on a binary linear programming formulation of the optimization problem [21], using ‘lp’ from package ‘lpSolve’ as solver [16]. Probably is not possible to obtain clustering solutions depending on available hardware resources due to the quadratic order of the program. The K-Medoids R implementation is an NP-hard optimization problem. Partitioning Around Medoids (PAM) [14] is a very popular heuristic for obtaining optimal K-Medoids partitions [16].
MakeDensityBasedClusterer (MDBC)
It’s an algorithm wrapping the SimpleKMeans and possibly others clusterers algorithms. Makes SimpleKmeans return a distribution and density. Fits normal distributions and discrete distributions within each cluster produced by the wrapped clusterer. For the SimpleKMeans supports the number of clusters requestable [15].
Cladograms
The clustering operations were made by calculating the Euclidean distance from the first alphabetically ordered species, defined as the pivotal species, to all the other species. Therefore, when ASAP created the clusters, it already had a symmetric distance matrix containing a data set with all the species. All we needed to do was to create a phylogenetic tree expressed as a Newick phylogenetic tree. We developed an unrooted tree created by the software NEIGHBOR from the PHYLIP package. We drew the unrooted tree in Figure 7, representing the eight clusters of the 60 species from dataset2. All default parameters were used.
Notes
Declarations
Acknowledgements
Funding: FAPEMIG - Fundação de Amparo à Pesquisa de Minas Gerais, Brazil and CNPq - Conselho Nacional de Desenvolvimento Científico e Tecnológico - Brazil.
This article has been published as part of BMC Genomics Volume 12 Supplement 4, 2011: Proceedings of the 6th International Conference of the Brazilian Association for Bioinformatics and Computational Biology (X-meeting 2010). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/12?issue=S4
Authors’ Affiliations
References
- Golub G, Kahan W: Calculating the Singular Values and Pseudo-Inverse of a Matrix. Journal of the Society for Industrial and Applied Mathematics, Series B: Numerical Analysis. 1965, 2: 205-224. 10.1137/0702016.View ArticleGoogle Scholar
- Berry MW, Dumais ST, OBrien GW: Using Linear Algebra for Intelligent Information Retrieval. SIAM Review. 1995, 37: 573-595. 10.1137/1037127.View ArticleGoogle Scholar
- Élden L: Numerical linear algebra in data mining. Acta Numerica. 2006, 15: 327-384.View ArticleGoogle Scholar
- Élden L: Matrix Methods in Data Mining and Pattern Recognition. 2007, Society for Industrial and Applied MathematicsView ArticleGoogle Scholar
- Fogolari F, Tessari S, Molinari H: Singular value decomposition analysis of protein sequence alignment score data. Proteins. 2002, 46: 161-170. 10.1002/prot.10032.PubMedView ArticleGoogle Scholar
- Del-Castillo-Negrete D, Hirshman SP, Spong DA, DAzevedo EF: Compression of magnetohydrodynamic simulation data using singular value decomposition. Journal of Computational Physics. 2007, 222: 265-286. 10.1016/j.jcp.2006.07.022.View ArticleGoogle Scholar
- Deerwester SC, Dumais ST, Furnas GW, Harshman RA, Landauer TK, Lochbaum KE, Streeter LA: Computer information retrieval using latent semantic structure. U. S. Patent: 4839853. 1989Google Scholar
- Stuart GW, Moffett K, Leader JJ: A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes. Mol Biol Evol. 2002, 19: 554-562. 10.1093/oxfordjournals.molbev.a004111.PubMedView ArticleGoogle Scholar
- Vries JK, Liu X: Subfamily specific conservation profiles for proteins based on n-gram patterns. BMC Bioinformatics. 2008, 9: 72-10.1186/1471-2105-9-72.PubMedPubMed CentralView ArticleGoogle Scholar
- Ider YZ, Onart S: Algebraic reconstruction for 3D magnetic resonance-electrical impedance tomography (MREIT) using one component of magnetic flux density. Physiol Meas. 2004, 25: 281-294. 10.1088/0967-3334/25/1/032.PubMedView ArticleGoogle Scholar
- Hartigan JA, W MA: Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics). 1979, 28: 100-108.Google Scholar
- Dempster AP, Laird NM, Rubin DB: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. 1977, 39: 1-38.Google Scholar
- De Smet F, Mathys J, Marchal K, Thijs G, De Moor B, Moreau Y: Adaptive quality-based clustering of gene expression profiles. Bioinformatics. 2002, 18: 735-746. 10.1093/bioinformatics/18.5.735.PubMedView ArticleGoogle Scholar
- Kaufman L, Rousseeuw P: Finding Groups in Data An Introduction to Cluster Analysis. 1990, Wiley InterscienceGoogle Scholar
- Witten IH, Frank E, Hall MA: Data Mining: Practical Machine Learning Tools and Techniques. 2011, Morgan KaufmannGoogle Scholar
- Team RDC: R: A Language and Environment for Statistical Computing. 2006Google Scholar
- Abeel T, de Peer YV, Saeys Y: Java-ML: A Machine Learning Library. Journal of Machine Learning Research. 2009, 10: 931-934.Google Scholar
- Liu Q, Zhang Y, Xu Y, Ye X: Fuzzy kernel clustering of RNA secondary structure ensemble using a novel similarity metric. J Biomol Struct Dyn. 2008, 25: 685-696.PubMedView ArticleGoogle Scholar
- Vries JK, Munshi R, Tobi D, Klein-Seetharaman J, Benos PV, Bahar I: A sequence alignment-independent method for protein classification. Appl Bioinformatics. 2004, 3: 137-148. 10.2165/00822942-200403020-00008.PubMedView ArticleGoogle Scholar
- Couto BRGM, Ladeira AP, Santos MA: Application of latent semantic indexing to evaluate the similarity of sets of sequences without multiple alignments character-by-character. Genet Mol Res. 2007, 6: 983-999.PubMedGoogle Scholar
- Gordon AD, Vichi M: Partitions of Partitions. Journal of Classification. 1998, 15: 265-285. 10.1007/s003579900034.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.