cluster

Validate clusters in phylogenetic tree

Syntax

LeafClusters = cluster(Tree, Threshold) [LeafClusters, NodeClusters] = cluster(Tree, Threshold) [LeafClusters, NodeClusters, Branches] = cluster(Tree, Threshold) cluster(..., 'Criterion', CriterionValue, ...) cluster(..., 'MaxClust', MaxClustValue, ...) cluster(..., 'Distances', DistancesValue, ...)

Input Arguments

`Tree`	Phylogenetic tree object created, such as created with the `phytree` constructor function.
`Threshold`	Scalar specifying a threshold value.
`CriterionValue`	Character vector or string specifying the criterion to determine the number of clusters as a function of the species pairwise distances. Choices are: `'maximum'` (default) — Maximum within cluster pairwise distance (W_max). Cluster splitting stops when W_max ≤ `Threshold`. `'median'` — Median within cluster pairwise distance (W_med). Cluster splitting stops when W_med ≤ `Threshold`. `'average'` — Average within cluster pairwise distance (W_avg). Cluster splitting stops when W_avg ≤ `Threshold`. `'ratio'` — Between/within cluster pairwise distance ratio, defined as BW_rat = (trace(B)/(k - `1`)) / (trace(W)/(n - k)) where B and W are the between- and within-scatter matrices, respectively. k is the number of clusters, and n is the number of species in the tree. Cluster splitting stops when BW_rat ≥ `Threshold`. `'gain'` — Within cluster pairwise distance gain, defined as W_gain = (trace(W_old)/ (trace(W) - 1) * (n - k - 1)) where W and W_old are the within-scatter matrices for k and k - 1, respectively. k is the number of clusters, and n is the number of species in the tree. Cluster splitting stops when W_gain ≤ `Threshold`. `'silhouette'` — Average silhouette width (SW_avg). SW_avg ranges from `-1` to `+1`. Cluster splitting stops when SW_avg ≥ `Threshold`. For more information, see `silhouette`.
`MaxClustValue`	Positive integer specifying the maximum number of possible clusters for the tested partitions. Default is the number of leaves in the tree. Tip When using the `'maximum'`, `'median'`, or `'average'` criteria, set `Threshold` to `[]` (empty) to force `cluster` to return `MaxClustValue` clusters. It does so because such metrics monotonically decrease as k increases. Tip When using the `'ratio'`, `'gain'`, or `'silhouette'` criteria, you may find it hard to estimate an appropriate `Threshold` in advance. Set `Threshold` to `[]` (empty) to find the optimal number of clusters below `MaxClustValue`. Also, set `MaxClustValue` to a small value to avoid expensive computation due to testing all possible number of clusters.
`DistancesValue`	Matrix of pairwise distances, such as returned by the `seqpdist` function, containing biological distances between each pair of sequences. `cluster` substitutes this matrix for the patristic distances in `Tree`. For example, this matrix can contain the real sample pairwise distances.

Output Arguments

`LeafClusters`	Column vector containing a cluster index for each species (leaf) in `Tree`, a phylogenetic tree object.
`NodeClusters`	Column vector containing the cluster index for each leaf node and branch node in `Tree`. Tip Use the `LeafClusters` or `NodeClusters` output vectors with the handle returned by the `plot` method to modify graphic elements of the phylogenetic tree object. For more information, see Examples.
`Branches`	Two-column matrix containing, for each step in the algorithm, the index of the branch being considered and the value of the criterion. Each row corresponds to a step in the algorithm. The first column contains branch indices, and the second column contains criterion values. Tip To obtain the whole curve of the criterion versus the number of clusters in `Branches`, set `Threshold` to `[]` (empty) and do not specify a `MaxClustValue`. Be aware that computation of some criteria can be computationally intensive.

Description

LeafClusters = cluster(Tree, Threshold) returns a column vector containing a cluster index for each species (leaf) in a phylogenetic tree object. It determines the optimal number of clusters as follows:

Starting with two clusters (k = 2), selects the partition that optimizes the criterion specified by the 'Criterion' property
Increments k by 1 and again selects the optimal partition
Continues incrementing k and selecting the optimal partition until a criterion value = Threshold or k = the maximum number of clusters (that is, number of leaves)
From all possible k values, selects the k value whose partition optimizes the criterion

[LeafClusters, NodeClusters] = cluster(Tree, Threshold) returns a column vector containing the cluster index for each leaf node and branch node in Tree.

[LeafClusters, NodeClusters, Branches] = cluster(Tree, Threshold) returns a two-column matrix containing, for each step in the algorithm, the index of the branch being considered and the value of the criterion. Each row corresponds to a step in the algorithm. The first column contains branch indices, and the second column contains criterion values.

cluster(..., 'PropertyName', PropertyValue, ...) calls cluster with optional properties that use property name/property value pairs. You can specify one or more properties in any order. Enclose each PropertyName in single quotation marks. Each PropertyName is case insensitive. These property name/property value pairs are as follows:.

cluster(..., 'Criterion', CriterionValue, ...) specifies the criterion to determine the number of clusters as a function of the species pairwise distances.

cluster(..., 'MaxClust', MaxClustValue, ...) specifies the maximum number of possible clusters for the tested partitions. Default is the number of leaves in the tree.

cluster(..., 'Distances', DistancesValue, ...) substitutes the patristic distances in Tree with a user-provided pairwise distance matrix.

Examples

Validate the clusters in a phylogenetic tree:

% Read sequences from a multiple alignment file into a MATLAB
% structure
gagaa = multialignread('aagag.aln');

% Build a phylogenetic tree from the sequences
gag_tree = seqneighjoin(seqpdist(gagaa),'equivar',gagaa);

% Validate the clusters in the tree and find the best partition
% using the 'gain' criterion
[i,j] = cluster(gag_tree,[],'criterion','gain','maxclust',10);

% Use the returned vector of indices to color the branches of each
% cluster in a plot of the tree
h = plot(gag_tree);
set(h.BranchLines(j==2),'Color','b')
set(h.BranchLines(j==1),'Color','r')

References

[1] Dudoit, S. and Fridlyan, J. (2002). A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology 3(7), research 0036.1–0036.21.

[2] Theodoridis, S. and Koutroumbas, K. (1999). Pattern Recognition (Academic Press), pp. 434–435.

[3] Kaufman, L. and Rousseeuw, P.J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis (New York, Wiley).

[4] Calinski, R. and Harabasz, J. (1974). A dendrite method for cluster analysis. Commun Statistics 3, 1–27.

[5] Hartigan, J.A. (1985). Statistical theory in clustering. J Classification 2, 63–76.