ACGT: Clustering

Clustering

The goal of clustering is to organize a set of objects or individuals (eg. cities, companies, patients, genes, proteins etc..) characterized by a set of "attributes" or "variables", into classes and sub-classes of similarity. For instance, a city can be described by the attributes such as population, total area, presence or absence of an airport or railway station, number of hospitals, quality of living and son on.

Thus, the attributes may be of different types such as numeric (quantitative), nominal (categorical) or ordinal. The notion of similarity or dissimilarity between individuals is crucial in clustering since it aims at putting together, in a same cluster, the individuals that 'resemble' to each other. Clustering algorithms divide a set of objects into groups so that the objects within a same group are more similar than those across groups. In the particular case of gene expression microarray experiments, clustering algorithms can be used to discover the groups of genes having similar profiles i.e. similar pattern of expression across samples. There are two approaches to clustering techniques: hierarchical and non-hierarchical. The former provide a series of nested clusters whereas the latter find a single partition into a predetermined number of classes.

In the Clinico Genomic Trials, gene expression data as well as clinical trials data are available for a set of patients. Clustering the set of genes based on the expression data can produce groups of genes with similar patterns on the one hand, and clusters of patients can be obtained from clinical data, on the other. The link between the clusters of genes and those of patients can be studied with the help of statistical measures of association.

The crucial role played by cluster analysis in genomics was first demonstrated by Eisen et al (1998) and Brown P.O., Botstein D. (1999).

The clustering methods most commonly used are the hierarchical methods (Single Linkage, Complete Linkage, Average Linkage and Centroid) and they make use of Euclidean distance between genes or samples most often for numerical data.

The clustering tools that will be developed for use of clinicians and researchers in ACGT are based on the Likelihood Link Analysis (LLA) methodology. Here a measure of similarity is defined as the "likelihood" of the resemblance between the objects to be clustered (eg genes or patients) under some non-link hypothesis, giving rise to a probabilistic measure of similarity devoid of bias. The cluster joining strategy is itself defined as the likelihood of the similarity between clusters, a kind of probabilistic aggregation criterion. This methodology provides two statistics of utmost practical importance: the "global index" that measures the quality of the partition obtained at different levels of hierarchy, and enables the user to choose the most "coherent" partition, and the "local index" which allows to detect the "significant nodes" associated with the "strong clusters". This tool is integrated in the R package LLAhclust which can perform hierarchical clustering of a set of "individuals" or a set of attributes of different types (numerical, nominal, ordinal, Boolean and many others).

References:

Cluster analysis and display of genome-wide expression patterns, Eisen MB, Spellman PT, Brown PO, Botstein D., 1998, PNAS USA, 95(25
Exploring the new world of the genome with DNA microarrays, Brown P.O., Botstein D.,1991, Nature Genetics, 21
Foundations of the likelihood linkage analysis classification method, I. C. Lerman, 1981, Applied Stochastic Models and Data Analysis, 7, pp 63-76
Likelihood linkage analysis classification method: An example treated by hand, I. C. Lerman, 1993, Biochimie, 75, pp 379-397

ACGT Newsletter

Clustering