Motivation
- One of the important goals in the post-genomic era is to discover the functions of genes.
- High-throughput technologies allow us to speed up the process of finding the functions of genes.
- But there are tens of thousands of genes involved in a microarray experiment.
- Questions:
- How do we analyze the data?
- Which genes should we start exploring?
- Let’s look at the problem in a different angle
- The issue here is dealing with high-dimensional data
- How do people deal with high-dimensional data?
- Start by finding interesting patterns associated with the data
- Clustering is one of the well-known techniques with successful applications on large domain for finding patterns
- Some successes in applying clustering on microarray data
- Golub et. al (1999) uses clustering techniques to discover subclasses of AML and ALL from microarray data
- Eisen et. al (1998) uses clustering techniques that are able to group genes of similar function together.
- But what is clustering?
Introduction
- The goal of clustering is to
- group data points that are close (or similar) to each other
- identify such groupings (or clusters) in an unsupervised manner
- Unsupervised: no information is provided to the algorithm on which data belong to which clusters
- One of the major applications of clustering in bioinformatics is on microarray data to cluster similar genes
- Suppose genes A and B are grouped in the same cluster, then we hypothesis that genes A and B are involved in similar function.
- If we know the role of gene A is apoptosis
- but we do not know if gene B is involved in apoptosis
- we can do experiments to confirm if gene B indeed is involved in apoptosis.
Hypotheses:
- Genes with similar expression patterns implies that the coexpression of these genes
- Coexpressed genes can imply that
- they are involved in similar functions
- they are somehow related, for instance because their proteins directly/indirectly interact with each other
- It is widely believed that coexpressed genes implies that they are involved in similar functions
-
-
- But still, what can we really gain from doing clustering?
- Suppose genes A and B are grouped in the same cluster, then we hypothesize that proteins A and B might interact with each other.
- So we can do experiments to confirm if such interaction exists.
- So clustering microarray data in a way helps us make hypotheses about:
- potential functions of genes
- potential protein-protein interactions
-
- Do coexpressed genes always imply that they have similar functions?
- Not necessarily
- housekeeping genes
- genes which always expressed or never expressed despite of different conditions
- there can be noise in microarray data
- But clustering is useful in:
- visualization of data
- hypothesis generation
- From the paper “Data clustering: review”
- Feature Selection
- identifying the most effective subset of the original features to use in clustering
- Feature Extraction
- transformations of the input features to produce new salient features.
- Interpattern Similarity
- measured by a distance function defined on pairs of patterns.
- Grouping
- methods to group similar patterns in the same cluster
- Various clustering algorithms
- hierarchical
- k-means
- k-medoid
- fuzzy c-means
- Different ways of measuring similarity
- Measure validity of clusters
- How can we tell the generated clusters are good?
- How can we judge if the clusters are biologically meaningful?
- There are two styles of hierarchical clustering algorithms to build a tree from the input set S:
- Agglomerative (bottom-up):
- Beginning with singletons (sets with 1 element)
- Merging them until S is achieved as the root.
- It is the most common approach.
- Divisive (top-down):
-
-
- Recursively partitioning S until singleton sets are reached.
-
-
-
- Input: a pairwise matrix involved all instances in S
- Algorithm
- Place each instance of S in its own cluster (singleton), creating the list of clusters L (initially, the leaves of T):
L= S1, S2, S3, …, Sn-1, Sn.
-
- Compute a merging cost function between every pair of elements in L to find the two closest clusters {Si, Sj} which will be the cheapest couple to merge.
- Remove Si and Sj from L.
- Merge Si and Sj to create a new internal node Sij in T which will be the parent of Si and Sj in the resulting tree.
- Step 2 can be done in different ways, which is what distinguishes single-linkage from complete-linkage and average-linkage clustering.
- In single-linkage clustering (also called the connectedness or minimum method): we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster.
- In complete-linkage clustering (also called the diameter or maximum method), we consider the distance between one cluster and another cluster to be equal to the greatest distance from any member of one cluster to any member of the other cluster.
- In average-linkage clustering, we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.
- Go to Step 2 until there is only one set remaining.
- Agglomerative (bottom-up):