Clustering

Motivation

  • One of the important goals in the post-genomic era is to discover the functions of genes.
  • High-throughput technologies allow us to speed up the process of finding the functions of genes.
  • But there are tens of thousands of genes involved in a microarray experiment.
  • Questions:
    • How do we analyze the data?
    • Which genes should we start exploring?
  • Let’s look at the problem in a different angle
  • The issue here is dealing with high-dimensional data
  • How do people deal with high-dimensional data?
  • Start by finding interesting patterns associated with the data
  • Clustering is one of the well-known techniques with successful applications on large domain for finding patterns
  • Some successes in applying clustering on microarray data
  • Golub et. al (1999) uses clustering techniques to discover subclasses of AML and ALL from microarray data
  • Eisen et. al (1998) uses clustering techniques that are able to group genes of similar function together.
  • But what is clustering?

Introduction

  • The goal of clustering is to
    • group data points that are close (or similar) to each other
    • identify such groupings (or clusters) in an unsupervised manner
      • Unsupervised: no information is provided to the algorithm on which data belong to which clusters
      • One of the major applications of clustering in bioinformatics is on microarray data to cluster similar genes
      • Suppose genes A and B are grouped in the same cluster, then we hypothesis that genes A and B are involved in similar function.
        • If we know the role of gene A is apoptosis
        • but we do not know if gene B is involved in apoptosis
        • we can do experiments to confirm if gene B indeed is involved in apoptosis.

Hypotheses:

  • Genes with similar expression patterns implies that the coexpression of these genes
  • Coexpressed genes can imply that
    • they are involved in similar functions
    • they are somehow related, for instance because their proteins directly/indirectly interact with each other
  • It is widely believed that coexpressed genes implies that they are involved in similar functions
      • But still, what can we really gain from doing clustering?
      • Suppose genes A and B are grouped in the same cluster, then we hypothesize that proteins A and B might interact with each other.
        • So we can do experiments to confirm if such interaction exists.
      • So clustering microarray data in a way helps us make hypotheses about:
        • potential functions of genes
        • potential protein-protein interactions
  • Do coexpressed genes always imply that they have similar functions?
    • Not necessarily
    • housekeeping genes
    • genes which always expressed or never expressed despite of different conditions
    • there can be noise in microarray data
    • But clustering is useful in:
    • visualization of data
    • hypothesis generation
    • From the paper “Data clustering: review”
    • Feature Selection
      • identifying the most effective subset of the original features to use in clustering
    • Feature Extraction
      • transformations of the input features to produce new salient features.
    • Interpattern Similarity
      • measured by a distance function defined on pairs of patterns.
    • Grouping
      • methods to group similar patterns in the same cluster
      • Various clustering algorithms
        • hierarchical
        • k-means
        • k-medoid
        • fuzzy c-means
      • Different ways of measuring similarity
      • Measure validity of clusters
        • How can we tell the generated clusters are good?
        • How can we judge if the clusters are biologically meaningful?
  • There are two styles of hierarchical clustering algorithms to build a tree from the input set S:
    • Agglomerative (bottom-up):
      • Beginning with singletons (sets with 1 element)
      • Merging them until S is achieved as the root.
      • It is the most common approach.
    • Divisive (top-down):
          • Recursively partitioning S until singleton sets are reached.
      • Input: a pairwise matrix involved all instances in S
      • Algorithm
        1. Place each instance of S in its own cluster (singleton), creating the list of clusters L (initially, the leaves of T):

        L= S1, S2, S3, …, Sn-1, Sn.

        1. Compute a merging cost function between every pair of elements in L to find the two closest clusters {Si, Sj} which will be the cheapest couple to merge.
        2. Remove Si and Sj from L.
        3. Merge Si and Sj to create a new internal node Sij in T which will be the parent of Si and Sj in the resulting tree.
        4. Step 2 can be done in different ways, which is what distinguishes single-linkage from complete-linkage and average-linkage clustering.
          • In single-linkage clustering (also called the connectedness or minimum method): we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster.
          • In complete-linkage clustering (also called the diameter or maximum method), we consider the distance between one cluster and another cluster to be equal to the greatest distance from any member of one cluster to any member of the other cluster.
          • In average-linkage clustering, we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.
          • Go to Step 2 until there is only one set remaining.

Machine Learning introduction

Machine learning is programming computers to optimize a performance criterion using example data or past experience.

There is no need to “learn” to calculate payroll

Learning is used when:

Human expertise does not exist (navigating on Mars),

Humans are unable to explain their expertise (speech recognition)

Solution changes in time (routing on a computer network)

Solution needs to be adapted to particular cases (user biometrics)

Learning general models from a data of particular examples

Data is cheap and abundant (data warehouses, datamarts); knowledge is expensive and scarce.

Example in retail: Customer transactions to consumer behavior:

People who bought “Da Vinci Code” also bought “The Five People You Meet in Heaven”

Data Mining and KDD

Definition := “KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad)

Applications

Retail: Market basket analysis, Customer relationship management (CRM)

Finance: Credit scoring, fraud detection

Manufacturing: Optimization, troubleshooting

Medicine: Medical diagnosis

Telecommunications: Quality of service optimization

Bioinformatics: Motifs, alignment

Web mining: Search engines

Machine Learning

Study of algorithms that improve their performance at some task with experience

Optimize a performance criterion using example data or
past experience.

Role of Statistics: Inference from a sample

Role of Computer science: Efficient algorithms to Solve the optimization problem

Representing and evaluating the model for inference

Learning Associations

Basket analysis:

P (Y | X ) probability that somebody who buys X also
buys Y where X and Y are products/services.

Example: P ( chips | beer ) = 0.7

Classification

Example: Credit scoring

Differentiating between low-risk and high-risk customers from their income and savings

Prediction : Regression

Example: Price of a used car

x : car attributes

y : price

y = g (x | θ )

g ( ) model,

θ parameters

Supervised  Learning  : Uses

Example: decision trees tools that create rules

Prediction of future cases: Use the rule to predict the output for future inputs

Knowledge extraction: The rule is easy to understand

Compression: The rule is simpler than the data it explains

Outlier detection: Exceptions that are not covered by the rule, e.g., fraud

Un-Supervised Learning:

Learning “what normally happens”

No output

clustering: Grouping similar instances

Other applications: Summarization, Association Analysis

Reinforcement Learning:

Topics:

Policies: what actions should an agent take in a particular situation

Utility estimation: how good is a state (→used by policy)

No supervised output but delayed reward

Credit assignment problem (what was responsible for the outcome)

Applications:

Game playing

Robot in a maze

Multiple agents, partial observability,

Example applications

Customer segmentation in CRM

Image compression: Color quantization

 

Bioinformatics: Learning motifs

Data Mining – Classification(Continued)

  • Bayesian Classification
  • Classification by backpropagation
  • Classification based on concepts from association rule mining
  • Probabilistic learning:  Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems
  • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct.  Prior knowledge can be combined with observed data.
  • Probabilistic prediction:  Predict multiple hypotheses, weighted by their probabilities

Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

 

  • The classification problem may be formalized using a-posteriori probabilities:
  •  P(C|X)  = prob. that the sample tuple X=<x1,…,xk> is of class C.

 

  • E.g. P(class=N | outlook=sunny,windy=true,…)

 

  • Idea: assign to sample X the class label C such that P(C|X) is maximal
  • Bayes theorem:

P(C|X) = P(X|C)·P(C) / P(X)

  • P(X) is constant for all classes
  • P(C) = relative freq of class C samples
  • C such that P(C|X) is maximum =
    C such that P(X|C)·P(C) is maximum
  • Problem: computing P(X|C) is unfeasible!
  • Naïve assumption: attribute independence

Bayesian belief network allows a subset of the variables conditionally independent

  • A graphical model of causal relationships
  • Several cases of learning Bayesian belief networks

P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)

  • If i-th attribute is categorical:
    P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C
  • If i-th attribute is continuous:
    P(xi|C) is estimated thru a Gaussian density function
  • Computationally easy in both cases

Neural Networks

  • Advantages
    • prediction accuracy is generally high
    • robust, works when training examples contain errors
    • output may be discrete, real-valued, or a vector of several discrete or real-valued attributes
    • fast evaluation of the learned target function
  • Criticism
    • long training time
    • difficult to understand the learned function (weights)
    • not easy to incorporate domain knowledge

Network Training

 

  • The ultimate objective of training
    • obtain a set of weights that makes almost all the tuples in the training data classified correctly
  • Steps
    • Initialize weights with random values
    • Feed the input tuples into the network one by one
    • For each unit
      • Compute the net input to the unit as a linear combination of all the inputs to the unit
      • Compute the output value using the activation function
      • Compute the error
      • Update the weights and the bias

Network Pruning and Rule Extraction

  • Network pruning
    • Fully connected network will be hard to articulate
    • N input nodes, h hidden nodes and m output nodes lead to h(m+N) weights
    • Pruning: Remove some of the links without affecting classification accuracy of the network
  • Extracting rules from a trained network
    • Discretize activation values; replace individual activation value by the cluster average maintaining the network accuracy
    • Enumerate the output from the discretized activation values to find rules between activation value and output
    • Find the relationship between the input and activation value
    • Combine the above two to have rules relating the output to input

Association Based Classification

  • Several methods for association-based classification
    • ARCS: Quantitative association mining and clustering of association rules (Lent et al’97)
      • It beats C4.5 in (mainly) scalability and also accuracy
    • Associative classification: (Liu et al’98)  
      • It mines high support and high confidence rules in the form of “cond_set => y”, where y is a class label
    • CAEP (Classification by aggregating emerging patterns) (Dong et al’99)
      • Emerging patterns (EPs): the itemsets whose support increases significantly from one class to another
      • Mine Eps based on minimum support and growth rate

 

Data Mining Classification

  • What is classification? What is prediction?
  • Issues regarding classification and prediction
  • Classification by decision tree induction
  • Bayesian Classification
  • Classification:  
    • predicts categorical class labels
    • classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
  • Prediction:  
    • models continuous-valued functions, i.e., predicts unknown or missing values
  • Typical Applications
    • credit approval
    • target marketing
    • medical diagnosis
    • treatment effectiveness analysis
  • Model construction: describing a set of predetermined classes
    • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute
    • The set of tuples used for model construction: training set
    • The model is represented as classification rules, decision trees, or mathematical formulae
  • Model usage: for classifying future or unknown objects
    • Estimate accuracy of the model
      • The known label of test sample is compared with the classified result from the model
      • Accuracy rate is the percentage of test set samples that are correctly classified by the model
      • Test set is independent of training set, otherwise over-fitting will occur
  • Supervised learning (classification)
    • Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations
    • New data is classified based on the training set
  • Unsupervised learning (clustering)
    • The class labels of training data is unknown
    • Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
  • Data cleaning
    • Preprocess data in order to reduce noise and handle missing values
  • Releance analysis (feature selection)
    • Remove the irrelevant or redundant attributes
  • Data transformation
    • Generalize and/or normalize data

Decision tree

  • A flow-chart-like tree structure
  • Internal node denotes a test on an attribute
  • Branch represents an outcome of the test
  • Leaf nodes represent class labels or class distribution
    • Decision tree generation consists of two phases
      • Tree construction
        • At start, all the training examples are at the root
        • Partition examples recursively based on selected attributes
      • Tree pruning
        • Identify and remove branches that reflect noise or outliers
    • Use of decision tree: Classifying an unknown sample
      • Test the attribute values of the sample against the decision tree

Basic algorithm (a greedy algorithm)

  • Tree is constructed in a top-down recursive divide-and-conquer manner
  • At start, all the training examples are at the root
  • Attributes are categorical (if continuous-valued, they are discretized in advance)
  • Examples are partitioned recursively based on selected attributes
  • Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)
      • Conditions for stopping partitioning
        • All samples for a given node belong to the same class
        • There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf
        • There are no samples left
  • Classification—a classical problem extensively studied by statisticians and machine learning researchers
  • Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed
  • Why decision tree induction in data mining?
    • relatively faster learning speed (than other classification methods)
    • convertible to simple and easy to understand classification rules
    • can use SQL queries for accessing databases
    • comparable classification accuracy with other methods
  • SLIQ (EDBT’96 — Mehta et al.)
    • builds an index for each attribute and only class list and the current attribute list reside in memory
  • SPRINT (VLDB’96 — J. Shafer et al.)
    • constructs an attribute list data structure
  • PUBLIC (VLDB’98 — Rastogi & Shim)
    • integrates tree splitting and tree pruning: stop growing the tree earlier
  • RainForest  (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
    • separates the scalability aspects from the criteria that determine the quality of the tree
    • builds an AVC-list (attribute, value, class label)
    • Integration of generalization with decision-tree induction (Kamber et al’97).
    • Classification at primitive concept levels
      • E.g., precise temperature, humidity, outlook, etc.
      • Low-level concepts, scattered classes, bushy classification-trees
      • Semantic interpretation problems.
    • Cube-based multi-level classification
      • Relevance analysis at multi-levels.
      • Information-gain analysis with dimension + level.

Bayesian Classification

  • Probabilistic learning:  Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems
  • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct.  Prior knowledge can be combined with observed data.
  • Probabilistic prediction:  Predict multiple hypotheses, weighted by their probabilities
  • Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

Artificial Intelligence can it be creative

Based on some of the open source code which uses NLP for artificial intelligence. Yes its based on the training sets created and also based on rules constructing a tree which gets branched out based on weight of edges.

This makes me think can Artificial Intelligence can it be creative or do the lateral thinking which is kind out of box solutions. Since its based on data points and facts , experience captured during the scenarios.

I believe if it has to be it should start with a node which closely resembles the one which is existing before and recompute the weightage for edges based on new start and the path created.

Hence this kind of intelligence if it could be created on the Artificial Intelligence. This would be very helpful where the context changes a lot based on environment.

IOT when the machines connect and uses the analytics could leverage Artificial Intelligence based on the data provided based on the context come to quick conclusion.

Till the system is in place using Artificial intelligence we could leverage abstraction events and context in terms of patterns and find the nearest one.

Changed in display of web pages

Presently the blogs , contents are in a flat text. Hope this would be great if a 3D helper which comes when we go to the Blog and where we see contents which initially listens from the creator with the Voice variation.

if the 3D helper understands whether the user is understanding the context and questions him relevant and aligns the information according to the relevance.

3D helper would be great if they could sense the reader facial expressions and understand his interest in the contents aligns the display of information.

I believe this kind of display of information would really help in making the person and understanding the content and also with the questions which was created by the author. More kind of feedback could be reached by the author and understand what is the new kind of innovation brought of this content which was shared.

Applying Analytic on the observation would help in changing information displayed according to the customer preferences and could make it more effective which helps them to get connected and share the information.

How our old generation disappears

With the rapid expansion of new technology such as IOT, Big Data, Analytics, 3D Printing , On demand computing, Payment technologies, SCM, Social media, Mobile computing , Cloud . This reshapes our thinking and living style we start doing normal life cycle activities.

If we could proactive understand the technology and its impact how it could be leveraged into their business such as using IOT & Analytics for sensor detection and creating intelligent devices.

3D printing could be used for prototyping and creating dimensional objects from imaginary to physical form without much investment.

Flipkart , Amazon has created a very good network of Supply chain management and using cashless transactions and reaching out to wider set of customers.

Social media has already started creating the impact bringing the world together and share their thoughts.

Mobile has already shrink  the way we interact and from any place.

I hope in near future how we forgot our old style of travelling using Car & Flight, Mobile for any communication. In future desktops would go out of use. Big shopping markets, our currency format, way business is done if it changes to these technologies it would survive if not they would lose their competitive edge.

Future of web applications

Currently looking at the IT Trends mobile applications , IOT devices, Interactive consoles (ATM, Device interactions,..) are slowly replacing the web applications.

I hope in near future this devices would leverage the capabilities of sensors, IOT , analytics, Facial recognition. This would sense the mood of customers to get the inputs and analytics would make this devices more intelligent in getting the right inputs.

The front end such as web page would change we need to look at how the middleware layer would interact with different front end devices and do the transactions.

In some areas such as payments where cashless transactions are done they are in the process of moving to this approach.

Google cars directly interact with GPS and uses its intelligence in driving the car. Remote interactions based on intelligence and make the devices to behave based on the interaction would open a lot of opportunities.

These are implemented in few of the areas there are wide areas to implement which would create a bigger trend in how customers interact and provide inputs.

Customer irrationality to products reduced through Analytics

Every human wants to make the decision in the rational manner based on the information available. The decision to purchase a product is mostly guided through the list of attributes the customer expect in the product and how the product satisfies it. But the customers are moved rational to irrational based on sudden influence of some factors which affects the purchase decision.

This talks about how to handle the irrationality of customers which influence the buying decision using the guided factors to influence

The major products which has swept the market is based on identifying the attributes which are essential to customer how it is incorporated in product. The aligning of these attributes to product also helps visualize the customer to create a brand factor which helps them visualize the brand to product.

Analytics has already started playing the major role in identifying the attributes based on the interaction on social media, web, purchase style, list of features customers expected while trying to purchase.

Applying the analytics model on the data and understanding the relationship between the attributes which strongly influences the customer decision would be a primary one. Hence the decision of what channels, customization of products and market factors would be distilled and helps the decision makers with the analytics tools to identify the variance.

Why Games are so interesting

When there are lot of algorithms and analysis which helps in predicting the model and provide strategy and tactics how the game has to be play better. still the chaos and dynamics makes the game interesting.

In a football based on past play and team formation, strategy and tactics to be adopted are changing still the dynamics plays a major role.

In Cricket and baseball even when the bowler and batter prediction based on the curve and the angle are defined still there are unpredictable which changes the dynamics of game.