A quick run-through about Machine Learning. Classification (Supervised)

A quick run-through about Machine Learning

Classification (Supervised)

Simplest model of data Y Y =  +  X + noise Deterministic (functional) relationship X

Simplest model of data Y Y =  +  X + noise Deterministic (functional) relationship X “Learning” = estimating parameters , ,  from (x,y) pairs. Can be estimate by least squares Is the empirical mean Is the residual variance

Learning with ignorance Latent “switch” variable – hidden process at work

Decision trees blue? big? oval? no yes

Decision trees blue? big? oval? no yes + Handles mixed variables + Handles missing data + Efficient for large data sets + Handles irrelevant attributes + Easy to understand - Predictive power

Feedforward neural network input Hidden layer Output Weights on each arc Sigmoid function at each node

Feedforward neural network input Hidden layer Output - Handles mixed variables - Handles missing data - Efficient for large data sets - Handles irrelevant attributes - Easy to understand + Predictive power

Nearest Neighbor –Remember all your data –When someone asks a question, find the nearest old data point return the answer associated with it

Nearest Neighbor ? - Handles mixed variables + Handles missing data - Efficient for large data sets - Handles irrelevant attributes - Easy to understand + Predictive power

 Training data : l-dimensional vector with flag of true or false  Separating hyperplane :  Inequalities :  Support vectors :  Support vector expansion:  Decision: Support vector machine margin

Support Vector Machines (SVMs) Two key ideas: –Large margins are good –Kernel trick

SVMs: summary - Handles mixed variables - Handles missing data - Efficient for large data sets - Handles irrelevant attributes - Easy to understand + Predictive power Kernel trick can be used to make many linear methods non-linear e.g., kernel PCA, kernelized mutual information Large margin classifiers are good General lessons from SVM success:

Boosting Can boost any weak learner Most commonly: boosted decision “stumps” + Handles mixed variables + Handles missing data - Efficient for large data sets + Handles irrelevant attributes - Easy to understand + Predictive power

Supervised learning: semi-summary Learn mapping F from inputs to outputs using a training set of (x,t) pairs F can be drawn from different hypothesis spaces, e.g., decision trees, linear separators, linear in high dimensions, mixtures of linear Algorithms offer a variety of tradeoffs Many good books, e.g., –“The elements of statistical learning”, Hastie, Tibshirani, Friedman, 2001 –“Pattern classification”, Duda, Hart, Stork, 2001

Probabilistic graphical models Probabilistic models Graphical models Directed Undirected Bayes nets MRFs DBNs Hidden Markov Model (HMM) Naïve Bayes classifier Mixtures of experts Kalman filter model Ising model

Family of Alarm Bayesian Networks Qualitative part: Directed acyclic graph (DAG) Nodes - random variables Edges - direct influence Quantitative part: Set of conditional probability distributions 0.90.1 e b e 0.20.8 0.01 0.99 0.90.1 be b b e BE P(A | E,B) Earthquake Radio Burglary Alarm Call Compact representation of probability distributions via conditional independence Together: Define a unique distribution in a factored form

Example: “ICU Alarm” network Domain: Monitoring Intensive-Care Patients 37 variables 509 parameters …instead of 2 54 PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP

Probabilistic Inference Posterior probabilities –Probability of any event given any evidence P(X|E) PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP

Bayesian inference + Elegant – no distinction between parameters and other hidden variables + Can use priors to learn from small data sets (c.f., one-shot learning by humans) - Math can get hairy - Often computationally intractable

Structure learning Increases the number of parameters to be estimated Wrong assumptions about domain structure Cannot be compensated for by fitting parameters Wrong assumptions about domain structure EarthquakeAlarm Set Sound Burglary EarthquakeAlarm Set Sound Burglary Earthquake Alarm Set Sound Burglary Adding an arc Missing an arc

Scorebased Learning E, B, A. E B A E B A E B A Search for a structure that maximizes the score Define scoring function that evaluates how well a structure matches the data

Problems with local search S(G|D) Easy to get stuck in local optima “truth” you

Problems with local search II –Small sample size  many high scoring models –Answer based on one model often useless –Want features common to many models E R B A C E R B A C E R B A C E R B A C E R B A C P(G|D) Picking a single best model can be misleading

Structure learning: other issues Discovering latent variables Learning causal models Learning from interventional data Active learning

Discovering latent variables a) 17 parameters b) 59 parameters There are some techniques for automatically detecting the possible presence of latent variables

Learning causal models So far, we have only assumed that X -> Y -> Z means that Z is independent of X given Y. However, we often want to interpret directed arrows causally. This is uncontroversial for the arrow of time. But can we infer causality from static observational data?

Learning causal models We can infer causality from static observational data if we have at least four measured variables and certain “tetrad” conditions hold. See books by Pearl and Spirtes et al. However, we can only learn up to Markov equivalence, not matter how much data we have. X Y Z X Y Z X Y Z X Y Z

Learning from interventional data The only way to distinguish between Markov equivalent networks is to perform interventions, e.g., gene knockouts. We need to (slightly) modify our learning algorithms. smoking Yellow fingers P(smoker|observe(yellow)) >> prior smoking Yellow fingers P(smoker | do(paint yellow)) = prior Cut arcs coming into nodes which were set by intervention

Active learning Which experiments (interventions) should we perform to learn structure as efficiently as possible? This problem can be modeled using decision theory. Exact solutions are wildly computationally intractable. Can we come up with good approximate decision making techniques? Can we implement hardware to automatically perform the experiments? “AB: Automated Biologist”

Hidden Markov Model Another probabilistic network model

Outline Hidden Markov Models (HMM) Three principal problems of HMMs – The “ Evaluation Problem ” – The “ Decoding Problem ” – The “ Learning problem ”

Hidden Markov Models Sean R Eddy, Nat.Biotech. 22 (10), 1315

denotes the state, denotes the current state (state at time t) The number of states of the model,. The number of observation symbols in the alphabet,. – i.e. M=20 for proteins and M=4 for nucleotides. The Hidden Markov Model,. Notions

Mathematical Definition Transition Probabilities ( ): Emission Probabilities ( ): Initial State ( ):

Principle Problems of HMMs The Evaluation Problem – Given an HMM and a sequence of observations. What is the probability that the observations are generated by the model? – i.e. Given a protein family model, what is the possibility of a protein sequence belonging to that family? The Decoding Problem – Given a model and a sequence of observations, what is the most likely state sequence in the model that produced the observations? – i.e. Given an intron splicing sequence model, which is the most likely intron splicing site in a long sequence? The Learning problem – Train HMM models from unaligned sequences – Build HMM models from aligned sequences

Evaluation Problem Forward and Backward Variables Answer:

Decoding Problem Viterbi Algorithm Dynamic programming – Starts from calculation of using recursion in 1.8, while always keeping a pointer to the “ winning state ” in the maximum finding operation. Finally the state, is found where Starting from this state, the sequence of states is back-tracked as the pointer in each state indicates.This gives the required set of states.

Learning Problem Build a HMM: – In case you know the “ hidden ” state path, i.e. prealigned sequences with known state path – Direct compute state emissions and transitions Train a HMM – In case only observations are available, i.e. unaligned sequences – Objective: Maximize – Approaches: Baum-Welch (aka, Expectation Maximization algorithm) Gradient based Method

HMM Advantages Statistical Grounding – Statisticians are comfortable with the theory behind hidden Markov models – Freedom to manipulate the training and verification processes – Mathematical / theoretical analysis of the results and processes – HMMs are still very powerful modeling tools – far more powerful than many statistical methods

HMM Advantages continued Modularity – HMMs can be combined into larger HMMs Transparency of the Model – Assuming an architecture with a good design – People can read the model and make sense of it – The model itself can help increase understanding

HMM Advantages continued Incorporation of Prior Knowledge – Incorporate prior knowledge into the architecture – Initialize the model close to something believed to be correct – Use prior knowledge to constrain training process

HMM Disadvantages Markov Chains – States are supposed to be independent – P(y) must be independent of P(x), and vice versa – This usually isn ’ t true – Can get around it when relationships are local – Not good for RNA folding problems P(x) … P(y)

Clustering (Unsupervised)

An Illustration

Concepts of Clustering Clusters Different ways of representing clusters – Division with boundaries – Spheres – Probabilistic – Dendrograms – … 1 2 3 I1 I2 … In 0.5 0.2 0.3

Clustering Clustering quality – Inter-clusters distance  maximized – Intra-clusters distance  minimized The quality of a clustering result depends on both the similarity measure used by the method and its application. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns Clustering vs. classification – Which one is more difficult? Why? – There are a huge number of clustering techniques.

Dissimilarity/Distance Measure Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d (i, j) The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables. Weights should be associated with different variables based on applications and data semantics. It is hard to define “ similar enough ” or “ good enough ”. The answer is typically highly subjective.

Types of data in clustering analysis Interval-scaled variables Binary variables Nominal, ordinal, and ratio variables Variables of mixed types

Interval-valued variables Continuous measurements in a roughly linear scale, e.g., weight, height, temperature, etc Standardize data (depending on applications) – Calculate the mean absolute deviation: where – Calculate the standardized measurement (z-score)

Similarity Between Objects Distance: Measure the similarity or dissimilarity between two data objects Some popular ones include: Minkowski distance: where (x i1, x i2, …, x ip ) and (x j1, x j2, …, x jp ) are two p- dimensional data objects, and q is a positive integer If q = 1, d is Manhattan distance

Similarity Between Objects (Cont.) If q = 2, d is Euclidean distance: – Properties d(i,j)  0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j)  d(i,k) + d(k,j) Also, one can use weighted distance, and many other similarity/distance measures.

Binary Variables A contingency table for binary data Simple matching coefficient (invariant, if the binary variable is symmetric): Jaccard coefficient (noninvariant if the binary variable is asymmetric): Object i Object j

Dissimilarity of Binary Variables Example – gender is a symmetric attribute (not used below) – the remaining attributes are asymmetric attributes – let the values Y and P be set to 1, and the value N be set to 0

Nominal Variables A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green, etc Method 1: Simple matching – m: # of matches, p: total # of variables Method 2: use a large number of binary variables – creating a new binary variable for each of the M nominal states

Ordinal Variables An ordinal variable can be discrete or continuous Order is important, e.g., rank Can be treated like interval-scaled (f is a variable) – replace x if by their ranks – map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by – compute the dissimilarity using methods for interval - scaled variables

Ratio-Scaled Variables Ratio-scaled variable: a measurement on a nonlinear scale, approximately at exponential scale, such as Ae Bt or Ae -Bt, e.g., growth of a bacteria population. Methods: – treat them like interval-scaled variables—not a good idea! (why?—the scale can be distorted) – apply logarithmic transformation y if = log(x if ) – treat them as continuous ordinal data and then treat their ranks as interval-scaled

Variables of Mixed Types A database may contain all six types of variables – symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio One may use a weighted formula to combine their effects – f is binary or nominal: d ij (f) = 0 if x if = x jf, or d ij (f) = 1 o.w. – f is interval-based: use the normalized distance – f is ordinal or ratio-scaled compute ranks r if and and treat z if as interval - scaled

Major Clustering Techniques Partitioning algorithms: Construct various partitions and then evaluate them by some criterion Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion Density-based: based on connectivity and density functions Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of the model to each other.

Partitioning Algorithms: Basic Concept Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion – Global optimal: exhaustively enumerate all partitions – Heuristic methods: k-means and k-medoids algorithms – k-means : Each cluster is represented by the center of the cluster – k-medoids or PAM (Partition around medoids): Each cluster is represented by one of the objects in the cluster

The K-Means Clustering Given k, the k-means algorithm is as follows: 1)Choose k cluster centers to coincide with k randomly- chosen points 2)Assign each data point to the closest cluster center 3)Recompute the cluster centers using the current cluster memberships. 4)If a convergence criterion is not met, go to 2). Typical convergence criteria are: no (or minimal) reassignment of data points to new cluster centers, or minimal decrease in squared error. p is a point and m i is the mean of cluster C i

Example For simplicity, 1 dimensional data and k=2. data: 1, 2, 5, 6,7 K-means: – Randomly select 5 and 6 as initial centroids; – => Two clusters {1,2,5} and {6,7}; meanC1=8/3, meanC2=6.5 – => {1,2}, {5,6,7}; meanC1=1.5, meanC2=6 – => no change. – Aggregate dissimilarity = 0.5^2 + 0.5^2 + 1^2 + 1^2 = 2.5

Comments on K-Means Strength: efficient: O(tkn), where n is # data points, k is # clusters, and t is # iterations. Normally, k, t << n. Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness – Applicable only when mean is defined, difficult for categorical data – Need to specify k, the number of clusters, in advance – Sensitive to noisy data and outliers – Not suitable to discover clusters with non-convex shapes – Sensitive to initial seeds

Variations of the K-Means Method A few variants of the k-means which differ in – Selection of the initial k seeds – Dissimilarity measures – Strategies to calculate cluster means Handling categorical data: k-modes – Replacing means of clusters with modes – Using new dissimilarity measures to deal with categorical objects – Using a frequency based method to update modes of clusters

k-Medoids clustering method k-Means algorithm is sensitive to outliers – Since an object with an extremely large value may substantially distort the distribution of the data. Medoid – the most centrally located point in a cluster, as a representative point of the cluster. An example In contrast, a centroid is not necessarily inside a cluster. Initial Medoids

Partition Around Medoids PAM: 1.Given k 2.Randomly pick k instances as initial medoids 3.Assign each data point to the nearest medoid x 4.Calculate the objective function the sum of dissimilarities of all points to their nearest medoids. (squared-error criterion) 5.Randomly select an point y 6.Swap x by y if the swap reduces the objective function 7.Repeat (3-6) until no change

Comments on PAM Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean (why?) Pam works well for small data sets but does not scale well for large data sets. – O(k(n-k) 2 ) for each change where n is # of data, k is # of clusters Outlier (100 unit away)

Hierarchical Clustering Use distance matrix for clustering. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 Step 1Step 2Step 3Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3Step 2Step 1Step 0 agglomerative divisive

Agglomerative Clustering At the beginning, each data point forms a cluster (also called a node). Merge nodes/clusters that have the least dissimilarity. Go on merging Eventually all nodes belong to the same cluster

A Dendrogram Shows How the Clusters are Merged Hierarchically Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram. A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

Divisive Clustering Inverse order of agglomerative clustering Eventually each node forms a cluster on its own

More on Hierarchical Methods Major weakness of agglomerative clustering methods – do not scale well: time complexity at least O(n 2 ), where n is the total number of objects – can never undo what was done previously Integration of hierarchical with distance-based clustering to scale-up these clustering methods – BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters – CURE (1998): selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a specified fraction

Summary Cluster analysis groups objects based on their similarity and has wide applications Measure of similarity can be computed for various types of data Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density- based methods, etc Clustering can also be used for outlier detection which are useful for fraud detection What is the best clustering algorithm?

Thank you Q & A

A quick run-through about Machine Learning. Classification (Supervised)

Similar presentations

Presentation on theme: "A quick run-through about Machine Learning. Classification (Supervised)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A quick run-through about Machine Learning. Classification (Supervised)

Similar presentations

Presentation on theme: "A quick run-through about Machine Learning. Classification (Supervised)"— Presentation transcript:

Similar presentations

About project

Feedback