# Motivations Social Bookmarking Socialized Bookmarks Tags.

## Presentation on theme: "Motivations Social Bookmarking Socialized Bookmarks Tags."— Presentation transcript:

Machine Learning and Statistical Analysis
Jong Youl Choi Computer Science Department

Motivations Social Bookmarking Socialized Bookmarks Tags

Collaborative Tagging System
Motivations Social indexing or collaborative annotation Collect knowledge from people Extract information Challenges Vast amount of data  Efficient indexing scheme Very dynamic  Temporal analysis Unsupervised data  Clustering, inference

Outlines Principles of Machine Learning Machine Learning Algorithms
Bayes’ theorem and maximum likelihood Machine Learning Algorithms Clustering analysis Dimension reduction Classification Parallel Computing General parallel computing architecture Parallel algorithms

Machine Learning Definition Algorithm Types Topics
Algorithms or techniques that enable computer (machine) to “learn” from data. Related with many areas such as data mining, statistics, information theory, etc. Algorithm Types Unsupervised learning Supervised learning Reinforcement learning Topics Models Artificial Neural Network (ANN) Support Vector Machine (SVM) Optimization Expectation-Maximization (EM) Deterministic Annealing (DA) Inductive Learning – extract rules, patterns, or information out of massive data (e.g., decision tree, clustering, …) Deductive Learning – require no additional input, but improve performance gradually (e.g., advice taker, …)

Bayes’ Theorem Posterior probability of i, given X
i 2  : Parameter X : Observations P(i) : Prior (or marginal) probability P(X|i) : likelihood Maximum Likelihood (ML) Used to find the most plausible i 2 , given X Computing maximum likelihood (ML) or log-likelihood  Optimization problem

Maximum Likelihood (ML) Estimation
Problem Estimate hidden parameters (={, }) from the given data extracted from k Gaussian distributions Gaussian distribution Maximum Likelihood With Gaussian (P = N), Solve either brute-force or numeric method (Mitchell , 1997)

EM algorithm Problems in ML estimation
Observation X is often not complete Latent (hidden) variable Z exists Hard to explore whole parameter space Expectation-Maximization algorithm Object : To find ML, over latent distribution P(Z |X,) Steps 0. Init – Choose a random old 1. E-step – Expectation P(Z |X, old) 2. M-step – Find new which maximize likelihood. 3. Go to step 1 after updating old Ã new

Clustering Analysis Definition Dissimilarity measurement
Grouping unlabeled data into clusters, for the purpose of inference of hidden structures or information Dissimilarity measurement Distance : Euclidean(L2), Manhattan(L1), … Angle : Inner product, … Non-metric : Rank, Intensity, … Types of Clustering Hierarchical Agglomerative or divisive Partitioning K-means, VQ, MDS, … (Matlab helppage)

K-Means Find K partitions with the total intra-cluster variance minimized Iterative method Initialization : Randomized yi Assignment of x (yi fixed) Update of yi (x fixed) Problem?  Trap in local minima (MacKay, 2003)

Deterministic Annealing (DA)
Deterministically avoid local minima No stochastic process (random walk) Tracing the global solution by changing level of randomness Statistical Mechanics Gibbs distribution Helmholtz free energy F = D – TS Average Energy D = < Ex> Entropy S = - P(Ex) ln P(Ex) F = – T ln Z In DA, we make F minimized (Maxima and Minima, Wikipedia)

Deterministic Annealing (DA)
Analogy to physical annealing process Control energy (randomness) by temperature (high  low) Starting with high temperature (T = 1) Soft (or fuzzy) association probability Smooth cost function with one global minimum Lowering the temperature (T ! 0) Hard association Revealing full complexity, clusters are emerged Minimization of F, using E(x, yj) = ||x-yj||2 Iteratively,

Dimension Reduction Definition Curse of dimensionality Types
Process to transform high-dimensional data into low-dimensional ones for improving accuracy, understanding, or removing noises. Curse of dimensionality Complexity grows exponentially in volume by adding extra dimensions Types Feature selection : Choose representatives (e.g., filter,…) Feature extraction : Map to lower dim. (e.g., PCA, MDS, … ) (Koppen, 2000)

Principle Component Analysis (PCA)
Finding a map of principle components (PCs) of data into an orthogonal space, such that y = W x where W 2 Rd£h (hÀd) PCs – Variables with the largest variances Orthogonality Linearity – Optimal least mean-square error Limitations? Strict linearity specific distribution Large variance assumption x1 x2 PC1 PC2

Random Projection Like PCA, reduction of dimension by y = R x where R is a random matrix with i.i.d columns and R 2 Rd£p (pÀd) Johnson-Lindenstrauss lemma When projecting to a randomly selected subspace, the distance are approximately preserved Generating R Hard to obtain orthogonalized R Gaussian R Simple approach choose rij = {+31/2,0,-31/2} with probability 1/6, 4/6, 1/6 respectively

Multi-Dimensional Scaling (MDS)
Dimension reduction preserving distance proximities observed in original data set Loss functions Inner product Distance Squared distance Classical MDS: minimizing STRAIN, given  From , find inner product matrix B (Double centering) From B, recover the coordinates X’ (i.e., B=X’X’T )

Multi-Dimensional Scaling (MDS)
SMACOF : minimizing STRESS Majorization – for complex f(x), find auxiliary simple g(x,y) s.t.: Majorization for STRESS Minimize tr(XT B(Y) Y), known as Guttman transform (Cox, 2001)

Self-Organizing Map (SOM)
Competitive and unsupervised learning process for clustering and visualization Result : similar data getting closer in the model space Learning Choose the best similar model vector mj with xi Update the winner and its neighbors by mk = mk + (t) (t)(xi – mk) (t) : learning rate (t) : neighborhood size Input Model

Classification Definition Generalization Vs. Specification
A procedure dividing data into the given set of categories based on the training set in a supervised way Generalization Vs. Specification Hard to achieve both Avoid overfitting(overtraining) Early stopping Holdout validation K-fold cross validation Leave-one-out cross-validation Validation Error Training Error Underfitting Overfitting (Overfitting, Wikipedia)

Artificial Neural Network (ANN)
Perceptron : A computational unit with binary threshold Abilities Linear separable decision surface Represent boolean functions (AND, OR, NO) Network (Multilayer) of perceptrons  Various network architectures and capabilities Weighted Sum Activation Function (Jain, 1996)

Artificial Neural Network (ANN)
Learning weights – random initialization and updating Error-correction training rules Difference between training data and output: E(t,o) Gradient descent (Batch learning) With E =  Ei , Stochastic approach (On-line learning) Update gradient for each result Various error functions Adding weight regularization term ( wi2) to avoid overfitting Adding momentum (wi(n-1)) to expedite convergence

Support Vector Machine
Q: How to draw the optimal linear separating hyperplane?  A: Maximizing margin Margin maximization The distance between H+1 and H-1: Thus, ||w|| should be minimized Margin

Support Vector Machine
Constraint optimization problem Given training set {xi, yi} (yi 2 {+1, -1}): Minimize : Lagrangian equation with saddle points Minimized w.r.t the primal variable w and b: Maximized w.r.t the dual variables i (all i ¸ 0) xi with i > 0 (not i = 0) is called support vector (SV)

Support Vector Machine
Soft Margin (Non-separable case) Slack variables i < C Optimization with additional constraint Non-linear SVM Map non-linear input to feature space Kernel function k(x,y) = h(x), (y)i Kernel classifier with support vectors si Input Space Feature Space

Parallel Computing Memory Architecture Decomposition Strategy
Task – E.g., Word, IE, … Data – scientific problem Pipelining – Task + Data Shared Memory Distributed Memory Symmetric Multiprocessor (SMP) OpenMP, POSIX, pthread, MPI Easy to manage but expensive Commodity, off-the-shelf processors MPI Cost effective but hard to maintain (Barney, 2007) (Barney, 2007)

Parallel SVM Shrinking Parallel SVM
Recall : Only support vectors (i>0) are used in SVM optimization Predict if data is either SV or non-SV Remove non-SVs from problem space Parallel SVM Partition the problem Merge data hierarchically Each unit finds support vectors Loop until converge (Graf, 2005)

Thank you!! Questions?