Advanced Artificial Intelligence

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.

Segmentation and Fitting Using Probabilistic Methods

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Normalized Cuts and Image Segmentation Jianbo Shi and Jitendra Malik, Presented by: Alireza Tavakkoli.

Segmentation CSE P 576 Larry Zitnick Many slides courtesy of Steve Seitz.

Announcements Project 2 more signup slots questions Picture taking at end of class.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Independent Component Analysis (ICA) and Factor Analysis (FA)

Visual Recognition Tutorial

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Computer Vision - A Modern Approach Set: Segmentation Slides by D.A. Forsyth Segmentation and Grouping Motivation: not information is evidence Obtain a.

Lecture II-2: Probability Review

Radial Basis Function Networks

Collaborative Filtering Matrix Factorization Approach

Cao et al. ICML 2010 Presented by Danushka Bollegala.

PATTERN RECOGNITION AND MACHINE LEARNING

Segmentation using eigenvectors Papers: “Normalized Cuts and Image Segmentation”. Jianbo Shi and Jitendra Malik, IEEE, 2000 “Segmentation using eigenvectors:

CSE554AlignmentSlide 1 CSE 554 Lecture 5: Alignment Fall 2011.

EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.

EM and model selection CS8690 Computer Vision University of Missouri at Columbia Missing variable problems In many vision problems, if some variables.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

Online Learning for Collaborative Filtering

EECS 274 Computer Vision Segmentation by Clustering II.

Learning Theory Reza Shadmehr LMS with Newton-Raphson, weighted least squares, choice of loss function.

Clustering and Testing in High- Dimensional Data M. Radavičius, G. Jakimauskas, J. Sušinskas (Institute of Mathematics and Informatics, Vilnius, Lithuania)

CS654: Digital Image Analysis Lecture 30: Clustering based Segmentation Slides are adapted from:

Made by: Maor Levy, Temple University Many data points, no labels 2.

Recommender Systems Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Credits to Bing Liu (UIC) and Angshul Majumdar.

SYSTEMS Identification Ali Karimpour Assistant Professor Ferdowsi University of Mashhad Reference: “System Identification Theory For The User” Lennart.

Flat clustering approaches

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.

Advanced Artificial Intelligence Lecture 8: Advance machine learning.

Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.

K-Means Segmentation.

CSE 554 Lecture 8: Alignment

Matrix Factorization and Collaborative Filtering

Data Mining: Concepts and Techniques

Semi-Supervised Clustering

Deep Feedforward Networks

Classification of unlabeled data:

Lecture 7: Image alignment

Asymmetric Correlation Regularized Matrix Factorization for Web Service Recommendation Qi Xie1, Shenglin Zhao2, Zibin Zheng3, Jieming Zhu2 and Michael.

Clustering (3) Center-based algorithms Fuzzy k-means

Adopted from Bin UIC Recommender Systems Adopted from Bin UIC.

Segmentation and Grouping

Collaborative Filtering Matrix Factorization Approach

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Seam Carving Project 1a due at midnight tonight.

Spectral Clustering Eric Xing Lecture 8, August 13, 2010

Segmentation (continued)

Feature space tansformation methods

Announcements Project 4 questions Evaluations.

Generally Discriminant Analysis

LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.

Learning Theory Reza Shadmehr

CS 188: Artificial Intelligence Fall 2008

Recommendation Systems

Mathematical Foundations of BME

EM Algorithm and its Applications

CSSE463: Image Recognition Day 34

Calibration and homographies

CS5760: Computer Vision Lecture 9: RANSAC Noah Snavely

CSC 578 Neural Networks and Deep Learning

Presentation transcript:

Advanced Artificial Intelligence Lecture 4: Clustering And Recommender Systems

Outline Clustering Recommender Systems K-Means EM K-Means++ Mean-Shift Spectral Clustering Recommender Systems Collaborative Filtering

K-Means Many data points, no labels

K-Means Choose a fixed number of clusters Choose cluster centers and point-cluster allocations to minimize error can’t do this by exhaustive search, because there are too many possible allocations. Algorithm fix cluster centers; allocate points to closest cluster fix allocation; compute best cluster centers x could be any set of features for which we can compute a distance (careful about scaling)

K-Means

K-Means * From Marc Pollefeys COMP 256 2003

K-Means Is an approximation to EM We notice: Model (hypothesis space): Mixture of N Gaussians Latent variables: Correspondence of data and Gaussians We notice: Given the mixture model, it’s easy to calculate the correspondence Given the correspondence it’s easy to estimate the mixture models

Expectation Maximzation: Idea Data generated from mixture of Gaussians Latent variables: Correspondence between Data Items and Gaussians

Generalized K-Means (EM)

Gaussians

ML Fitting Gaussians

Learning a Gaussian Mixture (with known covariance) E-Step M-Step

Expectation Maximization Converges! Proof [Neal/Hinton, McLachlan/Krishnan]: E/M step does not decrease data likelihood Converges at local minimum or saddle point But subject to local minima

Practical EM Number of Clusters unknown Suffers (badly) from local minima Algorithm: Start new cluster center if many points “unexplained” Kill cluster center that doesn’t contribute (Use AIC/BIC criterion for all this, if you want to be formal)

K-Means++ Can we prevent arbitrarily bad local minima? Randomly choose ﬁrst center. Pick new center with prob. proportional to   (Contribution of p to total error) Repeat until k centers. Expected error = O(log k) * optimal

K-Means++ In the second step ，Calculate the distance between each data point and the nearest seed point (cluster center) in turn ，A set D of D(1), D(2), ..., D(n) is obtained in turn, where n represents the size of the data set 。 In D, in order to avoid noise, you cannot directly select the element with the largest value. You should select the element with the larger value and then use the corresponding data point as the seed point. Take a random value and use the weight to calculate the next "seed point".

K-Means++ There are 8 samples in the data set. The distribution and corresponding serial numbers are shown below.

K-Means++ Suppose Point 6 is selected as the first initial cluster center, then the probability of D (x) of each sample and the probability of selecting the second cluster center at step 2 are shown in the following table: P (x) is the probability of each sample being selected as the next cluster center. The Sum of the last row is the cumulative sum of probability P (x), which is used to select second clustering centers by the roulette method.

Mean-Shift Given n sample points of d-dimensional space Rd, i=1,...,n, optionally a point x in space, then the basic form of the Mean Shift vector is defined as : Where Sh represents the data point where the distance from the point of the data set to x is less than the radius h of the sphere K indicates that in the n sample point Xi, there are k points falling into the Sk area. By calculating the drift vector, and then updating the position of the center x of the ball, the update formula is ：x = x + Mh Solve a vector so that the center of the circle moves in the direction of the highest density of the data set.

Mean-Shift Source of slide: http://vision.stanford.edu/teaching/cs131_fall1314_nope/lectures/lecture13_kmeans_cs131.pdf

Mean-Shift Source of slide: http://vision.stanford.edu/teaching/cs131_fall1314_nope/lectures/lecture13_kmeans_cs131.pdf

Mean-Shift Source of slide: http://vision.stanford.edu/teaching/cs131_fall1314_nope/lectures/lecture13_kmeans_cs131.pdf

Mean-Shift Source of slide: http://vision.stanford.edu/teaching/cs131_fall1314_nope/lectures/lecture13_kmeans_cs131.pdf

Mean-Shift Source of slide: http://vision.stanford.edu/teaching/cs131_fall1314_nope/lectures/lecture13_kmeans_cs131.pdf

Mean-Shift Source of slide: http://vision.stanford.edu/teaching/cs131_fall1314_nope/lectures/lecture13_kmeans_cs131.pdf

Mean-Shift Source of slide: http://vision.stanford.edu/teaching/cs131_fall1314_nope/lectures/lecture13_kmeans_cs131.pdf

Mean-Shift Clustering

Mean-Shift Bandwidth : radius h

Mean-Shift Bandwidth : radius h

Mean-Shift Clustering Cluster: all data points in the attraction basin of a mode Attraction basin: the region for which all trajectories lead to the same mode Source of slide: http://vision.stanford.edu/teaching/cs131_fall1314_nope/lectures/lecture13_kmeans_cs131.pdf

Spectral Clustering

Spectral Clustering

Spectral Clustering: Overview Data Similarities Block-Detection

Eigenvectors and Blocks Block matrices have block eigenvectors: Near-block matrices have near-block eigenvectors: [Ng et al., NIPS 02] l1= 2 l2= 2 l3= 0 l4= 0 1 .71 .71 eigensolver l1= 2.02 l2= 2.02 l3= -0.02 l4= -0.02 1 .2 -.2 .71 .69 .14 -.14 .69 .71 eigensolver

Spectral Space Can put items into blocks by eigenvectors: Resulting clusters independent of row ordering: e1 1 .2 -.2 .71 .69 .14 -.14 .69 .71 e2 e1 e2 e1 1 .2 -.2 .71 .14 .69 .69 -.14 .71 e2 e1 e2

The Spectral Advantage The key advantage of spectral clustering is the spectral space representation:

Measuring Affinity Intensity Distance Texture here c(x) is a vector of filter outputs. A natural thing to do is to square the outputs of a range of different filters at different scales and orientations, smooth the result, and rack these into a vector. Texture

Scale affects affinity This is figure 14.18

Clustering

Recommender Systems

Collaborative Filtering Memory-based This approach uses user rating data to compute the similarity between users or items. Typical examples of this approach are neighborhood-based CF and item-based/user-based top-N recommendation Model-based In this approach, models are developed using different algorithms to predict users' rating of unrated items. Like Matrix Factorization Hybrid A number of applications combine the memory-based and the model-based CF algorithms.

Collaborative Filtering the workflow of a collaborative A user expresses his or her preferences by rating items (e.g. books, movies or CDs) of the system. These ratings can be viewed as an approximate representation of the user's interest in the corresponding domain. he system matches this user's ratings against other users' and finds the people with most "similar" tastes. With similar users, the system recommends items that the similar users have rated highly but not yet being rated by this user (presumably the absence of rating is often considered as the unfamiliarity of an item)

Collaborative Filtering assumption: users with similar taste in past will have similar taste in future requires only matrix of ratings →applicable in many domains widely used in practice

Collaborative Filtering input: matrix of user-item ratings (with missing values , often very sparse) output: predictions for missing values

Collaborative Filtering Consider user x Find set N of other users are similar to x’s rating Estimate x’s ratings based on ratings of user in N

Similar Users Jaccard similarity measure Cosine similarity measure Pearson correlation coefficient

Collaborative Filtering

Collaborative Filtering

Collaborative Filtering The user based top-N recommendation algorithm uses a similarity-based vector model to identify the k most similar users to an active user. After the k most similar users are found, their corresponding user-item matrices are aggregated to identify the set of items to be recommended.

Collaborative Filtering

Collaborative Filtering

Collaborative Filtering

Collaborative Filtering

Collaborative Filtering

Collaborative Filtering

Collaborative Filtering The item based top-N recommendation algorithm uses a similarity-based vector model to identify the k most similar items to an active user. After the k most similar items are found, their corresponding user-item matrices are aggregated to identify the set of users to be recommended.

Matrix Factorization Given a matrix X ∈ ，m <=n, we want to find matrices U, V such that x≈UV For a given number of factors k, SVD gives us the optimal factorization which globally minimizes the mean squared prediction error over all user-item pairs

Latent Factors

Matrix Factorization we want to minimize squared errors regularization to avoid overfitting How to find the minimum? Stochastic Gradient Descent

Matrix Factorization prediction error update (in parallel): learning rate : regularization : λ

Matrix Factorization

Matrix Factorization

Matrix Factorization

Matrix Factorization

Top-N recommendation After Matrix Factorization , you can get all the ratings. Give a user ,you can get all the ratings about item. Sorting , then recommend the Top-N rating item that user do not buy before.