DATA MINING LECTURE 8 Sequence Segmentation Dimensionality Reduction.

Slides:



Advertisements
Similar presentations
Eigen Decomposition and Singular Value Decomposition
Advertisements

Eigen Decomposition and Singular Value Decomposition
Dimensionality Reduction. High-dimensional == many features Find concepts/topics/genres: – Documents: Features: Thousands of words, millions of word pairs.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) Dimensionality Reductions or data projections Random projections.
Dimensionality Reduction PCA -- SVD
Dimension reduction (1)
PCA + SVD.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Lecture outline Dimensionality reduction
Hinrich Schütze and Christina Lioma
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
Principal Component Analysis
3D Geometry for Computer Graphics
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Dimensional reduction, PCA
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Time-series data analysis
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
3D Geometry for Computer Graphics
3D Geometry for Computer Graphics
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD
1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)
Time series analysis and Sequence Segmentation
CS 277: Data Mining Recommender Systems
SVD(Singular Value Decomposition) and Its Applications
Summarized by Soo-Jin Kim
Chapter 2 Dimensionality Reduction. Linear Methods
Presented By Wanchen Lu 2/25/2013
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
SINGULAR VALUE DECOMPOSITION (SVD)
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
ITCS 6265 Information Retrieval & Web Mining Lecture 16 Latent semantic indexing Thanks to Thomas Hofmann for some slides.
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets 
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
Vector Semantics Dense Vectors.
LECTURE 11: Advanced Discriminant Analysis
Lecture 8:Eigenfaces and Shared Features
DATA MINING LECTURE 6 Dimensionality Reduction PCA – SVD
LSI, SVD and Data Management
Restructuring Sparse High Dimensional Data for Effective Retrieval
Presentation transcript:

DATA MINING LECTURE 8 Sequence Segmentation Dimensionality Reduction

SEQUENCE SEGMENTATION

Why deal with sequential data? Because all data is sequential All data items arrive in the data store in some order Examples transaction data documents and words In some (many) cases the order does not matter In many cases the order is of interest

Time-series data Financial time series, process monitoring…

Sequence Segmentation Goal: discover structure in the sequence and provide a concise summary Given a sequence T, segment it into K contiguous segments that are as homogeneous as possible Similar to clustering but now we require the points in the cluster to be contiguous

Example t R t R

Basic definitions

The K-segmentation problem Similar to K-means clustering, but now we need the points in the clusters to respect the order of the sequence. This actually makes the problem easier. Given a sequence T of length N and a value K, find a K-segmentation S = {s 1, s 2, …,s K } of T such that the SSE error E is minimized.

Optimal solution for the k-segmentation problem [ Bellman’61: The k-segmentation problem can be solved optimally using a standard dynamic- programming algorithm Dynamic Programming: Construct the solution of the problem by using solutions to problems of smaller size Define the dynamic programming recursion Build the solution bottom up from smaller to larger instances Define the dynamic programming table that stores the solutions to the sub-problems

Dynamic Programming Solution

N 1 1 K Dynamic programming solution k n

Algorithm Complexity

Heuristics Bottom-up greedy (BU): O(NlogN) Merge adjacent points each time selecting the two points that cause the smallest increase in the error until K segments [Keogh and Smyth’97, Keogh and Pazzani’98] Top-down greedy (TD): O(NK) Introduce breakpoints so that you get the largest decrease in error, until K segments are created. [Douglas and Peucker’73, Shatkay and Zdonik’96, Lavrenko et. al’00] Local Search Heuristics: O(NKI) Assign the breakpoints randomly and then move them so that you reduce the error [Himberg et. al ’01]

DIMENSIONALITY REDUCTION

The curse of dimensionality Real data usually have thousands, or millions of dimensions E.g., web documents, where the dimensionality is the vocabulary of words Facebook graph, where the dimensionality is the number of users Huge number of dimensions causes many problems Data becomes very sparse, some algorithms become meaningless (e.g. density based clustering) The complexity of several algorithms depends on the dimensionality and they become infeasible.

Dimensionality Reduction Usually the data can be described with fewer dimensions, without losing much of the meaning of the data. The data reside in a space of lower dimensionality Essentially, we assume that some of the data is noise, and we can approximate the useful part with a lower dimensionality space. Dimensionality reduction does not just reduce the amount of data, it often brings out the useful part of the data

Data in the form of a matrix We are given n objects and d attributes describing the objects. Each object has d numeric values describing it. We will represent the data as a n  d real matrix A. We can now use tools from linear algebra to process the data matrix Our goal is to produce a new n  k matrix B such that It preserves as much of the information in the original matrix A as possible It reveals something about the structure of the data in A

Example: Document matrices ndocuments d terms (e.g., theorem, proof, etc.) A ij = frequency of the j -th term in the i -th document Find subsets of terms that bring documents together

Example: Recommendation systems ncustomers d movies A ij = rating of j -th product by the i -th customer Find subsets of movies that capture the behavior or the customers

Some linear algebra basics

Singular Value Decomposition r : rank of matrix A σ 1 ≥ σ 2 ≥ … ≥σ r : singular values (square roots of eig-vals AA T, A T A) : left singular vectors (eig-vectors of AA T ) : right singular vectors (eig-vectors of A T A) [n×r][r×r][r×n]

Singular Value Decomposition

An (extreme) example Document-term matrix Blue and Red rows (colums) are linearly depedent There are two types of documents (words): blue and red To describe the data is enough to describe the two types, and the projection weights for each row A is a rank-2 matrix A =

An (more realistic) example Document-term matrix There are two types of documents and words but they are mixed We now have more than two singular vectors, but the strongest ones are still about the two types. By keeping the two strongest singular vectors we obtain most of the information in the data. This is a rank-2 approximation of the matrix A A =

AVTVT  U= objects features significant noise significant sig. = SVD and Rank-k approximations

Rank-k approximations (A k ) U k (V k ): orthogonal matrix containing the top k left (right) singular vectors of A.  k : diagonal matrix containing the top k singular values of A A k is an approximation of A n x dn x kk x kk x d A k is the best approximation of A

SVD as an optimization

What does this mean?

Two applications Latent Semantic Indexing (LSI): Apply SVD on the document-term space, and index the k- dimensional vectors When a query comes, project it onto the low dimensional space and compute similarity cosine similarity in this space Singular vectors capture main topics, and enrich the document representation Recommender systems and collaborative filtering In a movie-rating system there are just a few types of users. What we observe is an incomplete and noisy version of the true data The rank-k approximation reconstructs the “true” matrix and we can provide ratings for movies that are not rated.

SVD and PCA PCA is a special case of SVD on the centered covariance matrix.

Covariance matrix

PCA: Principal Component Analysis

PCA Input: 2-d dimensional points Output: 1st (right) singular vector 1st (right) singular vector: direction of maximal variance, 2nd (right) singular vector 2nd (right) singular vector: direction of maximal variance, after removing the projection of the data along the first singular vector.

Singular values  1 : measures how much of the data variance is explained by the first singular vector.  2 : measures how much of the data variance is explained by the second singular vector. 11 1st (right) singular vector 2nd (right) singular vector

Another property of PCA/SVD The chosen vectors are such that minimize the sum of square differences between the data vectors and the low-dimensional projections 1st (right) singular vector

SVD is “the Rolls-Royce and the Swiss Army Knife of Numerical Linear Algebra.”* *Dianne O’Leary, MMDS ’06