CS 599: Social Media Analysis University of Southern California1 Elementary Text Analysis & Topic Modeling Kristina Lerman University of Southern California.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Topic models Source: Topic models, David Blei, MLSS 09.

Information retrieval – LSI, pLSI and LDA

Dimensionality Reduction PCA -- SVD

What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

Principal Component Analysis

1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University

Dimensional reduction, PCA

Latent Dirichlet Allocation a generative model for text

TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.

IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

British Museum Library, London Picture Courtesy: flickr.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

LATENT DIRICHLET ALLOCATION. Outline Introduction Model Description Inference and Parameter Estimation Example Reference.

Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.

CS 277: Data Mining Recommender Systems

Chapter 12 (Section 12.4) : Recommender Systems Second edition of the book, coming soon.

Chapter 2 Dimensionality Reduction. Linear Methods

Cao et al. ICML 2010 Presented by Danushka Bollegala.

Principled Regularization for Probabilistic Matrix Factorization Robert Bell, Suhrid Balakrishnan AT&T Labs-Research Duke Workshop on Sensing and Analysis.

1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.

Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.

Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.

Example 16,000 documents 100 topic Picked those with large p(w|z)

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Online Learning for Latent Dirichlet Allocation

EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.

27. May Topic Models Nam Khanh Tran L3S Research Center.

Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.

Authors: Rosario Sotomayor, Joe Carthy and John Dunnion Speaker: Rosario Sotomayor Intelligent Information Retrieval Group (IIRG) UCD School of Computer.

1 Computing Relevance, Similarity: The Vector Space Model.

Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.

June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.

Contextual Recommendation in Multi-User Devices Raz Nissim, Michal Aharon, Eshcar Hillel, Amit Kagian, Ronny Lempel, Hayim Makabee.

SINGULAR VALUE DECOMPOSITION (SVD)

Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang

Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.

Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.

Recommender Systems Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Credits to Bing Liu (UIC) and Angshul Majumdar.

Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.

Topic Modeling using Latent Dirichlet Allocation

1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret

Latent Dirichlet Allocation

Pairwise Preference Regression for Cold-start Recommendation Speaker: Yuanshuai Sun

1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.

Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.

CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.

Link Distribution on Wikipedia [0407]KwangHee Park.

Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.

Natural Language Processing Topics in Information Retrieval August, 2002.

DATA MINING LECTURE 8 Sequence Segmentation Dimensionality Reduction.

Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

14.0 Linguistic Processing and Latent Topic Analysis.

Collaborative Deep Learning for Recommender Systems

Latent Semantic Analysis (LSA) Jed Crandall 16 June 2009.

1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.

Matrix Factorization and Collaborative Filtering

MATRIX FACTORIZATION TECHNIQUES FOR RECOMMENDER SYSTEMS

Adopted from Bin UIC Recommender Systems Adopted from Bin UIC.

Q4 : How does Netflix recommend movies?

Probabilistic Models with Latent Variables

Michal Rosen-Zvi University of California, Irvine

Latent Dirichlet Allocation

Topic Models in Text Processing

Recommendation Systems

Restructuring Sparse High Dimensional Data for Effective Retrieval

Latent Semantic Analysis

Presentation transcript:

CS 599: Social Media Analysis University of Southern California1 Elementary Text Analysis & Topic Modeling Kristina Lerman University of Southern California

Why topic modeling Volume of collections of text document is growing exponentially, necessitating methods for automatically organizing, understanding, searching and summarizing them Uncover hidden topical patterns in collections. Annotate documents according to topics. Using annotations to organize, summarize and search.

Topic Modeling NIH Grants Topic Map 2011 NIH Map Viewer (

Brief history of text analysis 1960s –Electronic documents come online –Vector space models (Salton) –‘bag of words’, tf-idf 1990s –Mathematical analysis tools become widely available –Latent semantic indexing (LSI) –Singular value decomposition (SVD, PCA) 2000s –Probabilistic topic modeling (LDA) –Probabilistic matrix factorization (PMF)

Readings Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4): –Latent Dirichlet Allocation (LDA) Yehuda Koren, Robert Bell and Chris Volinsky. Matrix Factorization Techniques For Recommender Systems. In Journal of Computer, 2009.

Vector space model Term frequency genes 5 organism 3 survive 1 life 1 computer 1 organisms 1 genomes 2 predictions 1 genetic 1 numbers 1 sequenced 1 genome 2 computational 1 …

Vector space models: reducing noise genes 5 organism 3 survive 1 life 1 computer 1 organisms 1 genomes 2 predictions 1 genetic 1 numbers 1 sequenced 1 genome 2 computational 1 gene 6 organism 4 survive 1 life 1 comput 2 predictions 1 numbers 1 sequenced 1 genome 4 original stem words remove stopwords and or but also to too as can I you he she …

Vector space model Each document is a point in high-dimensional space Document 1 gene 6 organism 4 survive 1 life 1 comput 2 predictions 1 numbers 1 sequenced 1 genome 4 … gene organism … Document 2 gene 0 organism 6 survive 1 life 1 comput 2 predictions 1 numbers 1 sequenced 1 genome 4 …

Vector space model Each document is a point in high-dimensional space Document 1 gene 6 organism 4 survive 1 life 1 comput 2 predictions 1 numbers 1 sequenced 1 genome 4 … gene organism … Document 2 gene 0 organism 6 survive 1 life 1 comput 2 predictions 1 numbers 1 sequenced 1 genome 4 … Compare two documents: similarity ~ cos(  ) 

Improving the vector space model Use tf-idf, instead of term frequency (tf), in the document vector –Term frequency * inverse document frequency –E.g., ‘computer’ occurs 3 times in a document, but it is present in 80% of documents  tf-idf score ‘computer’ is 3*1/.8=3.75 ‘gene’ occurs 2 times in a document, but it is present in 20% of documents  tf-idf score of ‘gene’ is 2*1/.2=10

Some problems with vector space model Synonymy –Unique term corresponds to a dimension in term space –Synonyms (‘kid’ and ‘child’) are different dimensions Polysemy –Different meanings of the same term improperly confused –E.g., document about river ‘banks’ will be improperly judged to be similar to a document about financial ‘banks’

Latent Semantic Indexing Identifies subspace of tf-idf that captures most of the variance in a corpus –Need a smaller subspace to represent document corpus –This subspace captures topics that exist in a corpus Topic = set of related words Handles polysemy and synonymy –Synonyms will belong to the same topic since they may co-occur with the same related words

LSI, the Method Document-term matrix A Decompose A by Singular Value Decomposition (SVD) –Linear algebra Approximate A using truncated SVD –Captures the most important relationships in A –Ignores other relationships –Rebuild the matrix A using just the important relationships

LSI, the Method (cont.) Each row and column of A gets mapped into the k-dimensional LSI space, by the SVD.

Singular value decomposition SVD- Singular value decomposition

Lower rank decomposition Usually, rank of the matrix A is small: r<<min(m,n). –Only a few of the largest eigenvectors (those associated with the largest eigenvalues ) matter –These r eigenvectors define a lower dimensional subspace that captures most important characteristics of the document corpus –All operations (document comparison, similar) can be done in this reduced-dimension subspace

Probabilistic Modeling Generative probabilistic modeling Treats data as observations Contains hidden variables Hidden variables reflect the themes that pervade a corpus of documents Infer hidden thematic structure Analyze words in the documents Discover topics in the corpus A topic is a distribution over words –Large reduction in description length Few topics are needed to represent themes in a document corpus – about 100

LDA – Latent Dirichlet Allocation (Blei 2003) Intuition: Documents have multiple topics

Topics A topic is a distribution over words A document is a distribution over topics A word in a document is drawn from one of those topics Document Topics

Generative Model of LDA Each topic is a distribution over words Each document is a mixture of corpus-wide topics Each word is drawn from one of those topics

LDA inference We observe only documents The rest of the structure are hidden variables

LDA inference Our goal is to infer hidden variables Compute their distribution conditioned on the documents p(topic, proportions, assignments | documents)

Posterior Distribution Only documents are observable. Infer underlying topic structure. Topics that generated the documents. For each document, distribution of topics. For each word, which topic generated the word. Algorithmic challenge: Finding the conditional distribution of all the latent variables, given the observation.

LDA as Graphical Model Encodes assumptions Defines a factorization of the joint distribution

LDA as Graphical Model Nodes are random variables; edges indicate dependence Shaded nodes are observed; unshaded nodes are hidden Plates indicate replicated variables

Posterior Distribution This joint defines a posterior p( , z,  |W): From a collection of documents W, infer Per-word topic assignment z d,n Per-document topic proportions  d Per-corpus topic distribution  k

Posterior Distribution Evaluate p(z|W): posterior distribution over the assignment of words to topic.  and  can be estimated. Computing p(z|W) involves evaluating a probability distribution over a large discrete space.

Approximate posterior inference algorithms Mean field variational methods Expectation propagation Gibbs sampling Distributed sampling … Efficient packages for solving this problem

Example Data: collection of Science articles from –17K documents –11M words –20K unique words (stop words and rare words removed) Model: 100-topic LDA

Extensions to LDA Extension to LDA relax assumptions made by the model –“bag of words” assumption: order of words does not matter in reality, the order of words in the document is not arbitrary –Order of documents does not matter But in historical document collection, new topics arise –Number of topics is known and fixed Hierarchical Baysian models infer the number of topics

How useful are learned topic models Model evaluation –How well do learned topics describe unseen (test) documents –How well it can be used for personalization Model checking –Given a new corpus of documents, what model should be used? How many topics? Visualization and user interfaces Topic models for exploratory data analysis

Recommender systems Personalization tools allow filtering large collections of movies, music, tv shows, … to recommend only relevant items to people –Build a taste profile for a user –Build topic profile for an item –Recommend items that fit user’s taste profile Probabilistic modeling techniques –Model people instead of documents to learn their profiles from observed actions Commercially successful (Netflix competition)

The intuition

User-item rating prediction 4.0 Ratings … … 5.0 Users Items

Collaborative filtering Collaborative filtering analyzes users’ past behavior and relationships between users and items to identify new user- item associations –Recommend new items that “similar” users liked –But, “cold start” problem makes it hard to make recommendations to new users Approaches –Neighborhood methods –Latent factor models

Neighborhood methods Identify similar users who like the same movies. User their ratings of other movies to recommend new movies to user

Latent factor models Characterize users and items by 20 to 100 factors, inferred from the ratings patterns

Probabilistic Matrix Factorization (PMF) User Item User V U R Item Topic R=U T V Marvel’s hero, Classic, Action... TV series, Classic, Action… Drama, Family, … Item: distribution over topics User: distribution over topics

Singular Value Decomposition

Probabilistic formulation User Item User V U R Item Topic UTVUTV User’s topics Item’s topics PMF [Salakhutdinov & Mnih 08] “PMF is a probabilistic linear model with Gaussian observation noise that handles very large and possibly sparse data.”

Inference Minimize regularized error by Stochastic gradient descent ( –Compute prediction error for a set of parameters –Find the gradient (slope) of parameters –Modify parameters by a magnitude proportional to negative of the gradient Alternating least squares –When one parameter is unknown, becomes an easy quadratic function that can be solved using least squares –Fix U, find V using least squares. Fix V, find U using least squares

Application: Netflix challenge 2006 contest to improve movie recommendations Data –500K Netflix users (anonymized) –17K movies –100M ratings on scale of 1-5 stars Evaluation –Test set of 3M ratings (ground truth labels withheld) –Root-mean-square error (RMSE) on the test set Prize –$1M for beating Netflix algorithm by 10% on RMSE –If no winner, $50K prize to leading team

Factorization models in the Netflix competition Factorization models gave leading teams an advantage –Discover most descriptive “dimensions” for predicting movie preferences …

Performance of factorization models Model performance depends on complexity Netflix algorithm: RMSE= Grand prize target: RMSE=0.8563

Summary Hidden factors create relationships among observed data –Document topics give rise to correlations among words –User’s tastes give rise to correlations among her movie ratings Methods for inferring hidden (latent) factors from observations –Latent semantic indexing (SVD) –Topic models (LDA, etc.) –Matrix factorization (SVD, PMF, etc.) Trade off between model complexity, performance and computational efficience

Tools Topic modeling 1.Blei's LDA w/ "variational method" ( project.org/web/packages/lda/) orhttp://cran.r- project.org/web/packages/lda/ 2."Gibbs sampling method" ( and PMF 1.Matlab implementation ( 2.Blei's CTR code (