2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Contextual Text Mining Qiaozhu Mei University of Illinois at Urbana-Champaign.

Slides:



Advertisements
Similar presentations
1 A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs Qiaozhu Mei, Chao Liu, Hang Su, and ChengXiang Zhai : University of Illinois.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
A Cross-Collection Mixture Model for Comparative Text Mining
Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Information retrieval – LSI, pLSI and LDA
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Modelling Relevance and User Behaviour in Sponsored Search using Click-Data Adarsh Prasad, IIT Delhi Advisors: Dinesh Govindaraj SVN Vishwanathan* Group:
Language Models Hongning Wang
One Theme in All Views: Modeling Consensus Topics in Multiple Contexts Jian Tang 1, Ming Zhang 1, Qiaozhu Mei 2 1 School of EECS, Peking University 2 School.
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
1 Topic-Sentiment Mixture: Modeling Facets and Opinions in Weblogs Qiaozhu Mei †, Xu Ling †, Matthew Wondra †, Hang Su ‡, and ChengXiang Zhai † † University.
Generative Topic Models for Community Analysis
Learning to Rank: New Techniques and Applications Martin Szummer Microsoft Research Cambridge, UK.
Topic Modeling with Network Regularization Md Mustafizur Rahman.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
2008 © ChengXiang Zhai 1 Contextual Text Analysis with Probabilistic Topic Models ChengXiang Zhai Department of Computer Science Graduate School of Library.
Presented by Zeehasham Rasheed
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
2010 © University of Michigan 1 Text Retrieval and Data Mining in SI - An Introduction Qiaozhu Mei School of Information Computer Science and Engineering.
Scalable Text Mining with Sparse Generative Models
Language Modeling Approaches for Information Retrieval Rong Jin.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Context Analysis in Text Mining and Search Qiaozhu Mei Department of Computer Science University of Illinois at Urbana-Champaign
Entropy of Search Logs - How Big is the Web? - How Hard is Search? - With Personalization? With Backoff? Qiaozhu Mei †, Kenneth Church ‡ † University of.
Overview of Search Engines
Cohort Modeling for Enhanced Personalized Search Jinyun YanWei ChuRyen White Rutgers University Microsoft BingMicrosoft Research.
Topic Modeling with Network Regularization Qiaozhu Mei, Deng Cai, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
Generating Impact-Based Summaries for Scientific Literature Qiaozhu Mei, ChengXiang Zhai University of Illinois at Urbana-Champaign 1.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
MINING MULTI-FACETED OVERVIEWS OF ARBITRARY TOPICS IN A TEXT COLLECTION Xu Ling, Qiaozhu Mei, ChengXiang Zhai, Bruce Schatz Presented by: Qiaozhu Mei,
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Towards Contextual Text Mining Qiaozhu Mei University of Illinois at Urbana-Champaign.
LIS618 lecture 1 Thomas Krichel economic rational for traditional model In olden days the cost of telecommunication was high. database use.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Comparative Text Mining Q. Mei, C. Liu, H. Su, A. Velivelli, B. Yu, C. Zhai DAIS The Database and Information Systems Laboratory. at The University of.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
A General Optimization Framework for Smoothing Language Models on Graph Structures Qiaozhu Mei, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.
1 Rated Aspect Summarization of Short Comments Yue Lu, ChengXiang Zhai, and Neel Sundaresan.
1 Rated Aspect Summarization of Short Comments Yue Lu, ChengXiang Zhai, and Neel Sundaresan Presented by: Sapan Shah.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Automatic Labeling of Multinomial Topic Models
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
A Study of Poisson Query Generation Model for Information Retrieval
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
KNN & Naïve Bayes Hongning Wang
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Context Analysis in Text Mining and Search
School of Computer Science & Engineering
Probabilistic Topic Model
Course Summary (Lecture for CS410 Intro Text Info Systems)
Text Retrieval and Data Mining in SI - An Introduction
Language Models for Information Retrieval
John Lafferty, Chengxiang Zhai School of Computer Science
Michal Rosen-Zvi University of California, Irvine
Junghoo “John” Cho UCLA
Topic Models in Text Processing
Language Models for TR Rong Jin
Presentation transcript:

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Contextual Text Mining Qiaozhu Mei University of Illinois at Urbana-Champaign

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Knowledge Discovery from Text 2 Text Mining System

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 3 Trend of Text Content Content Type Published Content Professional web content User generated content Private text content Amount / day3-4G~ 2G8-10G~ 3T - Ramakrishnan and Tomkins 2007

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Text on the Web (Unconfirmed) 4 ~750k /day ~3M day ~150k /day 1M 10B 6M ~100B Where to Start? Where to Go? Gold?

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Context Information in Text 5 Author Time Source Author’s occupation Language Social Network Check Lap Kok, HK self designer, publisher, editor … 3:53 AM Jan 28 th From Ping.fm Location Sentiment

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Rich Context in Text 6 102M blogs ~3M msgs /day ~150k bookmarks /day ~300M words/month ~2M users 5M users 500M URLs 8M contributors 100+ languages 750K posts/day 100M users > 1M groups 73 years ~400k authors ~4k sources 1B queries? Per hour? Per IP?

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Text + Context = ? 7 + Context = Guidance I Have A Guide! =

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Query + User = Personalized Search 8 MSR Modern System Research Medical simulation Montessori School of Raleigh Mountain Safety Research MSR Racing Wikipedia definitions Metropolis Street Racer Molten salt reactor Mars sample return Magnetic Stripe Reader How much can personalized help? If you know me, you should give me Microsoft Research…

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 9 Common ThemesIBMAPPLEDELL Battery Life Long, 4-3 hrsMedium, 3-2 hrsShort, 2-1 hrs Hard disk Large, GBSmall, 5-10 GBMedium, GB Speed Slow, MhzVery Fast, 3-4 GhzModerate, 1-2 Ghz IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Customer Review + Brand = Comparative Product Summary Can we compare Products?

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 10 Hot Topics in SIGMOD Literature + Time = Topic Trends What’s hot in literature?

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 11 One Week Later Blogs + Time & Location = Spatiotemporal Topic Diffusion How does discussion spread?

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 12 Tom Hanks, who is my favorite movie star act the leading role. protesting... will lose your faith by watching the movie. a good book to past time.... so sick of people making such a big deal about a fiction book The Da Vinci Code Blogs + Sentiment = Faceted Opinion Summary What is good and what is bad?

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 13 Information retrieval Machine learning Data mining Coauthor Network Publications + Social Network = Topical Community Who works together on what?

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Query log + User = Personalized Search Literature + Time = Topic Trends Review + Brand = Comparative Opinion Blog + Time & Location = Spatiotemporal Topic Diffusion Blog + Sentiment = Faceted Opinion Summary Publications + Social Network = Topical Community Text + Context = Contextual Text Mining 14 ….. A General Solution for All

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Contextual Text Mining Generative Model of Text Modeling Simple Context Modeling Implicit Context Modeling Complex Context Applications of Contextual Text Mining 15

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Generative Model of Text 16 the.. movie.. harry.. potter is.. based.. on.. j..k..rowling the is harry potter movie plot time rowling the Generation Inference, Estimation harry potter movie harry is

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Contextualized Models 17 book Generation: How to select contexts? How to model the relations of contexts? Inference: How to estimate contextual models? How to reveal contextual patterns? Year = 2008 Year = 1998 Location = US Location = China Source = official Sentiment = + harry potter is book harry potter rowling movie harry potter director

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Topics in Text Topic (Theme) = the subject of a discourse A topic covers multiple documents A document has multiple topics Topic = a soft cluster of documents Topic = a multinomial distribution of words 18 Data Mining Many text mining tasks: Extracting topics from text Reveal contextual topic patterns Web Search Machine Learning search 0.2 engine 0.15 query 0.08 user 0.07 ranking 0.06 ……

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Probabilistic Topic Models 19 ipod nano music download apple movie harry potter actress music Topic 1 Topic 2 Apple iPod Harry Potter Idownloaded themusicof themovie harrypotterto myipodnano ipod 0.15 harry 0.09

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Parameter Estimation Maximizing data likelihood: Parameter Estimation using EM algorithm 20 ipod nano music download apple movie harry potter actress music Idownloaded themusicof themovie harrypotterto myipodnano ?????????? ?????????? Guess the affiliation Estimate the params Idownloaded themusicof themovie harrypotterto myipodnano Idownloaded themusicof themovie harrypotterto myipodnano Idownloaded themusicof themovie harrypotterto myipodnano Pseudo- Counts

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign How Context Affects Topics 21 Topics in science literature: 16 th Century v.s. 21 st Century When do a computer scientist and a gardener use “tree, root, prune” in text? What does “tree” mean in “algorithm”? In Europe, “football” appears a lot in a soccer report. What about in the US? Text are generated according to the Context !!

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Simple Contextual Topic Model 22 Topic 1 Topic 2 Context 1: 2004 Context 2: 2007 Apple iPod Harry Potter ipod mini 4gb harry prisoner azkaban Idownloaded themusicof themovie harrypotterto myiphone ipod iphone nano potter order phoenix

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Contextual Topic Patterns Compare contextualized versions of topics: Contextual topic patterns Contextual topic patterns  conditional distributions –z: topic; c: context; w: word : strength of topics in context :content variation of topics 23

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Example: Topic Life Cycles (Mei and Zhai KDD’05) 24 Context = time Comparing

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Example: Spatiotemporal Theme Pattern (Mei et al. WWW’06) 25 Week4: The theme is again strong along the east coast and the Gulf of Mexico Week3: The theme distributes more uniformly over the states Week2: The discussion moves towards the north and west Week5: The theme fades out in most states Week1: The theme is the strongest along the Gulf of Mexico About Government Response in Hurricane Katrina Context = time & location Comparing

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Example: Evolutionary Topic Graph (Mei and Zhai KDD’05) 26 T SVM criteria classifica – tion linear … decision tree classifier class Bayes … Classifica - tion text unlabeled document labeled learning … Informa - tion web social retrieval distance networks … 1999 … web classifica – tion features0.006 topic … mixture random cluster clustering variables … topic mixture LDA semantic … … KDD Context = time Comparing

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 27 Example: Event Impact Analysis (Mei and Zhai KDD’06) vector concept model space boolean function … xml model collect judgment rank … probabilist model logic boolean algebra estimate weight … model language estimate parameter distribution smooth likelihood … 1998 Publication of the paper “A language modeling approach to information retrieval” Starting of the TREC conferences year 1992 term relevance weight feedback model probabilistic document … Theme: retrieval models SIGIR papers Context = event Comparing

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Implicit Context in Text Some contexts are hidden –Sentiments; intents; impact; etc. Document  contexts: don’t know for sure –Need to infer this affiliation from the data Train a model M for each implicit context Provide M to the topic model as guidance 28

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Modeling Implicit Context 29 Topic 1 Topic 2 Positive Negative ? ?? hate awful disgust good like perfect Apple iPod Harry Potter Ilikethe songof movieon perfectbut hatetheaccent my ipod the color size quality actress music visual price scratch problem director accent plot

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 30 Semi-supervised Topic Model (Mei et al. WWW’07) Maximum A Posterior (MAP) Estimation Maximum Likelihood Estimation (MLE) Add Dirichlet priors w Topics … 11 22 kk  d1  d2  dk Document love great hate awful r1r1 r2r2 Similar to adding pseudo-counts to the observation Guidance from the user

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Example: Faceted Opinion Summarization (Mei et al. WWW’07) NeutralPositiveNegative Topic 1: Movie... Ron Howards selection of Tom Hanks to play Robert Langdon. Tom Hanks stars in the movie,who can be mad at that? But the movie might get delayed, and even killed off if he loses. Directed by: Ron Howard Writing credits: Akiva Goldsman... Tom Hanks, who is my favorite movie star act the leading role. protesting... will lose your faith by... watching the movie. After watching the movie I went online and some research on... Anybody is interested in it?... so sick of people making such a big deal about a FICTION book and movie. Topic 2: Book I remembered when i first read the book, I finished the book in two days. Awesome book.... so sick of people making such a big deal about a FICTION book and movie. I’m reading “Da Vinci Code” now. … So still a good book to past time. This controversy book cause lots conflict in west society. 31 Context = topic & sentiment

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 32 Results: Sentiment Dynamics Facet: the book “ the da vinci code”. ( Bursts during the movie, Pos > Neg ) Facet: the impact on religious beliefs. ( Bursts during the movie, Neg > Pos )

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 33 Results: Topic with User’s Guidance Topics for iPod: No PriorWith Prior Battery, nanoMarketingAds, spamNanoBattery batteryapplefreenanobattery shufflemicrosoftsigncolorshuffle chargemarketofferthincharge nanozunefreepayholdusb dockdevicecompletemodelhour itunecompanyvirus4gbmini usbconsumerfreeipoddocklife hoursaletrialinchrechargable Guidance from the user: I know two topics should look like this

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Complex Context in Text Complex context  structure of contexts Many contexts has latent structure –Time; location; social network Why modeling context structure? –Review novel contextual patterns; –Regularize contextual models; –Alleviate data sparseness: smoothing; 34

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Modeling Complex Context 35 Topic 1 Topic 2 A B Ad as Ad as Ad as Ad as Context 1 Two Intuitions: Regularization: Model(A) and Model(B) should be similar Smoothing: Look at B if A doesn’t have enough data Context A and B are closely related

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Applications of Contextual Text Mining Personalized Search –Personalization with backoff Social Network Analysis (for schools) –Finding Topical communities Information Retrieval (for industry labs) –Smoothing Language Models 36

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Application I: Personalized Search 37

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 38 Personalization with Backoff (Mei and Church WSDM’08) Ambiguous query: MSG –Madison Square Garden –Monosodium Glutamate Disambiguate based on user’s prior clicks We don’t have enough data for everyone! –Backoff to classes of users Proof of Concept: –Context = Segments defined by IP addresses Other Market Segmentation (Demographics)

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Apply Contextual Text Mining to Personalized Search The text data: Query Logs The generative model: P(Url| Query) The context: Users (IP addresses) The contextual model: P(Url| Query, IP) The structure of context: –Hierarchical structure of IP addresses 39

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 40 Evaluation Metric: Entropy (H) Difficulty of encoding information (a distr.) –Size of search space; difficulty of a task H = 20  1 million items distributed uniformly Powerful tool for sizing challenges and opportunities –How hard is search? –How much does personalization help?

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 41 How Hard Is Search? Traditional Search –H(URL | Query) –2.8 (= 23.9 – 21.1) Personalized Search IP –H(URL | Query, IP) –1.2 –1.2 (= 27.2 – 26.0) Entropy (H) Query21.1 URL22.1 IP22.1 All But IP23.9 All But URL26.0 All But Query27.1 All Three27.2 Personalization cuts H in Half!

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Context = First k bytes of IP * *.* 156.*.*.* *.*.*.* Full personalization: every context has a different model: sparse data! No personalization: all contexts share the same model Personalization with backoff: similar contexts have similar models

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 43 Backing Off by IP λs estimated with EM A little bit of personalization –Better than too much –Or too little λ 4 : weights for first 4 bytes of IP λ 3 : weights for first 3 bytes of IP λ 2 : weights for first 2 bytes of IP …… Sparse DataMissed Opportunity

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 44 Context  Market Segmentation Traditional Goal of Marketing: –Segment Customers (e.g., Business v. Consumer) –By Need & Value Proposition Need: Segments ask different questions at different times Value: Different advertising opportunities Segmentation Variables –Queries, URL Clicks, IP Addresses –Geography & Demographics (Age, Gender, Income) –Time of day & Day of Week

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 45 Business Days v. Weekends: More Clicks and Easier Queries Easier More Clicks

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Harder Queries at TV Time 46 Harder queries

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Application II: Information Retrieval 47

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Application: Text Retrieval Document d A text mining paper data mining Doc Language Model (LM) θ d : p(w|  d ) text 4/100=0.04 mining 3/100=0.03 clustering 1/100=0.01 … data = 0 computing = 0 … Query q Data ½=0.5 Mining ½=0.5 Query Language Model θ q : p(w|  q ) Data ½=0.4 Mining ½=0.4 Clustering =0.1 … ? p(w|  q’ ) text =0.039 mining =0.028 clustering =0.01 … data = computing = … Similarity function Smoothed Doc LM θ d' : p(w|  d’ ) 48

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Smoothing a Document Language Model 49 Retrieval performance  estimate LM  smoothing LM text 4/100 = 0.04 mining 3/100 = 0.03 Assoc. 1/100 = 0.01 clustering 1/100=0.01 … data = 0 computing = 0 … text = mining = Assoc. = clustering =0.01 … data = computing = … Assign non-zero prob. to unseen words Estimate a more accurate distribution from sparse data text = mining = Assoc. = clustering =0.01 … data = computing = …

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Apply Contextual Text Mining to Smoothing Language Models The text data: collection of documents The generative model: P(word) The context: Document The contextual model: P(w|d) The structure of context: –Graph structure of documents Goal: use the graph of documents to estimate a good P(w|d) 50

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Traditional Document Smoothing in Information Retrieval d Collection d Clusters d Nearest Neighbors Interpolate MLE with Reference LM Estimate a Reference language model θ ref using the collection (corpus) [Ponte & Croft 98] [Liu & Croft 04] [Kurland& Lee 04] 51

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Graph-based Smoothing for Language Models in Retrieval (Mei et al. SIGIR 2008) A novel and general view of smoothing 52 d P(w|d): MLE P(w|d): Smoothed P(w|d) = Surface on top of the Graph projection on a plain Smoothed LM = Smoothed Surface! Collection = Graph (of Documents) Collection P(w|d 1 ) P(w|d 2 ) d1d1 d2d2 Can also be a word graph

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign The General Objective of Smoothing 53 Fidelity to MLE Smoothness of the surface Importance of vertices - Weights of edges (1/dist.)

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Smoothing Language Models using a Document Graph 54 Construct a kNN graph of documents; d w(u): Deg(u) w(u,v): cosine Additional Dirichlet Smoothing f u = p(w|d u ); Document language model:

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Effectiveness of the Framework 55 Data SetsDirichletDMDGDMWG † DSDGQMWG AP *** (+17.1%) *** (+16.1%) *** (+10.1%) (+10.1%) LA ** (+4.5%) ** (+4.5%) ** (+1.6%) SJMN *** (+13.2%) *** (+12.3%) *** (+10.3%) (+7.4%) TREC *** (+5.4%) ** (+5.4%) (+1.6%) (+1.2%) † DMWG: reranking top 3000 results. Usually this yields to a reduced performance than ranking all the documents Wilcoxon test: *, **, *** means significance level 0.1, 0.05, 0.01 Graph-based smoothing >> Baseline Smoothing Doc LM >> relevance score >> Query LM

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Intuitive Interpretation – Smoothing using Document Graph d d 10 Absorption Probability to the “1” state Writing a word w in a document = random walk on the doc Markov chain; write down w if reaching “1” Act as neighbors do 56

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Application III: Social Network Analysis 57

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Topical Community Analysis 58 physicist, physics, scientist, theory, gravitation … writer, novel, best-sell, book, language, film… Topic modeling to help community extraction Information Retrieval + Data Mining + Machine Learning, … = Domain Review + Algorithm + Evaluation, … or Computer Science Literature Network analysis to help topic extraction ?

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Apply Contextual Text Mining to Topical Community Analysis The text data: Publications of researchers The generative model: topic model The context: author The contextual model: author-topic model The structure of context: –Social Network: coauthor network of researchers 59

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Intuitions People working on the same topic belong to the same “topical community” Good community: coherent topic + well connected A topic is semantically coherent if people working on this topic also collaborate a lot 60 IR ? More likely to be an IR person or a compiler person? Intuition: my topics are similar to my neighbors

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Social Network Context for Topic Modeling 61 Context = author Coauthor = similar contexts Intuition: I work on similar topics to my neighbors Smoothed Topic distributions  P(θ j |author) e.g. coauthor network

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Topic Modeling with Network Regularization (NetPLSA) 62 Basic Assumption (e.g., co-author graph) Related authors work on similar topics PLSA Graph Harmonic Regularizer, Generalization of [Zhu ’03], importance (weight) of an edge difference of topic distribution on neighbor vertices tradeoff between topic and smoothness topic distribution of a document

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Topics & Communities without Regularization Topic 1Topic 2Topic 3Topic 4 term 0.02 peer 0.02 visual 0.02 interface 0.02 question 0.02 patterns 0.01 analog 0.02 towards 0.02 protein 0.01 mining 0.01 neurons 0.02 browsing 0.02 training 0.01 clusters 0.01 vlsi 0.01 xml 0.01 weighting 0.01 stream 0.01 motion 0.01 generation 0.01 multiple 0.01 frequent 0.01 chip 0.01 design 0.01 recognition 0.01 e 0.01 natural 0.01 engine 0.01 relations 0.01 page 0.01 cortex 0.01 service 0.01 library 0.01 gene 0.01 spike 0.01 social ? ? ? ? Noisy community assignment

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Topics & Communities with Regularization 64 Topic 1Topic 2Topic 3Topic 4 retrieval 0.13 mining 0.11 neural 0.06 web 0.05 information 0.05 data 0.06 learning 0.02 services 0.03 document 0.03 discovery 0.03 networks 0.02 semantic 0.03 query 0.03 databases 0.02 recognition 0.02 services 0.03 text 0.03 rules 0.02 analog 0.01 peer 0.02 search 0.03 association 0.02 vlsi 0.01 ontologies 0.02 evaluation 0.02 patterns 0.02 neurons 0.01 rdf 0.02 user 0.02 frequent 0.01 gaussian 0.01 management 0.01 relevance 0.02 streams 0.01 network 0.01 ontology 0.01 Information Retrieval Data mining Machine learning Web Coherent community assignment

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Topic Modeling and SNA Improve Each Other MethodsCut Edge Weights Ratio Cut/ Norm. Cut Community Size Community 1 Community 2 Community 3 Community 4 PLSA / NetPLSA / NCut / Ncut: spectral clustering with normalized cut. J. Shi et al pure network based community finding Network Regularization helps extract coherent communities (network assures the focus of topics) Topic Modeling helps balancing communities (text implicitly bridges authors) The smaller the better

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Smoothed Topic Map 66 Map a topic on the network (e.g., using p(θ|a)) PLSA (Topic : “information retrieval”) NetPLSA Core contributors Irrelevant Intermediate

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Summary of My Talk 67 Text + Context = Contextual Text Mining –A new paradigm of text mining A novel framework for contextual text mining –Probabilistic Topic Models –Contextualize by simple context, implicit context, complex context; Applications of contextual text mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign A Roadmap of My Work 68 Information Retrieval & Web Search Contextual Text Mining KDD 05 KDD 06b WWW 06 WWW 07 WWW 08 Contextual Topic Models KDD 06a SIGIR 07 SIGIR 08 KDD 07 WSDM 08 CIKM 08 ACL 08

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Research Discipline 69 Text Information Management Text Mining Information Retrieval Data Mining Natural Language Processing Database Bioinformatics Machine Learning Applied Statistics Social Networks Information Science

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign End Note 70 +=

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Thank You 71