Presentation is loading. Please wait.

Presentation is loading. Please wait.

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Contextual Text Mining Qiaozhu Mei University of Illinois at Urbana-Champaign.

Similar presentations


Presentation on theme: "2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Contextual Text Mining Qiaozhu Mei University of Illinois at Urbana-Champaign."— Presentation transcript:

1 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Contextual Text Mining Qiaozhu Mei qmei2@uiuc.edu University of Illinois at Urbana-Champaign

2 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Knowledge Discovery from Text 2 Text Mining System

3 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 3 Trend of Text Content Content Type Published Content Professional web content User generated content Private text content Amount / day3-4G~ 2G8-10G~ 3T - Ramakrishnan and Tomkins 2007

4 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Text on the Web (Unconfirmed) 4 ~750k /day ~3M day ~150k /day 1M 10B 6M ~100B Where to Start? Where to Go? Gold?

5 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Context Information in Text 5 Author Time Source Author’s occupation Language Social Network Check Lap Kok, HK self designer, publisher, editor … 3:53 AM Jan 28 th From Ping.fm Location Sentiment

6 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Rich Context in Text 6 102M blogs ~3M msgs /day ~150k bookmarks /day ~300M words/month ~2M users 5M users 500M URLs 8M contributors 100+ languages 750K posts/day 100M users > 1M groups 73 years ~400k authors ~4k sources 1B queries? Per hour? Per IP?

7 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Text + Context = ? 7 + Context = Guidance I Have A Guide! =

8 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Query + User = Personalized Search 8 MSR Modern System Research Medical simulation Montessori School of Raleigh Mountain Safety Research MSR Racing Wikipedia definitions Metropolis Street Racer Molten salt reactor Mars sample return Magnetic Stripe Reader How much can personalized help? If you know me, you should give me Microsoft Research…

9 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 9 Common ThemesIBMAPPLEDELL Battery Life Long, 4-3 hrsMedium, 3-2 hrsShort, 2-1 hrs Hard disk Large, 80-100 GBSmall, 5-10 GBMedium, 20-50 GB Speed Slow, 100-200 MhzVery Fast, 3-4 GhzModerate, 1-2 Ghz IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Customer Review + Brand = Comparative Product Summary Can we compare Products?

10 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 10 Hot Topics in SIGMOD Literature + Time = Topic Trends What’s hot in literature?

11 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 11 One Week Later Blogs + Time & Location = Spatiotemporal Topic Diffusion How does discussion spread?

12 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 12 Tom Hanks, who is my favorite movie star act the leading role. protesting... will lose your faith by watching the movie. a good book to past time.... so sick of people making such a big deal about a fiction book The Da Vinci Code Blogs + Sentiment = Faceted Opinion Summary What is good and what is bad?

13 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 13 Information retrieval Machine learning Data mining Coauthor Network Publications + Social Network = Topical Community Who works together on what?

14 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Query log + User = Personalized Search Literature + Time = Topic Trends Review + Brand = Comparative Opinion Blog + Time & Location = Spatiotemporal Topic Diffusion Blog + Sentiment = Faceted Opinion Summary Publications + Social Network = Topical Community Text + Context = Contextual Text Mining 14 ….. A General Solution for All

15 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Contextual Text Mining Generative Model of Text Modeling Simple Context Modeling Implicit Context Modeling Complex Context Applications of Contextual Text Mining 15

16 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Generative Model of Text 16 the.. movie.. harry.. potter is.. based.. on.. j..k..rowling the is harry potter movie plot time rowling 0.1 0.07 0.05 0.04 0.02 0.01 the Generation Inference, Estimation harry potter movie harry is

17 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Contextualized Models 17 book Generation: How to select contexts? How to model the relations of contexts? Inference: How to estimate contextual models? How to reveal contextual patterns? Year = 2008 Year = 1998 Location = US Location = China Source = official Sentiment = + harry potter is book harry potter rowling 0.15 0.10 0.08 0.05 movie harry potter director 0.18 0.09 0.08 0.04

18 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Topics in Text Topic (Theme) = the subject of a discourse A topic covers multiple documents A document has multiple topics Topic = a soft cluster of documents Topic = a multinomial distribution of words 18 Data Mining Many text mining tasks: Extracting topics from text Reveal contextual topic patterns Web Search Machine Learning search 0.2 engine 0.15 query 0.08 user 0.07 ranking 0.06 ……

19 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Probabilistic Topic Models 19 ipod nano music download apple 0.15 0.08 0.05 0.02 0.01 movie harry potter actress music 0.10 0.09 0.05 0.04 0.02 Topic 1 Topic 2 Apple iPod Harry Potter Idownloaded themusicof themovie harrypotterto myipodnano ipod 0.15 harry 0.09

20 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Parameter Estimation Maximizing data likelihood: Parameter Estimation using EM algorithm 20 ipod nano music download apple 0.15 0.08 0.05 0.02 0.01 movie harry potter actress music 0.10 0.09 0.05 0.04 0.02 Idownloaded themusicof themovie harrypotterto myipodnano ?????????? ?????????? Guess the affiliation Estimate the params Idownloaded themusicof themovie harrypotterto myipodnano Idownloaded themusicof themovie harrypotterto myipodnano Idownloaded themusicof themovie harrypotterto myipodnano Pseudo- Counts

21 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign How Context Affects Topics 21 Topics in science literature: 16 th Century v.s. 21 st Century When do a computer scientist and a gardener use “tree, root, prune” in text? What does “tree” mean in “algorithm”? In Europe, “football” appears a lot in a soccer report. What about in the US? Text are generated according to the Context !!

22 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Simple Contextual Topic Model 22 Topic 1 Topic 2 Context 1: 2004 Context 2: 2007 Apple iPod Harry Potter ipod mini 4gb harry prisoner azkaban Idownloaded themusicof themovie harrypotterto myiphone ipod iphone nano potter order phoenix

23 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Contextual Topic Patterns Compare contextualized versions of topics: Contextual topic patterns Contextual topic patterns  conditional distributions –z: topic; c: context; w: word : strength of topics in context :content variation of topics 23

24 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Example: Topic Life Cycles (Mei and Zhai KDD’05) 24 Context = time Comparing

25 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Example: Spatiotemporal Theme Pattern (Mei et al. WWW’06) 25 Week4: The theme is again strong along the east coast and the Gulf of Mexico Week3: The theme distributes more uniformly over the states Week2: The discussion moves towards the north and west Week5: The theme fades out in most states Week1: The theme is the strongest along the Gulf of Mexico About Government Response in Hurricane Katrina Context = time & location Comparing

26 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Example: Evolutionary Topic Graph (Mei and Zhai KDD’05) 26 T SVM 0.007 criteria 0.007 classifica – tion 0.006 linear 0.005 … decision 0.006 tree 0.006 classifier 0.005 class 0.005 Bayes 0.005 … Classifica - tion 0.015 text 0.013 unlabeled 0.012 document 0.008 labeled 0.008 learning 0.007 … Informa - tion 0.012 web 0.010 social 0.008 retrieval 0.007 distance 0.005 networks 0.004 … 1999 … web 0.009 classifica – tion 0.007 features0.006 topic 0.005 … mixture 0.005 random 0.006 cluster 0.006 clustering 0.005 variables 0.005 … topic 0.010 mixture 0.008 LDA 0.006 semantic 0.005 … … 20002001200220032004 KDD Context = time Comparing

27 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 27 Example: Event Impact Analysis (Mei and Zhai KDD’06) vector 0.0514 concept 0.0298 model 0.0291 space 0.0236 boolean 0.0151 function 0.0123 … xml 0.0678 email 0.0197 model 0.0191 collect 0.0187 judgment 0.0102 rank 0.0097 … probabilist 0.0778 model 0.0432 logic 0.0404 boolean 0.0281 algebra 0.0200 estimate 0.0119 weight 0.0111 … model 0.1687 language 0.0753 estimate 0.0520 parameter 0.0281 distribution 0.0268 smooth 0.0198 likelihood 0.0059 … 1998 Publication of the paper “A language modeling approach to information retrieval” Starting of the TREC conferences year 1992 term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 model 0.0310 probabilistic 0.0188 document 0.0173 … Theme: retrieval models SIGIR papers Context = event Comparing

28 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Implicit Context in Text Some contexts are hidden –Sentiments; intents; impact; etc. Document  contexts: don’t know for sure –Need to infer this affiliation from the data Train a model M for each implicit context Provide M to the topic model as guidance 28

29 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Modeling Implicit Context 29 Topic 1 Topic 2 Positive Negative ? ?? hate awful disgust 0.21 0.03 0.01 good like perfect 0.10 0.05 0.02 Apple iPod Harry Potter Ilikethe songof movieon perfectbut hatetheaccent my ipod the color size quality actress music visual price scratch problem director accent plot

30 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 30 Semi-supervised Topic Model (Mei et al. WWW’07) Maximum A Posterior (MAP) Estimation Maximum Likelihood Estimation (MLE) Add Dirichlet priors w Topics … 11 22 kk  d1  d2  dk Document love great hate awful r1r1 r2r2 Similar to adding pseudo-counts to the observation Guidance from the user

31 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Example: Faceted Opinion Summarization (Mei et al. WWW’07) NeutralPositiveNegative Topic 1: Movie... Ron Howards selection of Tom Hanks to play Robert Langdon. Tom Hanks stars in the movie,who can be mad at that? But the movie might get delayed, and even killed off if he loses. Directed by: Ron Howard Writing credits: Akiva Goldsman... Tom Hanks, who is my favorite movie star act the leading role. protesting... will lose your faith by... watching the movie. After watching the movie I went online and some research on... Anybody is interested in it?... so sick of people making such a big deal about a FICTION book and movie. Topic 2: Book I remembered when i first read the book, I finished the book in two days. Awesome book.... so sick of people making such a big deal about a FICTION book and movie. I’m reading “Da Vinci Code” now. … So still a good book to past time. This controversy book cause lots conflict in west society. 31 Context = topic & sentiment

32 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 32 Results: Sentiment Dynamics Facet: the book “ the da vinci code”. ( Bursts during the movie, Pos > Neg ) Facet: the impact on religious beliefs. ( Bursts during the movie, Neg > Pos )

33 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 33 Results: Topic with User’s Guidance Topics for iPod: No PriorWith Prior Battery, nanoMarketingAds, spamNanoBattery batteryapplefreenanobattery shufflemicrosoftsigncolorshuffle chargemarketofferthincharge nanozunefreepayholdusb dockdevicecompletemodelhour itunecompanyvirus4gbmini usbconsumerfreeipoddocklife hoursaletrialinchrechargable Guidance from the user: I know two topics should look like this

34 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Complex Context in Text Complex context  structure of contexts Many contexts has latent structure –Time; location; social network Why modeling context structure? –Review novel contextual patterns; –Regularize contextual models; –Alleviate data sparseness: smoothing; 34

35 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Modeling Complex Context 35 Topic 1 Topic 2 A B Ad as Ad as Ad as Ad as Context 1 Two Intuitions: Regularization: Model(A) and Model(B) should be similar Smoothing: Look at B if A doesn’t have enough data Context A and B are closely related

36 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Applications of Contextual Text Mining Personalized Search –Personalization with backoff Social Network Analysis (for schools) –Finding Topical communities Information Retrieval (for industry labs) –Smoothing Language Models 36

37 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Application I: Personalized Search 37

38 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 38 Personalization with Backoff (Mei and Church WSDM’08) Ambiguous query: MSG –Madison Square Garden –Monosodium Glutamate Disambiguate based on user’s prior clicks We don’t have enough data for everyone! –Backoff to classes of users Proof of Concept: –Context = Segments defined by IP addresses Other Market Segmentation (Demographics)

39 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Apply Contextual Text Mining to Personalized Search The text data: Query Logs The generative model: P(Url| Query) The context: Users (IP addresses) The contextual model: P(Url| Query, IP) The structure of context: –Hierarchical structure of IP addresses 39

40 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 40 Evaluation Metric: Entropy (H) Difficulty of encoding information (a distr.) –Size of search space; difficulty of a task H = 20  1 million items distributed uniformly Powerful tool for sizing challenges and opportunities –How hard is search? –How much does personalization help?

41 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 41 How Hard Is Search? Traditional Search –H(URL | Query) –2.8 (= 23.9 – 21.1) Personalized Search IP –H(URL | Query, IP) –1.2 –1.2 (= 27.2 – 26.0) Entropy (H) Query21.1 URL22.1 IP22.1 All But IP23.9 All But URL26.0 All But Query27.1 All Three27.2 Personalization cuts H in Half!

42 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Context = First k bytes of IP 42 156.111.188.243 156.111.188.* 156.111.*.* 156.*.*.* *.*.*.* Full personalization: every context has a different model: sparse data! No personalization: all contexts share the same model Personalization with backoff: similar contexts have similar models

43 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 43 Backing Off by IP λs estimated with EM A little bit of personalization –Better than too much –Or too little λ 4 : weights for first 4 bytes of IP λ 3 : weights for first 3 bytes of IP λ 2 : weights for first 2 bytes of IP …… Sparse DataMissed Opportunity

44 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 44 Context  Market Segmentation Traditional Goal of Marketing: –Segment Customers (e.g., Business v. Consumer) –By Need & Value Proposition Need: Segments ask different questions at different times Value: Different advertising opportunities Segmentation Variables –Queries, URL Clicks, IP Addresses –Geography & Demographics (Age, Gender, Income) –Time of day & Day of Week

45 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 45 Business Days v. Weekends: More Clicks and Easier Queries Easier More Clicks

46 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Harder Queries at TV Time 46 Harder queries

47 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Application II: Information Retrieval 47

48 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Application: Text Retrieval Document d A text mining paper data mining Doc Language Model (LM) θ d : p(w|  d ) text 4/100=0.04 mining 3/100=0.03 clustering 1/100=0.01 … data = 0 computing = 0 … Query q Data ½=0.5 Mining ½=0.5 Query Language Model θ q : p(w|  q ) Data ½=0.4 Mining ½=0.4 Clustering =0.1 … ? p(w|  q’ ) text =0.039 mining =0.028 clustering =0.01 … data = 0.001 computing = 0.0005 … Similarity function Smoothed Doc LM θ d' : p(w|  d’ ) 48

49 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Smoothing a Document Language Model 49 Retrieval performance  estimate LM  smoothing LM text 4/100 = 0.04 mining 3/100 = 0.03 Assoc. 1/100 = 0.01 clustering 1/100=0.01 … data = 0 computing = 0 … text = 0.039 mining = 0.028 Assoc. = 0.009 clustering =0.01 … data = 0.001 computing = 0.0005 … Assign non-zero prob. to unseen words Estimate a more accurate distribution from sparse data text = 0.038 mining = 0.026 Assoc. = 0.008 clustering =0.01 … data = 0.002 computing = 0.001 …

50 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Apply Contextual Text Mining to Smoothing Language Models The text data: collection of documents The generative model: P(word) The context: Document The contextual model: P(w|d) The structure of context: –Graph structure of documents Goal: use the graph of documents to estimate a good P(w|d) 50

51 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Traditional Document Smoothing in Information Retrieval d Collection d Clusters d Nearest Neighbors Interpolate MLE with Reference LM Estimate a Reference language model θ ref using the collection (corpus) [Ponte & Croft 98] [Liu & Croft 04] [Kurland& Lee 04] 51

52 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Graph-based Smoothing for Language Models in Retrieval (Mei et al. SIGIR 2008) A novel and general view of smoothing 52 d P(w|d): MLE P(w|d): Smoothed P(w|d) = Surface on top of the Graph projection on a plain Smoothed LM = Smoothed Surface! Collection = Graph (of Documents) Collection P(w|d 1 ) P(w|d 2 ) d1d1 d2d2 Can also be a word graph

53 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign The General Objective of Smoothing 53 Fidelity to MLE Smoothness of the surface Importance of vertices - Weights of edges (1/dist.)

54 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Smoothing Language Models using a Document Graph 54 Construct a kNN graph of documents; d w(u): Deg(u) w(u,v): cosine Additional Dirichlet Smoothing f u = p(w|d u ); Document language model:

55 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Effectiveness of the Framework 55 Data SetsDirichletDMDGDMWG † DSDGQMWG AP88-900.2170.254 *** (+17.1%) 0.252 *** (+16.1%) 0.239 *** (+10.1%) 0.239 (+10.1%) LA0.2470.258 ** (+4.5%) 0.257 ** (+4.5%) 0.251 ** (+1.6%) 0.247 SJMN0.2040.231 *** (+13.2%) 0.229 *** (+12.3%) 0.225 *** (+10.3%) 0.219 (+7.4%) TREC80.2570.271 *** (+5.4%) 0.271 ** (+5.4%) 0.261 (+1.6%) 0.260 (+1.2%) † DMWG: reranking top 3000 results. Usually this yields to a reduced performance than ranking all the documents Wilcoxon test: *, **, *** means significance level 0.1, 0.05, 0.01 Graph-based smoothing >> Baseline Smoothing Doc LM >> relevance score >> Query LM

56 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Intuitive Interpretation – Smoothing using Document Graph d d 10 Absorption Probability to the “1” state Writing a word w in a document = random walk on the doc Markov chain; write down w if reaching “1” Act as neighbors do 56

57 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Application III: Social Network Analysis 57

58 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Topical Community Analysis 58 physicist, physics, scientist, theory, gravitation … writer, novel, best-sell, book, language, film… Topic modeling to help community extraction Information Retrieval + Data Mining + Machine Learning, … = Domain Review + Algorithm + Evaluation, … or Computer Science Literature Network analysis to help topic extraction ?

59 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Apply Contextual Text Mining to Topical Community Analysis The text data: Publications of researchers The generative model: topic model The context: author The contextual model: author-topic model The structure of context: –Social Network: coauthor network of researchers 59

60 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Intuitions People working on the same topic belong to the same “topical community” Good community: coherent topic + well connected A topic is semantically coherent if people working on this topic also collaborate a lot 60 IR ? More likely to be an IR person or a compiler person? Intuition: my topics are similar to my neighbors

61 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Social Network Context for Topic Modeling 61 Context = author Coauthor = similar contexts Intuition: I work on similar topics to my neighbors Smoothed Topic distributions  P(θ j |author) e.g. coauthor network

62 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Topic Modeling with Network Regularization (NetPLSA) 62 Basic Assumption (e.g., co-author graph) Related authors work on similar topics PLSA Graph Harmonic Regularizer, Generalization of [Zhu ’03], importance (weight) of an edge difference of topic distribution on neighbor vertices tradeoff between topic and smoothness topic distribution of a document

63 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Topics & Communities without Regularization Topic 1Topic 2Topic 3Topic 4 term 0.02 peer 0.02 visual 0.02 interface 0.02 question 0.02 patterns 0.01 analog 0.02 towards 0.02 protein 0.01 mining 0.01 neurons 0.02 browsing 0.02 training 0.01 clusters 0.01 vlsi 0.01 xml 0.01 weighting 0.01 stream 0.01 motion 0.01 generation 0.01 multiple 0.01 frequent 0.01 chip 0.01 design 0.01 recognition 0.01 e 0.01 natural 0.01 engine 0.01 relations 0.01 page 0.01 cortex 0.01 service 0.01 library 0.01 gene 0.01 spike 0.01 social 0.01 63 ? ? ? ? Noisy community assignment

64 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Topics & Communities with Regularization 64 Topic 1Topic 2Topic 3Topic 4 retrieval 0.13 mining 0.11 neural 0.06 web 0.05 information 0.05 data 0.06 learning 0.02 services 0.03 document 0.03 discovery 0.03 networks 0.02 semantic 0.03 query 0.03 databases 0.02 recognition 0.02 services 0.03 text 0.03 rules 0.02 analog 0.01 peer 0.02 search 0.03 association 0.02 vlsi 0.01 ontologies 0.02 evaluation 0.02 patterns 0.02 neurons 0.01 rdf 0.02 user 0.02 frequent 0.01 gaussian 0.01 management 0.01 relevance 0.02 streams 0.01 network 0.01 ontology 0.01 Information Retrieval Data mining Machine learning Web Coherent community assignment

65 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Topic Modeling and SNA Improve Each Other MethodsCut Edge Weights Ratio Cut/ Norm. Cut Community Size Community 1 Community 2 Community 3 Community 4 PLSA48312.14/1.252280217823262257 NetPLSA6620.29/0.132636198930691347 NCut8550.23/0.1226996323811 65 -Ncut: spectral clustering with normalized cut. J. Shi et al. 2000 - pure network based community finding Network Regularization helps extract coherent communities (network assures the focus of topics) Topic Modeling helps balancing communities (text implicitly bridges authors) The smaller the better

66 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Smoothed Topic Map 66 Map a topic on the network (e.g., using p(θ|a)) PLSA (Topic : “information retrieval”) NetPLSA Core contributors Irrelevant Intermediate

67 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Summary of My Talk 67 Text + Context = Contextual Text Mining –A new paradigm of text mining A novel framework for contextual text mining –Probabilistic Topic Models –Contextualize by simple context, implicit context, complex context; Applications of contextual text mining

68 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign A Roadmap of My Work 68 Information Retrieval & Web Search Contextual Text Mining KDD 05 KDD 06b WWW 06 WWW 07 WWW 08 Contextual Topic Models KDD 06a SIGIR 07 SIGIR 08 KDD 07 WSDM 08 CIKM 08 ACL 08

69 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Research Discipline 69 Text Information Management Text Mining Information Retrieval Data Mining Natural Language Processing Database Bioinformatics Machine Learning Applied Statistics Social Networks Information Science

70 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign End Note 70 +=

71 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Thank You 71


Download ppt "2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Contextual Text Mining Qiaozhu Mei University of Illinois at Urbana-Champaign."

Similar presentations


Ads by Google