Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analyzing unstructured text with topic models Mark Steyvers Dep. of Cognitive Sciences & Dep. of Computer Science University of California, Irvine collaborators:

Similar presentations


Presentation on theme: "Analyzing unstructured text with topic models Mark Steyvers Dep. of Cognitive Sciences & Dep. of Computer Science University of California, Irvine collaborators:"— Presentation transcript:

1 Analyzing unstructured text with topic models Mark Steyvers Dep. of Cognitive Sciences & Dep. of Computer Science University of California, Irvine collaborators: Padhraic Smyth, UC Irvine; Tom Griffiths UC Berkeley

2 NYT 330,000 articles Enron 250,000 emails 16 million Medline articles NSF/ NIH 100,000 grants Analyzing Unstructured Text AOL queries 20,000,000 queries 650,000 users Pennsylvania Gazette (1728-1800) 80,000 articles

3 Topic Models and Text Analysis Can answer a number of questions:  What is in this corpus?  What is in this document, paragraph, or sentence?  What does this person/group of people write about?  What tags are appropriate for this document?  What are the topical trends over time?

4 Topic Models Automatic and unsupervised extraction of semantic themes from large text collections. Widely used model in machine learning and text mining –pLSI Model: Hoffman (1999) –LDA Model: Blei, Ng, and Jordan (2001, 2003) –LDA with Gibbs sampling : Griffiths and Steyvers (2003, 2004)

5 Basic Assumptions Each topic is a distribution over words Each document a mixture of topics Each word in a document originates from a single topic

6 Model P( words | document ) =  P(words|topic) P (topic|document) Topic = probability distribution over words topic weights for each document Automatically learned from text corpus

7 Topics.4 1.0.6 1.0 MONEY 1 BANK 1 BANK 1 LOAN 1 BANK 1 MONEY 1 BANK 1 MONEY 1 BANK 1 LOAN 1 LOAN 1 BANK 1 MONEY 1.... Topic Weights Documents and topic assignments RIVER 2 MONEY 1 BANK 2 STREAM 2 BANK 2 BANK 1 MONEY 1 RIVER 2 MONEY 1 BANK 2 LOAN 1 MONEY 1.... RIVER 2 BANK 2 STREAM 2 BANK 2 RIVER 2 BANK 2.... Toy Example

8 Topics ? ? MONEY ? BANK BANK ? LOAN ? BANK ? MONEY ? BANK ? MONEY ? BANK ? LOAN ? LOAN ? BANK ? MONEY ?.... Topic Weights RIVER ? MONEY ? BANK ? STREAM ? BANK ? BANK ? MONEY ? RIVER ? MONEY ? BANK ? LOAN ? MONEY ?.... RIVER ? BANK ? STREAM ? BANK ? RIVER ? BANK ?.... Statistical Inference Documents and topic assignments ?

9 Statistical Inference Exact inference is intractable Markov chain Monte Carlo (MCMC) with Gibbs sampling scalable to large document collections (e.g. all of wikipedia) parallelizable Form of dimensionality reduction –Number of topics T= 50…2000

10 Examples Topics from New York Times WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10 500_STOCK_INDEX WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP TerrorismWall Street FirmsStock MarketBankruptcy

11 Learning multiple meanings of words PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW

12 Demographic Analysis of Search Queries

13 AOL dataset Dataset: - 20,000,000+ web queries - 650,000+ users Users were given “anonymous” user-id –No demographics in this dataset

14 Example query log from user #2178 ID Query Date/Time URL clicked 2178dog eats uncooked pasta2006-05-26 15:31:56 2178inducing dog vomiting2006-05-26 15:32:46http://www.twodogpress.com 2178inducing dog vomiting2006-05-26 15:32:46http://www.canismajor.com 2178inducing dog vomiting2006-05-26 15:32:46http://kitchen.robbiehaf.com 2178inducing dog vomiting2006-05-26 15:32:46http://www.dog-first-aid-101.com 2178inducing dog vomiting2006-05-26 15:38:36 2178walmart2006-05-12 12:39:52http://www.walmart.com 2178sears2006-05-12 12:44:22http://www.sears.com 2178target2006-05-12 17:05:36http://www.target.com 2178babycenter.com2006-05-12 17:43:59http://www.babycenter.com 2178google2006-05-16 10:54:39http://www.google.com 2178fit pregnancy2006-05-16 15:34:23 2178baby center2006-05-16 15:37:22 2178yahoo.com2006-05-18 17:11:05http://www.yahoo.com 2178applebee's carside2006-05-19 19:21:08http://www.applebees.com 2178baby names2006-05-20 15:02:38http://www.babynames.com 2178baby names2006-05-20 15:02:38http://www.babynamesworld.com 2178baby names2006-05-20 15:02:38http://www.thinkbabynames.com 2178mortgage calculator2006-05-24 14:39:05http://www.bankrate.com 2178us zip codes2006-05-25 21:26:47http://www.usps.com

15 Another Query Database… Not publicly available Dataset –250,000+ users –411,000+ queries Age and gender of users are known: –age brackets: 0-12, 13-17, 18-20, 21-24, 25-29, 30- 34, 35-44, 45-54, 55-64, 65+

16 Topic modeling of queries Each user searches for a mixture of topics Each topic is a probability distribution over query words

17 Four example topics (out of 200) auto car parts cars used ford honda truck toyota webmd cymbalta xanax gout vicodin effexor prednisone lexapro ambien party store wedding birthday jewelry ideas cards cake gifts hannah montana zac efron disney high school musical miley cyrus hilary duff Probability distribution over words. Most likely words listed at the top

18 User = mixture of topics auto car parts cars used ford honda truck toyota hannah montana zac efron disney high school musical miley cyrus hilary duff webmd cymbalta xanax gout vicodin effexor prednisone lexapro ambien party store wedding birthday jewelry ideas cards cake gifts User #7654 80% 20% User #246 100%

19 Topic Analysis Find likely topics for each demographic bucket Find likely demographics given topics What’s on the mind of people in different age-groups?

20 “poems” topic

21 “myspace” topic

22 “sports” topic

23 “MTV” topic

24 “Clothing Stores” topic

25 “Hairstyles” topic

26 “recipes” topic

27 Results Topic models give quick summaries of demographic trends in query datasets Other potential applications: –e.g. blogs, social networking sites, email, etc –clinical data, e.g. therapy discussions

28 Analyzing Emails who writes on what topics?

29 Enron email data 250,000 emails 5000 authors 1999-2002

30 Author-topic models We can learn the association between authors of documents and topics Assume each author works on a mixture of topics

31 ENRON Email: who writes on certain topics?... But also over senders (authors) of email. Most likely authors listed at the top

32 Enron email: two example topics (T=100)

33 Detecting Papers on Unusual Topics for Authors We can calculate perplexity (unusualness) for words in a document given an author Papers ranked by perplexity for M. Jordan:

34 Author Separation Can model attribute words to authors correctly within a document?

35 Application: Faculty Browser

36 Faculty Browser Automatically analyzes computer science papers by UC San Diego and UC Irvine researchers Finds topically related researchers

37 one topic most prolific researchers in this topic

38 topics this researcher is interested in other researchers with similar topical interests one researcher

39 Inferred network of researchers connected through topics

40 Modeling Extensions

41 330,000 articles 2000-2002 Entity-topic modeling Who is mentioned in what context?

42 Three investigations began Thursday into the securities and exchange_commission's choice of william_webster to head a new board overseeing the accounting profession. house and senate_democrats called for the resignations of both judge_webster and harvey_pitt, the commission's chairman. The white_house expressed support for judge_webster as well as for harvey_pitt, who was harshly criticized Thursday for failing to inform other commissioners before they approved the choice of judge_webster that he had led the audit committee of a company facing fraud accusations. “The president still has confidence in harvey_pitt,” said dan_bartlett, bush's communications director … Extracted Named Entities Used standard algorithms to extract named entities: - People - Places - Organizations

43 Standard Topic Model with Entities

44

45 Example of Extracted Entity-Topic Network

46 Topic Trends Tour-de-France Anthrax Quarterly Earnings Proportion of words assigned to topic for that time slice

47 Learning Topic Hierarchies (example: psych Review Abstracts) RESPONSE STIMULUS REINFORCEMENT RECOGNITION STIMULI RECALL CHOICE CONDITIONING SPEECH READING WORDS MOVEMENT MOTOR VISUAL WORD SEMANTIC ACTION SOCIAL SELF EXPERIENCE EMOTION GOALS EMOTIONAL THINKING GROUP IQ INTELLIGENCE SOCIAL RATIONAL INDIVIDUAL GROUPS MEMBERS SEX EMOTIONS GENDER EMOTION STRESS WOMEN HEALTH HANDEDNESS REASONING ATTITUDE CONSISTENCY SITUATIONAL INFERENCE JUDGMENT PROBABILITIES STATISTICAL IMAGE COLOR MONOCULAR LIGHTNESS GIBSON SUBMOVEMENT ORIENTATION HOLOGRAPHIC CONDITIONIN STRESS EMOTIONAL BEHAVIORAL FEAR STIMULATION TOLERANCE RESPONSES A MODEL MEMORY FOR MODELS TASK INFORMATION RESULTS ACCOUNT SELF SOCIAL PSYCHOLOGY RESEARCH RISK STRATEGIES INTERPERSONAL PERSONALITY SAMPLING MOTION VISUAL SURFACE BINOCULAR RIVALRY CONTOUR DIRECTION CONTOURS SURFACES DRUG FOOD BRAIN AROUSAL ACTIVATION AFFECTIVE HUNGER EXTINCTION PAIN THE OF AND TO IN A IS

48

49 Hidden Markov Topics Model Syntactic dependencies  short range dependencies Semantic dependencies  long-range  zz zz zz zz ww ww ww ww ss ss ss ss Semantic state: generate words from topic model Syntactic states: generate words from HMM (Griffiths, Steyvers, Blei, & Tenenbaum, 2004)

50 MODEL ALGORITHM SYSTEM CASE PROBLEM NETWORK METHOD APPROACH PAPER PROCESS IS WAS HAS BECOMES DENOTES BEING REMAINS REPRESENTS EXISTS SEEMS SEE SHOW NOTE CONSIDER ASSUME PRESENT NEED PROPOSE DESCRIBE SUGGEST USED TRAINED OBTAINED DESCRIBED GIVEN FOUND PRESENTED DEFINED GENERATED SHOWN IN WITH FOR ON FROM AT USING INTO OVER WITHIN HOWEVER ALSO THEN THUS THEREFORE FIRST HERE NOW HENCE FINALLY #*IXTN-CFP#*IXTN-CFP EXPERTS EXPERT GATING HME ARCHITECTURE MIXTURE LEARNING MIXTURES FUNCTION GATE DATA GAUSSIAN MIXTURE LIKELIHOOD POSTERIOR PRIOR DISTRIBUTION EM BAYESIAN PARAMETERS STATE POLICY VALUE FUNCTION ACTION REINFORCEMENT LEARNING CLASSES OPTIMAL * MEMBRANE SYNAPTIC CELL * CURRENT DENDRITIC POTENTIAL NEURON CONDUCTANCE CHANNELS IMAGE IMAGES OBJECT OBJECTS FEATURE RECOGNITION VIEWS # PIXEL VISUAL KERNEL SUPPORT VECTOR SVM KERNELS # SPACE FUNCTION MACHINES SET NETWORK NEURAL NETWORKS OUPUT INPUT TRAINING INPUTS WEIGHTS # OUTPUTS NIPS Semantics NIPS Syntax

51 Random sentence generation LANGUAGE: [S] RESEARCHERS GIVE THE SPEECH [S] THE SOUND FEEL NO LISTENERS [S] WHICH WAS TO BE MEANING [S] HER VOCABULARIES STOPPED WORDS [S] HE EXPRESSLY WANTED THAT BETTER VOWEL

52 Software Public-domain MATLAB toolbox for topic modeling on the Web: http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm


Download ppt "Analyzing unstructured text with topic models Mark Steyvers Dep. of Cognitive Sciences & Dep. of Computer Science University of California, Irvine collaborators:"

Similar presentations


Ads by Google