Presentation is loading. Please wait.

Presentation is loading. Please wait.

Extracting Semantic Representations with Probabilistic Topic Models Mark SteyversUC Irvine Tom Griffiths Padhraic Smyth Dave Newman Brown University UC.

Similar presentations


Presentation on theme: "Extracting Semantic Representations with Probabilistic Topic Models Mark SteyversUC Irvine Tom Griffiths Padhraic Smyth Dave Newman Brown University UC."— Presentation transcript:

1 Extracting Semantic Representations with Probabilistic Topic Models Mark SteyversUC Irvine Tom Griffiths Padhraic Smyth Dave Newman Brown University UC Irvine

2 Extracting Statistical Regularities from Text EMAIL BOOKS/ JOURNALS NEWSPAPERS Computer Science/Statistics: Information retrieval Text mining Data mining Psychology: Semantic cognition Episodic memory Psycholinguistics ?

3 Overview IProbabilistic Topic Models II Computer Science Applications Analyzing Scientific Topics: PNAS Analyzing NSF and NIH funding Analyzing Enron Email III Theory for semantic cognition Word Association Free Recall IV Conclusion

4 Probabilistic Topic Models Originated in domain of statistics & machine learning (e.g., Hoffman, 2001; Blei, Ng, Jordan, 2003) Extracts topics from large collections of text No usage of dictionaries of thesauri Topic extraction is unsupervised

5 DATA Corpus of text Topic Model Find parameters that “reconstruct” data Model is Generative

6 Probabilistic Topic Models Each document is a probability distribution over topics Each topic is a probability distribution over words

7 Document generation as a probabilistic process TOPICS MIXTURE TOPIC TOPIC WORD WORD...... 1. for each document, choose a mixture of topics 2. For every word slot, sample a topic [1..T] from the mixture 3. sample a word from the topic

8 loan TOPIC 1 money loan bank money bank river TOPIC 2 river stream bank stream bank loan DOCUMENT 2: river 2 stream 2 bank 2 stream 2 bank 2 money 1 loan 1 river 2 stream 2 loan 1 bank 2 river 2 bank 2 bank 1 stream 2 river 2 loan 1 bank 2 stream 2 bank 2 money 1 loan 1 river 2 stream 2 bank 2 stream 2 bank 2 money 1 river 2 stream 2 loan 1 bank 2 river 2 bank 2 money 1 bank 1 stream 2 river 2 bank 2 stream 2 bank 2 money 1 DOCUMENT 1: money 1 bank 1 bank 1 loan 1 river 2 stream 2 bank 1 money 1 river 2 bank 1 money 1 bank 1 loan 1 money 1 stream 2 bank 1 money 1 bank 1 bank 1 loan 1 river 2 stream 2 bank 1 money 1 river 2 bank 1 money 1 bank 1 loan 1 bank 1 money 1 stream 2.3.8.2 Example Mixture components Mixture weights Bayesian approach: use priors Mixture weights ~ Dirichlet(  ) Mixture components ~ Dirichlet(  ).7

9 DOCUMENT 2: river ? stream ? bank ? stream ? bank ? money ? loan ? river ? stream ? loan ? bank ? river ? bank ? bank ? stream ? river ? loan ? bank ? stream ? bank ? money ? loan ? river ? stream ? bank ? stream ? bank ? money ? river ? stream ? loan ? bank ? river ? bank ? money ? bank ? stream ? river ? bank ? stream ? bank ? money ? DOCUMENT 1: money ? bank ? bank ? loan ? river ? stream ? bank ? money ? river ? bank ? money ? bank ? loan ? money ? stream ? bank ? money ? bank ? bank ? loan ? river ? stream ? bank ? money ? river ? bank ? money ? bank ? loan ? bank ? money ? stream ? Inverting (“fitting”) the model Mixture components Mixture weights TOPIC 1 TOPIC 2 ? ? ?

10 Inverting the generative model Inverting the model involves extracting topics and mixing proportions per document from corpus Bayesian Inference techniques (MCMC with Gibbs sampling)

11 Example: topics from an educational corpus (TASA) PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW 37K docs, 26K words 1700 topics, e.g.:

12 Polysemy PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW

13 Overview IProbabilistic Topic Models II Computer Science Applications Analyzing Scientific Topics: PNAS Analyzing NSF and NIH funding Analyzing Enron Email III Theory for semantic cognition Word Association Free Recall IV Conclusion

14 37 CDNA AMINO SEQUENCE ACID PROTEIN ISOLATED ENCODING CLONED ACIDS IDENTITY CLONE EXPRESSED ENCODES RAT HOMOLOGY How do topics change over time? Analysis of dynamics: perform linear trend analysis for each topic “hot topics” go up, “cold topics” go down 289 KDA PROTEIN PURIFIED MOLECULAR MASS CHROMATOGRA.. POLYPEPTIDE GEL SDS BAND APPARENT LABELED IDENTIFIED FRACTION DETECTED 75 ANTIBODY ANTIBODIES MONOCLONAL ANTIGEN IGG MAB SPECIFIC EPITOPE HUMAN MABS RECOGNIZED SERA EPITOPES DIRECTED NEUTRALIZING 2 SPECIES GLOBAL CLIMATE CO2 WATER ENVIRONMENTAL YEARS MARINE CARBON DIVERSITY OCEAN EXTINCTION TERRESTRIAL COMMUNITY ABUNDANCE 134 MICE DEFICIENT NORMAL GENE NULL MOUSE TYPE HOMOZYGOUS ROLE KNOCKOUT DEVELOPMENT GENERATED LACKING ANIMALS REDUCED 179 APOPTOSIS DEATH CELL INDUCED BCL CELLS APOPTOTIC CASPASE FAS SURVIVAL PROGRAMMED MEDIATED INDUCTION CERAMIDE EXPRESSION

15 37 CDNA AMINO SEQUENCE ACID PROTEIN ISOLATED ENCODING CLONED ACIDS IDENTITY CLONE EXPRESSED ENCODES RAT HOMOLOGY 289 KDA PROTEIN PURIFIED MOLECULAR MASS CHROMATOGRA.. POLYPEPTIDE GEL SDS BAND APPARENT LABELED IDENTIFIED FRACTION DETECTED 75 ANTIBODY ANTIBODIES MONOCLONAL ANTIGEN IGG MAB SPECIFIC EPITOPE HUMAN MABS RECOGNIZED SERA EPITOPES DIRECTED NEUTRALIZING 2 SPECIES GLOBAL CLIMATE CO2 WATER ENVIRONMENTAL YEARS MARINE CARBON DIVERSITY OCEAN EXTINCTION TERRESTRIAL COMMUNITY ABUNDANCE 134 MICE DEFICIENT NORMAL GENE NULL MOUSE TYPE HOMOZYGOUS ROLE KNOCKOUT DEVELOPMENT GENERATED LACKING ANIMALS REDUCED 179 APOPTOSIS DEATH CELL INDUCED BCL CELLS APOPTOTIC CASPASE FAS SURVIVAL PROGRAMMED MEDIATED INDUCTION CERAMIDE EXPRESSION Cold topics Hot topics NOBEL 1987 NOBEL 2002

16 Overview IProbabilistic Topic Models II Computer Science Applications Analyzing Scientific Topics: PNAS Analyzing NSF and NIH funding Analyzing Enron Email III Theory for semantic cognition Word Association Free Recall IV Conclusion

17 NSF & NIH grant abstracts Analyze 22,000+ active grants during 2002 NIH – NIMH, NCI NSF – BIO, SBE Visualize topic similarity between funding programs What topics are funded?

18 Example topics

19 NIH NSF – BIO NSF – SBE 2D visualization of funding programs – nearby program support similar topics

20 Funding Amounts per Topic We have $ funding per grant We have distribution of topics for each grant Solve for the $ amount per topic  What are expensive topics?

21 High $$$ topicsLow $$$ topics

22 Overview IProbabilistic Topic Models II Computer Science Applications Analyzing Scientific Topics: PNAS Analyzing NSF and NIH funding Analyzing Enron Email III Theory for semantic cognition Word Association Free Recall IV Conclusion

23 Enron email data 500,000 emails 5000 authors 1999-2002

24 Enron topics TEXANS WIN FOOTBALL FANTASY SPORTSLINE PLAY TEAM GAME SPORTS GAMES GOD LIFE MAN PEOPLE CHRIST FAITH LORD JESUS SPIRITUAL VISIT ENVIRONMENTAL AIR MTBE EMISSIONS CLEAN EPA PENDING SAFETY WATER GASOLINE FERC MARKET ISO COMMISSION ORDER FILING COMMENTS PRICE CALIFORNIA FILED POWER CALIFORNIA ELECTRICITY UTILITIES PRICES MARKET PRICE UTILITY CUSTOMERS ELECTRIC STATE PLAN CALIFORNIA DAVIS RATE BANKRUPTCY SOCAL POWER BONDS MOU TIMELINE May 22, 2000 Start of California energy crisis

25 Overview IProbabilistic Topic Models II Computer Science Applications Analyzing Scientific Topics: PNAS Analyzing NSF and NIH funding Analyzing Enron Email III Theory for semantic cognition Word Association Free Recall IV Conclusion

26 Semantic Memory Semantic memory system might arise from the need to 1) predict what concepts are needed in what contexts 2) disambiguate uncertain information Useful perspective for understanding various language and memory tasks

27 Word Association CUE: PLAY RESPONSES: FUN, BALL, GAME, WORK, GROUND, MATE, CHILD, ENJOY, WIN, ACTOR

28 Modeling Word Association Word association modeled as prediction Given that a single word is observed, what future other words might occur? Under a single topic assumption: Response Cue

29 Observed associates for the cue “play”

30 Model predictions from TASA corpus RANK 9

31 Median rank of first associate Median Rank

32 Latent Semantic Analysis (Landauer & Dumais, 1997) word-document counts high dimensional space SVD RIVER STREAM MONEY BANK Each word is a single point in semantic space Similarity measured by cosine of angle between word vectors

33 Median rank of first associate Median Rank

34 Triangle Inequality in Spatial Representations w1w1 PLAY SOCCER THEATER Cosine similarity: cos(w 1,w 3 ) ≥ cos(w 1,w 2 )cos(w 2,w 3 ) – sin(w 1,w 2 )sin(w 2,w 3 ) w2w2 w3w3

35 Testing violation of triangle inequality Look for triplets of associates w 1 w 2 w 3 such that P( w 2 | w 1 ) >  P( w 3 | w 2 ) >  and measure P( w 3 | w 1 ) Vary threshold 

36

37 Small-World Structure of Associations (Steyvers & Tenenbaum, 2005) BASEBALL BAT BALL GAME PLAY STAGE THEATER Properties: 1) Short path lengths 2) Clustering 3) Power law degree distributions Small world graphs arise elsewhere: internet, social relations, biology

38 #Incoming links has power law distribution  =-2.25 Power law degree distribution  some words are very often used as an associate BASEBALL BAT BALL GAME PLAY STAGE THEATER

39 Creating Association Networks TOPICS MODEL: Calculate the conditional probabilities of all word pairs i and j Connect i to j when P( w=j | w=i ) > threshold LSA: For each word, generate K associates by picking K nearest neighbors in semantic space  =-2.05

40 Paradigmatic/ Syntagmatic Associations

41 Associations in free recall STUDY THESE WORDS: Bed, Rest, Awake, Tired, Dream, Wake, Snooze, Blanket, Doze, Slumber, Snore, Nap, Peace, Yawn, Drowsy RECALL WORDS..... FALSE RECALL: “Sleep” 61%

42 Recall as a reconstructive process Reconstruct study list based on the stored “gist” The gist can be represented by a distribution over topics Under a single topic assumption: Retrieved word Study list

43 Predictions for the “Sleep” list STUDY LIST EXTRA LIST (top 8)

44 Psychology/Comp.Sci Connections Research on human memory is useful for developing better text mining algorithms Models for information retrieval might be helpful in understanding human memory

45 Integrating Topics and Syntax Syntactic dependencies  short range dependencies Semantic dependencies  long-range  zz zz zz zz ww ww ww ww ss ss ss ss Semantic state: generate words from topic model Syntactic states: generate words from HMM (Griffiths, Steyvers, Blei, & Tenenbaum, 2004)

46 ... IN BY WITH ON AS FROM TO FOR THE A AN THIS THEIR ITS EACH ONE IS ARE BE HAS HAVE WAS WERE AS BASED PRESENTED DISCUSSED PROPOSED DESCRIBED SUCH USED DERIVED THEORY MODEL PROCESSES MODELS SYSTEM PROCESS EFFECTS INFORMATION ATTENTION SEARCH VISUAL PROCESSING TASK PERFORMANCE INFORMATION ATTENTIONAL MEMORY TERM LONG SHORT RETRIEVAL STORAGE MEMORIES AMNESIA IQ BEHAVIOR EVOLUTIONARY ENVIRONMENT GENES HERITABILITY GENETIC SELECTION DRUG AROUSAL NEURAL BRAIN HABITUATION BIOLOGICAL TOLERANCE BEHAVIORAL SOCIAL SELF ATTITUDE IMPLICIT ATTITUDES PERSONALITY JUDGMENT PERCEPTION (S) THE SEARCH IN LONG TERM MEMORY …… (S) A MODEL OF VISUAL ATTENTION ……

47 Random sentence generation LANGUAGE: [S] RESEARCHERS GIVE THE SPEECH [S] THE SOUND FEEL NO LISTENERS [S] WHICH WAS TO BE MEANING [S] HER VOCABULARIES STOPPED WORDS [S] HE EXPRESSLY WANTED THAT BETTER VOWEL

48 Topic Hierarchies In regular topic model, no relations between topics Alternative: hierarchical topic organization topic 1 topic 2 topic 3 topic 4 topic 5 topic 6 topic 7 Nested Chinese Restaurant Process Blei, Griffiths, Jordan, Tenenbaum (2004) Learn hierarchical structure, as well as topics within structure

49 Example: Psych Review Abstracts RESPONSE STIMULUS REINFORCEMENT RECOGNITION STIMULI RECALL CHOICE CONDITIONING SPEECH READING WORDS MOVEMENT MOTOR VISUAL WORD SEMANTIC ACTION SOCIAL SELF EXPERIENCE EMOTION GOALS EMOTIONAL THINKING GROUP IQ INTELLIGENCE SOCIAL RATIONAL INDIVIDUAL GROUPS MEMBERS SEX EMOTIONS GENDER EMOTION STRESS WOMEN HEALTH HANDEDNESS REASONING ATTITUDE CONSISTENCY SITUATIONAL INFERENCE JUDGMENT PROBABILITIES STATISTICAL IMAGE COLOR MONOCULAR LIGHTNESS GIBSON SUBMOVEMENT ORIENTATION HOLOGRAPHIC CONDITIONIN STRESS EMOTIONAL BEHAVIORAL FEAR STIMULATION TOLERANCE RESPONSES A MODEL MEMORY FOR MODELS TASK INFORMATION RESULTS ACCOUNT SELF SOCIAL PSYCHOLOGY RESEARCH RISK STRATEGIES INTERPERSONAL PERSONALITY SAMPLING MOTION VISUAL SURFACE BINOCULAR RIVALRY CONTOUR DIRECTION CONTOURS SURFACES DRUG FOOD BRAIN AROUSAL ACTIVATION AFFECTIVE HUNGER EXTINCTION PAIN THE OF AND TO IN A IS

50 Generative Process RESPONSE STIMULUS REINFORCEMENT RECOGNITION STIMULI RECALL CHOICE CONDITIONING SPEECH READING WORDS MOVEMENT MOTOR VISUAL WORD SEMANTIC ACTION SOCIAL SELF EXPERIENCE EMOTION GOALS EMOTIONAL THINKING GROUP IQ INTELLIGENCE SOCIAL RATIONAL INDIVIDUAL GROUPS MEMBERS SEX EMOTIONS GENDER EMOTION STRESS WOMEN HEALTH HANDEDNESS REASONING ATTITUDE CONSISTENCY SITUATIONAL INFERENCE JUDGMENT PROBABILITIES STATISTICAL IMAGE COLOR MONOCULAR LIGHTNESS GIBSON SUBMOVEMENT ORIENTATION HOLOGRAPHIC CONDITIONIN STRESS EMOTIONAL BEHAVIORAL FEAR STIMULATION TOLERANCE RESPONSES A MODEL MEMORY FOR MODELS TASK INFORMATION RESULTS ACCOUNT SELF SOCIAL PSYCHOLOGY RESEARCH RISK STRATEGIES INTERPERSONAL PERSONALITY SAMPLING MOTION VISUAL SURFACE BINOCULAR RIVALRY CONTOUR DIRECTION CONTOURS SURFACES DRUG FOOD BRAIN AROUSAL ACTIVATION AFFECTIVE HUNGER EXTINCTION PAIN THE OF AND TO IN A IS


Download ppt "Extracting Semantic Representations with Probabilistic Topic Models Mark SteyversUC Irvine Tom Griffiths Padhraic Smyth Dave Newman Brown University UC."

Similar presentations


Ads by Google