Download presentation
Presentation is loading. Please wait.
Published byVivian Booker Modified over 9 years ago
1
Structured Topic Models: Jointly Modeling Words and Their Accompanying Modalities Xuerui Wang Computer Science Department University of Massachusetts Amherst Joint work with Andrew McCallum, Andres Corrada-Emmanuel, Chris Pal, Xing Wei and Natasha Mohanty.
2
2 Probabilistic topic models Main Assumption: –Documents are mixture of topics –Topic distributions over words for co-occurrence Objectives: –Understand text using learned topics –Represent documents in topic space
3
3 Clustering words into topics with Latent Dirichlet Allocation [Blei, Ng, Jordan 2003] Sample a distribution over topics, For each document: Sample a topic, z For each word in doc Sample a word from the topic, w Example: 70% finance 30% environment finance “bank” Generative Process: environment
4
4 STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN Example topics induced from a large collection of text FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE [Tennenbaum et al]
5
5 STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE Example topics induced from a large collection of text [Tennenbaum et al]
6
6 Documents are not just text ! Multiple modalities: –Research papers (author, venue, words, etc.) –Email messages (sender, recipients, time, words, etc.) –Legislative resolutions (voting record, words, etc.) –And many more Most previous work: one modality at a time –Learn topics from words –Discover groups from relations –Etc.
7
7 Scientific Questions How to design model structures to capture information from multiple different modalities simultaneously? Will joint inference give improvement over treating each modality separately?
8
8 Outline Introduction Role and Topic Discovery in Social Networks Group and Topic Discovery from Voting Records Topics over Time Topical Phrase with Markov Assumption Conclusions
9
9 All possible “topic models” with one latent topic, two observed modalities and two conditional dependencies
10
10 Outline Introduction Role and Topic Discovery in Social Networks Group and Topic Discovery from Voting Records Topics over Time Topical Phrase with Markov Assumption Conclusions
11
11 From LDA to Author-Recipient-Topic
12
12 All possible “topic models” with two observed modalities
13
13 Inference and Estimation Gibbs Sampling: - Easy to implement - Reasonably fast r
14
14 Enron email corpus 250k email messages 147 people Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT) From: debra.perlingiere@enron.com To: steve.hooser@enron.com Subject: Enron/TransAltaContract dated Jan 1, 2001 Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions. DP Debra Perlingiere Enron North America Corp. Legal Department 1400 Smith Street, EB 3885 Houston, Texas 77002 dperlin@enron.com
15
15 Topics, and prominent senders / receivers discovered by ART Topic names, by hand
16
16 Topics, and prominent senders / receivers discovered by ART Beck = “Chief Operations Officer” Dasovich = “Government Relations Executive” Shapiro = “Vice President of Regulatory Affairs” Steffes = “Vice President of Government Affairs”
17
17 Comparing role discovery connection strength (A,B) = distribution over authored topics Traditional SNA distribution over recipients distribution over authored topics Author-TopicART
18
18 Comparing role discovery Tracy Geaconne Dan McCarty Traditional SNAAuthor-TopicART Similar roles Different roles Geaconne = “Secretary” McCarty = “Vice President”
19
19 Traditional SNAAuthor-TopicART Different roles Very similarNot very similar Geaconne = “Secretary” Hayslett = “Vice President & CTO” Comparing Role Discovery Tracy Geaconne Rod Hayslett
20
20 Traditional SNAAuthor-TopicART Different roles Very differentVery similar Blair = “Gas pipeline logistics” Watson = “Pipeline facilities planning” Comparing role discovery Lynn Blair Kimberly Watson
21
21 McCallum Email Corpus 2004 January - October 2004 23k email messages 825 people From: kate@cs.umass.edu Subject: NIPS and.... Date: June 14, 2004 2:27:41 PM EDT To: mccallum@cs.umass.edu There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for: NIPS registration receipt. CALO registration receipt. Thanks, Kate
22
22 McCallum Email Blockstructure
23
23 Four most prominent topics in discussions with ____?
24
24
25
25 Two most prominent topics in discussions with ____?
26
26 Traditional SNAAuthor-TopicART Block structured Not ART: Roles but not Groups Enron TransWestern Division
27
27 Outline Introduction Role and Topic Discovery in Social Networks Group and Topic Discovery from Voting Records Topics over Time Topical Phrase with Markov Assumption Conclusions
28
28 Groups and Topics Input: –Observed relations between people –Attributes on those relations (text, or categorical) Output: –Attributes clustered into “topics” –Groups of people---varying depending on topic
29
29 Discovering groups from observed set of relations Admiration relations among six high school students. Student Roster Adams Bennett Carter Davis Edwards Frederking Academic Admiration Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C)
30
30 Adjacency matrix representing relations ABCDEF A B C D E F ABCDEF G1G2G1G2G3 G1 G2 G1 G2 G3 A B C D E F ACBDEF G1 G2 G3 G1 G2 G3 A C B D E F Student Roster Adams Bennett Carter Davis Edwards Frederking Academic Admiration Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C)
31
31 Group Model: partitioning entities into groups Stochastic Blockstructures for Relations [Nowicki, Snijders 2001] S: number of entities G: number of groups Enhanced with arbitrary number of groups in [Kemp, Griffiths, Tenenbaum 2004] Beta Dirichlet Binomial Multinomial
32
32 Two relations with different attributes ACBDEF G1 G2 G3 G1 G2 G3 ACEBDF G1 G2 G1 G2 A C E B D F Student Roster Adams Bennett Carter Davis Edwards Frederking Academic Admiration Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C) Social Admiration Soci(A, B) Soci(A, D) Soci(A, F) Soci(B, A) Soci(B, C) Soci(B, E) Soci(C, B) Soci(C, D) Soci(C, F) Soci(D, A) Soci(D, C) Soci(D, E) Soci(E, B) Soci(E, D) Soci(E, F) Soci(F, A) Soci(F, C) Soci(F, E) A C B D E F
33
33 Goal: Model relations and their (textual) attributes simultaneously to obtain better groups and more meaningful topics. budget, funding, annual, cash document, corrections, review, annual
34
34 The Group-Topic model: discovering groups and topics simultaneously Dirichlet Multinomial Uniform Beta Dirichlet Binomial Multinomial
35
35 All possible “topic models” with two observed modalities
36
36 Inference and Estimation Gibbs Sampling: - Many r.v.s can be integrated out - Easy to implement - Reasonably fast We assume the relationship is symmetric.
37
37 U.S. Senate data set 16 years of voting records in the US Senate (1989 – 2005) a Senator may respond Yea or Nay to a resolution 3423 resolutions with text attributes (index terms) 191 Senators in total across 16 years S.543 Title: An Act to reform Federal deposit insurance, protect the deposit insurance funds, recapitalize the Bank Insurance Fund, improve supervision and regulation of insured depository institutions, and for other purposes. Sponsor: Sen Riegle, Donald W., Jr. [MI] (introduced 3/5/1991) Cosponsors (2) Latest Major Action: 12/19/1991 Became Public Law No: 102-242. Index terms: Banks and banking Accounting Administrative fees Cost control Credit Deposit insurance Depressed areas and other 110 terms Banks and bankingAccountingAdministrative feesCost control CreditDeposit insuranceDepressed areas Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen (D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea Bradley (D-NJ), Nay Conrad (D-ND), Nay ……
38
38 Topics discovered (U.S. Senate) EducationEnergy Military Misc. Economic educationenergygovernmentfederal schoolpowermilitarylabor aidwaterforeigninsurance childrennucleartaxaid druggascongresstax studentspetrolaidbusiness elementaryresearchlawemployee preventionpollutionpolicycare Mixture of Unigrams Group-Topic Model Education + Domestic ForeignEconomic Social Security + Medicare educationforeignlaborsocial schooltradeinsurancesecurity federalchemicalstaxinsurance aidtariffcongressmedical governmentcongressincomecare taxdrugsminimummedicare energycommunicablewagedisability researchdiseasesbusinessassistance
39
39 Groups discovered (US Senate) Groups from topic Education + Domestic
40
40 Senators Who Change Coalition the most Dependent on Topic e.g. Senator Shelby (D-AL) votes with the Republicans on Economic with the Democrats on Education + Domestic with a small group of maverick Republicans on Social Security + Medicare
41
41 Dataset #2: The UN General Assembly Voting records of the UN General Assembly (1990 - 2003) A country may choose to vote Yes, No or Abstain 931 resolutions with text attributes (titles) 192 countries in total Also experiments later with resolutions from 1960-2003 Vote on Permanent Sovereignty of Palestinian People, 87th plenary meetingPermanent Sovereignty of Palestinian People The draft resolution on permanent sovereignty of the Palestinian people in the occupied Palestinian territory, including Jerusalem, and of the Arab population in the occupied Syrian Golan over their natural resources (document A/54/591) was adopted by a recorded vote of 145 in favour to 3 against with 6 abstentions: In favour: Afghanistan, Argentina, Belgium, Brazil, Canada, China, France, Germany, India, Japan, Mexico, Netherlands, New Zealand, Pakistan, Panama, Russian Federation, South Africa, Spain, Turkey, and other 126 countries. Against: Israel, Marshall Islands, United States. Abstain: Australia, Cameroon, Georgia, Kazakhstan, Uzbekistan, Zambia.
42
42 Topics Discovered (UN) Everything Nuclear Human Rights Security in Middle East nuclearrightsoccupied weaponshumanisrael usepalestinesyria implementationsituationsecurity countriesisraelcalls Mixture of Unigrams Group-Topic Model Nuclear Non-proliferation Nuclear Arms Race Human Rights nuclear rights statesarmshuman unitedpreventionpalestine weaponsraceoccupied nationsspaceisrael
43
43 Groups Discovered (UN) The countries list for each group are ordered by their 2005 GDP (PPP) and only 5 countries are shown in groups that have more than 5 members.
44
44 Do we get better groups with the GT model? 1.Cluster bills into topics using mixture of unigrams; 2.Apply group model on topic- specific subsets of bills. Agreement Index (AI) measures group cohesion. Higher, better. DatasetsAvg. AI for BaselineAvg. AI for GTp-value Senate0.81980.8294<.01 UN0.85480.8664<.01 1.Jointly cluster topic and groups at the same time using the GT model. Baseline Model GT Model
45
45 Groups and Topics, Trends over Time (UN)
46
46 Outline Introduction Role and Topic Discovery in Social Networks Group and Topic Discovery from Voting Records Topics over Time Topical Phrase with Markov Assumption Conclusions
47
47 Groups and Topics, Trends over Time (UN)
48
48 Want to model trends over time Is prevalence of topic growing or waning? Pattern appears only briefly –Capture its statistics in focused way –Don’t confuse it with patterns elsewhere in time How do roles, groups, influence shift over time?
49
49 Topics Over Time (TOT) Beta over time topic index time stamp word Multinomial over words Dirichlet prior Dirichlet prior multinomial over topics Beta over time topic index time stamp word Multinomial over words Dirichlet prior multinomial over topics Dirichlet prior
50
50 All possible “topic models” with two observed modalities
51
51 State of the union address 208 Addresses delivered between January 8, 1790 and January 29, 2002. To increase the number of documents, we split the addresses into paragraphs and treated them as ‘documents’. One-line paragraphs were excluded. Stopping was applied. 17156 ‘documents’ 21534 words 669,425 tokens Our scheme of taxation, by means of which this needless surplus is taken from the people and put into the public Treasury, consists of a tariff or duty levied upon importations from abroad and internal-revenue taxes levied upon the consumption of tobacco and spirituous and malt liquors. It must be conceded that none of the things subjected to internal-revenue taxation are, strictly speaking, necessaries. There appears to be no just complaint of this taxation by the consumers of these articles, and there seems to be nothing so well able to bear the burden without hardship to any portion of the people. 1910
52
52 Comparing TOT against LDA
53
53 TOT on 17 years of NIPS proceedings
54
54 TOT versus LDA on McCallum email
55
55 Topic Distributions Conditioned on Time time topic mass (in vertical height) in NIPS conference papers
56
56 TOT on 17 years of NIPS proceedings TOT LDA
57
57 TOT improves ability to predict time Predicting the year of a State-of-the-Union address. L1 = distance between predicted year and actual year.
58
58 Outline Introduction Role and Topic Discovery in Social Networks Group and Topic Discovery from Voting Records Topics over Time Topical Phrase with Markov Assumption Conclusions
59
59 Topic Interpretability LDA algorithms algorithm genetic problems efficient Topical N-grams genetic algorithms genetic algorithm evolutionary computation evolutionary algorithms fitness function
60
60 Topics modeling phrases Topics based only on unigrams often difficult to interpret Topic discovery itself is confused because important meaning / distinctions carried by phrases. Significant opportunity to provide improved language models to ASR, MT, IR, etc.
61
61 Topical N-Gram model z1z1 z2z2 z3z3 z4z4 w1w1 w2w2 w3w3 w4w4 y1y1 y2y2 y3y3 y4y4 11 T D... W T W 11 22 22
62
62 All possible “topic models” with two observed modalities
63
63 Features of Topical N-Grams model Easily trained by Gibbs sampling –Can run efficiently on millions of words Topic-specific phrase discovery –“white house” has special meaning as a phrase in the politics topic, –... but not in the real estate topic.
64
64 NIPS research papers Full text of NIPS papers between 1987-1999. 1,740 research papers in total. 13, 649 unique words and 2,301,375 word tokens. Stop words removed and no stemming.
65
65 “Reinforcement Learning” state learning policy action reinforcement states time optimal actions function algorithm reward step dynamic control sutton rl decision algorithms agent LDA reinforcement learning optimal policy dynamic programming optimal control function approximator prioritized sweeping finite-state controller learning system reinforcement learning RL function approximators markov decision problems markov decision processes local search state-action pair markov decision process belief states stochastic policy action selection upright position reinforcement learning methods policy action states actions function reward control agent q-learning optimal goal learning space step environment system problem steps sutton policies Topical N-grams (2+)Topical N-grams (1)
66
66 “Support Vector Machines” kernel linear vector support set nonlinear data algorithm space pca function problem margin vectors solution training svm kernels matrix machines LDA support vectors test error support vector machines training error feature space training examples decision function cost functions test inputs kkt conditions leave-one-out procedure soft margin bayesian transduction training patterns training points maximum margin strictly convex regularization operators base classifiers convex optimization kernel training support margin svm solution kernels regularization adaboost test data generalization examples cost convex algorithm working feature sv functions Topical N-grams (2+)Topical N-grams (1)
67
67 Word dependencies in information retrieval Long-distance dependency ---- topical (semantic) dependency helps [Hofmann, 1999; Wei and Croft, 2006]. Short-distance dependency ---- phrases (usually discovered by separate modules) can boost IR performance [Fagan, 1989; Evans et al., 1991; Strzalkowski, 1995; Mitra et al., 1997]. TNG simultaneously capture both.
68
68 San Jose Mercury News (TREC) Covers materials from San Jose Mercury News in 1991 With TREC queries 51-150 90,257 documents in total, 255, 686 unique words and 17,574,989 word tokens. Stop words removed and no stemming. SJMN91-06364022 06364022 Photo; PHOTO: Associated Press; MONSTER MASH -- Kentucky's Jamal Mash Burn shows his stuff in the Wildcats' 103-89 victory over state rival Louisville on Saturday. Mashburn had 25 points. COLLEGE; BASKETBALL; GAME; RESULT; RANKING; SCHOOL Arizona had a 24-point night from Sean Rooks, a height advantage and strong defense, but still struggled to an 83-76 victory over Evansville in the Fiesta Bowl Classic in Tucson, Ariz., on Saturday.; The victory moved the No. 6 Wildcats into the championship of their tournament for the seventh straight time. Sports ARIZONA EDGES EVANSVILLE ……
69
69 Ad-hoc retrieval on SJMN Clearly contain phrases No phrases due to stopping and punctuation removing Mixed results on many other queries.
70
70 Ad-hoc retrieval on SJMN * indicates statistically significant differences in performance with 95% confidence according to the Wilcoxon test
71
71 Outline Introduction Role and Topic Discovery in Social Networks Group and Topic Discovery from Voting Records Topics over Time Topical Phrase with Markov Assumption Conclusions
72
72 All possible “topic models” with two observed modalities (revisit) ART GT TOT TNG
73
73 Conclusions With carefully designed model structures, we can utilize multi-modality information. Choices of configuration are task dependent. Better results are obtained from joint inference on various tasks.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.