Presentation is loading. Please wait.

Presentation is loading. Please wait.

Topic Models for Social Network Analysis and Bibliometrics

Similar presentations


Presentation on theme: "Topic Models for Social Network Analysis and Bibliometrics"— Presentation transcript:

1 Topic Models for Social Network Analysis and Bibliometrics
4/19/2017 Topic Models for Social Network Analysis and Bibliometrics Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Xuerui Wang, Natasha Mohanty, Andres Corrada, Chris Pal, Wei Li, David Mimno and Gideon Mann.

2 Goal: Mine actionable knowledge from unstructured text.
4/19/2017 Goal: Mine actionable knowledge from unstructured text. I want to help people make good decisions by leveraging the knowledge on the Web and other bodies of text. Sometimes document retrieval is enough for this, but sometimes you need to find patterns in structured data that span many pages. Let me give you some examples of what I mean.

3 From Text to Actionable Knowledge
4/19/2017 From Text to Actionable Knowledge Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Prediction Outlier detection Decision support

4 Joint Inference IE Data Mining Uncertainty Info Emerging Patterns
4/19/2017 Joint Inference Uncertainty Info Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Emerging Patterns Prediction Outlier detection Decision support

5 Unified Model IE Complex Inference and Learning Data Mining Spider
4/19/2017 Unified Model Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Probabilistic Model Conditional Random Fields [Lafferty, McCallum, Pereira] Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…] Discriminatively-trained undirected graphical models Document collection Actionable knowledge Prediction Outlier detection Decision support Complex Inference and Learning Just what we researchers like to sink our teeth into!

6 (Linear Chain) Conditional Random Fields
4/19/2017 [Lafferty, McCallum, Pereira 2001] Undirected graphical model, trained to maximize conditional probability of output sequence given input sequence Finite state model Graphical model OTHER PERSON OTHER ORG TITLE … output seq y y y y y t - 1 t t+1 t+2 t+3 FSM states . . . observations x x x x x t - 1 t t +1 t +2 t +3 said Jones a Microsoft VP … input seq A CRF is simply an undirected graphical model trained to maximize a conditional probability. First explorations with these models centered on finite state models, represented as linear-chain graphical models, with joint probability distribution over state sequence Y calculated as a normalized product over potentials on cliques of the graph. As is often traditional in NLP and other application areas, these potentials are defined to be log-linear combination of weights on features of the clique values. The chief excitement from an application point of view is the ability to use rich and arbitrary features of the input without complicating inference. Yielded many good results, for example… Explosion of interest in many different conferences. where Asian word segmentation [COLING’04], [ACL’04] IE from Research papers [HTL’04] Object classification in images [CVPR ‘04] Wide-spread interest, positive experimental results in many applications. Noun phrase, Named entity [HLT’03], [CoNLL’03] Protein structure prediction [ICML’04] IE from Bioinformatics text [Bioinformatics ‘04],…

7 1. Jointly labeling cascaded sequences Factorial CRFs
4/19/2017 1. Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words In some situations we want to output not a single label sequence but several, for example… Rather than performing each of these labeling tasks serially in a cascade, Factorial CRFs perform all three tasks jointly, Providing matching accuracy with only 50% of the training data Joint prediction of part-of-speech and noun-phrase in newswire, matching accuracy with only 50% of the training data. Inference: Tree reparameterization BP [Wainwright et al, 2002]

8 2. Jointly labeling distant mentions Skip-chain CRFs
4/19/2017 2. Jointly labeling distant mentions Skip-chain CRFs [Sutton, McCallum, SRL 2004] Senator Joe Green said today … Green ran for … 14% reduction in error on most repeated field in seminar announcements. Skip-chain CRFs allow the model to capture dependencies among the labels of distant words, Such as this occurrence of “Green” in which there is a lot of local evidence that it is a person’s name, and this one in which there isn’t. Inference: Tree reparameterization BP [Wainwright et al, 2002]

9 3. Joint co-reference among all pairs Affinity Matrix CRF
4/19/2017 3. Joint co-reference among all pairs Affinity Matrix CRF “Entity resolution” “Object correspondence” . . . Mr Powell . . . 45 . . . Powell . . . Y/N Y/N -99 Y/N ~25% reduction in error on co-reference of proper nouns in newswire. 11 Traditionally in NLP co-reference has been performed by making independent coreference decisions on each pair of entity mentions. An Affinity Matrix CRF jointly makes all coreference decisions together, accounting for multiple constraints. . . . she . . . Inference: Correlational clustering graph partitioning [McCallum, Wellner, IJCAI WS 2003, NIPS 2004] [Bansal, Blum, Chawla, 2002]

10 4. Joint segmentation and co-reference
4/19/2017 4. Joint segmentation and co-reference Extraction from and matching of research paper citations. o s World Knowledge Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison-Wesley, 1990. c Co-reference decisions y y p Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, , 1990. Database field values c y c Citation attributes s s Segmentation o o Recently we have been working with models that have one component for segmenting a citation string into fields, and another for coreference---citation matching And performing both segmentation and coreference jointly. 35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Inference: Variant of Iterated Conditional Modes [Wellner, McCallum, Peng, Hay, UAI 2004] see also [Marthi, Milch, Russell, 2003] [Besag, 1986]

11 Leveraging Text in Social Network Analysis
4/19/2017 Context Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Joint inference among detailed steps Actionable knowledge Leveraging Text in Social Network Analysis Prediction Outlier detection Decision support

12 Outline Social Network Analysis with Topic Models
4/19/2017 Outline Social Network Analysis with Topic Models Role Discovery (Author-Recipient-Topic Model, ART) Group Discovery (Group-Topic Model, GT) Enhanced Topic Models Correlations among Topics (Pachinko Allocation, PAM) Time Localized Topics (Topics-over-Time Model, TOT) Markov Dependencies in Topics (Topical N-Grams Model, TNG) Bibliometric Impact Measures enabled by Topics Multi-Conditional Mixtures

13 Social Network in an Email Dataset
4/19/2017 Social Network in an Dataset

14 Clustering words into topics with Latent Dirichlet Allocation
4/19/2017 Clustering words into topics with Latent Dirichlet Allocation [Blei, Ng, Jordan 2003] Generative Process: Example: For each document: Sample a distribution over topics,  70% Iraq war 30% US election For each word in doc Sample a topic, z Iraq war Sample a word from the topic, w “bombing”

15 Example topics induced from a large collection of text
4/19/2017 Example topics induced from a large collection of text DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE [Tennenbaum et al]

16 Example topics induced from a large collection of text
4/19/2017 Example topics induced from a large collection of text DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER [Tennenbaum et al]

17 From LDA to Author-Recipient-Topic
4/19/2017 From LDA to Author-Recipient-Topic [McCallum et al 2005] (ART) All select word given topic. Difference is in how topic is selected. Assymetric modeling of author and recipient Topics discovery guided by the social network in with text messages sent and received.

18 Inference and Estimation
4/19/2017 Inference and Estimation Gibbs Sampling: Easy to implement Reasonably fast r

19 Enron Email Corpus 250k email messages 23k people
4/19/2017 Enron Corpus 250k messages 23k people Date: Wed, 11 Apr :56: (PDT) From: To: Subject: Enron/TransAltaContract dated Jan 1, 2001 Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions. DP Debra Perlingiere Enron North America Corp. Legal Department 1400 Smith Street, EB 3885 Houston, Texas 77002

20 Topics, and prominent senders / receivers discovered by ART
4/19/2017 Topics, and prominent senders / receivers discovered by ART Topic names, by hand

21 Topics, and prominent senders / receivers discovered by ART
4/19/2017 Topics, and prominent senders / receivers discovered by ART Beck = “Chief Operations Officer” Dasovich = “Government Relations Executive” Shapiro = “Vice President of Regulatory Affairs” Steffes = “Vice President of Government Affairs”

22 Comparing Role Discovery
4/19/2017 Comparing Role Discovery Traditional SNA ART Author-Topic connection strength (A,B) = distribution over recipients distribution over authored topics distribution over authored topics

23 Comparing Role Discovery Tracy Geaconne  Dan McCarty
4/19/2017 Comparing Role Discovery Tracy Geaconne  Dan McCarty Traditional SNA ART Author-Topic Similar roles Different roles Different roles Geaconne = “Secretary” McCarty = “Vice President”

24 Comparing Role Discovery Lynn Blair  Kimberly Watson
4/19/2017 Comparing Role Discovery Lynn Blair  Kimberly Watson Traditional SNA ART Author-Topic Different roles Very similar Very different Blair = “Gas pipeline logistics” Watson = “Pipeline facilities planning”

25 McCallum Email Corpus 2004 January - October 2004 23k email messages
4/19/2017 McCallum Corpus 2004 January - October 2004 23k messages 825 people From: Subject: NIPS and .... Date: June 14, :27:41 PM EDT To: There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for: NIPS registration receipt. CALO registration receipt. Thanks, Kate

26 Four most prominent topics in discussions with ____?
4/19/2017 Four most prominent topics in discussions with ____?

27 4/19/2017

28 Two most prominent topics in discussions with ____?
4/19/2017 Two most prominent topics in discussions with ____?

29 4/19/2017

30 Role-Author-Recipient-Topic Models
4/19/2017 Role-Author-Recipient-Topic Models

31 Results with RART: People in “Role #3” in Academic Email
4/19/2017 Results with RART: People in “Role #3” in Academic olc lead Linux sysadmin gauthier sysadmin for CIIR group irsystem mailing list CIIR sysadmins system mailing list for dept. sysadmins allan Prof., chair of “computing committee” valerie second Linux sysadmin tech mailing list for dept. hardware steve head of dept. I.T. support

32 Roles for allan (James Allan)
4/19/2017 Roles for allan (James Allan) Role #3 I.T. support Role #2 Natural Language researcher Roles for pereira (Fernando Pereira) Role #2 Natural Language researcher Role #4 SRI CALO project participant Role #6 Grant proposal writer Role #10 Grant proposal coordinator Role #8 Guests at McCallum’s house

33 ART: Roles but not Groups
4/19/2017 ART: Roles but not Groups Traditional SNA ART Author-Topic Block structured Not Not Enron TransWestern Division

34 a Outline Social Network Analysis with Topic Models
4/19/2017 Outline Social Network Analysis with Topic Models a Role Discovery (Author-Recipient-Topic Model, ART) Group Discovery (Group-Topic Model, GT) Enhanced Topic Models Correlations among Topics (Pachinko Allocation, PAM) Time Localized Topics (Topics-over-Time Model, TOT) Markov Dependencies in Topics (Topical N-Grams Model, TNG) Bibliometric Impact Measures enabled by Topics Multi-Conditional Mixtures

35 Groups and Topics Input: Output: Observed relations between people
4/19/2017 Groups and Topics Input: Observed relations between people Attributes on those relations (text, or categorical) Output: Attributes clustered into “topics” Groups of people---varying depending on topic

36 Discovering Groups from Observed Set of Relations
4/19/2017 Discovering Groups from Observed Set of Relations Student Roster Adams Bennett Carter Davis Edwards Frederking Academic Admiration Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C) Admiration relations among six high school students. We want to discover latent groups from our observations of relations between entities. Assume we have six high school students who filled out a survey. A student may admire another student in some way, defining a relation. The table lists all relations that hold between two students. For example, Adams admires Bennett, Carter admires Bennett, etc.

37 Adjacency Matrix Representing Relations
4/19/2017 Adjacency Matrix Representing Relations Student Roster Adams Bennett Carter Davis Edwards Frederking Academic Admiration Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C) A B C D E F G1 G2 G3 A C B D E F G1 G2 G3 A B C D E F It is common to visualize relations by adjacency matrix. If cell(i,j) is colored green, student I admires student j. We want a generative model of these colorings. We could have parameters for each person-index-pair. Instead the model maps each person to a group-id, and conditions the coloring on a group-id-pair. Thus the model gives the model incentive to discover group-mappings that help explain the observed relation colorings. Here, we have used 3 groups. It is common to rearrange the matrix by permutation to get Blockstructures. A B C D E F A C B D E F

38 Group Model: Partitioning Entities into Groups
4/19/2017 Group Model: Partitioning Entities into Groups Stochastic Blockstructures for Relations [Nowicki, Snijders 2001] Beta Multinomial Dirichlet S: number of entities G: number of groups Here is the graphical model for this generative process. The generation of the observed relations, v, is conditioned on one of G-squared Binomials, selected by the group mappings of the individual people. Whether a relation holds for two entities completely depends on whether it holds for their two groups. Binomial Enhanced with arbitrary number of groups in [Kemp, Griffiths, Tenenbaum 2004]

39 Two Relations with Different Attributes
4/19/2017 Two Relations with Different Attributes Student Roster Adams Bennett Carter Davis Edwards Frederking Academic Admiration Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C) Social Admiration Soci(A, B) Soci(A, D) Soci(A, F) Soci(B, A) Soci(B, C) Soci(B, E) Soci(C, B) Soci(C, D) Soci(C, F) Soci(D, A) Soci(D, C) Soci(D, E) Soci(E, B) Soci(E, D) Soci(E, F) Soci(F, A) Soci(F, C) Soci(F, E) A C B D E F G1 G2 G3 A C E B D F G1 G2 But what if we have two different relations, with different attributes. We do the same analysis, now for Social Admiration relations. This time, we end up with a quite different grouping. The attributes of relations can influence grouping! The focus in this talk is on text as the attributes associated with relations. The vocabulary size of this text could be very large, and so we want to cluster these textual attributes as well. Topic models are widely used in text modeling. Next we quickly review a simple topic model for text. A C B D E F A C E B D F

40 The Group-Topic Model: Discovering Groups and Topics Simultaneously
4/19/2017 The Group-Topic Model: Discovering Groups and Topics Simultaneously [Wang, Mohanty, McCallum 2006] Beta Uniform Multinomial Dirichlet Dirichlet Binomial Multinomial Starting from the blockstrutures model, we build our GT model by adding the mixture of unigrams and enhancing the blockstructures with multiple topics. We first generate the topic-wise group assignments for all entities. For each relation, or each document, whether it holds completely depends on the probability of the group assignments of two entities. Note here, we only generate the topic-wise group assignments once for the whole dataset.

41 Inference and Estimation
4/19/2017 Inference and Estimation Gibbs Sampling: Many r.v.s can be integrated out Easy to implement Reasonably fast We assume the relationship is symmetric. We using Gibbs sampling for inference, and we assume the relations are symmetric. (Although, asymmetric relations will not increase the complexity of the model.) The upper formula is related to the topic of a document and the lower one corresponds to the group assignment of an entity for a certain topic.

42 4/19/2017 Dataset #1: U.S. Senate 16 years of voting records in the US Senate (1989 – 2005) a Senator may respond Yea or Nay to a resolution 3423 resolutions with text attributes (index terms) 191 Senators in total across 16 years S.543 Title: An Act to reform Federal deposit insurance, protect the deposit insurance funds, recapitalize the Bank Insurance Fund, improve supervision and regulation of insured depository institutions, and for other purposes. Sponsor: Sen Riegle, Donald W., Jr. [MI] (introduced 3/5/1991) Cosponsors (2) Latest Major Action: 12/19/1991 Became Public Law No: Index terms: Banks and banking Accounting Administrative fees Cost control Credit Deposit insurance Depressed areas and other 110 terms We use 16 years of voting records from the U.S. Senate. In our setting, we model agreement between two senators as a relation instead of the explicit yes/no votes, because the meaning of yes or no can change with the simple addition of a “not” to the text of the bill. Here is an example bill in the US Senate dataset. The textual attributes we model are these index terms, here, of which there are typically many. Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen (D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea Bradley (D-NJ), Nay Conrad (D-ND), Nay ……

43 Topics Discovered (U.S. Senate)
4/19/2017 Topics Discovered (U.S. Senate) Education Energy Military Misc. Economic education energy government federal school power military labor aid water foreign insurance children nuclear tax drug gas congress students petrol business elementary research law employee prevention pollution policy care Mixture of Unigrams Education + Domestic Foreign Economic Social Security + Medicare education foreign labor social school trade insurance security federal chemicals tax aid tariff congress medical government income care drugs minimum medicare energy communicable wage disability research diseases business assistance Group-Topic Model First, here are the topics found by the traditional mixture of unigrams on the text alone---without the votes. They are quite salient: “education, school, aid, children… prevention”; “energy, power, water… pollution”. But these two topics have very similar voting patterns, so in our Group-Topic model they get collapsed. The Group-Topic model has discovered a new topic---Social Security and Medicare---that didn’t appear in the traditional model at all. The word co-occurrences simply weren’t strong enough to create a topic for this from the words alone. But this issue has a very distinctive voting pattern---so our Group-Topic model has found that creating such a topic helped it predict vote-agreement. Thus, the traditional mixture of unigrams discovers topics that help predict word co-occurrence. Our Group-Topic model discovers topics that also help predict people’s behavior and relations. Something more useful!

44 Groups Discovered (US Senate)
4/19/2017 Groups Discovered (US Senate) Groups from topic Education + Domestic Here are the groups for topic Education + Domestic. As we can see, the Democrats are more unified on this topic, and the Republicans have split into 3 groups. In the first group we have the core of the Republican party, along with one Democrat, (but, OK, he is from Texas). Group 2 contains all but two of the Democrats. Group 3 contains Republican Senators from more liberal states, like Maine, Oregon and Pennsylvania. Group 4 are the conservative but independently-minded republicans, like John McCain. This group also contains Zell Miller from Georgia---a very conservation Democrat who frequently criticizes his own party, and even backed Republican George Bush over Democrat John Kerry in 2004 presidential election.

45 Senators Who Change Coalition the most Dependent on Topic
4/19/2017 Senators Who Change Coalition the most Dependent on Topic Our model can also be used to determine which Senators change their group membership the most across different topics. For example, Senator Shelby is a Democrat from Alabama (a fairly conservative state), and he is found to side with the Republicans on Economic issues, with the… e.g. Senator Shelby (D-AL) votes with the Republicans on Economic with the Democrats on Education + Domestic with a small group of maverick Republicans on Social Security + Medicaid

46 Dataset #2: The UN General Assembly
4/19/2017 Voting records of the UN General Assembly ( ) A country may choose to vote Yes, No or Abstain 931 resolutions with text attributes (titles) 192 countries in total Also experiments later with resolutions from Vote on Permanent Sovereignty of Palestinian People, 87th plenary meeting The draft resolution on permanent sovereignty of the Palestinian people in the occupied Palestinian territory, including Jerusalem, and of the Arab population in the occupied Syrian Golan over their natural resources (document A/54/591) was adopted by a recorded vote of 145 in favour to 3 against with 6 abstentions: In favour: Afghanistan, Argentina, Belgium, Brazil, Canada, China, France, Germany, India, Japan, Mexico, Netherlands, New Zealand, Pakistan, Panama, Russian Federation, South Africa, Spain, Turkey, and other 126 countries. Against: Israel, Marshall Islands, United States. Abstain: Australia, Cameroon, Georgia, Kazakhstan, Uzbekistan, Zambia. We also experimented with the voting records from the U.N. Here is an example bill from our UN dataset.

47 Topics Discovered (UN)
4/19/2017 Topics Discovered (UN) Everything Nuclear Human Rights Security in Middle East nuclear rights occupied weapons human israel use palestine syria implementation situation security countries calls Mixture of Unigrams Nuclear Non-proliferation Nuclear Arms Race Human Rights nuclear rights states arms human united prevention palestine weapons race occupied nations space israel Group-Topic Model Again, here we inspect the topics from two models. The traditional mixture of unigrams, using word-co-occurrence alone, puts all nuclear issues together into one topic. But there are two very distinct voting patterns on nuclear issues. One is nuclear non-proliferation; the other is the historic arms race between the U.S. and U.S.S.R. And you can see here that our Group-Topic model has successfully found this important separation of issues.

48 Groups Discovered (UN)
4/19/2017 Groups Discovered (UN) The countries list for each group are ordered by their 2005 GDP (PPP) and only 5 countries are shown in groups that have more than 5 members. Here are groups for the 3 topics in our GT model. The topics form the columns, with the five groups for each topic in the rows. As I just mentioned, on the topic of Nuclear Non-proliferation and arsenal, USA, Japan, Germany, UK and Russia all vote together. But on the Arms Race, India was a big ally of Russia, Japan/Germany/Italy/Poland/Hungary all vote together, and USA and Israel vote together, with Palau perhaps just there trying to “curry favor”. On the topic of Human Rights we see USA, Japan, Germany, UK and Russia voting together, and then on the other hand we see Nicaragua, Papua, Rwanda,… voting together. Ouch!

49 Do We Get Better Groups with the GT Model?
4/19/2017 Do We Get Better Groups with the GT Model? Baseline Model GT Model Cluster bills into topics using mixture of unigrams; Apply group model on topic-specific subsets of bills. Jointly cluster topic and groups at the same time using the GT model. Datasets Avg. AI for Baseline Avg. AI for GT p-value Senate 0.8198 0.8294 <.01 UN 0.8548 0.8664 The groups found by the GT look good so far. We also inspect the quality of groups found by the GT model numerically. We compare the GT model with a baseline model which first does topic clustering, then divides all the bills into subsets, with each of them corresponding to one single topic. Finally we run the group model separately on each subset of bills. As shown in the table, the group cohesion in our GT model is significantly better than even this fairly expressive baseline model---showing the joint inference helps in discovering groups and topics. Agreement Index (AI) measures group cohesion. Higher, better.

50 Groups and Topics, Trends over Time (UN)
4/19/2017 Groups and Topics, Trends over Time (UN) The last experiment we conducted is to run the GT model on overlapping time windows of 15 years, each shifted by 5 years. The left part is the topics found by the GT model for each period, and the right part show the group distribution for Topic 3. For example, regarding Africa independence in the first period, it is satisfying to see that here, (during the Cold War) the Western Block countries form a group, and the Eastern block countries form another. Another example, Throughout the history of the UN, the US is usually in the same group as Europe. However, as we can see during , when the Israeli-Palestinian conflict was the most dominant topic, US and Israel form a group of their own separating themselves from Europe.

51 a a Outline Social Network Analysis with Topic Models
4/19/2017 Outline Social Network Analysis with Topic Models a Role Discovery (Author-Recipient-Topic Model, ART) Group Discovery (Group-Topic Model, GT) Enhanced Topic Models Correlations among Topics (Pachinko Allocation, PAM) Time Localized Topics (Topics-over-Time Model, TOT) Markov Dependencies in Topics (Topical N-Grams Model, TNG) Bibliometric Impact Measures enabled by Topics a Multi-Conditional Mixtures

52 Latent Dirichlet Allocation
4/19/2017 LDA 20 visual model motion field object image images objects fields receptive eye position spatial direction target vision multiple figure orientation location “images, motion, eyes” LDA 100 motion detection field optical flow sensitive moving functional detect contrast light dimensional intensity computer mt measures occlusion temporal edge real “motion, some junk” [Blei, Ng, Jordan, 2003] α N θ n z β T w φ

53 Correlated Topic Model
4/19/2017 Correlated Topic Model [Blei, Lafferty, 2005] N logistic normal n z β T w φ Square matrix of pairwise correlations.

54 4/19/2017 Pachinko Machine

55 Pachinko Allocation Model
4/19/2017 Pachinko Allocation Model Thanks to Michael Jordan for suggesting the name [Li, McCallum, 2005] 11 Given: directed acyclic graph (DAG); at each interior node: a Dirichlet over its children and words at leaves not the graphical model Model structure, 21 22 For each document: Sample a multinomial from each Dirichlet 31 32 33 For each word in this document: Starting from the root, sample a child from successive nodes, down to a leaf. Generate the word at the leaf 41 42 43 44 45 word1 word2 word3 word4 word5 word6 word7 word8 Like a Polya tree, but DAG shaped, with arbitrary number of children.

56 Pachinko Allocation Model
4/19/2017 Pachinko Allocation Model [Li, McCallum, 2005] 11 DAG may have arbitrary structure arbitrary depth any number of children per node sparse connectivity edges may skip layers not the graphical model Model structure, 21 22 31 32 33 41 42 43 44 45 word1 word2 word3 word4 word5 word6 word7 word8

57 Pachinko Allocation Model
4/19/2017 Pachinko Allocation Model [Li, McCallum, 2005] 11 not the graphical model Model structure, 21 22 Distributions over distributions over topics... Distributions over topics; mixtures, representing topic correlations 31 32 33 41 42 43 44 45 Distributions over words (like “LDA topics”) word1 word2 word3 word4 word5 word6 word7 word8 Some interior nodes could contain one multinomial, used for all documents. (i.e. a very peaked Dirichlet)

58 Pachinko Allocation Model
4/19/2017 Pachinko Allocation Model [Li, McCallum, 2005] 11 Estimate all these Dirichlets from data. Estimate model structure from data. (number of nodes, and connectivity) not the graphical model Model structure, 21 22 31 32 33 41 42 43 44 45 word1 word2 word3 word4 word5 word6 word7 word8

59 Pachinko Allocation Special Cases
4/19/2017 Pachinko Allocation Special Cases Latent Dirichlet Allocation 32 41 42 43 44 45 word1 word2 word3 word4 word5 word6 word7 word8

60 Pachinko Allocation Special Cases
4/19/2017 Pachinko Allocation Special Cases Hierarchical Latent Dirichlet Allocation (HLDA) 11 Very low variance Dirichlet at root Each leaf of the HLDA topic hier. has a distr. over nodes on path to the root. 21 22 23 24 31 32 33 34 The HLDA hier. 41 42 51 word1 word2 word3 word4 word5 word6 word7 word8

61 Pachinko Allocation on a Topic Hierarchy
4/19/2017 Pachinko Allocation on a Topic Hierarchy Combining best of HLDA and Pachinko Allocation 00 The PAM DAG. 11 12 ...representing correlations among topic leaves. 21 22 23 24 31 32 33 34 The HLDA hier. 41 42 51 word1 word2 word3 word4 word5 word6 word7 word8

62 Pachinko Allocation Model
4/19/2017 Pachinko Allocation Model ... with two layers, no skipping layers, fully-connected from one layer to the next. 11 21 22 23 “super-topics” 31 32 33 34 35 “sub-topics” fixed multinomials word1 word2 word3 word4 word5 word6 word7 word8 Another special case would select only one super-topic per document.

63 (with fixed multinomials for topics)
4/19/2017 Graphical Models PAM (with fixed multinomials for topics) LDA q α α N N q θ θ n n z z1 z2 zm β β T T w φ w φ

64 Pachinko Allocation Model
4/19/2017 Pachinko Allocation Model Likelihood Estimate z’s by Gibbs sampling Estimate ’s by moment matching.

65 Preliminary Experimental Results
4/19/2017 Preliminary Experimental Results Topic Coherence Likelihood on held-out data Document classification

66 NIPS Dataset NIPS Conference Papers Volumes 0-12 1740 papers
4/19/2017 NIPS Dataset NIPS Conference Papers Volumes 0-12 Spanning – 1999. Prepared by Sam Roweis. 1740 papers 13649 Words 2,301,375 tokens We want to discover latent groups from our observations of relations between entities. Assume we have six high school students who filled out a survey. A student may admire another student in some way, defining a relation. The table lists all relations that hold between two students. For example, Adams admires Bennett, Carter admires Bennett, etc.

67 Topic Coherence Comparison
4/19/2017 Topic Coherence Comparison “models, estimation, stopwords” “estimation, some junk” PAM 100 estimation bayesian parameters data methods estimate maximum probabilistic distributions noise variable variables noisy inference variance entropy models framework statistical estimating “estimation” LDA 20 models model parameters distribution bayesian probability estimation data gaussian methods likelihood em mixture show approach paper density framework approximation markov LDA 100 estimation likelihood maximum noisy estimates mixture scene surface normalization generated measurements surfaces estimating estimated iterative combined figure divisive sequence ideal Example super-topic 33 input hidden units function number 27 estimation bayesian parameters data methods 24 distribution gaussian markov likelihood mixture 11 exact kalman full conditional deterministic 1 smoothing predictive regularizers intermediate slope

68 Topic Coherence Comparison
4/19/2017 Topic Coherence Comparison “images, motion eyes” “motion, some junk” “motion” “eyes” “images” LDA 20 visual model motion field object image images objects fields receptive eye position spatial direction target vision multiple figure orientation location LDA 100 motion detection field optical flow sensitive moving functional detect contrast light dimensional intensity computer mt measures occlusion temporal edge real PAM 100 motion video surface surfaces figure scene camera noisy sequence activation generated analytical pixels measurements assigne advance lated shown closed perceptual PAM 100 eye head vor vestibulo oculomotor vestibular vary reflex vi pan rapid semicircular canals responds streams cholinergic rotation topographically detectors ning PAM 100 image digit faces pixel surface interpolation scene people viewing neighboring sensors patches manifold dataset magnitude transparency rich dynamical amounts tor 69 network neural time system networks 7 motion video surface surfaces figure 4 user final validate resolve oam

69 Topic Coherence Comparison
4/19/2017 Topic Coherence Comparison “neural networks, some junk” “neural networks, some junk” “neural networks, much less junk” PAM 100 input hidden units function number functions networks output linear layer single results weight inputs basis parameters standard network patterns study LDA 20 architecture network input output structure paper level task work sequences sequence multiple problem shows connectionist networks context perform scale learn LDA 100 network layer multi trained high perceptron layers give type nonlinearity perceptrons module modified matched performed provided designed samples study mode input hidden units function number weights size elements depth polynomial numbers perceptron average geometric uniform code vector derive population codes ments exhibits substantially specifically physics

70 Blind Topic Evaluation
4/19/2017 Blind Topic Evaluation Randomly select 25 similar pairs of topics generated from PAM and LDA 5 people Each asked “select the topic in each pair that you find most semantically coherent.” Prefer PAM Topic counts LDA PAM 5 votes 5 >= 4 votes 3 8 >= 3 votes 9 16

71 Example Topic Pairs with Human Evaluation
4/19/2017 Example Topic Pairs with Human Evaluation Three pairs: 1. optimization, 2. neuro-biology, 3. adaptive control systems

72 Topic Correlations in PAM
4/19/2017 Topic Correlations in PAM 5000 research paper abstracts, from across all CS Numbers on edges are supertopics’ Dirichlet parameters

73 Likelihood on Held Out Data
4/19/2017 Likelihood on Held Out Data Likelihood comparison NIPS abstracts Train the model with 75% data Calculate likelihood on 25% data Calculate likelihood by Sampling many, many documents from the model Estimating a simple mixture of multinomials from these Calculate the likelihood of data under this simple mixture.

74 Likelihood Comparison
4/19/2017 Likelihood Comparison Varying number of topics

75 Document Classification
4/19/2017 Document Classification “Comp5” from 20 Newsgroups corpus. Train on 25%, test on 75% Like Naive Bayes, but use LDA/PAM per-class instead of multinomial. ~2.5% increase Test Accuracy (%)

76 a Outline Social Network Analysis with Topic Models
4/19/2017 Outline Social Network Analysis with Topic Models a Role Discovery (Author-Recipient-Topic Model, ART) Group Discovery (Group-Topic Model, GT) Enhanced Topic Models Correlations among Topics (Pachinko Allocation, PAM) Time Localized Topics (Topics-over-Time Model, TOT) Markov Dependencies in Topics (Topical N-Grams Model, TNG) Bibliometric Impact Measures enabled by Topics Multi-Conditional Mixtures

77 Want to Model Trends over Time
4/19/2017 Want to Model Trends over Time Is prevalence of topic growing or waning? Pattern appears only briefly Capture its statistics in focused way Don’t confuse it with patterns elsewhere in time How do roles, groups, influence shift over time?

78 Topics over Time (TOT)  w t Nd z D  T       z    w t  T T
4/19/2017 Topics over Time (TOT) w t Nd z D T Multinomial over words time stamp multinomial over topics topic index word Dirichlet prior distribution on time stamps Beta over time Uniform prior Dirichlet multinomial over topics Uniform prior Dirichlet prior topic index z time stamp word w t T T Nd Multinomial over words Beta over time D

79 State of the Union Address
4/19/2017 State of the Union Address 208 Addresses delivered between January 8, 1790 and January 29, 2002. To increase the number of documents, we split the addresses into paragraphs and treated them as ‘documents’. One-line paragraphs were excluded. Stopping was applied. 17156 ‘documents’ 21534 words 669,425 tokens Our scheme of taxation, by means of which this needless surplus is taken from the people and put into the public Treasury, consists of a tariff or duty levied upon importations from abroad and internal-revenue taxes levied upon the consumption of tobacco and spirituous and malt liquors. It must be conceded that none of the things subjected to internal-revenue taxation are, strictly speaking, necessaries. There appears to be no just complaint of this taxation by the consumers of these articles, and there seems to be nothing so well able to bear the burden without hardship to any portion of the people. The State of the Union Address is an annual event in which the President of the United States reports on the status of the country, normally to a joint session of the U.S. Congress (the House of Representatives and the Senate). The address is also used to outline the President's legislative proposals for the upcoming year. 1910

80 Comparing TOT with LDA

81 Sample Topic: Cold War world nations united states peace free economic
4/19/2017 Sample Topic: Cold War world nations united states peace free economic military soviet international security strength defense freedom europe force peoples efforts aggression today

82 Comparing TOT against LDA
4/19/2017 Comparing TOT against LDA

83 TOT on 17 years of NIPS proceedings
4/19/2017 TOT on 17 years of NIPS proceedings

84 TOT on 17 years of NIPS proceedings
LDA

85 TOT versus LDA on my email
4/19/2017 TOT versus LDA on my

86 TOT improves ability to Predict Time
4/19/2017 TOT improves ability to Predict Time Predicting the year of a State-of-the-Union address. L1 = distance between predicted year and actual year.

87 a a a a a Outline Social Network Analysis with Topic Models
4/19/2017 Outline Social Network Analysis with Topic Models a Role Discovery (Author-Recipient-Topic Model, ART) Group Discovery (Group-Topic Model, GT) Enhanced Topic Models Correlations among Topics (Pachinko Allocation, PAM) Time Localized Topics (Topics-over-Time Model, TOT) Markov Dependencies in Topics (Topical N-Grams Model, TNG) Bibliometric Impact Measures enabled by Topics a a a a Multi-Conditional Mixtures

88 Topics Modeling Phrases
4/19/2017 Topics Modeling Phrases Topics based only on unigrams often difficult to interpret Topic discovery itself is confused because important meaning / distinctions carried by phrases. Significant opportunity to provide improved language models to ASR, MT, IR, etc.

89 Topical N-gram Model   z1 z2 z3 z4 . . . y1 y2 y3 y4 . . . w1 w2 w3
4/19/2017 Topical N-gram Model z1 z2 z3 z4 . . . y1 y2 y3 y4 . . . w1 w2 w3 w4 . . . D 1 1 2 2 W W T T

90 evolutionary computation evolutionary algorithms
4/19/2017 LDA Topic LDA algorithms algorithm genetic problems efficient Topical N-grams genetic algorithms genetic algorithm evolutionary computation evolutionary algorithms fitness function

91 Sample Topical N-gram topics
4/19/2017 Sample LDA topics Sample Topical N-gram topics

92 Topic Comparison LDA Topical N-grams (2) Topical N-grams (1) learning
4/19/2017 Topic Comparison LDA Topical N-grams (2) Topical N-grams (1) learning optimal reinforcement state problems policy dynamic action programming actions function markov methods decision rl continuous spaces step policies planning reinforcement learning optimal policy dynamic programming optimal control function approximator prioritized sweeping finite-state controller learning system reinforcement learning_rl function approximators markov decision problems markov decision processes local search state-action pair markov decision process belief states stochastic policy action selection upright position reinforcement learning methods policy action states actions function reward control agent q-learning optimal goal learning space step environment system problem steps sutton policies

93 Topic Comparison LDA Topical N-grams (2) Topical N-grams (1) motion
4/19/2017 Topic Comparison LDA Topical N-grams (2) Topical N-grams (1) motion visual field position figure direction fields eye location retina receptive velocity vision moving system flow edge center light local receptive field spatial frequency temporal frequency visual motion motion energy tuning curves horizontal cells motion detection preferred direction visual processing area mt visual cortex light intensity directional selectivity high contrast motion detectors spatial phase moving stimuli decision strategy visual stimuli motion response direction cells stimulus figure contrast velocity model responses stimuli moving cell intensity population image center tuning complex directions

94 Topic Comparison LDA Topical N-grams (2) Topical N-grams (1) word
4/19/2017 Topic Comparison LDA Topical N-grams (2) Topical N-grams (1) word system recognition hmm speech training performance phoneme words context systems frame trained speaker sequence speakers mlp frames segmentation models speech recognition training data neural network error rates neural net hidden markov model feature vectors continuous speech training procedure continuous speech recognition gamma filter hidden control speech production neural nets input representation output layers training algorithm test set speech frames speaker dependent speech word training system recognition hmm speaker performance phoneme acoustic words context systems frame trained sequence phonetic speakers mlp hybrid

95 a a a a a a Outline Social Network Analysis with Topic Models
4/19/2017 Outline Social Network Analysis with Topic Models a Role Discovery (Author-Recipient-Topic Model, ART) Group Discovery (Group-Topic Model, GT) Enhanced Topic Models Correlations among Topics (Pachinko Allocation, PAM) Time Localized Topics (Topics-over-Time Model, TOT) Markov Dependencies in Topics (Topical N-Grams Model, TNG) Bibliometric Impact Measures enabled by Topics a a a a a Multi-Conditional Mixtures

96 Social Networks in Research Literature
Better understand structure of our own research area. Structure helps us learn a new field. Aid collaboration Map how ideas travel through social networks of researchers. Aids for hiring and finding reviewers!

97 Traditional Bibliometrics
4/19/2017 Traditional Bibliometrics Analyses a small amount of data (e.g. 19 articles from a single issue of a journal) Uses “journal” as a proxy for “research topic” (but there is no journal for information extraction) Uses impact measures almost exclusively based on simple citation counts. How can we use topic models to create new, interesting impact measures?

98 Our Data Over 1 million research papers, gathered as part of Rexa.info portal. Cross linked references / citations.

99 Finding Topics with TNG
Traditional unigram LDA run on 1 million titles / abstracts (200 topics) ...select ~300k papers on ML, NLP, robotics, vision... Find 200 TNG topics among those papers.

100 Topical Bibliometric Impact Measures
Topical Citation Counts Topical Impact Factors Topical Longevity Topical Diversity Topical Precedence Topical Transfer

101 Topical Diversity Entropy of the topic distribution among
papers that cite this paper (this topic). Low Diversity High Diversity

102 Topical Diversity Can also be measured on particular papers...

103 Topical Precedence “Early-ness” Within a topic, what are the earliest papers that received more than n citations? Information Retrieval: On Relevance, Probabilistic Indexing and Information Retrieval, Kuhns and Maron (1960) Expected Search Length: A Single Measure of Retrieval Effectiveness Based on the Weak Ordering Action of Retrieval Systems, Cooper (1968) Relevance feedback in information retrieval, Rocchio (1971) Relevance feedback and the optimization of retrieval effectiveness, Salton (1971) New experiments in relevance feedback, Ide (1971) Automatic Indexing of a Sound Database Using Self-organizing Neural Nets, Feiten and Gunzel (1982)

104 Topical Precedence “Early-ness” Within a topic, what are the earliest papers that received more than n citations? Speech Recognition: Some experiments on the recognition of speech, with one and two ears, E. Colin Cherry (1953) Spectrographic study of vowel reduction, B. Lindblom (1963) Automatic Lipreading to enhance speech recognition, Eric D. Petajan (1965) Effectiveness of linear prediction characteristics of the speech wave for..., B. Atal (1974) Automatic Recognition of Speakers from Their Voices, B. Atal (1976)

105 Topical Transfer Transfer from Digital Libraries to other topics
Cit’s Paper Title Web Pages 31 Trawling the Web for Emerging Cyber-Communities, Kumar, Raghavan, Computer Vision 14 On being ‘Undigital’ with digital cameras: extending the dynamic... Video 12 Lessons learned from the creation and deployment of a terabyte digital video Graphs Trawling the Web for Emerging Cyber-Communities 11 WebBase: a repository of Web pages

106 Topical Transfer Citation counts from one topic to another. Map “producers and consumers”

107 a a a a a a a Outline Social Network Analysis with Topic Models
4/19/2017 Outline Social Network Analysis with Topic Models a Role Discovery (Author-Recipient-Topic Model, ART) Group Discovery (Group-Topic Model, GT) Enhanced Topic Models Correlations among Topics (Pachinko Allocation, PAM) Time Localized Topics (Topics-over-Time Model, TOT) Markov Dependencies in Topics (Topical N-Grams Model, TNG) Bibliometric Impact Measures enabled by Topics a a a a a a Multi-Conditional Mixtures

108 Want a “topic model” with the advantages of CRFs
Use arbitrary, overlapping features of the input. Undirected graphical model, so we don’t have to think about avoiding cycles. Integrate naturally with our other CRF components. Train “discriminatively” Natural semi-supervised training What does this mean? Topic models are unsupervised!

109 4/19/2017 “Multi-Conditional Mixtures” Latent Variable Models fit by Multi-way Conditional Probability [McCallum, Wang, Pal, 2005], [McCallum, Pal, Wang, 2006] For clustering structured data, ala Latent Dirichlet Allocation & its successors But an undirected model, like the Harmonium [Welling, Rosen-Zvi, Hinton, 2005] But trained by a “multi-conditional” objective: O = P(A|B,C) P(B|A,C) P(C|A,B) e.g. A,B,C are different modalities

110 Objective Functions for Parameter Estimation
Traditional, joint training (e.g. naive Bayes, most topic models) Traditional mixture model (e.g. LDA) Traditional Traditional, conditional training (e.g. MaxEnt classifiers, CRFs) Conditional mixtures (e.g. Jebara’s CEM) Multi-conditional (mostly conditional, generative regularization) Multi-conditional (for semi-sup) New, multi-conditional Multi-conditional (for transfer learning, 2 tasks, shared hiddens)

111 “Multi-Conditional Learning” (Regularization)
[McCallum, Pal, Wang, 2006]

112 Predictive Random Fields mixture of Gaussians on synthetic data
4/19/2017 Predictive Random Fields mixture of Gaussians on synthetic data [McCallum, Wang, Pal, 2005] Data, classify by color Generatively trained Multi-Conditional Conditionally-trained [Jebara 1998]

113 Multi-Conditional Mixtures vs. Harmoniun on document retrieval task
4/19/2017 Multi-Conditional Mixtures vs. Harmoniun on document retrieval task [McCallum, Wang, Pal, 2005] Multi-Conditional, multi-way conditionally trained Conditionally-trained, to predict class labels Harmonium, joint, with class labels and words Harmonium, joint with words, no labels

114 Outline Social Network Analysis with Topic Models
4/19/2017 Outline Social Network Analysis with Topic Models Role Discovery (Author-Recipient-Topic Model, ART) Group Discovery (Group-Topic Model, GT) Enhanced Topic Models Correlations among Topics (Pachinko Allocation, PAM) Time Localized Topics (Topics-over-Time Model, TOT) Markov Dependencies in Topics (Topical N-Grams Model, TNG) Bibliometric Impact Measures enabled by Topics Multi-Conditional Mixtures

115 Summary

116 Assigning topics to documents
4/19/2017 Assigning topics to documents Build a 200 topic n-gram topic model on 300k documents Remove stopword or methodological topics (e.g. “efficient, fast, speed”) For each document d, if more than 10% of d’s tokens are assigned to topic t, and that comprises more than two tokens, assign d to t Each topic is now an intellectual “domain” that includes some number of documents. We can substitute topic for journal in most traditional bibliometric indicators. We can also now define several new indicators.

117 4/19/2017 Impact Factor Journal Impact Factor: Citations from articles published in 2004 to articles in Cell published in , divided by the number of articles published in Cell in 2004 Impact factors from JCR: Nature 32.182 Cell 28.389 JMLR 5.952 Machine Learning 3.258

118 4/19/2017 Topic Impact Factor

119 Broad Impact: Diffusion
4/19/2017 Broad Impact: Diffusion Journal Diffusion: # of journals citing Cell divided by the total number of citations to Cell, over a given time period, times 100 Problem: relatively brittle at low citation counts. If a topic/journal is cited twice by two different topics/journals, it will have high diffusion.

120 Broad Impact: Diversity
4/19/2017 Broad Impact: Diversity Topic Diversity: Entropy of the distribution of citing topics Better at capturing broad end of impact spectrum: the high diffusion topics are identical to the least frequently cited topics

121 Broad Impact: Diversity
4/19/2017 Broad Impact: Diversity Topic Diversity: Entropy of the distribution of citing topics Topic diversity can also be measured for papers:

122 Longevity: Cited Half Life
4/19/2017 Longevity: Cited Half Life Two views: Given a paper, what is the median age of citations to that paper? What is the median age of citations from current literature?

123 History: Topical Precedence
4/19/2017 History: Topical Precedence Within a topic, what are the earliest papers that received more than n citations? Information Retrieval (138): On Relevance, Probabilistic Indexing and Information Retrieval, Kuhns and Maron (1960) Expected Search Length: A Single Measure of Retrieval Effectiveness Based on the Weak Ordering Action of Retrieval Systems, Cooper (1968) Relevance feedback in information retrieval, Rocchio (1971) Relevance feedback and the optimization of retrieval effectiveness, Salton (1971) New experiments in relevance feedback, Ide (1971) Automatic Indexing of a Sound Database Using Self-organizing Neural Nets, Feiten and Gunzel (1982)


Download ppt "Topic Models for Social Network Analysis and Bibliometrics"

Similar presentations


Ads by Google