Presentation is loading. Please wait.

Presentation is loading. Please wait.

Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.

Similar presentations


Presentation on theme: "Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter."— Presentation transcript:

1 Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter Organisciak Katrina Fenlon Graduate School of Library & Information Science University of Illinois, Urbana-Champaign Supported by IMLS LG-06-07-0020. 1

2 The Setting: IMLS DCC collection(s) Data providers (IMLS NLG & LSTA) metadata … DCC services metadata OAI-PMH Service provider: DCC 2

3 High-Level Research Interest Improve “access” to data harvested for federated digital libraries by enhancing: – Representation of documents – Representation of document aggregations – Capitalizing on the relationship between aggregations and documents. PS: By “document” I mean a single metadata (usually DC) record. 3

4 Motivation for our Work Most empirical approaches to this type of problem rely on some kind of analysis of term counts. Unreliable for our data: – Vocabulary mismatch – Poor probability estimates 4

5 The Setting: IMLS DCC 5

6 The Problem: Supporting End-User Experience Full-text search Browse by “subject” Desired: – Improved browsing – Support high-level aggregation understanding and resource discovery Approach: Empirically induced “topics” using established methods--e.g. latent Dirichlet allocation (LDA). 6

7 7

8 8

9 9

10 10

11 Research Question Can we improve induced models by mitigating the influence of noisy data, common in federated digital library settings? Hypothesis: Harvested records are not all useful for training a model of corpus-level topics. Approach: Identify and remove “weakly topical” documents during model training. 11

12 Latent Dirichlet Allocation Given a corpus of documents, C, and an empirically chosen integer k Assume that a generative process involving k latent topics generated word occurrences in C. End result: for a given word w and a given document D: – Pr(w|T i ) – Pr(D|T i ) – Pr(T i ) For each topic T 1 … T k 12

13 Latent Dirichlet Allocation Given a corpus of documents, C, and an empirically chosen integer k Assume that a generative process involving k latent topics generated word occurrences in C. End result: for a given word w and a given document D: – Pr(w|T i ) – Pr(D|T i ) – Pr(T i ) For each topic T 1 … T k 1.Choose doc length N ~ Poisson(mu). 2.Choose probability vector Theta ~ Dir(alpha). 3.For each word w i in 1:N: a)Choose topic z i ~ Multinomial(Theta). b)Choose word w n from P(w n | w n, Beta). 1.Choose doc length N ~ Poisson(mu). 2.Choose probability vector Theta ~ Dir(alpha). 3.For each word w i in 1:N: a)Choose topic z i ~ Multinomial(Theta). b)Choose word w n from P(w n | w n, Beta). 13

14 Latent Dirichlet Allocation Given a corpus of documents, C and an empirically chosen integer k. Assume that a generative process involving k latent topics generated word occurrences in C. End result: for a given word w and a given document D: – Pr(w|T i ) – Pr(D|T i ) – Pr(T i ) For each topic T 1 … T k Calculate estimates via iterative methods: MCMC / Gibbs Sampling. 14

15 Full Corpus 15

16 Full Corpus Proposed algorithm 16

17 Reduced Corpus Pr(w | T) Pr(D | T) Pr(T) Train the Model 17

18 Full Corpus Inference Pr(w | T) Pr(D | T) Pr(T) Pr(w | T) Pr(D | T) Pr(T) 18

19 Sample Topics Induced from “Raw” Data 19

20 Documents’ Topical Strength Hypothesis: Harvested records are not all useful for training a model of corpus-level. topics. 20

21 Documents’ Topical Strength Hypothesis: Harvested records are not all useful for training a model of corpus-level. Proposal: Improve induced topic model by removing “weakly topical” documents during training. After training, use the inferential apparatus of LDA to assign topics to these “stop documents.” 21

22 Identifying “Stop Documents” Time at which documents enter a repository is often informative (e.g. bulk uploads). log Pr(d i | M C ) where M C is the collection language model and d i is the words comprising the ith document 22

23 Identifying “Stop Documents” Our paper outlines an algorithm for accomplishing this. Intuition: – Given a document d i decide if it is part of a “run” of near-identical records. – Remove all records that occur within a run. – The required amount of homogeneity to identify a run is guided by a parameter tol which is the cumulative normal: e.g. 95%, 99% confidence. 23

24 24

25 25

26 Sample Topics Induced from Groomed Data 26

27 Experimental Assessment Question: Are topics built from “sampled” corpora more coherent than topics induced from raw corpora? Intrusion detection: – Find the 10 most probable words for topic T i – Replace one of these 10 with a word chosen from the corpus with uniform probability. – Ask human assessors to identify the “intruder” word. 27

28 Experimental Assessment For each topic T i have 20 assessors try to find an intruder (20 different intruders). Repeat for both “sampled” and “raw” models. – i.e. 20 * 2* 100 = 4,000 assessments A si is the percent of workers who correctly found the intruder in the ith topic of the sampled model and A ri is analogous for the raw model H 0 : A si > A ri yields p<0.001 28

29 Experimental Assessment For each topic T i have 20 workers subjectively assess the topic’s “coherence,” reporting on a 4-point Likert scale. 29

30 Current & Future Work Testing breadth of coverage Assessing the value of induced topics Topic information for document priors in the language modeling IR framework [next slide] Massive document expansion for improved language model estimation [under review] 30

31 Weak Topicality and Document Priors 31

32 Weak Topicality and Document Priors 32

33 Thank You ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter Organisciak Katrina Fenlon Graduate School of Library & Information Science University of Illinois, Urbana-Champaign 33


Download ppt "Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter."

Similar presentations


Ads by Google