Presentation is loading. Please wait.

Presentation is loading. Please wait.

Topic Models Discovering Annotating Comparing Referring Sampling Illustrating Representing John Unsworth, “Scholarly Primitives” “Scholarly Primitives”

Similar presentations


Presentation on theme: "Topic Models Discovering Annotating Comparing Referring Sampling Illustrating Representing John Unsworth, “Scholarly Primitives” “Scholarly Primitives”"— Presentation transcript:

1 Topic Models Discovering Annotating Comparing Referring Sampling Illustrating Representing John Unsworth, “Scholarly Primitives” “Scholarly Primitives” (2000) Digitizing (and OCR) Collecting (and cleaning: “scrubbing,” “wrangling,” “munging,” etc.) Organizing (Clustering, Classifying, etc.) topic modeling Evidentiary Primitives Narrating Arguing Reframing (altering the context, etc.) Hermeneutic Primitives

2 Topic Models Idea of Topic Modeling (simplified) Generally: an “unsupervised” method of creating a simplified representation of a body of materials. Specifically: an unsupervised method of representing a corpus as a set of topics (a distribution over a set of topics)

3 Topic Models Goals of Topic Modeling (simplified) Clustering Hypothesis Forming Exploring Verifying/Proving?

4 Topic Models Logical Process of Topic Modeling (simplified) Edwin Chen "Introduction to Latent Dirichlet Allocation" (2011) Suppose you have the following set of sentences: 1.I like to eat broccoli and bananas. 2.I ate a banana and spinach smoothie for breakfast. 3.Chinchillas and kittens are cute. 4.My sister adopted a kitten yesterday. 5.Look at this cute hamster munching on a piece of broccoli. What is latent Dirichlet allocation? It’s a way of automatically discovering topics that these sentences contain. For example, given these sentences and asked for 2 topics, LDA might produce something like Sentences 1 and 2: 100% Topic A Sentences 3 and 4: 100% Topic B Sentence 5: 60% Topic A, 40% Topic B Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food) Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals) The question, of course, is: how does LDA perform this discovery?

5 Topic Models Logical Process of Topic Modeling (simplified) Ted Underwood "Topic Modeling Made Just Simple Enough" (2012)

6 Topic Models Logical Process of Topic Modeling (even more simplified) Treat a document or set of documents as a “bag of words.” Use the Latent Dirichlet Allocation (LDA) algorithm to hypothesize the generation of the documents from sub-”bags” of words (“topics”) that tend to collocate (by means of the MALLET -- MAchine Learning for LanguagE Toolkit): Show each topic: For each topic, show the 10 or so words belonging to the topic that occur most frequently. Visualize in a word-cloud (or by other means). Assume that a “topic” (sub-bag of words) is a “theme.” Andrew Goldstone's interface (Dfr-Browser) for browsing topic models created from JSTOR journals Topic model of PMLA,1889–2007

7 Topic Models(plus other text analysis)

8 Topic Models Hermeneutical Moves at a Low Level in Topic Modeling Process (not covered in today’s workshop) Scrubbing and Creating a Stop Word List Chunking Texts Predefining the Number of Topics to Look for

9 Topic Models Matt Burton, "The Joy of Topic Modeling": “… the brown squiggles along the bottom represent a vocabulary of words and the grey peaks represent individual word’s probability density…. The list of top words, words that are “heavy” with more probabilistic mass, are the interesting group of words to examine because they are the co-occurring words in that topic distribution.”

10 Topic Models – A Probabilistic Universe

11 Boris Tomashevsky’s example of a narrative motif (theme) (“Thematics,” 1925): “Raskolnikov kills the old woman” Probablistic rewriting: “There is a 74% chance that in this document Raskolnikov kills (82%) / wounds (15%) / ignores (3%) the old woman (68%) / young woman (23%) / other (9%).”


Download ppt "Topic Models Discovering Annotating Comparing Referring Sampling Illustrating Representing John Unsworth, “Scholarly Primitives” “Scholarly Primitives”"

Similar presentations


Ads by Google