Presentation is loading. Please wait.

Presentation is loading. Please wait.

More Text Analytics University of Illinois at Urbana-Champaign.

Similar presentations

Presentation on theme: "More Text Analytics University of Illinois at Urbana-Champaign."— Presentation transcript:

1 More Text Analytics University of Illinois at Urbana-Champaign

2 Outline Concept Tracking –Emotion Tracking Topic Modeling Attendee Project Work

3 Concept Tracking

4 Text Analytics: Concept Tracking Given: Set of documents Given: Set of concepts and related words Find the concepts in the set of documents using the related words and a synonym network Concepts can then be displayed with additional meta data from the documents for timeline, or GIS mapping Specific example is Emotion Tracking

5 SEASR @ Work – Emotion Tracking Goal is to have this type of Visualization to track emotions across a text document

6 Text Analytics: Emotion Tracking Sentiment Analysis

7 Classifying text based on its sentiment –Determining the attitude of a speaker or a writer –Determining whether a review is positive/negative Ask: What emotion is being conveyed within a body of text? –Look at only adjectives lots of issues and challenges Need to Answer: –What emotions to track? –How to measure/classify an adjective to one of the selected emotions? –How to visualize the results?

8 Sentiment Analysis: Emotion Selection Which emotions: – – %20emotions.htm %20emotions.htm – m m Parrot’s classification (2001) –six core emotions –Love, Joy, Surprise, Anger, Sadness, Fear

9 Sentiment Analysis: Emotions

10 Sentiment Analysis: Using Adjectives How to classify adjectives: –Lots of metrics we could use … Lists of adjectives already classified – –Need a “nearness” metric for missing adjectives –Using a thesaurus to find a path between words Need a metric to compare the paths –Assume the longer the path, the “farther away” the two words are. –No antonyms –No colloquialisms or slang

11 Ontological Association (WordNet) As of 2006, the database contains about 150,000 words organized in over 115,000 synsets for a total of 207,000 word-sense pairs POSUnique Strings SynsetsTotal Strings Word-Sense Pairs Noun11779882115146312 Verb115291376725047 Adjective214791815630002 Adverb448136215580 Totals155287117659206941

12 Ontological Association (WordNet) Search for table Noun –S: (n) table, tabular array (a set of data arranged in rows and columns) "see table 1” –S: (n) table (a piece of furniture having a smooth flat top that is usually supported by one or more vertical legs) "it was a sturdy table” –S: (n) table (a piece of furniture with tableware for a meal laid out on it) "I reserved a table at my favorite restaurant” –S: (n) mesa, table (flat tableland with steep edges) "the tribe was relatively safe on the mesa but they had to descend into the valley for water” –S: (n) table (a company of people assembled at a table for a meal or game) "he entertained the whole table with his witty remarks” –S: (n) board, table (food or meals in general) "she sets a fine table"; "room and board” Verb –S: (v) postpone, prorogue, hold over, put over, table, shelve, set back, defer, remit, put off (hold back to a later time) "let's postpone the exam” –S: (v) table, tabularize, tabularise, tabulate (arrange or enter in tabular form)

13 Sentiment Analysis For example, how would you get from delightful to rainy?

14 SEASR: Sentiment Analysis How to get from delightful to rainy ? –['delightful', 'fair', 'balmy', 'moist', 'rainy’] sexy to joyless? –['sexy', 'provocative', 'blue', 'joyless’] bitter to lovable? –['bitter', 'acerbic', 'tangy', 'sweet', 'lovable’]

15 SEASR: Sentiment Analysis Introducing SynNet: a traversable graph of synonyms (adjectives)

16 Thesaurus Network (SynNet) Used, create link between every term and its synonyms Created a large network Determine a metric to use to assign the adjectives to one of our selected terms –Is there a path? –How to evaluate best paths?

17 SynNet Metrics Path length Number of Paths Common nodes Symmetric: a  b b  a Unique nodes in all paths

18 SynNet Metrics: Path Length Rainy to Pleasant –Shortest path length is 4 (blue) Rainy, Moist, Watery, Bland, Pleasant –Green path has length of 3 but is not reachable via symmetry –Blue nodes are nodes 2 hops away

19 SynNet Metrics: Common Nodes Common Nodes –depth of common nodes Example –Top shows happy –Bottom shows delightful –Common nodes shown in center cluster

20 SynNet Metrics: Symmetry Symmetry of path in common nodes

21 Concept Tracking: Sentiment Step 1: list your sentiments/concepts –joy, sad, anger, surprise, love, fear Step 2: for each concept, list adjectives –joy: joyful, happy, hopeful –surprise:surprising, amazing, wonderful, unbelievable Step 3: for each adjective in the text, calculate all the paths to each adjective in step 2 Step 4: pick the best adjective (using metrics)

22 SynNet: Sentiment Analysis Example: –the adjective incredible is more like which emotion

23 SynNet: Sentiment Analysis Incredible to loving (concept: love) Blue paths are symmetric paths

24 SynNet: Sentiment Analysis Incredible to surprising (concept: surprise) Blue paths are symmetric paths

25 SynNet: Sentiment Analysis Incredible to joyful (concept: joy)

26 SynNet: Sentiment Analysis Incredible to joyless (concept: sad)

27 SynNet: Sentiment Analysis Incredible to fearful (concept: fear)

28 SynNet: Sentiment Analysis Incredible to wonderful (concept: joy)

29 SynNet: Sentiment Analysis Try it yourself: – /synnet/path/white/afraid – /synnet/path/white/afraid?format=xml – /synnet/path/white/afraid?format=json – /synnet/path/white/afraid?format=flash –Database is only adjectives –More api coming soon, visualizations

30 Sentiment Analysis: Issues Not a perfect solution –still need context to get quality Vain –['vain', 'insignificant', 'contemptible', 'hateful'] –['vain', 'misleading', 'puzzling', 'surprising’] Animal –['animal', 'sensual', 'pleasing', 'joyful'] –['animal', 'bestial', 'vile', 'hateful'] –['animal', 'gross', 'shocking', 'fearful'] –['animal', 'gross', 'grievous', 'sorrowful'] Negation –“My mother was not a hateful person.”

31 Sentiment Analysis: Process Process Overview (2 flows) –Create Concept Cache & Ignore Cache Load the documents Extract the adjectives (POS analysis) Find the unique adjectives Label each adjective (SynNet Service) –Apply Concepts Load the document(s) Segment the document for single document Extract the adjectives (POS analysis) Summarize adjectives across segments or documents Visualize the concepts by segments

32 Sentiment Analysis: Visualization SEASR visualization component –Origintally based on flash using the flare ActionScript Library – wer/emotions.html

33 Sentiment Analysis: 911 Corpus Concepts for each story were identified Mapping was done by using additional meta- data for each story

34 Concept Mapping of an Author 5 books by Charles Dickens 1.Tale of Two Cities 2.Great Expectations 3.Christmas Carol 4.Oliver Twist 5.David Copperfield

35 Concept Mapping for Multi Documents

36 Concept Mapping of a Single Document Tale of Two CitiesGreat Expectations

37 Concept Mapping of a Single Document

38 Concept Mapping: Creating Cache Files Two cache files –Concept cache Stores the word, concept, POS, seed word mapping and some numbers –greatjoyJJ031wonderful2 –anonymoussurpriseJJ3561unbelievable4 –darkfearJJ81502horrible2 –Ignore cache Stores the word that do not map to a concept

39 Concept Mapping: Create Cache Flow

40 Concept Mapping Notes If list of concepts and seed words have not changed, you can continue to use the same cache files for all documents. But you will need to change the cache file it you want to define new concept mappings. –E.g. Emotion Tracking: 6 concepts and their seed words –E.g. Positive/Negative: 2 concepts and seeds like (yes, yeah, ok, etc) (no, nay, not, etc) –E.g. Male/Female: 2 concepts and seeds like (he, his, him, mr, etc.) (she, her, mrs, etc.) Copy cache files to your machine for starters

41 Topic Modeling

42 Text Analytics: Topic Modeling Given: Set of documents Find: To reveal the semantic content in large collection of documents Usage: Mallet Topic Modeling tools Output: –Shows the percentage of relevance for each document in each cluster –Shows the key words and their counts for each topic

43 Topic Modeling: LDA Model LDA Model from Blei (2011) LDA assumes that there are K topics shared by the collection. Each document exhibits the topics with different proportions. Each word is drawn from one topic. We discover the structure that best explain a corpus.

44 Topic Modeling: Martha Ballard’s Diary LabelWords MIDWIFERYbirth deld safe morn receivd calld left cleverly pm labour fine reward arivd infant expected recd shee born patient CHURCHmeeting attended afternoon reverend worship foren mr famely performd vers attend public supper st service lecture discoarst administred supt DEATHday yesterday informd morn years death ye hear expired expird weak dead las past heard days drowned departed evinn GARDENINGgardin sett worked clear beens corn warm planted matters cucumbers gatherd potatoes plants ou sowd door squash wed seeds SHOPPINGlb made brot bot tea butter sugar carried oz chees pork candles wheat store pr beef spirit churnd flower ILLNESSunwell mr sick gave dr rainy easier care head neighbor feet relief made throat poorly takeing medisin ts stomach

45 Topic Modeling: Martha Ballard’s Diary

46 Topic Modeling: Pennsylvania Gazette LabelWords RUNAWAYaway reward servant old whoever named year feet jacket high paid hair pair secure coat run inches GOVT –U.S.state government constitution law united power citizen people public congress right legislature REAL ESTATEgood house acre sold land meadow well mile premise plantation stone containing mill dwelling orchard GOVT –REVOLTcountry america war great liberty nation people american men let cause peace enemy present state she CLOTHsilk cotton ditto white black linen cloth women blue worsted men fine thread plain coloured

47 Topic Modeling: Historical Newspapers TopicsExplanation black* price* worth* white* goods* yard* silk* made* lot* week ladies wool* inch* ladles* sale* prices* pair* suits* fine* Reflects discussion of the market and sales of goods, with some words that relate to cotton and others that reflect other goods being sold alongside cotton (such as wool). state* people* states* bill* law* made united* party* men* country* government* county* public* presi- dent* money* committee* general* great question* Political language associated with the political debates that dominated much of newspaper content during this era. The association of the topic “money” is particularly telling, as economic and fiscal policy were particularly important discussion during the era. market* cotton* york* good* steady* closed* prices* corn* texas* wheat* fair* stock* choice* year* lower* receipts* ton* crop* higher* All these topics reflect market-driven language related to the buying and selling cotton and, to a much smaller ex- tent, other crops such as corn.

48 Topic Modeling: Mining the Dispatch Topic words –negro, years, reward, boy, man, named, jail, delivery, give, left, black, paid, pay, ran, color, richmond, subscriber, high, apprehension, age, ranaway, free, feet, delivered Advertisement Ranaway.—$10 reward. —Ranaway from the subscriber, on the 3d inst., my slave woman Parthena. Had on a dark brown and white calico dress. She is of a ginger-bread color; medium size; the right fore-finger shortened and crooked, from a whitlow. I think she is harbored somewhere in or near Duvall's addition. For her delivery to me I will pay $10. de 6—ts G. W. H. Tyler.

49 Topic Modeling: Mining the Dispatch

50 Topic Modeling: Link-Node Visualization

51 Topic Modeling: Link-Node Visualization Extract the tokens, word counts, and their connections from the Mallet topic model files into a graph file that generates edges and nodes, allowing us to view the ten topics as a network model in Gephi

52 Topic Modeling: Matrix Visualization

53 Topic Modeling Uses Mallet Topic Modeling to cluster nouns from over 4000 documents from 19th century with 10 segments per document Top 10 topics showing at most 200 keywords for that topic

54 Topic Modeling Process Load the documents Segment the documents Extract nouns (POS analysis) Create the Mallet data structures for each segment Mallet for topic modeling Save results Parse keyword results Create tagclouds of keywords

55 Topic Modeling Flow

56 HTRC Topics Search for “dickens” in subcollections 1148 documents 100 topics, showing 2 below

57 Topic Model Explorer

58 Additional Topic Modeling Variations Topics over time Connections between topics Hierarchy of topics

59 Topic Modeling References or-latent-dirichlet-allocation-for-english-majors/ or-latent-dirichlet-allocation-for-english-majors/ Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pages 96–104, Portland, OR, USA, 24 June 2011. © 2011 Association for Computational Linguistics Termite: Visualization Techniques for Assessing Textual Topic Models, Jason Chuang, Christopher D. Manning, Jeffrey Heer, Advanced Visual Interfaces, 2012Termite: Visualization Techniques for Assessing Textual Topic Models Jason ChuangJeffrey Heer

60 Demonstration Concept Tracking –Emotion Tracking for single document –Emotion Tracking comparison for multiple documents Topic Modeling –Tagclouds of topic keywords

61 Learning Exercises Open the flow for tracking concepts –Modify the flow to load your data –Modify the flow to track concepts of interest to you Open the flow for topic modeling –Modify the flow to load your data –Review the results and decide if there are modifications that you need to make For instance, maybe you also want to look at verbs

62 Learning Exercises – Concepts Apply Emotion Tracking with existing seeds –Download the caches as a starting point –Custom Concept Mapping Apply Cache For Single Document Change Attribute Cache Lookup Component –Properties for cache need to use a full path

63 Learning Exercises – Concepts Output will show other adjectives that were not tagged with a concept… If there are words In this list, then you will want to run the Custom Concept Mapping Create Cache Flow…

64 Learning Exercises – Create Cache Change Attribute Cache Lookup Component –Properties for cache need to use a full path Update Tuple Cache Property for cache needs to use a full path Update Ignore List Property for cache needs to use a full path

65 Attendee Project Plan Study/Project Title Team Members and their Affiliation Procedural Outline of Study/Project –Research Question/Purpose of Study –Data Sources –Analysis Tools Activity Timeline or Milestones Report or Project Outcome(s) Ideas on what your team needs from SEASR staff to help you achieve your goal. Identify Analytics

66 Discussion Questions What part of these applications can be useful to your research?

Download ppt "More Text Analytics University of Illinois at Urbana-Champaign."

Similar presentations

Ads by Google