Presentation is loading. Please wait.

Presentation is loading. Please wait.

SEASR Applications National Center for Supercomputing Applications University of Illinois at Urbana-Champaign.

Similar presentations


Presentation on theme: "SEASR Applications National Center for Supercomputing Applications University of Illinois at Urbana-Champaign."— Presentation transcript:

1 SEASR Applications National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

2 Outline Audio Analysis with NEMA Text Analysis with Monk Emotion Tracking Hands-On

3 Defining Music Information Retrieval? Music Information Retrieval (MIR) is the process of searching for, and finding, music objects, or parts of music objects, via a query framed musically and/or in musical terms Music Objects: Scores, Parts, Recordings (WAV, MP3, etc.), etc. Musically framed query: Singing, Humming, Keyboard, Notation-based, MIDI file, Sound file, etc. Musical terms: Genre, Style, Tempo, etc.

4 NEMA Networked Environment for Music Analysis –UIUC, McGill (CA), Goldsmiths (UK), Queen Mary (UK), Southampton (UK), Waikato (NZ) –Multiple geographically distributed locations with access to different audio collections –Distributed computation to extract a set of features and/or build and apply models

5 SEASR @ Work – NEMA Executes a SEASR flow for each run –Loads audio data –Extracts features from every 10 second moving window of audio –Loads models –Applies the models –Sends results back to the WebUI

6 NEMA Flow – Blinkie

7 NEMA Vision researchers at Lab A to easily build a virtual collection from Library B and Lab C, acquire the necessary ground-truth from Lab D, incorporate a feature extractor from Lab E, combine with the extracted features with those provided by Lab F, build a set of models based on pair of classifiers from Labs G and H validate the results against another virtual collection taken from Lab I and Library J. Once completed, the results and newly created features sets would be, in turn, made available for others to build upon

8 Do It Yourself (DIY) 1

9 DIY Options

10 DIY Job List

11 DIY Job View

12 Nester: Cardinal Annotation Audio tagging environment Green boxes indicate a tag by a researcher Given tags, automated approaches to learn the pattern are applied to find untagged patterns

13 Nester: Cardinal Catalog View

14 Examining Audio Collection Tagged a set of examples Male and Female

15 SEASR @ Work: MONK MONK: a case study Texts as data Texts from multiple sources Texts reprocessed into a new representation Different tools using the same data Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

16 MONK Project MONK provides: 1400 works of literature in English from the 16th - 19th century = 108 million words, POS-tagged, TEI-tagged, in a MySQL database. Several different open-source interfaces for working with this data A public API to the datastore SEASR under the hood, for analytics Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

17 MONK “A word token is the spelling or surface of form of a word. MONK performs a variety of operations that supply each token with additional 'metadata’. –Take something like 'hee louyd hir depely'. –This comes to exist in the MONK textbase as something like hee_pns31_he louyd_vvd_love hir_pno31_she depely_av-j_deep Because the textbase 'knows' that the surface 'louyd' is the past tense of the verb 'love' the individual token can be seen as an instance of several types: the spelling, the part of speech, and the lemma or dictionary entry form of a word.” (Martin Mueller) Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

18 Text Data Texts represent language, which changes over time (spellings) Comparison of texts as data requires some normalization (lemma) Counting as a means of comparison requires units to count (tokens) Treating texts as data will usually entail a new representation of those texts, to make them comparable and to make their features countable. Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

19 Text from Multiple Sources Five aphorisms about textual data (causing tool- builders to weep): Scholars are interested in texts first, data second Tools are only useful if they can be applied to texts that are of interest No single collection has all texts No two collections will be identical in format No one collection will be internally consistent in format Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

20 Public MONK Texts Documenting the American South from UNC-Chapel Hill –(1.5 Gb, 8.5 M words) Early American Fiction from the University of Virginia –(930 Mb, 5.2 M words) Wright American Fiction from Indiana University –(4 Gb, 23 M words) Shakespeare from Northwestern University – (170 M, 850 K words) About 7 Gigabytes, 38 M words Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

21 Restricted Monk Texts Eighteenth-Century Collection Online (ECCO) from the Text Creation Partnership –(6 Gb, 34 M words) Early English Books Online (EEBO) from the Text Creation Partnership –(7 G, 39 M words) Nineteenth-Century Fiction (NCF) from Chadwyck Healey –(7 G, 39 M words) About 20 Gb, 112 M words Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

22 MONK Ingest Process Texts reprocessed into a new representation TEI source files (from various collections, with various idiosyncracies) go through Abbot, a series of xsl routines that transform the input format into TEI- Analytics (TEI-A for short), with some curatorial interaction. “Unadorned” TEI-A files go through Morphadorner, a trainable part-of-speech tagger that tokenizes the texts into sentences, words and punctuation, assigns ids to the words and punctuation marks, and adorns the words with morphological tagging data (lemma, part of speech, and standard spelling). Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

23 MONK Ingest Process Adorned TEI-A files go through Acolyte, a script that adds curator-prepared bibliographic data Bibadorned files are processed by Prior, using a pair of files defining the parts of speech and word classes, to produce tab-delimited text files in MySQL import format, one file for each table in the MySQL database. cdb.csh creates a Monk MySQL database and imports the tab-delimited text files. Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

24 MONK Tools MONK Datastore Flamenco Faceted Browsing MONK extension for Zotero TeksTale Clustering and Word Clouds FeatureLens SEASR The MONK Workbench (Public) The MONK Workbench (Restricted) Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

25 SEASR @ Work – MONK Workbench Executes flows for each analysis requested –Predictive modeling using Naïve Bayes –Predictive modeling using Support Vector Machines (SVM) –Feature Comparison (Dunning Loglikelihood)

26 Feature Lens “The discussion of the children introduces each of the short internal narratives. This champions the view that her method of repetition was patterned: controlled, intended, and a measured means to an end. It would have been impossible to discern through traditional reading“

27 Dunning Loglikelihood Tag Cloud Words that are under-represented in writings by Victorian women as compared to Victorian men. Results are loaded into Wordle for the tag cloud —Sara Steger

28 SEASR @ Work – Emotion Tracking Goal is to have this type of Visualization to track emotions across a text document (Leveraging flare.prefuse.org)

29 UIMA Structured data Two SEASR examples using UIMA POS data –Frequent patterns (rule associations) of nouns (fpgrowth) –Sentiment analysis of adjectives

30 UIMA Unstructured Information Management Applications

31 UIMA + P.O.S. tagging Analysis Engines to analyze document to record Part Of Speech information. OpenNLP Tokenizer OpenNLP PosTagger OpenNLP SentanceDetector POSWriter Serialization of the UIMA CAS

32 UIMA to SEASR: Experiment I Finding patterns

33 SEASR + UIMA: Frequent Patterns Frequent Pattern Analysis on nouns Goal: –Discover a cast of characters within the text –Discover nouns that frequently occur together character relationships

34 Frequent Patterns: visualization Analysis of Tom Sawyer 10 paragraph window Support set to 10% Analysis of Tom Sawyer 10 paragraph window Support set to 10%

35 UIMA to SEASR: Experiment II Sentiment Analysis

36 UIMA + SEASR: Sentiment Analysis Classifying text based on its sentiment –Determining the attitude of a speaker or a writer –Determining whether a review is positive/negative Ask: What emotion is being conveyed within a body of text? –Look at only adjectives (UIMA POS) lots of issues and challenges Need to Answer: –What emotions to track? –How to measure/classify an adjective to one of the selected emotions? –How to visualize the results?

37 Sentiment Analysis: Emotion Selection Which emotions: –http://en.wikipedia.org/wiki/List_of_emotionshttp://en.wikipedia.org/wiki/List_of_emotions –http://changingminds.org/explanations/emotions/basic %20emotions.htmhttp://changingminds.org/explanations/emotions/basic %20emotions.htm –http://www.emotionalcompetency.com/recognizing.ht mhttp://www.emotionalcompetency.com/recognizing.ht m Parrot’s classification (2001) –six core emotions –Love, Joy, Surprise, Anger, Sadness, Fear

38 Sentiment Analysis: Emotions

39 Sentiment Analysis: Using Adjectives How to classify adjectives: –Lots of metrics we could use … Lists of adjectives already classified –http://www.derose.net/steve/resources/emotionwor ds/ewords.htmlhttp://www.derose.net/steve/resources/emotionwor ds/ewords.html –Need a “nearness” metric for missing adjectives –How about the thesaurus game ?

40 SEASR: Sentiment Analysis Using only a thesaurus, find a path between two words –no antonyms –no colloquialisms or slang

41 SEASR: Sentiment Analysis For example, how would you get from delightful to rainy? (answer coming soon, unless you find it first)

42 SEASR: Sentiment Analysis How to get from delightful to rainy ? ['delightful', 'fair', 'balmy', 'moist', 'rainy']. ['sexy', 'provocative', 'blue', 'joyless’] ['bitter', 'acerbic', 'tangy', 'sweet', 'lovable’] sexy to joyless? bitter to lovable?

43 SEASR: Sentiment Analysis Use this game as a metric for measuring a given adjective to one of the six emotions. Assume the longer the path, the “farther away” the two words are.

44 SEASR: Sentiment Analysis Introducing SynNet: a traversable graph of synonyms (adjectives)

45 Thesaurus Network (SynNet) Used thesaurus.com, create link between every term and its synonyms Created a large network Determine a metric to use to assign the adjectives to one of our selected terms –Is there a path? –How to evaluate best paths?

46 SynNet: rainy to pleasant

47 SynNet Metrics Path length Number of Paths Common nodes Symmetric: a  b b  a Unique nodes in all paths

48 SynNet Metrics: Path Length Rainy to Pleasant –Shortest path length is 4 (blue) Rainy, Moist, Watery, Bland, Pleasant –Green path has length of 3 but is not reachable via symmetry –Blue nodes are nodes 2 hops away

49 SynNet Metrics: Common Nodes Common Nodes –depth of common nodes Example –Top shows happy –Bottom shows delightful –Common nodes shown in center cluster

50 SynNet Metrics: Symmetry Symmetry of path in common nodes

51 SynNet: Sentiment Analysis Step 1: list your sentiments/concepts –joy, sad, anger, surprise, love, fear Step 2: for each concept, list adjectives –joy: joyful, happy, hopeful –surprise:surprising,amazing, wonderful, unbelievable Step 3: for each adjective in the text, calculate all the paths to each adjective in step 2 Step 4: pick the best adjective (using metrics)

52 SynNet: Sentiment Analysis Example: the adjective to score is incredible

53 SynNet: Sentiment Analysis Incredible to loving (concept: love) Blue paths are symmetric paths

54 SynNet: Sentiment Analysis Incredible to surprising (concept: surprise) Blue paths are symmetric paths

55 SynNet: Sentiment Analysis Incredible to joyful (concept: joy)

56 SynNet: Sentiment Analysis Incredible to joyless (concept: sad)

57 SynNet: Sentiment Analysis Incredible to fearful (concept: fear) Winner!

58 SynNet: Sentiment Analysis Try it yourself: http://services.seasr.org/synnet – /synnet/path/white/afraid – /synnet/path/white/afraid?format=xml – /synnet/path/white/afraid?format=json – /synnet/path/white/afraid?format=flash –Database is only adjectives –More api coming soon, visualizations

59 Sentiment Analysis: Issues Not a perfect solution –still need context to get quality Vain –['vain', 'insignificant', 'contemptible', 'hateful'] –['vain', 'misleading', 'puzzling', 'surprising’] Animal –['animal', 'sensual', 'pleasing', 'joyful'] –['animal', 'bestial', 'vile', 'hateful'] –['animal', 'gross', 'shocking', 'fearful'] –['animal', 'gross', 'grievous', 'sorrowful'] Negation –“My mother was not a hateful person.”

60 Sentiment Analysis: Process Process Overview –Extract the adjectives (SEASR, POS analysis) –Read in adjectives (SEASR) –Label each adjective (SEASR, SynNet) –Summarize windows of adjectives lots of experimentation here –Visualize the windows

61 Sentiment Analysis: Visualization New SEASR visualization component –Based on flash using the flare ActionScript Library http://flare.prefus e.org/ –Still in development http://demo.seasr.org/public/resources /data/viewer/emotions.html

62 Demonstration Son of Blinkie from the NEMA Project MONK Emotion Tracking

63 Learning Exercises

64 Discussion Questions What part of these applications can be useful to your research?


Download ppt "SEASR Applications National Center for Supercomputing Applications University of Illinois at Urbana-Champaign."

Similar presentations


Ads by Google