SEASR Applications National Center for Supercomputing Applications University of Illinois at Urbana-Champaign.

Slides:

Advertisements

Similar presentations

Testing Relational Database

Advertisements

HATHI TRUST A Shared Digital Repository Delivering Data For New Generations of Research Strategies and Challenges Jeremy York NISO/BISG Forum ALA 2010.

From Words to Meaning to Insight Julia Cretchley & Mike Neal.

A PowerPoint Presentation

© Paradigm Publishing, Inc Word 2010 Level 2 Unit 1Formatting and Customizing Documents Chapter 2Proofing Documents.

Improved TF-IDF Ranker

More Text Analytics University of Illinois at Urbana-Champaign.

Text Analytics National Center for Supercomputing Applications University of Illinois at Urbana-Champaign.

University of Illinois Visualizing Text Loretta Auvil UIUC February 25, 2011.

Multimedia Answer Generation for Community Question Answering.

© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

Tools for Textual Data John Unsworth May 20, 2009

Idiosyncrasy at Scale Data Curation and the Digital Humanities John Unsworth December 7, 2010 IDCC Man walks around.

UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Text-Mining and Humanities Research John Unsworth Microsoft Faculty Summit, July 2009.

Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.

Information Retrieval in Practice

Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.

Automating Tasks With Macros

Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.

Overview of Search Engines

SEASR Analytics and Zotero University of Illinois at Urbana-Champaign.

Digital Library Architecture and Technology

The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.

JSP Standard Tag Library

SharePoint 2010 Business Intelligence Module 6: Analysis Services.

The ePortfolio and Student Evaluation A training presentation by: Amy Cannady Robin Drewry Bonnie Hicks.

1 INTRODUCTION TO DATABASE MANAGEMENT SYSTEM L E C T U R E

Meta-Knowledge Computer-age study skill or What kids need to know to be effective students Graham Seibert Copyright 2006.

SEASR Applications and Future Work National Center for Supercomputing Applications University of Illinois at Urbana-Champaign.

Text Analytics National Center for Supercomputing Applications University of Illinois at Urbana-Champaign.

Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.

Ontology-Based Information Extraction: Current Approaches.

SEASR Applications and Future Work University of Illinois at Urbana-Champaign.

Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.

Installation and Development Tools National Center for Supercomputing Applications University of Illinois at Urbana-Champaign The SEASR project and its.

SEASR Analytics for Zotero Loretta Auvil Automated Learning Group Data-Intensive Technologies and Applications, National Center for.

More Text Analytics National Center for Supercomputing Applications University of Illinois at Urbana-Champaign.

TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.

SEASR Analytics Loretta Auvil Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.

Mashups and Dashboards National Center for Supercomputing Applications University of Illinois at Urbana-Champaign.

MedKAT Medical Knowledge Analysis Tool December 2009.

Presented By- Shahina Ferdous, Student ID – , Spring 2010.

What We Learned From Related Projects Research-oriented Social Environment (RoSE)

Visualizations, Mashups and Dashboards University of Illinois at Urbana-Champaign.

©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek (610)

CSC 594 Topics in AI – Text Mining and Analytics

Ask a Librarian: The Role of Librarians in the Music Information Retrieval Community Jenn Riley, Indiana University Constance A. Mayer, University of Maryland.

4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.

More Text Analytics National Center for Supercomputing Applications University of Illinois at Urbana-Champaign.

HTRC Loretta Auvil, Boris Capitanu University of Illinois at Urbana-Champaign

SEASR Analytics and Zotero University of Illinois at Urbana-Champaign.

June 2013 BIG DATA SCIENCE: A PATH FORWARD. CONFIDENTIAL | 2  Data Science Lead.

Reputation Management System

Creating Zotero Flows Data-Intensive Technologies and Applications, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign.

Semantically-Rich Tools for Text Exploration Andrew Ashton Center for Digital Scholarship Brown University.

Topical Analysis and Visualization of (Network) Data Using Sci2 Ted Polley Research & Editorial Assistant Cyberinfrastructure for Network Science Center.

ACCESS LESSON 1 DATABASE BASICS VOCABULARY. BACKSTAGE VIEW A menu of options and commands that allows you to access various screens to perform common.

Spring Staff Lecturer: Prof. Sara Cohen Graders: Igor Lifshits, Arbel Moshe 2.

Information Retrieval in Practice

A Simple Approach for Author Profiling in MapReduce

Search Engine Architecture

A Straightforward Author Profiling Approach in MapReduce

Structured Browsing for Unstructured Text

CS 430: Information Discovery

GSLIS Research Showcase, April 9, 2010

Introduction to Text Analysis

The ultimate in data organization

Slides showing what we have working now in Monk Last updated May 6, 2008 (by Catherine) Based on slides used at NEH meeting May 5th for a quick demo.

Introduction to Sentiment Analysis

Presentation transcript:

SEASR Applications National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

Outline Audio Analysis with NEMA Text Analysis with Monk Emotion Tracking Hands-On

Defining Music Information Retrieval? Music Information Retrieval (MIR) is the process of searching for, and finding, music objects, or parts of music objects, via a query framed musically and/or in musical terms Music Objects: Scores, Parts, Recordings (WAV, MP3, etc.), etc. Musically framed query: Singing, Humming, Keyboard, Notation-based, MIDI file, Sound file, etc. Musical terms: Genre, Style, Tempo, etc.

NEMA Networked Environment for Music Analysis –UIUC, McGill (CA), Goldsmiths (UK), Queen Mary (UK), Southampton (UK), Waikato (NZ) –Multiple geographically distributed locations with access to different audio collections –Distributed computation to extract a set of features and/or build and apply models

Work – NEMA Executes a SEASR flow for each run –Loads audio data –Extracts features from every 10 second moving window of audio –Loads models –Applies the models –Sends results back to the WebUI

NEMA Flow – Blinkie

NEMA Vision researchers at Lab A to easily build a virtual collection from Library B and Lab C, acquire the necessary ground-truth from Lab D, incorporate a feature extractor from Lab E, combine with the extracted features with those provided by Lab F, build a set of models based on pair of classifiers from Labs G and H validate the results against another virtual collection taken from Lab I and Library J. Once completed, the results and newly created features sets would be, in turn, made available for others to build upon

Do It Yourself (DIY) 1

DIY Options

DIY Job List

DIY Job View

Nester: Cardinal Annotation Audio tagging environment Green boxes indicate a tag by a researcher Given tags, automated approaches to learn the pattern are applied to find untagged patterns

Nester: Cardinal Catalog View

Examining Audio Collection Tagged a set of examples Male and Female

Work: MONK MONK: a case study Texts as data Texts from multiple sources Texts reprocessed into a new representation Different tools using the same data Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

MONK Project MONK provides: 1400 works of literature in English from the 16th - 19th century = 108 million words, POS-tagged, TEI-tagged, in a MySQL database. Several different open-source interfaces for working with this data A public API to the datastore SEASR under the hood, for analytics Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

MONK “A word token is the spelling or surface of form of a word. MONK performs a variety of operations that supply each token with additional 'metadata’. –Take something like 'hee louyd hir depely'. –This comes to exist in the MONK textbase as something like hee_pns31_he louyd_vvd_love hir_pno31_she depely_av-j_deep Because the textbase 'knows' that the surface 'louyd' is the past tense of the verb 'love' the individual token can be seen as an instance of several types: the spelling, the part of speech, and the lemma or dictionary entry form of a word.” (Martin Mueller) Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

Text Data Texts represent language, which changes over time (spellings) Comparison of texts as data requires some normalization (lemma) Counting as a means of comparison requires units to count (tokens) Treating texts as data will usually entail a new representation of those texts, to make them comparable and to make their features countable. Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

Text from Multiple Sources Five aphorisms about textual data (causing tool- builders to weep): Scholars are interested in texts first, data second Tools are only useful if they can be applied to texts that are of interest No single collection has all texts No two collections will be identical in format No one collection will be internally consistent in format Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

Public MONK Texts Documenting the American South from UNC-Chapel Hill –(1.5 Gb, 8.5 M words) Early American Fiction from the University of Virginia –(930 Mb, 5.2 M words) Wright American Fiction from Indiana University –(4 Gb, 23 M words) Shakespeare from Northwestern University – (170 M, 850 K words) About 7 Gigabytes, 38 M words Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

Restricted Monk Texts Eighteenth-Century Collection Online (ECCO) from the Text Creation Partnership –(6 Gb, 34 M words) Early English Books Online (EEBO) from the Text Creation Partnership –(7 G, 39 M words) Nineteenth-Century Fiction (NCF) from Chadwyck Healey –(7 G, 39 M words) About 20 Gb, 112 M words Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

MONK Ingest Process Texts reprocessed into a new representation TEI source files (from various collections, with various idiosyncracies) go through Abbot, a series of xsl routines that transform the input format into TEI- Analytics (TEI-A for short), with some curatorial interaction. “Unadorned” TEI-A files go through Morphadorner, a trainable part-of-speech tagger that tokenizes the texts into sentences, words and punctuation, assigns ids to the words and punctuation marks, and adorns the words with morphological tagging data (lemma, part of speech, and standard spelling). Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

MONK Ingest Process Adorned TEI-A files go through Acolyte, a script that adds curator-prepared bibliographic data Bibadorned files are processed by Prior, using a pair of files defining the parts of speech and word classes, to produce tab-delimited text files in MySQL import format, one file for each table in the MySQL database. cdb.csh creates a Monk MySQL database and imports the tab-delimited text files. Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

MONK Tools MONK Datastore Flamenco Faceted Browsing MONK extension for Zotero TeksTale Clustering and Word Clouds FeatureLens SEASR The MONK Workbench (Public) The MONK Workbench (Restricted) Slides from, John Unsworth, “Tools for Textual Data”, May 20, 2009

Work – MONK Workbench Executes flows for each analysis requested –Predictive modeling using Naïve Bayes –Predictive modeling using Support Vector Machines (SVM) –Feature Comparison (Dunning Loglikelihood)

Feature Lens “The discussion of the children introduces each of the short internal narratives. This champions the view that her method of repetition was patterned: controlled, intended, and a measured means to an end. It would have been impossible to discern through traditional reading“

Dunning Loglikelihood Tag Cloud Words that are under-represented in writings by Victorian women as compared to Victorian men. Results are loaded into Wordle for the tag cloud —Sara Steger

Work – Emotion Tracking Goal is to have this type of Visualization to track emotions across a text document (Leveraging flare.prefuse.org)

UIMA Structured data Two SEASR examples using UIMA POS data –Frequent patterns (rule associations) of nouns (fpgrowth) –Sentiment analysis of adjectives

UIMA Unstructured Information Management Applications

UIMA + P.O.S. tagging Analysis Engines to analyze document to record Part Of Speech information. OpenNLP Tokenizer OpenNLP PosTagger OpenNLP SentanceDetector POSWriter Serialization of the UIMA CAS

UIMA to SEASR: Experiment I Finding patterns

SEASR + UIMA: Frequent Patterns Frequent Pattern Analysis on nouns Goal: –Discover a cast of characters within the text –Discover nouns that frequently occur together character relationships

Frequent Patterns: visualization Analysis of Tom Sawyer 10 paragraph window Support set to 10% Analysis of Tom Sawyer 10 paragraph window Support set to 10%

UIMA to SEASR: Experiment II Sentiment Analysis

UIMA + SEASR: Sentiment Analysis Classifying text based on its sentiment –Determining the attitude of a speaker or a writer –Determining whether a review is positive/negative Ask: What emotion is being conveyed within a body of text? –Look at only adjectives (UIMA POS) lots of issues and challenges Need to Answer: –What emotions to track? –How to measure/classify an adjective to one of the selected emotions? –How to visualize the results?

Sentiment Analysis: Emotion Selection Which emotions: – – %20emotions.htmhttp://changingminds.org/explanations/emotions/basic %20emotions.htm – mhttp:// m Parrot’s classification (2001) –six core emotions –Love, Joy, Surprise, Anger, Sadness, Fear

Sentiment Analysis: Emotions

Sentiment Analysis: Using Adjectives How to classify adjectives: –Lots of metrics we could use … Lists of adjectives already classified – ds/ewords.htmlhttp:// ds/ewords.html –Need a “nearness” metric for missing adjectives –How about the thesaurus game ?

SEASR: Sentiment Analysis Using only a thesaurus, find a path between two words –no antonyms –no colloquialisms or slang

SEASR: Sentiment Analysis For example, how would you get from delightful to rainy? (answer coming soon, unless you find it first)

SEASR: Sentiment Analysis How to get from delightful to rainy ? ['delightful', 'fair', 'balmy', 'moist', 'rainy']. ['sexy', 'provocative', 'blue', 'joyless’] ['bitter', 'acerbic', 'tangy', 'sweet', 'lovable’] sexy to joyless? bitter to lovable?

SEASR: Sentiment Analysis Use this game as a metric for measuring a given adjective to one of the six emotions. Assume the longer the path, the “farther away” the two words are.

SEASR: Sentiment Analysis Introducing SynNet: a traversable graph of synonyms (adjectives)

Thesaurus Network (SynNet) Used thesaurus.com, create link between every term and its synonyms Created a large network Determine a metric to use to assign the adjectives to one of our selected terms –Is there a path? –How to evaluate best paths?

SynNet: rainy to pleasant

SynNet Metrics Path length Number of Paths Common nodes Symmetric: a  b b  a Unique nodes in all paths

SynNet Metrics: Path Length Rainy to Pleasant –Shortest path length is 4 (blue) Rainy, Moist, Watery, Bland, Pleasant –Green path has length of 3 but is not reachable via symmetry –Blue nodes are nodes 2 hops away

SynNet Metrics: Common Nodes Common Nodes –depth of common nodes Example –Top shows happy –Bottom shows delightful –Common nodes shown in center cluster

SynNet Metrics: Symmetry Symmetry of path in common nodes

SynNet: Sentiment Analysis Step 1: list your sentiments/concepts –joy, sad, anger, surprise, love, fear Step 2: for each concept, list adjectives –joy: joyful, happy, hopeful –surprise:surprising,amazing, wonderful, unbelievable Step 3: for each adjective in the text, calculate all the paths to each adjective in step 2 Step 4: pick the best adjective (using metrics)

SynNet: Sentiment Analysis Example: the adjective to score is incredible

SynNet: Sentiment Analysis Incredible to loving (concept: love) Blue paths are symmetric paths

SynNet: Sentiment Analysis Incredible to surprising (concept: surprise) Blue paths are symmetric paths

SynNet: Sentiment Analysis Incredible to joyful (concept: joy)

SynNet: Sentiment Analysis Incredible to joyless (concept: sad)

SynNet: Sentiment Analysis Incredible to fearful (concept: fear) Winner!

SynNet: Sentiment Analysis Try it yourself: – /synnet/path/white/afraid – /synnet/path/white/afraid?format=xml – /synnet/path/white/afraid?format=json – /synnet/path/white/afraid?format=flash –Database is only adjectives –More api coming soon, visualizations

Sentiment Analysis: Issues Not a perfect solution –still need context to get quality Vain –['vain', 'insignificant', 'contemptible', 'hateful'] –['vain', 'misleading', 'puzzling', 'surprising’] Animal –['animal', 'sensual', 'pleasing', 'joyful'] –['animal', 'bestial', 'vile', 'hateful'] –['animal', 'gross', 'shocking', 'fearful'] –['animal', 'gross', 'grievous', 'sorrowful'] Negation –“My mother was not a hateful person.”

Sentiment Analysis: Process Process Overview –Extract the adjectives (SEASR, POS analysis) –Read in adjectives (SEASR) –Label each adjective (SEASR, SynNet) –Summarize windows of adjectives lots of experimentation here –Visualize the windows

Sentiment Analysis: Visualization New SEASR visualization component –Based on flash using the flare ActionScript Library e.org/ –Still in development /data/viewer/emotions.html

Demonstration Son of Blinkie from the NEMA Project MONK Emotion Tracking

Learning Exercises

Discussion Questions What part of these applications can be useful to your research?