LECTURE 10: TEXT AS DATA April 13, 2015 SDS 136 Communicating with Data Portions of this slide deck adapted from J.Chuang University of Washington.

Slides:



Advertisements
Similar presentations
Project 1: Business Communications Overview. Project 1 About the Presentations The presentations cover the objectives found in the opening of each chapter.
Advertisements

OCLC Research OCLC Online Computer Library Center 2006 WebWise Los Angeles, CA 17 February 2006 FictionFinder: Don Quixote to Graphic Novels Diane Vizine-Goetz.
Text Categorization.
Boolean and Vector Space Retrieval Models
Chapter 5: Introduction to Information Retrieval
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Information Retrieval in Practice
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Ch 4: Information Retrieval and Text Mining
Information Retrieval in Practice
Hinrich Schütze and Christina Lioma
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Recommender systems Ram Akella November 26 th 2008.
Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton.
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Overview of Search Engines
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Text-as-Data March 25, 2009.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Text Analysis Everything Data CompSci Spring 2014.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Text mining.
Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Prepared by Upstream Analysis April 10, 2008 Democratic Presidential Debates Shift in Focus.
Tools You Can Use Today Digital Text.  Turns any web page into a clean page to view now or later.  Story Book Adaptations Made Easy 
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
SEASR Analytics Loretta Auvil Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.
Vector Space Models.
Lecture 07: Dealing with Big Data
What Is Text Mining? Also known as Text Data Mining Process of examining large collections of unstructured textual resources in order to generate new.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
TO Each His Own: Personalized Content Selection Based on Text Comprehensibility Date: 2013/01/24 Author: Chenhao Tan, Evgeniy Gabrilovich, Bo Pang Source:
 The same story, information, etc can be represented in different media  Text, images, sound, moving pictures  All media can be represented digitally.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
IR 6 Scoring, term weighting and the vector space model.
Information Retrieval in Practice
Plan for Today’s Lecture(s)
Presentation by: ABHISHEK KAMAT ABHISHEK MADHUSUDHAN SUYAMEENDRA WADKI
Lab 04: Visualizing Multiple Variables
Robust Semantics, Information Extraction, and Information Retrieval
Representation of documents and queries
Text Categorization Assigning documents to a fixed set of categories
From frequency to meaning: vector space models of semantics
CS224N: Query Focused Multi-Document Summarization
Boolean and Vector Space Retrieval Models
From Unstructured Text to StructureD Data
Term Frequency–Inverse Document Frequency
Presentation transcript:

LECTURE 10: TEXT AS DATA April 13, 2015 SDS 136 Communicating with Data Portions of this slide deck adapted from J.Chuang University of Washington

Outline What is text data? Why visualize text? Techniques Lab

What is text data? Documents - Articles, books and novels - s, web pages, blogs Text snippets - Tweets, SMS messages - Tags, comments, profiles And more... - Computer programs, logs - Collections of documents - This slide!

Discussion Question: what are some characteristics of text data? Answer: - Often high dimensional (over 1m * words in English language…) - Packed with meaning and relationships: Correlations: Hong Kong, San Francisco, Bay Area Order: April, February, January, June, March, May Membership: Tennis, Running, Swimming, Hiking, Piano Hierarchy, antonyms & synonyms, entities, … * As of 2009, according to languagemonitor.com

Why visualize text data? Understand – read a document Summarize – get the “gist” of a document Cluster – group together similar contents Quantify – convert to numerical measures Correlate – compare patterns in text to those in other data, e.g., test scores with conversations on social media

“Bag of words” model Ignore ordering relationships within the text A document ≈ vector of term weights - Each dimension corresponds to a term (10,000+) - Each value represents the relevance For example, simple term counts Aggregate into a document-term matrix

Example: health care reform Recent history - Initiatives by President Clinton - Overhaul by President Obama Text data - News articles - Speech transcriptions - Legal documents What questions might you want to answer?

A concrete example

New York Times: Obama 2009 economix.blogs.nytimes.com/2009/09/09/obama-in-09-vs-clinton-in-93

New York Times: Clinton 1993 economix.blogs.nytimes.com/2009/09/09/obama-in-09-vs-clinton-in-93

Comparison economix.blogs.nytimes.com/2009/09/09/obama-in-09-vs-clinton-in-93 Obama 2009Clinton 1993Rep. Charles Boustany of Louisiana 2009

Word clouds Strengths - Familiar to many people - Can help with “gisting” and initial query formation Weaknesses - Does not show the structure of the text - Sub-optimal visual encoding (position is not meaningful) - Inaccurate size encoding (long words are bigger) - May not facilitate comparison (unstable layout) - Term frequency may not be meaningful

Flashback economix.blogs.nytimes.com/2009/09/09/obama-in-09-vs-clinton-in-93 Obama 2009Clinton 1993Rep. Charles Boustany of Louisiana 2009

Weighting words Term Frequency tf td = # of times term t appears in document d TF-IDF: Term Freq by Inverse Document Freq tf.idf td = # of times term t appears in document d # of times term t appears in all documents

Frequency example: Happy Potter

TF-IDF example: Harry Potter

Limitations of frequency statistics Typically focus on unigrams (single terms) Often favors frequent (TF) or rare (IDF) terms - Still not clear that these provide best description A “bag of words” ignores additional information - Grammar / part-of-speech - Position within document - Recognizable entities

Example: Yelp reviews Yatani 2011

Example: Yelp reviews Yatani 2011

Tips: descriptive keyphrases Understand the limitations of your language model Bag of words: - Easy to compute - Single words - Loss of word ordering Select appropriate model and visualization - Generate longer, more meaningful phrases - Adjective-noun word pairs for reviews - Show keyphrases in context

Discussion What are some other ways we might visualize text data?

Lab 9: working with text in Tableau Instructions for today’s lab are available at: We’ll be working with data from several famous novels available through Project Gutenberg: - “A Tale of Two Cities” by Charles Dickens - “Little Women” by Louisa May Alcott - “Alice’s Adventures in Wonderland” by Lewis Carroll - “Jane Eyre” by Charlotte Bronte - “The Arabian Knights” as translated by Sir Richard Burton - “Don Quixote” by Miguel de Cervantes To get credit for this lab, use your visualization to identify one interesting feature or trend in this dataset and post to Piazza