Intro to Practical Natural Language Processing Wharton Data Camp Sessions 8 Agenda 1)Tasks in NLP 2)Use NLTK.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Career Identity How to find the best career for YOU! Viki Chinn - LSE Careers Adam Sandelson – Student Counselling Service.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Introduction to Information Retrieval
SOCIAL MEDIA & PHYSICAL ACTIVITY PROMOTION: MAKING THE CONNECTIONS Presented by: Sandra De Freitas
Data Mining Classification: Alternative Techniques
Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Visualization STAT 890, STAT 442, CM 462
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
WHO WE ARE ●Website Development & Design ●Web Marketing Strategy, Training, and Analysis ●Web Applications, iOS apps, Android apps.
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
Data Mining: A Closer Look
Introduction to machine learning
Content Marketing Use your Knowledge to grow your Business.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Comparison of Classification Methods for Customer Attrition Analysis Xiaohua Hu, Ph.D. Drexel University Philadelphia, PA, 19104
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
More than words: Social networks’ text mining for consumer brand sentiments A Case on Text Mining Key words: Sentiment analysis, SNS Mining Opinion Mining,
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Short Introduction to Machine Learning Instructor: Rada Mihalcea.
Opinion Mining Using Econometrics: A Case Study on Reputation Systems Anindya Ghose, Panagiotis G. Ipeirotis, and Arun Sundararajan Leonard N. Stern School.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Getting Started with Facebook Without Sharing Pictures Of What You Had For Lunch.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Laboratory for InterNet Computing CSCE 561 Social Media Projects Ryan Benton October 8, 2012.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Promoting Rational Drug Use in the Community Working with journalists.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
Department of Electrical Engineering and Computer Science Kunpeng Zhang, Yu Cheng, Yusheng Xie, Doug Downey, Ankit Agrawal, Alok Choudhary {kzh980,ych133,
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
CSC 594 Topics in AI – Text Mining and Analytics
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Amazon review utility estimator. Overview  Goal: To determine the “usefulness” of Amazon.com reviews  Using Mallet classifiers  Several custom features.
Predicting Voice Elicited Emotions
Classification Ensemble Methods 1
Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
Strategies for Social Media Marketing. SOCIAL MEDIA & YOUR AUDIENCE Find and engage with current and potential customers online! Social is now the top.
CHAPTER THREE. PUT YOUR BRAND ONLINE Employers pay a lot of attention to the “cultural fit” of candidates; they want to know if your personality will.
2014 Lexicon-Based Sentiment Analysis Using the Most-Mentioned Word Tree Oct 10 th, 2014 Bo-Hyun Kim, Sr. Software Engineer With Lina Chen, Sr. Software.
General Information Course Id: COSC6342 Machine Learning Time: TU/TH 1-2:30p Instructor: Christoph F. Eick Classroom:AH301
Vocabulary notebooks Schmitt, N., & Schmitt, D.. (1995). Vocabulary notebooks: Theoretical underpinnings and practical suggestions. ELT Journal, 49(2),
GOING DEEPER INTO STEP 1: UNWRAPPING STANDARDS Welcome!
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Customize this slide for your business!
Brief Intro to Machine Learning CS539
Introduction to Machine Learning
CSC 594 Topics in AI – Natural Language Processing
School of Computer Science & Engineering
Erasmus University Rotterdam
Stock Market Prediction
Text Mining & Natural Language Processing
Ensemble learning.
Text Mining & Natural Language Processing
Introduction to Text Analysis
Word embeddings (continued)
Introduction to Sentiment Analysis
Presentation transcript:

Intro to Practical Natural Language Processing Wharton Data Camp Sessions 8 Agenda 1)Tasks in NLP 2)Use NLTK

Quick Overview of Resources For Machine Learning, NLP, and Econometrics: Wharton Specific and Books

Machine Learning/ NLP classes CIS 520 Machine Learning CIS 530 NLP CIS 630 Machine Learning for NLP There are more classes for theory in STAT/CIS

Awesome Machine Learning Books! The Element of Statistical Learning – Hastie and Tibshirani – ML bible 1 Pattern Recognition and Machine Learning – Chris Bishop – ML bible 2 IF YOU WANT DEEP UNDERSTANDING OF THE MATERIALS Statistical Learning Theory – Theory of ML Bible 1 Probability theory of Pattern Recognition – Theory of ML Bible 2

If you are going to do any sort of empirical work STAT 500 – If you have never taken a course in Econometrics STAT 520 – basic econometrics STAT 521 – Use of R for applied econometrics (this course went through a major change) STAT 541 – Andreas Buja on multivariate stat and writing (There exist one and only one required textbook in this course and that’s a writing book) STAT 542 – Shane Jensen Bayesian stat (Jensen is the man) STAT 921 – Dylan Small Observational study (Required if you are doing any empirical work) Econ for theory

Subjective Econometric Books Recs William H. Greene is great “Mostly Harmless Econometrics” is great Edward Frees’ longitudinal and panel data: analysis and applications in the social sciences IS one of my favorite econometric books Lot more based on usage but ask me separately

ML in Business & Combining the two Data Science for Business What you need to know about data mining and data-analytic thinking (For Intro & overview) science-for-biz.com/ science-for-biz.com/ – Foster Provost: Great researcher in IS at NYU – 72 Reviews on Amazon- 4.7 average! Targeted Learning – Springer Series in Statistics (AKA Serious Series) – Incorporate Machine Learning into Causal Inference – UCLA Statisticians

Good Quick Cook-book style NLP books FREE BOOK online Jurafsky & Martin “Speech and Language Processing” for deep theory Bing Liu’s two books:

There are many tasks that NLP can do and many are hard Machine translation – Very hard – Funny – Hilarious Video (Fresh Prince of Bel-Air theme after it was translated several times into different languages) Sentiment detection Automatic summarization Etc

Today Supervised Learning + NLP – Identifying certain content (this is what we will probably use the most). Content-coding. – A Research Example – Sentiment Analysis Example

Given: – a set of texts (corpus), – and labels (comprising the training set) – Label can be Certain content exist Negative/Positive sentiment etc Goal: – create algorithm that mimics the label Supervised Learning + NLP

Imagine a task You are an NSA agent OR You are a hacker You are given a job OR You are on a mission and are looking for fellow hackers Train an NLP algorithm to be able to tell if a sentence or short text on the internet contains any planning of hacking/ddos attack plans “Greetings, fellow anons, we have a new target in our movement against RIAA [...] WE WILL NOT TOLERATE ANY LONGER!”

What do we, humans, do in realizing the existence of the content? “Greetings, fellow anons, we have a new target in our movement against RIAA [...] WE WILL NOT TOLERATE ANY LONGER!” Key words: target, movement, anons, RIAA, not, tolerate. bigrams: new target, our movement, against RIAA, not tolerate Use of upper case and “!” ETC

Narrow and Specific NLP Example I can only show you one very specific example of NLP today You need to take at least a machine learning course and an NLP course to be able to do this type of processing comfortably – 2 courses will probably suffice for applying ML + NLP for your research

Overview of 1 Example in NLP: Identifying certain content in text e.g., positive/negative sentiment 1.Find text data (short text or a sentence – a review for example) 2.break the sentence down into basic building blocks using NLP techniques I’ll show – outcome is ordered list of building blocks 3.process the ordered list of building blocks and come up with many sentence-level patterns – these will be the x-variables or sentence-level attributes (e.g., content = “positive review or not”) – Count the number of word “great” occuring X 1 – Count the number of laudatory words X 2 – Etc. Recording certain patterns X n 4.Obtain text data with labels (positive or not): this is called the gold set and comes with y-var {positive, negative} tags 5.Use machine learning techniques on the gold set to learn the relationship between X-var from 3 and Y-var from 4. This part is training the machine learning algorithm.

Basic idea in NLP: identifying certain content in text 1.Find text data 2.Breaking Sentence : break the sentence down into understandable building blocks (e.g., words or lemmas) 3.Sentence Attribute Generation :identify different sentence-attributes just as humans do when reading (many to be explained) 4.Gold Set Generation: obtain a set of training sentences with labels identifying if the sentences do or do not have certain content from a reliable source (gold data set) 5.Training: use statistical tools to infer which sentence- attributes are correlated with certain content outcomes, thereby “learning” to identify content in sentences.

NLP uses machine learning Machine Learning (Classification) – Supervised Learning – given training data x-vars & y-vars, infer function “f” y=f(x). Curve fitting is a basic supervised learning. You need labeled training data which is X-Y pair. – Unsupervised Learning – problem of finding hidden structure from unlabeled data just x-vars. E.g. Clustering. – NLP uses both and in our context it’s supervised learning

Supervised Learning Taken from nltk.com

Breaking Sentence Stop-words removal:removing punctuation and words with low information such as the definite article “the” Tokenizing: the process of breaking a sentence into words, phrases, and symbols or “tokens” Stemming: the process of reducing inflected words to their root form, e.g., “playing” to “play” Part-of-speech tagging: determining part-of- speech such as noun etc

Sentence Attribute Generation Bag of words: collect words Counted bag of words: words and count the occurrence Bigram: A bigram is formed by two adjacent words (e.g. “Bigram is”, “is formed” are bigrams). Ngram: self-explanatory Specific keywords (“like”, “love”, “bad”) Frequency count of certain part of speech Count the location of certain words Count the use of !,?,etc SO MANY MORE! In big projects, engineers develop algorithm to automatically generate attributes!

Gold Set Generation Get example sentences or text data – You tag them – Or get RAs – Or use Amazon Mechanical Turk Or there maybe database already existing – Online tagged corpora Speaking of database for NLP, it’s not used in this context but there exist great resources – Check out wordnet and framenet + more

Training the classifiers You are done breaking the sentences and generating sentence attributes – these are x variables Y-variables are the tags you obtained Use your favorite ML algorithm or combinations – Regular GLM – SVM – Naïve Bayes – Neural Network – Decision Tree – Conditional Random Forest – Ensemble Learning: Boosting and Bagging – ETC

Let’s go deeper into each stages First, Breaking Sentence

Natural Language Processing Tasks “ Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.New YorkNew Jersey The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.” (Bloomberg article on Sandy) Slides Taken from: Bommarito Consulting

What kind of questions can we ask? Basic – What is the structure of the text? Paragraphs Sentences Tokens/words – What are the words that appear in this text? Nouns – Subjects – Direct objects Verbs Advanced – What are the concepts that appear in this text? – How does this text compare to other text? Natural Language Processing Tasks Slides Taken from: Bommarito Consulting

Segmentation and Tokenization “Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.New YorkNew Jersey The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.” Segments Types Paragraphs Sentences Tokens Natural Language Processing Tasks Slides Taken from: Bommarito Consulting

Segmentation and Tokenization But how does it work? Paragraphs – Two consecutive line breaks – A hard line break followed by an indent Sentences – Period, except abbreviation, ellipsis within quotation, etc. Tokens and Words – Whitespace – Punctuation Natural Language Processing Tasks Slides Taken from: Bommarito Consulting

Segmentation and Tokenization “Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.New YorkNew Jersey The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.” Paragraphs: 2 Sentences: 2 Words: 561. – ['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for', 'today', 'and', 'tomorrow‘, …] Natural Language Processing Tasks Slides Taken from: Bommarito Consulting

What kind of questions can we ask? We now have an ordered list of tokens. ['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for', 'today', 'and', 'tomorrow‘, …] Natural Language Processing Tasks Slides Taken from: Bommarito Consulting

Stop Words Removal Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.New YorkNew Jersey The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts. Hurricane Sandy grounded 3,200 flights scheduled today tomorrow, prompted New York suspend subway bus service forced evacuation New Jersey shore headed toward land life-threatening wind rain. System, killed many 65 people Caribbean path north, may capable inflicting much $18 billion damage barrels New Jersey tomorrow knock power millions week, according forecasters risk experts. Slides Taken from: Bommarito Consulting Natural Language Processing Tasks

Natural language processing Tasks Stop Words Removal+ Stemming Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.New YorkNew Jersey The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts. Hurrican Sandi ground 3,200 flight schedul today tomorrow, prompt New York suspend subway bu servic forc evacu New Jersey shore head toward land life-threaten wind rain. System, kill mani 65 peopl Caribbean path north, may capabl inflict much $18 billion damag barrel New Jersey tomorrow knock power million week, accord forecast risk expert. Slides Taken from: Bommarito Consulting

Natural language processing Part of Speech Tagging Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.New YorkNew Jersey The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts. [('Hurricane', 'NNP'), ('Sandy', 'NNP'), ('grounded', 'VBD'), ('3,200', 'CD'), ('flights', 'NNS'), ('scheduled', 'VBN'), ('for', 'IN'), ('today', 'NN'), ('and', 'CC'), ('tomorrow', 'NN'), …] NNP: Proper Noun, Plural NNS: Noun, Plural VBD: Verb, Past tense VBN:Verb, Past Participle CD: Cardinal Number IN: Proposition/sub-conj etc For more

Let’s go deeper into each stages Second, Sentence Attribute Generation

Remember one thing When you read sentences yourself, what do you notice about what you notice? Make those into attributes! The goal is to mimic what we humans do

Let’s go deeper into each stages Third, Gold Set Generation

Resources for Gold Set Generation Yourself RA: pretty expensive AMT: Amazon Mechanical Turk Obtain multiple tags and you have to check inter-rater agreement to be robust

Research Example ssrn.com/abstract=

38 Research Question What content attributes of social media messages elicit greater consumer response & engagement? E.g., 1.What’s the comparative effect of informative advertising (product, price information, etc) VS persuasive advertising (Emotion, humor, etc) on engagement? 2.Differences across industries? Introduction & Motivation

Sample Messages from Walmart (Dec Score an iPad 3 for an iPad2 price! Now at your local store, $50 off the iPad 3. Plus, get a $30 iTunes Gift Card. Offer good through 12/31 or while supplies last. (Product Advertisement + Deal + Product Location + Product Stock Availability) Rollback with Vizio. Select models have lower prices ranging from $228 for a 32" (diagonal screen size 31.5") LCD TV to $868 on a 55” (diagonal screen size 54.6") LED TV. (Product Advertisement + Price info + Brand Mention + Link) Maria’s mission is helping veterans and their families find employment. Like this and watch Maria’s story. (Philanthropic Message + Explicit Like solicitation + Link) 39 Data

40 Data Post-level panel data on messages posted by many companies from Sep 2011 to July 2012 Message content Impressions, likes and comments on a daily basis Page-level panel data on each pages Page statistics on a daily basis (e.g., Fan number, Industry type) Aggregate demographics of fans and post viewers (impressions demographics) After Cleaning: 106,316 unique messages posted by 782 companies Daily Likes & Comments: 1.3 million rows of post-level snapshots recording about 450 million page fans’ responses. Data

41 Variables Engagement Metric (Dependent Variable) Variables that affect engagement (Independent Variables) Informative Ad Content Brand and Product mention Price Deals Product Availability etc Persuasive Ad Content Emotion Humor Philanthropic Emoticon Small Talk etc Message Type Photo, Video, Status update, App, Link Controls Impressions Industry Type Day since post Reading Complexity Message Length etc Empirical Strategy COMMENTSLIKES

Message Content Tagging At least 9 different workers per message + Majority vote Used to train natural language processing algorithm to tag remaining posts – 7 Statistical classifiers + rule-based method combined by ensemble learning – Greater than 99% accuracy, precision, and recall for most variables (10-fold CV) 42 Data Worker Eligibility Criteria –Must have > 97% accuracy –Must have > 100 previously approved tasks –Location: US only Criteria for using the input –Question for detecting if the worker is paying attention –Completion duration > 30 seconds (avg took 3 min) –Plus, 5+ more protocols

NLP Algorithm Process

NLP Algorithm Performance

NLTK

Open up nlp.py

WITH BAD NLPWITH GOOD NLP “COMPUTER, HOT EARL GREY TEA” “COMPUTER, TEA, EARL GREY, HOT”

This Concludes the 2014 Wharton Tech/Data Camp Please help me and give feedback on this course for improvement. Thank you! =SV_agzfeKZvPQD0hUN