Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intro to Practical Natural Language Processing Wharton Data Camp Sessions 8 Agenda 1)Tasks in NLP 2)Use NLTK.

Similar presentations


Presentation on theme: "Intro to Practical Natural Language Processing Wharton Data Camp Sessions 8 Agenda 1)Tasks in NLP 2)Use NLTK."— Presentation transcript:

1 Intro to Practical Natural Language Processing Wharton Data Camp Sessions 8 Agenda 1)Tasks in NLP 2)Use NLTK

2 Quick Overview of Resources For Machine Learning, NLP, and Econometrics: Wharton Specific and Books

3 Machine Learning/ NLP classes CIS 520 Machine Learning CIS 530 NLP CIS 630 Machine Learning for NLP There are more classes for theory in STAT/CIS

4 Awesome Machine Learning Books! The Element of Statistical Learning – Hastie and Tibshirani – ML bible 1 Pattern Recognition and Machine Learning – Chris Bishop – ML bible 2 IF YOU WANT DEEP UNDERSTANDING OF THE MATERIALS Statistical Learning Theory – Theory of ML Bible 1 Probability theory of Pattern Recognition – Theory of ML Bible 2

5 If you are going to do any sort of empirical work STAT 500 – If you have never taken a course in Econometrics STAT 520 – basic econometrics STAT 521 – Use of R for applied econometrics (this course went through a major change) STAT 541 – Andreas Buja on multivariate stat and writing (There exist one and only one required textbook in this course and that’s a writing book) STAT 542 – Shane Jensen Bayesian stat (Jensen is the man) STAT 921 – Dylan Small Observational study (Required if you are doing any empirical work) Econ 705-706 for theory

6 Subjective Econometric Books Recs William H. Greene is great “Mostly Harmless Econometrics” is great Edward Frees’ longitudinal and panel data: analysis and applications in the social sciences IS one of my favorite econometric books Lot more based on usage but ask me separately

7 ML in Business & Combining the two Data Science for Business What you need to know about data mining and data-analytic thinking (For Intro & overview)http://data- science-for-biz.com/http://data- science-for-biz.com/ – Foster Provost: Great researcher in IS at NYU – 72 Reviews on Amazon- 4.7 average! Targeted Learning – Springer Series in Statistics (AKA Serious Series) – Incorporate Machine Learning into Causal Inference – UCLA Statisticians

8 Good Quick Cook-book style NLP books http://www.nltk.org/ http://nltk.org/book/ FREE BOOK online http://nltk.org/book/ Jurafsky & Martin “Speech and Language Processing” for deep theory Bing Liu’s two books: http://www.cs.uic.edu/~liub/

9 There are many tasks that NLP can do and many are hard Machine translation – Very hard – http://translationparty.com/ Funny http://translationparty.com/ – Hilarious Video (Fresh Prince of Bel-Air theme after it was translated several times into different languages) http://www.youtube.com/watch?v=LMkJuDVJdTw Sentiment detection Automatic summarization Etc

10 Today Supervised Learning + NLP – Identifying certain content (this is what we will probably use the most). Content-coding. – A Research Example – Sentiment Analysis Example

11 Given: – a set of texts (corpus), – and labels (comprising the training set) – Label can be Certain content exist Negative/Positive sentiment etc Goal: – create algorithm that mimics the label Supervised Learning + NLP

12 Imagine a task You are an NSA agent OR You are a hacker You are given a job OR You are on a mission and are looking for fellow hackers Train an NLP algorithm to be able to tell if a sentence or short text on the internet contains any planning of hacking/ddos attack plans “Greetings, fellow anons, we have a new target in our movement against RIAA [...] WE WILL NOT TOLERATE ANY LONGER!”

13 What do we, humans, do in realizing the existence of the content? “Greetings, fellow anons, we have a new target in our movement against RIAA [...] WE WILL NOT TOLERATE ANY LONGER!” Key words: target, movement, anons, RIAA, not, tolerate. bigrams: new target, our movement, against RIAA, not tolerate Use of upper case and “!” ETC

14 Narrow and Specific NLP Example I can only show you one very specific example of NLP today You need to take at least a machine learning course and an NLP course to be able to do this type of processing comfortably – 2 courses will probably suffice for applying ML + NLP for your research

15 Overview of 1 Example in NLP: Identifying certain content in text e.g., positive/negative sentiment 1.Find text data (short text or a sentence – a review for example) 2.break the sentence down into basic building blocks using NLP techniques I’ll show – outcome is ordered list of building blocks 3.process the ordered list of building blocks and come up with many sentence-level patterns – these will be the x-variables or sentence-level attributes (e.g., content = “positive review or not”) – Count the number of word “great” occuring X 1 – Count the number of laudatory words X 2 – Etc. Recording certain patterns X n 4.Obtain text data with labels (positive or not): this is called the gold set and comes with y-var {positive, negative} tags 5.Use machine learning techniques on the gold set to learn the relationship between X-var from 3 and Y-var from 4. This part is training the machine learning algorithm.

16 Basic idea in NLP: identifying certain content in text 1.Find text data 2.Breaking Sentence : break the sentence down into understandable building blocks (e.g., words or lemmas) 3.Sentence Attribute Generation :identify different sentence-attributes just as humans do when reading (many to be explained) 4.Gold Set Generation: obtain a set of training sentences with labels identifying if the sentences do or do not have certain content from a reliable source (gold data set) 5.Training: use statistical tools to infer which sentence- attributes are correlated with certain content outcomes, thereby “learning” to identify content in sentences.

17 NLP uses machine learning Machine Learning (Classification) – Supervised Learning – given training data x-vars & y-vars, infer function “f” y=f(x). Curve fitting is a basic supervised learning. You need labeled training data which is X-Y pair. – Unsupervised Learning – problem of finding hidden structure from unlabeled data just x-vars. E.g. Clustering. – NLP uses both and in our context it’s supervised learning

18 Supervised Learning Taken from nltk.com

19 Breaking Sentence Stop-words removal:removing punctuation and words with low information such as the definite article “the” Tokenizing: the process of breaking a sentence into words, phrases, and symbols or “tokens” Stemming: the process of reducing inflected words to their root form, e.g., “playing” to “play” Part-of-speech tagging: determining part-of- speech such as noun etc

20 Sentence Attribute Generation Bag of words: collect words Counted bag of words: words and count the occurrence Bigram: A bigram is formed by two adjacent words (e.g. “Bigram is”, “is formed” are bigrams). Ngram: self-explanatory Specific keywords (“like”, “love”, “bad”) Frequency count of certain part of speech Count the location of certain words Count the use of !,?,etc SO MANY MORE! In big projects, engineers develop algorithm to automatically generate attributes!

21 Gold Set Generation Get example sentences or text data – You tag them – Or get RAs – Or use Amazon Mechanical Turk Or there maybe database already existing – Online tagged corpora Speaking of database for NLP, it’s not used in this context but there exist great resources – Check out wordnet and framenet + more

22 Training the classifiers You are done breaking the sentences and generating sentence attributes – these are x variables Y-variables are the tags you obtained Use your favorite ML algorithm or combinations – Regular GLM – SVM – Naïve Bayes – Neural Network – Decision Tree – Conditional Random Forest – Ensemble Learning: Boosting and Bagging – ETC

23 Let’s go deeper into each stages First, Breaking Sentence

24 Natural Language Processing Tasks “ Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.New YorkNew Jersey The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.” (Bloomberg article on Sandy) Slides Taken from: Bommarito Consulting

25 What kind of questions can we ask? Basic – What is the structure of the text? Paragraphs Sentences Tokens/words – What are the words that appear in this text? Nouns – Subjects – Direct objects Verbs Advanced – What are the concepts that appear in this text? – How does this text compare to other text? Natural Language Processing Tasks Slides Taken from: Bommarito Consulting

26 Segmentation and Tokenization “Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.New YorkNew Jersey The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.” Segments Types Paragraphs Sentences Tokens Natural Language Processing Tasks Slides Taken from: Bommarito Consulting

27 Segmentation and Tokenization But how does it work? Paragraphs – Two consecutive line breaks – A hard line break followed by an indent Sentences – Period, except abbreviation, ellipsis within quotation, etc. Tokens and Words – Whitespace – Punctuation Natural Language Processing Tasks Slides Taken from: Bommarito Consulting

28 Segmentation and Tokenization “Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.New YorkNew Jersey The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.” Paragraphs: 2 Sentences: 2 Words: 561. – ['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for', 'today', 'and', 'tomorrow‘, …] Natural Language Processing Tasks Slides Taken from: Bommarito Consulting

29 What kind of questions can we ask? We now have an ordered list of tokens. ['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for', 'today', 'and', 'tomorrow‘, …] Natural Language Processing Tasks Slides Taken from: Bommarito Consulting

30 Stop Words Removal Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.New YorkNew Jersey The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts. Hurricane Sandy grounded 3,200 flights scheduled today tomorrow, prompted New York suspend subway bus service forced evacuation New Jersey shore headed toward land life-threatening wind rain. System, killed many 65 people Caribbean path north, may capable inflicting much $18 billion damage barrels New Jersey tomorrow knock power millions week, according forecasters risk experts. Slides Taken from: Bommarito Consulting Natural Language Processing Tasks

31 Natural language processing Tasks Stop Words Removal+ Stemming Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.New YorkNew Jersey The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts. Hurrican Sandi ground 3,200 flight schedul today tomorrow, prompt New York suspend subway bu servic forc evacu New Jersey shore head toward land life-threaten wind rain. System, kill mani 65 peopl Caribbean path north, may capabl inflict much $18 billion damag barrel New Jersey tomorrow knock power million week, accord forecast risk expert. Slides Taken from: Bommarito Consulting

32 Natural language processing Part of Speech Tagging Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.New YorkNew Jersey The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts. [('Hurricane', 'NNP'), ('Sandy', 'NNP'), ('grounded', 'VBD'), ('3,200', 'CD'), ('flights', 'NNS'), ('scheduled', 'VBN'), ('for', 'IN'), ('today', 'NN'), ('and', 'CC'), ('tomorrow', 'NN'), …] NNP: Proper Noun, Plural NNS: Noun, Plural VBD: Verb, Past tense VBN:Verb, Past Participle CD: Cardinal Number IN: Proposition/sub-conj etc For more http://www.mozart-oz.org/mogul/doc/lager/brill-tagger/penn.html

33 Let’s go deeper into each stages Second, Sentence Attribute Generation

34 Remember one thing When you read sentences yourself, what do you notice about what you notice? Make those into attributes! The goal is to mimic what we humans do

35 Let’s go deeper into each stages Third, Gold Set Generation

36 Resources for Gold Set Generation Yourself RA: pretty expensive AMT: Amazon Mechanical Turk Obtain multiple tags and you have to check inter-rater agreement to be robust

37 Research Example ssrn.com/abstract=2290802

38 38 Research Question What content attributes of social media messages elicit greater consumer response & engagement? E.g., 1.What’s the comparative effect of informative advertising (product, price information, etc) VS persuasive advertising (Emotion, humor, etc) on engagement? 2.Differences across industries? Introduction & Motivation

39 Sample Messages from Walmart (Dec 2012 -https://www.facebook.com/walmart) Score an iPad 3 for an iPad2 price! Now at your local store, $50 off the iPad 3. Plus, get a $30 iTunes Gift Card. Offer good through 12/31 or while supplies last. (Product Advertisement + Deal + Product Location + Product Stock Availability) Rollback with Vizio. Select models have lower prices ranging from $228 for a 32" (diagonal screen size 31.5") LCD TV to $868 on a 55” (diagonal screen size 54.6") LED TV. http://walmarturl.com/10oZ6yS (Product Advertisement + Price info + Brand Mention + Link)http://walmarturl.com/10oZ6yS Maria’s mission is helping veterans and their families find employment. Like this and watch Maria’s story. http://walmarturl.com/VzWFlh (Philanthropic Message + Explicit Like solicitation + Link)http://walmarturl.com/VzWFlh 39 Data

40 40 Data Post-level panel data on messages posted by many companies from Sep 2011 to July 2012 Message content Impressions, likes and comments on a daily basis Page-level panel data on each pages Page statistics on a daily basis (e.g., Fan number, Industry type) Aggregate demographics of fans and post viewers (impressions demographics) After Cleaning: 106,316 unique messages posted by 782 companies Daily Likes & Comments: 1.3 million rows of post-level snapshots recording about 450 million page fans’ responses. Data

41 41 Variables Engagement Metric (Dependent Variable) Variables that affect engagement (Independent Variables) Informative Ad Content Brand and Product mention Price Deals Product Availability etc Persuasive Ad Content Emotion Humor Philanthropic Emoticon Small Talk etc Message Type Photo, Video, Status update, App, Link Controls Impressions Industry Type Day since post Reading Complexity Message Length etc Empirical Strategy COMMENTSLIKES

42 Message Content Tagging At least 9 different workers per message + Majority vote Used to train natural language processing algorithm to tag remaining posts – 7 Statistical classifiers + rule-based method combined by ensemble learning – Greater than 99% accuracy, precision, and recall for most variables (10-fold CV) 42 Data Worker Eligibility Criteria –Must have > 97% accuracy –Must have > 100 previously approved tasks –Location: US only Criteria for using the input –Question for detecting if the worker is paying attention –Completion duration > 30 seconds (avg took 3 min) –Plus, 5+ more protocols

43 NLP Algorithm Process

44 NLP Algorithm Performance

45 NLTK

46 Open up nlp.py

47 WITH BAD NLPWITH GOOD NLP “COMPUTER, HOT EARL GREY TEA” “COMPUTER, TEA, EARL GREY, HOT”

48 This Concludes the 2014 Wharton Tech/Data Camp Please help me and give feedback on this course for improvement. Thank you! http://wharton.qualtrics.com/SE/?SID =SV_agzfeKZvPQD0hUN


Download ppt "Intro to Practical Natural Language Processing Wharton Data Camp Sessions 8 Agenda 1)Tasks in NLP 2)Use NLTK."

Similar presentations


Ads by Google