Introduction to Data Science Lecture 5 Natural Language Processing CS 194 Fall 2015 John Canny with some slides from David Hall.

Introduction to Data Science Lecture 5 Natural Language Processing CS 194 Fall 2015 John Canny with some slides from David Hall

Outline BoW, N-grams Part-of-Speech Tagging Named-Entity Recognition Grammars Parsing Dependencies

Natural Language Processing (NLP) In a recent survey (KDNuggets blog) of data scientists, 62% reported working “mostly or entirely” with data about people. Much of this data is text. In the suggested CS194-16 projects (a random sample of data science projects around campus), nearly half involve natural language text processing. NLP is a central part of mining large datasets.

Natural Language Processing Some basic terms: Syntax: the allowable structures in the language: sentences, phrases, affixes (-ing, -ed, -ment, etc.). Semantics: the meaning(s) of texts in the language. Part-of-Speech (POS): the category of a word (noun, verb, preposition etc.). Bag-of-words (BoW): a featurization that uses a vector of word counts (or binary) ignoring order. N-gram: for a fixed, small N (2-5 is common), an n-gram is a consecutive sequence of words in a text.

Bag of words Featurization Assuming we have a dictionary mapping words to a unique integer id, a bag-of-words featurization of a sentence could look like this: Sentence: The cat sat on the mat word id’s: 1 12 5 3 1 14 The BoW featurization would be the vector: Vector 2, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1 position 1 3 5 12 14 In practice this would be stored as a sparse vector of (id, count)s: (1,2),(3,1),(5,1),(12,1),(14,1) Note that the original word order is lost, replaced by the order of id’s.

Bag of words Featurization BoW featurization is probably the most common representation of text for machine learning. Most algorithms expect numerical vector inputs and BoW provides this. These include kNN, Naïve Bayes, Logistic Regression, k- Means clustering, topic models, collaborative filtering (using text input).

Bag of words Featurization One challenge is the size of the representation. Since words follow a power-law distribution, vocabulary size grows with corpus size at almost a linear rate. Since rare words aren’t observed often enough to inform the model, its common to discard infrequent words occurring less than e.g. 5 times. One can also use the feature selection methods mentioned last time: mutual information and chi-squared tests. But these often have false positive problems and are less reliable than frequency culling on power-law data.

Bag of words Featurization The feature selection methods mentioned last time: mutual information and chi-squared tests, often have false positive problems on power-law data. Frequency-based selection usually works better. Word rank Frequency 10 5 10 4 10 3 10 2 10 1 10 0 10 1 10 2 10 3 10 4 10x more features in each block, but 10x fewer observations of them. Each is a potential false positive. Difficult to control the false-positive rate for rare features.

N-grams Because word order is lost, the sentence meaning is weakened. This sentence has quite a different meaning but the same BoW vector: Sentence: The mat sat on the cat word id s: 1 14 5 3 1 12 BoW featurization: Vector 2, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1 But word order is important, especially the order of nearby words. N-grams capture this, by modeling tuples of consecutive words.

N-grams Sentence: The cat sat on the mat 2-grams: the-cat, cat-sat, sat-on, on-the, the-mat Notice how even these short n-grams “make sense” as linguistic units. For the other sentence we would have different features: Sentence: The mat sat on the cat 2-grams: the-mat, mat-sat, sat-on, on-the, the-cat We can go still further and construct 3-grams: Sentence: The cat sat on the mat 3-grams: the-cat-sat, cat-sat-on, sat-on-the, on-the-mat Which capture still more of the meaning: Sentence: The mat sat on the cat 3-grams: the-mat-sat, mat-sat-on, sat-on-the, on-the-cat

N-grams Features Typically, its advantages to use multiple n-gram features in machine learning models with text, e.g. unigrams + bigrams (2-grams) + trigrams (3-grams). The unigrams have higher counts and are able to detect influences that are weak, while bigrams and trigrams capture strong influences that are more specific. e.g. “the white house” will generally have very different influences from the sum of influences of “the”, “white”, “house”.

N-grams as Features Using n-grams+BoW improves accuracy over BoW alone for classifying and other text problems. A typical performance (e.g. classifier accuracy) looks like this: 3-grams < 2-grams < 1-grams < 1+2-grams < 1+2+3-grams A few percent improvement is typical for text classification.

N-grams size N-grams pose some challenges in feature set size. If the original vocabulary size is |V|, the number of possible 2- grams is |V| 2 while for 3-grams it is |V| 3 Luckily natural language n-grams (including single words) have a power law frequency structure. This means that most of the n- grams you see are common. A dictionary that contains the most common n-grams will cover most of the n-grams you see.

Power laws for N-grams N-grams also follow a power law distribution:

N-grams size Because of this you may see values like this: Unigram dictionary size: 40,000 Bigram dictionary size: 100,000 Trigram dictionary size: 300,000 With coverage of > 80% of the features occurring in the text.

N-gram Language Models N-grams can be used to build statistical models of texts. When this is done, they are called n-gram language models. An n-gram language model associates a probability with each n- gram, such that the sum over all n-grams (for fixed n) is 1. You can then determine the overall likelihood of a particular sentence: The cat sat on the mat Is much more likely than The mat sat on the cat

Skip-grams We can also analyze the meaning of a particular word by looking at the contexts in which it occurs. The context is the set of words that occur near the word, i.e. at displacements of …,-3,-2,-1,+1,+2,+3,… in each sentence where the word occurs. A skip-gram is a set of non-consecutive words (with specified offset), that occur in some sentence. We can construct a BoSG (bag of skip-gram) representation for each word from the skip-gram table.

Skip-grams Then with a suitable embedding (DNN or linear projection) of the skip-gram features, we find that word meaning has an algebraic structure: Man + (King – Man) + (Woman – Man) = Queen Tomáš Mikolov et al. (2013). "Efficient Estimation of Word Representations in Vector Space""Efficient Estimation of Word Representations in Vector Space" Woman Man King Queen

Parts of Speech Thrax’s original list (c. 100 B.C): Noun Verb Pronoun Preposition Adverb Conjunction Participle Article

Parts of Speech Thrax’s original list (c. 100 B.C): Noun (boat, plane, Obama) Verb (goes, spun, hunted) Pronoun (She, Her) Preposition (in, on) Adverb (quietly, then) Conjunction (and, but) Participle (eaten, running) Article (the, a)

Parts of Speech (Penn Treebank 2014) 1.CCCoordinating conjunction 2.CDCardinal number 3.DTDeterminer 4.EXExistential there 5.FWForeign word 6.IN Preposition or subordinating conjunction 7.JJAdjective 8.JJRAdjective, comparative 9.JJSAdjective, superlative 10.LSList item marker 11.MDModal 12.NNNoun, singular or mass 13.NNSNoun, plural 14.NNPProper noun, singular 15.NNPSProper noun, plural 16.PDTPredeterminer 17.POSPossessive ending 18.PRPPersonal pronoun 19.PRP$Possessive pronoun 20.RBAdverb 21.RBRAdverb, comparative 22.RBSAdverb, superlative 23.RPParticle 24.SYMSymbol 25.TOto 26.UHInterjection 27.VBVerb, base form 28.VBDVerb, past tense 29.VBGVerb, gerund or present participle 30.VBNVerb, past participle 31.VBP Verb, non-3rd person singular present 32.VBZVerb, 3rd person singular present 33.WDTWh-determiner 34.WPWh-pronoun 35.WP$Possessive wh-pronoun 36.WRBWh-adverb

POS tags Morphology: “liked,” “follows,” “poke” Context: “can” can be: “trash can,” “can do,” “can it” Therefore taggers should look at the neighborhood of a word.

Constraint-Based Tagging Based on a table of words with morphological and context features:

POS taggers Taggers typically use HMMs (Hidden-Markov Models) or MaxEnt (Maximum Entropy) sequence models, with the latter giving state-of-the art performance. Accuracy on test data is around 97%. A representative is the Stanford POS tagger: http://nlp.stanford.edu/software/tagger.shtml http://nlp.stanford.edu/software/tagger.shtml Tagging is fast, 300k words/second typical (Speedread, about 10x slower for Stanford).

Named Entity Recognition People, places, (specific) things “Chez Panisse”: Restaurant “Berkeley, CA”: Place “Churchill-Brenneis Orchards”: Company(?) Not “Page” or “Medjool”: part of a non-rigid designator 27

Named Entity Recognition Why a sequence model? “Page” can be “Larry Page” or “Page mandarins” or just a page. “Berkeley” can be a person. 28

Named Entity Recognition Chez Panisse, Berkeley, CA - A bowl of Churchill- Brenneis Orchards Page mandarins and Medjool dates States: Restaurant, Place, Company, GPE, Movie, “Outside” Usually use BeginRestaurant, InsideRestaurant. Why? Emissions: words Estimating these probabilities is harder. 29

Named Entity Recognition Both (hand-written) grammar methods and machine-learning methods are used. The learning methods use sequence models: HMMs or CRFs = Conditional Random Fields, with the latter giving state-of-the-art performance. State-of-the-art on standard test data = 93.4%, human performance = 97%. Speed around 200k words/sec (Speedread). Example: http://nlp.stanford.edu/software/CRF-NER.shtmlhttp://nlp.stanford.edu/software/CRF-NER.shtml 30

Grammars Grammars comprise rules that specify acceptable sentences in the language: (S is the sentence or root node) S  NP VP S  NP VP PP NP  DT NN VP  VB NP VP  VBD PP  IN NP DT  “the” NN  “mat”, “cat” VBD  “sat” IN  “on”

Grammars Grammars comprise rules that specify acceptable sentences in the language: (S is the sentence or root node) “the cat sat on the mat” S  NP VP S  NP VP PP (the cat) (sat) (on the mat) NP  DT NN (the cat), (the mat) VP  VB NP VP  VBD PP  IN NP DT  “the” NN  “mat”, “cat” VBD  “sat” IN  “on”

Grammars English Grammars are context-free: the productions do not depend on any words before or after the production. The reconstruction of a sequence of grammar productions from a sentence is called “parsing” the sentence. It is most conveniently represented as a tree:

Parse Trees “The cat sat on the mat”

Parse Trees In bracket notation: (ROOT (S (NP (DT the) (NN cat)) (VP (VBD sat) (PP (IN on) (NP (DT the) (NN mat))))))

Grammars There are typically multiple ways to produce the same sentence. Consider the statement by Groucho Marx: “While I was in Africa, I shot an elephant in my pajamas” “How he got into my pajamas, I don’t know”

Parse Trees “…,I shot an elephant in my pajamas” -what people hear first

Parse Trees Groucho’s version

Grammars Recursion is common in grammar rules, e.g. NP  NP RC Because of this, sentences of arbitrary length are possible.

Recursion in Grammars “Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo”.

Grammars Its also possible to have “sentences” inside other sentences… S  NP VP VP  VB NP SBAR SBAR  IN S

Recursion in Grammars “Nero played his lyre while Rome burned”.

5-min Break Please talk to us if you haven’t finalized your group! We will get you your data in the next day or two. We’re in this room for lecture and lab from now on.

PCFGs Complex sentences can be parsed in many ways, most of which make no sense or are extremely improbable (like Groucho’s example). Probabilistic Context-Free Grammars (PCFGs) associate and learn probabilities for each rule: S  NP VP 0.3 S  NP VP PP 0.7 The parser then tries to find the most likely sequence of productions that generate the given sentence. This adds more realistic “world knowledge” and generally gives much better results. Most state-of-the-art parsers these days use PCFGs.

CKY parsing A CKY (Cocke Younger Kasami) table is a dynamic programming parser (most parsers use some form of this): N(John) V(hit) D(the)N(ball) NP VP S

CKY parsing Not every internal node encodes a symbol (there are N-1 internal nodes in the tree, but O(N 2 ) cells in the table). N(John) V(hit) D(the)N(ball) NP VP S

CKY parsing Not every internal node encodes a symbol (there are N-1 internal nodes in the tree, but O(N 2 ) cells in the table). N(John) V(hit) D(the)N(ball)

CKY parsing Not every internal node encodes a symbol (there are N-1 internal nodes in the tree, but O(N 2 ) cells in the table). N(John) V(hit) D(the)N(ball) NP

CKY parsing Not every internal node encodes a symbol (there are N-1 internal nodes in the tree, but O(N 2 ) cells in the table). N(John) V(hit) D(the)N(ball) NP VP

CKY parsing Not every internal node encodes a symbol (there are N-1 internal nodes in the tree, but O(N 2 ) cells in the table). N(John) V(hit) D(the)N(ball) NP VP S

CKY parsing For an internal node x height k, there are k possible pairs of symbols which x can decompose into: x C B

Viterbi Parse Given the probabilities of cells in the table, we can find the most probable parse using a generalized Viterbi scan. i.e. find the most likely parse at the root, then return it, and the left and right nodes which generated it. We can do this as part of the scoring process by maintaining a set of backpointers. x C B

Systems NLTK: Python-based NLP system. Many modules, good visualization tools, but not quite state-of-the-art performance. Stanford Parser: Another comprehensive suite of tools (also POS tagger), and state-of-the-art accuracy. Has the definitive dependency module. Berkeley Parser: Slightly higher parsing accuracy (than Stanford) but not as many modules. Note: high-quality dependency parsing is usually very slow, but see: https://github.com/dlwh/puckhttps://github.com/dlwh/puck

Dependency Parsing Dependency structures are rooted in existing words, i.e. there are no non-terminal nodes, and a tree on n words has exactly n nodes. A dependency tree has about half as many total nodes as a binary (Chomsky Normal form) constituency tree.

Dependency Parsing Dependency parses may be non-binary, and structure type is encoded in links rather than nodes:

Constituency vs. Dependency Grammars Dependency parses are faster and more compact descriptions. There are not simplifications of constituency parses, although they can be derived from lexicalized constituency grammars. Lexicalized grammars assign a head word to each non-terminal node.

Constituency vs. Dependency Grammars Dependency and constituency parses capture attachment information (e.g. sentiments directed at particular targets). Constituency parses are probably better for abstracting semantic features, i.e. constituency units often correspond to semantic units. Constituency grammars have a much large space to search and take longer, especially as sentence length grows O(N 3 ). However, if the goal is to find semantic groups its possible to proceed bottom-up with (small) bounded N.

Dependencies “The cat sat on the mat” dependency tree parse tree constituency labels of leaf nodes

Dependencies From the dependency tree, we can obtain a “sketch” of the sentence. i.e. by starting at the root we can look down one level to get: “cat sat on” And then by looking for the object of the prepositional child, we get: “cat sat on mat” We can easily ignore determiners “a, the”. And importantly, adjectival and adverbial modifiers generally connect to their targets:

Dependencies “Brave Merida prepared for a long, cold winter”

Dependencies “Russell reveals himself here as a supremely gifted director of actors”

Dependencies Stanford dependencies can be constructed from the output of a lexicalized constituency parser (so you can in principle use other parsers). The mapping is based on hand-written regular expressions. Dependency grammars have been widely used for sentiment analysis and for semantic embeddings of sentences. Examples: http://nlp.stanford.edu/software/lex-parser.shtml http://www.maltparser.org/

Parser Performance MaltParser is widely used for dependency parsing because of its speed (around 10k words/sec). High-quality constituency parsing on a fast machines usually runs at < 2000 words/sec. Puck (with a GPU) raises this to around 8000 words/sec, similar to Malt Parser. Parsers are very complicated pieces of software, and are quite sensitive to tuning. Make sure your parser is tuned properly – evaluate on sample text. Parsers are often trained on sanitized data (news articles), and will not work as well on informal text like social media. If possible, train on similar data (e.g. LDC’s English web treebank).

Summary BoW, N-grams Part-of-Speech Tagging Named-Entity Recognition Grammars Parsing Dependencies

Introduction to Data Science Lecture 5 Natural Language Processing CS 194 Fall 2015 John Canny with some slides from David Hall.

Similar presentations

Presentation on theme: "Introduction to Data Science Lecture 5 Natural Language Processing CS 194 Fall 2015 John Canny with some slides from David Hall."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Data Science Lecture 5 Natural Language Processing CS 194 Fall 2015 John Canny with some slides from David Hall.

Similar presentations

Presentation on theme: "Introduction to Data Science Lecture 5 Natural Language Processing CS 194 Fall 2015 John Canny with some slides from David Hall."— Presentation transcript:

Similar presentations

About project

Feedback