Presentation is loading. Please wait.

Presentation is loading. Please wait.

Natural language processing

Similar presentations


Presentation on theme: "Natural language processing"— Presentation transcript:

1 Natural language processing
Rohit Apte

2 Natural Language Processing
Artificial Intelligence Linguistics Computer Science A way for computers to analyze language in a useful way Perform tasks (make appointments, buy things) Language Translation Question Answering Information Extraction (Analyze documents, read s, chatbots, etc.) NLP

3 Basic NLP techniques Sentence segmentation – split a paragraph into sentences Word tokenization – split a sentence into token words Stemming and lemmatization (text normalization) - reduce inflection in words to their root forms. Stemming tends to be more crude (usually truncating end of words), while Lemmatization uses vocabulary and morphological analysis of words. wait waited waits waiting

4 Original Kim arrived on Tuesday morning at the border station of Dong Dang and was welcomed with a military salute reminiscent of the fanfare that greeted his grandfather Kim Il-sung more than half a century ago. His heavily armoured train pulled into the station just after 8am, after a 65-hour, 4,500km journey from Pyongyang via mainland China. Just as they did during Kim Il-sung’s visit 55 years ago, a dozen Vietnamese soldiers in white and green uniforms saluted the 35-year-old leader as he emerged from the train and walked on the red carpet. Stemmed kim arriv on tuesday morn at the border station of dong dang and wa welcom with a militari salut reminisc of the fanfar that greet hi grandfath kim il-sung more than half a centuri ago . hi heavili armour train pull into the station just after 8am , after a 65-hour , 4,500km journey from pyongyang via mainland china . just as they did dure kim il-sung ’ s visit 55 year ago , a dozen vietnames soldier in white and green uniform salut the 35-year-old leader as he emerg from the train and walk on the red carpet . Lemmatized Kim arrive on Tuesday morning at the border station of Dong Dang and be welcome with a military salute reminiscent of the fanfare that greet his grandfather Kim Il-sung more than half a century ago . His heavily armour train pull into the station just after 8am , after a 65-hour , 4,500km journey from Pyongyang via mainland China . Just a they do during Kim Il-sung ’ s visit 55 year ago , a dozen Vietnamese soldier in white and green uniform salute the 35-year-old leader a he emerge from the train and walk on the red carpet .

5 Basic NLP techniques (cont.)
Part of Speech (POS) tagging – tag words as nouns, adjectives, principles, pronouns, etc Named Entity Recognition – Identify Organization, Person, Location, Date, Time, etc. Dependency Parsing – Analyze grammatical structure of a sentence, establishing relationship between “head” words and words that modify those heads. Coreference resolution – Find all expressions that refer to the same entity in a text. Relation extraction – Track semantic relationships from a text.

6 Word Kim arrived on Tuesday morning at the border station of POS NNP VBD IN NN DT NER PERSON O DATE TIME Word Dong Dang and was welcomed with a military salute reminiscent POS NNP CC VBD VBN IN DT JJ NN NER PERSON O Word of the fanfare that greeted his grandfather Kim Il-sung more POS IN DT NN WDT VBD PRP$ NNP JRR NER O PERSON Word than half a century ago . POS IN PDT DT NN RB NER O

7 Penn treebank POS tags Tag Description CC Coordinating conjunction NNS
Noun, plural TO to CD Cardinal number NNP Proper noun, singular UH Interjection DT Determiner NNPS Proper noun, plural VB Verb, base form EX Existential there PDT Predeterminer VBD Verb, past tense FW Foreign word POS Possessive ending VBG Verb, gerund or present participle IN Preposition/subordinating conjunction PRP Personal pronoun VBN Verb, past participle JJ Adjective PRP$ Possessive pronoun VBP Verb, non-3rd person singular present JJR Advective, competitive RB Adverb VBZ Verb, 3rd person singular present JJS Adjective, superlative RBR Adverb, comparative WDT Wh-determiner LS List item marker RBS Adverb, superlative WP Wh-pronoun MD Modal RP Particle WP$ Possessive wh-pronoun NN Noun, singular/mass SYM Symbol WRB Wh-adverb

8

9 Advanced NLP techniques
Question Answering Sentiment Analysis Dialog System (chatbots, reasoning over knowledge base) Document Summarization Text Generation Machine Translation

10 Common libraries/tools for nlp
NLTK SpaCy Gensim Stanford NLP (Java) Apache NLP (Java) Word embeddings – word2vec, GloVE, FastText

11 Machine learning vs deep learning
Traditional approach to NLP was Statistical (Machine Learning) based Required lots of feature engineering (and domain knowledge) Task of optimizing weights was fairly trivial relative to feature engineering Deep learning attempts to transform raw data into representations of increasing complexity. Requires little or no feature engineering Provides a very flexible learning framework In 2010 deep learning techniques started outperforming ML techniques.

12 Finding collocations using ML techniques
A collocation is a pair or group of words that are habitually juxtaposed Examples – crystal clear, middle management, nuclear option, etc. Traditional method is to find a formula based on the statistical quantities of those words to calculate a score associated with every word pair. * Tag Pattern Example (*) A N linear function N N regression coefficients A A N gaussian random variable A N N cumulative distribution function N A N mean squared error N N N class probability function N P N degrees of freedom

13

14 NLP is hard! Natural Languages are complex and ambiguous (unlike programming languages). Rules are often broken. Most tech giants (Google, Alibaba, Microsoft, Amazon) tackling “BIG” NLP problems like Machine Comprehension, inference, etc. These take years of research before any commercial gains are realized. There is value in addressing “SMALLER” problems – summarizing resumes, analyzing central bank statements, etc.

15 Resume summarization Resumes often excessively populated in detail, most of which is irrelevant. Can simplify the process by scanning the text from PDF document. Extract Named Entities for each line of text. Provides a quick summary of education, work experience, etc. We can build a basic prototype in under 10 minutes (and less than 20 lines of code!)

16 Harvard Asian Healthcare Caucasian MD/PhD McKinsey Christoph Westphal S&P China Healthcare – Barclays Equity Research Fidelity, Capital SAC Mann Biotechnology Ventures Joined Barclays SciClone Pharmaceuticals SciClone NASDAQ CFO VC McKinsey & Company Joined SciClone Harvard Research Associate Harvard Business School Harvard Medical School/Boston Children’s Vanderbilt University Summa D Vanderbilt University Harvard College Raymond DuBois American Association for Cancer M.D. Anderson Cancer Center Discovered Renaissance Weekend Walter Annenberg PharmaChina Executive Retreat China Healthcare Investment Conference Co Boston Biotech Conference – Moderated Asia Healthcare Panel Biotech

17 Parsing fomc statements
The Federal Open Market Committee manages monetary policy and sets interest rates that has a major impact on the world economy. The Fed releases statements along with interest rate targets at 2pm NY time on selected dates (as per a fixed calendar schedule). These statements tend to move financial markets, especially if the Fed acts in contrast to market expectations. Traders position around these statements and the market is often in a race to digest the information in the statement.

18

19 Can we tackle this using nlp?
Crawl Fed website for latest (and historical) statements. Extract common things the Fed speaks about – POS tagging and collocation extraction. Analyze changes counts for each phrase. This uses no domain knowledge of Interest Rates or the Fed!

20 Can we tackle this using nlp (cont.)?
Extract fed rate target (most important part of the statement). React to changes in fed statement vs market expectations. See which members voted for the resolution (certain members are known hawks or doves) For Against Janet L. Yellen Neel Kashkari William C. Dudley Lael Brainard Charles L. Evans Stanley Fischer Patrick Harker Robert S. Kaplan Jerome H. Powell Daniel K. Tarullo

21 Can we do more? What about sentiment analysis on sentences containing keywords (inflation, unemployment, etc.) to give an overall confidence score? Popular platforms (Microsoft, Google, Amazon) don’t give very good results. They are trained on different datasets. They work well for movie/product reviews but not for our task. We could train our own sentiment classifier, but the challenge is where to get enough labelled data?

22

23 Working with text in machine learning
We need to convert text to numbers that can be fed into ML models. Traditionally there were a few approaches Bag of Words – sentence is represented as a bag (multiset) of its words. One hot encoding – each word is encoded as a vector of zeros with 1 indicating the word. TFIDF – Term Frequency (summarizes how often a word occurs within a document), Inverse Document Frequency (downscales words that appear across a lot of documents). Hashing – convert sentence to a hash and use that as input to your models (Hashing is a one way function).

24 Encoding Values Bag of words (CountVectorizer) [ ] One hot encoding (OneHotEncoder) The [ ] quick [ ] brown [ ] TFIDF (TFIDFVectorizer) [ ] Hashing (HashingVectorizer) [ ]

25 But there is a major drawback with this
As our vocabulary grows, the size of our vectors gets very large. For a general problem like Machine Comprehension or Chatbots, vocabulary size of 2 million words is quite common. Storing large vectors with mostly zeros is memory intensive. Sparse Vectors can help with this, but we are still dealing with large vectors that can slow our ML models. Stemming and lemmatization can help, but again we may lose critical information by using these transformations. One hot representations are orthogonal vectors. So there is no natural similarity for one-hot vectors.

26 Is there another way? YES! Word Embeddings.
3 popular types – word2vec, GloVe and FastText Core Idea: A word’s meaning is given by words that frequently appear close by (coreference). Build a (dense) vector representation for each word, chosen so that it is similar to other words that appear in similar contexts.

27 Word embeddings Word2vec – Developed by Tomas Mikolov at Google in 2013 Captures coreference information using a predictive model. Skip gram and CBOW models (usually take the average of the two vectors). GloVE – developed by Pennington, Socher, Manning at Stanford in 2014 Also captures coreference information, but uses a count-based model (with dimensionality reduction). FastText – developed by Tomas Mikolov at Facebook in 2015 Extension of word2vec that improves embeddings for rare words. Can construct a vector for out of vocabulary words using its neighboring words.

28 Word embedding created using Word2vec | Source: https://www

29 Word embeddings (cont.)
Word embeddings are usually a good starting point for deep learning (and vanilla) models. Convert words to vectors using embeddings and then apply deep learning models to this data. Note that these are trained on Wikipedia (GloVe has vectors trained on Twitter data as well). If using for a different domain (Crypto, Medical files, etc.) we need to train our own embeddings. Gensim provides a framework to do train word2vec on custom datasets.


Download ppt "Natural language processing"

Similar presentations


Ads by Google