Natural Language Processing. Seminar webpage: https://www.cs.princeton.edu/courses/archive/fall16/cosIW05/

Natural Language Processing

Seminar webpage: https://www.cs.princeton.edu/courses/archive/fall16/cosIW05/

Introductions Instructor: Christiane Fellbaum (fellbaum) COS 412 Office hours Tue 10-12 and by appointment (drop-ins usually welcome) A.I: Kiran Vodrahalli (knv) Office hours Friday 9-11, 3-5 in COS 402 or vicinity

Seminar participants Richard Chu Thomas Clark Ben Cohen Rohan Doshi Gudrun Jonsdottir Jacob Kaplan Stefan Keselj Avinash Nayak Shefali Nayak Kwasi Oppong-Badu Jonathan Zhang

What this seminar is all about  Natural Language Processing:  An area of Artificial Intelligence (modeling high-level human behavior)—Turing Test  Computational approach to understanding and generating human (natural) language  Difficult but fascinating; lots of challenges  Many applications; human-computer interaction

What you will all do/Formalities  Write, share, (revise,) submit a proposal (due Oct. 5)  Deliver an oral presentations (week of Dec. 12)  Prepare a final written report (due Jan. 10)  LaTex template on COS IW website (other formats are OK)  Present a poster (Jan. 12)  For specifics and deadlines see http://www.cs.princeton.edu/ugrad/independent-work/important- steps-and-deadlines http://www.cs.princeton.edu/ugrad/independent-work/guidelines- and-useful-information

How the IW seminar works  We’ll all meet every week  No “lecture” but interaction  You’ll work on a project alone or with a partner  Try to balance experience, competencies  Communicate via Piazza

How the IW seminar works  Individual deadlines/schedules  Prepare four slides for presentation in each meeting  Where you are now  Where you are going  Success, obstacles, ….

How to pick a project  Get inspired ( e.g., by previous projects)  Take a walk through the COS building and study the posters on the walls  Find a topic that interests (excites?) you; scope it right  Do some background research  What is the SOTA?  What are open questions or areas for improvement?  Very important: getting a dataset  Pre-existing text corpora  Data mining: construct customized dataset from blogs, reviews, etc.  Scraping tools

Your project  Formulate (a) specific question(s)  Formulate a hypothesis  If I find X, it means Y  What would a not-X result mean?  Getting a “positive” result or beating previous systems’ performance is not necessarily a mark of better work!  Negative results are *very* important for research

Broad topic areas  Build or improve an NLP tool for English or another language  Part-of-speech tagger  Stemmer or Lemmatizer  Syntactic parser  Word sense disambiguation  Build a new/improve current WordNet

Broad topic areas  Try your hand on a core NLP application (with new dataset/method):  Text summarization  Paraphrasing  Named Entity Recognition  Coreference Resolution  Question Answering  Factoids: who/what/when vs. open-ended: how/why  Dialog Systems  Sentiment/Humor/Sarcasm Detection  Machine Translation

Broad Topic Areas  Ask a question that can be approached by analyzing specific data. Somewhat randomly chosen illustrative examples:  Do political candidates from different parties use measurably different language?  Authorship attribution  Can we identify dishonest bloggers?  Can sentiment analysis of product reviews predict numerical ratings?  Can analyses of “salon” comments on textbook passages lead to better networking among students? (Dr. Gunawardena)  Generate puns, jokes; stories  Identify medical/drug issues by analyzing user forum posts

Broad topic areas  Build an app where natural language is a major component  Scheduling meetings, events for people with different interests and constraints  Improved spell or grammar checker; malapropism detector  Improved plagiarism detector

Broad topic areas  Language and other modalities  Labeling of images  Resources: ImageNet, ShapeNet

Tools and resources  Python tutorial  www.learn python.org www.learn python.org  Python packages for NLP:  Natural Language Toolkit (NLTK)  www.nltk.org  spaCY  https://spacy.io/ https://spacy.io/  Google Cloud Natural Language API (ML-based)  https://cloud.google.com/natural-language/  Syntactic parser, Named Entity Recognition, Sentiment Analysis

Tools and Resources  Stanford CoreNLP tools stanfordnlp.github.io/CoreNLP/ Java/has python interface

Text Corpora  Corpora out of Brigham Young University Nice interfaces http://corpus.byu.edu/ PU has a license for the COCA (Corpus of Contemporary American) Google book corpus  Google n-gram corpus https://books.google.com/ngrams

Corpora with Linguistic Annotations  Penn Treebank https://www.cis.upenn.edu/~treebank/

NLP tools for Word Sense Disambiguation  Princeton’s own WordNet  wordnet.princeton.edu/  There’s much more to it than what you learned in COS 226!  Wordnets in many other languages:  http://compling.hss.ntu.edu.sg/omw/ http://compling.hss.ntu.edu.sg/omw/  Ted Pedersen’s WN::Similarity  http://www.d.umn.edu/~tpederse/similarity.html http://www.d.umn.edu/~tpederse/similarity.html

More specific Tasks of NLP  Building tools for automatically analyzing language (we’ll limit ourselves to written language)  These steps are sometimes called the “NLP pipeline”  You might want to build a pipeline component in a language other than English  Or improve on an existing tool for English using a different method

The NLP Pipeline  Character Recognition (unlikely to be addressed in the seminar)  Segmentation, Tokenization (where does a word end and the next one begin?)  Relatively easy in English but hard in Chinese, for example

The NLP Pipeline  Stemming : stripping off endings (proposal, propose => propos)  Doesn’t necessarily leave a word  This can be a major challenge in some languages   Lemmatizing : stripping off endings so that a word is left, which can then be looked up in a dictionary  does, done, doing => do  mice => mouse  heavier => heavy

The NLP Pipeline  Part-of-Speech-Tagging : determine the lexical category of a word  ship sails today  Ship (verb) the sails (noun) today!  OR  The ship (noun) sails (verb) today  There are standard POS tag sets for a given language  English: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html  Not all languages share all categories

The NLP Pipeline  Parsing : Assigning structure to a phrase or sentence  Hierarchical “tree” or nested brackets  [ [the cat]n [chased]v [the dog]n]s  Can be flatter or deeper (more or less fine-grained), depending on one’s goals or theory

The NLP Pipeline  Parsing is important for:  Co-reference resolution: Who is the most likely antecedents of his:  John called another student and told him his grade  Prepositional Phrase attachment: Who has the binoculars?  The agent saw the suspected spy with binoculars  The answers depend on the particular parse

The NLP Pipeline  The hardest part: Meaning (semantics)  Lexical: Which meaning of a word is the intended one in the context?  Sentence level: Given candidates for word meanings, what does the sentence mean?  Discourse/text level: what does a paragraph/chapter mean?  Is there implicit meaning that’s not overtly expressed (sentiment)?

Word Sense Disambiguation  Really hard!  The most frequent words have the most meaning (Zipf distribution)  beat, check, pitch, bar, state, like, can….  Related: Named Entity Recognition  G.W. Bush vs. the burning bush  G. Washington vs. Washington, D.C.

Word Sense Disambiguation  Prerequisite for most NLP applications:  Information retrieval  Question Answering  Paraphrasing  Text Summarization  Sentiment/opinion analysis  Machine Translation

Approaches to WSD  “ symbolic ” (linguistic): WordNet (wordnet.princeton.edu)  WordNet graph is much richer than presented in COS226  WN-based algorithms to WSD rely on a shared intuition:  Words in a context are semantically similar/related.  The sense (node in WN) within the shortest distance to other words in the same context is likely to be the intended one. E.g., bank most likely has the financial_institution sense when it occurs near money. (The path from {money} to {bank, financial_ institution} is shorter than from {money} to {bank, sloping_land}.

Statistical approaches to word meaning and WSD  Semantic vector space models  Basic intuition: words with similar meanings share similar/same contexts  Requires large corpus to compute word co-occurrence statistics  Cosine similarity between two vectors reflects semantic similarity  GloVe (Global Vectors for Word Representation)  nlp.stanford.edu/projects/glove

Understanding Text  Current “hot” topic: sentiment analysis  What is the writer’s (often implicit) feeling/stance/bias?  Product evaluation (does the language correlate with a star or numerical rating?)  Travel and entertainment (flights, hotels, restaurants, movies,…)  Politics (blogs, forums, tweets,….)  Education (course ratings, SAT essays)  Medical (depression detection; personality assessment; drug use)  Legal, forensic (court transcripts)

WordNets for Sentiment Analysis  SentiWordNet  Bing Liu’s tutorial:  https://www.cs.uic.edu/~liub/FBS/Sentiment-Analysis-tutorial-AAAI-2011.pdf

Models, Algorithms,…

Approaches (no longer opposed)  Symbolic (rule-based)  Build a system based entirely on linguistic/grammar rules  Statistical  E.g., Bayesian

Models: overview  Rule-based models  Grammars  State machines  Finite state automata  Transducers  Probabilistic models  Markov models  Vector space models

Algorithms: overview  State space searches  Dynamic programming  Machine learning Classifiers Expectation Maximization

Statistical Approach  Machine Learning  Build a classifier  Supervised: train on human-annotated data (“Gold standard”)  Collect data via crowd-sourcing (Amazon Mechanical Turk)  Unsupervised (unannotated data)  Low(er) quality (e.g., Google translate)

Working with NL data  Regular expressions (string/sequence searches)  Implemented with finite state automata, N-grams, Markov models  Useful for stemming, for example

Working with NL data  Statistical language models of word sequences  n-grams  Entropy, information  Pointwise Mutual Information (powerful computer vs. strong tea)

Evaluation  Standard measures for ML approach:  Precision, Recall, f-measure  Error analysis  For Machine Translation, paraphrasing, summarizing:  BLEU, ROUGE

Natural Language Processing. Seminar webpage: https://www.cs.princeton.edu/courses/archive/fall16/cosIW05/

Similar presentations

Presentation on theme: "Natural Language Processing. Seminar webpage: https://www.cs.princeton.edu/courses/archive/fall16/cosIW05/"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Natural Language Processing. Seminar webpage: https://www.cs.princeton.edu/courses/archive/fall16/cosIW05/

Similar presentations

Presentation on theme: "Natural Language Processing. Seminar webpage: https://www.cs.princeton.edu/courses/archive/fall16/cosIW05/"— Presentation transcript:

Similar presentations

About project

Feedback