Download presentation
Presentation is loading. Please wait.
Published bySilvester O’Connor’ Modified over 8 years ago
1
Natural Language Processing
2
Seminar webpage: https://www.cs.princeton.edu/courses/archive/fall16/cosIW05/
3
Introductions Instructor: Christiane Fellbaum (fellbaum) COS 412 Office hours Tue 10-12 and by appointment (drop-ins usually welcome) A.I: Kiran Vodrahalli (knv) Office hours Friday 9-11, 3-5 in COS 402 or vicinity
4
Seminar participants Richard Chu Thomas Clark Ben Cohen Rohan Doshi Gudrun Jonsdottir Jacob Kaplan Stefan Keselj Avinash Nayak Shefali Nayak Kwasi Oppong-Badu Jonathan Zhang
5
What this seminar is all about Natural Language Processing: An area of Artificial Intelligence (modeling high-level human behavior)—Turing Test Computational approach to understanding and generating human (natural) language Difficult but fascinating; lots of challenges Many applications; human-computer interaction
6
What you will all do/Formalities Write, share, (revise,) submit a proposal (due Oct. 5) Deliver an oral presentations (week of Dec. 12) Prepare a final written report (due Jan. 10) LaTex template on COS IW website (other formats are OK) Present a poster (Jan. 12) For specifics and deadlines see http://www.cs.princeton.edu/ugrad/independent-work/important- steps-and-deadlines http://www.cs.princeton.edu/ugrad/independent-work/guidelines- and-useful-information
7
How the IW seminar works We’ll all meet every week No “lecture” but interaction You’ll work on a project alone or with a partner Try to balance experience, competencies Communicate via Piazza
8
How the IW seminar works Individual deadlines/schedules Prepare four slides for presentation in each meeting Where you are now Where you are going Success, obstacles, ….
9
How to pick a project Get inspired ( e.g., by previous projects) Take a walk through the COS building and study the posters on the walls Find a topic that interests (excites?) you; scope it right Do some background research What is the SOTA? What are open questions or areas for improvement? Very important: getting a dataset Pre-existing text corpora Data mining: construct customized dataset from blogs, reviews, etc. Scraping tools
10
Your project Formulate (a) specific question(s) Formulate a hypothesis If I find X, it means Y What would a not-X result mean? Getting a “positive” result or beating previous systems’ performance is not necessarily a mark of better work! Negative results are *very* important for research
11
Broad topic areas Build or improve an NLP tool for English or another language Part-of-speech tagger Stemmer or Lemmatizer Syntactic parser Word sense disambiguation Build a new/improve current WordNet
12
Broad topic areas Try your hand on a core NLP application (with new dataset/method): Text summarization Paraphrasing Named Entity Recognition Coreference Resolution Question Answering Factoids: who/what/when vs. open-ended: how/why Dialog Systems Sentiment/Humor/Sarcasm Detection Machine Translation
13
Broad Topic Areas Ask a question that can be approached by analyzing specific data. Somewhat randomly chosen illustrative examples: Do political candidates from different parties use measurably different language? Authorship attribution Can we identify dishonest bloggers? Can sentiment analysis of product reviews predict numerical ratings? Can analyses of “salon” comments on textbook passages lead to better networking among students? (Dr. Gunawardena) Generate puns, jokes; stories Identify medical/drug issues by analyzing user forum posts
14
Broad topic areas Build an app where natural language is a major component Scheduling meetings, events for people with different interests and constraints Improved spell or grammar checker; malapropism detector Improved plagiarism detector
15
Broad topic areas Language and other modalities Labeling of images Resources: ImageNet, ShapeNet
16
Tools and resources Python tutorial www.learn python.org www.learn python.org Python packages for NLP: Natural Language Toolkit (NLTK) www.nltk.org spaCY https://spacy.io/ https://spacy.io/ Google Cloud Natural Language API (ML-based) https://cloud.google.com/natural-language/ Syntactic parser, Named Entity Recognition, Sentiment Analysis
17
Tools and Resources Stanford CoreNLP tools stanfordnlp.github.io/CoreNLP/ Java/has python interface
18
Text Corpora Corpora out of Brigham Young University Nice interfaces http://corpus.byu.edu/ PU has a license for the COCA (Corpus of Contemporary American) Google book corpus Google n-gram corpus https://books.google.com/ngrams
19
Corpora with Linguistic Annotations Penn Treebank https://www.cis.upenn.edu/~treebank/
20
NLP tools for Word Sense Disambiguation Princeton’s own WordNet wordnet.princeton.edu/ There’s much more to it than what you learned in COS 226! Wordnets in many other languages: http://compling.hss.ntu.edu.sg/omw/ http://compling.hss.ntu.edu.sg/omw/ Ted Pedersen’s WN::Similarity http://www.d.umn.edu/~tpederse/similarity.html http://www.d.umn.edu/~tpederse/similarity.html
21
More specific Tasks of NLP Building tools for automatically analyzing language (we’ll limit ourselves to written language) These steps are sometimes called the “NLP pipeline” You might want to build a pipeline component in a language other than English Or improve on an existing tool for English using a different method
22
The NLP Pipeline Character Recognition (unlikely to be addressed in the seminar) Segmentation, Tokenization (where does a word end and the next one begin?) Relatively easy in English but hard in Chinese, for example
23
The NLP Pipeline Stemming : stripping off endings (proposal, propose => propos) Doesn’t necessarily leave a word This can be a major challenge in some languages Lemmatizing : stripping off endings so that a word is left, which can then be looked up in a dictionary does, done, doing => do mice => mouse heavier => heavy
24
The NLP Pipeline Part-of-Speech-Tagging : determine the lexical category of a word ship sails today Ship (verb) the sails (noun) today! OR The ship (noun) sails (verb) today There are standard POS tag sets for a given language English: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html Not all languages share all categories
25
The NLP Pipeline Parsing : Assigning structure to a phrase or sentence Hierarchical “tree” or nested brackets [ [the cat]n [chased]v [the dog]n]s Can be flatter or deeper (more or less fine-grained), depending on one’s goals or theory
26
The NLP Pipeline Parsing is important for: Co-reference resolution: Who is the most likely antecedents of his: John called another student and told him his grade Prepositional Phrase attachment: Who has the binoculars? The agent saw the suspected spy with binoculars The answers depend on the particular parse
27
The NLP Pipeline The hardest part: Meaning (semantics) Lexical: Which meaning of a word is the intended one in the context? Sentence level: Given candidates for word meanings, what does the sentence mean? Discourse/text level: what does a paragraph/chapter mean? Is there implicit meaning that’s not overtly expressed (sentiment)?
28
Word Sense Disambiguation Really hard! The most frequent words have the most meaning (Zipf distribution) beat, check, pitch, bar, state, like, can…. Related: Named Entity Recognition G.W. Bush vs. the burning bush G. Washington vs. Washington, D.C.
29
Word Sense Disambiguation Prerequisite for most NLP applications: Information retrieval Question Answering Paraphrasing Text Summarization Sentiment/opinion analysis Machine Translation
30
Approaches to WSD “ symbolic ” (linguistic): WordNet (wordnet.princeton.edu) WordNet graph is much richer than presented in COS226 WN-based algorithms to WSD rely on a shared intuition: Words in a context are semantically similar/related. The sense (node in WN) within the shortest distance to other words in the same context is likely to be the intended one. E.g., bank most likely has the financial_institution sense when it occurs near money. (The path from {money} to {bank, financial_ institution} is shorter than from {money} to {bank, sloping_land}.
31
Statistical approaches to word meaning and WSD Semantic vector space models Basic intuition: words with similar meanings share similar/same contexts Requires large corpus to compute word co-occurrence statistics Cosine similarity between two vectors reflects semantic similarity GloVe (Global Vectors for Word Representation) nlp.stanford.edu/projects/glove
32
Understanding Text Current “hot” topic: sentiment analysis What is the writer’s (often implicit) feeling/stance/bias? Product evaluation (does the language correlate with a star or numerical rating?) Travel and entertainment (flights, hotels, restaurants, movies,…) Politics (blogs, forums, tweets,….) Education (course ratings, SAT essays) Medical (depression detection; personality assessment; drug use) Legal, forensic (court transcripts)
33
WordNets for Sentiment Analysis SentiWordNet Bing Liu’s tutorial: https://www.cs.uic.edu/~liub/FBS/Sentiment-Analysis-tutorial-AAAI-2011.pdf
34
Models, Algorithms,…
35
Approaches (no longer opposed) Symbolic (rule-based) Build a system based entirely on linguistic/grammar rules Statistical E.g., Bayesian
36
Models: overview Rule-based models Grammars State machines Finite state automata Transducers Probabilistic models Markov models Vector space models
37
Algorithms: overview State space searches Dynamic programming Machine learning Classifiers Expectation Maximization
38
Statistical Approach Machine Learning Build a classifier Supervised: train on human-annotated data (“Gold standard”) Collect data via crowd-sourcing (Amazon Mechanical Turk) Unsupervised (unannotated data) Low(er) quality (e.g., Google translate)
39
Working with NL data Regular expressions (string/sequence searches) Implemented with finite state automata, N-grams, Markov models Useful for stemming, for example
40
Working with NL data Statistical language models of word sequences n-grams Entropy, information Pointwise Mutual Information (powerful computer vs. strong tea)
41
Evaluation Standard measures for ML approach: Precision, Recall, f-measure Error analysis For Machine Translation, paraphrasing, summarizing: BLEU, ROUGE
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.