Cognitive Computation Group Resources for Semantic Similarity

Slides:

Advertisements

Similar presentations

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.

Advertisements

University of Sheffield NLP Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell.

University of Sheffield NLP Module 4: Machine Learning.

CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

COGEX at the Second RTE Marta Tatu, Brandon Iles, John Slavick, Adrian Novischi, Dan Moldovan Language Computer Corporation April 10 th, 2006.

COGEX at the Second RTE Marta Tatu, Brandon Iles, John Slavick, Adrian Novischi, Dan Moldovan Language Computer Corporation April 10 th, 2006.

A Machine Learning Approach to Coreference Resolution of Noun Phrases By W.M.Soon, H.T.Ng, D.C.Y.Lim Presented by Iman Sen.

An Introduction to Edison Vivek Srikumar 17 th April 2012.

Cognitive Computation Group Curator Overview December 3, 2013

Overview of the KBP 2013 Slot Filler Validation Track Hoa Trang Dang National Institute of Standards and Technology.

A Linear Programming Formulation for Global Inference in Natural Language Tasks Dan RothWen-tau Yih Department of Computer Science University of Illinois.

Chunk Parsing CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)

Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.

Robust Textual Inference via Graph Matching Aria Haghighi Andrew Ng Christopher Manning.

An Introduction to Machine Learning and Natural Language Processing Tools Vivek Srikumar, Mark Sammons (Some slides from Nick Rizzolo)

Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.

Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.

Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.

C. Varela; Adapted w/permission from S. Haridi and P. Van Roy1 Declarative Computation Model Defining practical programming languages Carlos Varela RPI.

Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Basi di dati distribuite Prof. M.T. PAZIENZA a.a

Natural Language Query Interface Mostafa Karkache & Bryce Wenninger.

A Memory-Based Approach to Semantic Role Labeling Beata Kouchnir Tübingen University 05/07/04.

1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR,

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,

Overview of Search Engines

Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.

11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.

Cognitive Computation Group Natural Language Processing Tutorial May 26 & 27, 2011

Page 1 Relation Alignment for Textual Entailment Recognition Department of Computer Science University of Illinois at Urbana-Champaign Mark Sammons, V.G.Vinod.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.

Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.

1 Statistical Parsing Chapter 14 October 2012 Lecture #9.

AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.

Illinois-Coref: The UI System in the CoNLL-2012 Shared Task Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth Supported by ARL,

Ling 570 Day 17: Named Entity Recognition Chunking.

21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.

A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart

CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.

IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

Introducing Python CS 4320, SPRING Lexical Structure Two aspects of Python syntax may be challenging to Java programmers Indenting ◦Indenting is.

Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.

Rules, Movement, Ambiguity

1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )

Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.

Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Supertagging CMSC Natural Language Processing January 31, 2006.

4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.

Commonsense Reasoning in and over Natural Language Hugo Liu, Push Singh Media Laboratory of MIT The 8 th International Conference on Knowledge- Based Intelligent.

Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.

11 Project, Part 3. Outline Basics of supervised learning using Naïve Bayes (using a simpler example) Features for the project 2.

Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,

Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.

Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,

Natural Language Processing Information Extraction Jim Martin (slightly modified by Jason Baldridge)

INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.

Natural Language Processing (NLP)

Improving a Pipeline Architecture for Shallow Discourse Parsing

CSCE 590 Web Scraping – Information Retrieval

Chunk Parsing CS1573: AI Application Development, Spring 2003

Family History Technology Workshop

CS246: Information Retrieval

Natural Language Processing (NLP)

Natural Language Processing (NLP)

Presentation transcript:

Cognitive Computation Group Resources for Semantic Similarity

Textual Inference Given a task like Question Answering…  …you have a large set of documents, e.g. all articles from the New York times for 2011 and 2012  …and a set of questions, e.g. “Who participated in the gubernatorial debates in January 2012?”  …You must return excerpts of the documents that answer the questions. What are the challenges? Page 2

QA Example Consider the following example question, and a sample document excerpt that might answer it: Q. Where is the headquarters of the parent company of Solahart Services? A. Aztec Solar, Inc. recently acquired Solahart Services of Stockton California. Aztec Solar, a Sacramento based residential and commercial solar company, is excited about acquiring Solahart's regional and national solar customers. Page 3

QA Example Given the QA pair on the previous page, a human reader might make the following inference steps: 1. Aztec Solar, Inc. recently acquired Solahart Services  Aztec Solar, Inc. is the parent company of Solahart Services. 2. “Aztec Solar, Inc.” looks like a company name 3. Aztec Solar, a Sacramento based residential and commercial solar company  Aztec Solar is based in Sacramento 4. “Aztec Solar” == “Aztec Solar, Inc.” Page 4

QA Example An automated system may use a matching process like this: 1. Rewrite the question: The headquarters of the parent company of Solahart Services is in 2. Match question entities and tokens LOCATION  Sacramento; company  Aztec Solar, Inc. 3. Apply structure-mapping rules -based  headquarters in 4. This example can be easily perturbed to be more difficult (to thwart a shallow system) Page 5

Outline Introduction: Textual Inference example Semantic Textual Similarity task LLM: a baseline system Comparators  Overview  Instances: WNSim, NESim Annotators  POS, Chunk, NER, Coreference, SRL Curator Edison  Data structures  Calling Curator  Feature Extraction Page 6

Textual Inference: Semantic Similarity Grand NLP challenge: work at level of meaning of text  Do these two sentences mean the same thing? 1. John said he is considered a witness but not a suspect. 2. "He is not a suspect anymore," John said.  If they are different, how different are they?  …rate similarity on a scale 0…5: 0 == different topic; 5 == paraphrase  /task6/data/uploads/datasets/train-readme.txt /task6/data/uploads/datasets/train-readme.txt Page 7

Examples from STS training corpus Nationally, the federal Centers for Disease Control and Prevention recorded 4,156 cases of West Nile, including 284 deaths. There were 293 human cases of West Nile in Indiana in 2002, including 11 deaths statewide. Score: Chavez said investigators feel confident they've got "at least one of the fires resolved in that regard." Albuquerque Mayor Martin Chavez said investigators felt confident that with the arrests they had "at least one of the fires resolved.“ Score: Page 8

CANDIDATE BASELINE: LEXICAL LEVEL MATCHING (LLM) Page 9

Words Matter Approximate similarity of meaning via lexical overlap – how many words in common But this isn’t exactly fool-proof… Page 10 John Smith bought three cakes and two oranges John bought two oranges John Smith bought three cakes and two oranges John bought three oranges

LLM Scoring Designed for Textual Entailment (inherently asymmetric) Proportion of matched Hypothesis tokens, normalized by length of shorter text Let T be the Text, containing tokens indexed by j Let H be the Hypothesis, with tokens indexed by i Let S(word 1, word 2 ) be a lexical similarity function that returns a value in the range [0,1] Page 11

LLM code import edu.illinois.cs.cogcomp.mrcs.comparators.LlmComparator; String source = "Of the three kings referred to by their last names, Atawanaba was the oldest."; String target = "Three kings were named in the lawsuit."; LlmComparator llm = new LlmComparator( config ); double result = llm.compareStrings( source, target ); Page 12

Can we do better? Depends on the application… more advanced task may require more sophisticated patterns to separate classes Sparsity of features  Many words/sequences of words may not occur very often  This means a learned classifier may not generalize well  More abstract representation can help Ambiguity of words – e.g. “terminal”, “moving”  Additional information may help Meaning encoded in structure – e.g. “Matthew Smith, the Maverick’s last hope…” NLP annotation tools generally abstract over underlying words so that features generalize better Page 13

COMPARATORS Page 14

So you want to compare some text…. How similar are two words? Two strings? Two paragraphs?  Depends on what they are  String edit distance is usually a weak measure  … think about coreference resolution… Solution: specialized metrics Page 15 String 1String 2Norm. edit sim. ShiiteShi’ ‘ite0.667 Mr. SmithMrs. Smith0.900 Wilbur T. GobsmackMr. Gobsmack0.611 FrigidCold0.167 WealthWreath0.667 ParisFrance0.167

WNSim Generate table mapping terms linked in WordNet ontology  Synonymy, Hypernymy, Meronymy Score reflects distance (up to 3 edges, undirected – e.g. via lowest common subsumer) Score is symmetric Page 16 String 1String 2WNSim similarity ShiiteShi’ ‘ite0 Mr. SmithMrs. Smith0 Wilbur T. GobsmackMr. Gobsmack0 FrigidCold1 WealthWreath0 ParisFrance0

Using WNSim Install and run the WNSim code   Sets up an xmlrpc server  Expects xmlrpc ‘struct’ data structure (analogous to Dictionary) STRUCT { FIRST_STRING: aString; SECOND_STRING: anotherString }  Returns another xmlrpc data structure: STRUCT { SCORE: aDouble; REASON: aString } USE: call and cache (reduce network latency overhead) NOTE: LLM code has Java client… Page 17

WNSim via Metric interface String metricHost = “…”; int metricPort = …; XmlRpcMetricClient client = new XmlRpcMetricClient( “WNSim”, metricHost, metricPort ); MetricResponse response = client.compareStrings( source_, target_ ); double score = response.score; Page 18

NESim Set of entity-type-specific measures  Acronyms, Prefix/Title rules, distance metric Score reflects similarity based on type information Score is asymmetric Page 19 String 1String 2Norm. edit distance ShiiteShi’ ‘ite0.922 Joan SmithJohn Smith0 Wilbur T. GobsmackMr. Gobsmack0.95 FrigidCold0 WealthWreath0.900 ParisFrance0.411

Using NESim NESim package from CCG web site  NESim can use context to help determine similarity  Specify token offsets of NE string to indicate context (optional)  Specify Type as one of PER, LOC, ORG (optional) [ #] [# # ]  Note: offsets are inclusive, token-based, zero offset Uses specialized resources depending on the type (if specified)  Rules/gazetteers for People’s names  Acronyms for Organizations Page 20

Using NESim (cont’d) Returns a score in [0, 1]  Threshold of 0.8 or higher is advised  Weakly similar names are generally not semantically close Put jar on classpath, call programmatically  Loads large lists, so instantiate once only import edu.illinois.cs.cogcomp.entityComparison.core.EntityComparison; EntityComparison entityComparator = new EntityComparison(); entityComparator.compare( aName, anotherName ); double currentScore = entityComparator.getScore(); Problem: identifying NE boundaries, types Page 21

ANNOTATORS Page 22

Available from CCG Tokenization/Sentence Splitting Part Of Speech Chunking Named Entity Recognition Coreference Semantic Role Labeling Page 23

Tokenization and Sentence Segmentation Given a document, find the sentence and token boundaries The police chased Mr. Smith of Pink Forest, Fla. all the way to Bethesda, where he lived. Smith had escaped after a shoot-out at his workplace, Machinery Inc. Why?  Word counts may be important features  Words may themselves be the object you want to classify  “lived.” and “lived” should give the same information  different analyses need to align if you want to leverage multiple annotators from different sources/tasks Page 24

Tokenization and Sentence Segmentation ctd. Believe it or not, this is an open problem No agreed standard for token-level segmentation  e.g. “American-led” vs. “American - led”?  e.g. “$ 32 M” vs “$32 M” and “$32M”? Different tasks may use different standards No wildly successful sentence segmenter exists (see the excerpts in news aggregators for some nice errors) Noisier text (e.g. online consumer reviews)  poorer performance (for reasons like inconsistent capitalization) LBJ distribution includes the Illinois tokenizer and sentence segmenter Page 25

Part of Speech (POS) Allows simple abstraction for pattern detection Disambiguate a target, e.g. “make (a cake)” vs. “make (of car)” Specify more abstract patterns, e.g. Noun Phrase: ( DT JJ* NN ) Specify context in abstract way  e.g. “DT boy VBX” for “actions boys do”  This expression will catch “a boy cried”, “some boy ran”, … Page 26 POSDTNNVBDPPDTJJNN WordTheboystoodontheburningdeck POSDTNNVBDPPDTJJNN WordAboyrodeonaredbicycle

Chunking Identifies phrase-level constituents in sentences [NP Boris] [ADVP regretfully] [VP told] [NP his wife] [SBAR that] [NP their child] [VP could not attend] [NP night school] [PP without] [NP permission]. Useful for filtering: identify e.g. only noun phrases, or only verb phrases  Groups modifiers with heads  Useful for e.g. Mention Detection Used as source of features, e.g. distance (abstracts away determiners, adjectives, for example), sequence,…  More efficient to compute than full syntactic parse  Applications in e.g. Information Extraction – getting (simple) information about concepts of interest from text documents Page 27

Named Entity Recognition Identifies and classifies strings of characters representing proper nouns [ PER Neil A. Armstrong], the 38-year-old civilian commander, radioed to earth and the mission control room here: “ [LOC Houston], [ORG Tranquility] Base here; the Eagle has landed." Useful for filtering documents  “I need to find news articles about organizations in which Bill Gates might be involved…” Disambiguate tokens: “Chicago” (team) vs. “Chicago” (city) Source of abstract features  E.g. “Verbs that appear with entities that are Organizations”  E.g. “Documents that have a high proportion of Organizations” Page 28

Coreference Identify all phrases that refer to each entity of interest – i.e., group mentions of concepts [ Neil A. Armstrong], [the 38-year-old civilian commander], radioed to [earth]. [He] said the famous words, “ [the Eagle] has landed”." The Named Entity recognizer only gets us part-way… …if we ask, “what actions did Neil Armstrong perform?”, we will miss many instances (e.g. “He said…”) Coreference resolver abstracts over different ways of referring to the same person  Useful in feature extraction, information extraction Page 29

Semantic Role Labeler SRL reveals relations and arguments in the sentence (where relations are expressed as verbs) Cannot abstract over variability of expressing the relations – e.g. kill vs. murder vs. slay… Page 30

CURATOR Page 31

Big NLP We introduced a lot of tools, some of them quite sophisticated The more complex, the bigger the memory requirement  NER: 1G; Coref: 1G; SRL: 4G …. If you use tools from different sources, they may be…  In different languages  Using different data structures If you run a lot of experiments on a single corpus, it would be nice to cache the results  …and for your colleagues, nice if they can access that cache. Curator is our solution to these problems. Page 32

Curator Supports distributed NLP resources  Central point of contact  Single set of interfaces  Code generation in many languages (using Thrift) Programmatic interface  Defines set of common data structures used for interaction Caches processed data Enables highly configurable NLP pipeline Overhead: Annotation is all at the level of character offsets: Normalization/mapping to token level required Need to wrap tools to provide requisite data structures Page 33

Curator Page 34 NER SRL POS, Chunker Cache Curator

Using Curator for Flexible NLP Pipeline ml ml For this class only: dedicated curator instance  Temporary instance with host, port accessible to class members Recommended: access using Edison library (next) Page 35

Edison: An NLP Library Convenient interface to Curator  Converts to token-level indexing (often more convenient) Supports feature extraction over trees  Apply to syntactic parse/dependency, and to SRL/NOM  E.g. see html for examples of dependency path features html Page 36

Serializing TextAnnotations public void serializeAnnotations( List annotations_, String outputFile_ ) throws Exception { try { ObjectOutputStream objOut = new ObjectOutputStream(new FileOutputStream( outputFile_ ) ); objOut.writeObject( new Integer( annotations_.size() ) ); for ( TextAnnotation ta: annotations_ ) { System.err.println( "serializing TA for text '" + ta.getText() + "'..." ); objOut.writeObject( ta ); } objOut.close(); } catch (IOException e) { … } return; } Page 37

K-best Views in Curator The Charniak and Stanford parsers can be run in K-best mode These will be added to Curator with k=50  This will be quite disk-hungry  These components will probably *not* be cached Curator uses a MultiParser interface for k-best parsers  Generates a parse view in Record  The parse view is a List of Forests: the k-th Forest contains the k-th best parse for all sentences in record Edison does NOT yet directly support getting k-best parses from Curator, BUT… Page 38

K-best views in Edison Edison supports k-best views List topKParses =...; // A list of top-k parses, say from Charniak ta.addView(ViewNames.PARSE_CHARNIAK, topKParses); List parses = ta.getTopKViews(ViewNames.PARSE_CHARNIAK); int tokenId = 17; // some token Page 39

Edison k-best example cont’d Constituent c = new Constituent("", "",ta, tokenId, tokenId+1); int treeId =0 ; for(View parseTree: parses) { for(Constituent parseConstituent: parseTree. where(Queries.containsConstituent(c))) { // do something with parseConstituent belonging to tree "treeId" } treeId++; } Page 40

A FINAL WORD Page 41

LLM and Semantic Similarity LLM was designed for Textual Entailment, and is asymmetric by design This task is a little different – trying to assess level of semantic equivalence of two sentences S1 and S2 Still want to normalize (don’t want all short sentence pairs to have lower scores than long sentence pairs), but consider evaluating for both (S1, S2) and for (S2, S1) Page 42

FIN Page 43