Features, Formalized Stephen Mayhew Hyung Sul Kim 1.

Slides:

Advertisements

Similar presentations

Based on slides by Pierre Dönnes and Ron Meir Modified by Longin Jan Latecki, Temple University Ch. 5: Support Vector Machines Stephen Marsland, Machine.

Advertisements

Information Extraction Lecture 7 – Linear Models (Basic Machine Learning) CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.

Introduction to Information Retrieval

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

ONLINE ARABIC HANDWRITING RECOGNITION By George Kour Supervised by Dr. Raid Saabne.

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

A Corpus for Cross- Document Co-Reference D. Day 1, J. Hitzeman 1, M. Wick 2, K. Crouch 1 and M. Poesio 3 1 The MITRE Corporation 2 University of Massachusetts,

Semi-supervised Relation Extraction with Large-scale Word Clustering Ang Sun Ralph Grishman Satoshi Sekine New York University June 20, 2011 NYU.

An Introduction of Support Vector Machine

Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Machine learning continued Image source:

计算机科学与技术学院 Chinese Semantic Role Labeling with Dependency-driven Constituent Parse Tree Structure Hongling Wang, Bukang Wang Guodong Zhou NLP Lab, School.

Deep Learning in NLP Word representation and how to use it for Parsing

Reporter: Longhua Qian School of Computer Science and Technology

Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.

1 CS546: Machine Learning and Natural Language Preparation to the Term Project: - Dependency Parsing - Dependency Representation for Semantic Role Labeling.

Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.

Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.

This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.

A Language Independent Method for Question Classification COLING 2004.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

Unsupervised Constraint Driven Learning for Transliteration Discovery M. Chang, D. Goldwasser, D. Roth, and Y. Tu.

Date: 2014/02/25 Author: Aliaksei Severyn, Massimo Nicosia, Aleessandro Moschitti Source: CIKM’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Building.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio Supervised Relation Extraction.

A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,

1 Intelligente Analyse- und Informationssysteme Frank Reichartz, Hannes Korte & Gerhard Paass Fraunhofer IAIS, Sankt Augustin, Germany Dependency Tree.

Christopher M. Bishop, Pattern Recognition and Machine Learning.

GRASP Learning a Kernel Matrix for Nonlinear Dimensionality Reduction Kilian Q. Weinberger, Fei Sha and Lawrence K. Saul ICML’04 Department of Computer.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

Multilingual Opinion Holder Identification Using Author and Authority Viewpoints Yohei Seki, Noriko Kando,Masaki Aono Toyohashi University of Technology.

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Inference Protocols for Coreference Resolution Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Nick Rizzolo, Mark Sammons, and Dan Roth This research.

Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,

Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.

Support Vector Machines and Kernel Methods for Co-Reference Resolution 2007 Summer Workshop on Human Language Technology Center for Language and Speech.

Object Recognition as Ranking Holistic Figure-Ground Hypotheses Fuxin Li and Joao Carreira and Cristian Sminchisescu 1.

Support-Vector Networks C Cortes and V Vapnik (Tue) Computational Models of Intelligence Joon Shik Kim.

An Improved Algorithm for Decision-Tree-Based SVM Sindhu Kuchipudi INSTRUCTOR Dr.DONGCHUL KIM.

Machine Learning in CSC 196K

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Foundational Issues Machine Learning 726 Simon Fraser University.

Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.

Convolution Kernels on Constituent, Dependency and Sequential Structures for Relation Extraction Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date:

Overview of Statistical NLP IR Group Meeting March 7, 2006.

Static model noOverlaps :: ArgumentCandidate[] candidates -> discrete[] types for (i : (0.. candidates.size() - 1)) for (j : (i candidates.size()

Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.

Support Vector Machines Part 2. Recap of SVM algorithm Given training set S = {(x 1, y 1 ), (x 2, y 2 ),..., (x m, y m ) | (x i, y i )   n  {+1, -1}

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)

An Integrated Approach for Relation Extraction from Wikipedia Texts Yulan Yan Yutaka Matsuo Mitsuru Ishizuka The University of Tokyo WWW 2009.

Sentiment analysis algorithms and applications: A survey

Part 2 Applications of ILP Formulations in Natural Language Processing

Sparse Kernel Machines

Relation Extraction CSCI-GA.2591

Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD

Deep learning and applications to Natural language processing

Learning Emoji Embeddings Using Emoji Co-Occurrence Network Graph

Recitation 6: Kernel SVM

Word embeddings based mapping

CSCI 5832 Natural Language Processing

Automatic Extraction of Hierarchical Relations from Text

Support Vector Machines and Kernels

Modeling IDS using hybrid intelligent systems

Presentation transcript:

Features, Formalized Stephen Mayhew Hyung Sul Kim 1

Outline What are features? How are they defined in NLP tasks in general? How they are defined specifically for relation extraction? (Kernel methods) 2

What are features? 3

Feature Extraction Pipeline 1.Define Feature Generation Functions (FGF) 2.Apply FGFs to Data to make a lexicon 3.Translate examples into feature space 4.Learning with vectors 4

Feature Generation Functions 5

6

7

Feature Extraction Pipeline 1.Define Feature Generation Functions (FGF) 2.Apply FGFs to Data to make a lexicon 3.Translate examples into feature space 4.Learning with vectors 8

Lexicon Apply our FGF to all input data. Creates grounded features and indexes them … 3534: hasWord(stark) 3535: hasWord(stamp) 3536: hasWord(stampede) 3537: hasWord(starlight) … 9

Feature Extraction Pipeline 1.Define Feature Generation Functions (FGF) 2.Apply FGFs to Data to make a lexicon 3.Translate examples into feature space 4.Learning with vectors 10

Translate examples to feature space From Lexicon: … 98: hasWord(In) … 241: hasWord(the) … 3534: hasWord(stark) 3535: hasWord(stamp) 3536: hasWord(stampede) 3537: hasWord(starlight) … “In the stark starlight” “In the stark starlight” 11

Feature Extraction Pipeline 1.Define Feature Generation Functions (FGF) 2.Apply FGFs to Data to make a lexicon 3.Translate examples into feature space 4.Learning with vectors Easy. 12

Feature Extraction Pipeline 1.FGFs are already defined 2.Lexicon is already defined 3.Translate examples into feature space 4.Learning with vectors No surprises here. Testing 13

Structured Pipeline - Training Exactly the same as before! 14

Structured Pipeline - Testing 15

Automatic Feature Generation Two ways to look at this: 1.Creating an FGF This is a black art, not even intuitive for humans to do 2.Choosing the best subset of a closed set This is possible, algorithms exist 16

Exploiting Syntactico-Semantic Structures for Relation Extraction Before doing the hard task of relation classification, apply some easy heuristics to recognize: Premodifiers: [the [Seattle] Zoo] Possessives: [[California’s] Governor] Prepositions: [officials] in [California] Formulaics: [Medford], [Massachusetts] These 4 structures cover 80% of the mention pairs (in ACE 2004) Chan and Roth, ACL

Kernels for Relation Extraction Hyung Sul Kim 18

Kernel Tricks Borrowed a few slides from ACL2012 Tutorial for Kernels in NLP by Moschitti 19

20

21

22

23

All We Need is K(x 1, x 2 ) = ϕ(x 1 ) · ϕ(x 2 ) Computing K(x 1, x 2 ) can be possible without mapping x to ϕ(x) 24

Linear Kernels with Features (Zhou et al., 2005) Pairwise binary-SVM training Features Words Entity Types Mention Level Overlap Base Phrase Chunking Dependency Tree Parse Tree Semantic Resources 25

FeatureDescriptionExample WM1bag-of-words in M1{they} HM1head word of M1they WM2bag-of-words in M2{their, children} HM2head word of M2children HM12combination of HM1 and HM2 WBNULLwhen no word in between0 WBFLthe only word in between when only one word in between0 WBFfirst word in between when at least two words in betweendo WBLlast word in between when at least two words in betweenput WBOother words in between except first and last words when at least three words in between not BM1Ffirst word before M10 BM1Lsecond word before M10 AM2Ffirst word after M2in AM2Lsecond word after M2a Word Features 26

Entity Types, Mention Level, Overlap FeatureDescriptionExample 1Example 2 ET12combination of mention entity types (PER, ORG, FAC, LOC, GPE) ML12combination of mention levels (NAME, NOMIAL, PRONOUN) #MBnumber of other mentions in between00 #WBnumber of words in between30 M1>M21 if M2 is included in M101 M1<M21 if M1 is included in M200 27

Base Phrase Chunking FeatureDescriptionExample CPHBNULLwhen no phrase in between0 CPHBFLthe only phrase head when only one phrase in between0 CPHBFfirst phrase head in between when at least two phrases in betweenJAPAN CPHBLlast phrase head in between when at least two phrase heads in betweenKILLED CPHBOother phrase heads in between except first and last phrase heads when at least three phrases in between 0 CPHBM1Ffirst phrase head before M10 CPHBM1Lsecond phrase head before M10 CPHAM2Ffirst phrase head after M20 CPHAM2Lsecond phrase head after M20 28

Dependency Trees That's because Israel was expected to retaliate against Hezbollah forces in areas controlled by Syrian troops. FeatureDescriptionExample ET1DW1combination of the entity type and the dependent word for M1 H1DW1combination of the head word and the dependent word for M1 ET2DW2combination of the entity type and the dependent word for M2 H2DW2combination of the head word and the dependent word for M2 ET12SameNPcombination of ET12 and whether M1 and M2 included in the same NP0 ET12SamePPcombination of ET12 and weather M1 and M2 included in the same PP0 ET12SameVPcombination of ET12 and weather M1 and M2 included in the same VP1 M1 M2 29

Performance of Features (F1 Measure) 30

Performance Comparison YearAuthorsMethodF-Measure 2005Zhou et al.Linear Kernels with Handcrafted Features

Syntactic Kernels (Zhao and Grishman, 2005) Syntactic Kernels (Composite of 5 Kernels) Argument Kernel Bigram Kernel Link Sequence Kernel Dependency Path Kernel Local Dependency Kernel 32

Bigram Kernel All unigrams and bigrams in the text from M1 to M2 UnigramBigram theythey do dodo not notnot put putput their theirtheir children children 33

Dependency Path Kernel That's because Israel was expected to retaliate against Hezbollah forces in areas controlled by Syrian troops. 34

Performance Comparison YearAuthorsMethodF-Measure 2005Zhou et al.Linear Kernels with Handcrafted Features Zhao and Grishman Syntactic Kernels (Composite of 5 Kernels)

Composite Kernel (Zhang et al., 2006) Composite of Two Kernels Entity Kernel (Linear Kernel with entity related features given by ACE datasets) Convolution Tree Kernel (Collins and Duffy, 2001) Two ways to composite two kernels Linear Combination Polynomial Expansion 36

Convolution Tree Kernel (Collins and Duffy, 2001) An example tree Efficiently Compute K(x 1, x 2 ) by O(|x 1 |·|x 2 |) 37

Relation Instance Spaces

Performance Comparison YearAuthorsMethodF-Measure 2005Zhou et al.Linear Kernels with Handcrafted Features Zhao and Grishman Syntactic Kernels (Composite of 5 Kernels) Zhang et al.Entity Kernel + Convolution Tree Kernel

Context-Sensitive Tree Kernel (Zhou et al., 2007) Motivational Example: John and Mary got married called predicate-linked category (10%) PT: 63.6 Context-Sensitive Tree Kernel:

Performance Comparison YearAuthorsMethodF-Measure 2005Zhou et al.Linear Kernels with Handcrafted Features Zhao and Grishman Syntactic Kernels (Composite of 5 Kernels) Zhang et al.Entity Kernel + Convolution Tree Kernel Zhou et al. (Zhou et al., 2005) + Context-sensitive Tree Kernel

Best Kernel (Nguyen et al., 2009) Use Multiple Kernels on Constituent Trees Dependency Trees Sequential Structures Design 5 different Kernel Composites with 4 Tree Kernels and 6 Sequential Kernels 42

Convolution Tree Kernels on 4 Special Trees PET DW GR GRW PET + GR = 70.5 DW + GR =

Word Sequence Kernels on 6 Special Sequences SK1. Sequence of terminals (lexical words) in the PET e.g. T2-LOC washington, U.S. T1-PER officials SK2. Sequence of part-of-speech (POS) tags in the PET e.g. T2-LOC NN, NNP T1-PER NNS SK3. Sequence of grammatical relations in the PET e.g. T2-LOC pobj, nn T1-PER nsubj SK4. Sequence of words in the DW e.g. Washington T2-LOC In working T1-PER officials GPE U.S. SK5. Sequence of grammatical relations in the GR e.g. pobj T2-LOC prep ROOT T1-PER nsubj GPE nn SK6. Sequence of POS tags in the DW e.g. NN T2-LOC IN VBP T1-PER NNS GPE NNP SK1 + SK2 + SK3 + SK4 + SK5 + SK6 =

Word Sequence Kernels (Cancedda et al., 2003) Extended Sequence Kernels Map to high-dimensional spaces using every subsequence Penalties to common subsequences (using IDF) longer subsequences non-contiguous subsequences 45

Performance Comparison YearAuthorsMethodF-Measure 2005Zhou et al.Linear Kernels with Handcrafted Features Zhao and Grishman Syntactic Kernels (Composite of 5 Kernels) Zhang et al.Entity Kernel + Convolution Tree Kernel Zhou et al. (Zhou et al., 2005) + Context-sensitive Tree Kernel Nguyen et al. Multiple Tree Kernels + Multiple Sequence Kernels 71.5 (Zhang et al., 2006) F-measure 68.9 in our settings (Zhou et al., 2007) “Such heuristics expand the tree and remove unnecessary information allowing a higher improvement on RE. They are tuned on the target RE task so although the result is impressive, we cannot use it to compare with pure automatic learning approaches, such us our models. “ (Zhang et al., 2006) F-measure 68.9 in our settings (Zhou et al., 2007) “Such heuristics expand the tree and remove unnecessary information allowing a higher improvement on RE. They are tuned on the target RE task so although the result is impressive, we cannot use it to compare with pure automatic learning approaches, such us our models. “ 46

Topic Kernel (Wang et al., 2011) Use Wikipedia InfoBox to learn topics of relations (like topics of words) based on co-occurrences TopicsTop Relations Topic 1active_years_end_date, career_end, final_year, retired Topic 2commands, part_of, battles, not_able_commanders Topic 3influenced, school_tradition, not_able_ideas, main_interests Topic 4destinations, end, through, post_town Topic 5prizes, award, academy_awards, highlights Topic 6inflow, outflow, length, maxdepth Topic 7after, successor, ending_terminus Topic 8college, almamater, education … 47

Overview 48

Performance Comparison YearAuthorsMethodF-Measure 2005Zhou et al.Linear Kernels with Handcrafted Features Zhao and Grishman Syntactic Kernels (Composite of 5 Kernels) Zhang et al.Entity Kernel + Convolution Tree Kernel Zhou et al. (Zhou et al., 2005) + Context-sensitive Tree Kernel Nguyen et al. Multiple Tree Kernels + Multiple Sequence Kernels Wang et al. Entity Features + Word Features + Dependency Path + Topic Kernels