1 Fast Methods for Kernel-based Text Analysis Taku Kudo 工藤 拓 Yuji Matsumoto 松本 裕治 NAIST (Nara Institute of Science and Technology) 41st Annual Meeting.

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Frequent Closed Pattern Search By Row and Feature Enumeration
SVM—Support Vector Machines
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Part-Of-Speech Tagging and Chunking using CRF & TBL
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Made with OpenOffice.org 1 Sentiment Classification using Word Sub-Sequences and Dependency Sub-Trees Pacific-Asia Knowledge Discovery and Data Mining.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Intelligent Information Directory System for Clinical Documents Qinghua Zou 6/3/2005 Dr. Wesley W. Chu (Advisor)
Support Vector Machines
An SVM Based Voting Algorithm with Application to Parse Reranking Paper by Libin Shen and Aravind K. Joshi Presented by Amit Wolfenfeld.
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
An SVMs Based Multi-lingual Dependency Parsing Yuchang CHENG, Masayuki ASAHARA and Yuji MATSUMOTO Nara Institute of Science and Technology.
Japanese Dependency Structure Analysis Based on Maximum Entropy Models Kiyotaka Uchimoto † Satoshi Sekine ‡ Hitoshi Isahara † † Kansai Advanced Research.
11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.
Mining and Summarizing Customer Reviews
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.
1 Data-Driven Dependency Parsing. 2 Background: Natural Language Parsing Syntactic analysis String to (tree) structure He likes fish S NP VP NP VNPrn.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
On the Issue of Combining Anaphoricity Determination and Antecedent Identification in Anaphora Resolution Ryu Iida, Kentaro Inui, Yuji Matsumoto Nara Institute.
Querying Structured Text in an XML Database By Xuemei Luo.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer 1 Ting-Hao (Kenneth) Huang Yun-Nung (Vivian) Chen Lingpeng Kong
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
1 Exploiting Syntactic Patterns as Clues in Zero- Anaphora Resolution Ryu Iida, Kentaro Inui and Yuji Matsumoto Nara Institute of Science and Technology.
A Language Independent Method for Question Classification COLING 2004.
Date: 2014/02/25 Author: Aliaksei Severyn, Massimo Nicosia, Aleessandro Moschitti Source: CIKM’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Building.
Japanese Dependency Analysis using Cascaded Chunking Taku Kudo 工藤 拓 Yuji Matsumoto 松本 裕治 Nara Institute Science and Technology, JAPAN.
1 Boosting-based parse re-ranking with subtree features Taku Kudo Jun Suzuki Hideki Isozaki NTT Communication Science Labs.
AQUAINT Workshop – June 2003 Improved Semantic Role Parsing Kadri Hacioglu, Sameer Pradhan, Valerie Krugler, Steven Bethard, Ashley Thornton, Wayne Ward,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Low-Rank Kernel Learning with Bregman Matrix Divergences Brian Kulis, Matyas A. Sustik and Inderjit S. Dhillon Journal of Machine Learning Research 10.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Online Multiple Kernel Classification Steven C.H. Hoi, Rong Jin, Peilin Zhao, Tianbao Yang Machine Learning (2013) Presented by Audrey Cheong Electrical.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Support Vector Machines and Kernel Methods for Co-Reference Resolution 2007 Summer Workshop on Human Language Technology Center for Language and Speech.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,
Association Analysis (3)
Question Classification using Support Vector Machine Dell Zhang National University of Singapore Wee Sun Lee National University of Singapore SIGIR2003.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Searching Topics Sequential Search Binary Search.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.
Efficient Logistic Regression with Stochastic Gradient Descent William Cohen 1.
Support Vector Machines Part 2. Recap of SVM algorithm Given training set S = {(x 1, y 1 ), (x 2, y 2 ),..., (x m, y m ) | (x i, y i )   n  {+1, -1}
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Efficient Image Classification on Vertically Decomposed Data
Relation Extraction CSCI-GA.2591
Efficient Image Classification on Vertically Decomposed Data
CS 581 Tandy Warnow.
Perceptron Learning for Chinese Word Segmentation
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

1 Fast Methods for Kernel-based Text Analysis Taku Kudo 工藤 拓 Yuji Matsumoto 松本 裕治 NAIST (Nara Institute of Science and Technology) 41st Annual Meeting of the Association for Computational Linguistics, Sapporo JAPAN

2 Background Kernel methods (e.g., SVM) become popular Can incorporate prior knowledge independently from the machine learning algorithms by giving task dependent kernel (generalized dot-product) High accuracy

3 Problem Too slow to use kernel-based text analyzers to the real NL applications (e.g., QA or text mining) because of their inefficiency in testing Some kernel-based parsers run only at seconds/sentence

4 Goals Build fast but still accurate kernel- based text analyzers Make it possible to use them to wider range of NL applications

5 Outline Polynomial Kernel of degree d Fast Methods for Polynomial kernel PKI PKE Experiments Conclusions and Future Work

6 Outline Polynomial Kernel of degree d Fast Methods for Polynomial kernels PKI PKE Experiments Conclusions and Future Work

7 Kernel Methods No need to represent example in an explicit feature vector Complexity of testing is O(L ・ |X|) Training data

8 Kernels for Sets (1/3) Focus on the special case where examples are represented as sets The instances in NLP are usually represented as sets (e.g., bag-of-words) Feature set: Training data:

9 Kernels for Sets (2/3) Combinations (subsets) of features Simple definition: 2 nd order 3 rd order

10 Kernels for Sets (3/3) I ate a cake PRP VBD DT NN Dependent (+1) or independent (-1) ? headmodifier Head-word: ate Head-POS: VBD Modifier-word: cake Modifier-POS: NN X= Head-word: ate Head-POS: VBD Modifier-word: cake Modifier-POS: NN Head-POS/Modifier-POS: VBD/NN Head-word/Modifier-POS: ate/NN … X= Subsets (combinations) of basic features are critical to improve overall accuracy in many NL tasks Previous approaches select combinations heuristically Heuristic selection

11 Polynomial Kernel of degree d Implicit form Explicit form is a set of all subsets of with exactly elements in it is prior weight to the subsets with size (subset weight)

12 Example (Cubic Kernel d=3 ) Implicit form: Explicit form: Up to 3 subsets are used as new features

13 Outline Polynomial Kernel of degree d Fast Methods for Polynomial kernel PKI PKE Experiments Conclusions and Future Work

14 Toy Example {a, b, c} {a, b, d} {b, c, d} X α X={a,c,e} Feature Set: F={a,b,c,d,e} Examples : Test Example : Kernel : j #SVs L =3 j

15 PKB (Baseline) {a, b, c} {a, b, d} {b, c, d} Xα Test Example X={a,c,e} K(X,X ’ ) = (|X ∩ X ’ |+1) 3 f(X) = 1 ・ (2+1) ・ (1+1) - 2 (1+1) = 15 Complexity is always O(L ・ |X|) 333 K(X j,X) j

16 PKI (Inverted Representation) {a, b, c} {a, b, d} {b, c, d} X j α K(X,X ’ ) = (|X ∩ X ’ |+1) 3 a b c d {1,2} {1,2,3} {1,3} {2,3} Test Example X= {a, c, e} f(X)=1 ・ (2+1) ・ (1+1) - 2 (1+1) = 15 333 Average complexity is O(B ・ |X|+L) Efficient if feature space is sparse Suitable for many NL tasks Inverted Index B = Avg. size

17 PKE (Expanded Representation) Convert into linear form by calculating vector w projects X into its subsets space

18 PKE (Expanded Representation) K(X,X ’ ) = (|X ∩ X ’ |+1) c 3 (0)=1, c 3 (1)=7, c 3 (2)=12, c 3 (3)=6 {a, b, c} {a, b, d} {b, c, d} X j αjαj φ {a} {b} {c} {d} {a,b} {a,c} {a,d} {b,c} {b,d} {c,d} {a,b,c} {a,b,d} {a,c,d} {b,c,d} C w W (Expansion Table) 3 F(X)= – = 15 Test Example X={a,c,e} {φ,{a},{c}, {e}, {a,c},{a,e}, {c,e},{a,c,e}} Complexity is O(|X| ), independent of the number of SVs (L) Efficient if the number of SVs is large d w({b,d}) = 12 (0.5 – 2 ) = -18

19 PKE in Practice Hard to calculate Expansion Table exactly Use Approximated Expansion Table Subsets with smaller |w| can be removed, since |w| represents a contribution to the final classification Use subset mining (a.k.a. basket mining) algorithm for efficient calculation

20 Subset Mining Problem id set { a c d } { a b c } { a b d } { b c e } Transaction Database {a}:3 {b}:3 {c}:3 {d}:2 {a b}:2 {b c}: 2 {a c}:2 {a d}: 2 Results Extract all subsets that occur in no less than sets of the transaction database and no size constraints → NP-hard Efficient algorithms have been proposed (e.g., Apriori, PrefixSpan)

21 Feature Selection as Mining Can efficiently build the approximated table σ controls the rate of approximation {a, b, c} {a, b, d} {b, c, d} X i αiαi Direct generation with subset mining {a} {d} {a,b} {a,c} {b,c} {b,d} {c,d} {b,c,d} W φ {a} {b} {c} {d} {a,b} {a,c} {a,d} {b,c} {b,d} {c,d} {a,b,c} {a,b,d} {a,c,d} {b,c,d} σ= s w Exhaustive generation and testing → Impractical! s

22 Outline Polynomial Kernel of degree d Fast Methods for Polynomial kernel PKI PKE Experiments Conclusions and Future Work

23 Experimental Settings Three NL tasks English Base-NP Chunking (EBC) Japanese Word Segmentation (JWS) Japanese Dependency Parsing (JDP) Kernel Settings Quadratic kernel is applied to EBC Cubic kernel is applied to JWS and JDP

24 Results (English Base-NP Chunking) Time (Sec./Sent.) Speedup Ratio F-score PKB PKI PKE (σ=.01) PKE (σ=.005) PKE (σ=.001) PKE (σ=.0005)

25 Results (Japanese Word Segmentation) Time (Sec./Sent.) Speedup Ratio Accuracy (%) PKB PKI PKE (σ=.01) PKE (σ=.005) PKE (σ=.001) PKE (σ=.0005)

26 Results (Japanese Dependency Parsing) Time (Sec./Sent.) Speedup Ratio Accuracy (%) PKB PKI PKE (σ=.01) PKE (σ=.005) PKE (σ=.001) PKE (σ=.0005)

27 Results fold speed up in PKI fold speed up in PKE Preserve the accuracy when we set an appropriate σ

28 Comparison with related work XQK [Isozaki et al. 02] Same concept as PKE Designed only for the Quadratic Kernel Exhaustively creates the expansion table PKE Designed for general Polynomial Kernels Uses subset mining algorithms to create the expansion table

29 Conclusions Propose two fast methods for the polynomial kernel of degree d PKI (Inverted) PKE (Expanded) 2-12 fold speed up in PKI, fold speed up in PKE Preserve the accuracy

30 Future Work Examine the effectiveness in a general machine learning dataset Apply PKE to other convolution kernels Tree Kernel [Collins 00]  Dot-product between trees  Feature space is all sub-tree  Apply sub-tree mining algorithm [Zaki 02]

31 English Base-NP Chunking Extract Non-overlapping Noun Phrase from text [NP He ] reckons [NP the current account deficit ] will narrow to [NP only # 1.8 billion ] in [NP September ]. BIO representation (seeing as a tagging task) B: beginning of chunk I: non-initial chunk O: outside Pair-wise method to 3-class problem training: wsj15-18, test: wsj20 (standard set)

32 Japanese Word Segmentation 太 郎 は 花 子 に 本 を 読 ま せ た太 郎 は 花 子 に 本 を 読 ま せ た ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ Sentence: Boundaries: Distinguish the relative position Use also the character types of Japanese Training: KUC 01-08, Test: KUC 09 If there is a boundary between and, otherwise Taro made Hanako read a book

33 Japanese Dependency Parsing 私は ケーキを 食べる I-top cake-acc. eat Identify the correct dependency relations between two bunsetsu (base phrase in English) Linguistic features related to the modifier and head (word, POS, POS-subcat, inflections, punctuations, etc) Binary classification (+1 dependent, -1 independent) Cascaded Chunking Model [kudo, et al. 02] Training: KUC 01-08, Test: KUC 09 I eat a cake

34 Kernel Methods (1/2) X : example to be classified X i : training examples : weight for examples : a function to map examples to another vectorial space Suppose a learning task: training examples

35 PKE (Expanded Representation) If we calculate in advance ( is the indicator function) for all subsets

36 TRIE representation {a} {d} {a,b} {a,c} {b,c} {b,d} {c,d} {b,c,d} w adb b ccd c d d root Compress redundant structures Classification can be done by simply traversing the TRIE

37 Kernel Methods No need to represent example in an explicit feature vector Complexity of testing is O(L |X|) Training data