1 Fast Methods for Kernel-based Text Analysis Taku Kudo 工藤拓 Yuji Matsumoto 松本裕治 NAIST (Nara Institute of Science and Technology) 41st Annual Meeting.

1 Fast Methods for Kernel-based Text Analysis Taku Kudo 工藤拓 Yuji Matsumoto 松本裕治 NAIST (Nara Institute of Science and Technology) 41st Annual Meeting of the Association for Computational Linguistics, Sapporo JAPAN

2 Background Kernel methods (e.g., SVM) become popular Can incorporate prior knowledge independently from the machine learning algorithms by giving task dependent kernel (generalized dot-product) High accuracy

3 Problem Too slow to use kernel-based text analyzers to the real NL applications (e.g., QA or text mining) because of their inefficiency in testing Some kernel-based parsers run only at 2 - 3 seconds/sentence

4 Goals Build fast but still accurate kernel- based text analyzers Make it possible to use them to wider range of NL applications

5 Outline Polynomial Kernel of degree d Fast Methods for Polynomial kernel PKI PKE Experiments Conclusions and Future Work

6 Outline Polynomial Kernel of degree d Fast Methods for Polynomial kernels PKI PKE Experiments Conclusions and Future Work

7 Kernel Methods No need to represent example in an explicit feature vector Complexity of testing is O(L ・ |X|) Training data

8 Kernels for Sets (1/3) Focus on the special case where examples are represented as sets The instances in NLP are usually represented as sets (e.g., bag-of-words) Feature set: Training data:

9 Kernels for Sets (2/3) Combinations (subsets) of features Simple definition: 2 nd order 3 rd order

10 Kernels for Sets (3/3) I ate a cake PRP VBD DT NN Dependent (+1) or independent (-1) ? headmodifier Head-word: ate Head-POS: VBD Modifier-word: cake Modifier-POS: NN X= Head-word: ate Head-POS: VBD Modifier-word: cake Modifier-POS: NN Head-POS/Modifier-POS: VBD/NN Head-word/Modifier-POS: ate/NN … X= Subsets (combinations) of basic features are critical to improve overall accuracy in many NL tasks Previous approaches select combinations heuristically Heuristic selection

11 Polynomial Kernel of degree d Implicit form Explicit form is a set of all subsets of with exactly elements in it is prior weight to the subsets with size (subset weight)

12 Example (Cubic Kernel d=3 ) Implicit form: Explicit form: Up to 3 subsets are used as new features

14 Toy Example {a, b, c} {a, b, d} {b, c, d} 1 0.5 -2 X α X={a,c,e} 123123 Feature Set: F={a,b,c,d,e} Examples : Test Example : Kernel : j #SVs L =3 j

15 PKB (Baseline) {a, b, c} {a, b, d} {b, c, d} 1 0.5 -2 Xα Test Example X={a,c,e} K(X,X ’ ) = (|X ∩ X ’ |+1) ３ 123123 f(X) = 1 ・ (2+1) + 0.5 ・ (1+1) - 2 (1+1) = 15 Complexity is always O(L ・ |X|) ３３３ K(X j,X) j

16 PKI (Inverted Representation) {a, b, c} {a, b, d} {b, c, d} 1 0.5 -2 X j α K(X,X ’ ) = (|X ∩ X ’ |+1) ３ 123123 a b c d {1,2} {1,2,3} {1,3} {2,3} Test Example X= {a, c, e} f(X)=1 ・ (2+1) + 0.5 ・ (1+1) - 2 (1+1) = 15 ３３３ Average complexity is O(B ・ |X|+L) Efficient if feature space is sparse Suitable for many NL tasks Inverted Index B = Avg. size

17 PKE (Expanded Representation) Convert into linear form by calculating vector w projects X into its subsets space

18 PKE (Expanded Representation) K(X,X ’ ) = (|X ∩ X ’ |+1) c 3 (0)=1, c 3 (1)=7, c 3 (2)=12, c 3 (3)=6 {a, b, c} {a, b, d} {b, c, d} 1 0.5 -2 X j αjαj 123123 φ {a} {b} {c} {d} {a,b} {a,c} {a,d} {b,c} {b,d} {c,d} {a,b,c} {a,b,d} {a,c,d} {b,c,d} -0.5 10.5 -3.5 -7 -10.5 18 12 6 -12 -18 -24 6 3 0 -12 C w 1 12 7 6 W (Expansion Table) 3 F(X)= - 0.5 + 10.5 – 7 + 12 = 15 Test Example X={a,c,e} {φ,{a},{c}, {e}, {a,c},{a,e}, {c,e},{a,c,e}} Complexity is O(|X| ), independent of the number of SVs (L) Efficient if the number of SVs is large d w({b,d}) = 12 (0.5 – 2 ) = -18

19 PKE in Practice Hard to calculate Expansion Table exactly Use Approximated Expansion Table Subsets with smaller |w| can be removed, since |w| represents a contribution to the final classification Use subset mining (a.k.a. basket mining) algorithm for efficient calculation

20 Subset Mining Problem id set 1 2 3 4 { a c d } { a b c } { a b d } { b c e } Transaction Database {a}:3 {b}:3 {c}:3 {d}:2 {a b}:2 {b c}: 2 {a c}:2 {a d}: 2 Results Extract all subsets that occur in no less than sets of the transaction database and no size constraints → NP-hard Efficient algorithms have been proposed (e.g., Apriori, PrefixSpan)

21 Feature Selection as Mining Can efficiently build the approximated table σ controls the rate of approximation {a, b, c} {a, b, d} {b, c, d} 1 0.5 -2 X i αiαi 123123 Direct generation with subset mining {a} {d} {a,b} {a,c} {b,c} {b,d} {c,d} {b,c,d} 10.5 -10.5 12 -12 -18 -24 -12 W φ {a} {b} {c} {d} {a,b} {a,c} {a,d} {b,c} {b,d} {c,d} {a,b,c} {a,b,d} {a,c,d} {b,c,d} σ=10 -0.5 10.5 -3.5 -7 -10.5 12 6 -12 -18 -24 6 3 0 -12 s w Exhaustive generation and testing → Impractical! s

23 Experimental Settings Three NL tasks English Base-NP Chunking (EBC) Japanese Word Segmentation (JWS) Japanese Dependency Parsing (JDP) Kernel Settings Quadratic kernel is applied to EBC Cubic kernel is applied to JWS and JDP

24 Results (English Base-NP Chunking) Time (Sec./Sent.) Speedup Ratio F-score PKB.164 1.093.84 PKI.020 8.393.84 PKE (σ=.01).0016 105.293.79 PKE (σ=.005).0016 101.393.85 PKE (σ=.001).0017 97.793.84 PKE (σ=.0005).0017 96.893.84

25 Results (Japanese Word Segmentation) Time (Sec./Sent.) Speedup Ratio Accuracy (%) PKB.85 1.097.94 PKI.49 1.797.94 PKE (σ=.01).0024358.297.93 PKE (σ=.005).0028300.197.95 PKE (σ=.001).0034242.697.94 PKE (σ=.0005).0035238.897.94

26 Results (Japanese Dependency Parsing) Time (Sec./Sent.) Speedup Ratio Accuracy (%) PKB.285 1.089.29 PKI.0226 12.689.29 PKE (σ=.01).0042 66.888.91 PKE (σ=.005).0060 47.889.05 PKE (σ=.001).0086 33.389.26 PKE (σ=.0005).0090 31.889.29

27 Results 2 - 12 fold speed up in PKI 30 - 300 fold speed up in PKE Preserve the accuracy when we set an appropriate σ

28 Comparison with related work XQK [Isozaki et al. 02] Same concept as PKE Designed only for the Quadratic Kernel Exhaustively creates the expansion table PKE Designed for general Polynomial Kernels Uses subset mining algorithms to create the expansion table

29 Conclusions Propose two fast methods for the polynomial kernel of degree d PKI (Inverted) PKE (Expanded) 2-12 fold speed up in PKI, 30-300 fold speed up in PKE Preserve the accuracy

30 Future Work Examine the effectiveness in a general machine learning dataset Apply PKE to other convolution kernels Tree Kernel [Collins 00]  Dot-product between trees  Feature space is all sub-tree  Apply sub-tree mining algorithm [Zaki 02]

31 English Base-NP Chunking Extract Non-overlapping Noun Phrase from text [NP He ] reckons [NP the current account deficit ] will narrow to [NP only # 1.8 billion ] in [NP September ]. BIO representation (seeing as a tagging task) B: beginning of chunk I: non-initial chunk O: outside Pair-wise method to 3-class problem training: wsj15-18, test: wsj20 (standard set)

32 Japanese Word Segmentation 太郎は花子に本を読ませた太郎は花子に本を読ませた ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ Sentence: Boundaries: Distinguish the relative position Use also the character types of Japanese Training: KUC 01-08, Test: KUC 09 If there is a boundary between and, otherwise Taro made Hanako read a book

33 Japanese Dependency Parsing 私はケーキを食べる I-top cake-acc. eat Identify the correct dependency relations between two bunsetsu (base phrase in English) Linguistic features related to the modifier and head (word, POS, POS-subcat, inflections, punctuations, etc) Binary classification (+1 dependent, -1 independent) Cascaded Chunking Model [kudo, et al. 02] Training: KUC 01-08, Test: KUC 09 I eat a cake

34 Kernel Methods (1/2) X : example to be classified X i : training examples : weight for examples : a function to map examples to another vectorial space Suppose a learning task: training examples

35 PKE (Expanded Representation) If we calculate in advance ( is the indicator function) for all subsets

36 TRIE representation {a} {d} {a,b} {a,c} {b,c} {b,d} {c,d} {b,c,d} 10.5 -10.5 12 -12 -18 -24 -12 w adb b ccd c d d root 10.5 12 -10.5 -24 -18-12 Compress redundant structures Classification can be done by simply traversing the TRIE

37 Kernel Methods No need to represent example in an explicit feature vector Complexity of testing is O(L |X|) Training data

1 Fast Methods for Kernel-based Text Analysis Taku Kudo 工藤拓 Yuji Matsumoto 松本裕治 NAIST (Nara Institute of Science and Technology) 41st Annual Meeting.

Similar presentations

Presentation on theme: "1 Fast Methods for Kernel-based Text Analysis Taku Kudo 工藤拓 Yuji Matsumoto 松本裕治 NAIST (Nara Institute of Science and Technology) 41st Annual Meeting."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Fast Methods for Kernel-based Text Analysis Taku Kudo 工藤 拓 Yuji Matsumoto 松本 裕治 NAIST (Nara Institute of Science and Technology) 41st Annual Meeting.

Similar presentations

Presentation on theme: "1 Fast Methods for Kernel-based Text Analysis Taku Kudo 工藤 拓 Yuji Matsumoto 松本 裕治 NAIST (Nara Institute of Science and Technology) 41st Annual Meeting."— Presentation transcript:

Similar presentations

About project

Feedback

1 Fast Methods for Kernel-based Text Analysis Taku Kudo 工藤拓 Yuji Matsumoto 松本裕治 NAIST (Nara Institute of Science and Technology) 41st Annual Meeting.

Presentation on theme: "1 Fast Methods for Kernel-based Text Analysis Taku Kudo 工藤拓 Yuji Matsumoto 松本裕治 NAIST (Nara Institute of Science and Technology) 41st Annual Meeting."— Presentation transcript: