Presentation is loading. Please wait.

Presentation is loading. Please wait.

String Kernels on Slovenian documents Blaž Fortuna Dunja Mladenić Marko Grobelnik.

Similar presentations


Presentation on theme: "String Kernels on Slovenian documents Blaž Fortuna Dunja Mladenić Marko Grobelnik."— Presentation transcript:

1 String Kernels on Slovenian documents Blaž Fortuna Dunja Mladenić Marko Grobelnik

2 Outline of the talk Bag-of-words and String Kernel Datasets Experiments Conclusions

3 Representation of text Vector-space model (bag-of-words) Most commonly used Each document is encoded as a feature vector with word frequencies as elements IDF weighting, normalized Similarity is inner-product (cosine similarity)

4 Idea behind String Kernels Words -> Substrings Each document is encoded as a feature vector with substring frequencies as elements More contiguous substrings receive higher weighting (trough decay parameter ) caarcrbabrapcp car 2 2 3 0000 bar0 2 0 2 3 00 cap 2 0000 2 3 (Lodhi et al., 2002)

5 String Kernel Explicit computation of feature vectors from previous slide is very expensive. Efficient dynamic programming algorithm exists that takes two strings as input and calculates inner-product between their feature vectors. This can be used as kernel for SVM!

6 Advantage of String Kernel No need to stem or lemmatize words. Example: Computer Computing Microcomputer Computational This should help on highly inflected languages like Slovenian or Croatian

7 Disadvantage of string kernel compared to bag-of-words Slower Linear speed up can not be used for training SVM Features not explicitly visible – harder to a analyse model

8 Datasets (1/2) Mat’kurja – Slovenian internet directory www.hr – Croatian internet directory Each web-site has a short description and is assigned to a topic from hierarchy. Web site: Vrtnar.com Topic: Science/Biology Description: Obnovljen mini vrtnarski portal s kratkimi informacijami. Web site: Elastik Topic: Arts/Architecture Description: Multidiciplinarna mreza arhitetkov, urbanistov in novomedijskih avtorjev med Amsterdamom in Ljubljano.

9 Datasets (2/2) CategorySubcategoryDocuments M-ArtsMusic45 % Painting7 % Theatre4 % M-ScienceSchools25 % Medicine14 % Students12 % H-ArtsMusic66 % Painting10 % Film6 % Slovenian Croatian { { Unbalanced!

10 Experimental setting No pre-processing of documents Documents for each domain were randomly split into training part (30%) and testing part (70%) Results were averaged over 5 different splits Break Even Point as success measure SVM Cost parameter C = 1.0 String kernel decay parameter = 0.2 and length 5 Categorytraintest M-Arts10672490 M-Science12142832 H-Arts366853

11 Experiments CategorySubcategoryBow [%]SK [%] M-ArtsMusic80  1.988  0.4 Painting22  5.560  2.6 Theatre24  3.161  6.6 M-ScienceSchools81  3.878  2.6 Medicine32  1.975  2.0 Student30  4.059  1.1 H-ArtsMusic76  3.782  1.3 Painting36  9.170  2.6 Film17  9.282  2.7

12 Unbalanced datasets (1/3) Higher difference on unbalanced categories!

13 Unbalanced datasets (2/3) We tried SVM with different cost parameter for positive and for negative examples (parameter j) Results for bag-of-words increase No significant difference for string kernel

14 Unbalanced datasets (3/3) Variation of parameter j on bag-of-words Bag-of-words with j = 5.0 comparing to String Kernels with j = 1.0

15 Conclusions String kernel significantly outperforms bag-of-words on highly inflected natural languages Difference is higher on categories with small number of positive examples SVM support for unbalanced data helps bag-of-words but performance is still lower than of string kernel

16 Questions?


Download ppt "String Kernels on Slovenian documents Blaž Fortuna Dunja Mladenić Marko Grobelnik."

Similar presentations


Ads by Google