Automatic Text Processing: Cross-Lingual Text Categorization Automatic Text Processing: Cross-Lingual Text Categorization Dipartimento di Ingegneria dell’Informazione.

Automatic Text Processing: Cross-Lingual Text Categorization Automatic Text Processing: Cross-Lingual Text Categorization Dipartimento di Ingegneria dell’Informazione Università degli Studi di Siena Dottorato di Ricerca in Ingegneria dell’Informazone XVII ciclo Candidate: Leonardo Rigutini Advisor: Prof. Marco Maggini

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Outlines − Introduction to Cross Lingual Text Categorization:  Realtionships with Cross Lingual Information Retrieval  Possible approaches –Text Categorization  Multinomial Naive Bayes models  Distance distribution and term filtering  Learning with labeled and unlabeled data –The algorithm  The basic solution  The modified algorithm –Experimental results and conclusions

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Cross Lingual Text Categorization − The problem arose in the last years due to the large amount of documents in many different languages − Many industries would categorize the new documents according to the existing class structure without building a different text management system for each language − The CLTC is highly close to the Cross-Lingual Information Retrieval (CLIR):  Many works in the literature deal with CLIR  Very little work about CLTC

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Cross Lingual Information Retrieval a)Poly-Lingual  Data composed by documents in different languages  Dictionary contains terms from different dictionaries  A wide learning set containing sufficient documents for each languages is needed  An unique classifier is trained b)Cross-Lingual:  The language is identified and translated into a different one  A new classifier is trained for each language

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione a) Poly-Lingual − Drawbacks:  Requires many documents for the learning set for each language  High dimensionality of the dictionary:  n vocabularies  Many terms shared between two languages  Difficult feature selection due to the coexistence of many different languages − Advantages:  Conceptually simple method  An unique classifier is used  Quite good performances

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione b) Cross-Lingual − Drawbacks:  Use of a translation step:  Very low performances  Named Entity Recognition (NER)  Time consuming  In some approaches experts for each language are needed − Advantages:  It does not need experts for each language − Three different approaches: 1.Training set translation 2.Test set translation 3.“Esperanto”

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione 1. Training set translation − The classifier is trained with documents in language L 2 translated from the L 1 learning set:  L 2 is the language of the unlabeled data  The learning set is highly noisy and the classifier could show poor performances − The system works on the L 2 language documents  Number of translations lower than the test set translation approach − Not much used in CLIR

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione 2. Test set translation − The model is trained using documents in language L 1 without translation:  Training using data not corrupted by noise − The unlabeled documents in language L 2 are translated into the language L 1 :  The translation step is highly time consuming  It has very low performances and it introduces much noise  A filtering phase on the test data after the translation is needed − The translated documents are categorized by the classifier trained in the language L 1 :  Possible inconsistency between training and unlabeled data

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione 3. “Esperanto” − All documents in each languages are translated into a new universal language, Esperanto (L E )  The new language should maintain all the semantic features of each language  Very difficult to design  High amount of knowledge for each language is needed − The system works in this new universal language  It needs the translation of the training set and of the test set  Very time consuming − Few used in CLIR

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione From CLIR to CLTC Following the CLIR: a)Poly-Lingual approach  n mono-lingual text categorization problems, one for each language  It requires a test set for each language: experts that labels the documents for each language b)Cross-lingual 1.Test set translation:  It requires the tet set translation  time consuming 2.Esperanto:  It is very time consuming and requires a large amount of knowledge for each language 3.Training set translation:  No proposals using this thecnique

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione CLTC problem formulation − “Given a predefined category organization for documents in the language L 1 the task is to classify documents in language L 2 according to that organization without having to manually label the data in L 2 since it requires experts in that language and this is expensive.” − The Poly-Lingual approach translation is not usable in this case, since it requires a learning set in the unknown language L 2 − Even the “esperanto” approach is not possible, since it needs knowledge about all the languages − Only the training and test set approach can be used in this type of problem

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Outlines − Introduction to Cross Lingual Text Categorization:  Realtionships with Cross Lingual Information Retrieval  Possible approaches –Text Categorization  Multinomial Naive Bayes models  Distance distribution and term filtering  Learning with labeled and unlabeled data –The algorithm  The basic solution  The modified algorithm –Experimental results and conclusions

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Naive Bayes classifier − The two most successful techniques for text categorization:  NaiveBayes  SVM − Naive Bayes  A document d i belongs to class C j such that:  Using bayes rule the probability can be expressed as:

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Multinomial Naive Bayes − Since is a common factor, it can be negleted − can be easily estimated from the document distribution in the training set or otherwise it can be considered constant − The naive assumption is that the presence of each word in a document is an independent event and does not depend on the others. It allows to write: where is the number of occurrences of word w t in the document d i.

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Multinomial Naive Bayes − Assuming that each document is drawn from a multinomial distribution of words, the probability of w t in class C r can be estimated as: − This method is very simple and it is one of the most used in text categorization − Despite the strong naive assumption, it yelds good performances in most cases

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Smoothing techniques − A typical problem in probailistic models are the zero values:  If a feature was never observed in training process, its estimated probability is 0. When it is observed during the classification process, the 0 value can not be used, since it makes null the likelihood − The two main methods to avoid the zero are  Additive smoothing (add-one or Laplace):  Good-Turing smoothing:

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Distance distribution − The distribution of documents in the space is uniform and does not form clouds − The distances between two similar documents and between two different documents are very close − It depends on:  High number of dimensions  High number of not discriminative words that overcome the others in the evaluation of the distances

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Distances distribution

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Information Gain − Term filtering:  Stopword list  Luhn reduction  Information gain − Information gain:

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione − New research area in Automatic Text Processing:  Usually having a large labeled dataset is a time consuming task and much expensive − Learning from labeled and unlabeled examples:  Use a small initial labeled dataset  Extract information from a large unlabeled dataset − The idea is:  Use the labeled data to initialize a labeling process on the unlabeled data  Use the new labeled data to build the classifier Learning from labeled and unlabeled data

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Learning from labeled and unlabeled data − EM algorithm  E step: data are labeled using the current parameter configuration  M step: model is updated assuming the labeled to be correct − The model is initialized using the small labeled dataset

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Outlines − Introduction to Cross Lingual Text Categorization:  Realtionships with Cross Lingual Information Retrieval  Possible approaches –Text Categorization  Multinomial Naive Bayes models  Distance distribution and term filtering  Learning with labeled and unlabeled data –The algorithm  The basic solution  The modified algorithm –Experimental results and Conclusions

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Cross Lingual Text Categorization − The problem can be stated as:  We have a small labeled dataset in language L 1  We want to categorize a large unlabeled dataset in language L 2  We do not want to use experts for the language L 2 − The idea is:  We can translate the training set into the language L 2  We can initialize an EM algorithm with these very noisy data  We can reinforce the behavior of the classifier using the unlabeled data in language L 2

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Notation − With L 1, L 2 and L 1  2 we indicate the languages 1,2 and L 1 translated into L 2 − We use these pedices for training set Tr, test set Ts and classifier C:  C 1  2 indicates the classifier trained with Tr 1  2,, that is the training set Tr 1 translated into language L 2

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione The basic algorithm Tr 1 Ts 2 C21C21C21C21 results Tr 1  2 Translation 1  2 E(t) start EM iterations E step M step

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione The basic algorithm − Once the classifier is trained, it can be used to label a larger dataset − This algortihm can start with small initial dataset and it is an advantage since our initial dataset is very noisy − Problems  Data  Translation  Algorithm

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Data − Temporal dependency:  Documents regarding same topic in different times, deal with different themes − Geographical dependency:  Documents regarding the same topics in different places, deal with different persons, facts etc… − Find the discriminative terms for each topic independent of time and place

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Translation − The translator performs very poorly expecially when the text is badly written :  Named Entity Recognition (NER):  words that should not be translated  different words referring to the same entity  Word-sense disambiguation:  In translation it is a fundamental problem

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Algorithm − EM algorithm has some important limitations:  The trivial solution is a good solution:  all documents in a single cluster  all the others clusters empty  Usually it tends to form few large central clusters and many small peripheral clusters:  It depends on the starting point and on the noise on the data added at the cluster at each EM step

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Improved algorithm by using IG Ts 2 C21C21C21C21 results Tr 1  2 E(t) start EM iterations E step M step IG k 1 IG k 2

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione The filter k 1 − Highly selective since the data are composed by translated text and they are very noisy − Initialize the EM process by selecting the most informative words in the data Ts 2 results Tr 1  2 IG k 1

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione The filter k 2 − It performs a regularization effect on the EM algorithm  it selects the most discriminative words at each EM iteration  The not significative words do not influence the updating of the centroid in EM iterations − The parameter should be higher than the previous:  It works on the original data Ts 2 C21C21C21C21 results E(t) E step M step IG k 2

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Outlines − Introduction to Cross Lingual Text Categorization:  Realtionships with Cross Lingual Information Retrieval  Possible approaches –Text Categorization  Multinomial Naive Bayes models  Distance distribution and term filtering  Learning with labeled and unlabeled data –The algorithm  The basic solution  The modified algorithm –Experimental results and Conclusions

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Previous works − Nuria et al. used ILO corpus and two language (E,S) to test three different approaches to CLTC:  Polylingual  Test set translation  Profile-based translation − They used the Winnow (ANN) and Rocchio algorithm − They compared the results with the monolingual test − Low performances: 70%-75%

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Multi-lingual Dataset − Very few multi-lingual data sets available:  No one with Italian language − We built the data set by crawling the Newsgroups − Newsgroups:  Availability of the same groups in different languages  Large number of available messages  Different levels of each topic

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Multi-lingual Dataset − Multi lingual dataset compostion  Two languages: Italian (L I ) and English (L E )  Three groups: auto, hardware and sport TRAINTEST Tr I Tr E Ts I Auto1.000 6.988 Hw1.000 6.991 Sports1.000 6.984 total3.000 20.963

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Multi-lingual Dataset − Drawbacks:  Short messages  Informal documents:  Slang terms  Badly written words  Often transversal topics  advertising, spam, other actual topics (elections)  Temporal dependency: same topic in two different moments deals with different problems  Geographical dependency: same topic in two different places deals with different persons, facts etc…

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Monolingual test Tr I Ts I CICICICI results Ts I test setRecallPrecision Auto Hw Sports 6.988 6.991 6.984 94,01 ± 1,03% 96,21 ± 0,93% 92,89 ± 1,12% 93,76 ± 1,09% 93,01 ± 0,45% 96,74 ± 1,24% total20.96394,43 ± 0,90% Results are averaged on a ten-fold cross-validation –No traslation –Training set and test set in the Italian language

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Baseline multilingual test CEICEICEICEI Tr E Ts I results Tr E  I Translation E  I Ts I test setRecallPrecision Auto Hw Sports 6.988 6.991 6.984 69,56 ± 5,34% 87,24 ± 2,02% 50,95 ± 6,28% 66,56 ± 4,76% 63,35 ± 3,72% 88,22 ± 4,36% total20.96369,26 ± 4,22% Translation from English to Italian Results are averaged on a ten-fold cross-validation

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Simple EM Algorithm Ts I results Tr E  I Translation E  I E(t) start EM iterations E step M step CEICEICEICEI Tr E Ts I test setRecallPrecision Auto Hw Sports 6.988 6.991 6.984 71,32 ± 1,05% 98,04 ± 1,01% 0,73 ± 0,41% 51,40 ± 1,00% 61,55 ± 0,98% 65,41 ± 0,05% total20.96356,32 ± 1,10% Translation from English to Italian Results are averaged on a ten-fold cross-validation

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Filtered EM algorithm Ts I test setRecallPrecision Auto Hw Sports 6.988 6.991 6.984 92,59 ± 1,05% 87,88 ± 0,98% 91,01 ± 1,03% 87,07 ± 1,02% 92,78 ± 0,88% 92,28 ± 0,90% total20.96390,64 ± 0,96% Ts I CEICEICEICEI results Tr E  I start EM iterations E step M step IG k 1 IG k 2 E(t) k 1 = 300 k 2 = 1000 Translation from English to Italian Results are averaged on a ten-fold cross-validation

Artificial Intelligence Research Group of Siena Leonardo Rigutini – Dipartimento Ingegneria dell’Informazione Conclusions − The filtered EM algorithm performs better than other algorithms existing in literature − It does not needs an initial labeled dataset in the desired language:  No other algorithms have been proposed having such feature − It achieves good results starting with few translated documents:  It does not require much time for translation

Automatic Text Processing: Cross-Lingual Text Categorization Automatic Text Processing: Cross-Lingual Text Categorization Dipartimento di Ingegneria dell’Informazione.

Similar presentations

Presentation on theme: "Automatic Text Processing: Cross-Lingual Text Categorization Automatic Text Processing: Cross-Lingual Text Categorization Dipartimento di Ingegneria dell’Informazione."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatic Text Processing: Cross-Lingual Text Categorization Automatic Text Processing: Cross-Lingual Text Categorization Dipartimento di Ingegneria dell’Informazione.

Similar presentations

Presentation on theme: "Automatic Text Processing: Cross-Lingual Text Categorization Automatic Text Processing: Cross-Lingual Text Categorization Dipartimento di Ingegneria dell’Informazione."— Presentation transcript:

Similar presentations

About project

Feedback