The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.

Slides:

Advertisements

Similar presentations

Background Knowledge for Ontology Construction Blaž Fortuna, Marko Grobelnik, Dunja Mladenić, Institute Jožef Stefan, Slovenia.

Advertisements

Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.

Learning visual representations for unfamiliar environments Kate Saenko, Brian Kulis, Trevor Darrell UC Berkeley EECS & ICSI.

On-line Compilation of Comparable Corpora and Their Evaluation Radu ION, Dan TUFIŞ, Tiberiu BOROŞ, Alexandru CEAUŞU and Dan ŞTEFĂNESCU Research Institute.

Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:

Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.

Large-Scale Entity-Based Online Social Network Profile Linkage.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li.

Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.

Confidence-Weighted Linear Classification Mark Dredze, Koby Crammer University of Pennsylvania Fernando Pereira Penn  Google.

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

Introduction Information Management systems are designed to retrieve information efficiently. Such systems typically provide an interface in which users.

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

XML Document Mining Challenge Bridging the gap between Information Retrieval and Machine Learning Ludovic DENOYER – University of Paris 6.

Canonical Correlation Analysis: An overview with application to learning methods By David R. Hardoon, Sandor Szedmak, John Shawe-Taylor School of Electronics.

Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

Large-Scale Cost-sensitive Online Social Network Profile Linkage.

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.

Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.

Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.

Multilingual Synchronization focusing on Wikipedia

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

Blaz Fortuna, Marko Grobelnik, Dunja Mladenic Jozef Stefan Institute ONTOGEN SEMI-AUTOMATIC ONTOLOGY EDITOR.

Kernel Canonical Correlation Analysis (Language Independent Document Representation) Roland Pihlakas Part of the slides is taken from.PPT with same title.

Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.

Kernel Canonical Correlation Analysis (Language Independent Document Representation) Blaz Fortuna Marko Grobelnik Dunja Mladenić Jozef Stefan Institute,

Kyoshiro SUGIYAMA, AHC-Lab., NAIST An Investigation of Machine Translation Evaluation Metrics in Cross-lingual Question Answering Kyoshiro Sugiyama, Masahiro.

1 Comparison of Principal Component Analysis and Random Projection in Text Mining Steve Vincent April 29, 2004 INFS 795 Dr. Domeniconi.

Transfer Learning with Applications to Text Classification Jing Peng Computer Science Department.

A Language Independent Method for Question Classification COLING 2004.

1 Co-Training for Cross-Lingual Sentiment Classification Xiaojun Wan ( 萬小軍 ) Associate Professor, Peking University ACL 2009.

Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Xinxiong Chen, Yabin Zheng, Maosong Sun 2011, FCCNLL Automatic Keyphrase.

Experiments of Opinion Analysis On MPQA and NTCIR-6 Yaoyong Li, Kalina Bontcheva, Hamish Cunningham Department of Computer Science University of Sheffield.

Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.

A MIXED MODEL FOR CROSS LINGUAL OPINION ANALYSIS Lin Gui, Ruifeng Xu, Jun Xu, Li Yuan, Yuanlin Yao, Jiyun Zhou, Shuwei Wang, Qiaoyun Qiu, Ricky Chenug.

1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.

Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.

GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,

Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science ＆ Information Engineering.

1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.

Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma

CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

Introduction to String Kernels Blaz Fortuna JSI, Slovenija.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Kernel Canonical Correlation Analysis Blaz Fortuna JSI, Slovenija Cross-language information retrieval.

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.

A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.

This research is supported by NIH grant U54-GM114838, a grant from the Allen Institute for Artificial Intelligence (allenai.org), and Contract HR

System for Semi-automatic ontology construction

Support Vector Machines

Large scale multilingual and multimodal integration

Concave Minimization for Support Vector Machine Classifiers

Word embeddings (continued)

Semi-Automatic Data-Driven Ontology Construction System

Presentation transcript:

The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University

Outline Cross-lingual text mining Kernel CCA Machine translation Information retrieval experiment Classification experiment Conclusions

Cross-lingual text mining When applying text mining to a multilingual text corpora specific language issues appear: Information retrieval: retrieved documents should depend only on the meaning of the query and not its language. Classification: only one classifier should be learned and not a separate classifier for each language Clustering: documents should be grouped into clusters based on their content, not on the language they are written in.

KCCA (Kernel Canonical Correlation Analysis) KCCA learns a semantic representation of the text from a corpus of unlabeled paired documents. On input we have set of paired documents (for each document we have a version in each language) On output we get set of mappings from native language space into “language independent space” – subspace with semantic dimensions [Vinokourov et. al, 2002] loss, income, company, quarter verlust, einkommen, firma, viertel wage, payment, negotiati-ons, union zahlung, volle, gewerkschaft, verhand- lungsrunde KCCA Semantic dimensions

Paired training set and machine translation KCCA needs paired dataset for training. When there is no paired dataset available we have two options: We use human made dataset from some other domain. This could be unreliable because of a big semantic and vocabulary gap. We use machine translation tools to generate paired dataset. In our experiments we used Google Language Tools for translating documents.

Experiments We investigated how the quality of machine translation generated train set compares with a true human generated paired corpus. Two major issues are addressed: How much do we win or lose by using machine translation when a human generated corpus is available for the target domain? only for a different domain?

Experiment #1 – Information retrieval We compared two paired corpora: Hansard corpus: aligned pairs of text chunks from the official records of the 36 th Canadian Parliament Proceedings. [Germann, 2001] Artificial corpus: half of the English and half of the French translations from Hansard corpus were replaced by machine translation. Queries were generated from each test document by extracting 5 words with the highest TFIDF weights and using them as a query. The goal was to retrieve the paired document. Experimental procedure (for each corpus): (1)KCCA trained on 1500 paired documents, (2)All 896 test documents (in both languages) projected into the KCCA semantic space, (3)Each query was projected into the KCCA semantic space and documents were retrieved using nearest neighbour based on cosine distance to the query.

Results En-EnEn-FrFr-EnFr-Fr Hansard87 / 9966 / 9665 / 9584 / 99 Artificial86 / 9958 / 9159 / 9083 / 99 For 65% of queries the correct document appeared on the first place. For 95% of queries the correct document appeared among first 10 results. There is no difference when query and document are in the same language When query and document are from different languages, there is around 5-10% drop in retrieval accuracy

Experiment #2 – Classification Reuters multilingual corpora (English and French) was used as a dataset. [Reuters, 2004] First paired train set, Hansard, was taken from previous experiment; different domain than news articles. Second paired train set was generated from the Reuters dataset using machine translation (Google). Experimental procedure (for each corpus): (1)KCCA trained on 1500 paired documents, (2)Whole Reuters corpus was projected into the KCCA semantic space, (3)Linear SVM classifier was learned in KCCA semantic space on a subset of 3000 documents and tested on a subset of (results are averaged over 5 random splits).

Results #KCCA dimensions: 800 FE … French training set, English testing set. Artificial paired training set generates significantly better semantic space than train set taken from a different domain!

Conclusions We have shown that the machine translation can be used to generate training set for Kernel CCA which can give almost as good performance as a train set made by human translators. When no hand made translations are available this can significantly decrease the cost of a multilingual text mining. We would like also to thank Miha Grcar for making an automated interface to Google Language Tools!

Questions?