Language Identification of Web Data for Building Linguistic Corpora Marija Stupar, Tereza Jurić, Nikola Ljubešić Faculty of Humanities and Social Sciences.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Anina Bauer and Đilda Pečarić Department of Information and Communication Sciences Faculty of Humanities and Social Sciences of the University of Zagreb.
Advances in WP2 Torino Meeting – 9-10 March
Multilingual multimedia thesaurus for conservation and restoration collaborative networked model of construction Lucijana Leoni University of Dubrovnik.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
Language Identification in Web Pages Bruno Martins, Mário J. Silva Faculdade de Ciências da Universidade Lisboa ACM SAC 2005 DOCUMENT ENGENEERING TRACK.
Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia.
1 EU & languages Elisabetta Gibertini Michela Sgarbi Mirjam Arula Hanna-Liis Karp.
Translating for the European Commission Vilnius, 7 June 2013 Miroslav Adamiš Director DGT.
Language Identification Ben King1/23June 12, 2013 Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods Ben King.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.
Bratislava - 02/10/ Romain FERRETTI Overview of the VISAL project.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Priorities in the Study of Information Sciences Faculty of Humanities and Social Sciences, University of Zagreb, Croatia Ph.D. Sanja Seljan, associate.
Leuven, Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Personalisation Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
IATE EU tool for translation-oriented terminology work
CS 396 Pattern Recognition Project Language Classifier v1.0 By Paul Troncone, David Keiper, Eugene Schvarts.
1 Translate and Translator Toolkit Universally accessible information through translation Jeff Chin Product Manager Michael Galvez Product Manager.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
DataLyzer software Training. Introduction The purpose of this PPT is to give you quick information on the functionality of DataLyzer and to guide you.
Web Page Language Identification Based on URLs Reporter: 鄭志欣 Advisor: Hsing-Kuo Pao 1.
Harmonisation across countries in SHARE Workshop on Harmonisation of Social Survey Data for Cross-National Comparison Prague 19.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Digital Information and Heritage INFuture Zagreb, Sentence Alignment as the Basis For Translation Memory Database Sanja Seljan Faculty of.
COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.
INFuture 2013 Interactive Application for Learning the Latin Language.
Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.
An Investigation of Statistical Machine Translation (Spanish to English) Raghav Bashyal.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Leximancer Tijana Husić Textual content analysis tool.
Why Study Languages Produced by the Subject Centre for Languages, Linguistics and Area Studies …When Everyone Speaks English?
Yuya Akita , Tatsuya Kawahara
Tom Ko and Brian Mak The Hong Kong University of Science and Technology.
Identifying Entity Relationships in News Reports 27. January 2010 Martin Jačala, Jozef Tvarožek Faculty of Informatics and Information Technology Slovak.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Date: 2015/11/19 Author: Reza Zafarani, Huan Liu Source: CIKM '15
Developing reading skills and motivation through mobile phones Monika Habjanec, Polytechnic Croatian Zagorje Krapina Jasminka Pernjek, High school Krapina.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Learning a Monolingual Language Model from a Multilingual Text Database Rayid Ghani & Rosie Jones School of Computer Science Carnegie Mellon University.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Special Features: Brushed aluminium enclosure Compact and lightweight USB 3.0 for superfast data transfer – also connects to USB 2.0 Power via USB port,
Special Features: Colour: Blue/ Grey Shock resistant USB 3.0 mobile solid state drive Integrated USB 3.0 cable, no additional cables needed IP55 – dust.
CLEF Workshop ECDL 2003 Trondheim Michael Kluck slide 1 Introduction to the Monolingual and Domain-Specific Tasks of the Cross-language.
Topics in Computer Vision: Machine Learning Prof. Paulo Cezar Pinto Carvalho Final project presentation – Alexandre Chapiro.
EU Terminology in the Age of Digital Communication
Language Identification and Part-of-Speech Tagging
A Simple Approach for Author Profiling in MapReduce
A Straightforward Author Profiling Approach in MapReduce
EU Terminology: Building text-related & translation-oriented projects for IATE 20th European Symposium on Languages for Special Purposes – University.
Oracle Supplier Management Solution Product Availability
EU and multilingualism
INFuture 2009, Zagreb, /7 17/2/19 Transcription and transliteration in a computer data processing Greta Šimičević Faculty of Humanities and Social.
Multilingualism in Eurostat publications
Part of Speech Tagging with Neural Architecture Search
UNED Anselmo Peñas Álvaro Rodrigo Felisa Verdejo Thanks to…
Web Content Extraction Based on Maximum Continuous Sum of Text Density
Presentation transcript:

Language Identification of Web Data for Building Linguistic Corpora Marija Stupar, Tereza Jurić, Nikola Ljubešić Faculty of Humanities and Social Sciences University of Zagreb, Croatia INFuture2011: “Information Sciences and e-Society” Zagreb, 10 November 2011

Overview Introduction Experimental setup ▫Languages observed Methods used ▫Main approaches ▫Hybrid approaches Results ▫Document level ▫Paragraph level Conclusion Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

Introduction Web as a rich source of linguistic material More than one natural language within such sources Defining the method for language identification of the data collected from the Web ▫Comparison of two main and two hybrid approaches Ultimate goal ▫Using Web resources as a basis for constructing corpora – building hrWaC, the Croatian Web corpus Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

Experimental setup csdeenesfrhrhuitplskslsv cs de en es fr hr hu it pl sk sl sv Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora Twelve languages observed Table 1: A snippet from Language Similarity Table (Scannell, 2007)

Methods used Main approaches ▫Function word distributions ▫Second-order Markov models Hybrid approaches ▫Harmonic balance ▫Sophisticated method Language identification on document and paragraph level Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora Function words Character count Czech German English Spanish French Croatian Hungarian Italian Polish Slovak Slovenian ­ Swedish Table 2: Amount of data collected for each basic method

Methods used – main approaches Function word distributions ▫Lists of function words from all languages in question ▫The algorithm chooses the language for which the highest percentage of words could be identified as function words of the respective language Second-order Markov models ▫Conditional probabilities of a character regarding the two previous characters for which distribution s of bigram and trigram characters are calculated on a training set Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

Methods used - hybrid approaches Harmonic balance ▫Harmonic mean of the certainty of the function words method and the Markov model method ▫Certainty is calculated as a/(a+b) where a is the first result, and b the second best result Sophisticated hybrid method ▫Takes into account the strengths of each main method Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

Methods used - hybrid approaches Sophisticated hybrid method algorithm ▫If the Markov model and function words method give the same results, the result is accepted ▫In case the results of both models are not the same, but the second best result of the Markov model method is identical to the first result of the function words method and its certainty is over 0.6, the result of the function word method is accepted ▫Otherwise the result of the Markov model method is accepted Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

Methods used - evaluation Document level ▫20 documents per language ▫Documents containing less than 70% of any language are considered unsolvable Paragraph level ▫Paragraphs in 50 documents were labeled by language they are written in ▫750 paragraphs in total Evaluation measure is accuracy ▫a+d/a+b+c+d Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

Results Main approaches Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora Function words Markov model Function words Markov model Document levelParagraph level Positive Negative6153 Accuracy Table 3: Results of the evaluation of the traditional approaches

Results Hybrid approaches Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora Harmonic balance Sophisticated method Harmonic balance Sophisticated method Document levelParagraph level Positive Negative1043 Accuracy Table 4: Results of the evaluation of hybrid methods

Conclusion Markov model outperforms the function words method Hybrid approaches showed to be more efficient on the document level (mixed language content) Power-lawish distribution of languages Three languages - 99% of the data Around 96% of documents written in only one language ▫4% have mixed content Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

Language Identification of Web Data for Building Linguistic Corpora Marija Stupar, Tereza Jurić, Nikola Ljubešić Faculty of Humanities and Social Sciences University of Zagreb, Croatia INFuture2011: “Information Sciences and e-Society” Zagreb, 10 November 2011