Language Identification of Web Data for Building Linguistic Corpora Marija Stupar, Tereza Jurić, Nikola Ljubešić Faculty of Humanities and Social Sciences University of Zagreb, Croatia INFuture2011: “Information Sciences and e-Society” Zagreb, 10 November 2011
Overview Introduction Experimental setup ▫Languages observed Methods used ▫Main approaches ▫Hybrid approaches Results ▫Document level ▫Paragraph level Conclusion Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora
Introduction Web as a rich source of linguistic material More than one natural language within such sources Defining the method for language identification of the data collected from the Web ▫Comparison of two main and two hybrid approaches Ultimate goal ▫Using Web resources as a basis for constructing corpora – building hrWaC, the Croatian Web corpus Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora
Experimental setup csdeenesfrhrhuitplskslsv cs de en es fr hr hu it pl sk sl sv Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora Twelve languages observed Table 1: A snippet from Language Similarity Table (Scannell, 2007)
Methods used Main approaches ▫Function word distributions ▫Second-order Markov models Hybrid approaches ▫Harmonic balance ▫Sophisticated method Language identification on document and paragraph level Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora Function words Character count Czech German English Spanish French Croatian Hungarian Italian Polish Slovak Slovenian Swedish Table 2: Amount of data collected for each basic method
Methods used – main approaches Function word distributions ▫Lists of function words from all languages in question ▫The algorithm chooses the language for which the highest percentage of words could be identified as function words of the respective language Second-order Markov models ▫Conditional probabilities of a character regarding the two previous characters for which distribution s of bigram and trigram characters are calculated on a training set Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora
Methods used - hybrid approaches Harmonic balance ▫Harmonic mean of the certainty of the function words method and the Markov model method ▫Certainty is calculated as a/(a+b) where a is the first result, and b the second best result Sophisticated hybrid method ▫Takes into account the strengths of each main method Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora
Methods used - hybrid approaches Sophisticated hybrid method algorithm ▫If the Markov model and function words method give the same results, the result is accepted ▫In case the results of both models are not the same, but the second best result of the Markov model method is identical to the first result of the function words method and its certainty is over 0.6, the result of the function word method is accepted ▫Otherwise the result of the Markov model method is accepted Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora
Methods used - evaluation Document level ▫20 documents per language ▫Documents containing less than 70% of any language are considered unsolvable Paragraph level ▫Paragraphs in 50 documents were labeled by language they are written in ▫750 paragraphs in total Evaluation measure is accuracy ▫a+d/a+b+c+d Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora
Results Main approaches Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora Function words Markov model Function words Markov model Document levelParagraph level Positive Negative6153 Accuracy Table 3: Results of the evaluation of the traditional approaches
Results Hybrid approaches Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora Harmonic balance Sophisticated method Harmonic balance Sophisticated method Document levelParagraph level Positive Negative1043 Accuracy Table 4: Results of the evaluation of hybrid methods
Conclusion Markov model outperforms the function words method Hybrid approaches showed to be more efficient on the document level (mixed language content) Power-lawish distribution of languages Three languages - 99% of the data Around 96% of documents written in only one language ▫4% have mixed content Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora
Language Identification of Web Data for Building Linguistic Corpora Marija Stupar, Tereza Jurić, Nikola Ljubešić Faculty of Humanities and Social Sciences University of Zagreb, Croatia INFuture2011: “Information Sciences and e-Society” Zagreb, 10 November 2011