Presentation is loading. Please wait.

Presentation is loading. Please wait.

Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.

Similar presentations


Presentation on theme: "Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission."— Presentation transcript:

1 Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission College Blvd. Denton,TX,76203 Santa Clara,CA,95054 hakan@unt.eduhakan@unt.edu ykim@yahoo-inc.comykim@yahoo-inc.com ACL 2009

2 outline Introduction Data Generation Language Identification Conclusions and Future Work

3 Introduction(1) Decide in which language a given text is written It is heavily studied It is critical importance to search engines for queries Challenges : lack of any standard or publicly available data set

4 Introduction(2) A case where a correct identification of language is not necessary. example : query ”homo sapiens”, a user enter this query from Spain. Add a non-linguistic feature to system

5 Introduction(3)

6 Data Generation(1) Data set : Constructed by the queries with clicked urls From : Yahoo! Search Engine for each language Time : three months time period

7 Data Generation(2) Preprocess : remove any numbers or special characters or extra spaces. lowercase all the letters of the queries. Calculating the frequencies of the urls for each query. A web page is 474 words on the average Identify the language for web page using one of the existing methods.

8 Data Generation(3) Using Table 1(T1) and Table 2(T2) to store the above information T1 : [ q, u, f u ] T2 : [ u, l ] q : query u : a unique url u : url l : language identified for u f u : the frequency of u Combine T1 and T2 into T3 T3 : [ q, l, f l, c u,l ] l : a language f l : the count of clicks for l c u,l : the count of unique urls in language l

9 Data Generation(4) It has many noise. 1. A query maps to more than one language. solve : Giving a weight w q,l for each query to a language set a threshold parameter W if w q,l < W then remove this query 2.navigational query example : ACL 2009

10 Data Generation(5) Solve : set two threshold parameter F and U if F q > F or U q < U then remove this query Algorithm

11 Data Generation(6) How to turn our parameter dependent on the size of data set (Silverstein et al.,1999) W = 1, F = 50, U = 5 How many query will be filter 5%~10% of the queries Pick 500 queries randomly and annotate them by human Category-1: If the query does not contain any foreign terms. Category-2: If there exists some foreign terms but the query would still be expected to bring web pages in the same language. Category-3: If the query belongs to other languages, or all the terms are foreign to the annotator.

12 Data Generation(7) How much of this multi- linguality parameter selection eliminate? result : Category-1 : 47.6% Category-1+2 : 60.2%

13 Language Identification(1) Implement three models use a different existing feature 1.statistical model 2.knowledge based model 3.morphological model EuroParl Corpora Combine all three models in a machine learning framework using a novel approach Add a non-linguistic

14 Language Identification(2) Test set-3500 human annotated queries

15 Statistical model Character based n-gram feature (n=1 to 7) Vocabulary from training corpus(EuroParl) Generate a probability distribution from these count Above work can use SRILM Toolkit with Kneser-Ney Discounting and interpolation

16 Knowledge based model Word based n-gram feature (n=1) Vocabulary from training corpus(EuroParl) Generate a probability distribution from these count

17 Morphological model Gather the affix information from corpora in an unsupervised(Harald Hammarstr¨om 2006) Give a score for each affix

18 Language Identification(3) Performance

19 Decision tree classification Each model can complement the other in certain cases Train data : automatically annotated data set Feature : confidence score Use the Kurtosis measure

20 Decision tree classification An example : query “the sovereign individual” and statistical model identifies it as English k = 7.6 > = = ( 4.47 + 1.96 ) so this query’s confidence score is “en-HIGH” Implement DT classifier by the Weka Machine Learning Toolkit (Witten and Frank,2005)

21 Decision tree classification Outperform all the models for each size on average

22 Decision tree classification M l i,l j : language l i misclassified by the system as l j

23 non-linguistic feature Non-linguistic feature is the language information of the country It helps the search engine in guessing the language example : query “how to tape for plantar fasciits”(it is labelled as Category-2) It is classified to Porteguese query

24 non-linguistic feature Increase test set size to 430 queries

25 Conclusions A completely automated method to generate a reliable data set Built a decision tree classifier that improves the results on average Built a second classifier that takes into account the geographical information of the users

26 Feature Work To improve the accuracy of data generation More careful examination in parameter values To extend the number of languages in data set Consider other alternatives to the decision tree framework


Download ppt "Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission."

Similar presentations


Ads by Google