Automatically Identifying Localizable Queries Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database.

Automatically Identifying Localizable Queries Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea Michael J. Welch, Junghoo Cho University of California, Los Angeles SIGIR 2008

Copyright  2009 by CEBT Contents  Introduction  Motivation  Our Approach Identify candidate localizable queries Select a set of relevant features Train and evaluate supervised classifier performance  Evaluation Individual Classifiers Ensemble Classifiers  Conclusion and future work  Discussion IDS Lab Seminar - 2

Copyright  2009 by CEBT Introduction  Typical queries Insufficient to fully specify a user’s information need  Localizable queries Some queries are location sensitive – “italian restaurant” -> “[city] italian restaurant” – “courthouse” -> “[county] courthouse” – “drivers license” -> “[state] drivers license” They are submitted by a user with the goal of finding information or services relevant to user’s current location.  Our task Identify the queries which contain locations as contextual modifiers IDS Lab Seminar - 3

Copyright  2009 by CEBT Motivation  Why automatically localize? Reduce burden on the user – No special “local” or “mobile” site Improve search result relevance – Not all information is relevant to every user Increase clickthrough rate Improve local sponsored content matching IDS Lab Seminar - 4

Copyright  2009 by CEBT Motivation  Significant fraction of queries are localizable Roughly 30% But users only explicitly localize them about ½ of the time – 16% of queries would benefit from automatic localization  Users agree on which queries are localizable Queries for goods and services – E.g. “food supplies”, “home health care providers” – But “calories coffee”, “eye chart” are not. IDS Lab Seminar - 5

Copyright  2009 by CEBT Our Approach  Identify candidate localizable queries  Select a set of relevant features  Train and evaluate supervised classifier performance IDS Lab Seminar - 6

Copyright  2009 by CEBT Identifying Base Queries  Queries are short and unformatted  Use string matching Compare against locations of interest – Using U.S. Census Bureau data Extract base query – Where the matched portion of text is tagged with the detected location type (state, county, or city) To ensure accuracy, filter out false positives in the classifier Simple, yet effective IDS Lab Seminar - 7

Copyright  2009 by CEBT Example: Identifying Base Queries Public libraries in malibu california Public libraries in california Public libraries in Public libraries in malibu Public libraries in IDS Lab Seminar - 8 city:malibustate:california city:malibustate:california

Copyright  2009 by CEBT Example: Identifying Base Queries  Three distinct base queries Remove stop words and group by base Allows us to compute aggregate statistics IDS Lab Seminar - 9 BaseTag public libraries californiacity:malibu public libraries malibustate:california public librariescity:malibu, state:california

Copyright  2009 by CEBT Distinguishing Features  Hypothesis: localizable queries should Be explicitly localized by some users Occur several times – From different users Occur with several different locations – Each with about equal probability IDS Lab Seminar - 11

Copyright  2009 by CEBT Localization Ratio  Users vote for the localizability of query q i by contextualizing it with a location l  Drawbacks Capable to small sample sizes Unable to identify false positives resulting from incorrectly tagged locations IDS Lab Seminar - 12 r i : localization ratio for q i Q i : the count of all instances of q i Q i (L) : the count of all query instances tagged with some location l ∈ L r i : localization ratio for q i Q i : the count of all instances of q i Q i (L) : the count of all query instances tagged with some location l ∈ L, r i ∈ [0,1]

Copyright  2009 by CEBT Location Distribution  Informally: given an instance of any localized query q l with base q b, the probability that q l contains location l is approximately equal across all locations that occur with q b.  To estimate the distribution, we calculate several measures mean, median, min, max, and standard deviation of occurrence counts IDS Lab Seminar - 13 q l : localized query q b : base query L( q b ) : the set of location tags q l : localized query q b : base query L( q b ) : the set of location tags

Copyright  2009 by CEBT Location Distribution  The “fried chicken” problem IDS Lab Seminar - 14 TagCountTagCount city:chester6city:rice2 city:colorado springs1city:waxahachie1 city:cook1state:kentucky163 city:crown1state:louisiana4 city:lousiana4state:maryland2 city:louisville2

Copyright  2009 by CEBT Clickthrough Rates  Assumption Greater clickthrough rate indicative of higher user satisfaction – T. Joachims et. al., “Accurately interpreting clickthrough data as implicit feedback”, SIGIR ‘05.  Calculated clickthrough rates for both the base query and its localized forms Binary clickthrough function  Clickthrough rate for localized instances 17% higher than nonlocalized instances IDS Lab Seminar - 15

Copyright  2009 by CEBT Classifier Training Data  Selected a random sample of 200 base queries generated by the tagging step  Filtered out base queries where n L <= 1 (with only one distinct location modifier) u q = 1 (only issued by a single user) q = 0 (base form was never issued to the search engine)  From remaining 102 queries 48 positive (localizable) examples 54 negative (non-localizable) examples IDS Lab Seminar - 17

Copyright  2009 by CEBT Evaluation Setup  Evaluated supervised classifiers on precision and recall using 10- fold cross validation Precision: accuracy of queries classified as localizable Recall: percent of localizable queries identified  Focused attention on positive precision False positives more harmful than false negatives Recall scores account for manual filtering IDS Lab Seminar - 18

Copyright  2009 by CEBT Individual Classifiers  Naïve Bayes Gaussian assumption doesn’t hold for all features – Kernel-based naïve Bayes classifier is used.  Decision Trees Emphasized localization ratio, location distribution measures, and clickthrough rates IDS Lab Seminar - 19 ClassifierPrecisionRecall Naïve Bayes64%43% Decision Tree (Information Gain)67%57% Decision Tree (Normalized Information Gain)64%56% Decision Tree (Gini Coefficient)68%51%

Copyright  2009 by CEBT Individual Classifiers  SVM (Support Vector Machine) A set of related supervised learning methods used for classification and regression Improvement over NB and DT, but opaque  Neural Network Best individual classifier, but also opaque IDS Lab Seminar - 20 ClassifierPrecisionRecall SVM75%62% Neural Network85%52%

Copyright  2009 by CEBT Ensemble Classifiers  Observation False positive classifications didn’t fully overlap for individual classifiers  Combined DT, SVM, and NN using a majority voting scheme IDS Lab Seminar - 21 ClassifierPrecisionRecall Combined94%46%

Copyright  2009 by CEBT Conclusion  Method for classifying queries as localizable Scalable, language independent tagging Determined useful features for classification Demonstrated simple components can make a highly accurate system  Exploited variation in classifiers by applying majority voting IDS Lab Seminar - 22

Copyright  2009 by CEBT Future Work  Optimize feature computation for real-time Many features fit into MapReduce framework  Investigate using dynamic features Updating classifier models Explicit feedback loops  Generalize definition of “location” Landmarks, relative locations, GPS  Integration with search system IDS Lab Seminar - 23

Copyright  2009 by CEBT Discussion  Pros Interesting issue to be helpful for web search Good performance  Cons Lack contents to understand – One of equations is omitted – No explanation about terms No explanation why ‘localizable’ is called as ‘positive’ False positives IDS Lab Seminar - 24

Automatically Identifying Localizable Queries Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database.

Similar presentations

Presentation on theme: "Automatically Identifying Localizable Queries Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatically Identifying Localizable Queries Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database.

Similar presentations

Presentation on theme: "Automatically Identifying Localizable Queries Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database."— Presentation transcript:

Similar presentations

About project

Feedback