Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA.
Introduction to Information Retrieval
Large-Scale Entity-Based Online Social Network Profile Linkage.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Evaluating Search Engine
Xyleme A Dynamic Warehouse for XML Data of the Web.
Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University.
MusicSense: Contextual Music Recommendation using Emotional Allocation Modeling Rui Cai, Chao Zhang, Chong Wang, Lei Zhang, and Wei-Ying Ma Proceedings.
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
University of Kansas Department of Electrical Engineering and Computer Science Dr. Susan Gauch April 2005 I T T C Dr. Susan Gauch Personalized Search Based.
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Probabilistic Query Expansion Using Query Logs Hang Cui Tianjin University, China Ji-Rong Wen Microsoft Research Asia, China Jian-Yun Nie University of.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Performance Measurement. 2 Testing Environment.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
A New Algorithm for Inferring User Search Goals with Feedback Sessions.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide.
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
General Architecture of Retrieval Systems 1Adrienn Skrop.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Detecting Online Commercial Intention (OCI)
Presentation transcript:

Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005

2 QDL: query’s dominant location A QDL is geographical location(s) associated with a query in collective human knowledge Challenge: The location name contained in the query string may not mean a geographical location Running entity extraction algorithms based on geographical location dictionary look-up is not good enough INTRODUCTION

3

4 RELATED WORK Named Entity Recognition (NER) “Grounding” algorithms use a gazetteer to verify geographic names, and use context information in the text to help distill the correct sense of a name too slow Tagging locations for a web page use ZIP codes, phone numbers, languages, and HTML links Most related work: ”Categorizing Web Queries according to Geographical Locality” classified web queries into two types: local and global low precision and recall

5 Google and Yahoo!’s local search sites

6 QDL DETECTION use three types of information sources: queries, search results, and query logs Search results : text blobs (snippets) and returned web URLs (result pages) Query log : user location and web pages on the result list users clicked on

7 QDL DETECTION

8 Detecting QDL from Queries Search engines always do their best to return most up-to-date, relevant, and popular content and documents in the top portion of the returned results We developed a query tokenization algorithm to break a query into atomic parts (tokens) by usage of the query in top search results.

9 Detecting QDL from Queries We formulate our problem as: for a given query Q, split it into the most probable token list TL={t1, t2,…,tn}, in order to maximize the conditional probability Pr(TL|Q). According to the Bayes’ law, we have:

10 Detecting QDL from Queries Pr(Q) is the same for all possible TLs. Pr(Q|TL) equals to one We estimate Pr(TL) as follows: where TF(tj) and TF(si) stand for the frequency of token tj or si in the result snippets. m is the number of all possible tokens for a given query and n is the number of tokens in TL. For example, m is 15 for a query “kentucky fried chicken in seattle” and n is 3 if it is split into “Kentucky fried chicken | in | seattle”

11 Detecting QDL from Queries Step 1: Submit the query to search engine and collect a list of tokens (sub-queries) from top result snippets returned from the search engine. TF% is the number of occurrences of a token divided by total number of occurrences of all tokens in the top search result

12 Detecting QDL from Queries Step 2: Assemble tokens from Step 1 back into original query,starting from the top one. A token cannot be reused in the assembly process. For our example, we obtained the following token lists.

13 Detecting QDL from Queries Step 3: Pick the top token list from Step 2. For our example, we pick “kentucky fried chicken | in | seattle.” Step 4: For each token in the Step 3 outcome, repeat Steps 1-3 until it is not further breakable. For our example, we send “kentucky fried chicken” to search engine, and found it is not further breakable because the first sub-token on the returned list is the input token itself. Step 5: Output the final token list that only contains atomic tokens and has the largest Pr(TL). For our example, the final output of the algorithm is: “kentucky fried chicken | in | seattle.”

14 Detecting QDL from Queries Because “kentucky” is always used together with “fried chicken,” by itself it cannot be a geographical location. The token “seattle” is atomic and not ambiguous, thus the QDL of this query is “Seattle, WA, USA.” Another advantage from our tokenization algorithm is that because the algorithm is completely based on live search results, search queries will always be broken correctly by current popular usage.

15 Detecting QDL from Query Logs User IPs and clicked URLs are used For user IPs, we map them to user locations We set a minimal number of log items that a query should have before calculating its QDL-log For clicked URLs, we retrieve their content and merge them into one page. Based on our gazetteer, we extract all location names from the page, and then calculate the dominant location.

16 Detecting QDL from Query Logs combine the QDLs from user locations (QDL- log-IP) and from clicked URLs (QDL-log-URL) where 0<α<1.l is the location node. f(l, QDL-log- URL) stands for the frequency of l in the clicked pages, while f(l, QDL-log-IP) represents the frequency in the IP locations.

17 Detecting QDL from Search Results We merge the snippets or page content from top search results into one page, and use the same way to calculate QDL-result.

18 EXPERIMENTAL RESULTS Data and Settings: Our data set is a recent MSN Search log over a 30-day period of time. randomly select 10,000 unique US English queries geographical thesauruses : location entities are collected from various sources, including USA Zip codes, telephone numbers, and geographical names In this paper, for queries with locations outside USA, we define them as no QDL. the search results/snippets are obtained by sending the queries to the MSN search engine

19 EXPERIMENTAL RESULTS Evaluation Methodology We evaluated the outcome of our QDL detection solution using labeled queries A computational outcome for a query is said to be correct only when all of its QDLs exactly match the labeled results, or both computational and labeled QDLs are null We also report the per query average running time cost We separate the computational time cost and the page downloading time

20 Tuning Parameters

21 Tuning Parameters

22 EXPERIMENTAL RESULTS CT stands for Computational Time in milliseconds (excluding page downloading time), and DP stands for number of Downloaded Pages per query.

23 QUERY TYPES no QDLhave QDL No location keywords Type-1Type-3 Have location keywords Type-4Type-2

24 Error Distributions QDL-combined (using QDL-query and QDL-log- combined) EFP: false positives (returns a QDL that does not exist) EFN: false negatives (returns no QDL while there is one) ELoc: correctly detects that there is a QDL but returns one different from the labeled QDL

25 CONCLUSIONS AND FUTURE WORK Knowing a query’s dominant location will effectively improve local search relevance We presented a novel solution for detecting dominant locations from search queries Our outcome will be always up to date, capturing the correct and current locations for queries Experimental results show that our QDL detection algorithms achieved high performance in both accuracy and speed We will measure query’s local search intention. Knowing that a query is local but does not have a QDL is also very important for improving search relevance