Geographically Focused Collaborative Crawling Hyun Chul Lee University of Toronto & Genieknows.com Joint work with Weizheng Gao (Genieknows.com) Yingbo.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,

1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.

Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)

CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.

Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)

Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.

Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.

Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.

Chapter 5 Searching for Truth: Locating Information on the WWW.

1 Automatic Identification of User Goals in Web Search Uichin Lee, Zhenyu Liu, Junghoo Cho Computer Science Department, UCLA {uclee, vicliu,

What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.

SIEVE—Search Images Effectively through Visual Elimination Ying Liu, Dengsheng Zhang and Guojun Lu Gippsland School of Info Tech,

Databases & Data Warehouses Chapter 3 Database Processing.

Personalization in Local Search Personalization of Content Ranking in the Context of Local Search Philip O’Brien, Xiao Luo, Tony Abou-Assaleh, Weizheng.

Chapter 5 Searching for Truth: Locating Information on the WWW.

Lecturer: Ghadah Aldehim

Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)

User Browsing Graph: Structure, Evolution and Application Yiqun Liu, Yijiang Jin, Min Zhang, Shaoping Ma, Liyun Ru State Key Lab of Intelligent Technology.

Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.

Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.

Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.

Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,

11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.

COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.

Quantitative Comparisons of Search Engine Results Mike Thlwall School of Computing and Information Technology, University of Wolverhampton ( 伍爾弗漢普頓 UK)

Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.

Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.

Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.

Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Search Engine Architecture

Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.

Algorithmic Detection of Semantic Similarity WWW 2005.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. How valuable is medical social media data? Content analysis of the medical web Presenter :Tsai Tzung.

Understanding User Goals in Web Search University of Seoul Computer Science Database Lab. Min Mi-young.

1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,

A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,

Post-Ranking query suggestion by diversifying search Chao Wang.

Lecture 2- Internet, Basic Search, Advanced Search COE 201- Computer Proficiency.

The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide.

1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.

1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G

UbiCrawler: a scalable fully distributed Web crawler

Search Engine Architecture

Searching for Truth: Locating Information on the WWW

Searching for Truth: Locating Information on the WWW

Searching for Truth: Locating Information on the WWW

Presentation transcript:

Geographically Focused Collaborative Crawling Hyun Chul Lee University of Toronto & Genieknows.com Joint work with Weizheng Gao (Genieknows.com) Yingbo Miao (Genieknows.com)

Outline Introduction/Motivation Crawling Strategies Evaluation Criteria Experiments Conclusion

Evolution of Search Engines Due to a large number of web users with different search needs, the current search engines are not sufficient to satisfy all needs Search Engines are being evolved! Some possible evolution paths: Personalization of search engines Examples: MyAsk, Google personalized search, My Yahoo Search, etc. Localization of search engines Examples: Google Local, Yahoo Local, Citysearch,etc. Specialization of search engines Examples: Kosmix, IMDB, Scirus, Citeseer, etc Others (Web 2.0, multimedia, blog, etc)

Search Engine Localization Yahoo estimates that 20-25% of all search queries have a local component either stated explicitly (e.g. Home Depot Boston; Washington acupuncturist) or implicitly (e.g. flowers, doctors). Source: SearchengineWatch Aug 3, 2004

Local Search Engine Objective: Allow the user to perform the search according to his/her keyword input as well as the geographical location of his/her interest. Location can be City, State, Country. E.g. Find restaurants in Los Angeles, CA, USA Specific Address E.g. Find starbucks near 100 millam street houston, tx Point of Interest E.g. Find restaurants near LAX.

Web Based Local Search Engine The precise definition of what Local Search Engine is not possible as there are numerous Internet YellowPages (IYP) that claim to be local search engines. Certainly, a true Local Search Engine should be Web based. Crawling of deep web data (geographically-relevant) is also desirable.

Motivation for geographically focused crawling 1st step toward building a local search engine is to collect/crawl geographically sensitive pages. There are two possible approaches 1. Target those pages which are geographically-sensitive during crawling 1. General Crawling 2. Filter out pages that are not geographically- sensitive. We study this problem as part of building Genieknows local search engine

Outline Introduction/Motivation Crawling Strategies Evaluation Criteria Experiments Conclusion

Problem Description Geographically Sensitive Crawling: Given a set of targeted locations (e.g. list of cities), collect as many pages as possible that are geographically-relevant to the given locations. For simplicity, for our experiment, we assume that the targeted locations are given in forms of city-state pairs.

Basic Assumption (Example) Pages about Boston Pages about Houston Non-relevant pages 7 3

Basic Strategy (Single Crawling Node Case) Exploit features that might potentially lead to the desired geographically- sensitive pages Guide the behavior of crawling using such features. Note that similar ideas are used for topical focused crawling. Given Page Extracted URL 2 ( Extracted URL 1 ( Crawl this URL Target Location: Boston

Extension to the multiple crawling nodes case C1C2C3C4C5G1G2G3C1C2 WEB Geographically-Sensitive Crawling Nodes General Crawling Nodes BostonChicagoHouston Crawling Nodes

URL 1 (about Chicago) C2C1G3G2G1 BostonChicagoHouston Extension to the multiple crawling nodes case (cont.) Geographically-Sensitive Crawling Nodes General Crawling Nodes URL 2 (Not geographically -sensitive)

C2C1G3G2G1 BostonChicagoHouston Extension to the multiple crawling nodes case (cont.) Geographically-Sensitive Crawling Nodes General Crawling Nodes Page 1 Extracted URL (Houston)

Crawling Strategies (URL Based) Does the considered URL contain the targeted city- pair A? Assign the corresponding URL to the crawling node responsible for the city-pair A Yes

Crawling Strategies (Extended Anchor Text) Extended Anchor Text refers to the set of prefix and suffix tokens to the link text. When multiple findings of city- state pairs occur, then choose the closest one to the link text Does the considered Extended Anchor Text contain the targeted city- pair A? Assign the corresponding URL to the crawling node responsible for the city-pair A Yes

Crawling Strategies (Full Content Based) Consider the number of times that the city-state pairs is found as part of the content Consider the number of times that only city- name is found as part of the content 1. Compute the probability that a page is about a city-state pair using the full content 2. Assign all extracted URLs to the crawling node responsible for the most probable city-state pair

Crawling Strategies (Classification Based) Shown to be useful for the topical collaborative crawling strategy (Chung et al., 2002) Na ï ve-Bayes classifier was used for simplicity. Training data were obtained from DMOZ. 1. Classifier determines whether a page is relevant to the given city- state pair 2. Assign all extracted URLs to the crawling node responsible for the most probable city-state pair

Crawling Strategies (IP-Address Based) For the IP-address mapping tool, hostip.info (API) was employed. 1. Associate the IP- address of the web host service with the corresponding city-state pair 2. Assign all extracted URLs to the crawling node responsible for the obtained city-state pair

Normalization and Disambiguation of city names From previous researches (Amitay et al. 2004, Ding et al. 2000) Aliasing: Problem of different names for the same city. United States Postal Service (USPS) was used for this purpose. Ambiguity: City names with different meanings. For the full content based strategy, solve it through the analysis of other city-state pairs found within the page For the rest of strategy, simply assume that it is city name with the largest population size.

Outline Introduction/Motivation Crawling Strategies Evaluation Criteria Experiments Conclusion

Evaluation Model Standard Metrics (Cho et al. 2002) Overlap: N-I/N where N refers to the total number of downloaded pages by the overall crawler and I denotes the number of unique downloaded pages by the overall crawler Diversity: S/N where S denotes the number of unique domain names of downloaded pages by the overall crawler and N denotes the total number of downloaded pages by the overall crawler. Communication Overlap: Exchanged URLs per downloaded page.

Evaluation Models (Cont.) Geographically Sensitive Metrics Use extracted geo-entities (address information) to evaluate. Geo-Coverage: Number of pages with at least one geo-entity. Geo-Focus: Number of retrieved pages that contain at least one geographic entity to the assigned city-state pairs of the crawling node. Geo-Centrality: How central are the retrieved nodes relative to those pages that contain geographic entities.

Outline Introduction/Motivation Crawling Strategies Evaluation Criteria Experiment Conclusion

Experiment Objective: Crawl pages pertinent to the top 100 US cities Crawling Nodes: 5 Geographically Sensitive Crawling nodes and 1 General node 2 Servers (3.2 GHz dual P4s, 1 GB RAM) Around 10 million pages for each crawling strategy Standard Hash-Based Crawling was also considered for the purpose of comparison

Result (Geo-Coverage)

Result (Geo-Focus)

Result (Communication-Overhead)

Outline Introduction/Motivation Crawling Strategies Evaluation Criteria Experiment Geographic Locality

Question: How close (graph based distance) are those geographically- sensitive pages? : the probability that that a pair of linked pages, chosen uniformly at random, is pertinent to the same city-state pair under the considered collaborative strategy. : the probability that that a pair of un- linked pages, chosen uniformly at random, is pertinent to the same city-state pair under the considered collaborative strategy.

Results Crawling Strategy URL Based Classification Based Full Content Based

Conclusion We showed the feasibility of geographically focused collaborative crawling approach to target those pages which are geographically-sensitive. We proposed several evaluation criteria for geographically focused collaborative crawling. Extended anchor text and URL are valuable features to be exploited for this particular type of crawling. It would be interesting to look at other more sophiscated features. There are lot of problems related to local search including ranking, indexing, retrieval, crawling