Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department Overview: introduction.

Similar presentations


Presentation on theme: "Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department Overview: introduction."— Presentation transcript:

1 Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department suel@poly.edu http://cis.poly.edu/suel Overview: introduction and motivation research: improving cluster-based search engines research: future peer-to-peer search engine architectures

2 Web search engines: 1. Introduction and Motivation

3 Basic structure of a search engine: Crawler disks Index indexing Search.com Query: “computer” look up 1. Introduction and Motivation (cont.)

4 coverage (need to cover large part of the web) good ranking (in the case of broad queries) freshness (need to update content) user load (up to 10000 queries/sec - Google) manipulation (sites want to be listed first) Challenges for search engines: need to crawl and store massive data sets smart information retrieval techniques frequent recrawling of content many queries on massive data most techniques will be exploited quickly 1. Introduction and Motivation (cont.)

5 more than 3 billion web pages and 10 million web sites need to crawl, store, and process terabytes of data 10000 queries / second (Google) cluster of more than 5000 Linux servers (Google) “planetary-scale web service” (google, hotmail, yahoo, aol web caches, akamai) proprietary code and secret recipes 1. Introduction and Motivation (cont.)

6 Other types of web search tools Web directories (yahoo, open directory project)yahooopen directory project Specialized search engines (cora, citeseer, achoo, findlaw)citeseerachoo findlaw Local search engines (for one site) Meta search engines (dogpile, mamma, search.com)dogpilemammasearch.com Personal search assistants (alexa, google toolbar)alexagoogle toolbar Image search (ditto, visoo)dittovisoo Database search (completeplanet, brightplanet)completeplanetbrightplanet 1. Introduction and Motivation (cont.)

7 trademark and copyright enforcement - track down mp3 and video files - track down images with logos (Cobion)Cobion comparison shopping and auction bots competitive intelligence national security: monitoring certain websites Data collection, extraction & mining tools Example: Whizbang job database: - collects job announcements on company web sites - focused crawling to track down job annoucements - sorts job announcements by type, locations, etc. 1. Introduction and Motivation (cont.)

8 algorithms systems information retrieval databases machine learning natural language processing AI 1. Introduction and Motivation (cont.)

9 efficiency and scalin g with query load - per-node performance - scaling cluster size data size and scaling with the web - data acquisition: crawling and refresh - index size and performance - index updates better ranking for improved results - link-based ranking - topic- and context-specific ranking 2. Cluster-Based Search Engines Research Challenges:

10 Polybot crawler: (with Vlad Shkapenyuk) scalable web crawler runs on cluster of servers 300 pages/sec (and beyond)

11 Storage and Indexing: (Alex Okulov and Xiaohui Long) high-speed LAN or SAN storing and indexing terabytes on network of workstations fast compression techniques for storage index performance and index updates index partitioning Linux servers with several disks each

12 Ragerank (Brin&Page/Google) “significance of a page depends on significance of those referencing it” improving link-based ranking integration of term- and link-based methods Link-based ranking (Yenyu Chen and Qingqing Gan)

13 Future Search Engines and Search Tools expect powerful user interfaces beyond browser - browsing assistants - search and navigation tools many more search engine accesses most access programmatic in nature idea: split search engine into upper and lower tier - lower tier: crawling, indexing, index queries (dumb, big data) - upper tier: ranking, interface, analysis (smart stuff) idea: lower layer as highly distributed substrate to support search and navigation tools - open and agnostic “let a thousand flowers bloom” - scalable “let a million queries fly” 2. Peer-to-peer Search Engine Architectures

14 P2P web search architecture: thousands of powerful machines all over the internet machines can join or leave agnostic: can implement many IR methods on top search engine search engine search engine search engine

15 West Exploration and Search Technology Lab: about 10 grad and undergrad students more information: http://cis.poly.edu/westlabhttp://cis.poly.edu/westlab courses on web search, IR, web protocols Showcase slides at http://cis.poly.edu/showcase/


Download ppt "Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department Overview: introduction."

Similar presentations


Ads by Google