Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department Overview: introduction.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

How does a web search engine work?. search  google (started 1998 … now worth $365 billion)  bing  amazon  web, images, news, maps, books, shopping,
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. 1 The Architecture of a Large-Scale Web Search and Query Engine.
Presentation Outline  Project Aims  Introduction of Digital Video Library  Introduction of Our Work  Considerations and Approach  Design and Implementation.
Google & Beyond Expert Internet Searching Tools & Strategies.
Web Servers How do our requests for resources on the Internet get handled? Can they be located anywhere? Global?
2/11/2004 Internet Services Overview February 11, 2004.
ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.
Web Exploration and Search Technology Lab Department of Computer and Information Science Polytechnic University Brooklyn, NY Faculty: Torsten Suel.
Progress Report 11/1/01 Matt Bridges. Overview Data collection and analysis tool for web site traffic Lets website administrators know who is on their.
Introduction Web Development II 5 th February. Introduction to Web Development Search engines Discussion boards, bulletin boards, other online collaboration.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Internet Research Search Engines & Subject Directories.
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Databases & Data Warehouses Chapter 3 Database Processing.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
How Search Engines Work General Search Strategies Dr. Dania Bilal IS 587 SIS Fall 2007.
Introduction. Readings r Van Steen and Tanenbaum: 5.1 r Coulouris: 10.3.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Master Thesis Defense Jan Fiedler 04/17/98
Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
1 Search Engines Emphasis on Google.com. 2 Discovery  Discovery is done by browsing & searching data on the Web.  There are 2 main types of search facilities.
Overviews of ITCS 6161/8161: Advanced Topics on Database Systems Dr. Jianping Fan Department of Computer Science UNC-Charlotte
McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.
Chapter 6: Information Retrieval and Web Search
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
CS211 - Fernandez - 1 CS211 Graduate Computer Architecture Network 3: Clusters, Examples.
The World Wide Web: Information Resource. Hock, Randolph. The Extreme Searcher’s Internet Handbook. 2 nd ed. CyberAge Books: Medford. (2007). Internet.
Search Engines By: Faruq Hasan.
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
The World Wide Web: Information Resource. How a Search Engine works… How Search Works - YouTube
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Sigir’99 Inside Internet Search Engines: Spidering and Indexing Jan Pedersen and William Chang.
Web Search Architecture & The Deep Web
Google search in general  Google Search, commonly referred to as Google Web Search or just Google, is a web search engine owned by Google Inc. It is.
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
Internet Searching the World Wide Web. The Internet and the World Wide Web The Internet is a worldwide collection of networks that allows people to communicate.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
Types Pros & cons.  A program for the retrieval of data, files, or documents from a database or network, esp. the Internet.  Search engines usually.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Data mining in web applications
Information Retrieval in Practice
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
Search Engine Architecture
Search Engines & Subject Directories
What is a Search Engine EIT, Author Gay Robertson, 2017.
Data Mining Chapter 6 Search Engines
Search Engines & Subject Directories
Search Engines & Subject Directories
All About the Internet.
Presentation transcript:

Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department Overview: introduction and motivation research: improving cluster-based search engines research: future peer-to-peer search engine architectures

Web search engines: 1. Introduction and Motivation

Basic structure of a search engine: Crawler disks Index indexing Search.com Query: “computer” look up 1. Introduction and Motivation (cont.)

coverage (need to cover large part of the web) good ranking (in the case of broad queries) freshness (need to update content) user load (up to queries/sec - Google) manipulation (sites want to be listed first) Challenges for search engines: need to crawl and store massive data sets smart information retrieval techniques frequent recrawling of content many queries on massive data most techniques will be exploited quickly 1. Introduction and Motivation (cont.)

more than 3 billion web pages and 10 million web sites need to crawl, store, and process terabytes of data queries / second (Google) cluster of more than 5000 Linux servers (Google) “planetary-scale web service” (google, hotmail, yahoo, aol web caches, akamai) proprietary code and secret recipes 1. Introduction and Motivation (cont.)

Other types of web search tools Web directories (yahoo, open directory project)yahooopen directory project Specialized search engines (cora, citeseer, achoo, findlaw)citeseerachoo findlaw Local search engines (for one site) Meta search engines (dogpile, mamma, search.com)dogpilemammasearch.com Personal search assistants (alexa, google toolbar)alexagoogle toolbar Image search (ditto, visoo)dittovisoo Database search (completeplanet, brightplanet)completeplanetbrightplanet 1. Introduction and Motivation (cont.)

trademark and copyright enforcement - track down mp3 and video files - track down images with logos (Cobion)Cobion comparison shopping and auction bots competitive intelligence national security: monitoring certain websites Data collection, extraction & mining tools Example: Whizbang job database: - collects job announcements on company web sites - focused crawling to track down job annoucements - sorts job announcements by type, locations, etc. 1. Introduction and Motivation (cont.)

algorithms systems information retrieval databases machine learning natural language processing AI 1. Introduction and Motivation (cont.)

efficiency and scalin g with query load - per-node performance - scaling cluster size data size and scaling with the web - data acquisition: crawling and refresh - index size and performance - index updates better ranking for improved results - link-based ranking - topic- and context-specific ranking 2. Cluster-Based Search Engines Research Challenges:

Polybot crawler: (with Vlad Shkapenyuk) scalable web crawler runs on cluster of servers 300 pages/sec (and beyond)

Storage and Indexing: (Alex Okulov and Xiaohui Long) high-speed LAN or SAN storing and indexing terabytes on network of workstations fast compression techniques for storage index performance and index updates index partitioning Linux servers with several disks each

Ragerank (Brin&Page/Google) “significance of a page depends on significance of those referencing it” improving link-based ranking integration of term- and link-based methods Link-based ranking (Yenyu Chen and Qingqing Gan)

Future Search Engines and Search Tools expect powerful user interfaces beyond browser - browsing assistants - search and navigation tools many more search engine accesses most access programmatic in nature idea: split search engine into upper and lower tier - lower tier: crawling, indexing, index queries (dumb, big data) - upper tier: ranking, interface, analysis (smart stuff) idea: lower layer as highly distributed substrate to support search and navigation tools - open and agnostic “let a thousand flowers bloom” - scalable “let a million queries fly” 2. Peer-to-peer Search Engine Architectures

P2P web search architecture: thousands of powerful machines all over the internet machines can join or leave agnostic: can implement many IR methods on top search engine search engine search engine search engine

West Exploration and Search Technology Lab: about 10 grad and undergrad students more information: courses on web search, IR, web protocols Showcase slides at