Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Natural Language Processing WEB SEARCH ENGINES August, 2002.
Computer Information Technology – Section 3-2. The Internet Objectives: The Student will: 1. Understand Search Engines and how they work 2. Understand.
Information Retrieval in Practice
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
James Tam Computer Searches Concepts covered What is a search engine and how do they work? General search tips The Big Six search engines Other search.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
INFO 624 Week 3 Retrieval System Evaluation
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
The Wharton School of the University of Pennsylvania OPIM 101 2/16/19981 The Information Retrieval Problem n The IR problem is very hard n Why? Many reasons,
Unit 3 Web Search Engines. Can You Find the Answers? n Connect to Google Google n Search for items on Iran Records ________ n Combine Iran with nuclear.
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
Overview of Search Engines
Internet Research Search Engines & Subject Directories.
SEARCHING ON THE INTERNET
Effective Internet Searching. Why use the Internet Search for a question Research a topic Current research Variety of sources, a click away What other.
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Net Search Engines The Which, Why and How Tim Landeck Handouts/PowerPoint available at:
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
1999 Asian Women's Network Training Workshop Tools for Searching Information on the Web  Search Engines  Meta-searchers  Information Gateways  Subject.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
Searching Information. General Steps Identifying Key Words, Synonyms, and Key Phrases Constructing an effective search statement Advance search/boolean.
INTRODUCTION TO RESEARCH. Learning to become a researcher By the time you get to college, you will be expected to advance from: Information retrieval–
Internet Search Strategies How and Where to Find What you Need on the Internet.
Search Engine Comparisons By: Thomie Ventura. Search Engines Today, much, but not all, of the work we do revolves around the web Today, much, but not.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
Search Engines AGCM 4143 Electronic Communications in Agriculture.
The Internet 8th Edition Tutorial 4 Searching the Web.
Search engines are used to for looking for documents. They compile their databases by employing "spiders" or "robots" to crawl through web space from.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Search Engines June 20, 2005 LIBS100 Linda Galloway.
Search Engine Architecture
By: Channa Boucher. What is ? Gigablast is a search engine that was created in 2000 that retrieves information from partner sites. It was created to index.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek (610)
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Search Tools and Search Engines Searching for Information and common found internet file types.
Search Engines By: Faruq Hasan.
©2012 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek (610)
Searching the World Wide Web: Meta Crawlers vs. Single Search Engines By: Voris Tejada.
Web Directories: Group 5 Jack Baker Laura Bingham Morgan Stewart.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Learning how to search on the web “If all you ever do is all you’ve ever done, then all you’ll ever get is all you’ve ever got.” (author unknown)
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Data mining in web applications
Information Retrieval in Practice
Search Engine Architecture
Search Engine Architecture
Federated & Meta Search
Search Engines & Subject Directories
Data Mining Chapter 6 Search Engines
Introduction to Information Retrieval
Search Engines & Subject Directories
Search Engines & Subject Directories
Search Engine Architecture
Presentation transcript:

Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages indexed –In January, 2001, 1.3 billion pages indexed Estimate is that this is around 10-15% of the web

Finding Out About People interact with all that information because they want to KNOW something; there is a question they are trying to answer or a piece of information they want Simplest approach: –Knowledge is organized into chunks (pages) –Goal is to return appropriate chunks

Search Engines Goal of search engine is to return appropriate chunks Steps involve include –asking a question –finding answers –evaluating answers –presenting answers Value of a search engine depends on how well it does on all of these.

Asking a question Reflect some information need Query Syntax needs to allow information need to be expressed –Keywords –Combining terms Simple: “required”, NOT (+ and -) Boolean expressions with and/or/not and nested parentheses Variations: strings, NEAR, capitalization. –Simplest syntax that works –Typically more acceptable if predictable Another set of problems when information isn’t text: graphics, music

Finding the Information Goal is to retrieve all relevant chunks. Too time- consuming to do in real-time, so search engines index pages. Two basic approaches –Index and classify by hand –Automate For BOTH approaches deciding what to index on (e.g., what is a keyword) is a significant issue. Most major search sites now provide both

Indexing by Hand Indexing by hand involves having a person look at web pages and assign them to categories. Assumes a hierarchy of categories exists into which pages are placed Each document can go into multiple categories Produces very high quality indices Can retrieve by browsing the hierarchy Very expensive to create. –YAHOO is best-known early example

Automated Indexing Automated indexing involves parsing documents to pull out key words and creating a table which links keywords to documents Doesn’t have any predefined categories or keywords Can cover a much higher proportion of the web Can update more quickly Much lower quality, therefore important to have some kind of relevance ranking –Alta Vista was a well-known early example

Automating Search We will focus on automated search and indexing. Always balancing various factors: –Recall and Precision If there are 100 relevant documents and you find 50, your recall is 50%. If you find 100 documents, and 10 of them are on topic, your precision is 10%. Which is more important varies with query and with coverage –Speed, storage, completeness, timeliness How fast can you locate and index documents? Answer a query? How much room do you need on your server? What percent of the web do you cover? How many dead links do you have? How long before information is found by your search engine? –Ease of use vs power of queries Full Boolean queries very rich, very confusing. Alta Vista Advanced Search. Simplest is “and”ing together keywords; fast, straightforward. Google

Search Engine Basics A spider or crawler starts at a web page, identifies all links on it, and follows them to new web pages. A parser processes each web page and extracts individual words. An indexer creates/updates a hash table which connects words with documents A searcher uses the hash table to retrieve documents based on words A ranking system decides the order in which to present the documents: their relevance

Search Engines: Not So Basic Summary of document Cache of document Format Filters (pdf, postscript, etc) Duplicate identification and removal “More like this” Content Filters

Evaluating Search Engines Generally, the usability of a search engine includes several factors: –Coverage. Most important -- if it doesn’t even look at a page it can’t retrieve it. Matters more when looking for rare information. –Relevance ranking. If precision is low and relevance ranking is poor, takes too much wading to find desired result. Matters more in noisy domains. –Ease of use. This includes the query syntax, and also factors like how crowded the page is. A lot of personal preference here. –Other features. In some settings special features may be critical.

Some Well-Known Search Engines Google. AltaVista Yahoo Lycos...