How Search Engines Work: A Technology Overview

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems.
Leveraging Your Taxonomy to Increase User Productivity MAIQuery and TM Navtree.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Information Retrieval in Practice
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting UC Berkeley SIMS class.
Overview of Search Engines
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Databases & Data Warehouses Chapter 3 Database Processing.
PubMed/How to Search, Display, Download & (module 4.1)
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
What difference a good tool? using Endeca for a faceted catalog Emily Lynema NCSU Libraries ACRL Delaware Valley Chapter Fall Program November 3, 2006.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1999 Asian Women's Network Training Workshop Tools for Searching Information on the Web  Search Engines  Meta-searchers  Information Gateways  Subject.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
NCSU Libraries Kristin Antelman NCSU Libraries June 24, 2006.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Medline on OvidSP. Medline Facts Extensive MeSH thesaurus structure with many synonyms used in mapping and multidatabase searching with Embase Thesaurus.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
ITGS Databases.
Web- and Multimedia-based Information Systems Lecture 2.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
Advanced Semantics and Search Beyond Tag Clouds and Taxonomies Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services.
Avi Rappoport, Search Tools Consulting Search and Discovery Tools A View into the Future.
Technology for E-commerce Helena Ahonen-Myka. In this part... n search tools n metadata n personalization n collaborative filtering n data mining.
Information Retrieval
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
General Architecture of Retrieval Systems 1Adrienn Skrop.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Avi Rappoport, SearchTools.com InternetWorld NY 2001 Site Search That Doesn't Stink.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Discovery and Metadata March 9, 2004 John Weatherley
Data mining in web applications
Search Engine Optimization
Information Retrieval in Practice
Information Architecture
Searching for Information
Search Engine Optimization
Search Engines and Search techniques
Search Engine Architecture
Lesson 6: Databases and Web Search Engines
Federated & Meta Search
Information Retrieval
Text Categorization Assigning documents to a fixed set of categories
IL Step 3: Using Bibliographic Databases
Lesson 6: Databases and Web Search Engines
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Web Search Engines.
Presentation transcript:

How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting www.searchtools.com consult1@searchtools.com UC Berkeley SIMS class 202 September 16, 2004

Purpose of Search Engines Adaptive Path Purpose of Search Engines Helping people find what they’re looking for Starts with an “information need” Convert to a query Gets results In the materials available Web pages Other formats Deep Web UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Confidential · ©2001 Adaptive Path, LLC · 2443 Fillmore Street #404 · San Francisco, California 94115

Search is Not a Panacea Search can’t find what’s not there The content is hugely important Information Architecture is vital Usable sites have good navigation and structure UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Search Looks Simple UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

But It's Not Index ahead of time Provide search forms Display results Find files or records Open each one and read it Store each word in a searchable index Provide search forms Match the query terms with words in the index Sort documents by relevance Display results UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Search Processing UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Search is Mostly Invisible Like an iceberg, 2/3 below water user interface search functionality content UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Text Search vs. Database Query Text search works for structured content Keyword search vs. SQL queries Approximate vs. exact match Multiple sources of content Response time and database resources Relevance ranking, very important Works in the real world (e.g. EBay) UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Search is Only as Good as the Content Users blame the search engine Even when the content is unavailable Understand the scope of site or intranet Kinds of information Divided sites: products / corporate info Dates Languages Sources and data silos: CMSs, databases... Update processes UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Making a Searchable Index Store text to search it later Many ways to gather text Crawl (spider) via HTTP Read files on file servers Access databases (HTTP or API) Data silos via local APIs Applications, CMSs, via Web Services Security and Access Control UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Robot Indexing Diagram Source:James Ghaphery, VCU UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

What the Index Needs Basic information for document or record File name / URL / record ID Title or equivalent Size, date, MIME type Full text of item More metadata Product name, picture ID Category, topic, or subject Other attributes, for relevance ranking and display UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Simple Index Diagram UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

More Complex Index Processing UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Index Issues Stopwords Stemming Metadata Semantics Explicit (tags) Implicit (context) Semantics CMS and Database fields XML tags and attributes UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Search Query Processing What happens after you click the search button, and before retrieval starts. Usually in this order Handle character set, maybe language Look for operators and organize the query Look for field names or metadata Extract words (just like the indexer) Deal with letter casing UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Search and Retrieval Retrieval: find files with query terms Not the same as relevance ranking Recall: find all relevant items Precision: find only relevant items Increasing one decreases the other UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Retrieval = Matching Single-word queries Find items containing that word Multi-word queries: combine lists Any: every item with any query word All: only items with every word Phrases: find only items with all words in order Boolean and complex queries Use algorithm to combine lists UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Why Searches Fail Empty search Nothing on the site on that topic (scope) Misspelling or typing mistakes Vocabulary differences Restrictive search defaults Restrictive search choices Software failure UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

LII.org No-Matches Page UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Relevance Ranking Theory: sort the matching items, so the most relevant ones appear first Can't really know what the user wants Relevance is hard to define and situational Short queries tend to be deeply ambiguous What do people mean when they type “bank”? First 10 results are the most important The more transparent, the better UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Relevance Processing Sorting documents on various criteria Start with words matching query terms Citation and link analysis Like old library Citation Indexes Ted Nelson - not only hypertext, but the links Google PageRank Incoming links Authority of linkers Taxonomies and external metadata UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

TF-IDF Ranking Algorithm Term frequency in the item Inverse document frequency of term Rare words are likely to be more important wij = weight of Term Tj in Document Di tfij = frequency of Term Tj in Document Dj N = number of Documents in collection n = number of Documents where term Tj occurs at least once From Salton 1989 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Other Algorithms Vector space Probabilistic (binary interdependence) Fuzzy set theory Bayesian statistical analysis Latent semantic indexing Neural networks Machine learning All require sophisticated queries See MIR, chapter 2 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Relevance Heuristics Heuristics are rules of thumb Not algorithms, not math Search Relevance Ranking Heuristics Documents containing all search words Search words as a phrase Matches in title tag Matches in other metadata Based on real-word user behavior UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Search Results Interface What users see after they click the Search button The most visible part of search Elements of the results page Page layout and navigation Results header List of results items Results footer UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Many Experiments in Interface UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Back to Simplicity UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Search Suggestions (aka Best Bets) Human judgment beats algorithms Great for frequent, ambiguous searches Use search log to identify best candidates Recommend good starting pages Product information, FAQs, etc. Requires human resources That means money and time More static than algorithmic search UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

MSU Keywords UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Siemens Results UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Cooks.com Results UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Salon.com Results UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Faceted Metadata Search & Browse Leverage content structure database fields (i.e. cruise amenities) document metadata (news article bylines) Provide both search and browse Support information foraging Integrate navigation with results Not just subject taxonomies Display only fruitful paths, no dead ends Supported by academic research Marti Hearst, UCB SIMS, flamenco.berkeley.edu UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Faceted Search: Information UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Faceted Search: Online Catalog UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Search Metrics and Analytics Number of searches Number of no-matches searches Traffic from search to high-value pages Relate search changes to other metrics Search Log Analysis Top 5% searches: phrases and words Top no-matches searches Use as market research UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Search Will Never Be Perfect Search engines can’t read minds User queries are short and ambiguous Some things will help Design a usable interface Show match words in context Keep index current and complete Adjust heuristic weighting Maintain suggestions and synonyms Consider faceted metadata search UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting

Search Engines, sorta Rocket Science Questions and discussion Contact me consult1@searchtools.com www.searchtools.com  This presentation: www.searchtools.com/slides/sims/202-04/ UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting