Presentation is loading. Please wait.

Presentation is loading. Please wait.

How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting UC Berkeley SIMS class.

Similar presentations


Presentation on theme: "How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting UC Berkeley SIMS class."— Presentation transcript:

1 How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting www.searchtools.com consult1@searchtools.com UC Berkeley SIMS class 202 September 16, 2004

2 2 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Purpose of Search Engines  Helping people find what they’re looking for Starts with an “information need” Convert to a query Gets results  In the materials available Web pages Other formats Deep Web

3 3 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Search is Not a Panacea  Search can’t find what’s not there The content is hugely important  Information Architecture is vital  Usable sites have good navigation and structure

4 4 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Search Looks Simple

5 5 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting But It's Not  Index ahead of time Find files or records Open each one and read it Store each word in a searchable index  Provide search forms Match the query terms with words in the index Sort documents by relevance  Display results

6 6 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Search Processing

7 7 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting content search functionality user interface Search is Mostly Invisible Like an iceberg, 2/3 below water

8 8 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Text Search vs. Database Query  Text search works for structured content  Keyword search vs. SQL queries  Approximate vs. exact match  Multiple sources of content  Response time and database resources  Relevance ranking, very important  Works in the real world (e.g. EBay)

9 9 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Search is Only as Good as the Content  Users blame the search engine Even when the content is unavailable  Understand the scope of site or intranet Kinds of information Divided sites: products / corporate info Dates Languages Sources and data silos: CMSs, databases... Update processes

10 10 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Making a Searchable Index  Store text to search it later  Many ways to gather text Crawl (spider) via HTTP Read files on file servers Access databases (HTTP or API) Data silos via local APIs Applications, CMSs, via Web Services  Security and Access Control

11 11 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Robot Indexing Diagram Source:James Ghaphery, VCU

12 12 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting What the Index Needs  Basic information for document or record File name / URL / record ID Title or equivalent Size, date, MIME type  Full text of item  More metadata Product name, picture ID Category, topic, or subject Other attributes, for relevance ranking and display

13 13 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Simple Index Diagram

14 14 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting More Complex Index Processing

15 15 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Index Issues  Stopwords  Stemming  Metadata Explicit (tags) Implicit (context)  Semantics CMS and Database fields XML tags and attributes

16 16 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Search Query Processing  What happens after you click the search button, and before retrieval starts.  Usually in this order Handle character set, maybe language Look for operators and organize the query Look for field names or metadata Extract words (just like the indexer) Deal with letter casing

17 17 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Search and Retrieval  Retrieval: find files with query terms  Not the same as relevance ranking  Recall: find all relevant items  Precision: find only relevant items  Increasing one decreases the other

18 18 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Retrieval = Matching  Single-word queries Find items containing that word  Multi-word queries: combine lists Any: every item with any query word All: only items with every word Phrases: find only items with all words in order  Boolean and complex queries Use algorithm to combine lists

19 19 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Why Searches Fail  Empty search  Nothing on the site on that topic (scope)  Misspelling or typing mistakes  Vocabulary differences  Restrictive search defaults  Restrictive search choices  Software failure

20 20 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting LII.org No-Matches Page

21 21 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Relevance Ranking  Theory: sort the matching items, so the most relevant ones appear first  Can't really know what the user wants  Relevance is hard to define and situational  Short queries tend to be deeply ambiguous What do people mean when they type “bank”?  First 10 results are the most important  The more transparent, the better

22 22 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Relevance Processing  Sorting documents on various criteria  Start with words matching query terms  Citation and link analysis Like old library Citation Indexes Ted Nelson - not only hypertext, but the links Google PageRank Incoming links Authority of linkers  Taxonomies and external metadata

23 23 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting TF-IDF Ranking Algorithm  Term frequency in the item  Inverse document frequency of term Rare words are likely to be more important  w ij = weight of Term T j in Document D i  tf ij = frequency of Term T j in Document D j  N = number of Documents in collection  n = number of Documents where term Tj occurs at least once From Salton 1989

24 24 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Other Algorithms  Vector space  Probabilistic (binary interdependence)  Fuzzy set theory  Bayesian statistical analysis  Latent semantic indexing  Neural networks  Machine learning  All require sophisticated queries  See MIR, chapter 2

25 25 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Relevance Heuristics  Heuristics are rules of thumb Not algorithms, not math  Search Relevance Ranking Heuristics Documents containing all search words Search words as a phrase Matches in title tag Matches in other metadata  Based on real-word user behavior

26 26 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Search Results Interface  What users see after they click the Search button  The most visible part of search  Elements of the results page Page layout and navigation Results header List of results items Results footer

27 27 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Many Experiments in Interface

28 28 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Back to Simplicity

29 29 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Search Suggestions (aka Best Bets)  Human judgment beats algorithms  Great for frequent, ambiguous searches Use search log to identify best candidates  Recommend good starting pages Product information, FAQs, etc.  Requires human resources That means money and time  More static than algorithmic search

30 30 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting MSU Keywords

31 31 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Siemens Results

32 32 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Cooks.com Results

33 33 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Salon.com Results

34 34 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Faceted Metadata Search & Browse  Leverage content structure database fields (i.e. cruise amenities) document metadata (news article bylines)  Provide both search and browse Support information foraging Integrate navigation with results Not just subject taxonomies Display only fruitful paths, no dead ends  Supported by academic research Marti Hearst, UCB SIMS, flamenco.berkeley.eduflamenco.berkeley.edu

35 35 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Faceted Search: Information

36 36 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Faceted Search: Online Catalog

37 37 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Search Metrics and Analytics  Metrics Number of searches Number of no-matches searches Traffic from search to high-value pages Relate search changes to other metrics  Search Log Analysis Top 5% searches: phrases and words Top no-matches searches Use as market research

38 38 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Search Will Never Be Perfect  Search engines can’t read minds User queries are short and ambiguous  Some things will help Design a usable interface Show match words in context Keep index current and complete Adjust heuristic weighting Maintain suggestions and synonyms Consider faceted metadata search

39 39 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Search Engines, sorta Rocket Science  Questions and discussion  Contact me consult1@searchtools.com www.searchtools.com  This presentation: www.searchtools.com/slides/sims/202-04/


Download ppt "How Search Engines Work: A Technology Overview Avi Rappoport Search Tools Consulting UC Berkeley SIMS class."

Similar presentations


Ads by Google