Presentation is loading. Please wait.

Presentation is loading. Please wait.

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

Similar presentations


Presentation on theme: "Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe."— Presentation transcript:

1 Search Engines

2 Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe the topic? ‰What other terms might be used for this topic? If you need ideas to help you think of terms to describe your topic: – Use encyclopedias or handbooks for background information and terminology – Look at words in the Catalogue records, from Subject headings, Contents, Notes Use Boolean Logic

3 Web directories What: Directories are a collection of web sites that are organized by broad subject categories. When you search a directory, you are only searching the web sites identified and included in that directory. Directories may be broad (covering all subject areas), or subject specific (focusing on a particular subject). How: Directories are created by people. People have identified web sites and then organized them for you. A good directory will also evaluate web sites for quality and reliability before including them in the directory.

4 Meta-Search Engines Meta-search engines don't create their own database of information. They search the databases of other search engines. The major advantage of using a meta-search engine is that it allows the user to search several search engines simultaneously. Examples of meta-search engines include · Metacrawler www.metacrawler.com/ · Inference www.infind.com · Savvysearch www.savvysearch.com · Mamma www.mamma.com · Dogpile www.dogpile.com/ · Search www.search.com/ · C4 www.c4.com/ · Profusion www.profusion.com/

5 Search Engine An Internet tool which will search for sites containing the words that you designate as a search term Search engines search their own databases of information When you use a search engine to search the web, you are searching only the web pages and web objects (i.e. PDF files) that the search engine has identified at a given point in time. You are not searching the web in "real time" - only a given point in time.

6 Search Engines vs. Directories Search Engines Computer built index of information on web More inclusive Used to find specific resources Searchable by keyword Excessive “hits” Every page of a Website is indexed Better for general searches, but can be used to find specific information Directories Human aided, organized list May be general or subject- specific May be able to “search” directory – Google - general – NetTech Educational Technology Coordinator Website - subject specific User has control of browsing Fixed vocabulary Links go to Website home pages only Better at general searches

7 Search Engine Deploys a robot program called a spider or robot designed to track down Web pages, follow the links these pages contain, and add information to their own database Each search engine has its own way of doing things

8 Two major functions The index process: building data structures that enable searching The query process: using those data structures to produce a ranked list of documents for a user’s query

9 Architecture of a Search Engine The Web Ad indexes Web spider Indexer Indexes Search User

10 10 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting What the Index Needs Basic information for document or record – File name / URL / record ID – Title or equivalent – Size, date, MIME type Full text of item More metadata – Product name, picture ID – Category, topic, or subject – Other attributes, for relevance ranking and display

11 11 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Simple Index Diagram

12 The Indexing Process

13 Text Acquisition Identifying and making available the documents that will be searched How? – Crawling or scanning the web, a corporate intranet, or other sources of information – Building a document data store containing the text and metadata for all the documents Metadata: document type, document structure, document length, …

14 Crawlers Identifying and acquiring documents for the search engine Web crawler: following links on web pages to discover new pages – Efficiency: how to handle the huge volume of new pages and updated pages Web crawler restricted to a single site supports site search Topic-based/focused crawlers: using classification techniques to restrict pages that are likely relevant to a specific topic – Used in vertical or topical search Enterprise document crawler: following links to discover both internal and external pages

15 Conversion Converting a variety of formats (e.g., HTML,XML, PDF, …) into a consistent text and metadata format Resolving encoding problem – Using ASCII (7 bits) or extended ASCII (8 bits) for English – Using Unicode (16 bits) for international languages

16 Document Data Stores A simple database to manage large numbers of documents and structured data Document components are typically stored in a compressed form for efficiency Structured data consists of document metadata and other information extracted from the documents such as links and anchor text

17 Text Transformation / Index Creation Text transformation: transforming documents into index terms or features – Index terms: the parts of a document that are stored in the index and used in searching – Features: parts of a text document that is used to represent its content – Examples: phrases, names, dates, links, … – Index vocabulary: the set of all the terms that are indexed for a document collection Index creation: creating the indexes or data structures – Example: building inverted list indexes

18 Parsers, Stopping, and Stemming Processing the sequence of text tokens in the document to recognize structural elements such as titles, figures, links, and headings – Tokenization: identifying units to be indexed – Using syntax of markup languages to identify structures Stopping: removing common words from the stream of tokens, e.g., the, of, to, … – Reducing index size considerably Stemming: group words that are derived from a common stem – Example: fish, fishes, fishing – Increase the likelihood that words used in queries and documents will match

19 Collecting Document Statistics Gathering and recording statistical information about words, features, and documents – Statistics will be used to compute scores of documents – Stored in lookup tables Examples – Counts of index term occurrences – Positions in the documents where the index terms occurred – Counts of occurrences over groups of documents – Lengths of documents in terms of the number of tokens Actual data depends on the retrieval model and the associated ranking method

20 Weighting Calculating index term weights using the document statistics and storing them in lookup tables – Pre-computation can improve query answering efficiency TF/IDF weighting – TF (the term frequency): the frequency of index term occurrences in a document – IDF (inverse document frequency): the inverse of the frequency of index term occurrences in all documents – N/n (N: # documents indexed, n: # documents containing a particular term)

21 21 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Search Query Processing What happens after you click the search button, and before retrieval starts. Usually in this order – Handle character set, maybe language – Look for operators and organize the query – Look for field names or metadata – Extract words (just like the indexer) – Deal with letter casing

22 22 UCB SIMS 202, Sept. 2004 Avi Rappoport, Search Tools Consulting Retrieval = Matching Single-word queries – Find items containing that word Multi-word queries: combine lists – Any: every item with any query word – All: only items with every word – Phrases: find only items with all words in order Boolean and complex queries – Use algorithm to combine lists


Download ppt "Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe."

Similar presentations


Ads by Google