Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Search Engines

Similar presentations


Presentation on theme: "Introduction to Search Engines"— Presentation transcript:

1 Introduction to Search Engines

2 Search Engine Overview
Query (질의) 1 Searchable Index (색인) Search Results 2 3 Search Data (0) (1) Query Indexing (2) Document Ranking (3) Result Display 1. Document Collection - e.g., spider/crawler 2. Document Indexing - term indexing (tokenizing, stop & stem) - term weighting USER: Has information need 1.     Identify the Information Need: - Think (Reflection), Talk (Discussion), Learn about it (Info. Processing) 2.     Communicate the Information Need - Say it in my own words and hope for the best - Express it in the system’s query language 3.     Give Feedback to refine and update #1 & #2 - This isn’t what I am looking for. - This is it. Give me more like it. - This could be it. I’m not sure. - This is what I asked for, but now I would like this. INTERMEDIARY: Knows how to find Information  1.     Discover User’s Information Need - Questions to guide (e.g. expand, focus, clarify), - Dialogues to discover (e.g. motivation, background) - Provide, or suggest potentially useful information 2.     Query the Database for appropriate Information - Formulate a query (Information in the form of question), Translate into system’s language 3.     Process Feedback to refine and update #1 & #2 - You wanted to find out about “…,” right? (NO  redo #1; YES  redo #2) - Reformulate the query to emphasize, de-emphasize, fuzzy-emphasize the “importance” of information contents relative to the user. - Update the query to accommodate the change in Information Need  User Intermediary Information What am I looking for? - Identification of info. need What question do I ask? - Query formulation What is the searcher looking for? - Discovery of user’s info. need How should the question be posed? - Query representation Where is the relevant information? - Query-document matching What data to collect? - Collection development What information to index? - Indexing/Representation How to represent it? - Data structure Search Engines

3 Search Engine: Data Document Collection Document Indexing
Select target data sources – e.g., domain, corpus, WWW Harvest data – e.g., data entry, data import, spider/crawler Document Indexing Select indexing sources (색인어) – e.g., metadata, keywords, content Extract indexing terms – e.g., tokenization, stop & stem Assign term weights – e.g., tf-idf, okapi “The frequency of word occurrence in an article furnishes a useful measurement of word significance.” 문헌에 출현한 단어들은 문헌의 내용 분석을 위해 사용될 수 있으며, 단어의 출현빈도가 이 단어의 주제어로서의 중요성을 측정하는 기준이 된다 . Luhn, H.P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2, Search Engines

4 Search Engine: Indexing Process
Documents (Text) INVERTED INDEX Term Weighting Tokenization Tokens Tokens SEQUENTIAL INDEX Tokens Token Selection Tokens Tokens Tokens Tokens Token Normalization Select Tokens D1: Information retrieval seminars D2: Retrieval Models and Information Retrieval D3: Information Model D1 D2 D3 wd1 (information) 1 wd2 (model) wd3 (retrieval) 2 wd4 (seminar) D1 information 1, retrieval 1, seminar 1 D2 information 1, model 1, retrieval 2 D3 information 1, model 1 D1: information, retrieval, seminar(s) D2: retrieval, model(s), and, information, retrieval D3: information, model Search Engines

5 Search Engine: Search Query Indexing Document Ranking Result Display
Tokenization Stop & Stem Term Weighting Document Ranking Query-Document matching Document Score computation Result Display Content - e.g., title & snippets Layout - e.g., grouped by category Toppings - e.g., related searches Query: What is information retrieval? Q: Information 1, retrieval 1 Index Term D1 D2 D3 wd1 (information) 1 wd2 (model) wd3 (retrieval) 2 wd4 (seminar) Rank docID score 1 D2 3 2 D1 D3 Search Engines

6 8 1 9 2 10 11 3 4 12 5 13 6 14 7 Search Engines

7 Result Categories 15 16 17 Proprietary (Naver-specific) content
Encyclopedia Naver Books Q&A DB (지식iN) Magazine Café Blog Book Map Website Advertisement (파워링크) Image Webpage Naver News Library Video Naver AppStore Naver Scholar Naver Post Naver Shopping News Naver Dictionary 15 16 17 Proprietary (Naver-specific) content Dynamic category order Toppings Search by Category Related Searches Popular Searches (by category) 18 Query: 정보검색 (Information Retrieval) Query: 검색엔진 (Search Engine) 19 20 Search Engines

8 Result Categories 1 Webpage-centric content Dynamic category order
Advertisement 1 Webpage-centric content Dynamic category order Toppings Search by Category Related Searches 2 Query: Information Retrieval Query: Search Engine Search Engines

9 Search Engine vs. Database vs. Directories
Corpus Type General Specific General/Specific Data Collection Automatic - crawler/spider Manual - data entry/import - classification Data Quality Not controlled Controlled Data Organization None (bag-of-words) Structured - Relational - Hierarchical Query Input Text box Field-specific - Boolean Category Tree Search Result Ranked - documents Not ranked - records - categories Search Index Document text Database Tables e.g. Google Library Search dmoz.org USER: Has information need 1.     Identify the Information Need: - Think (Reflection), Talk (Discussion), Learn about it (Info. Processing) 2.     Communicate the Information Need - Say it in my own words and hope for the best - Express it in the system’s query language 3.     Give Feedback to refine and update #1 & #2 - This isn’t what I am looking for. - This is it. Give me more like it. - This could be it. I’m not sure. - This is what I asked for, but now I would like this. INTERMEDIARY: Knows how to find Information  1.     Discover User’s Information Need - Questions to guide (e.g. expand, focus, clarify), - Dialogues to discover (e.g. motivation, background) - Provide, or suggest potentially useful information 2.     Query the Database for appropriate Information - Formulate a query (Information in the form of question), Translate into system’s language 3.     Process Feedback to refine and update #1 & #2 - You wanted to find out about “…,” right? (NO  redo #1; YES  redo #2) - Reformulate the query to emphasize, de-emphasize, fuzzy-emphasize the “importance” of information contents relative to the user. - Update the query to accommodate the change in Information Need  Search Engines

10 WIDIT 2003: Web IR System Body Index Anchor Index Header Index
Sub-indexes Sub-indexes Sub-indexes Documents System Training Fusion Module Indexing Module Retrieval Module Search Results queries queries Topics Fusion Result Simple Queries Phrase Queries Dynamic Tuning Final Result Reranking Module Search Engines

11 WIDIT 2004: Web IR w/ Query Classification
Body Index Anchor Index Header Index Static Tuning Sub-indexes Sub-indexes Sub-indexes Documents Indexing Module Fusion Module Retrieval Module Search Results Topics Queries Queries Simple Queries Expanded Queries Fusion Result Dynamic Tuning Query Classification Module Query Types Re-ranking Module Final Result Search Engines

12 WIDIT 2004: Dynamic Tuning Search Engines

13 WIDIT 2005: Web HARD IR System
Topics WordNet NLP Module Web CF Documents OSW WebX Indexing Inverted Index Synonym Definition Noun Phrase Web Terms OSW Phrase Search Results Retrieval Module Fusion Module Automatic Tuning Baseline Result CF Terms Post-CF Result Re-ranking Module Final User Search Engines

14 WIDIT 2006: Blog IR System Search Engines


Download ppt "Introduction to Search Engines"

Similar presentations


Ads by Google