Introduction to Search Engines

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Introduction to Information Retrieval
Multimedia Database Systems
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Information Retrieval in Practice
Search Engines and Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
Intelligent Information Retrieval CS 336 –Lecture 2: Query Language Xiaoyan Li Spring 2006 Modified from Lisa Ballesteros’s slides.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Search Engines and Information Retrieval Chapter 1.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
IL Step 3: Using Bibliographic Databases Information Literacy 1.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Search Engine Architecture
IR Theory: Relevance Feedback. Relevance Feedback: Example  Initial Results Search Engine2.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Information Resources Libraries & the Open Web Frederic Murray Assistant Professor MLIS, University of British Columbia BA, Political Science, University.
Information Retrieval
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
WIDIT at TREC-2005 HARD Track Kiduk Yang, Ning Yu, Hui Zhang, Ivan Record, Shahrier Akram WIDIT Laboratory School of Library & Information Science Indiana.
Fusion-based Approach to Web Search Optimization Kiduk Yang, Ning Yu WIDIT Laboratory SLIS, Indiana University AIRS2005 Kiduk Yang, Ning Yu WIDIT Laboratory.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Chapter 20 Asking Questions, Finding Sources. Characteristics of a Good Research Paper Poses an interesting question and significant problem Responds.
IR Theory: Web Information Retrieval. Web IRFusion IR Search Engine 2.
Introduction to Information Retrieval. What is IR? Sit down before fact as a little child, be prepared to give up every conceived notion, follow humbly.
WHIM- Spring ‘10 By:-Enza Desai. What is HCIR? Study of IR techniques that brings human intelligence into search process. Coined by Gary Marchionini.
Information Retrieval in Practice
Searching for Information
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Search Engine Architecture
Information Retrieval and Web Search
CS 430: Information Discovery
Prepared by Rao Umar Anwar For Detail information Visit my blog:
INFORMATION RETRIEVAL
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Search Techniques and Advanced tools for Researchers
Thanks to Bill Arms, Marti Hearst
Information Retrieval
Introduction to Search Engines
Searching EIT, Author Gay Robertson, 2017.
IL Step 3: Using Bibliographic Databases
Introduction to Information Retrieval
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Search Engine Architecture
Cell Biology and Genetics
Information Retrieval and Web Design
Information Retrieval and Web Design
Presentation transcript:

Introduction to Search Engines

Search Engine Overview Query (질의) 1 Searchable Index (색인) Search Results 2 3 Search Data (0) (1) Query Indexing (2) Document Ranking (3) Result Display 1. Document Collection - e.g., spider/crawler 2. Document Indexing - term indexing (tokenizing, stop & stem) - term weighting USER: Has information need 1.     Identify the Information Need: - Think (Reflection), Talk (Discussion), Learn about it (Info. Processing) 2.     Communicate the Information Need - Say it in my own words and hope for the best - Express it in the system’s query language 3.     Give Feedback to refine and update #1 & #2 - This isn’t what I am looking for. - This is it. Give me more like it. - This could be it. I’m not sure. - This is what I asked for, but now I would like this. INTERMEDIARY: Knows how to find Information  1.     Discover User’s Information Need - Questions to guide (e.g. expand, focus, clarify), - Dialogues to discover (e.g. motivation, background) - Provide, or suggest potentially useful information 2.     Query the Database for appropriate Information - Formulate a query (Information in the form of question), Translate into system’s language 3.     Process Feedback to refine and update #1 & #2 - You wanted to find out about “…,” right? (NO  redo #1; YES  redo #2) - Reformulate the query to emphasize, de-emphasize, fuzzy-emphasize the “importance” of information contents relative to the user. - Update the query to accommodate the change in Information Need  User Intermediary Information What am I looking for? - Identification of info. need What question do I ask? - Query formulation What is the searcher looking for? - Discovery of user’s info. need How should the question be posed? - Query representation Where is the relevant information? - Query-document matching What data to collect? - Collection development What information to index? - Indexing/Representation How to represent it? - Data structure Search Engines

Search Engine: Data Document Collection Document Indexing Select target data sources – e.g., domain, corpus, WWW Harvest data – e.g., data entry, data import, spider/crawler Document Indexing Select indexing sources (색인어) – e.g., metadata, keywords, content Extract indexing terms – e.g., tokenization, stop & stem Assign term weights – e.g., tf-idf, okapi “The frequency of word occurrence in an article furnishes a useful measurement of word significance.” 문헌에 출현한 단어들은 문헌의 내용 분석을 위해 사용될 수 있으며, 단어의 출현빈도가 이 단어의 주제어로서의 중요성을 측정하는 기준이 된다 . Luhn, H.P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2, 159-165. Search Engines

Search Engine: Indexing Process Documents (Text) INVERTED INDEX Term Weighting Tokenization Tokens Tokens SEQUENTIAL INDEX Tokens Token Selection Tokens Tokens Tokens Tokens Token Normalization Select Tokens D1: Information retrieval seminars D2: Retrieval Models and Information Retrieval D3: Information Model D1 D2 D3 wd1 (information) 1 wd2 (model) wd3 (retrieval) 2 wd4 (seminar) D1 information 1, retrieval 1, seminar 1 D2 information 1, model 1, retrieval 2 D3 information 1, model 1 D1: information, retrieval, seminar(s) D2: retrieval, model(s), and, information, retrieval D3: information, model Search Engines

Search Engine: Search Query Indexing Document Ranking Result Display Tokenization Stop & Stem Term Weighting Document Ranking Query-Document matching Document Score computation Result Display Content - e.g., title & snippets Layout - e.g., grouped by category Toppings - e.g., related searches Query: What is information retrieval? Q: Information 1, retrieval 1 Index Term D1 D2 D3 wd1 (information) 1 wd2 (model) wd3 (retrieval) 2 wd4 (seminar) Rank docID score 1 D2 3 2 D1 D3 Search Engines

8 1 9 2 10 11 3 4 12 5 13 6 14 7 Search Engines

Result Categories 15 16 17 Proprietary (Naver-specific) content Encyclopedia Naver Books Q&A DB (지식iN) Magazine Café Blog Book Map Website Advertisement (파워링크) Image Webpage Naver News Library Video Naver AppStore Naver Scholar Naver Post Naver Shopping News Naver Dictionary 15 16 17 Proprietary (Naver-specific) content Dynamic category order Toppings Search by Category Related Searches Popular Searches (by category) 18 Query: 정보검색 (Information Retrieval) Query: 검색엔진 (Search Engine) 19 20 Search Engines

Result Categories 1 Webpage-centric content Dynamic category order Advertisement 1 Webpage-centric content Dynamic category order Toppings Search by Category Related Searches 2 Query: Information Retrieval Query: Search Engine Search Engines

Search Engine vs. Database vs. Directories Corpus Type General Specific General/Specific Data Collection Automatic - crawler/spider Manual - data entry/import - classification Data Quality Not controlled Controlled Data Organization None (bag-of-words) Structured - Relational - Hierarchical Query Input Text box Field-specific - Boolean Category Tree Search Result Ranked - documents Not ranked - records - categories Search Index Document text Database Tables e.g. Google Library Search dmoz.org USER: Has information need 1.     Identify the Information Need: - Think (Reflection), Talk (Discussion), Learn about it (Info. Processing) 2.     Communicate the Information Need - Say it in my own words and hope for the best - Express it in the system’s query language 3.     Give Feedback to refine and update #1 & #2 - This isn’t what I am looking for. - This is it. Give me more like it. - This could be it. I’m not sure. - This is what I asked for, but now I would like this. INTERMEDIARY: Knows how to find Information  1.     Discover User’s Information Need - Questions to guide (e.g. expand, focus, clarify), - Dialogues to discover (e.g. motivation, background) - Provide, or suggest potentially useful information 2.     Query the Database for appropriate Information - Formulate a query (Information in the form of question), Translate into system’s language 3.     Process Feedback to refine and update #1 & #2 - You wanted to find out about “…,” right? (NO  redo #1; YES  redo #2) - Reformulate the query to emphasize, de-emphasize, fuzzy-emphasize the “importance” of information contents relative to the user. - Update the query to accommodate the change in Information Need  Search Engines

WIDIT 2003: Web IR System Body Index Anchor Index Header Index Sub-indexes Sub-indexes Sub-indexes Documents System Training Fusion Module Indexing Module Retrieval Module Search Results queries queries Topics Fusion Result Simple Queries Phrase Queries Dynamic Tuning Final Result Reranking Module Search Engines

WIDIT 2004: Web IR w/ Query Classification Body Index Anchor Index Header Index Static Tuning Sub-indexes Sub-indexes Sub-indexes Documents Indexing Module Fusion Module Retrieval Module Search Results Topics Queries Queries Simple Queries Expanded Queries Fusion Result Dynamic Tuning Query Classification Module Query Types Re-ranking Module Final Result Search Engines

WIDIT 2004: Dynamic Tuning Search Engines

WIDIT 2005: Web HARD IR System Topics WordNet NLP Module Web CF Documents OSW WebX Indexing Inverted Index Synonym Definition Noun Phrase Web Terms OSW Phrase Search Results Retrieval Module Fusion Module Automatic Tuning Baseline Result CF Terms Post-CF Result Re-ranking Module Final User Search Engines

WIDIT 2006: Blog IR System Search Engines