Web Search – Summer Term 2006 I. General Introduction (c) Wolfgang Hürst, Albert-Ludwigs-University.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Modern Information Retrieval Chapter 1: Introduction
An Introduction to Information Retrieval and Applications J. H. Wang Feb. 19, 2008.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
Web Search - Summer Term 2006 III. Web Search - Introduction (Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Search Engines and Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Information Retrieval
ISP 433/533 Week 2 IR Models.
Modern Information Retrieval Chapter 1: Introduction
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Information Retrieval in Practice
1 CS 430: Information Discovery Lecture 20 The User in the Loop.
Modern Information Retrieval Chapter 1 Introduction.
TextMOLE: Text Mining Operations Library and Environment Daniel B. Waegel and April Kontostathis, Ph.D. Ursinus College Collegeville PA.
Computer comunication B Information retrieval Repetition Retrieval models Wildcards Web information retrieval Digital libraries.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Exercise 1: Bayes Theorem (a). Exercise 1: Bayes Theorem (b) P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )
Information Retrieval
Web Search - Summer Term 2006 II. Information Retrieval (Models, Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Chapter 5: Information Retrieval and Web Search
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Search Engines and Information Retrieval Chapter 1.
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
Modern Information Retrieval Computer engineering department Fall 2005.
Information Retrieval and Web Search Lecture 1. Course overview Instructor: Rada Mihalcea Class web page:
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Modern Information Retrieval Presented by Miss Prattana Chanpolto Faculty of Information Technology.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Performance Measurement. 2 Testing Environment.
Information Retrieval and Web Search Course overview Instructor: Rada Mihalcea.
Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
1 TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL (IR) Introduction.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
INFORMATION STROAGE AND RETRIEVAL SYSTEM By Ms. Preeti Patel Lecturer School of Library And Information Science DAVV, Indore
Information Retrieval in Practice
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Lecture 1: Introduction and the Boolean Model Information Retrieval
Information Retrieval (in Practice)
Modern Information Retrieval
Proposal for Term Project
Thanks to Bill Arms, Marti Hearst
Data Mining Chapter 6 Search Engines
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Introduction to Information Retrieval
Information Retrieval and Extraction
Information Retrieval and Web Design
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

Web Search – Summer Term 2006 I. General Introduction (c) Wolfgang Hürst, Albert-Ludwigs-University

Introduction: Search What is “search” (by machine)? Data bases: Relational data bases, SQL, … Search in structured data Information Retrieval Search in un- (or semi-)structured data Example: -Archive ‘All s with sender from April 1 st -3 rd, 2006’ Search in exactly specified (meta) data ‘All s that are somehow related to project x’ Search in unspecified and unstructured body

Information Retrieval (IR) Information Retrieval (IR) deals with the representation, storage, organization of, and access to information items. (Page 1, Baeza-Yates und Ribeiro-Neto [1]) Information Retrieval (IR) = Part of computer science which studies the retrieval of information (not data) from a collection of written documents. The retrieved documents aim at satisfying a user information need usually expressed in natural language. (Glossary, page 444, Baeza-Yates & Ribeiro-Neto [1]) Note: Many other definitions exist Generally, all share this common view: INFORMATION QUERY DATA / DOCUMENTS INFORMATION NEED

DOCUMENTS USERDATA SEARCH PROCESS INFORMATION RETRIEVAL SYSTEM

INFORMATION NEED DOCUMENTS INFORMATION RETRIEVAL SYSTEM USERDATA SEARCH PROCESS

INFORMATION NEED DOCUMENTS INFORMATION RETRIEVAL SYSTEM QUERY RESULT QUERY PROCESSING & SEARCHING & RANKING INDEXING INDEX USERDATA SEARCH PROCESS

Information Retrieval (IR) Main problem: Unstructured, imprecisely, and imperfectly defined data But also: The whole search process can be characterized as uncertain and vague Hence: Information is often returned in form of a sorted list (docs ranked by relevance ). INFORMATION QUERY DATA / DOCUMENTS INFORMATION NEED

‘Data Retrieval’ vs. ‘IR’ Source: C. J. van RIJSBERGEN: INFORM. RETRIEVAL ( DATA RETRIEVALINFORM. RETRIEVAL MATCHINGEXACT MATCHPARTIAL / BEST MATCH INFERENCEDEDUCTIONINDUCTION MODELDETERMINISTICPROBABILISTIC CLASSIFICATIONMONOTHETICPOLYTHETIC QUERY LANGUAGEARTIFICIALNATURAL QUERY SPECIFICATION COMPLETEINCOMPLETE ITEMS WANTEDMATCHINGRELEVANT ERROR RESPONSESENSITIVEINSENSITIVE

Summary of most imporant terms Query = The expression of the user information need in the input language provided by the information system. The most common type of input language simply allows the specification of keywords and of a few boolean connectivities. (Glossary, page 449, Baeza-Yates & Ribeiro-Neto [1]) Index = A data structure built on the text to speed up searching. (Glossary, page 443, Baeza-Yates & Ribeiro-Neto [1]) The concept of relevance = Measure to quantify relevance of a particular document for a particular user in a particular situation.

LOGICAL VIEW OF THE DOCUMENTS (INDEX) IR Process: Tasks Involved INFORMATION NEEDDOCUMENTS User Interface PERFORMANCE EVALUATION QUERY QUERY PROCESSING (PARSING & TERM PROCESSING) LOGICAL VIEW OF THE INFORMATION NEED SELECT DATA FOR INDEXING PARSING & TERM PROCESSING SEARCHING RANKING RESULTS DOCS. RESULT REPRESENTATION

Evaluation of IR Systems Standard approaches for algorithm and computer system evaluation Speed / processing time Storage requirements Correctness of used algorithms But most importantly Performance, effectiveness Questions: What is a good / better search engine? How to measure search engine quality? Etc.

Evaluation of IR Systems Another important issue: Usability, users’ perception User 1 & system 1: ‘It took me 10 min to find the information.’ Example: User 2 & system 2: ‘It took me 14 min to find the information.’

Evaluation of IR Systems Another important issue: Usability, users’ perception User 1 & system 1: ‘It took me 10 min to find the information. Those were the worst 10 minutes of my life. I really hate this system!’ Example: User 2 & system 2: ‘It took me 14 min to find the information. I never had so much fun using any search engine before!’

Some Historical Remarks 1950s: Basic idea of searching text with a computer SOURCE: AMIT SINGHAL ‘MODERN INFORMATION RETRIEVAL: A BRIEF OVERVIEW’ (CH. 1), IEEE BULLETIN, s: Key developments, e.g. The SMART system (G. Salton, Harvard/Cornell) The Crainfield evaluations 1970s and 1980s: Advancements of basic ideas But: mainly with small test collections 1990s: Establishment of TREC (Text Retrieval Conference) series (since 1992 till today) Large text collections, expansion to other fields and areas, e.g. spoken document retrieval, non-english or multi-lingual retrieval, information filtering, user interactions, WWW, video retrieval, etc.

Information Retrieval & Web Search Historically, IR was mainly motivated by text search (libraries, etc.) Today: Various other areas and data, e.g. multi media (images, video, etc.), WWW, etc. Web search : perfect example for an IR system Goal: Find best possible results (web pages) based on a) Unstructured, heterogeneous, semistructured data b) Imprecise, ambiguous, short queries (Note: ‘Best possible results‘ is also a very vague specification of the ultimate goal) But: Very different from traditional IR tasks!

Characteristics of the Web Size : The web is big! An there are lots of users! Documents : Extreme variety regarding formats, structure, quality, etc. Users : Very different skills & intensions, e.g. Find all information about related patents Find some good tourist inform. about Paris Find the phone no. of the tourist office Location : The web is a distributed system Spam : Expect manipulation instead of cooperation from the document providers Dynamic : The web keeps growing & changing

Web Search Web search is an active research area with high economical impact Many open questions & challenges for research: Improving existing systems, adapting to new scenarios (more data, spam, …), new challenges (diff. data formats, multimedia, …), new tasks (desktop search, personalization, …), etc. Many other approaches & techniques exist, e.g. Clustering, specialized search engines, meta search engines, etc. We will cover some of this here, i.e. …

Web Search Course: Rough Outline Traditional (text) retrieval: Index generation (data structures), text processing, ranking (TF*IDF, …), models (Boolean, Vector Space, Probabilistic), evaluation (precision & recall, TREC, …) Only most important concepts as required for main part of the course, i.e.: Web search (special case of IR): Special characteristics of the web, ranking (PageRank, HITs, …), crawling (Spiders, Robots), indexing, and some selected topics

Text books about (text) IR [1] RICARDO BAEZA-YATES, BERTHIER RIBEIRO-NETO: ‘MODERN INFORMATIN RETRIEVAL’, ADDISON WESLEY, 1999 [2] WILLIAM B. FRAKES, RICARDO BAEZA-YATES (EDS.): ‘INFORMATION RETRIEVAL – DATA STRUCTURES AND ALGORITHMS’, P T R PRENTICE HALL, 1992 [3] C. J. VAN RIJSBERGEN: ‘INFORMATION RETRIEVAL’, 1979, AVAILABLE ONLINE AT [4] I. WITTEN, A. MOFFAT, T. BELL: ‘MANAGING GIGABYTES’, MORGAN KAUFMANN PUBLISHING, 1999 EXCERPTS FROM A NEW BOOK ‘INTRODUCTION TO INFORMATION RETRIEVAL’ BY C. MANNING, P. RAGHAVAN, H. SCHÜTZ (TO APPEAR 2007) ARE AVAILABLE ONLINE AT Only certain topics will be covered in this course. No books on web search, but selected articles will be recommended in the lecture