Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Search – Summer Term 2006 I. General Introduction (c) Wolfgang Hürst, Albert-Ludwigs-University.

Similar presentations


Presentation on theme: "Web Search – Summer Term 2006 I. General Introduction (c) Wolfgang Hürst, Albert-Ludwigs-University."— Presentation transcript:

1 Web Search – Summer Term 2006 I. General Introduction (c) Wolfgang Hürst, Albert-Ludwigs-University

2 Introduction: Search What is “search” (by machine)? Data bases: Relational data bases, SQL, … Search in structured data Information Retrieval Search in un- (or semi-)structured data Example: Email-Archive ‘All Emails with sender x@y.z from April 1 st -3 rd, 2006’ Search in exactly specified (meta) data ‘All Emails that are somehow related to project x’ Search in unspecified and unstructured body

3 Information Retrieval (IR) Information Retrieval (IR) deals with the representation, storage, organization of, and access to information items. (Page 1, Baeza-Yates und Ribeiro-Neto [1]) Information Retrieval (IR) = Part of computer science which studies the retrieval of information (not data) from a collection of written documents. The retrieved documents aim at satisfying a user information need usually expressed in natural language. (Glossary, page 444, Baeza-Yates & Ribeiro-Neto [1]) Note: Many other definitions exist Generally, all share this common view: INFORMATION QUERY DATA / DOCUMENTS INFORMATION NEED

4 DOCUMENTS USERDATA SEARCH PROCESS INFORMATION RETRIEVAL SYSTEM

5 INFORMATION NEED DOCUMENTS INFORMATION RETRIEVAL SYSTEM USERDATA SEARCH PROCESS

6 INFORMATION NEED DOCUMENTS INFORMATION RETRIEVAL SYSTEM QUERY RESULT QUERY PROCESSING & SEARCHING & RANKING INDEXING INDEX USERDATA SEARCH PROCESS

7 Information Retrieval (IR) Main problem: Unstructured, imprecisely, and imperfectly defined data But also: The whole search process can be characterized as uncertain and vague Hence: Information is often returned in form of a sorted list (docs ranked by relevance ). INFORMATION QUERY DATA / DOCUMENTS INFORMATION NEED

8 ‘Data Retrieval’ vs. ‘IR’ Source: C. J. van RIJSBERGEN: INFORM. RETRIEVAL (http://www.dcs.gla.ac.uk/Keith/Chapter.1/Ch.1.html) DATA RETRIEVALINFORM. RETRIEVAL MATCHINGEXACT MATCHPARTIAL / BEST MATCH INFERENCEDEDUCTIONINDUCTION MODELDETERMINISTICPROBABILISTIC CLASSIFICATIONMONOTHETICPOLYTHETIC QUERY LANGUAGEARTIFICIALNATURAL QUERY SPECIFICATION COMPLETEINCOMPLETE ITEMS WANTEDMATCHINGRELEVANT ERROR RESPONSESENSITIVEINSENSITIVE

9 Summary of most imporant terms Query = The expression of the user information need in the input language provided by the information system. The most common type of input language simply allows the specification of keywords and of a few boolean connectivities. (Glossary, page 449, Baeza-Yates & Ribeiro-Neto [1]) Index = A data structure built on the text to speed up searching. (Glossary, page 443, Baeza-Yates & Ribeiro-Neto [1]) The concept of relevance = Measure to quantify relevance of a particular document for a particular user in a particular situation.

10 LOGICAL VIEW OF THE DOCUMENTS (INDEX) IR Process: Tasks Involved INFORMATION NEEDDOCUMENTS User Interface PERFORMANCE EVALUATION QUERY QUERY PROCESSING (PARSING & TERM PROCESSING) LOGICAL VIEW OF THE INFORMATION NEED SELECT DATA FOR INDEXING PARSING & TERM PROCESSING SEARCHING RANKING RESULTS DOCS. RESULT REPRESENTATION

11 Evaluation of IR Systems Standard approaches for algorithm and computer system evaluation Speed / processing time Storage requirements Correctness of used algorithms But most importantly Performance, effectiveness Questions: What is a good / better search engine? How to measure search engine quality? Etc.

12 Evaluation of IR Systems Another important issue: Usability, users’ perception User 1 & system 1: ‘It took me 10 min to find the information.’ Example: User 2 & system 2: ‘It took me 14 min to find the information.’

13 Evaluation of IR Systems Another important issue: Usability, users’ perception User 1 & system 1: ‘It took me 10 min to find the information. Those were the worst 10 minutes of my life. I really hate this system!’ Example: User 2 & system 2: ‘It took me 14 min to find the information. I never had so much fun using any search engine before!’

14 Some Historical Remarks 1950s: Basic idea of searching text with a computer SOURCE: AMIT SINGHAL ‘MODERN INFORMATION RETRIEVAL: A BRIEF OVERVIEW’ (CH. 1), IEEE BULLETIN, 2001 1960s: Key developments, e.g. The SMART system (G. Salton, Harvard/Cornell) The Crainfield evaluations 1970s and 1980s: Advancements of basic ideas But: mainly with small test collections 1990s: Establishment of TREC (Text Retrieval Conference) series (since 1992 till today) Large text collections, expansion to other fields and areas, e.g. spoken document retrieval, non-english or multi-lingual retrieval, information filtering, user interactions, WWW, video retrieval, etc.

15 Information Retrieval & Web Search Historically, IR was mainly motivated by text search (libraries, etc.) Today: Various other areas and data, e.g. multi media (images, video, etc.), WWW, etc. Web search : perfect example for an IR system Goal: Find best possible results (web pages) based on a) Unstructured, heterogeneous, semistructured data b) Imprecise, ambiguous, short queries (Note: ‘Best possible results‘ is also a very vague specification of the ultimate goal) But: Very different from traditional IR tasks!

16 Characteristics of the Web Size : The web is big! An there are lots of users! Documents : Extreme variety regarding formats, structure, quality, etc. Users : Very different skills & intensions, e.g. Find all information about related patents Find some good tourist inform. about Paris Find the phone no. of the tourist office Location : The web is a distributed system Spam : Expect manipulation instead of cooperation from the document providers Dynamic : The web keeps growing & changing

17 Web Search Web search is an active research area with high economical impact Many open questions & challenges for research: Improving existing systems, adapting to new scenarios (more data, spam, …), new challenges (diff. data formats, multimedia, …), new tasks (desktop search, personalization, …), etc. Many other approaches & techniques exist, e.g. Clustering, specialized search engines, meta search engines, etc. We will cover some of this here, i.e. …

18 Web Search Course: Rough Outline Traditional (text) retrieval: Index generation (data structures), text processing, ranking (TF*IDF, …), models (Boolean, Vector Space, Probabilistic), evaluation (precision & recall, TREC, …) Only most important concepts as required for main part of the course, i.e.: Web search (special case of IR): Special characteristics of the web, ranking (PageRank, HITs, …), crawling (Spiders, Robots), indexing, and some selected topics

19 Text books about (text) IR [1] RICARDO BAEZA-YATES, BERTHIER RIBEIRO-NETO: ‘MODERN INFORMATIN RETRIEVAL’, ADDISON WESLEY, 1999 [2] WILLIAM B. FRAKES, RICARDO BAEZA-YATES (EDS.): ‘INFORMATION RETRIEVAL – DATA STRUCTURES AND ALGORITHMS’, P T R PRENTICE HALL, 1992 [3] C. J. VAN RIJSBERGEN: ‘INFORMATION RETRIEVAL’, 1979, AVAILABLE ONLINE AT http://www.dcs.gla.ac.uk/Keith/Preface.html [4] I. WITTEN, A. MOFFAT, T. BELL: ‘MANAGING GIGABYTES’, MORGAN KAUFMANN PUBLISHING, 1999 EXCERPTS FROM A NEW BOOK ‘INTRODUCTION TO INFORMATION RETRIEVAL’ BY C. MANNING, P. RAGHAVAN, H. SCHÜTZ (TO APPEAR 2007) ARE AVAILABLE ONLINE AT http://www-csli.stanford.edu/~schuetze/information-retrieval-book.html Only certain topics will be covered in this course. No books on web search, but selected articles will be recommended in the lecture


Download ppt "Web Search – Summer Term 2006 I. General Introduction (c) Wolfgang Hürst, Albert-Ludwigs-University."

Similar presentations


Ads by Google