Web- and Multimedia-based Information Systems Lecture 2.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Retrieval Evaluation J. H. Wang Mar. 18, Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections.
Multimedia Database Systems
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
WMES3103 : INFORMATION RETRIEVAL
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Evaluating the Performance of IR Sytems
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
The Vector Space Model …and applications in Information Retrieval.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto.
Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Text mining.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Evaluation of Agent Building Tools and Implementation of a Prototype for Information Gathering Leif M. Koch University of Waterloo August 2001.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Vector Space Models.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Information Retrieval
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
General Architecture of Retrieval Systems 1Adrienn Skrop.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Why indexing? For efficient searching of a document
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Text Indexing and Search
Indexing & querying text
Information Retrieval and Web Search
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Basic Information Retrieval
Representation of documents and queries
Chapter 5: Information Retrieval and Web Search
Presentation transcript:

Web- and Multimedia-based Information Systems Lecture 2

Vector Model Non-binary Weigths Degree of similarity Result ranking possible Fast & Good results

Vector Model Document Vector with weights for every index term Query Vector with weights for every index term Vectors of the dimension of the total number of index terms in the collection

Documents in Vector Space t1t1 t2t2 t3t3 D1D1 D2D2 D 10 D3D3 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D6D6

Vector Model Position 1 corresponds to term 1, position 2 to term 2, position t to term t The weight of the term is stored in each position

Vector Model Cosine of the angle between the vectors taken as similarity measure Sorting/Ranking of results Threshold for results More precise answer with more relevant docs on the top

Similarity Function

Vector Model Index Terms Weighting Binary Weights Raw Term Weights Term frequency x Inverse document frequency

Binary Weights Only the presence (1) or absence (0) of a term is included in the vector

Raw Term Weights The frequency of occurrence for the term in each document is included in the vector

Term frequency x Inverse document frequency

IDF Example IDF provides high values for rare words and low values for common words

Probabilistic Model Based on Probability For every document, a probability is calculated for: – Document being relevant – Document being irrelevant to the query Documents more relevant than not ranked in decreasing order of relevance

Text Operations in Detail Goal: Automated Generation of Index Terms All terms conveying meaning vs. Space requirements Rules for extraction from documents – Rules for divison of terms Punctuation Dashes – List of Stop Words Articles, prepositions, conjunctions

Word-oriented Reduction Schemes Lemmatisations Smaller term lists Generalization of terms Methods – Reduction to the infinitive – Reduction to a stem Algorithmic Methods for English German: – Biggest Problems: Prefixes & Compositions – Only with dictionaries Explicit listing of all forms Or rules to derive forms

Stemming Different Methods Most efficiently: Affix removal – Porter Algorithm – Implement later – Series of rules to strip suffixes s -> nil sses -> ss

Word Type Index Term Selection Nouns usually convey most meaning Elimination of other word types Clustering of compounds (computer science) – Noun groups – Maximum distance between terms

Thesauri „Treasury of words“ For every entry – Definition – Synonyms Useful with a specific knowledge domain where a controlled vocabulary can easily be obtained Difficult with a large and dynamic document collection as the web

Creation of Inverted List Create Vocabulary Note document, position in Document for each term Sort List (first by terms, then by positions) Split Terms & Positions

Basic Query Terms of the query isolated Get pointer to positions for every term Conduct Set Operations Get result documents and present

Advanced Query Functionality Comparison Operators for Metadata String of multiple terms More general: take into account distance and order of terms Truncation (Wildcards)

Information Retrieval System Evaluation Functionality Analysis Performance – Time – Space Retrieval Performance – Batch vs. Interactive mode

Retrieval Performance Measures Recall – The fraction of relevant documents which has been retrieved Precision – The fraction of the retrieved documents which is relevant

Precision vs. Recall User does usually not inspect all results Example: Relevant documents R={d2, d5} Result ranking returned by system 1. d12. d53. d2 For the second result, recall is at 50%, precision is also 50% For the third result, recall is 100%, precision is 67%

Programming Assignment

Different part each week Web Search Engine

WWW Search Engine Search Engine Indexer Robot DB WWW-Server Index WWW-ServerWWW-Client Query Result List QueryResults FilesRequest Documents

Assignment Part 1 Program a web robot Starts at a user-defined URL Navigates the Web via Hypertext links Speaks HTTP (see RFC1945) Stores the path it took (URLs) – preferrable in a tree-like datastructure Stores result code & important header fields for every request to disk in a format suitable for further processing

Assignment Part 1 (cont.) Implementation in Java Pure TCP socket communications No need to save documents in this assignment Robot shall identify itself via HTTP User- Agent header Extensibility required for future assignments

Example HTTP session telnet www 80 GET / HTTP/1.0 HTTP/ Document follows Date: Tue, 10 Sep :34:06 GMT Server: NCSA/1.4.2 Content-type: image/gif Last-modified: Tue, 10 Sep :25:26 GMT Content-length: 9755 TCP connection HTTP Request Response Headers Start of content