Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web- and Multimedia-based Information Systems Lecture 2.

Similar presentations


Presentation on theme: "Web- and Multimedia-based Information Systems Lecture 2."— Presentation transcript:

1 Web- and Multimedia-based Information Systems Lecture 2

2 Vector Model Non-binary Weigths Degree of similarity Result ranking possible Fast & Good results

3 Vector Model Document Vector with weights for every index term Query Vector with weights for every index term Vectors of the dimension of the total number of index terms in the collection

4 Documents in Vector Space t1t1 t2t2 t3t3 D1D1 D2D2 D 10 D3D3 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D6D6

5 Vector Model Position 1 corresponds to term 1, position 2 to term 2, position t to term t The weight of the term is stored in each position

6 Vector Model Cosine of the angle between the vectors taken as similarity measure Sorting/Ranking of results Threshold for results More precise answer with more relevant docs on the top

7 Similarity Function

8 Vector Model Index Terms Weighting Binary Weights Raw Term Weights Term frequency x Inverse document frequency

9 Binary Weights Only the presence (1) or absence (0) of a term is included in the vector

10 Raw Term Weights The frequency of occurrence for the term in each document is included in the vector

11 Term frequency x Inverse document frequency

12 IDF Example IDF provides high values for rare words and low values for common words

13 Probabilistic Model Based on Probability For every document, a probability is calculated for: – Document being relevant – Document being irrelevant to the query Documents more relevant than not ranked in decreasing order of relevance

14 Text Operations in Detail Goal: Automated Generation of Index Terms All terms conveying meaning vs. Space requirements Rules for extraction from documents – Rules for divison of terms Punctuation Dashes – List of Stop Words Articles, prepositions, conjunctions

15 Word-oriented Reduction Schemes Lemmatisations Smaller term lists Generalization of terms Methods – Reduction to the infinitive – Reduction to a stem Algorithmic Methods for English German: – Biggest Problems: Prefixes & Compositions – Only with dictionaries Explicit listing of all forms Or rules to derive forms

16 Stemming Different Methods Most efficiently: Affix removal – Porter Algorithm – Implement later – Series of rules to strip suffixes s -> nil sses -> ss

17 Word Type Index Term Selection Nouns usually convey most meaning Elimination of other word types Clustering of compounds (computer science) – Noun groups – Maximum distance between terms

18 Thesauri „Treasury of words“ For every entry – Definition – Synonyms Useful with a specific knowledge domain where a controlled vocabulary can easily be obtained Difficult with a large and dynamic document collection as the web

19 Creation of Inverted List Create Vocabulary Note document, position in Document for each term Sort List (first by terms, then by positions) Split Terms & Positions

20 Basic Query Terms of the query isolated Get pointer to positions for every term Conduct Set Operations Get result documents and present

21 Advanced Query Functionality Comparison Operators for Metadata String of multiple terms More general: take into account distance and order of terms Truncation (Wildcards)

22 Information Retrieval System Evaluation Functionality Analysis Performance – Time – Space Retrieval Performance – Batch vs. Interactive mode

23 Retrieval Performance Measures Recall – The fraction of relevant documents which has been retrieved Precision – The fraction of the retrieved documents which is relevant

24 Precision vs. Recall User does usually not inspect all results Example: Relevant documents R={d2, d5} Result ranking returned by system 1. d12. d53. d2 For the second result, recall is at 50%, precision is also 50% For the third result, recall is 100%, precision is 67%

25 Programming Assignment

26 Different part each week Web Search Engine

27 WWW Search Engine Search Engine Indexer Robot DB WWW-Server Index WWW-ServerWWW-Client Query Result List QueryResults FilesRequest Documents

28 Assignment Part 1 Program a web robot Starts at a user-defined URL Navigates the Web via Hypertext links Speaks HTTP (see RFC1945) Stores the path it took (URLs) – preferrable in a tree-like datastructure Stores result code & important header fields for every request to disk in a format suitable for further processing

29 Assignment Part 1 (cont.) Implementation in Java Pure TCP socket communications No need to save documents in this assignment Robot shall identify itself via HTTP User- Agent header Extensibility required for future assignments

30 Example HTTP session telnet www 80 GET / HTTP/1.0 HTTP/1.0 200 Document follows Date: Tue, 10 Sep 1996 14:34:06 GMT Server: NCSA/1.4.2 Content-type: image/gif Last-modified: Tue, 10 Sep 1996 13:25:26 GMT Content-length: 9755 TCP connection HTTP Request Response Headers Start of content


Download ppt "Web- and Multimedia-based Information Systems Lecture 2."

Similar presentations


Ads by Google