Presentation is loading. Please wait.

Presentation is loading. Please wait.

INFO 320: Information Needs, Searching, and Presentation (aka… Search)

Similar presentations


Presentation on theme: "INFO 320: Information Needs, Searching, and Presentation (aka… Search)"— Presentation transcript:

1 INFO 320: Information Needs, Searching, and Presentation (aka… Search)
Instructor: William Jones TA: Brennen Smith Lectures:  Tuesdays & Thursdays: 1:30 – 3:20 pm, MGH 238 Labs :  Wed.: 1:30 - 2:20 pm, MGH 030

2 For this Week 3 (10/13/2013) (Basics of Search)
Add Word exercise in class Boolean search vs. the vector space model B-trees 2.2 W One-minute madness – each team gets one minute to describe progress on lab exercises & issues encountered. On-going work in lab.

3 And also for this Week 3 (of 10/13)
Cool tool presentations; Essay review Wrap-up Guest speaker on SEO; 2.2 F Quiz on Module 2.

4

5 Components of a web crawler
fromButtcher, Clarke & Cormack, 2010, Information Retrieval, Chapter 15

6 Parsing a document What format is it in? What language is it in?
pdf/word/excel/html? What language is it in? How to handle “and”? What character set is in use? …

7 What you see.. *from

8 Is not what the crawler gets
*from

9 What character set is in use?
ISO Latin alphabet part 1 covers North America, Western Europe, Latin America, the Caribbean, Canada, Africa; the default for Web pages. UTF-8. A character set implementation of Unicode. A character in UTF8 can be from 1 to 4 bytes long. UTF-8 can represent any character in the Unicode standard. UTF-8 is backwards compatible with ASCII. UTF-8 is the preferred encoding for and web pages. *from

10 An HTML sample *from

11 Typical Stop Word List

12 Ambiguity of Natural Language (NL)
Synonomy: Different Words, Same Meaning “car” ~= “automobile” “stomach pain after eating” ~= “post-prandial abdominal discomfort” Polysemy: Same Words, Different Meanings “jaguar” as animal vs. kind of automobile. “juvenile victims of crime” vs. “victims of juvenile crime” Venetian blinds vs. blind Venetians

13 How to handle synonyms? car= automobile
When the document contains automobile, index it under car as well (also vice-versa) Or expand query. When the query contains automobile, look under car too. Or form concept, <automobile> When “car” is encountered, index under “<automobile>” (and “car” too?) Likewise for “automobile”. When either “car” or “automobile” are encountered in a query, add the term “<automobile>”.

14 Term Weighting TF .IDF Binary –presence or absence of term TF IDF
Simple count “Sublinear” TF scaling IDF TF .IDF

15 A matrix as a way to understand the index, the vector model and more.
Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Term 1 1 Term 2 Term 3 Term 4 Term 5 Term 6

16 Cells can have weights. Terms can be composites
Cells can have weights. Terms can be composites. Documents can have sections… Doc 1.1 Doc 1.2 Doc 2.1 Doc 2.2 Doc 6 William.title 3 4 William.abstract 2 1 William.intro 7

17 The index has 3 essential components
1. A term list – structured for fast access to individual terms Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Term 1 1 Term 2 Term 3 Term 4 Term 5 Term 6

18 The index has 3 essential components
2. For each term, a list of associations to documents. Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Term 1 1 Term 2 Term 3 Term 4 Term 5 Term 6

19 The index has 3 essential components
3. a list of documents that are indexed. Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Term 1 1 Term 2 Term 3 Term 4 Term 5 Term 6

20 The index can store information for each component
For terms – overall frequency in corpus (IDF), methods to identify or compute the term (e.g., variations in spelling, sound wave transformations, checksums, etc.) For term-to-doc associations – weights, # of occurrences, occurrence offsets, etc. For documents – address (by which to access content), summary, overall popularity (e.g., using PageRank).

21 The index can store information for each component
For terms – overall frequency in corpus (IDF), methods to identify or compute the term (e.g., variations in spelling, sound wave transformations, checksums, etc.) For term-to-doc associations – weights, # of occurrences, occurrence offsets, etc. For documents – address (by which to access content), summary, overall popularity (e.g., using PageRank).

22 The index can store information for each component
For terms – overall frequency in corpus (IDF), methods to identify or compute the term (e.g., variations in spelling, sound wave transformations, checksums, etc.) For term-to-doc associations – weights, # of occurrences, occurrence offsets, etc. For documents – address (by which to access content), summary, overall popularity (e.g., using PageRank).

23 The index can store information for each component
For terms – overall frequency in corpus (IDF), methods to identify or compute the term (e.g., variations in spelling, sound wave transformations, checksums, etc.) For term-to-doc associations – weights, # of occurrences, occurrence offsets, etc. For documents – address (by which to access content), summary, overall popularity (e.g., using PageRank).

24 Methods for fast access to terms
Simple sort If updates are few; or term list can reside in RAM. Hashing* B-trees (more next thursday) *From

25 Term Weighting TF .IDF Binary –presence or absence of term TF IDF
Simple count “Sublinear” TF scaling IDF TF .IDF

26 Zipf’s law If documents of a corpus are ranked (r) by the frequency (f) of their occurrence, then… r · f = k Relates to the Pareto principle aka the "80-20 rule“. Schütze, Hinrich; Christopher D. Manning; Prabhakar Raghavan (2008) Introduction to Information Retrieval

27 An sample Zipf distribution
The graph “hugs” the y and x-axes. Much is accounted for by top-ranked items but much is also hidden in a looong tail. *from

28 Questions?


Download ppt "INFO 320: Information Needs, Searching, and Presentation (aka… Search)"

Similar presentations


Ads by Google