INFO 320: Information Needs, Searching, and Presentation (aka… Search)

INFO 320: Information Needs, Searching, and Presentation (aka… Search)
Instructor: William Jones TA: Brennen Smith Lectures: Tuesdays & Thursdays: 1:30 – 3:20 pm, MGH 238 Labs : Wed.: 1:30 - 2:20 pm, MGH 030

For this Week 3 (10/13/2013) (Basics of Search)
Add Word exercise in class Boolean search vs. the vector space model B-trees 2.2 W One-minute madness – each team gets one minute to describe progress on lab exercises & issues encountered. On-going work in lab.

And also for this Week 3 (of 10/13)
Cool tool presentations; Essay review Wrap-up Guest speaker on SEO; 2.2 F Quiz on Module 2.

Components of a web crawler
fromButtcher, Clarke & Cormack, 2010, Information Retrieval, Chapter 15

Parsing a document What format is it in? What language is it in?
pdf/word/excel/html? What language is it in? How to handle “and”? What character set is in use? …

What you see.. *from

Is not what the crawler gets
*from

What character set is in use?
ISO Latin alphabet part 1 covers North America, Western Europe, Latin America, the Caribbean, Canada, Africa; the default for Web pages. UTF-8. A character set implementation of Unicode. A character in UTF8 can be from 1 to 4 bytes long. UTF-8 can represent any character in the Unicode standard. UTF-8 is backwards compatible with ASCII. UTF-8 is the preferred encoding for and web pages. *from

An HTML sample *from

Typical Stop Word List

Ambiguity of Natural Language (NL)
Synonomy: Different Words, Same Meaning “car” ~= “automobile” “stomach pain after eating” ~= “post-prandial abdominal discomfort” Polysemy: Same Words, Different Meanings “jaguar” as animal vs. kind of automobile. “juvenile victims of crime” vs. “victims of juvenile crime” Venetian blinds vs. blind Venetians

How to handle synonyms? car= automobile
When the document contains automobile, index it under car as well (also vice-versa) Or expand query. When the query contains automobile, look under car too. Or form concept, <automobile> When “car” is encountered, index under “<automobile>” (and “car” too?) Likewise for “automobile”. When either “car” or “automobile” are encountered in a query, add the term “<automobile>”.

Term Weighting TF .IDF Binary –presence or absence of term TF IDF
Simple count “Sublinear” TF scaling IDF TF .IDF

A matrix as a way to understand the index, the vector model and more.
Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 … Term 1 1 Term 2 Term 3 Term 4 Term 5 Term 6

Cells can have weights. Terms can be composites
Cells can have weights. Terms can be composites. Documents can have sections… Doc 1.1 Doc 1.2 Doc 2.1 Doc 2.2 Doc 6 … William.title 3 4 William.abstract 2 1 William.intro 7

The index has 3 essential components
1. A term list – structured for fast access to individual terms Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 … Term 1 1 Term 2 Term 3 Term 4 Term 5 Term 6

2. For each term, a list of associations to documents. Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 … Term 1 1 Term 2 Term 3 Term 4 Term 5 Term 6

3. a list of documents that are indexed. Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 … Term 1 1 Term 2 Term 3 Term 4 Term 5 Term 6

The index can store information for each component
For terms – overall frequency in corpus (IDF), methods to identify or compute the term (e.g., variations in spelling, sound wave transformations, checksums, etc.) For term-to-doc associations – weights, # of occurrences, occurrence offsets, etc. For documents – address (by which to access content), summary, overall popularity (e.g., using PageRank).

Methods for fast access to terms
Simple sort If updates are few; or term list can reside in RAM. Hashing* B-trees (more next thursday) *From

Term Weighting TF .IDF Binary –presence or absence of term TF IDF
Simple count “Sublinear” TF scaling IDF TF .IDF

Zipf’s law If documents of a corpus are ranked (r) by the frequency (f) of their occurrence, then… r · f = k Relates to the Pareto principle aka the "80-20 rule“. Schütze, Hinrich; Christopher D. Manning; Prabhakar Raghavan (2008) Introduction to Information Retrieval

An sample Zipf distribution
The graph “hugs” the y and x-axes. Much is accounted for by top-ranked items but much is also hidden in a looong tail. *from

Questions?

INFO 320: Information Needs, Searching, and Presentation (aka… Search)

Similar presentations

Presentation on theme: "INFO 320: Information Needs, Searching, and Presentation (aka… Search)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

INFO 320: Information Needs, Searching, and Presentation (aka… Search)

Similar presentations

Presentation on theme: "INFO 320: Information Needs, Searching, and Presentation (aka… Search)"— Presentation transcript:

Similar presentations

About project

Feedback