Search and Ye Shall Find (maybe) Seminar on Emergent Information Technology August 20, 2007 Douglas W. Oard.

Search and Ye Shall Find (maybe) Seminar on Emergent Information Technology August 20, 2007 Douglas W. Oard

What Do People Search For? Searchers often don’t clearly understand –The problem they are trying to solve –What information is needed to solve the problem –How to ask for that information The query results from a clarification process Dervin’s “sense making”: Need GapBridge

Process/System Co-Design

Design Strategies Foster human-machine synergy –Exploit complementary strengths –Accommodate shared weaknesses Divide-and-conquer –Divide task into stages with well-defined interfaces –Continue dividing until problems are easily solved Co-design related components –Iterative process of joint optimization

Human-Machine Synergy Machines are good at: –Doing simple things accurately and quickly –Scaling to larger collections in sublinear time People are better at: –Accurately recognizing what they are looking for –Evaluating intangibles such as “quality” Both are pretty bad at: –Mapping consistently between words and concepts

Supporting the Search Process Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Query Reformulation and Relevance Feedback Source Reselection NominateChoose Predict

Supporting the Search Process Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Indexing Index Acquisition Collection

Taylor’s Model of Question Formation Q1 Visceral Need Q2 Conscious Need Q3 Formalized Need Q4 Compromised Need (Query) End-user Search Intermediated Search

Search Component Model Comparison Function Representation Function Query Formulation Human Judgment Representation Function Retrieval Status Value Utility Query Information NeedDocument Query RepresentationDocument Representation Query Processing Document Processing

Relevance Relevance relates a topic and a document –Duplicates are equally relevant, by definition –Constant over time and across users Pertinence relates a task and a document –Accounts for quality, complexity, language, … Utility relates a user and a document –Accounts for prior knowledge

“Bag of Terms” Representation Bag = a “set” that can contain duplicates  “The quick brown fox jumped over the lazy dog’s back”  {back, brown, dog, fox, jump, lazy, over, quick, the, the} Vector = values recorded in any consistent order  {back, brown, dog, fox, jump, lazy, over, quick, the, the}  [1 1 1 1 1 1 1 1 2]

Bag of Terms Example The quick brown fox jumped over the lazy dog’s back. Document 1 Document 2 Now is the time for all good men to come to the aid of their party. the quick brown fox over lazy dog back now is time for all good men to come jump aid of their party 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 1 0 1 0 1 1 Term Document 1Document 2 Stopword List

Advantages of Ranked Retrieval Closer to the way people think –Some documents are better than others Enriches browsing behavior –Decide how far down the list to go as you read it Allows more flexible queries –Long and short queries can produce useful results

Counting Terms Terms tell us about documents –If “rabbit” appears a lot, it may be about rabbits Documents tell us about terms –“the” is in every document -- not discriminating Documents are most likely described well by rare terms that occur in them frequently –Higher “term frequency” is stronger evidence –Low “document frequency” makes it stronger still

Document Length Normalization Long documents have an unfair advantage –They use a lot of terms So they get more matches than short documents –And they use the same words repeatedly So they have much higher term frequencies Normalization seeks to remove these effects

“Okapi” Term Weights TF componentIDF component

Summary Goal: find documents most similar to the query Compute normalized document term weights –Some combination of TF, DF, and Length Optionally, get query term weights from the user –Estimate of term importance Compute inner product of query and doc vectors –Multiply corresponding elements and then add

Some Questions for Indexing How long will it take to find a document? –Is there any work we can do in advance? If so, how long will that take? How big a computer will I need? –How much disk space? How much RAM? What if more documents arrive? –How much of the advance work must be repeated? –Will searching become slower? –How much more disk space will be needed?

The Indexing Process quick brown fox over lazy dog back now time all good men come jump aid their party 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 0 1 Term Doc 1Doc 2 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 Doc 3 Doc 4 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 Doc 5Doc 6 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 0 1 1 1 1 0 0 0 Doc 7Doc 8 A B C F D G J L M N O P Q T AI AL BA BR TH TI 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6 Postings Inverted File

The Finished Product quick brown fox over lazy dog back now time all good men come jump aid their party Term A B C F D G J L M N O P Q T AI AL BA BR TH TI 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6 PostingsInverted File

Building an Inverted Index Simplest solution is a single sorted array –Fast lookup using binary search –But sorting large files on disk is very slow –And adding one document means starting over Tree structures allow easy insertion –But the worst case lookup time is linear Balanced trees provide the best of both –Fast lookup and easy insertion –But they require 45% more disk space

(Uncompressed) Index Size Very compact for Boolean retrieval –About 10% of the size of the documents If an aggressive stopword list is used Not much larger for ranked retrieval –Perhaps 20% Enormous for proximity operators –Sometimes larger than the documents!

Index Compression CPU’s are much faster than disks –A disk can transfer 1,000 bytes in ~20 ms –The CPU can do ~10 million instructions in that time Compressing the postings file is a big win –Trades decompression time for fewer disk reads Key idea: reduce redundancy –Trick 1: store relative offsets (some will be the same) –Trick 2: use an optimal coding scheme

Problems with “Free Text” Search Homonymy –Terms may have many unrelated meanings –Polysemy (related meanings) is less of a problem Synonymy –Many ways of saying (nearly) the same thing Anaphora –Alternate ways of referring to the same thing

Two Ways of Searching Write the document using terms to convey meaning Author Content-Based Query-Document Matching Document Terms Query Terms Construct query from terms that may appear in documents Free-Text Searcher Retrieval Status Value Construct query from available concept descriptors Controlled Vocabulary Searcher Choose appropriate concept descriptors Indexer Metadata-Based Query-Document Matching Query Descriptors Document Descriptors

Supervised Learning f 1 f 2 f 3 f 4 … f N v 1 v 2 v 3 v 4 … v N CvCv w 1 w 2 w 3 w 4 … w N CwCw Learner Classifier New example x 1 x 2 x 3 x 4 … x N CxCx Labelled training examples CwCw

Example: kNN Classifier

Problems with Controlled Vocabulary New concepts Users and indexers may think differently Using thesauri effectively requires training

Index Spam Goal: Manipulate rankings of an IR system Multiple strategies: –Create bogus user-assigned metadata –Add invisible text (font in background color, …) –Alter your text to include desired query terms –“Link exchanges” create links to your page

Some Observable Behaviors

Behavior Category

Minimum Scope

Some Examples Read/Ignored, Saved/Deleted, Replied to (Stevens, 1993) Reading time (Morita & Shinoda, 1994; Konstan et al., 1997) Hypertext Link (Brin & Page, 1998)

Estimating Authority from Links Authority Hub

Collecting Click Streams Browsing histories are easily captured –Make all links initially point to a central site Encode the desired URL as a parameter –Build a time-annotated transition graph for each user Cookies identify users (when they use the same machine) –Redirect the browser to the desired page Reading time is correlated with interest –Can be used to build individual profiles –Used to target advertising by doubleclick.com

0 20 40 60 80 100 120 140 160 180 No Interest Low Interest Moderate Interest High Interest Rating Reading Time (seconds) Full Text Articles (Telecommunications) 50 32 58 43

Problems with Observed Behavior Protecting privacy –What absolute assurances can we provide? –How can we make remaining risks understood? Scalable rating servers –Is a fully distributed architecture practical? Non-cooperative users –How can the effect of spamming be limited?

Putting It All Together Free TextBehaviorMetadata Topicality Quality Reliability Cost Flexibility

Search and Ye Shall Find (maybe) Seminar on Emergent Information Technology August 20, 2007 Douglas W. Oard.

Similar presentations

Presentation on theme: "Search and Ye Shall Find (maybe) Seminar on Emergent Information Technology August 20, 2007 Douglas W. Oard."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Search and Ye Shall Find (maybe) Seminar on Emergent Information Technology August 20, 2007 Douglas W. Oard.

Similar presentations

Presentation on theme: "Search and Ye Shall Find (maybe) Seminar on Emergent Information Technology August 20, 2007 Douglas W. Oard."— Presentation transcript:

Similar presentations

About project

Feedback