Beyond Content-Based Retrieval: Modeling Domains, Users and Interaction Susan Dumais Microsoft Research IEEE ADL’99.

Beyond Content-Based Retrieval: Modeling Domains, Users and Interaction Susan Dumais Microsoft Research http://research.microsoft.com/~sdumais IEEE ADL’99 - May 21, 1998

Research in IR at MS zMicrosoft Research (http://research.microsoft.com) yAdaptive Systems and Interaction - IR/UI yMachine Learning and Applied Statistics yData Mining yNatural Language Processing yCollaboration and Education yMSR Cambridge; MSR Beijing zMicrosoft Product Groups … many IR-related

IR Themes & Directions zImprovements in representation and content- matching yProbabilistic/Bayesian models - p(Relevant|Document), p(Concept|Words) yNLP: Truffle, MindNet zBeyond content-matching yUser/Task modeling yDomain/Object modeling yAdvances in presentation and manipulation WWW8 Panel Finding Anything in the Billion-Page Web: Are Algorithms the Answer? Moderator: Prabhakar Raghavan, IBM Almaden

Traditional View of IR Query Words Ranked List

What’s Missing? Query Words Ranked List User Modeling Domain Modeling Information Use

Domain/Obj Modeling zNot all objects are equal … potentially big win yA priori importance yInformation use (“readware”; collab filtering) zInter-object relationships yLink structure / hypertext ySubject categories - e.g., text categorization. text clustering zMetadata yE.g., reliability, recency, cost -> combining

User/Task Modeling zDemographics zTask -- What’s the user’s goal? ye.g., Lumiere zShort and long-term content interests ye.g., Implicit queries yInterest model = f(content_similarity, time, interest) ye.g., Letiza, WebWatcher, Fab

Information Use zBeyond batch IR model (“query->results”) yConsider larger task context yKnowledge management zHuman attention is critical resource yTechniques for automatic information summarization, organization, discover, filtering, mining, etc. zAdvanced UIs and interaction techniques yE.g, tight coupling of search, browsing to support information management

The Broader View of IR Query Words Ranked List User Modeling Domain Modeling Information Use

Beyond Content Matching zDomain/Object modeling èA priori importance èText classification and clustering zUser/Task modeling yImplicit queries and Lumiere zAdvances in presentation and manipulation yCombining structure and search (e.g., DM)

Estimating Priors zAccess Count ye.g., DirectHit yCollaborative Filtering zLink Structure (citation analysis) yClever (Hubs/Authorities) - Kleinberg et al. yGoogle - Brin & Page yWeb Archeologist - Bharat, Henzinger

New Relevance Ranking zRelevance ranking can include: ycontent-matching … of couse ypage/site popularity ypage quality yspam or porn or other downweighting yetc. zCombining these - relative weighting of these factors tricky and subjective

Text Classification zText Classification: assign objects to one or more of a predefined set of categories using text features yE.g., News feeds, Web data, OHSUMED, Email - spam/no-spam zApproaches: yHuman classification (e.g., LCSH, MeSH, Yahoo!, CyberPatrol) yHand-crafted knowledge engineered systems (e.g., CONSTRUE) *Inductive learning methods x(Semi-) automatic classification

Inductive Learning Methods zSupervised learning from examples yExamples are easy for domain experts to provide yModels easy to learn, update, and customize zExample learning algorithms yRelevance Feedback, Decision Trees, Naïve Bayes, Bayes Nets, Support Vector Machines (SVMs) zText representation yLarge vector of features (words, phrases, hand-crafted)

Support Vector Machine zOptimization Problem yFind hyperplane, h, separating positive and negative examples yOptimization for maximum margin: yClassify new items using: support vectors

Reuters Data Set ( 21578 - ModApte split) z9603 training articles; 3299 test articles zExample “interest” article 2-APR-1987 06:35:19.50 west-germany b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052 FRANKFURT, March 2 The Bundesbank left credit policies unchanged after today's regular meeting of its council, a spokesman said in answer to enquiries. The West German discount rate remains at 3.0 pct, and the Lombard emergency financing rate at 5.0 pct. REUTER zAverage article 200 words long

Example: Reuters news z118 categories (article can be in more than one category) zMost common categories (#train, #test) zOverall Results yLinear SVM most accurate: 87% precision at 87% recall Trade (369,119) Interest (347, 131) Ship (197, 89) Wheat (212, 71) Corn (182, 56) Earn (2877, 1087) Acquisitions (1650, 179) Money-fx (538, 179) Grain (433, 149) Crude (389, 189)

Reuters ROC - Category Grain Precision Recall LSVM Decision Tree Naïve Bayes Find Similar Recall: % labeled in category among those stories that are really in category Precision: % really in category among those stories labeled in category

Text Categ Summary z Accurate classifiers can be learned automatically from training examples z Linear SVMs are efficient and provide very good classification accuracy zWidely applicable, flexible, and adaptable representations yEmail spam/no-spam, Web, Medical abstracts, TREC

Text Clustering zDiscovering structure yVector-based document representation yEM algorithm to identify clusters zInteractive user interface

Text Clustering

Beyond Content Matching zDomain/Object modeling yText classification and clustering zUser/Task modeling èImplicit queries and Lumiere zAdvances in presentation and manipulation èCombining structure and search (e.g., DM)

Lumiere zInferring beliefs about user’s goals and ideal actions under uncertainty * User query User activity User profile Data structures Pr(Goals, Needs) E. Horvitz, et al.

Bayesian Models for Information Access Current General Interests User’s actions Explicit User Query User’s Acute Interests / Needs Applications, Docs, Data Structures Acute Goals/Tasks Previously Demonstrated Interests Context

Lumiere

Lumiere Office Assistant

Implicit Queries (IQ) zExplicit queries: ySearch is a separate, discrete task yUser types query, Gets results, Tries again … zImplicit queries: ySearch as part of normal information flow yOngoing query formulation based on user activities yNon-intrusive results display

Data Mountain with Implicit Query results shown (highlighted pages to left of selected page).

IQ Study: Experimental Details zStore 100 Web pages y50 popular Web pages; 50 random pages yWith or without Implicit Query xIQ1: Co-occurrence based IQ xIQ2: Content-based IQ zRetrieve 100 Web pages yTitle given as retrieval cue -- e.g., “CNN Home Page” yNo implicit query highlighting at retrieval

Find: “CNN Home Page”

Results: Information Storage zFiling strategies

Results: Information Storage zNumber of categories ( for semantic organizers ) 7 Categ 23 Categ

Results: Retrieval Time

Implicit Query Highlights zIQ built by tracking user’s reading behavior yNo explicit search required yGood matches returned zIQ user model: yCombines present context + previous interests zNew interfaces for tightly coupling search results with structure -- user study

Summary zImproving content-matching zAnd, beyond... yDomain/Object Models yUser/Task Models yInformation Presentation and Use zAlso … y non-text, multi-lingual, distributed zhttp://research.microsoft.com/~sdumais

The View from Wired, May 1996 z“information retrieval seemed like the easiest place to make progress … information retrieval is really only a problem for people in library science -- if some computer scientists were to put their heads together, they’d probably have it solved before lunchtime” [GS]

DLI-1 zUC Berkeley - Environmental planning and GIS zUC Santa Barbara - The Alexandria project spatially referenced map information zCMU - Informedia digital video library zUIllinois - Federating repositories of scientific literature zUMichigan - Intelligent agents for information location zStanford - Interoperation mechanisms among heterogeneous services

DLI-2 zOHSU/OGI: Tracking footprints through an information space: Leveraging the document selections of expert problem solvers zStanford: Image filtering for secure distribution of medical information zUWashington: Automatic reference librarians for the WWW zUKentucky: The digital atheneum: New techniques for restoring, searching and editing humanities collection zUSCarolina: A software and data library for experiments, simulation and archiving zJohns Hopkins: Digital workflow management: The Lester S. Levy collection of Music zUArizona: High-performance digital library classification systems: From information retrieval to knowledge management

zPopularity rankings - e.g., DirectHit zLink-based rankings - e.g., Google, Clever zQuestion answering - e.g., AskJeeves zPaid advertising - e.g.,GoTo zSteve Kirsch’s talk at SIGIR

Example Web Searches 161858lion lions 163041lion facts 163919picher of lions 164040lion picher 165002lion pictures 165100 pictures of lions 165211 pictures of big cats 165311lion photos 170013video in lion 172131pictureof a lioness 172207picture of a lioness 172241lion pictures 172334lion pictures cat 172443lions 172450lions 150052lion 152004lions 152036lions lion 152219lion facts 153747roaring 153848lions roaring 160232africa lion 160642lions, tigers, leopards and cheetahs 161042lions, tigers, leopards and cheetahs cats 161144wild cats of africa 161414africa cat 161602africa lions 161308africa wild cats 161823 mane 161840lion user = A1D6F19DB06BD694date = 970916excite log

Earlier IR Accomplishments in DTAS zBayesian IR yAnswer Wizard: words --> p(Concept i | Words) zUser Modeling yLumiere, Office Assistant: user words, user profile, user behavior (events), active data structures --> p(User needs, Goals | Evidence) zCollaborative Filtering

Current DTAS IR Projects zDomain/Object Modeling *Text Classification (Sue, David, Eric, John, Mehran) yCombining content and collaborative matches (Carl) zUser/Task Modeling yImplicit query (Sue, Eric, Krishna) yPrefetching information (Eric) yCost-benefit analysis of a search result (Jack)

User Modeling for IQ/IR zIQ: Model of user interests based on actions yExplicit search activity (query or profile) yPatterns of scroll / dwell on text yCopying and pasting actions yInteraction with multiple applications Explicit Queries or Profile Copy and Paste Scroll/Dwell on Text User’s Short- and Long-Term Interests / Needs “Implicit Query (IQ)” Other Applications

Improvements: Using Probabilistic Model zMSR-Cambridge (Steve Robertson) zProbabilistic Retrieval (e.g., Okapi) yTheory-driven derivation of matching function yEstimate: P Q (r i =Rel or NotRel | d=document) xUsing Bayes Rule and assuming conditional independence given Rel/NotRel

Improvements: Using Probabilistic Model zGood performance for uniform length document surrogates (e.g., abstracts) zEnhanced to take into account term frequency and document y“BM25” one of the best ranking function at TREC zEasy to incorporate relevance feedback zNow looking at adaptive filtering/routing

Improvements: Using NLP zCurrent search techniques use word forms zImprovements in content-matching will come from: -> Identifying relations between words -> Identifying word meanings zAdvanced NLP can provide these zhttp:/research.microspft.com/nlp

Dictionary MindNet Morphology Sketch Logical Form Portrait NL Text Discourse Generation NL Text NLP System Architecture Machine Translation ProjectsTechnology Search and Retrieval Meaning Representation Grammar & Style Checking Document Understanding Intelligent Summarizing Smart Selection Word Breaking Indexing

Result: 2-3 times as many relevant documents in the top 10 with Microsoft NLP 21.5% 33.1% 63.7% Engine X X+ NLP Relevant hits “Truffle”: Word Relations % Relevant In Top Ten Docs

“MindNet”: Word Meanings zA huge knowledge base zAutomatically created from dictionaries zWords (nodes) linked by relationships z7 million links and growing

MindNet

zP1 Panel (WWW8) - Finding Anything in the Billion-Page Web: yAre Algorithms the Answer?  Moderator: Prabhakar Raghavan, IBM Almaden Research Center,USA  Panelists: - Andrei Z. Broder, Compaq SRC - Monika R. Henzinger, Compaq SRC - Udi Manber, Yahoo! Inc. - Brian Pinkerton, Excite Inc.

zWeb queries yaverage length 2.2 words y10% use syntax (often incorrectly) y1% use advanced search ynoun phrase only zPrecision more important than recall?

zComputer Search Today: large central indices zHuman Search Today: yask friends - 6dos yhuman, decentralized

History-Enriched Digital Objs Hill, W., Hollan, J., Wroblewski, D., McCandless, T. Edit Wear and Read Wear: Their Theory and Generalization. ACM CHI'92 Human Factors in Computing Systems Proceedings. ACM Press (1992), 3-9.

Beyond Content-Based Retrieval: Modeling Domains, Users and Interaction Susan Dumais Microsoft Research IEEE ADL’99.

Similar presentations

Presentation on theme: "Beyond Content-Based Retrieval: Modeling Domains, Users and Interaction Susan Dumais Microsoft Research IEEE ADL’99."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Beyond Content-Based Retrieval: Modeling Domains, Users and Interaction Susan Dumais Microsoft Research IEEE ADL’99.

Similar presentations

Presentation on theme: "Beyond Content-Based Retrieval: Modeling Domains, Users and Interaction Susan Dumais Microsoft Research IEEE ADL’99."— Presentation transcript:

Similar presentations

About project

Feedback