Download presentation
Presentation is loading. Please wait.
Published bySolomon Page Modified over 8 years ago
1
Beyond Content-Based Retrieval: Modeling Domains, Users and Interaction Susan Dumais Microsoft Research http://research.microsoft.com/~sdumais IEEE ADL’99 - May 21, 1998
2
Research in IR at MS zMicrosoft Research (http://research.microsoft.com) yAdaptive Systems and Interaction - IR/UI yMachine Learning and Applied Statistics yData Mining yNatural Language Processing yCollaboration and Education yMSR Cambridge; MSR Beijing zMicrosoft Product Groups … many IR-related
3
IR Themes & Directions zImprovements in representation and content- matching yProbabilistic/Bayesian models - p(Relevant|Document), p(Concept|Words) yNLP: Truffle, MindNet zBeyond content-matching yUser/Task modeling yDomain/Object modeling yAdvances in presentation and manipulation WWW8 Panel Finding Anything in the Billion-Page Web: Are Algorithms the Answer? Moderator: Prabhakar Raghavan, IBM Almaden
4
Traditional View of IR Query Words Ranked List
5
What’s Missing? Query Words Ranked List User Modeling Domain Modeling Information Use
6
Domain/Obj Modeling zNot all objects are equal … potentially big win yA priori importance yInformation use (“readware”; collab filtering) zInter-object relationships yLink structure / hypertext ySubject categories - e.g., text categorization. text clustering zMetadata yE.g., reliability, recency, cost -> combining
7
User/Task Modeling zDemographics zTask -- What’s the user’s goal? ye.g., Lumiere zShort and long-term content interests ye.g., Implicit queries yInterest model = f(content_similarity, time, interest) ye.g., Letiza, WebWatcher, Fab
8
Information Use zBeyond batch IR model (“query->results”) yConsider larger task context yKnowledge management zHuman attention is critical resource yTechniques for automatic information summarization, organization, discover, filtering, mining, etc. zAdvanced UIs and interaction techniques yE.g, tight coupling of search, browsing to support information management
9
The Broader View of IR Query Words Ranked List User Modeling Domain Modeling Information Use
10
Beyond Content Matching zDomain/Object modeling èA priori importance èText classification and clustering zUser/Task modeling yImplicit queries and Lumiere zAdvances in presentation and manipulation yCombining structure and search (e.g., DM)
15
Estimating Priors zAccess Count ye.g., DirectHit yCollaborative Filtering zLink Structure (citation analysis) yClever (Hubs/Authorities) - Kleinberg et al. yGoogle - Brin & Page yWeb Archeologist - Bharat, Henzinger
16
New Relevance Ranking zRelevance ranking can include: ycontent-matching … of couse ypage/site popularity ypage quality yspam or porn or other downweighting yetc. zCombining these - relative weighting of these factors tricky and subjective
17
Text Classification zText Classification: assign objects to one or more of a predefined set of categories using text features yE.g., News feeds, Web data, OHSUMED, Email - spam/no-spam zApproaches: yHuman classification (e.g., LCSH, MeSH, Yahoo!, CyberPatrol) yHand-crafted knowledge engineered systems (e.g., CONSTRUE) *Inductive learning methods x(Semi-) automatic classification
18
Inductive Learning Methods zSupervised learning from examples yExamples are easy for domain experts to provide yModels easy to learn, update, and customize zExample learning algorithms yRelevance Feedback, Decision Trees, Naïve Bayes, Bayes Nets, Support Vector Machines (SVMs) zText representation yLarge vector of features (words, phrases, hand-crafted)
19
Support Vector Machine zOptimization Problem yFind hyperplane, h, separating positive and negative examples yOptimization for maximum margin: yClassify new items using: support vectors
20
Reuters Data Set ( 21578 - ModApte split) z9603 training articles; 3299 test articles zExample “interest” article 2-APR-1987 06:35:19.50 west-germany b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052 FRANKFURT, March 2 The Bundesbank left credit policies unchanged after today's regular meeting of its council, a spokesman said in answer to enquiries. The West German discount rate remains at 3.0 pct, and the Lombard emergency financing rate at 5.0 pct. REUTER zAverage article 200 words long
21
Example: Reuters news z118 categories (article can be in more than one category) zMost common categories (#train, #test) zOverall Results yLinear SVM most accurate: 87% precision at 87% recall Trade (369,119) Interest (347, 131) Ship (197, 89) Wheat (212, 71) Corn (182, 56) Earn (2877, 1087) Acquisitions (1650, 179) Money-fx (538, 179) Grain (433, 149) Crude (389, 189)
22
Reuters ROC - Category Grain Precision Recall LSVM Decision Tree Naïve Bayes Find Similar Recall: % labeled in category among those stories that are really in category Precision: % really in category among those stories labeled in category
23
Text Categ Summary z Accurate classifiers can be learned automatically from training examples z Linear SVMs are efficient and provide very good classification accuracy zWidely applicable, flexible, and adaptable representations yEmail spam/no-spam, Web, Medical abstracts, TREC
24
Text Clustering zDiscovering structure yVector-based document representation yEM algorithm to identify clusters zInteractive user interface
25
Text Clustering
26
Beyond Content Matching zDomain/Object modeling yText classification and clustering zUser/Task modeling èImplicit queries and Lumiere zAdvances in presentation and manipulation èCombining structure and search (e.g., DM)
27
Lumiere zInferring beliefs about user’s goals and ideal actions under uncertainty * User query User activity User profile Data structures Pr(Goals, Needs) E. Horvitz, et al.
28
Bayesian Models for Information Access Current General Interests User’s actions Explicit User Query User’s Acute Interests / Needs Applications, Docs, Data Structures Acute Goals/Tasks Previously Demonstrated Interests Context
29
Lumiere
30
Lumiere Office Assistant
31
Implicit Queries (IQ) zExplicit queries: ySearch is a separate, discrete task yUser types query, Gets results, Tries again … zImplicit queries: ySearch as part of normal information flow yOngoing query formulation based on user activities yNon-intrusive results display
34
Data Mountain with Implicit Query results shown (highlighted pages to left of selected page).
35
IQ Study: Experimental Details zStore 100 Web pages y50 popular Web pages; 50 random pages yWith or without Implicit Query xIQ1: Co-occurrence based IQ xIQ2: Content-based IQ zRetrieve 100 Web pages yTitle given as retrieval cue -- e.g., “CNN Home Page” yNo implicit query highlighting at retrieval
36
Find: “CNN Home Page”
37
Results: Information Storage zFiling strategies
38
Results: Information Storage zNumber of categories ( for semantic organizers ) 7 Categ 23 Categ
39
Results: Retrieval Time
40
Implicit Query Highlights zIQ built by tracking user’s reading behavior yNo explicit search required yGood matches returned zIQ user model: yCombines present context + previous interests zNew interfaces for tightly coupling search results with structure -- user study
41
Summary zImproving content-matching zAnd, beyond... yDomain/Object Models yUser/Task Models yInformation Presentation and Use zAlso … y non-text, multi-lingual, distributed zhttp://research.microsoft.com/~sdumais
42
The View from Wired, May 1996 z“information retrieval seemed like the easiest place to make progress … information retrieval is really only a problem for people in library science -- if some computer scientists were to put their heads together, they’d probably have it solved before lunchtime” [GS]
44
DLI-1 zUC Berkeley - Environmental planning and GIS zUC Santa Barbara - The Alexandria project spatially referenced map information zCMU - Informedia digital video library zUIllinois - Federating repositories of scientific literature zUMichigan - Intelligent agents for information location zStanford - Interoperation mechanisms among heterogeneous services
45
DLI-2 zOHSU/OGI: Tracking footprints through an information space: Leveraging the document selections of expert problem solvers zStanford: Image filtering for secure distribution of medical information zUWashington: Automatic reference librarians for the WWW zUKentucky: The digital atheneum: New techniques for restoring, searching and editing humanities collection zUSCarolina: A software and data library for experiments, simulation and archiving zJohns Hopkins: Digital workflow management: The Lester S. Levy collection of Music zUArizona: High-performance digital library classification systems: From information retrieval to knowledge management
46
zPopularity rankings - e.g., DirectHit zLink-based rankings - e.g., Google, Clever zQuestion answering - e.g., AskJeeves zPaid advertising - e.g.,GoTo zSteve Kirsch’s talk at SIGIR
51
Example Web Searches 161858lion lions 163041lion facts 163919picher of lions 164040lion picher 165002lion pictures 165100 pictures of lions 165211 pictures of big cats 165311lion photos 170013video in lion 172131pictureof a lioness 172207picture of a lioness 172241lion pictures 172334lion pictures cat 172443lions 172450lions 150052lion 152004lions 152036lions lion 152219lion facts 153747roaring 153848lions roaring 160232africa lion 160642lions, tigers, leopards and cheetahs 161042lions, tigers, leopards and cheetahs cats 161144wild cats of africa 161414africa cat 161602africa lions 161308africa wild cats 161823 mane 161840lion user = A1D6F19DB06BD694date = 970916excite log
54
Earlier IR Accomplishments in DTAS zBayesian IR yAnswer Wizard: words --> p(Concept i | Words) zUser Modeling yLumiere, Office Assistant: user words, user profile, user behavior (events), active data structures --> p(User needs, Goals | Evidence) zCollaborative Filtering
55
Current DTAS IR Projects zDomain/Object Modeling *Text Classification (Sue, David, Eric, John, Mehran) yCombining content and collaborative matches (Carl) zUser/Task Modeling yImplicit query (Sue, Eric, Krishna) yPrefetching information (Eric) yCost-benefit analysis of a search result (Jack)
56
User Modeling for IQ/IR zIQ: Model of user interests based on actions yExplicit search activity (query or profile) yPatterns of scroll / dwell on text yCopying and pasting actions yInteraction with multiple applications Explicit Queries or Profile Copy and Paste Scroll/Dwell on Text User’s Short- and Long-Term Interests / Needs “Implicit Query (IQ)” Other Applications
57
Improvements: Using Probabilistic Model zMSR-Cambridge (Steve Robertson) zProbabilistic Retrieval (e.g., Okapi) yTheory-driven derivation of matching function yEstimate: P Q (r i =Rel or NotRel | d=document) xUsing Bayes Rule and assuming conditional independence given Rel/NotRel
58
Improvements: Using Probabilistic Model zGood performance for uniform length document surrogates (e.g., abstracts) zEnhanced to take into account term frequency and document y“BM25” one of the best ranking function at TREC zEasy to incorporate relevance feedback zNow looking at adaptive filtering/routing
59
Improvements: Using NLP zCurrent search techniques use word forms zImprovements in content-matching will come from: -> Identifying relations between words -> Identifying word meanings zAdvanced NLP can provide these zhttp:/research.microspft.com/nlp
60
Dictionary MindNet Morphology Sketch Logical Form Portrait NL Text Discourse Generation NL Text NLP System Architecture Machine Translation ProjectsTechnology Search and Retrieval Meaning Representation Grammar & Style Checking Document Understanding Intelligent Summarizing Smart Selection Word Breaking Indexing
61
Result: 2-3 times as many relevant documents in the top 10 with Microsoft NLP 21.5% 33.1% 63.7% Engine X X+ NLP Relevant hits “Truffle”: Word Relations % Relevant In Top Ten Docs
62
“MindNet”: Word Meanings zA huge knowledge base zAutomatically created from dictionaries zWords (nodes) linked by relationships z7 million links and growing
63
MindNet
70
zP1 Panel (WWW8) - Finding Anything in the Billion-Page Web: yAre Algorithms the Answer? Moderator: Prabhakar Raghavan, IBM Almaden Research Center,USA Panelists: - Andrei Z. Broder, Compaq SRC - Monika R. Henzinger, Compaq SRC - Udi Manber, Yahoo! Inc. - Brian Pinkerton, Excite Inc.
71
zWeb queries yaverage length 2.2 words y10% use syntax (often incorrectly) y1% use advanced search ynoun phrase only zPrecision more important than recall?
72
zComputer Search Today: large central indices zHuman Search Today: yask friends - 6dos yhuman, decentralized
73
History-Enriched Digital Objs Hill, W., Hollan, J., Wroblewski, D., McCandless, T. Edit Wear and Read Wear: Their Theory and Generalization. ACM CHI'92 Human Factors in Computing Systems Proceedings. ACM Press (1992), 3-9.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.