Presentation is loading. Please wait.

Presentation is loading. Please wait.

Research in Information Retrieval and Management Susan Dumais Microsoft Research Library of Congress Feb 8, 1999.

Similar presentations


Presentation on theme: "Research in Information Retrieval and Management Susan Dumais Microsoft Research Library of Congress Feb 8, 1999."— Presentation transcript:

1 Research in Information Retrieval and Management Susan Dumais Microsoft Research Library of Congress Feb 8, 1999

2 Research in IR at MS zMicrosoft Research (http://research.microsoft.com) yDecision Theory and Adaptive Systems yNatural Language Processing yMSR Cambridge yUser Interface yDatabase yWeb Companion yPaperless Office zMicrosoft Product Groups … many IR-related

3 IR Themes & Directions zImprovements in representation and content-matching yProbabilistic/Bayesian models xp(Relevant|Document), p(Concept|Words) yNLP: Truffle, MindNet zBeyond content-matching yUser/Task modeling yDomain/Object modeling yAdvances in presentation and manipulation

4 Improvements: Using Probabilistic Model zMSR-Cambridge (Steve Robertson) zProbabilistic Retrieval (e.g., Okapi) yTheory-driven derivation of matching function yEstimate: P Q (r i =Rel or NotRel | d=document) xUsing Bayes Rule and assuming conditional independence given Rel/NotRel

5 Improvements: Using Probabilistic Model zGood performance for uniform length document surrogates (e.g., abstracts) zEnhanced to take into account term frequency and document yBM25 one of the best ranking function at TREC zEasy to incorporate relevance feedback zNow looking at adaptive filtering/routing

6 Improvements: Using NLP zCurrent search techniques use word forms zImprovements in content-matching will come from: -> Identifying relations between words -> Identifying word meanings zAdvanced NLP can provide these zhttp:/research.microspft.com/nlp

7 Dictionary MindNet Morphology Sketch Logical Form Portrait NL Text Discourse Generation NL Text NLP System Architecture Machine Translation ProjectsTechnology Search and Retrieval Meaning Representation Grammar & Style Checking Document Understanding Intelligent Summarizing Smart Selection Word Breaking Indexing

8 Result: 2-3 times as many relevant documents in the top 10 with Microsoft NLP 21.5% 33.1% 63.7% Engine X X+ NLP Relevant hits Truffle: Word Relations % Relevant In Top Ten Docs

9 MindNet: Word Meanings zA huge knowledge base zAutomatically created from dictionaries zWords (nodes) linked by relationships z7 million links and growing

10 MindNet

11 Beyond Content Matching zDomain/Object modeling yText classification and clustering zUser/Task modeling yImplicit queries and Lumiere zAdvances in presentation and manipulation yCombining structure and search (e.g., DM)

12 Broader View of IR Query Words Ranked List User Modeling Domain Modeling Information Use

13 Beyond Content Matching zDomain/Object modeling èText classification and clustering zUser/Task modeling yImplicit queries and Lumiere zAdvances in presentation and manipulation yCombining structure and search (e.g., DM)

14 Text Classification zText Classification: assign objects to one or more of a predefined set of categories using text features yE.g., News feeds, Web data, OHSUMED, Email - spam/no-spam zApproaches: yHuman classification (e.g., LCSH, MeSH, Yahoo!, CyberPatrol) yHand-crafted knowledge engineered systems (e.g., CONSTRUE) *Inductive learning methods x(Semi-) automatic classification

15 Classifiers zA classifier is a function: f(x) = conf(class) yfrom attribute vectors, x=(x 1,x 2, … x d ) yto target values, confidence(class) zExample classifiers yif (interest AND rate) OR (quarterly), then confidence(interest) = 0.9 yconfidence(interest) = 0.3*interest + 0.4*rate + 0.1*quarterly

16 Inductive Learning Methods zSupervised learning from examples yExamples are easy for domain experts to provide yModels easy to learn, update, and customize zExample learning algorithms yRelevance Feedback, Decision Trees, Naïve Bayes, Bayes Nets, Support Vector Machines (SVMs) zText representation yLarge vector of features (words, phrases, hand-crafted)

17 Text Classification Process text files word counts per file data set Decision tree Index Server Feature selection Naïve Bayes Find similar Bayes nets Support vector machine Learning Methods test classifier

18 Support Vector Machine zOptimization Problem yFind hyperplane, h, separating positive and negative examples yOptimization for maximum margin: yClassify new items using: support vectors

19 Support Vector Machines zExtendable to: yNon-separable problems (Cortes & Vapnik, 1995) yNon-linear classifiers (Boser et al., 1992) zGood generalization performance yHandwriting recognition (LeCun et al.) yFace detection (Osuna et al.) yText classification (Joachims, Dumais et al.) zPlatts Sequential Minimal Optimization algorithm very efficient

20 Reuters Data Set ( 21578 - ModApte split) z9603 training articles; 3299 test articles zExample interest article 2-APR-1987 06:35:19.50 west-germany b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052 FRANKFURT, March 2 The Bundesbank left credit policies unchanged after today's regular meeting of its council, a spokesman said in answer to enquiries. The West German discount rate remains at 3.0 pct, and the Lombard emergency financing rate at 5.0 pct. REUTER zAverage article 200 words long

21 Example: Reuters news z118 categories (article can be in more than one category) zMost common categories (#train, #test) zOverall Results yLinear SVM most accurate: 87% precision at 87% recall Trade (369,119) Interest (347, 131) Ship (197, 89) Wheat (212, 71) Corn (182, 56) Earn (2877, 1087) Acquisitions (1650, 179) Money-fx (538, 179) Grain (433, 149) Crude (389, 189)

22 Reuters ROC - Category Grain Precision Recall LSVM Decision Tree Naïve Bayes Find Similar Recall: % labeled in category among those stories that are really in category Precision: % really in category among those stories labeled in category

23 Text Categ Summary z Accurate classifiers can be learned automatically from training examples z Linear SVMs are efficient and provide very good classification accuracy zWidely applicable, flexible, and adaptable representations yEmail spam/no-spam, Web, Medical abstracts, TREC

24 Text Clustering zDiscovering structure yVector-based document representation yEM algorithm to identify clusters zInteractive user interface

25 Text Clustering

26 Beyond Content Matching zDomain/Object modeling yText classification and clustering zUser/Task modeling èImplicit queries and Lumiere zAdvances in presentation and manipulation èCombining structure and search (e.g., DM)

27 Implicit Queries (IQ) zExplicit queries: ySearch is a separate, discrete task yUser types query, Gets results, Tries again … zImplicit queries: ySearch as part of normal information flow yOngoing query formulation based on user activities, and non-intrusive results display yCan include explicit query or push profile, but doesnt require either

28

29

30 User Modeling for IQ/IR zIQ: Model of user interests based on actions yExplicit search activity (query or profile) yPatterns of scroll / dwell on text yCopying and pasting actions yInteraction with multiple applications Explicit Queries or Profile Copy and Paste Scroll/Dwell on Text Users Short- and Long-Term Interests / Needs Implicit Query (IQ) Other Applications

31 Implicit Query Highlights zIQ built by tracking users reading behavior yNo explicit search required yGood matches returned zIQ user model: yCombines present context + previous interests zNew interfaces for tightly coupling search results with structure -- user study

32

33 Data Mountain with Implicit Query results shown (highlighted pages to left of selected page).

34 IQ Study: Experimental Details zStore 100 Web pages y50 popular Web pages; 50 random pages yWith or without Implicit Query xIQ1: Co-occurrence based IQ xIQ2: Content-based IQ zRetrieve 100 Web pages yTitle given as retrieval cue -- e.g., CNN Home Page yNo implicit query highlighting at retrieval

35 Find: CNN Home Page

36 Results: Information Storage zFiling strategies zNumber of categories

37 Results: Retrieval Time

38 Example Web Searches 161858lion lions 163041lion facts 163919picher of lions 164040lion picher 165002lion pictures 165100 pictures of lions 165211 pictures of big cats 165311lion photos 170013video in lion 172131pictureof a lioness 172207picture of a lioness 172241lion pictures 172334lion pictures cat 172443lions 172450lions 150052lion 152004lions 152036lions lion 152219lion facts 153747roaring 153848lions roaring 160232africa lion 160642lions, tigers, leopards and cheetahs 161042lions, tigers, leopards and cheetahs cats 161144wild cats of africa 161414africa cat 161602africa lions 161308africa wild cats 161823 mane 161840lion user = A1D6F19DB06BD694date = 970916excite log

39

40

41 Summary zRich IR research tapestry zImproving content-matching zAnd, beyond... yDomain/Object Models yUser/Task Models yInformation Presentation and Use zhttp://research.microsoft.com/~sdumais


Download ppt "Research in Information Retrieval and Management Susan Dumais Microsoft Research Library of Congress Feb 8, 1999."

Similar presentations


Ads by Google