APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Data Mining Classification: Alternative Techniques
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
K nearest neighbor and Rocchio algorithm
Morris LeBlanc.  Why Image Retrieval is Hard?  Problems with Image Retrieval  Support Vector Machines  Active Learning  Image Processing ◦ Texture.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Query Reformulation: User Relevance Feedback. Introduction Difficulty of formulating user queries –Users have insufficient knowledge of the collection.
Presented by Zeehasham Rasheed
Chapter 19: Information Retrieval
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Chapter 5: Information Retrieval and Web Search
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Module 04: Algorithms Topic 07: Instance-Based Learning
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Bayesian Networks. Male brain wiring Female brain wiring.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Querying Structured Text in an XML Database By Xuemei Luo.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
Chapter 6: Information Retrieval and Web Search
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
 2003, G.Tecuci, Learning Agents Laboratory 1 Learning Agents Laboratory Computer Science Department George Mason University Prof. Gheorghe Tecuci 9 Instance-Based.
Survey of Approaches to Information Retrieval of Speech Message Kenney Ng Spoken Language Systems Group Laboratory for Computer Science Massachusetts Institute.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Chapter 23: Probabilistic Language Models April 13, 2004.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Relevance Feedback Hongning Wang
K nearest neighbors algorithm Parallelization on Cuda PROF. VELJKO MILUTINOVIĆ MAŠA KNEŽEVIĆ 3037/2015.
Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
Kansas State University Department of Computing and Information Sciences CIS 890: Special Topics in Intelligent Systems Wednesday, November 15, 2000 Cecil.
Instance-Based Learning Evgueni Smirnov. Overview Instance-Based Learning Comparison of Eager and Instance-Based Learning Instance Distances for Instance-Based.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
KNN & Naïve Bayes Hongning Wang
Data Science Algorithms: The Basic Methods
Information Retrieval
Information Retrieval and Web Search
Text Categorization Assigning documents to a fixed set of categories
Instance Based Learning
Chapter 5: Information Retrieval and Web Search
Retrieval Utilities Relevance feedback Clustering
Chapter 31: Information Retrieval
Information Retrieval and Web Design
Chapter 19: Information Retrieval
Presentation transcript:

APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL

Nearest Neighbor Classifiers Basic intuition: similar documents should have the same class label We can use the vector space model and the cosine similarity Training is simple: Remember the class value of the initial documents Index them using an inverted index structure Testing is also simple: Use each test document dt as a query Fetch k training documents most similar to it Use majority voting to determine the class of dt

Nearest Neighbor Classifiers Instead of pure counts of classes, we can use the weights wrt the similarity: If training document d has label c d, then c d accumulates a score of s(d q, d) The class with maximum score is selected Per-class offsets could be used and tuned later on:

Nearest Neighbor Classifiers Choosing the value k: Try various values of k and use a portion of the documents for validation. Cluster the documents and choose a value of k proportional to the size of small clusters

Nearest Neighbor Classifiers kNN is a lazy strategy compared to decision trees Advantages No-training needed to construct a model When properly tuned for k, and b c, they are comparable in accuracy to other classifiers. Disadvantages May involve many inverted index lookups, scoring, sorting and picking the best k results takes time (since k is small compared to the retrieved documents, such types of queries are called iceberg queries)

Nearest Neighbor Classifiers For better performance, some effort is spent during training Documents are clustered, and only a few statistical parameters are stored per-cluster A test document is first compared with the cluster representatives, then with the individual documents from appropriate clusters

Measures of accuracy We may have the following: Each document is associated with exactly one class Each document is associated with a subset of classes The ideas from precision and recall can be used to measure the accuracy of the classifier Calculate the average precision and recall for all the classes

Hypertext Classification An HTML document can be thought of a hierarchy of regions represented by a tree-structured Document Object Model (DOM). DOM tree consists of: Internal nodes : Leaf nodes : segments of text Hyperlinks to other nodes

Hypertext Classification An example DOM in XML format Is is important to distinguish the two occurrences of the term “surfing” which can be achieved by prefixing the term by the sequence of tags in the DOM tree. resume.publication.title.surfing resume.hobbies.item.surfing Statistical models for web-surfing Wind-surfing

Hypertext Classification Use relations to give meaning to textual features such as: contains-text(domNode, term) part-of(domNode1, domNode2) tagged(domNode, tagName) links-to(srcDomNode, dstDomNode) Contains-anchor-text(srcDomNode, dstDomNode, term) Classified(domNode, label) Discover rules from collection of relations such as: Classifed(A, facultyPage) :- contains-text(A, professor), contains-text(A, phd), links-to(B,A), contains0text(B,faculty) Where “:-” means if, and comma stands for conjunction.

Hypertext Classification Rule Induction with two-class setting FOIL (First order Inductive Learner by Quinlan, 1993) a greedy algorithm that learns rules to distinguish positive examples from negative ones Repeatedly searches for the current best rule and removes all the positive examples covered by the rule until all the positive examples in the data set are covered Tries to maximize the gain of adding literal p to rule r P is the set of positive and N is the set of negative examples When p is added to r, then there are P* positive and N* negative examples satisfying the new rule

Hypertext Classification Let R be the set of rules learned, initially empty While D+ != EmptySet do // learn a new rule Let r be true and be the new rule while some d in D- satisfies r // Add a new possibly negated literal to r to specialize it Add “best possible” literal p as a conjunction to r endwhile R <- R U {r} Remove from D+ all instances for which r evaluates to true Endwhile Return R

Hypertext Classification Let R be the set of rules learned, initially empty While D+ != EmptySet do // learn a new rule Let r be true and be the new rule while some d in D- satisfies r // Add a new possibly negated literal to r to specialize it Add “best possible” literal p as a conjunction to r endwhile R <- R U {r} Remove from D+ all instances for which r evaluates to true Endwhile Return R Types of literals explored: Xi=Xj, Xi=c, Xi>Xj etc where Xi, Xj are variables and c is a constant Q(X1,X2,..Xk) where Q is a relation and Xi are variables Not(L), where L is a literal of the above forms

Hypertext Classification With relational learning, we can learn class labels for individual pages, and relationships between them Member(homePage, department) Teaches(homePage, coursePage) Advises(homePage, homePage) We can also incorporate other classifiers such as naïve bayesian for rule learning

RETRIEVAL UTILITIES

Retrieval Utilities Relevance feedback Clustering Passage-based Retrieval Parsing N-grams Thesauri Semantic Networks Regression Analysis

Relevance Feedback Do the retrieval in multiple steps User refines the query at each step wrt the results of the previous queries User tells the IR system which documents are relevant New terms are added to the query based on the feedback Term weights may be updated based on the user feedback

Relevance Feedback Bypass the user for relevance feedback by Assuming the top-k results in the ranked list are relevant Modify the original query as done before

Relevance Feedback Example: “find information surrounding the various conspiracy theories about the assassination of John F. Kennedy” (Example from your text book) IF the highly ranked document contains the term “Oswald” then this needs to be added to the initial query If the term “assassination” appears in the top ranked document, then its weight should be increased.

Relevance Feedback in Vector Space Model Q is the original query R is the set of relevant and S is the set of irrelevant documents selected by the user |R| = n1, |S| = n2

Relevance Feedback in Vector Space Model Q is the original query R is the set of relevant and S is the set of irrelevant documents selected by the user |R| = n1, |S| = n2 In general The weights are referred to as Rocchio weights

Relevance Feedback in Vector Space Model What if the original query retrieves only non-relevant documents (determined by the user)? Then increase the weight of the most frequently occurring term in the document collection.

Relevance Feedback in Vector Space Model Result set clustering can be used as a utility for relevance feedback. Hierarchical clustering can be used for that purpose where the distance is defined by the cosine similarity