Information Retrieval Ch 23.2. Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Incremental Clustering Previous clustering algorithms worked in “batch” mode: processed all points at essentially the same time. Some IR applications cluster.
Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,
Metric Inverted - An efficient inverted indexing method for metric spaces Benjamin Sznajder Jonathan Mamou Yosi Mass Michal Shmueli-Scheuer IBM Research.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
CS347 Lecture 8 May 7, 2001 ©Prabhakar Raghavan. Today’s topic Clustering documents.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Learning Techniques for Information Retrieval Perceptron algorithm Least mean.
Modern Information Retrieval
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Modern Information Retrieval Chapter 5 Query Operations.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
Evaluating the Performance of IR Sytems
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
DL Introduction – Beeri/Feitelson1 Information Retrieval scope, basic concepts system architectures, modes of operation.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
Chapter 6: Information Retrieval and Web Search
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Evaluation of Agent Building Tools and Implementation of a Prototype for Information Gathering Leif M. Koch University of Waterloo August 2001.
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Chapter 23: Probabilistic Language Models April 13, 2004.
Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer 
DOCUMENT CLUSTERING USING HIERARCHICAL ALGORITHM Submitted in partial fulfillment of requirement for the V Sem MCA Mini Project Under Visvesvaraya Technological.
Information Retrieval
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Machine Learning Queens College Lecture 7: Clustering.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into.
More on Document Similarity and Clustering How similar are these two documents (Again) ? Are these two documents about the same topic ?
Text Clustering Hongning Wang
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Information Retrieval Quality of a Search Engine.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Data Mining and Text Mining. The Standard Data Mining process.
Collection Synthesis Donna Bergmark Cornell Digital Library Research Group March 12, 2002.
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Information Retrieval and Web Search
Data Clustering Michael J. Watts
Multimedia Information Retrieval
Basic Information Retrieval
Information Organization: Clustering
موضوع پروژه : بازیابی اطلاعات Information Retrieval
CS276B Text Information Retrieval, Mining, and Exploitation
Chapter 5: Information Retrieval and Web Search
A Suite to Compile and Analyze an LSP Corpus
Information Retrieval and Web Design
Presentation transcript:

Information Retrieval Ch 23.2

Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection Query language Result set Presentation of the result set

Evaluating IR system Precision (relevant doc in result set)/(doc in result set) Recall (relevant doc in result set)/(relevant doc)

Presentation of result sets Relevance feedback User saying which doc are relevant Document classification Preexisting taxonomy of topics Ch 18 Document clustering Tree of categories is created from scratch Ch20.3 Agglomerative clustering: merge nearest two doc. K-means clustering: assign doc. Into k categories.

K-means clustering 1. Pick k documents at random to represent the k categories 2. Assign every document to the closest category 3. Compute the mean of each cluster and use the k means to represent the new values of the k categories. 4. Repeat steps 2 and 3 until convergence.

Implementing IR systems Lexicon Stop words Inverted file Vector space model

Vector Space Model Transform document into vector Di=ABC, Dj=BBC Di={1, 1, 1}, Dj={0,2,1} Measure the distance between two document Dist=Di ‧ Dj = Sqrt((1-0) 2 + (1-2) 2 + (1-1) 2 ) Retrieval documents with smallest distance