Advanced Multimedia Text Retrieval/Classification Tamara Berg.

Slides:



Advertisements
Similar presentations
An Introduction To Categorization Soam Acharya, PhD 1/15/2003.
Advertisements

Text Categorization.
Clustering Basic Concepts and Algorithms
The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 Department of Electronic & Electrical Engineering University College.
Introduction to Information Retrieval
Text Similarity David Kauchak CS457 Fall 2011.
Distance and Similarity Measures
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Learning for Text Categorization
K nearest neighbor and Rocchio algorithm
Text Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and video.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Multimedia and Text Indexing. Multimedia Data Management The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video.
Chapter 2: Pattern Recognition
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
9/14/2000Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Marti Hearst University of California,
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
Distance and Similarity Measures
Collaborative Filtering Matrix Factorization Approach
Indexing Techniques Mei-Chen Yeh.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Advanced Multimedia Text Classification Tamara Berg.
APPLYING INFORMATION RETRIEVAL TO TEXT MINING Data mining Lab 이아람.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 Lazy Learning – Nearest Neighbor Lantz Ch 3 Wk 2, Part 1.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Rainbow Tool Kit Matt Perry Global Information Systems Spring 2003.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Classification Tamara Berg CSE 595 Words & Pictures.
Text Classification, Active/Interactive learning.
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
1 Computing Relevance, Similarity: The Vector Space Model.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Chapter 23: Probabilistic Language Models April 13, 2004.
Machine Learning Overview Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
1 CS 391L: Machine Learning Text Categorization Raymond J. Mooney University of Texas at Austin.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
CSSE463: Image Recognition Day 11 Due: Due: Written assignment 1 tomorrow, 4:00 pm Written assignment 1 tomorrow, 4:00 pm Start thinking about term project.
Information Retrieval and Organisation Chapter 14 Vector Space Classification Dell Zhang Birkbeck, University of London.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
KNN & Naïve Bayes Hongning Wang
SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.
IR 6 Scoring, term weighting and the vector space model.
Distance and Similarity Measures
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Instance Based Learning
Information Retrieval
Intro to NLP and Deep Learning
Machine Learning. k-Nearest Neighbor Classifiers.
CSSE463: Image Recognition Day 20
Nearest-Neighbor Classifiers
Collaborative Filtering Matrix Factorization Approach
Representation of documents and queries
Text Categorization Assigning documents to a fixed set of categories
Principles of Data Mining Published by Springer-Verlag. 2007
Instance Based Learning
From frequency to meaning: vector space models of semantics
CS 430: Information Discovery
Hankz Hankui Zhuo Text Categorization Hankz Hankui Zhuo
Text Categorization Berlin Chen 2003 Reference:
Information Retrieval
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Advanced Multimedia Text Retrieval/Classification Tamara Berg

Announcements Matlab basics lab – Feb 7 Matlab string processing lab – Feb 12 If you are unfamiliar with Matlab, attendance at labs is crucial!

Slide from Dan Klein

Today!

What does categorization/classification mean?

Slide from Dan Klein

Slide from Min-Yen Kan

Slide from Dan Klein

Slide from Dan Klein

Slide from Min-Yen Kan

Machine Learning - how to select a model on the basis of data / experience Learning parameters (e.g. probabilities) Learning structure (e.g. dependencies) Learning hidden concepts (e.g. clustering)

Representing Documents

Document Vectors

 Represent document as a “bag of words”

Example Doc1 = “the quick brown fox jumped” Doc2 = “brown quick jumped fox the”

Example Doc1 = “the quick brown fox jumped” Doc2 = “brown quick jumped fox the” Would a bag of words model represent these two documents differently?

Document Vectors  Documents are represented as “bags of words”  Represented as vectors when used computationally Each vector holds a place for every term in the collection Therefore, most vectors are sparse Slide from Mitch Marcus

Document Vectors  Documents are represented as “bags of words”  Represented as vectors when used computationally Each vector holds a place for every term in the collection Therefore, most vectors are sparse Slide from Mitch Marcus Lexicon – the vocabulary set that you consider to be valid words in your documents. Usually stemmed (e.g. running->run)

Document Vectors: One location for each word. novagalaxy heat h’wood filmroledietfur ABCDEFGHIABCDEFGHI “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.) Slide from Mitch Marcus

Document Vectors: One location for each word. novagalaxy heat h’wood filmroledietfur ABCDEFGHIABCDEFGHI “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.) Slide from Mitch Marcus

Document Vectors novagalaxy heat h’wood filmroledietfur ABCDEFGHIABCDEFGHI Document ids Slide from Mitch Marcus

Vector Space Model  Documents are represented as vectors in term space Terms are usually stems Documents represented by vectors of terms  A vector distance measures similarity between documents Document similarity is based on length and direction of their vectors Terms in a vector can be “weighted” in many ways Slide from Mitch Marcus

Document Vectors novagalaxy heat h’wood filmroledietfur ABCDEFGHIABCDEFGHI Document ids Slide from Mitch Marcus

Comparing Documents

Similarity between documents A = [ ]; G = [ ]; E = [ ];

Similarity between documents A = [ ]; G = [ ]; E = [ ]; Treat the vectors as binary = number of words in common. Sb(A,G) = ? Sb(A,E) = ? Sb(G,E) = ? Which pair of documents are the most similar?

Similarity between documents A = [ ]; G = [ ]; E = [ ]; Sum of Squared Distances (SSD) = SSD(A,G) = ? SSD(A,E) = ? SSD(G,E) = ?

Similarity between documents A = [ ]; G = [ ]; E = [ ]; Angle between vectors: Cos(θ) = Dot Product: Length (Euclidean norm):

Some words give more information than others Does the fact that two documents both contain the word “the” tell us anything? How about “and”? Stop words (noise words): Words that are probably not useful for processing. Filtered out before natural language is applied. Other words can be more or less informative. No definitive list but might include things like:

Some words give more information than others Does the fact that two documents both contain the word “the” tell us anything? How about “and”? Stop words (noise words): Words that are probably not useful for processing. Filtered out before natural language is applied. Other words can be more or less informative. No definitive list but might include things like:

Classifying Documents

Slide from Min-Yen Kan Here the vector space is illustrated as having 2 dimensions. How many dimensions would the data actually live in?

Slide from Min-Yen Kan Query document – which class should you label it with?

Classification by Nearest Neighbor Classify the test document as the class of the document “nearest” to the query document (use vector similarity to find most similar doc) Slide from Min-Yen Kan

Classification by kNN Classify the test document as the majority class of the k documents “nearest” to the query document. Slide from Min-Yen Kan

What are the features? What’s the training data? Testing data? Parameters? Classification by kNN