Machine Learning and Information Retrieval Zoubin Ghahramani Department of Engineering University of Cambridge.

Slides:



Advertisements
Similar presentations
Support.ebsco.com Nursing Reference Center Tutorial.
Advertisements

Recommender Systems & Collaborative Filtering
An Introduction to Data Mining
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Information Retrieval in Practice
A PowerPoint Presentation
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Personalized Query Classification Bin Cao, Qiang Yang, Derek Hao Hu, et al. Computer Science and Engineering Hong Kong UST.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Application of Unstructured Learning in Computational Biology Tony C Smith Department of Computer Science University of Waikato
Active Collaborative Filtering Machine Learning Group Department of Computer Science University of Toronto.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Search Engines and Information Retrieval
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Bayesian Content-Based Image Retrieval research with: Katherine A. Heller based on (Heller and Ghahramani, 2006) part IB, paper 8, Lent.
CHAPTER 6 SECONDARY DATA SOURCES. Important Topics of This Chapter Success of secondary data. To understand how to create an internal database. To distinguish.
Fundamentals of Information Systems, Second Edition 1 Organizing Data and Information Chapter 3.
Information Retrieval in Practice
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Recommender systems Ram Akella November 26 th 2008.
Scalable Text Mining with Sparse Generative Models
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Overview of Web Data Mining and Applications Part I
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.
9/30/2004TCSS588A Isabelle Bichindaritz1 Introduction to Bioinformatics.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Search Engines and Information Retrieval Chapter 1.
Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.
Adaptive News Access Daniel Billsus Presented by Chirayu Wongchokprasitti.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Machine Learning An Introduction. What is Learning?  Herbert Simon: “Learning is any process by which a system improves performance from experience.”
Bayesian Sets Zoubin Ghahramani and Kathertine A. Heller NIPS 2005 Presented by Qi An Mar. 17 th, 2006.
Recommendation system MOPSI project KAROL WAGA
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Lecture 10: 8/6/1435 Machine Learning Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Google News Personalization: Scalable Online Collaborative Filtering
Presenter: Shanshan Lu 03/04/2010
Chapter 1 Foundations of Information Systems in Business.
1 Machine Learning 1.Where does machine learning fit in computer science? 2.What is machine learning? 3.Where can machine learning be applied? 4.Should.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
Recommender Systems Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Credits to Bing Liu (UIC) and Angshul Majumdar.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
1 Information Retrieval LECTURE 1 : Introduction.
Named Entity Recognition in Query Jiafeng Guo 1, Gu Xu 2, Xueqi Cheng 1,Hang Li 2 1 Institute of Computing Technology, CAS, China 2 Microsoft Research.
Foundations of Information Systems in Business
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Chapter 9 : Application Areas. 2 Some Advance Application Areas of Computers  Software Development  Artificial Intelligence  Robotics  Industrial.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Introduction to Machine Learning Original Version by Prof. Nati Srebro-Bartom Modifications by Nir Ailon Lecture 0: What is Machine Learning? What.
Information Retrieval in Practice
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Recommender Systems & Collaborative Filtering
Lecture 1: Introduction and the Boolean Model Information Retrieval
Information Retrieval (in Practice)
Machine Learning With Python Sreejith.S Jaganadh.G.
What is Pattern Recognition?
Batyr Charyyev.
Course Summary ChengXiang “Cheng” Zhai Department of Computer Science
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
CHAPTER 7: Information Visualization
Information Retrieval and Web Design
Presentation transcript:

Machine Learning and Information Retrieval Zoubin Ghahramani Department of Engineering University of Cambridge

Machine Learning Machine learning is an interdisciplinary field focusing on both the mathematical foundations and practical applications of systems that learn, reason and act. Other related terms: Pattern Recognition, Neural Networks, Data Mining, Adaptive Control, Decision Theory, Statistical Modelling...

Applications of Machine Learning Bioinformatics Robotics Computer vision Modelling Brain Imaging and Neural Data Financial prediction Collaborative filtering Information retrieval…

What is Information Retrieval? finding material from within a large unstructured collection (e.g. the internet) that satisfies the users information need (e.g. expressed via a query). well known examples… …but there are many specialist search systems as well:

Traditional approach to information retrieval user types a text query system returns an ordered list of items

A new approach to retrieval user inputs a small set of items system returns a larger set of items that belong in the concept or category exemplified by the query set a simple example: –query = {Monday, Wednesday} –return = {Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday} (work with Katherine A. Heller, UCL)

Universe of items being searched… Imagine a universe of items: …. The items could be: images, music, documents, websites, publications, proteins, news stories, customer profiles, products, medical records, … or any other type of item one might want to query.

Query set: Result: Query set: Result: Illustrative example

Ranking items Rank each item in the universe by how well it would fit into a set which includes the query set Limit output to the top few items Query: Ranking: BestWorst

Key Technical Step Information retrieval using Bayesian model comparison: : item being scored : query set of items

Key Advantages of Our Approach Novel search paradigm for retrieval –queries are a small set of examples Based on: –principled statistical methods (Bayesian machine learning) –recent psychological research into models of human categorization and generalization Extremely fast –search >100,000 records per second on a laptop computer –uses sparse matrix methods –easy to parallelize and use inverted indices to search billions of records/sec

Some Example Applications literature search: searching scientific literature, patent databases, or news articles by giving a small set of example articles targeted advertising: finding similar customers as represented by their buying patterns biomedical search: searching for sets of similar patients based on medical records drug discovery: searching for similar sets of proteins based on sequence, annotations, structure, literature collaborative filtering: finding similar movies, music, books, based on matching your preferences to other peoples preferences online shopping: searching for products by giving a few examples online dating services / social networks: searching for people based on profiles finance: finding similar companies / stocks based on patterns of transactions

Prototype systems that we have built movie search automatic thesaurus finding similar authors of scientific articles searching academic literature image retrieval protein search

Movie Search 1813 people rating 1532 movies query is a small set of movies system searches for other movies that would fit into this set based on the ratings

Movie Search Example Results Query: –Gone with the wind –Casablanca Result (top hits): –Gone with the wind (1939) –Casablanca (1942) –The African Queen (1951) –The Philadelphia Story (1940) –My Fair Lady (1964) –The Adventures of Robin Hood (1938) –The Maltese Falcon (1941) –Rebecca (1940) –Singing in the Rain (1952) –It Happened One Night (1934)

Movie Search Example Results Query: –Mary Poppins –Toy Story Result (top hits): –Mary Poppins –Toy Story –Winnie the Pooh –Cinderella –The Love Bug –Bedknobs and Broomsticks –Davy Crockett –The Parent Trap –Dumbo –The Sound of Music

Searching Academic Literature Query: –A. Smola –B. Scholkopf Result (top hits): –A. Smola –B. Scholkopf –S. Mika –G. Ratsch –R. Williamson –K. Muller –J. Weston –J. Shawe-Taylor –V. Vapnik –T. Onoda these two researchers published conference papers in the area of support vector machines these are additional researchers who published conference papers in the area of support vector machines

Image Retrieval Results for Query: sunset These are the top 9 images returned. Our system finds images of sunsets using only the color and texture features of these unlabelled images.

Results for Query: sign These are the top 9 images returned. It finds images of signs using only the color and texture features of these unlabelled images.

Results for Query: fireworks These are the top 9 images returned.

Protein Search Proteins are the fundamental building blocks of life; our genes code for proteins Understanding the functions of and relationships between proteins is essential for bioscience, biomedicine, and drug discovery (a multi-billion dollar industry). We have built a protein retrieval system to search UniProt, an annotated database of 200,000+ proteins

Summary We have a new approach to searching based on providing a small set of examples. Our approach is based on sound statistical theory, machine learning methods, and psychological research. Answering queries is extremely fast, the system is very easy to implement, and search can be parallelized easily over a grid of computers. We have built several prototype systems to illustrate the applicability of this approach. There are many other very practical real-world applications.

Appendix

A Prototype Image Retrieval System A system for searching large collections of unlabelled images. You enter a word, e.g. sunset, and it retrieves images that match this label, using only color and texture features of the images A database of 32,000 images –Labelled Images: 10,000 images with about 3-10 text labels per image –Unlabelled Images: 22,000 images –Each image is represented by 240 binary color and texture features, no other information is used A vocabulary of about 2000 keywords Goal: we want to search the unlabelled images using queries which are subsets of the labelled images associated with keywords

The Image Retrieval Prototype System The Algorithm: 1.Input query word: e.g. w=sunset 2.Find all labelled images with label w 3.Use those images as a query set 4.Return the unlabelled images with the highest probability of belonging with the query set The algorithm is very fast: about 0.2 sec on a laptop to query 22,000 test images code can be further optimized and parallelized

Example Labelled Images for sunset These are 9 random images that were labelled sunset in the labelled training data. Notice that these images are quite variable, and the labelling subjective and somewhat noisy. Our retrieval system does very well and is quite robust to ambiguous categories and poor labelling.