Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Information Retrieval in Practice
Large-scale matching CSE P 576 Larry Zitnick
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Chapter 7: Text mining UIC - CS 594 Bing Liu 1 1.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Modern Information Retrieval Chapter 2 Modeling. Probabilistic model the appearance or absent of an index term in a document is interpreted either as.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
Vector Space Model CS 652 Information Extraction and Integration.
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Y. Weiss (Hebrew U.) A. Torralba (MIT) Rob Fergus (NYU)
Scalable Text Mining with Sparse Generative Models
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Lecture 09 Clustering-based Learning
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Artificial Intelligence (AI) Addition to the lecture 11.
Advanced Multimedia Text Classification Tamara Berg.
Text mining.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
04/30/13 Last class: summary, goggles, ices Discrete Structures (CS 173) Derek Hoiem, University of Illinois 1 Image: wordpress.com/2011/11/22/lig.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
SINGULAR VALUE DECOMPOSITION (SVD)
MIND: An architecture for multimedia information retrieval in federated digital libraries Henrik Nottelmann University of Dortmund, Germany.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Clustering More than Two Million Biomedical Publications Comparing the Accuracies of Nine Text-Based Similarity Approaches Boyack et al. (2011). PLoS ONE.
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Information Retrieval in Practice
Image taken from: slideshare
Plan for Today’s Lecture(s)
Presentation by: ABHISHEK KAMAT ABHISHEK MADHUSUDHAN SUYAMEENDRA WADKI
Search Engine Architecture
Clustering of Web pages
CS 430: Information Discovery
Vector-Space (Distributional) Lexical Semantics
Text Categorization Assigning documents to a fixed set of categories
Data Mining Chapter 6 Search Engines
From frequency to meaning: vector space models of semantics
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Presentation transcript:

Utilising software to enhance your research Eamonn Hynes 5 th November, 2012

Basic statistics and some parallel computing

Basic statistics Probability Mean Standard deviation Simple examples: -Probability of just one six from three throws of a die? -Probability of winning the Lotto Tougher problems: -Transcribing speech into words -Poker robot that plays optimally

Hands on Mean of column 1? Mean of row 4? Standard deviation of column 3?

Standard deviation 13.6%

A billion numbers? Single-core Multi-core Eight cores Single core Memory

More interesting example Again, a large sequence of numbers Speech signal ~56 Different sounds Task is to calculate the most likely sequence of words Over 50 years of research

Moore’s Law

Demise of Moore’s Law Reality

Moore’s Law The solution: – Parallel architectures – Hybrid architectures – New software – harder to write – New programming paradigms – Dedicated hardware – Beyond silicon

Amdhal’s Law Limitations on parallel code – Thankfully a large number of problems are parallel in nature (rendering 3D graphics, weather prediction, image processing, DNA matching) – But many problems are sequential in nature! – e.g. card game, legal process, ordering a laptop, etc. – Nothing we can do except increase clock rate!

Clustering

Categorise data into groups Important in many fields – speech, medical statistics, data mining, etc. Very loose algorithm (k-means clustering): – Let each point be a cluster centroid – Pick a random point – Get point closest to this chosen point – Calculate centroid – Repeat until just k centroids Big limitation: k must be specified in advance… Example

Clustering Not just for points on a 2d surface Pixels of an image Example

Support Vector Machines Support vector machines (SVMs) – Popular in the 1990s/2000s (Vapnik et al. 1992) – Non-linear classification – Beautiful maths Find a nonlinear boundary between k sets of points Example

Text analysis

Searching documents task Naïve search: – SQL query: “SELECT * FROM articles WHERE body LIKE '%$key word%';” – Works fine for small document collections Large databases: Better to index all documents tf-idf

Text analysis Process each document Calculate the frequency of each word Store the index, not the entire document Much faster document retrieval Intuitive to pick document with highest term count Must weight each document by the inverse document frequency

Text analysis Example: Simple Boolean logic Searching for “rose” If word appears, then document is relevant

Text analysis Taking term frequencies into account

Text analysis TFIDF = TF * IDF where: TF = C/T where C = number of times a given word appears in a document and T = total number of words in a document IDF = D/DF where D = total number of documents in a corpus, and DF = total number of documents containing a given word

Text analysis Natural language follows a Zipfian distribution

Finally

Deep belief networks Given a document, how to find similar documents? Deep belief networks (DBNs) State-of-the-art in machine learning More advanced than Latent Semantic Analysis (LSA) Principal Component Analysis (PCA) and clustering

Deep belief networks 2000 most common word stems fed into base layer Gradual reduction in number of neurons Left with a 30-digit binary representation of a document with 2000-dimension feature vector Super fast document retrieval (“semantic hashing”) Images from G. Hinton, Science (2006)