Correlation of Term Count and Document Frequency for Google N-Grams

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Improved TF-IDF Ranker
Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community
Series-O-Rama Search & Recommend TV series with SQL Guillaume Cabanac February 15th, 2011.
Information Retrieval Review
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modeling Modern Information Retrieval
Computing Trust in Social Networks
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Compare&Contrast: Using the Web to Discover Comparable Cases for News Stories Presenter: Aravind Krishna Kalavagattu.
The Vector Space Model …and applications in Information Retrieval.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Introduction to Language Models Evaluation in information retrieval Lecture 4.
Text-Based Content Search and Retrieval in ad hoc P2P Communities Francisco Matias Cuenca-Acuna Thu D. Nguyen
Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Search Engines and their Public Interfaces: Which APIs are the Most Synchronized? Frank McCown and Michael L. Nelson Department of Computer Science, Old.
Synchronicity Real Time Recovery of Missing Web Pages Martin Klein Introduction to Digital Libraries Week 14 CS 751 Spring /12/2011.
The identification of interesting web sites Presented by Xiaoshu Cai.
Clustering-based Collaborative filtering for web page recommendation CSCE 561 project Proposal Mohammad Amir Sharif
11 CANTINA: A Content- Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
Anti-Phishing Approaches Lifeng Hu
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Beespace Component: Filtering and Normalization for Biology Literature Qiaozhu Mei
1 Mining the Web to Determine Similarity Between Words, Objects, and Communities Author : Mehran Sahami Reporter : Tse Ho Lin 2007/9/10 FLAIRS, 2006.
Vector Space Models.
Evaluation of the NSDL and Google for Obtaining Pedagogical Resources Frank McCown, Johan Bollen, and Michael L. Nelson Old Dominion University Computer.
Discovering Key Concepts in Verbose Queries Michael Bendersky and W. Bruce Croft University of Massachusetts SIGIR 2008.
Ranking-based Processing of SQL Queries Date: 2012/1/16 Source: Hany Azzam (CIKM’11) Speaker: Er-gang Liu Advisor: Dr. Jia-ling Koh.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
1 CS 430: Information Discovery Lecture 5 Ranking.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Search Engines Session 5 INST 301 Introduction to Information Science.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
QUERY-PERFORMANCE PREDICTION: SETTING THE EXPECTATIONS STRAIGHT Date : 2014/08/18 Author : Fiana Raiber, Oren Kurland Source : SIGIR’14 Advisor : Jia-ling.
IR 6 Scoring, term weighting and the vector space model.
Presentation by: ABHISHEK KAMAT ABHISHEK MADHUSUDHAN SUYAMEENDRA WADKI
CSCE 590 Web Scraping – Information Extraction II
CANTINA: A Content-Based Approach to Detecting Phishing Web Sites
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Information Retrieval and Web Search
Martin Rajman, Martin Vesely
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Agreeing to Disagree: Search Engines and Their Public Interfaces
Applying Key Phrase Extraction to aid Invalidity Search
Information Retrieval
Just-In-Time Recovery of Missing Web Pages
Representation of documents and queries
Correlation of Term Count and Document Frequency for Google N-Grams
Text Categorization Assigning documents to a fixed set of categories
From frequency to meaning: vector space models of semantics
Identifying terms with similar meanings across corpora
6. Implementation of Vector-Space Retrieval
Information Retrieval and Web Design
Introduction to Digital Libraries Assignment #1
Introduction to Search Engines
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

Correlation of Term Count and Document Frequency for Google N-Grams Martin Klein and Michael L. Nelson Old Dominion University {mklein,mln}@cs.odu.edu ECIR 2009 Toulouse, France 04/08/2009

Background & Motivation Term frequency (TF) – inverse document frequency (IDF) is a well known term weighting concept Used (among others) to generate lexical signatures (LSs) TF is not hard to compute, IDF is since it depends on global knowledge about the corpus  When the entire web is the corpus IDF can only be estimated! Most text corpora provide term count values (TC) D1 = “Please, Please Me” D2 = “Can’t Buy Me Love” D3 = “All You Need Is Love” D4 = “Long, Long, Long” TC >= DF but is there a correlation? Can we use TC to estimate DF? Term All Buy Can’t Is Love Me Need Please You Long TC 1 2 3 DF

Experimental Setup & Results Investigate correlation between TC and DF within “Web as Corpus” (WaC) Rank similarity of all terms

Experimental Setup & Results Investigate correlation between TC and DF within “Web as Corpus” (WaC) Spearman’s ρ and Kendall τ

Experimental Setup & Results Show similarity between WaC based TC and Google N-Gram based TC TC frequencies

Experimental Setup & Results Top 10 terms in decreasing order of their TF/IDF values U = 14 ∩ = 6 Strong indicator that TC can be used to estimate DF for web pages! Rank WaC-DF WaC-TC Google N-Grams 1 IR 2 RETRIEVAL IRSG 3 4 BCS IRIT CONFERENCE 5 EUROPEAN 6 2009 GRANT 7 GOOGLE FILTERING 8 9 ACM 10 ARIA PAPERS Google: screen scraping DF (?) values from the Google web interface

Thank You & Come See My Poster!!! Correlation of Term Count and Document Frequency for Google N-Grams Questions Martin Klein and Michael L. Nelson Old Dominion University {mklein,mln}@cs.odu.edu