A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

Slides:



Advertisements
Similar presentations
Background Knowledge for Ontology Construction Blaž Fortuna, Marko Grobelnik, Dunja Mladenić, Institute Jožef Stefan, Slovenia.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Chapter 5: Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Text Categorization Moshe Koppel Lecture 1: Introduction Slides based on Manning, Raghavan and Schutze and odds and ends from here and there.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
On feature distributional clustering for text categorization Bekkerman, El-Yaniv, Tishby and Winter The Technion. June, 27, 2001.
Evaluating Search Engine
Information Retrieval in Practice
Search Engines and Information Retrieval
Text Categorization Hongning Wang Today’s lecture Bayes decision theory Supervised text categorization – General steps for text categorization.
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Using IR techniques to improve Automated Text Classification
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
Scalable Text Mining with Sparse Generative Models
Representation of hypertext documents based on terms, links and text compressibility Julian Szymański Department of Computer Systems Architecture, Gdańsk.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Overview of Search Engines
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.
Search Engines and Information Retrieval Chapter 1.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Evaluating Search Engines in chapter 8 of the book Search Engines Information Retrieval in Practice Hongfei Yan.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Querying Structured Text in an XML Database By Xuemei Luo.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Universit at Dortmund, LS VIII
A Language Independent Method for Question Classification COLING 2004.
Chapter 6: Information Retrieval and Web Search
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
ITGS Databases.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Michael Bendersky, W. Bruce Croft Dept. of Computer Science Univ. of Massachusetts Amherst Amherst, MA SIGIR
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Proximity based one-class classification with Common N-Gram dissimilarity for authorship verification task Magdalena Jankowska, Vlado Kešelj and Evangelos.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Information Retrieval in Practice
Dr.V.Jaiganesh Professor
Queensland University of Technology
Evaluation of IR Systems
Text Categorization Berlin Chen 2003 Reference:
Extracting Patterns and Relations from the World Wide Web
Presentation transcript:

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto Department of Mathematics, University of Toronto William J. Teahan School of informatics, University of Wales, Bangor School of informatics, University of Wales, Bangor

2 Abstract We suggest a way for locating duplicates and plagiarisms in a text collection using an R-measure. We suggest a way for locating duplicates and plagiarisms in a text collection using an R-measure. R-measure is the normalized sum of the lengths of all suffixes of the text repeated in other documents of the collection. R-measure is the normalized sum of the lengths of all suffixes of the text repeated in other documents of the collection. We applied the technique to several standard text collections and found that they contained a significant number of duplicate and plagiarized documents. We applied the technique to several standard text collections and found that they contained a significant number of duplicate and plagiarized documents.

3 Abstract A reformulation of the method leads to an algorithm that can be applied to supervised multi-class categorization. A reformulation of the method leads to an algorithm that can be applied to supervised multi-class categorization. Using Reuters Corpus Volume 1 (RCV1), the results show that the method outperforms SVM at multi- class categorization. Using Reuters Corpus Volume 1 (RCV1), the results show that the method outperforms SVM at multi- class categorization.

4 1. Motivation Text collections are used intensively in scientific research for many purposes such as text categorization, text mining, natural language processing, information retrieval and so on. Text collections are used intensively in scientific research for many purposes such as text categorization, text mining, natural language processing, information retrieval and so on. Every creator of a text collection is faced at some stage with the task of verifying its contents, e.g. are there duplicate documents? Every creator of a text collection is faced at some stage with the task of verifying its contents, e.g. are there duplicate documents? R-measure is defined as a number between 0 and 1 to characterize the “ repeatedness ”. R-measure is defined as a number between 0 and 1 to characterize the “ repeatedness ”.

5 1. Motivation The R-measure can be computed effectively using the suffix array data structure. The R-measure can be computed effectively using the suffix array data structure. The computation procedure can be improved to locate the sets of the duplicate or plagiarized documents, and to identify “ non-typical ” documents, such as those in a foreign language. The computation procedure can be improved to locate the sets of the duplicate or plagiarized documents, and to identify “ non-typical ” documents, such as those in a foreign language. Another reformulation leads to an algorithm that can be applied to supervised classification. Another reformulation leads to an algorithm that can be applied to supervised classification. The suggested techniques are character-based and do not require a-priori knowledge about the representation of the documents. The suggested techniques are character-based and do not require a-priori knowledge about the representation of the documents.

6 2. R-measure Suppose the collection consists of m documents, each document being a string T i = T i [1 … | T i |]. A squared R 2 -measure of document T is defined as: Suppose the collection consists of m documents, each document being a string T i = T i [1 … | T i |]. A squared R 2 -measure of document T is defined as: R 2 (T|T 1, … T m ) = Q(S|T 1, …,T m ) is the length of the longest prefix of S, repeated in one of documents T 1, …,T m

7 2. R-measure ex: T= “ catΔsatΔon ”, T 1 = “ catΔsat ” T 2 = “ theΔcatΔonΔaΔmat ”, then ex: T= “ catΔsatΔon ”, T 1 = “ catΔsat ” T 2 = “ theΔcatΔonΔaΔmat ”, then R 2 (T|T 1,T 2 ) = ( ) from “ catΔsat ”, and ( ) from “ atΔon ”. Alternative L-measure Alternative L-measure L(T|T 1, … T m ) = L(T|T 1,T 2 ) =

8 2. R-measure R-measure seems a more “ intuitive ” measure, since substrings other than “ catΔsat ” are also repeated. R-measure seems a more “ intuitive ” measure, since substrings other than “ catΔsat ” are also repeated. R-measure can be computed effectively using a suffix array, a full-text indexing structure. R-measure can be computed effectively using a suffix array, a full-text indexing structure. (Let S C = T 0 $T 1 $...T m $ and construct a suffix array for S C ) The complexity of time is O(|S C |) and that of space is O(|S C |)+O(m)+O(M), where M = max j=0,…,m |T j |. The complexity of time is O(|S C |) and that of space is O(|S C |)+O(m)+O(M), where M = max j=0,…,m |T j |.

9 3. Applications of R-measure Locating the duplicate sets Locating the duplicate sets Supervised classification Supervised classification Identifying foreign and/or non-typical documents Identifying foreign and/or non-typical documents

10 3. Applications of R-measure Supervised Classification Supervised Classification There exist two distinct types of classification: There exist two distinct types of classification: topic categorization, multi-class categorization (binary classifier, multi-class classifier) R-measure can be used: R-measure can be used: To select the correct class for the document T among m classes represented by texts S 1, …, S m, the source is guessed using the following estimate: θ(T) = argmax i R(T|S i )

11 3. Applications of R-measure Identifying foreign and/or non-typical documents Identifying foreign and/or non-typical documents Non-typical documents can be located simply by examining those documents which have the lowest R-measures. Non-typical documents can be located simply by examining those documents which have the lowest R-measures. There is a predominant language associated with the collection as a whole, and we want to identify documents that have a different language. There is a predominant language associated with the collection as a whole, and we want to identify documents that have a different language. Construct sample text for each language then proceed as multi-class categorization. Construct sample text for each language then proceed as multi-class categorization.

12 4. Experiments and Results Analysis of various text collections Analysis of various text collections Multi-class categorization Multi-class categorization

Analysis of Various Text Collections Reuters Reuters Contains 579 (2.7%) duplicate documents with R=1.0 Contains 579 (2.7%) duplicate documents with R=1.0 Partitioned into a training/testing split called ModApte Partitioned into a training/testing split called ModApte Two pairs of duplicates are shared between the training and testing splits (12495 and 18011, and 14913) Two pairs of duplicates are shared between the training and testing splits (12495 and 18011, and 14913)

Analysis of Various Text Collections 20Newsgroups 20Newsgroups 20news contains many duplicated messages 20news contains many duplicated messages 20news was derived from 20news with the purpose of removing duplicates. 20news was derived from 20news with the purpose of removing duplicates. There are still 6 repeated documents. (indistinguishable to classifiers that rely on word-based feature extraction) There are still 6 repeated documents. (indistinguishable to classifiers that rely on word-based feature extraction) Two documents differ by an extra new-line character and are assigned two different classes. Two documents differ by an extra new-line character and are assigned two different classes.

Analysis of Various Text Collections Russian-416 Russian-416 Comprises 416 texts from 102 Russian writers of the 19 th and 20 th centuries. Comprises 416 texts from 102 Russian writers of the 19 th and 20 th centuries. Only two documents have R ≧ 0.1 and all other books have R < Only two documents have R ≧ 0.1 and all other books have R < The level of R-measure values is much lower than other collections since the average document length is much larger ( characters). The level of R-measure values is much lower than other collections since the average document length is much larger ( characters).

Analysis of Various Text Collections Reuters Corpus Version 1 Reuters Corpus Version 1 A significant proportion of the articles in RCV1 are duplicated (3.4% or 27,754 articles) or extensively plagiarized (7.9% with R ≧ 0.5) A significant proportion of the articles in RCV1 are duplicated (3.4% or 27,754 articles) or extensively plagiarized (7.9% with R ≧ 0.5) Checking percentage of matching fields (ex: topics, headlines and dates) in duplicated documents: Checking percentage of matching fields (ex: topics, headlines and dates) in duplicated documents: Headlines-56.9% matched, dates-78.1%, countries-86.8%, industries-80.1%, topics-52.3%. Headlines-56.9% matched, dates-78.1%, countries-86.8%, industries-80.1%, topics-52.3%. 40% of the 50 lowest scoring articles consist almost entirely of names and numbers (non-typical docs). 40% of the 50 lowest scoring articles consist almost entirely of names and numbers (non-typical docs).

Analysis of Various Text Collections Reuters Corpus Version 1 Reuters Corpus Version 1 Identify foreign language documents Identify foreign language documents Several class models were constructed from a small sampling ( KB) of English, French, German, Dutch and Belgian text obtained from a popular search engine. Several class models were constructed from a small sampling ( KB) of English, French, German, Dutch and Belgian text obtained from a popular search engine. Find 410 French articles, 6 Dutch, 5 Belgian and 1 German article. Find 410 French articles, 6 Dutch, 5 Belgian and 1 German article. 100% precision and an estimated 98% recall. 100% precision and an estimated 98% recall.

Multi-class Categorization Authorship attribution using RCV1 Authorship attribution using RCV1 An important application for IR with benefits such as user modeling, determining context, efficient partitioning the collection for distributed retrieval and so on. An important application for IR with benefits such as user modeling, determining context, efficient partitioning the collection for distributed retrieval and so on. An experimental collection was formed from 1813 articles of the top 50 authors with respect to total size of articles. An experimental collection was formed from 1813 articles of the top 50 authors with respect to total size of articles. 10-fold split on subsets with conditions: R<0.25 (873), R<0.5 (1161), R<0.75 (1255), R<1.0 (1316), R ≦ 1.0 (1813) 10-fold split on subsets with conditions: R<0.25 (873), R<0.5 (1161), R<0.75 (1255), R<1.0 (1316), R ≦ 1.0 (1813)

Multi-class Categorization Table 2: How well the R-measure performs at determining the top 50 authors in RCV1 compared to SVM and compression-based methods. Results are percentage of correct guesses at first rank.

20 5. Conclusion and Discussion In this paper, we have highlighted the need for verifying a text collection — that is, ensuring the collection is both valid and consistent. In this paper, we have highlighted the need for verifying a text collection — that is, ensuring the collection is both valid and consistent. R-measure is suggested for collection verification and it can be computed effectively using the suffix array. R-measure is suggested for collection verification and it can be computed effectively using the suffix array. The implication for text categorization research is that a more careful approach is required to split the collection into training and testing sets than a random selection. The implication for text categorization research is that a more careful approach is required to split the collection into training and testing sets than a random selection.

21 5. Conclusion and Discussion There exists a class of problems which is more suitable for the R-measure and PPM approach than for SVM, such as the classification of texts coming from a single source, like the papers written by a single author or in a single language. There exists a class of problems which is more suitable for the R-measure and PPM approach than for SVM, such as the classification of texts coming from a single source, like the papers written by a single author or in a single language. In cases where the classification depends on the presence of one or two words (as for Reuters-21578), SVM would be the preferred method. In cases where the classification depends on the presence of one or two words (as for Reuters-21578), SVM would be the preferred method.