A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto Department of Mathematics, University of Toronto William J. Teahan School of informatics, University of Wales, Bangor School of informatics, University of Wales, Bangor

2 Abstract We suggest a way for locating duplicates and plagiarisms in a text collection using an R-measure. We suggest a way for locating duplicates and plagiarisms in a text collection using an R-measure. R-measure is the normalized sum of the lengths of all suffixes of the text repeated in other documents of the collection. R-measure is the normalized sum of the lengths of all suffixes of the text repeated in other documents of the collection. We applied the technique to several standard text collections and found that they contained a significant number of duplicate and plagiarized documents. We applied the technique to several standard text collections and found that they contained a significant number of duplicate and plagiarized documents.

3 Abstract A reformulation of the method leads to an algorithm that can be applied to supervised multi-class categorization. A reformulation of the method leads to an algorithm that can be applied to supervised multi-class categorization. Using Reuters Corpus Volume 1 (RCV1), the results show that the method outperforms SVM at multi- class categorization. Using Reuters Corpus Volume 1 (RCV1), the results show that the method outperforms SVM at multi- class categorization.

4 1. Motivation Text collections are used intensively in scientific research for many purposes such as text categorization, text mining, natural language processing, information retrieval and so on. Text collections are used intensively in scientific research for many purposes such as text categorization, text mining, natural language processing, information retrieval and so on. Every creator of a text collection is faced at some stage with the task of verifying its contents, e.g. are there duplicate documents? Every creator of a text collection is faced at some stage with the task of verifying its contents, e.g. are there duplicate documents? R-measure is defined as a number between 0 and 1 to characterize the “ repeatedness ”. R-measure is defined as a number between 0 and 1 to characterize the “ repeatedness ”.

5 1. Motivation The R-measure can be computed effectively using the suffix array data structure. The R-measure can be computed effectively using the suffix array data structure. The computation procedure can be improved to locate the sets of the duplicate or plagiarized documents, and to identify “ non-typical ” documents, such as those in a foreign language. The computation procedure can be improved to locate the sets of the duplicate or plagiarized documents, and to identify “ non-typical ” documents, such as those in a foreign language. Another reformulation leads to an algorithm that can be applied to supervised classification. Another reformulation leads to an algorithm that can be applied to supervised classification. The suggested techniques are character-based and do not require a-priori knowledge about the representation of the documents. The suggested techniques are character-based and do not require a-priori knowledge about the representation of the documents.

6 2. R-measure Suppose the collection consists of m documents, each document being a string T i = T i [1 … | T i |]. A squared R 2 -measure of document T is defined as: Suppose the collection consists of m documents, each document being a string T i = T i [1 … | T i |]. A squared R 2 -measure of document T is defined as: R 2 (T|T 1, … T m ) = Q(S|T 1, …,T m ) is the length of the longest prefix of S, repeated in one of documents T 1, …,T m

7 2. R-measure ex: T= “ catΔsatΔon ”, T 1 = “ catΔsat ” T 2 = “ theΔcatΔonΔaΔmat ”, then ex: T= “ catΔsatΔon ”, T 1 = “ catΔsat ” T 2 = “ theΔcatΔonΔaΔmat ”, then R 2 (T|T 1,T 2 ) = (7+6+5+4+3) from “ catΔsat ”, and (5+4+3+2+1) from “ atΔon ”. Alternative L-measure Alternative L-measure L(T|T 1, … T m ) = L(T|T 1,T 2 ) =

8 2. R-measure R-measure seems a more “ intuitive ” measure, since substrings other than “ catΔsat ” are also repeated. R-measure seems a more “ intuitive ” measure, since substrings other than “ catΔsat ” are also repeated. R-measure can be computed effectively using a suffix array, a full-text indexing structure. R-measure can be computed effectively using a suffix array, a full-text indexing structure. (Let S C = T 0 $T 1 $...T m $ and construct a suffix array for S C ) The complexity of time is O(|S C |) and that of space is O(|S C |)+O(m)+O(M), where M = max j=0,…,m |T j |. The complexity of time is O(|S C |) and that of space is O(|S C |)+O(m)+O(M), where M = max j=0,…,m |T j |.

9 3. Applications of R-measure Locating the duplicate sets Locating the duplicate sets Supervised classification Supervised classification Identifying foreign and/or non-typical documents Identifying foreign and/or non-typical documents

10 3. Applications of R-measure Supervised Classification Supervised Classification There exist two distinct types of classification: There exist two distinct types of classification: topic categorization, multi-class categorization (binary classifier, multi-class classifier) R-measure can be used: R-measure can be used: To select the correct class for the document T among m classes represented by texts S 1, …, S m, the source is guessed using the following estimate: θ(T) = argmax i R(T|S i )

11 3. Applications of R-measure Identifying foreign and/or non-typical documents Identifying foreign and/or non-typical documents Non-typical documents can be located simply by examining those documents which have the lowest R-measures. Non-typical documents can be located simply by examining those documents which have the lowest R-measures. There is a predominant language associated with the collection as a whole, and we want to identify documents that have a different language. There is a predominant language associated with the collection as a whole, and we want to identify documents that have a different language. Construct sample text for each language then proceed as multi-class categorization. Construct sample text for each language then proceed as multi-class categorization.

12 4. Experiments and Results Analysis of various text collections Analysis of various text collections Multi-class categorization Multi-class categorization

13 4.1 Analysis of Various Text Collections Reuters-21578 Reuters-21578 Contains 579 (2.7%) duplicate documents with R=1.0 Contains 579 (2.7%) duplicate documents with R=1.0 Partitioned into a training/testing split called ModApte Partitioned into a training/testing split called ModApte Two pairs of duplicates are shared between the training and testing splits (12495 and 18011, 14779 and 14913) Two pairs of duplicates are shared between the training and testing splits (12495 and 18011, 14779 and 14913)

14 4.1 Analysis of Various Text Collections 20Newsgroups 20Newsgroups 20news-19997 contains many duplicated messages 20news-19997 contains many duplicated messages 20news-18828 was derived from 20news-19997 with the purpose of removing duplicates. 20news-18828 was derived from 20news-19997 with the purpose of removing duplicates. There are still 6 repeated documents. (indistinguishable to classifiers that rely on word-based feature extraction) There are still 6 repeated documents. (indistinguishable to classifiers that rely on word-based feature extraction) Two documents differ by an extra new-line character and are assigned two different classes. Two documents differ by an extra new-line character and are assigned two different classes.

15 4.1 Analysis of Various Text Collections Russian-416 Russian-416 Comprises 416 texts from 102 Russian writers of the 19 th and 20 th centuries. Comprises 416 texts from 102 Russian writers of the 19 th and 20 th centuries. Only two documents have R ≧ 0.1 and all other books have R ＜ 0.01. Only two documents have R ≧ 0.1 and all other books have R ＜ 0.01. The level of R-measure values is much lower than other collections since the average document length is much larger (284800 characters). The level of R-measure values is much lower than other collections since the average document length is much larger (284800 characters).

16 4.1 Analysis of Various Text Collections Reuters Corpus Version 1 Reuters Corpus Version 1 A significant proportion of the articles in RCV1 are duplicated (3.4% or 27,754 articles) or extensively plagiarized (7.9% with R ≧ 0.5) A significant proportion of the articles in RCV1 are duplicated (3.4% or 27,754 articles) or extensively plagiarized (7.9% with R ≧ 0.5) Checking percentage of matching fields (ex: topics, headlines and dates) in duplicated documents: Checking percentage of matching fields (ex: topics, headlines and dates) in duplicated documents: Headlines-56.9% matched, dates-78.1%, countries-86.8%, industries-80.1%, topics-52.3%. Headlines-56.9% matched, dates-78.1%, countries-86.8%, industries-80.1%, topics-52.3%. 40% of the 50 lowest scoring articles consist almost entirely of names and numbers (non-typical docs). 40% of the 50 lowest scoring articles consist almost entirely of names and numbers (non-typical docs).

17 4.1 Analysis of Various Text Collections Reuters Corpus Version 1 Reuters Corpus Version 1 Identify foreign language documents Identify foreign language documents Several class models were constructed from a small sampling (100-120 KB) of English, French, German, Dutch and Belgian text obtained from a popular search engine. Several class models were constructed from a small sampling (100-120 KB) of English, French, German, Dutch and Belgian text obtained from a popular search engine. Find 410 French articles, 6 Dutch, 5 Belgian and 1 German article. Find 410 French articles, 6 Dutch, 5 Belgian and 1 German article. 100% precision and an estimated 98% recall. 100% precision and an estimated 98% recall.

18 4.2 Multi-class Categorization Authorship attribution using RCV1 Authorship attribution using RCV1 An important application for IR with benefits such as user modeling, determining context, efficient partitioning the collection for distributed retrieval and so on. An important application for IR with benefits such as user modeling, determining context, efficient partitioning the collection for distributed retrieval and so on. An experimental collection was formed from 1813 articles of the top 50 authors with respect to total size of articles. An experimental collection was formed from 1813 articles of the top 50 authors with respect to total size of articles. 10-fold split on subsets with conditions: R<0.25 (873), R<0.5 (1161), R<0.75 (1255), R<1.0 (1316), R ≦ 1.0 (1813) 10-fold split on subsets with conditions: R<0.25 (873), R<0.5 (1161), R<0.75 (1255), R<1.0 (1316), R ≦ 1.0 (1813)

19 4.2 Multi-class Categorization Table 2: How well the R-measure performs at determining the top 50 authors in RCV1 compared to SVM and compression-based methods. Results are percentage of correct guesses at first rank.

20 5. Conclusion and Discussion In this paper, we have highlighted the need for verifying a text collection — that is, ensuring the collection is both valid and consistent. In this paper, we have highlighted the need for verifying a text collection — that is, ensuring the collection is both valid and consistent. R-measure is suggested for collection verification and it can be computed effectively using the suffix array. R-measure is suggested for collection verification and it can be computed effectively using the suffix array. The implication for text categorization research is that a more careful approach is required to split the collection into training and testing sets than a random selection. The implication for text categorization research is that a more careful approach is required to split the collection into training and testing sets than a random selection.

21 5. Conclusion and Discussion There exists a class of problems which is more suitable for the R-measure and PPM approach than for SVM, such as the classification of texts coming from a single source, like the papers written by a single author or in a single language. There exists a class of problems which is more suitable for the R-measure and PPM approach than for SVM, such as the classification of texts coming from a single source, like the papers written by a single author or in a single language. In cases where the classification depends on the presence of one or two words (as for Reuters-21578), SVM would be the preferred method. In cases where the classification depends on the presence of one or two words (as for Reuters-21578), SVM would be the preferred method.

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

Similar presentations

Presentation on theme: "A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

Similar presentations

Presentation on theme: "A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto."— Presentation transcript:

Similar presentations

About project

Feedback