Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Applying Syntactic Similarity Algorithms.

Similar presentations


Presentation on theme: "© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Applying Syntactic Similarity Algorithms."— Presentation transcript:

1 © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Applying Syntactic Similarity Algorithms for Enterprise Information Management Lucy Cherkasova, Kave Eshghi, Brad Morrey, Joseph Tucek, Alistair Veitch Hewlett-Packard Labs

2 2 New Applications in the Enterprise Document deletion and compliance rules −how do you identify all the users who might have a copy of these files? E-Discovery −identify and retrieve a complete set of related documents (all earlier or later versions of the same document) −Simplify the review process: in the set of semantically similar documents (returned to the expert) identify clusters of syntactically similar documents Keep document repositories with up-to-date information −to identify and filter out the documents that are largely duplicates of newer versions in order to improve the quality of the collection.

3 3 Syntactic Similarity Syntactic similarity is useful to identify documents with a large textual intersection. Syntactic similarity algorithms are entirely defined by the syntactic (text) properties of the document Shingling technique (Broder et al) −Goal: to identify near-duplicates on the web −document A is represented by the set of shingles (sequences of adjacent words)

4 4 w1w1 w2w2 w3w3 w4w4 A: wNwN Shingling technique S(A) = {w 1, w 2, …, w j, …, w N } the set of all shingles in document A Parameter: a shingle size (moving window) Traditionally, a shingle size is defined as a number of words. In our work, we define a shingle size (moving window) via the number of bytes. wjwj wjwj = 6

5 5 Basic Metrics Similarity metric (documents A and B are ~similar) Containment metric (document A is ~contained in B)

6 6 Shingling-Based Approach Instead of comparing shingles (sequences of words) it is more convenient to deal with fingerprints (hashes) of shingles 64-bit Rabin fingerprints are used due to fast software implementation To further simplify the computation of similarity metric one can sample the document shingles to build a more compact document signature −i.e., instead of 1000 shingles take a sample of 100 shingles Different ways of sampling the shingles lead to different syntactic similarity algorithms

7 7 Four Algorithms We will compare performance and properties of the four syntactic similarity algorithms: −Three shingling-based algorithms (Min n, Mod n, Sketch n ) −Chunking-based algorithm (BSW n ) Three shingling-based algorithms ( Min n, Mod n, Sketch n ) differ how they sample the set of document shingles and build the document signature.

8 8 Min n Algorithm Let S(A)={f(w 1 ), f(w 2 ), …., f(w N )} be all fingerprinted shingles for document A. Min n : it selects the n numerically smallest fingerprinted shingles. Documents are represented by fixed-size signatures f(w 1 ) f(w 2 ) A:

9 9 Mod n Algorithm Let S(A)={f(w 1 ), f(w 2 ), …., f(w N )} be all fingerprinted shingles for A. Mod n selects all fingerprints whose value modulo n is zero. −Example: If n=100 and A=1000 bytes then Mod 100 (A) is represented by approximately 10 fingerprints. Documents are represented by variable-size signatures (proportional to the document size) f(w 1 ) f(w 2 ) A:

10 10 Sketch n Algorithm Each shingle is fingerprinted with a family of independent hash functions f 1,…, f n For each f i the fingerprint with smallest value is retained in the sketch. Documents are represented by fixed-size signatures: {min f 1 (A), min f 2 (A), …, min f n (A) } This algorithm has an elegant theoretical justification that the percentage of common entries in sketches of A and B accurately approximates the percentage of common shingles in A and B. w1w1 w2w2 f1f2…fnf1f2…fn min {f 1 (w 1 ), f 1 (w 2 ), …., f 1 (w N ) } min { f i (w 1 ), f i (w 2 ), …., f i (w N ) } A:

11 11 BSW n (Basic Sliding Window) Algorithm Document is represented by the chunks. Documents are represented by variable-size signatures (the signature is proportional to the document size) −Example: If n=100 and A=1000 bytes then BSW 100 (A) is represented by approximately 10 fingerprints. f(w 1 ) f(w 2 ) f(w k ) mod n = 0 f(w k ) Chunk is represented by the smallest fingerprint within the chunk min { f(w 1 ), f(w 2 ), …., f(w k ) } chunk boundary condition A:

12 12 Algorithm’s Properties and Parameters Algorithm’s properties: Algorithm’s parameters: −Sliding window size −Sampling frequency −Published papers use very different values Questions: −Sensitivity of the similarity metric to different values of algorithm’s parameters −Comparison of the four algorithms

13 13 Objective and Fair Comparison How to objectively compare the algorithms? −While one document collection might favor a particular algorithm, the other collection might show better results for a different algorithm −Can we design a framework for fair comparison? −Can the same framework be used for sensitivity analysis of the parameters?

14 14 Methodology Controlled set of modifications over a given document set:  add/remove words in the documents a predefined number of times

15 15 Methodology Research corpus RC orig : 100 different HPLabs TRs from 2007 converted to a text format Introduce modifications to documents in a controlled way: −Add/remove words to/from the document a predefined number of times −Modifications can be done in a random fashion or uniformly spread through the document RC i a = {RC orig, where word “a “ is inserted into each document i times } New average similarity metric:

16 16 Sensitivity to Sliding Window Size Window=20 is a good choice (~4words) Larger size window decreases significantly the similarity metric.

17 17 Frequency Sampling A big variance in similarity metric values for different documents under the smaller frequency sampling. Frequency sampling parameter depends on the document length distribution and should be tuned accordingly. Trade-off between the accuracy and the storage requirements RC a 50

18 18 Comparison of Similarity Algorithms Sketch n and BSW n are more sensitive to the number of changes in the documents (especially short ones) than Mod n are Min n

19 19 Case study using Enterprise Collections Two enterprise collections: −Collection_1 with 5040 documents; −Collection_2 with 2500 documents.

20 20 Results Algorithms Mod n are Min n have identified higher number of similar documents (with Mod n being a leader). However, Mod n has a higher number of false positives. For longer documents the difference between the algorithms is smaller. Moreover, for long documents (> than100KB) BSW n and related chunking-based algorithms might be a better choice (accuracy and storage wise).

21 21 Runtime Comparison Executing Sketch n is more expensive, especially for larger window size.

22 22 Conclusion Syntactic similarity is useful to identify documents with a large textual intersection. We designed a useful framework for a fair algorithm comparison: −compared performance of four syntactic similarity algorithms, and −identified a useful range of their parameters Future work: modify, refine, and optimize the BSW algorithm: −Chunking-based algorithms are actively used for deduplication in backup and storage enterprise solutions.

23 23 Sensitivity to Sliding Window Size Potentially, Mod n algorithm might have a higher rate of false positives


Download ppt "© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Applying Syntactic Similarity Algorithms."

Similar presentations


Ads by Google