Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dg.o conference 2006 Near-Duplicate Detection for eRulemaking Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.

Similar presentations


Presentation on theme: "Dg.o conference 2006 Near-Duplicate Detection for eRulemaking Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie."— Presentation transcript:

1 dg.o conference 2006 Near-Duplicate Detection for eRulemaking Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University Stuart Shulman Library and Information Science School of Information Sciences University of Pittsburgh

2 dg.o conference 2006 U.S. regulatory agencies must solicit, consider, and respond to public comments. Special interest groups make form letters available for generating comments via email and the Web –Moveon.org, http://www.moveon.orghttp://www.moveon.org –GetActive, http://www.getactive.orghttp://www.getactive.org Modifying a form letter is very easy

3 dg.o conference 2006 Insert screen shot of moveon.org, showing form letter and enter-your- comment-here Form Letter Individual Information Personal Notes

4 dg.o conference 2006

5

6

7

8

9

10 Group Near-duplicates based on –Text similarity Similar Vocabulary Similar Word Frequencies –Editing patterns –Metadata Hints to the clustering algorithm about how to group documents

11 dg.o conference 2006 Two instances must be in the same cluster Created when –complete containment of the reference copy (key block), –word overlap > 95% (minor change).

12 dg.o conference 2006 Two instances cannot be in the same cluster Created when two documents –cite different docket identification numbers People submitted comments to wrong place

13 dg.o conference 2006 Two instances are likely to be in the same cluster Created when two documents have –the same email relayer, –the same docket identification number, –similar file sizes, or –the same footer block.

14 dg.o conference 2006 Comparing with human-human intercoder agreement (measured in AC1) USEPA-OAR-2002-0056USEPA-OAR-2002-0056 (EPA Mercury dataset) USDOT-2003-16128USDOT-2003-16128 (DOT SUV dataset)

15 dg.o conference 2006 Comparing with other duplicate detection Algorithms (measured in F1)

16 dg.o conference 2006 Number of Constraints vs. F1.


Download ppt "Dg.o conference 2006 Near-Duplicate Detection for eRulemaking Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie."

Similar presentations


Ads by Google