Presentation is loading. Please wait.

Presentation is loading. Please wait.

Email Data Cleaning (KDD’05) Jie Tang 1, Hang Li 2, Yunbo Cao 2, Zhaohui Tang 3 1 Tsinghua University 2 Microsoft Research Asia 3 Microsoft Corporation.

Similar presentations


Presentation on theme: "Email Data Cleaning (KDD’05) Jie Tang 1, Hang Li 2, Yunbo Cao 2, Zhaohui Tang 3 1 Tsinghua University 2 Microsoft Research Asia 3 Microsoft Corporation."— Presentation transcript:

1 Data Cleaning (KDD’05) Jie Tang 1, Hang Li 2, Yunbo Cao 2, Zhaohui Tang 3 1 Tsinghua University 2 Microsoft Research Asia 3 Microsoft Corporation

2 Outline Motivation and Problem Description Related Work Our Approach Implementation Experimental Results Summary

3 Motivation is one of the most common modes of communication Text mining applications on s classification summarization Term extraction from …

4 Term Extraction From: SY - Find messages by this author Date: Mon, 4 Apr :29: Subject: Re:..How to do addition?? Hi Ranger, Your design of Matrix class is not good. what are you doing with two matrices in a single class?make class Matrix as follows import java.io.*; class Matrix { public static int AnumberOfRows; public static int AnumberOfColumns; private int matrixA[][]; public void inputArray() throws IOException { InputStreamReader input = new InputStreamReader(System.in); BufferedReader keyboardInput = new BufferedReader(input) } -- Sandeep Yadav Tel: Homepage: On Apr 3, :33 PM, ranger wrote: > Hi... I want to perform the addtion in my Matrix class. I got the program to > enter 2 Matricx and diaplay them. Hear is the code of the Matrix class and > TestMatrix class. I'm glad If anyone can let me know how to do the addition.....Tnx Extra line break Missing space Extra space Missing period Case errors. Hi Ranger, Your design of Matrix class is not good. What are you doing with two matrices in a single class? Make class Matrix as follows:

5 Outline Motivation and Problem Description Related Work Our Approach Implementation Experimental Results Summary

6 Related Work -- Data Mining Cleaning Several products have the feature of cleaning by using rules E.g. eClean (2000), WinPure ListCleaner Pro (2004) Information Extraction from Extracting contact information, etc E.g. Kristjansson and Culotta (2004), Culotta, Bekkerman, and McCallum (2004), Viola (2005) Web Page Cleaning Removing banner ads, decoration pictures E.g. Yi and Liu (2003), Lin and Ho (2002) Tabular Data Cleaning Detecting and removing duplicate information E.g. Hernández and Stolfo (1998), Rahm and Do (2000), SQL Server 2005

7 Related Work -- Language Processing Sentence Boundary Detection Palmer and Hearst (1997) Case Restoration Lita and Ittycheriah (2003) Mikheev (2002) Spelling Error Correction Golding and Roth (I996)

8 Outline Motivation and Problem Description Related Work Our Approach Implementation Experimental Results Summary

9 Our Approach -- Cascaded Approach Cleaning = non-text block filtering + text normalization Non-text block filtering - Quotation detection - Header detection - Signature detection - Program code detection Text normalization - Paragraph normalization * Extra line break detection - Sentence normalization * Missing period detection - Word normalization * Case restoration

10 Cascaded Approach From: SY - Find messages by this author Date: Mon, 4 Apr :29: Subject: Re:..How to do addition?? Hi Ranger, Your design of Matrix class is not good. what are you doing with two matrices in a single class?make class Matrix as follows import java.io.*; class Matrix { public static int AnumberOfRows; public static int AnumberOfColumns; private int matrixA[][]; public void inputArray() throws IOException { InputStreamReader input = new InputStreamReader(System.in); BufferedReader keyboardInput = new BufferedReader(input) } -- Sandeep Yadav Tel: Homepage: On Apr 3, :33 PM, ranger wrote: > Hi... I want to perform the addtion in my Matrix class. I got the program to > enter 2 Matricx and diaplay them. Hear is the code of the Matrix class and > TestMatrix class. I'm glad If anyone can let me know how to do the addition.. Hi Ranger, Your design of Matrix class is not good. what are you doing with two matrices in a single class?make class Matrix as follows Hi Ranger, Your design of Matrix class is not good. What are you doing with two matrices in a single class? Make class Matrix as follows. Hi Ranger, Your design of Matrix class is not good. what are you doing with two matrices in a single class? make class Matrix as follows. In a particular text mining application, we can retain some of the blocks Quotation Detection Header Detection Signature Detection Extra line break Detection Missing Period and Missing Space Detection Program Code Detection Extra Space Detection Case Restoration

11 Outline Motivation and Problem Description Related Work Our Approach Implementation Experimental Results Summary

12 Technical Issues Non-text filtering Quotation detection Header detection Signature detection Program code detection Text normalization Extra line break detection Sentence normalization Case restoration

13 Non-text Filtering Using SVMs Header detection Signature detection Program code detection

14 Features Used in Header Detection Position Feature Is the first line? Positive Word Features Begins with: “From:”, “Re:”, “In article”, etc. Contains: “original message”, “Fwd:”, etc. Ends with: “wrote:”, “said:”, etc. Negative Word Features Contains: “Hi”, “dear”, “thank you”, “best regards”, etc. Number of Words Feature Number of words in the current line Person Name Feature Contains a person name? Ending Character Features Ends with: colon, semicolon, quotation mark, question mark, exclamation mark, etc. Special Pattern Features Contains one type of special patterns: , date, number, URL, percentage, etc. Number of Line Breaks Feature Number of line breaks exist before the current line Special Pattern Features Contains one type of special patterns: , date, number, URL, percentage, etc. Positive Word Features Begins with: “From:”, “Re:”, “In article”, etc. Contains: “original message”, “Fwd:”, etc. Ends with: “wrote:”, “said:”, etc. Position Feature Is the first line? Ending Character Features Ends with: colon, semicolon, quotation mark, question mark, exclamation mark, etc.

15 From: SY - Find messages by this author Date: Mon, 4 Apr :29: Subject: Re:..How to do addition?? Hi Ranger, Your design of Matrix class is not good. what are you doing with two matrices in a single class?make class Matrix as follows import java.io.*; class Matrix { public static int AnumberOfRows; public static int AnumberOfColumns; private int matrixA[][]; public void inputArray() throws IOException { InputStreamReader input = new InputStreamReader(System.in); BufferedReader keyboardInput = new BufferedReader(input) } -- Sandeep Yadav Tel: Homepage: On Apr 3, :33 PM, ranger wrote: > Hi... I want to perform the addtion in my Matrix class. I got the program to > enter 2 Matricx and diaplay them. Hear is the code of the Matrix class and > TestMatrix class. I'm glad If anyone can let me know how to do the addition.....Tnx Two SVM models are employed to respectively identify the start line and end line. Header Detection Position Feature Positive Word Features (“From:”) Negative Word Features Number of Words Feature Person Name Feature Ending Character Features Special Pattern Features (“ ”) Number of Line Breaks Feature Position Feature Positive Word Features (“Subject:”) Negative Word Features Number of Words Feature Person Name Feature Ending Character Features (“??”) Special Pattern Features Number of Line Breaks Feature

16 - Input: An annotated dataset. - Output: Discovered features. - Algorithm: Step 1: Preprocessing. This step first processes s by using hard rules. it replaces several special patterns by a tag. For example, an address is to be replaced by a tag. Step 2: Learning patterns. This step take the header lines as positive samples and the other lines as negative samples. It employs the pattern learning tool to discovering the patterns. An example of the discovered patterns is: “ Date: ”. Step 3: Generating features. This step generates features according to the learned patterns by using heuristic rules. For the above example, the corresponding feature can be: “^\s*Date: \s*$”. The feature represents whether or not the current line contains the pattern. Automatic Feature Generation Generated Features From: Subject: (.*?) Re: > wrote in message Date: Subject: Date: -----Original Message----- To: …. - Feature definition is tedious. - Can we automate the feature generation?

17 Example Features Used in Signature Detection Position Feature Is the first line or the last line? Positive Word Features Contains: “Best Regards”, “Thanks”, “Sincerely”, “Good luck”, etc. Number of Words Feature Number of words in the current line Person Name Feature Contains a person name? Ending Character Features Ends with: colon, semicolon, quotation mark, question mark, exclamation mark… Special Symbol Pattern Features Contains consecutive special symbols such as: “ ”, “======”, “******”. Case Features Whether the tokens are all in upper-case, all in lower-case, all capitalized or only the first token is capitalized Number of Line Breaks Feature Number of line breaks exist before the current line

18 Position Feature Position of the current line Declaration Keyword Features Starts with: “string”, “char”, “double”, “dim”, “typedef struct”, “#include”, “import”, “#define”, “#undef”, etc. Statement Keyword Features There are four kind of statement keyword features: - “i++”; - “if”, “else if”, “switch”, and “case”; - “while”, “do{”, “for”, and “foreach”; - “goto”, “continue;”, “next;”, “break;” Equation Pattern Features There are four kind of equation pattern features: - “=”, “<=” and “<<=” - “a=b+/*-c;” - “a=B(bb,cc);” - “a=b;” Function Pattern Feature Contains function pattern? E.g., pattern covering “fread(pbBuffer,1, LOCK_SIZE, hSrcFile);” Example Features Used in Program Code Detection

19 Extra Line Break Detection Using SVMs

20 Features Used in Extra Line Break Detection Position Feature Is the first line or the last line? Greeting Word Features Contains: “Hi” and “Dear”, etc. Ending Character Features Ends with: colon, semicolon, quotation mark, question mark, exclamation mark, etc. Case Features Whether the current line ends with a word in lower case letters and whether or not the next line starts with a word in lower case letters Bullet Features Is the next line one kind of bullet of a list item like “1.” and “a)”? Number of Line Breaks Feature Number of line breaks exist after the current line Case Features Whether the current line ends with a word in lower case letters and whether or not the next line starts with a word in lower case letters

21 Hi Ranger, Your design of Matrix class is not good. what are you doing with two matrices in a single class?make class Matrix as follows One SVM model is employed to identify whether a line break is an extra one or not. Extra Line Break Detection Position Feature Greeting Word Features Ending Character Features Case Features Bullet Features Number of Line Breaks Feature

22 Case restoration tri-gram + sentence level decoding Backoff scheme:

23

24 Outline Motivation and Problem Description Related Work Our Approach Implementation Experimental Results Summary

25 Datasets in Experiments 73.2% contain extra line breaks 85.4% need sentence normalization 47.1% contain case errors Only 1.6% are absolutely clean Data Set# of Containing Header Containing Signature Containing Prog. Code Text Only DC Ontology NLP ML Jena Weka Protégé OWL Mobility WinServer Windows PSS BR J2EE (0.585)4229(0.760)401(0.072)768(0.138) 3256(0.585) 4229(0.760)

26 Cleaning Results -- 5-fold Cross Validation Cleaning TaskPrecisionRecallF1- Measure Header Our Method Baseline Signature Our Method Baseline Quotation Program Code Extra Line Break Our Method Baseline Sentence Baseline methods Header detection (eClean2000) Signature detection (rule based) Extra line break detection baseline (eClean2000) For case restoration: -Our method can reach 98.15% in terms of accuracy -The accuracy of Trucasing is about 97.7%

27 Automatic Features vs. Manual Features Detection TaskPrecisionRecallF1-Measure Header Manual Automatic Signature Manual Automatic

28 BRJ2EE Term Extraction Using Cleaning

29 How Cleaning Processing Helps Term Extraction +74.2% +6.4% +41% BR

30 How Cleaning Processing Helps Term Extraction (cont.) +42.4% +2.3% +24.7% J2EE

31 Outline Motivation and Problem Description Related Work Our Approach Implementation Experimental Results Summary

32 Formalized data cleaning as non-text filtering and text normalization Conducted cleaning in ‘cascaded’ approach Used SVM models for header, signature, program code, and extra line break detection Our approach significantly outperforms baseline methods When applied to term extraction, significant improvement on extraction accuracy can be obtained

33 Thanks!

34 Examples of List and Table in Supppoesdly this is what it does: *New Layout *Pentium-safe Compiled *Fixed Various bugs in ftp scan *Fixed Resume Scan Bug... I have the following grocery store data in the CSV file: item1, item2, item3 bread, jelly, peanut_butter bread, peanut_butter, bread, milk, peanut_butter beer, bread, beer, milk, ListTable

35 Future work(2) -- Domain Adaptation Problems Semi-supervised


Download ppt "Email Data Cleaning (KDD’05) Jie Tang 1, Hang Li 2, Yunbo Cao 2, Zhaohui Tang 3 1 Tsinghua University 2 Microsoft Research Asia 3 Microsoft Corporation."

Similar presentations


Ads by Google