Presentation is loading. Please wait.

Presentation is loading. Please wait.

India Research Lab Auto-grouping Emails for Faster eDiscovery Sachindra Joshi, Danish Contractor, Kenney Ng*, Prasad M Deshpande, and Thomas Hampp* IBM.

Similar presentations


Presentation on theme: "India Research Lab Auto-grouping Emails for Faster eDiscovery Sachindra Joshi, Danish Contractor, Kenney Ng*, Prasad M Deshpande, and Thomas Hampp* IBM."— Presentation transcript:

1 India Research Lab Auto-grouping Emails for Faster eDiscovery Sachindra Joshi, Danish Contractor, Kenney Ng*, Prasad M Deshpande, and Thomas Hampp* IBM Research – India*IBM Software Group

2 India Research Lab | Outline of the Talk  eDiscovery Process  A new way of eDiscovery Review: Group Level Review  Creating Syntactic Groups  Creating Semantic Groups  Experiments and Conclusion

3 India Research Lab | eDiscovery Process  Discovery: Process in pre-trial phase - Produce relevant information  eDiscovery: FRCP 2006 amendment - Produce relevant Electronically Stored Information (ESI)  Emails, chats, word docs, presentations etc.  Huge volumes of ESI - Process is expensive - 60% of cases warrant some form of eDisovery - 4.8 billion dollars industry in 2011

4 India Research Lab | eDiscovery Process  High cost due to review stage - Lawsuit between Clinton administration and tobacco companies (U.S. Vs. Philip Morris) Apply Text Mining Techniques to reduce high costs involved in eDiscovery Process

5 India Research Lab | Named entity annotator Language Annotator Signature Annotator Architecture of eDiscovery Review Systems

6 India Research Lab | Group Level Review  Review groups of documents that are “related” instead of individual documents - Mark whole group as responsive/unresponsive or privileged - Efficient and consistent - Syntactically Similar Documents  Automated messages, Near and exact duplicates - Semantically Similar Documents  Threads, semantic categories

7 India Research Lab | Detecting Syntactic Groups: Automated Messages

8 India Research Lab | Detecting Near Duplicates  S1: I am away from 17/2/2011 to 19/2/2011. Please mail xyz@in.ibm.com in case of any need xyz@in.ibm.com  S2: I am away from 26/7/2011 to 31/7/2011. Please mail abc@us.ibm.com in case of any need abc@us.ibm.com  Notion of Similarity: Resemblance  Use fingerprinting (Rabin) instead of actual chunks.

9 India Research Lab | Efficient Detection of Near Duplicates  For a document of length n words there would be - n-K+1 chunks with a window size of K  It suffices to keep for each document a relatively small fixed size signature  Let S n be the set of permutations of [n]  And let P be chosen uniformly at random over S n

10 India Research Lab | Signature Annotator  In practice choosing the permutations randomly is hard  Use a set of n one-to- one functions f i and keep only the smallest value for each f i  Keep only j lowest significant bits for each value

11 India Research Lab | Discovering Automated Messages  Generating groups of near duplicate – Index Based Clustering - For each document d in index I do  If d is not covered - Let S = {S 1, S 2, …, S n } be the signature of document d - D = Query(I, atleast(S,k)) - For each document d’ in D  d’ is covered  Discovering Groups of Automated Messages - Automated Messages, Group of bulk emails, Group of forward emails  Use MD5 to detect bulk emails. Emails with one segment are automated messages

12 India Research Lab | Detecting Semantic Groups: Email Threads  A tree like structure  A link denotes that the child node was written as a reply to the parent node.  Capture the context in which an email was written

13 India Research Lab | Detecting Email Threads  Meta data based methods - Headers are not consistently used  Content of old mail remains in the new mail - A segment contains text of only one communication  An email e i contains e j iff e i approximately contains all the segment of e j

14 India Research Lab © 2007 IBM Corporation Method for Thread Detection  Email Segment Generator (ESG) –Creates segments of it where each segment contains content of only one email.  Segment Signature Generator (SSG): –Generates a signature for a segment Use near duplicate signatures  For practical implementation, we limit on the number of segment signatures (N) that can be associated with an email, e.g. 20 segments.

15 India Research Lab © 2007 IBM Corporation Method: Processing at Indexing Time w1w1 w2w2 wnwn Word index ESG SSG Meta index Signature index

16 India Research Lab © 2007 IBM Corporation Method: Processing at Query Time q Word index w1w1 w2w2 wnwn Meta indexSignature index Generating Candidate Thread Set Use Signature Of First Segment

17 India Research Lab | Detecting Email Threads  Given a Candidate Thread Set - Identify the email with only root segment - An email e c is child of an email e p if e c minimally contains e p

18 India Research Lab | Creating Semantic Categories  Focus Categories - Documents that are likely to be responsive - Legal Content, Financial Communication, Intellectual Property - High recall  Filter Categories - Documents that are likely to be unresponsive - Bulk emails, Private communication, Jokes - High precision

19 India Research Lab | Creating Semantic Categories  Email Segmentation  Pattern based annotation: Use System T based method  Consolidation - Each concept is independent - Apply additional constraints over concepts

20 India Research Lab | Experiments – Near Duplicate Detection  Enron Corpus - 517K emails from 150 users  Measuring precision - Manually evaluated near duplicate set for 500 queries - With more bits precision is 100% even with 40% similarity threshold  Only 33.3 % emails are unique

21 India Research Lab | Experiments – Email Thread Detection  No ground truth for threads  Subject approximation Method: Based on “Re:”, “Fw:” etc in subject  Manually verified the results of thread for our method and subject approximation method - The union of correct emails in thread for both approaches is treated as ground truth.

22 India Research Lab | Experiments – Semantic Group  Ground truth: Sampled 2200 emails using generic keywords and then manually labeled

23 India Research Lab | Conclusions  We developed a framework that allow group level review of documents  We developed methods for finding syntactic groups such as automated messages for creating groups  We developed methods for finding email threads and semantic groups  We showed significant reduction in the review time by using the group level review and integrated the proposed techniques with IBM Infosphere eDiscovery Analyzer product


Download ppt "India Research Lab Auto-grouping Emails for Faster eDiscovery Sachindra Joshi, Danish Contractor, Kenney Ng*, Prasad M Deshpande, and Thomas Hampp* IBM."

Similar presentations


Ads by Google