Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Role of Metadata in Machine Learning for TAR Amanda Jones Marzieh Bazrafshan Fernando Delgado Tania Lihatsh Tami Schuyler

Similar presentations


Presentation on theme: "The Role of Metadata in Machine Learning for TAR Amanda Jones Marzieh Bazrafshan Fernando Delgado Tania Lihatsh Tami Schuyler"— Presentation transcript:

1 The Role of Metadata in Machine Learning for TAR Amanda Jones Marzieh Bazrafshan Fernando Delgado Tania Lihatsh Tami Schuyler ajones@h5.com mbazrafshan@h5.com fdelgado@h5.com tlihatsh@h5.com tschuyler@h5.com

2 Metadata Use in TAR – Lack of Consensus 2  It is generally agreed across the industry that metadata is a critical component of ESI for eDiscovery. Some view incorporation of metadata into machine learning algorithm development for TAR as a matter of course. Others view it as atypical, if not incompatible, with machine learning approaches to document classification.

3 Metadata in TAR – Goals of the Study 3  If metadata provides information that is vital for manual document review in eDiscovery, though, why would it be any less valuable for TAR?  Goals of the current study: 1.Establish the potential benefit of incorporating metadata into TAR algorithm development processes. 2.Establish the potential benefit of leveraging widely inclusive sets of metadata, as opposed to limited pre-determined sets. 3.Establish the potential benefit of integrating metadata using techniques that preserve the added layer of information associated with metadata values.

4 Metadata in TAR – Data & Methods 4  3 Distinct Data Sets: 1 drawn from Topic 301 of the TREC 2010 Interactive Task 2 proprietary business data sets Random sample of 4500 individually labeled documents for each Split into a 3000 document Control Set and a 1500-document Training Set  All machine learning models developed using an open source Support Vector Machine (SVM) implementation  Performance Metric: Area Under the Receiver Operating Characteristic Curve (AUROC)

5 Metadata in TAR – Metadata Choices 5  Metadata availability varied across data sets  Fields were chosen opportunistically based on availability and amenability to feature transformation Fields that were populated for fewer than 5% of the documents were omitted Continuous metadata values were transformed into categorical values. o For example, date values were collapsed into simple Month-Year values ; file size values were assigned to categories ranging from very small to very large.

6 Metadata in TAR – Metadata Choices 6  Standard Metadata: Author Sender, Recipient, Cc Subject, Title, File Name Document Type, File Extension Sent Date, Created Date Sender Domain, Recipient Domain  Extended Metadata (all of the above, plus): All Custodians, Primary Custodian Record Type Attachment Name Bates Prefix Drop Id Company/Organization Native File Size, Text Size Normalized Date, Parent Date Family Count, Attachment Count Recipient Count, Cc Count, Combined Recipient Count Page Count

7 Metadata in TAR – Incremental Testing 7  Hypothesis 1: Incorporating metadata into the machine learning process will lead to improved model performance. Text from Standard Metadata added to the body text of documents o There was a general trend of improvement across the three data sets. The improvement was highly significant for Data Set 3.  Hypothesis 2: Incorporating the text from Extended Metadata will lead to superior results as compared to incorporating Standard Metadata alone. Text from Extended Metadata added to the body text of documents – compared to models based on addition of Standard Metadata o There was a general trend of improvement across the three data sets. The improvement was highly significant for Data Sets 1 and 3.

8 Metadata in TAR – Incremental Testing 8  Hypothesis 3: Using metadata values in ways that preserve both attribute and value information will result in superior performance. Extended Metadata values prefixed to indicate their field origins added to body text – compared to models with Extended Metadata added as plain text o Improvements varied across the three data sets, but significant for Data Set 2. Dual modeling – prefixed metadata values and simple body text modeled independently, scores from two models multiplied to arrive at a final score – dual models compared to single models with prefixed Extended Metadata o There was a general trend of improvement across the three data sets. The improvement was highly significant for Data Sets 2 and 3.

9 9  Stepping back from incremental pairwise comparisons - clearer answers and more striking differences emerge  Models incorporating Extended Metadata significantly outperformed models based on body text alone in each condition for every data set. Overall Findings – MD Can Improve TAR

10 10  Similarly strong trends can be observed when each model created using Standard Metadata is compared to its Extended Metadata counterpart.  Extended Metadata improvements were highly significant in all cases for Data Sets 1 and 3 and significant for the dual model in Data Set 2. Overall Findings – More MD Is Better

11 11  Incorporating metadata as an integral component of machine learning processes for TAR in eDiscovery will benefit the community of practice. Neglecting this resource is – at best – a missed opportunity. In an information retrieval effort, why leave information on the table?  To realize the full potential of using metadata in machine learning for TAR, practitioners should not rely solely on a limited intuitive set of metadata. Examining the contributions of specific metadata fields at a more granular level could be very worthwhile. o Is “all available” always the best choice?  There are still more questions than answers when it comes to the use of metadata in modeling for TAR. o More effective algorithms? o Better techniques for capturing the full metadata contribution? Metadata in TAR – Conclusions

12 12 Questions?


Download ppt "The Role of Metadata in Machine Learning for TAR Amanda Jones Marzieh Bazrafshan Fernando Delgado Tania Lihatsh Tami Schuyler"

Similar presentations


Ads by Google