Planning for the TREC 2008 Legal Track Douglas Oard Stephen Tomlinson Jason Baron.

Slides:



Advertisements
Similar presentations
November 7, 2007TREC 2007 Overview of the TREC 2007 Legal Track Stephen Tomlinson Douglas W. Oard Jason R. Baron Paul Thompson.
Advertisements

Chapter 5 - Tuesday.
Technovation Lesson: Business Plan Week 6. Check-in: Business model You should have completed the business model page in your workbook. You’ll need this.
Submission Writing Fundamentals – Part Webinar Series Leonie Bryen.
Author Instructions How to upload Single Abstract to the paper management system Single Abstract is a document that describes one presentation that someone.
Structured Queries for Legal Search TREC 2007 Legal Track Yangbo Zhu, Le Zhao, Jamie Callan, Jaime Carbonell Language Technologies Institute School of.
Accessing Resources for Growth from External Sources
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Information Retrieval in Practice
Search Engines and Information Retrieval
The Enron and W3C Collections Tamer Elsayed and Douglas W. Oard ICAIL 2007, DESI Workshop, June 4 th, 2007 University of Maryland.
INFO 624 Week 3 Retrieval System Evaluation
Interactive Task of the TREC Legal Track: Theory meets Practice Douglas W. Oard College of Information Studies and Institute for Advanced Computer Studies.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Grant Proposal Basics 101 Office of Research & Sponsored Programs.
FAIRTRADE FOUNDATION OCR Nationals in ICT Unit 1 ICT Skills for Business.
Overview of Search Engines
Evaluation INFM 718X/LBSC 718X Session 6 Douglas W. Oard.
With Windows 7 Comprehensive© 2012 Pearson Education, Inc. Publishing as Prentice Hall1 PowerPoint Presentation to Accompany GO! with Windows 7 Comprehensive.
Novus HR Application Review Process Human Resources Qualifying Applications HR Sending Applications to Department/Search CommitteeHR Sending Applications.
Nine steps of a good project planning
G040 - Lecture 05 Common Document Layouts Mr C Johnston ICT Teacher
CS499 Use Cases References From Alistair Cockburn Writing Effective Use Cases (Book) - Use Case.
The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,
Research Report Chapter 15. Research Report – APA Format Title Page Running head – BRIEF TITLE, positioned in upper left corner of no more than 50 characters.
Search Engines and Information Retrieval Chapter 1.
CSE 441: Systems Analysis & Design
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
Week One Chapter five “Develop a Business Plan”
1 State Records Center Entering New Inventory  Versatile web address:  Look for any new ‘Special Updates’ each.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
Discovery Phase: where do we go from here? Co-directors contact information: Dr. Maureen Powers, Department of Cell Biology,
1 Unit 1 Information for management. 2 Introduction Decision-making is the primary role of the management function. The manager’s decision will depend.
Keep the Knowledge, Make a Record! What every State and Local government employee needs to know about recordkeeping © National Archives of Australia –
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Chapter 6: Information Retrieval and Web Search
Journalism & Media Studies Graduate Student Culminating Work : Steps for Submitting to the Campus Digital Archive at USFSP November 21, 2011 by Carol Hixson.
Observing Users (finishing up) CS352. Announcements, Activity Notice upcoming due dates (web page) Discussion: –Did your observations have enough detail.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Human Resources 1 G-Top Global Workflow Employee View September 2014.
Planning for the TREC 2008 Legal Track Douglas Oard Stephen Tomlinson Jason Baron.
Slide 1 FastFacts Feature Presentation November 18, 2015 To dial in, use this phone number and participant code… Phone number: Participant.
AAPG /HGS MEMBERSHIP EXPERIMENT AAPG ANNUAL CONVENTION 2006 Andrea Reynolds.
Topic by Topic Performance of Information Retrieval Systems Walter Liggett National Institute of Standards and Technology TREC-7 (1999)
REPORTS.
1 Software Project Planning Software project planning encompasses five major activities –Estimation, scheduling, risk analysis, quality management planning,
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
November 8, 2005NSF Expedition Workshop Supporting E-Discovery with Search Technology Douglas W. Oard College of Information Studies and Institute for.
Business Information Management I 1. “Copyright and Terms of Service Copyright © Texas Education Agency. The materials found on this website are copyrighted.
Enterprise Track: Thread-based Retrieval Enterprise Track: Thread-based Retrieval Yejun Wu and Douglas W. Oard Goal Explore -- document expansion.
Formal Report Strategies. Types of Formal Reports Informational Presents Info Analytical Presents Info Analyses info and draws conclusions Recommendation.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Prepared By: Bobby Wan Microsoft Access Prepared By: Bobby Wan
REPORT WRITING.
External Sales & Agreements (Contracts)
K101 OU Live group tutorial March 2017
SCC P2P – Collaboration Made Easy Contract Management training
Letters, Memos, and Correspondence.
Chapter 6 Choosing the Best Process and Form
Introduction to Information Retrieval
Webtoberfest 2018 Stetson’s 3rd Annual Web Content Info Session
Writing Careful Long Reports
FY18 Water Use Data and Research Program Q & A Session
Internet Basics and Information Literacy
In order to execute a search you can…
Microsoft Office Access is the best –selling personal computer database management system. What is Access?
Advanced Searching Tips
Discussion Class 9 Google.
Presentation transcript:

Planning for the TREC 2008 Legal Track Douglas Oard Stephen Tomlinson Jason Baron

Agenda Track goals Deciding on a document collection “Beating Boolean” Handling nasty OCR Making the best use of the metadata Ad hoc task design Interactive task design Relevance feedback task design Other issues

Track Goals Develop a reusable test collection –Documents, topics, evaluation measures Foster formation of a research community Establish baseline results

Choosing a Collection FERC Enron (w/attachments, full headers) –Somewhat larger than CMU – is the real killer app for E-discovery IIT CDIP version 1.0 (same as 2006/07) –We have 83 topics. Do we need more? State Department Cables –Task model would be FOIA, not E-Discovery

TREC Topic Number: 1 Title: Marketers or Traders of Electricity on the Financial Market Description: Identify Enron employees who bought and sold electricity on California’s financial (long-term sales) energy market, solely for the purpose of re-buying/re-selling this energy later for a profit. Narrative: A relevant document must at a minimum identify the name and address of the marketer, as well as the Enron subsidiary to which he/she belonged. The marketer’s phone number would be helpful as well, to help analysis of the corresponding Enron voice dataset. Hint: Enron Power Marketing, Inc. (EPMI), Enron Energy Services, Inc. and Enron Energy Marketing Corporation all appear to have conducted long-term marketing services for Enron. This observation is based on the fact that Enron submitted information for all three of these subsidiaries in its reply to FERC’s data request 2 (DR2). (DR2 asked Enron to submit information about its short-term and long-term sales. Enron replied with data from these three subsidiaries.) (38, pp. 1-2, plus personal analysis.) It would be good, however, to know for sure which entities or persons did marketing at Enron. Query Possibilities: (marketer or marketers or “Enron Power Marketing” or EPMI or “Enron Energy Services” or “Enron Energy Marketing Corporation”) (marketer or marketers or “Enron Power Marketing” or EPMI or “Enron Energy Services” or “Enron Energy Marketing Corporation”) and (MW or KW or watt* or MwH or KwH) o This is to target electricity sales rather than natural gas sales. All the subsequent electricity queries can be similarly modified. (marketer or marketers or EPMI) and (short or long) o As in have a long or short position in sales/purchases. (marketer or marketers or EPMI) and (NYMEX or CBOT or “Mid-Columbia” or COB or “California-Oregon Border” or “Four Corners” or “Palo Verde” or EOL) o The electricity futures hubs were Mid-Columbia, COB, Four Corners, and Palo Verde, as best the author can tell. (85) NYMEX and CBOT ran these. (89; 15, p. 78) o EOL was the forward market trading place. (36, p. 3)

Identity Modeling in Enron susan m scott suebob susan scott sue susan ciao again m scott scott susan susan m scott susan scott susan scott friday sscott5 susan sscott susan m scott com members 66,715 models 82,084 addr-name 3,151 addr-nickname 19,708 addr-addr

Enron Identity Test Collections Collection sIdentitiesMention Candidates QueriesMin.Avg.Max. Sager1, Shapiro Enron-subset54,01827, Enron-all248,451123, Sager Shapiro Enron-subset Enron-all Test Collections

Example Document Title: CIGNA WELL-BEING NEWSLETTER - FUTURE STRATEGY Organization Authors: PMUSA, PHILIP MORRIS USA Person Authors: HALLE, L Document Date: Document Type: MEMO, MEMORANDUM Bates Number: /9377 Page Count: 2 Collection: Philip Morris Philip Moxx's. U.S.A. x.dr~am~c. cvrrespoaa.aa Benffrts Departmext Rieh>pwna, Yfe&ia Ta: Dishlbutfon Data aday 90,1997. From: Lisa Fislla Sabj.csr CIGNA WeWedng Newsbttsr - Yntsre StratsU During our last CIGNA Aatfoa Plan meadng, tlu iasuo of wLetSae to i0op per'Irw+ng artieles aod discontinue mndia6 CIGNA Well-Being aawslener to om employees was a msiter of disanision. I Imvm done somme reaearc>>, and wanted to pruedt you with my Sadings and pcdiminary recwmmeadatioa for PM's atratezy Ieprding l4aas aewelattee*. I believe.vayone'a input is valusble, and would epproolate hoarlng fmaa aaeh of you on whetlne you concur with my reeommendatioa … ScannedOCRMetadata

State Department Cables 791,857 records – 550,983 of which are full text

State Department Cables

Handling Nasty OCR Index pruning Error estimation Character n-grams Duplicate detection Expansion using a cleaner collection

How to “Beat Boolean” Work from reference Boolean? –Swap out low-ranked-in for high-ranked-out Relax Boolean somehow? –Cover density, proximity perturbation, …

Using Metadata Title (term match) Author (social network Bates number (sequence)

Ad Hoc Task Design Evaluation measures Index size? –Error bars / Statistical significance testing –Limits on post-hoc use of the collection? –What are “meaningful” differences? Topic design –Negotiation transcript? Inter-annotator agreement

Interactive Track Design Evaluation measure –Precision-oriented? –Recall-oriented? –Effect of assessor disagreement

Relevance Feedback Task Evaluation measure –Residual recall at B_Residual? Two-stage feedback?

Some Open Questions Test collection reusability –Unbiased estimates? Tight error bars? Why can’t we beat Boolean??? –Different strategies? Detailed failure analysis? Can we improve topic formulation? –Structured relevance relevance feedback? Is OCR masking effects we need to see? –Is it time for a new collection? –Must it be de-duped? Is metadata needed? Does Δscope invalidate the interactive task?