Presentation is loading. Please wait.

Presentation is loading. Please wait.

October 6, 2008iSchool Colloquium No, Not That PMI: Creating Search Technology for E-Discovery Jason Baron, 1,4 Douglas W. Oard, 1,3 Tamer Elsayed 2,3.

Similar presentations

Presentation on theme: "October 6, 2008iSchool Colloquium No, Not That PMI: Creating Search Technology for E-Discovery Jason Baron, 1,4 Douglas W. Oard, 1,3 Tamer Elsayed 2,3."— Presentation transcript:

1 October 6, 2008iSchool Colloquium No, Not That PMI: Creating Search Technology for E-Discovery Jason Baron, 1,4 Douglas W. Oard, 1,3 Tamer Elsayed 2,3 and Lidan Wang 2,3 1 College of Information Studies 2 Computer Science Department 3 Institute for Advanced Computer Studies University of Maryland, College Park Plus thanks to: Simon Attfield, David Lewis, Paul Thompson, Stephen Tomlinson, Feng Zhou

2 U.S. v. Philip Morris et al. Civil lawsuit brought by Clinton Administration against tobacco companies in 1999 Racketeering allegation that companies have conspired since 1953 to defraud the American public as to the true health effects of smoking 1,726 Requests to Produce from tobacco companies for tobacco-related records (including email) from 30 federal agencies 32 million Clinton-era email records held by National Archives

3 Query Terms tobacco cigarette smoking nicotine Smokeless Synar Amendment Philip Morris R.J. Reynolds BAT Industries Liggett group Brown and Williamson Liggett PMI –(Philip Morris Institute) MSA –(Master Settlement Agreement) ETS –(Environmental Tobacco Smoke) B&W –(Brown & Williamson) TI (Tobacco Institute) … Round 1 Round 2

4 Suppressing False Positives Upper Marlboro, Maryland Presidential Management Intern (PMI) program Medical Savings Accounts (MSA) Metropolitan Standard Area (MSA) Educational Testing Service (ETS) Black & White photos (B&W) TI...

5 False Positives Relevant Smoking Policy Emails OMB VP Chief of Staff Ron Klain Office of the U.S. Trade Rep. White House Counsel

6 Final Boolean Query ( ((master settlement agreement OR msa) AND NOT (medical savings account OR metropolitan standard area)) OR s. 1415 OR (ets AND NOT educational testing service) OR (liggett AND NOT sharon a. liggett) OR atco OR lorillard OR (pmi AND NOT presidential management intern) OR pm usa OR rjr OR (b&w AND NOT photo*) OR phillip morris OR batco OR ftc test method OR star scientific OR vector group OR joe camel OR (marlboro AND NOT upper marlboro) … )

7 National Archives Clinton White House Tobacco Policy search request hired 25 persons ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ 32 million emails 200,000 80,000 for 6 months …

8 Federal Rules of Civil Procedure ( as amended 12/1/06) Rule 26(f) At the parties’ planning meeting, issues expected to be discussed include: –“Any issues relating to disclosure or discovery of electronically stored information, including the form or forms in which it should be produced” –“Any issues relating to preserving discoverable information”

9 Recent Case Law Ameriwood Industries, Inc. v. Liberman, 2007 WL 685623 (E.D. Mo.) (court orders expert report with number of “hits” based on negotiated search terms, with expectation that parties will continue to meet and confer to refine search based on false positives) Williams v. Taser Intern, Inc., 2007 WL 1630875 (N.D. Ga.) (court adjudicates search protocol with keywords plus use of simple Boolean operators) 6/1/07: First published legal opinion in U.S. discussing difference between “keyword” and “concept” searching. Disability Rights Council of Greater Washington, et al. v. Washington Metropolitan Transit Authority, 242 F.R.D. 139 (D.D.C. 2007)

10 Goals –Foster development of research communities –Create “benchmark” evaluation resources –Establish “baseline” results History –Sponsored by NIST since 1992 –“Legal Track” started in 2006 with an E-Discovery focus Text Retrieval Conference (TREC)

11 Desiderata in the Legal Realm Two-party –Negotiated (not one-sided) information needs Recall-oriented –“Smoking gun detection” + completeness Explainable –Quantifiable comparison to present best practice Affordable –Minimize amount of human review on back end

12 IIT CDIP Document Collection UCSF Legacy Tobacco Documents Library –6,877,327 documents released in lawsuits –Variety of corporate document types –Range of printing technologies and handwriting IIT CDIP v1.0 Document Collection –OCR: 50 GB –Metadata: 5 GB (XML) Scanned documents used for assessment –42 million TIFF page images: 1.5 TB

13 Sample Document


15 Example Document Title: CIGNA WELL-BEING NEWSLETTER - FUTURE STRATEGY Organization Authors: PMUSA, PHILIP MORRIS USA Person Authors: HALLE, L Document Date: 19970530 Document Type: MEMO, MEMORANDUM Bates Number: 2078039376/9377 Page Count: 2 Collection: Philip Morris Philip Moxx's. U.S.A. x.dr~am~c. cvrrespoaa.aa Benffrts Departmext Rieh>pwna, Yfe&ia Ta: Dishlbutfon Data aday 90,1997. From: Lisa Fislla Sabj.csr CIGNA WeWedng Newsbttsr - Yntsre StratsU During our last CIGNA Aatfoa Plan meadng, tlu iasuo of wLetSae to i0op per'Irw+ng artieles aod discontinue mndia6 CIGNA Well-Being aawslener to om employees was a msiter of disanision. I Imvm done somme reaearc>>, and wanted to pruedt you with my Sadings and pcdiminary recwmmeadatioa for PM's atratezy Ieprding l4aas aewelattee*. I believe.vayone'a input is valusble, and would epproolate hoarlng fmaa aaeh of you on whetlne you concur with my reeommendatioa … ScannedOCRMetadata

16 “Complaints” Drafted by The Sedona Conference® lawyers: (1)Wrongful death and products liability action based on the use of a certain type of radioactive phosphates resulting in contaminated candy as well as in drinking water; (2)Patent infringement action on a device named “Suck out the Bad, Blow in the Good,” designed to ventilate smoke; (3)Shareholder class action suit alleging securities fraud and false advertising in connection with a fictional “Smoke Longer, Feel Younger” campaign relying on ‘60s-era folk music; (4)Fictional Justice Department antitrust investigation looking in to a planned merger and acquisition of a casualty and property insurance company by a tobacco company.

17 No. 2006-3 July 1, 2006 THE DISTRICT COURT OF THE COMMONWEALTH OF NEW SEARCHLAND John Doe, et al. Echinoderm Cigarettes, et al. John Doe, on behalf of the Organization of Concerned Parents, brings this action to force the defendant tobacco companies to cease all placement of tobacco products, brands, and logos in television, film, live theater and rock concerts (collectively referred to as the "public media"). The historical placement of tobacco products and branding in the public media has forced an increase in product awareness, particularly among young adults and children, by providing consistent and recurring exposure to on-screen situations that generally glamorize smoking and other tobacco use. Plaintiff John Doe brings this action on behalf of a nationwide class of individuals injured in childhood and adulthood by defendants' actions. Mr. Doe resides at 1004 Public Avenue, Commonwealth of New Searchland. Defendants are Echinoderm Cigarettes and other unnamed tobacco companies, with principal places of business in the Commonwealth of New Searchland. This Court has jurisdiction pursuant to 1 Comm. New Searchland, Sec. 1956. According to information and belief, Echinoderm Cigarettes and other companies have a long history of placement of tobacco products and brand images in the public media. These media, including television (network and cable), film, a live theater, and rock concerts, are regularly viewed by children, teen-agers, and young adults. Such individuals are at the most impressionable time of their lives, and are unknowingly exposed to de facto advertising for tobacco and tobacco-related products simply by watching such media. In particular, the glamorous manner in which smoking and other tobacco use are portrayed on the screen adds a cachet to the habit that encourages young people to try smoking for the first time. Thus is exposed the true motivation for product placement - inducing non-smokers to become smokers with blatant disregard for the long term effects and public health risks associated with tobacco use. Echinoderm Cigarettes and other unnamed companies have represented that they do not pay for product placement in the public media. This representation is patently false. Tobacco concerns regularly pay for placement of their products via direct monetary compensation, exchange of goods and services, and other considerations. COUNT I Defendants have engaged in a pattern of misleading practices in violation of state and federal statutes by providing compensation to television networks, production companies, film production companies, providers of live theater and rock concerts in exchange for placement of products and brand images. COUNT II By exposing children and young adults to tobacco products in the public media, and by glamorizing the use of products known to cause health issues, defendants' actions are in violation of applicable law. Declare that Echinoderm Cigarettes and other unnamed defendants are in violation of law by providing compensation for placement of products and brand images, and by exposing children and young adults to tobacco products in the media. Enter an order requiring defendants to disgorge all monies, reimbursement, and payments received as a result of product placement in the public media. Defendants to pay costs and expenses, including attorneys' fees, in connection with the investigation and litigation of this matter.

18 RequestNumber:52 RequestText:Please produce any and all documents that discuss the use or introduction of high-phosphate fertilizers (HPF) for the specific purpose of boosting crop yield in commercial agriculture. Proposal:"high-phosphate fertilizer!" AND (boost! w/5 "crop yield") AND (commercial w/5 agricultur!) Rejoinder: (phosphat! OR hpf OR phosphorus OR fertiliz!) AND (yield! OR output OR produc! OR crop OR crops) FinalQuery:(("high-phosphat! fertiliz!" OR hpf) OR ((phosphat! OR phosphorus) w/15 (fertiliz! OR soil))) AND (boost! OR increas! OR rais! OR augment! OR affect! OR effect! OR multipl! OR doubl! OR tripl! OR high! OR greater) AND (yield! OR output OR produc! OR crop OR crops) B: 3078 A “Production Request”

19 - 52 Please produce any and all documents that discuss the use or introduction of high-phosphate fertilizers (HPF) for the specific purpose of boosting crop yield in commercial agriculture. - (("high-phosphat! fertiliz!" OR hpf) OR ((phosphat! OR phosphorus) w/15 (fertiliz! OR soil))) AND (boost! OR increas! OR rais! OR augment! OR affect! OR effect! OR multipl! OR doubl! OR tripl! OR high! OR greater) AND (yield! OR output OR produc! OR crop OR crops) - "high-phosphate fertilizer!" AND (boost! w/5 "crop yield") AND (commercial w/5 agricultur!) (phosphat! OR hpf OR phosphorus OR fertiliz!) AND (yield! OR output OR produc! OR crop OR crops) 3078 2007-A-1 - 1. These requests require the production of all responsive documents within the sole or joint possession, custody or control of the Defendant, including their agents, departments, attorneys, directors, officers, employees, consultants, investigators, insurance companies, or other persons subject to Defendant's custody or control. 2. All documents that respond, in whole or in part, to any portion of these Requests must be produced in their entirety, including all attachments and enclosures. 3. For purposes of these requests, the words used are considered to have, or should be understood to have their ordinary, everyday meanings. Plaintiffs refer Defendant to any dictionary in the event that Defendant asserts that the wording of a request is vague, ambiguous, unintelligible, or confusing. - 4. The words "and," "or," "each," "any," "all," "refer," and "discuss," shall be construed in their broadest form and the singular shall include the plural and the plural shall include the singular whenever necessary so as to bring within the scope of these Requests all documents (defined below) that might otherwise be construed to be outside their scope. 5. Solely for the purpose of the TREC 2007 legal track, the term "Defendant" shall include the named defendant companies in this complaint as well as all other companies whose records are found in the TREC collection database. 6. Solely for the purpose of the TREC 2007 legal track, "document" means all data, information or writings stored in the TREC legal database, including, without limitation: any written, electronic or computerized files, data or software; memoranda, emails correspondence, OCR scanned images, communications, reports, summaries, studies, analyses, evaluations, notes or notebooks, indices, spreadsheets, logs, books, pamphlets, binders, calendar or diary entries, ledger entries, press clippings, graphs, tables, charts, printouts, drawings, maps, meeting minutes, and transcripts. The term document encompasses all metadata associated with the document. The term also includes all drafts associated with any particular document. The term is also intended to include all electronically stored information as the term is used in the Federal Rules of Civil Procedure, 7. The terms "relating to," "regarding," ‘discussing," or "concerning," shall be synonymous and should be taken to mean in whole or in part constituting, containing, concerning, discussing, describing, analyzing, identifying or stating. 8. The term "high-phosphate fertilizers" (HPF) shall refer to any high phosphate fertilizer, including, but not limited to calcium phosphate fertilizers and superphosphate fertilizers. In some instances, "high-phosphate" fertilizers will be subsumed in the definition of "phosphatic fertlizers." However, phosphatic fertilizers are a more general term for fertilizers containing phosphate and the phosphate concentration of various phosphatic fertilizers is likely to vary. 9. The term "Maleic Hydrazide" (MH) refers to a pesticide that is sprayed on sugar beets for the purpose of decreasing sugar loss in beet roots. - 2007-A July 1, 2007 U.S. DISTRICT COURT SOUTHERN DISTRICT OF GLADSHEIM MR & MRS. N. EINHERJAR, individually and on behalf of the Estate of DRIFA EINHERJAR, a minor, and the CITY AND COUNTY OF VALHALLA, a government entity. GULLINKAMBI CANDY CO., a Gladsheim corporation; VIKING SUGAR FARMS, a Gladsheim corporation; and U.S. BEET SUGAR ASSOCIATION, a nationwide association with local chapters in Gladsheim. - 1. Plaintiffs Mr. and Mrs. N. Einherjar bring this action individually and on behalf of the estate of their deceased daughter Drifa Einherjar. These plaintiffs and the City and County of Valhalla (collectively referred to as "Plaintiffs") bring this action against Defendants Gullinkambi Candy Co. (GCC), Viking Sugar Farms (VSF), and the U.S. Beet Sugar Association (BSA) (hereinafter referred to collectively as "Defendants," or individually by their respective acronyms). This complaint seeks equitable and injunctive relief for the use of lethal substances in the production of VSF sugar, resulting in the death of a child and contamination of the Valhalla County groundwater. This complaint additionally seeks damages for strict products liability and failure to warn against GCC for the use of and failure to disclose lethal substances contained in its candy. Finally, this complaint seeks treble and punitive damages for fraud and conspiracy in violation of the Racketeer Influenced and Corrupt Organizations Act (RICO), 18 U.S.C. (sec) 1962 for Defendants' collective and organized concealment of lethal substances from Plaintiffs, resulting in the death of a child and massive contamination of Valhalla County's sole source of drinking water. - Plaintiffs, Mr. and Mrs. N. Einherjar, are residents of Valhalla, Gladsheim, and their deceased daughter, on whose behalf they are suing, was also a Valhalla resident. 2. Defendants GCC and VSF are both Gladsheim Corporations with principal places of business in Valhalla, Gladsheim. The U.S. Beet Sugar Association has local chapters in Valhalla, Gladsheim, and directs the actions of VSF. - All events giving rise to this incident took place in Valhalla, Gladsheim. Therefore, jurisdiction of this court is proper. - 3. Defendant VSF uses high-phosphate fertilizers (HPF) (sometimes referenced as phosphate fertilizers) to increase the flavor of its sugar beets. HPF contains traces of radioactive elements that remain as a byproduct of phosphate extraction. Phosphate used in HPF is taken from a rock mineral called Apatite which also contains radioactive radium. The resulting Apatite powder therefore contains traces of radioactive elements that become incorporated into HPF. Studies have shown that health problems caused by HPF include immune disorders, toxic myopathy, chronic fatigue syndrome, liver dysfunctions, irregular heart-beat, reactive depression, and memory loss. In addition to using HPF, VSF sprays its sugar beets with Maleic Hydrazide (MH) to decrease the loss of sugar content in its sugar beet crop. MH has been shown to cause renal dysfunction in laboratory mice and to eventually lead to death. 4. In 1933, the U.S. Beet Sugar Association conspired with cane-growers in Hawaii to form a powerful sugar cartel that controlled Congress through a strong sugar lobby. Together, the American sugar growers united to create an underground sugar-trade brotherhood secretly referred to as "The Sugar Program." Members of the brotherhood contributed large sums of money to hire sugar-interest lobbyists who successfully brought about a series of favorable Sugar Acts beginning in 1934 and continuing to the present day. The Sugar Program brotherhood has also been successful in preventing Congress from regulating HPF or MH. 5. For the past five years, the BSA has served as elected leader of The Sugar Program, and has been given the responsibility for regulating the actions of the brotherhood members and for approving all major contracts and actions taken by members under its control. 6. Defendant GCC is a candy company that uses VSF sugar in all of its candy. As part of its contract with VSF, GCC agreed to conceal the levels of HPF and MH contained in VSF sugar from its consumers in exchange for an exclusivity provision and a discount on the wholesale price of its sugar. GCC therefore omitted warnings about HPF and MH from its candy labels. 7. As a result of Defendants' collective actions and omissions an eight-year old girl died from consuming a piece of GCC candy and the Valhalla community as a whole has been harmed by the contamination of their drinking water with HPF and MH. - FIRST CAUSE OF ACTION Wrongful Death 8. On March 23, 2007, decedent Drifa Einherjar (hereinafter "Decedent") purchased a piece of GCC candy for $0.67 from the GCC store on Main Street, Valhalla, Gladsheim. At the time of purchase, Decedent was not warned or informed of any dangers of eating the candy and there were no warnings on the candy wrapper or labels of the candy bag. 9. GCC knew that VSF used HPF and MH in its sugar production process. Despite this knowledge, GCC contractually agreed to conceal the presence of HPF and MH in its candy as a condition of its agreement with VSF, in exchange for a discount on its bulk sugar purchases. 10. As a direct and proximate result of these stated acts and omissions, Decedent consumed a piece of GCC candy containing HPF and MH, resulting in her death on March 24, 2007. Decedent ate the candy in a manner in which it was intended to be eaten, and received no instructions from any agents of GCC to exercise caution or to eat the candy in any other way. SECOND CAUSE OF ACTION Strict Tort Liability 11. The aforementioned candy and VSF sugar used as a primary ingredient in the candy were unreasonably dangerous to human health due to their high content of HPF and MH. 12. Defendants GCC and VSF knew of this health risk and notwithstanding that knowledge, concealed these dangers from the consuming public. 13. As a result of the HPF and MH contained in GCC candy, Decedent died within 24 hours of consuming a single piece of GCC candy. THIRD CAUSE OF ACTION Public Nuisance (Against Defendant VSF only) 14. Defendant VSF's method of sugar beet farming creates a public nuisance that unreasonably endangers the health of all Valhalla residents by contaminating their groundwater. 15. By continuing to use HPF and MH in its sugar beet production and by failing to use the standard method of limestone quicklime phosphate precipitation in the treatment of its waste-water, VSF continues to contaminate the groundwater and will continue to endanger the health of Valhalla residents. The harm to Valhalla residents will continue until an injunction is issued to stop the use of HPF and MH or to require implementation of the limestone quicklime wastewater treatment to minimize contamination. 16. As a direct and proximate cause of Defendant's acts and omissions, residents of Valhalla have unknowingly ingested harmful substances from their contaminated water supply. FOURTH CAUSE OF ACTION Failure to Warn 17. VSF, as a sugar beet farm that uses HPF and MH, had a duty to issue warnings to Plaintiffs and the general public about the presence of HPF and MH in its sugar and the corresponding health risks that these substances posed in groundwater or direct consumption. 18. Defendants VSF and GCC knew, or with the exercise of reasonable care, should have known that HPF contained radioactive substances and that MH added to the diet of mice, resulted in renal dysfunction and eventual death. Despite this knowledge, no information was offered to the Valhalla Community about the potential hazards of HPF, the lethal nature of MH used in VSF's sugar production, or the presence of HPF or MH in GCC candy. 19. At all times relevant to this litigation, Defendants VSF and GCC had actual and/or constructive knowledge of the dangers mentioned above. Despite this knowledge, VSF continued to operate its sugar beet plant with reckless disregard for the community around it by contaminating their groundwater and GCC continued to sell candy containing HPF and MH in reckless disregard for the life of children whom it targeted in its advertising campaigns and who therefore could be expected to purchase and consume GCC candy. 20. VSF breached its duty to warn the community about HPF and MH groundwater contamination and GCC breached its duty to warn consumers of the HPF and MH in its candy. 21. Defendant VSF's failure to warn has resulted in the contamination of Valhalla County's drinking water and the endangerment of the health of Valhalla residents. 22. GCC's failure to warn resulted in the death of a child and the illness of several others. FIFTH CAUSE OF ACTION Conspiracy and Fraud in Violation of the Racketeer Influenced and Corrupt Organizations Act (RICO), 18 U.S.C. (sec) 1962, and Request for Treble Damages. 23. Defendants VSF, GCC, and BSA engaged in a conspiracy to defraud by collectively agreeing to conceal the presence and adverse health effects of HPF and MH from the American public, the Valhalla community and Plaintiffs in particular. 24. In 1933, Defendants formed a sugar cartel secretly known as "The Sugar Program" which successfully lobbied Congress in passing favorable sugar laws and prevented the regulation of HPF and MH in commercial agriculture. 25. All three Defendants contributed financially to a lobbying fund aimed at fighting HPF and MH regulation and obtaining the passage of favorable "Sugar Acts." 26. For the past five years, the BSA has lead lobbying efforts and approved all actions of The Sugar Program brotherhood. 27. BSA spearheaded the movement to discourage written warnings about HPF and MH, and approved the VSF contract with GCC which provided for a reduction of GCC's wholesale sugar price, and a favorable exclusivity provision between VSF and GCC, under the condition that GCC refrain from publishing warnings about HPF and MH on its product labels. 28. As a result of this collective action to defraud the public, Plaintiffs have suffered injuries indicated above. Treble damages are therefore appropriate under RICO to punish the conspiratorial nature of Defendants' planned concealment of known health risks presented by HPF and MH from the Valhalla community and from Plaintiffs, resulting in the death of a child. SIXTH CAUSE OF ACTION Negligence 29. Defendant VSF had a duty to the Valhalla community and to Plaintiffs to refrain from contaminating their groundwater and to provide warnings about the known health hazards associated with HPF and MH which it used in the production of its sugar beets. 30. Defendant GCC had a duty to the Valhalla community and to Plaintiffs to disclose the known levels of HPF and MH in VSF sugar which it used as a primary ingredient in its candy. 31. Defendant BSA had a duty to compel members of the brotherhood under its control to require lawful disclosures of HPF and MH. 32. All Defendants breached their respective duties to the Valhalla community and to Plaintiffs. As a result, Plaintiffs have suffered damages indicated above. Punitive Damages 33. The conduct of Defendants described above is outrageous. Defendants' conduct demonstrates a reckless disregard for human life and a conscious disregard for public safety. The acts and omissions described above were willful and performed with actual or implied malice. Punitive and exemplary damages are therefore appropriate and should be imposed in this instance. - WHEREFORE, Plaintiffs respectfully pray for a judgment against Defendants for: 1. Injunctive and equitable relief as the Court deems appropriate including: i) Requiring Defendant VSF to test and to monitor the water near its sugar plant; ii) Requiring Defendant VSF to use the quicklime limestone method for processing wastewater to minimize phosphate contamination of Valhalla groundwater, if it is permitted to continue operation of its plant and to continue use of HPF and MH in its sugar beet production; iii) Compelling Defendant VSF to remove existing HPF from the groundwater by any means necessary; and 2. Compensatory damages to be paid by all Defendants, according to proof at trial; 3. Punitive damages as the court deems appropriate; 4. Costs and attorneys fees of this lawsuit, with interest; 5. Any other relief as the court deems appropriate. The Resulting “Topic”

20 2006/07 Research Teams Carnegie Mellon U Dartmouth College Long Island U Sabir Research, Inc. U Iowa U Massachusetts U Maryland U Missouri, Kansas City U Washington Ursinus College Fudan U (CN) National U of Singapore (SG) Open Text Corporation (CA) U Amsterdam (NL) U Waterloo (CA)

21 Deconstructing “Concept Search” “Features” “Method” “Specification” “Result”

22 Representing “Documents” Content –Count the words –Weight the words –Ascribe meaning to words Context –Who said this? –When was it said? –Who did they say it to? Description –What was said about it? Behavior –What was done with it? “Features”

23 Controlling the Search System Proactive –“Keyword” query –“Boolean” query –Query by example Reactive –Ranked list selection –“More like this” query –Category exploration –Social network exploration Iterative –Query refinement –“Search within” query “Specification”

24 Generating Results Logic Similarity Probability Result set Ranked list Classification Clustering Visualization “Features” “Method” “Specification” “Result”

25 2006 Experiments 31 “official” runs from 6 sites –Judged top-100 main site run, top-10 for others –Scored top-5000 Reference Boolean run –Judged stratified sample of 200 documents –Judged to B Expert manual searcher “run” – ~100 documents/topic –Tried to find documents systems would miss

26 2006/07 “Relevancy” Assessors Bank of America Department of Justice FTI Consulting H5 Technologies Inc. NARA Lewis & Roca LLP New Mexico Attorney General Preston Gates LLP Reasonable Discovery LLC SAIC Private individuals (CA, UK) Law Schools Boston University Case Western Reserve George Mason George Washington Loyola-Los Angeles Loyola-New Orleans U Dayton U Indiana-Indianapolis U Maryland U Texas

27 2006 Inter-Assessor Agreement

28 b a c 2 x

29 2006: Nobody Finds Everything Source: TREC 2006 Legal Track

30 2006: Precision@R Automatic Ranked Runs Manual run for pool enrichment Reference Boolean run

31 Sampling for Affordable Evaluation

32 TREC 2007 Experiments Making “pools”: 68 runs from 12 groups –Up to 25,000 documents per run per topic –Plus 100 random unsubmitted documents –Before sampling: 195,688-476,252 docs/topic Bin 1 (“required”) –500 documents. done by 43 of 50 assessors Bins 2 through 6 (optional) –100 documents each –8 of 43 assessors did at least one, 5 did all

33 Estimated # of Rel Docs in Pool Mean per Topic: Relevant: 16,904 Non-rel.: 298,678 Gray: 4,303 Topic 71 (bromhidrosis): Relevant: 77,467 Topic 63 (sugar contract): Relevant: 18

34 Boolean Run Estimated Recall Mean EstR@B: 0.22 Boolean run missed 78% of the relevant documents (on average per topic) Topic 84 (1960’s films) EstR@B=100% Topic 77 (smoke NOT tobacco) EstR@B= 0%

35 Median vs. Boolean (EstR@B) Median won 8 of 43 Boolean won 31 of 43 (4 tied) Topic 99: 0.31 vs. 0.21 (natural disasters) Topic 58: 0.07 vs. 0.94 (phosphates and health) Boolean run had higher mean EstR@B than all submitted runs. Boolean Better Median Better

36 Median vs. Boolean (EstR@25000) Median won 33 of 43 Boolean won 9 of 43 (1 tied) Topic 60: 0.91 vs. 0.07 (phosphate precip.) Topic 58: 0.09 vs. 0.94 (phosphates and health) Highest mean EstR@25000 47% Boolean Better Median Better

37 Marginal Precision by Depth Band Depths 1- 5000: median Precision=18% Depths 5001-10000: median Precision=13% Depths 10001-15000: median Precision=11% Depths 15001-20000: median Precision=10% Depths 20001-25000: median Precision=10% 3 of 446 (0.7%) of random (unsubmitted) documents were judged relevant –On average, another 50,000 relevant docs per topic?

38 Median “Run” Marginal Precision (Depths 20,001-25,000, by Topic) only 6 of 43 topics Marg. Prec. > 10% Topic 69: MP = 100% (indoor smoke vent.) Topic 74: MP = 46% (indoor air quality) Topic 71: MP = 21% (bromhidrosis)

39 2008 Legal Track Interactive task models commercial practice –Recall-oriented (classify every document) –“Topic authority” available for clarification –Fewer topics with much richer sampling Relevance feedback task –Models multi-stage meet and confer Third set of ad hoc task topics –Completes development of reusable(?) collection

40 Hill Climbing the Boolean Set Boolean run On OCR Ranked Run on OCR Ranked Run on Metadata Extract “good” metadata and add to query Remove least likely from Boolean Add most likely from Ranked

41 Metadata-Based Expansion CorruptedRetrievable Document image from archives Retrievable Query Corrupted How to retrieve corrupted documents? Expand query with author and recipient names

42 TREC-2006/07 “Training Topics” for 2008 “Beating Boolean” (but not by much yet!) +63%

43 Meet-and-Confer Alternatives

44 Incremental Disclosure Benefit 5-then-10 10-then-5 15-then-0 50 Topics, Title Queries, TREC-2005 Robust Track Collection 0.56 0-then-0

45 Other Recent Developments ICAIL Workshop on Discovery of Electronically Stored Information (DESI), Stanford, June 2007, Sedona Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery (August 2007 public draft), DESI-2 Workshop, London, June 2008,

46 Taking the Larger View Jack G. Conrad, “E-Discovery Revisited: A Broader Perspective for Researchers,” DESI-1

47 E-Discovery as Sensemaking Simon Attfield and Ann Blandford, “E-Discovery Viewed as Integrated Human-Computer Sensemaking,” DESI-2

48 Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann To: Suzanne Adams Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the call will be too late for him. Sheila Identity Resolution in Email WHO?WHO?

49 3-Step Solution (1) Identity Modeling Posterior Distribution (3) Mention Resolution (2) Context Reconstruction

50 Where to Look for Evidence Socially-related Conversations This Message This Conversation Contextual Space On-Topic

51 Contextual Resolution “Sheila” social conversational social topical social topical “Sheila Tweed” “sheila” “” “sg” “Sheila Walton” “Sheila” “Sheila Tweed” “sheila” “” “sg” “Sheila Walton” “Sheila” Context-Free Resolution Elsayed, Oard and Namata ACL/HLT 2008

52 Test Collections CollectionEmailsIdentitiesMentionCandidates QueriesMin.Avg.Max. Sager1,628627511411 Shapiro974855491821 Enron-subset54,01827,340781152489 Enron-all248,451123,7837835181,785 Sager Shapiro Enron-subset Enron-all

53 Which Context is the best? SagerShapiroEnron-subEnron-all

54 Accomplishments Unique test collection –7 million documents with OCR and metadata –83 rich topics (Boolean, free text, context) –Recall-oriented evaluation measure Moderately robust research community –16 research teams from 4 countries –Attracting attention (and investment) in the law

Download ppt "October 6, 2008iSchool Colloquium No, Not That PMI: Creating Search Technology for E-Discovery Jason Baron, 1,4 Douglas W. Oard, 1,3 Tamer Elsayed 2,3."

Similar presentations

Ads by Google