Presentation is loading. Please wait.

Presentation is loading. Please wait.

Experiments with the Negotiated Boolean Queries of the TREC 2007 Legal Discovery Track Stephen Tomlinson Open Text Corporation 2007 Nov 8.

Similar presentations


Presentation on theme: "Experiments with the Negotiated Boolean Queries of the TREC 2007 Legal Discovery Track Stephen Tomlinson Open Text Corporation 2007 Nov 8."— Presentation transcript:

1 Experiments with the Negotiated Boolean Queries of the TREC 2007 Legal Discovery Track Stephen Tomlinson Open Text Corporation 2007 Nov 8

2 Overview who won the boolean query “negotiations” ? can dropping the boolean operators improve on the boolean run’s Recall@B ? did the boolean keywords (synonyms) improve on the natural language request text ? can just relaxing the proximity constraints improve Recall@B ? can blind feedback improve Recall@B ? can a fusion of vector and boolean approaches improve Recall@B ?

3 3 Boolean Queries Defendant –initial boolean query proposed by the defendant Plaintiff –rejoinder boolean query from the plaintiff Final –final negotiated boolean query

4 Topic 74: “All scientific studies expressly referencing health effects tied to indoor air quality.” Defendant: "health effect!" w/10 "air quality" Plaintiff: (scien! OR stud! OR research) AND ("air quality" OR health) Final: (scien! OR stud! OR research) AND ("air quality" w/15 health)

5 Topic 74 Boolean Results Defendant: "health effect!" w/10 "air quality" –2691 matches, 82% precision, 3% recall Plaintiff: (scien! OR stud! OR research) AND ("air quality" OR health) –858,700 matches, 64% precision@25000 (ranked), 25% recall@25000 (ranked) Final: (scien! OR stud! OR research) AND ("air quality" w/15 health) –20,516 matches, 77% precision, 22% recall

6 Topic 74: Missed Relevant Documents Final Boolean: (scien! OR stud! OR research) AND ("air quality" w/15 health) Passages in Missed Relevant Documents: “… Lowrey A.H. (1980). Indoor air pollution …” “assessment … entitled “Respiratory Health Effects of Passive Smoking …” “study … funded by the Center for Indoor Air Research”

7 Defendant vs. Final Boolean: Precision Def. Boolean won 20 Boolean won 22 (1 tied) Mean in (-0.09, 0.15) Topic 63: 1.00 vs. 0.02 (sugar contract) Topic 69: 0.00 vs. 0.97 (indoor smoke ventilation)

8 Defendant vs. Final Boolean: Recall Def. Boolean won 0 Boolean won 42 (1 tied) Mean in (-0.27, -0.11) Topic 77: 0.00 vs. 0.00 (smoke NOT tobacco) Topic 52: 0.00 vs. 0.98 (boosting crop yields)

9 Plaintiff vs. Final Boolean: Recall@25000 Pl. Boolean won 35 Boolean won 6 (2 tied) Mean in (0.03, 0.19) Topic 59: 0.76 vs. 0.01 (limestone treatment) Topic 58: 0.24 vs. 0.94 (phosphates and health)

10 Plaintiff vs. Final Boolean: Recall@B Pl. Boolean won 15 Boolean won 27 (1 tied) Mean in (-0.09, 0.04) Topic 63: 0.73 vs. 0.27 (sugar contract) Topic 58: 0.18 vs. 0.94 (phosphates and health)

11 Vector vs. Boolean (Example) Boolean: (scien! OR stud! OR research) AND ("air quality" w/15 health) Vector: scien! OR stud! OR research OR air OR quality OR health

12 Relevance Ranking term frequency dampening (BM25) –wildcard variants treated as same term –for boolean proximity constraints, only count term occurrences satisfying proximity –metadata + ocr included in document length inverse document frequency (log) –based on most common variant for wildcards

13 Vector vs. Boolean: Recall@B Vector won 16 Boolean won 26 (1 tied) Mean in (-0.13, 0.02) Topic 63: 0.79 vs. 0.27 (sugar contract) Topic 58: 0.08 vs. 0.94 (phosphates and health)

14 Topic 58: “… health problems caused by HPF …” Vector R@B=0.08, Boolean R@B=0.94 (B=8183, estRel = 1151) Phosphat! w/75 (caus! OR relat! OR assoc! OR derive! OR correlat!) w/75 (health OR disorder! OR toxic! OR "chronic fatigue" OR dysfunction! OR irregular OR memor! OR immun! OR myopath! OR liver! OR kidney! OR heart! OR depress! OR loss OR lost) vector matches often didn’t mention “Phospat!”

15 Topic 72: “… chemical process(es) which result in onions … making persons cry” Vector R@B=0.03, Boolean R@B=0.78 (B=119, estRel = 98) ((scien! OR research! OR chemical) w/25 onion!) AND (cries OR cry! OR tear!) proximity clause found some long documents with just one reference to onions’ effects

16 Topic 63: “… exclusivity clause in a sugar contract …” Vector R@B=0.79, Boolean R@B=0.27 (B=294, estRel = 18) (Sugar w/20 (contract! OR agreement! OR deal!)) AND exclusiv! boolean missed “U.S. sugar quota law”

17 Request vs. Vector: R@25000 Req. Vector won 21 Vector won 22 (0 tied) Mean in (0.00, 0.13) Topic 87: 1.00 vs. 0.13 (SEC reporting) Topic 84: 0.64 vs. 0.91 (1960s films)

18 Impact of Doubling Proximity Distances: Recall@B 2x-Prox Boolean won 14 Boolean won 8 (21 tied) Mean in (-0.03, 0.02) Topic 61: 0.49 vs. 0.44 (waste treatment) Topic 72: 0.39 vs. 0.78 (onions effect)

19 Impact of Blind Feedback: Recall@B Boolean+BF won 16 Boolean won 21 (6 tied) Mean in (-0.12, 0.03) Topic 90: 0.64 vs. 0.10 (sales in England) Topic 58: 0.01 vs. 0.94 (phosphates and health)

20 Fusion of Boolean, Request and Vector: Recall@B Fusion won 20 Boolean won 20 (3 tied) Mean in (-0.08, 0.03) Topic 65: 0.88 vs. 0.67 (candy packaging) Topic 58: 0.10 vs. 0.94 (phosphates and health)

21 Conclusions final negotiated boolean query often had substantially lower recall than the plaintiff boolean query boolean operators (AND, proximity) often have value blind feedback and fusion did not improve the boolean run’s Recall@B (on average)


Download ppt "Experiments with the Negotiated Boolean Queries of the TREC 2007 Legal Discovery Track Stephen Tomlinson Open Text Corporation 2007 Nov 8."

Similar presentations


Ads by Google