Presentation is loading. Please wait.

Presentation is loading. Please wait.

©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval Robert S. Bauer, Teresa Jade www.H5technologies.com.

Similar presentations


Presentation on theme: "©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval Robert S. Bauer, Teresa Jade www.H5technologies.com."— Presentation transcript:

1 ©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval Robert S. Bauer, Teresa Jade www.H5technologies.com www.H5technologies.com & Mitchell P. Marcus www.cis.upenn.edu/~mitch/ June 7, 2007 STIR:

2 ©2007 H5 Slide of 9 The e-Discovery IDEAL: High P with High R Find every relevant document & only those docs that are relevant Desired P=0.8 (or better) @ R=0.8 (or better) Acceptable P= 2 / 3 (or better) @ R= 2 / 3 (or better) 1

3 ©2007 H5 Slide of 9 The e-Discovery REALITY High P & Low R = RISK (important docs not retrieved) Low P & High R = COST (many more documents must be reviewed) TextREtrivalConference 1

4 ©2007 H5 Slide of 9 Agenda Results –TREC ad hoc (= typical) –Queries typifying Communities of Practice (CoPs) e-Discovery Approaches –5 Dimensions –Linguistics of CoPs Research Issues –TREC –AI –Linguists –Lawyers 2

5 ©2007 H5 Slide of 9 Typical Results – ad hoc queries (from Chapter 3, “Retrieval System Evaluation” by Chris Buckley and Ellen M. Voorhees, in TREC: Experiment and Evaluation in Information Retrieval, Voorhees & Harman, ed., MIT Press, 2005, p62, Fig. 3.1) TREC: Experiment and Evaluation in Information Retrieval 22 Topics Average Desired is Rare Acceptable < 10% 3

6 ©2007 H5 Slide of 9 compared with STIR topical avg in 4 cases (I-IV) encompassing 42 topics Accuracy Metrics Most accurate TREC results for 20 of 22 topics in one test case Ideal TREC avg Acceptable F 1 = 2. (P. R)/(P+R) I II III IV 4

7 ©2007 H5 Slide of 9 Recall Precision Average P & R for each case STIR compared with TREC IR Topical P & R results for one TREC and 4 STIR cases STIR TREC 5

8 ©2007 H5 Slide of 9 Recall Improvement Sampled Corpus Tests for 12 Topics in case I during STIR Training Recall Precision ● STIR training provides substantial Recall improvement with acceptable Precision reduction 5 Retrieval Acceptable to lowest limit of statistical uncertainty

9 ©2007 H5 Slide of 9 Agenda Results –TREC ad hoc (= typical) –Queries typifying Communities of Practice (CoPs) e-Discovery Approaches –5 Dimensions –Linguistics of CoPs Research Issues –TREC –AI –Linguists –Lawyers 6

10 ©2007 H5 Slide of 9 Dimensions of e-Discovery Subject Matter Legal Case Linguistics Documents Community 7

11 ©2007 H5 Slide of 9 Dimensions of e-Discovery: Document Review Legal Case Documents Example Systems: Manual (human) review conducted by attorneys Basic keyword searches targeted to legal issues Supervised learning with relevance feedback 7

12 ©2007 H5 Slide of 9 Dimensions of e-Discovery: Expert Search Subject Matter Legal Case Documents Example Systems: Subject matter experts review results under legal team direction ● Domain- specific lexicons used 7

13 ©2007 H5 Slide of 9 Dimensions of e-Discovery: Model Meaning Subject Matter Legal Case Linguistics Documents Example Systems: Supervised learning with –relevance feedback –semantic analysis ● Semantic search 7

14 ©2007 H5 Slide of 9 Dimensions of e-Discovery: Model Communities Subject Matter Legal Case Linguistics Documents Community Example System: ● Socio- Technical-IR 7

15 ©2007 H5 Slide of 9 Dimensions of e-Discovery: Socio-Technical-IR LinguisticsCommunity Non- computational Linguistic Disciplines –Pragmatics –Socio- Linguistics –Ethno- Methodology –Discourse Analysis A community of practice is –a diverse group of people –engaged in real work –over a significant period of time –developing their own tools, language, and processes –during which they build things, solve problems, learn and invent –evolving a practice that is highly skilled and highly creative 7

16 ©2007 H5 Slide of 9 Agenda Results –TREC ad hoc (= typical) –Queries typifying Communities of Practice (CoPs) e-Discovery Approaches –5 Dimensions –Linguistics of CoPs Research Issues –TREC –AI –Linguists –Lawyers 8

17 ©2007 H5 Slide of 9 Research Issues TREC –Nature of the relatively rare high P with high R queries –Measuring both recall and precision effectively AI –Knowledge-Based (Expert) Systems that codify linguistic expertise –Characterize practice communities of subject matter experts –Investigate combination systems applied to different types of topics Linguists –Identify and characterize different types of topics and map to system types –Language patterns in communities as well as subject matter fields –Defining categories in concrete terms Lawyers –Defining categories in concrete terms –Integration of technology and processes 9

18 ©2007 H5 Slide of 9 Back-Up

19 ©2007 H5 Slide of 9 STIR Analysis: CoPs’ Enunciatory language Relevant Document Text State of Affairs Object Process Action Fact Event


Download ppt "©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval Robert S. Bauer, Teresa Jade www.H5technologies.com."

Similar presentations


Ads by Google