Presentation on theme: "Structured Queries for Legal Search TREC 2007 Legal Track Yangbo Zhu, Le Zhao, Jamie Callan, Jaime Carbonell Language Technologies Institute School of."— Presentation transcript:
Structured Queries for Legal Search TREC 2007 Legal Track Yangbo Zhu, Le Zhao, Jamie Callan, Jaime Carbonell Language Technologies Institute School of Computer Science Carnegie Mellon University 11/06/2007
Agenda Introduction Main task – ad hoc search Routing task – relevance feedback
What is legal search Goal: retrieve all documents for production requests. Production request: describes a set of documents that the plaintiff forces the defendant to produce. Recall-oriented: high risk (value) of missing (finding) important documents. Sample request text: All documents discussing, referencing, or relating to company guidelines, strategies, or internal approval for placement of tobacco products in movies that are mentioned as G-rated. AND OR W/5 guide strategy approval family “G rated” movie film Final query
Data set 7 million business records from tobacco companies and research institutes. Metadata: title, author, organizations, etc. OCR text: contain errors 50 topics generated from four hypothetical complaints created by lawyers
Main task – Ad hoc search Indri query formulation Without boolean constraint #combine(ranking function) With boolean constraints #filreq( #band(boolean constraint) #combine(ranking function) )
Boolean constraint Translate the Final Query Original expressionIndri operator x AND y#uw(x y) x OR y#syn(x y) x BUT NOT y#filrej(y x) Phrase: “x y”#1(x y) Proximity: (x W/k y)#uw(k+2)(x y) AND OR W/5 guide strategy approval family “G rated” movie film
Ranking functions Bag of words (guide strategy approval family G rated movie film) Respect phrase operators (guide strategy approval family #1(G rated) movie film) Group synonyms together (#syn(guide strategy approval) #syn(family #1(G rated)) #syn(movie film)) AND OR W/5 guide strategy approval family “G rated” movie film
Experiments and findings Boolean constraints improve recall and precision Structured queries outperform bag-of-words ones * B is the number of documents matching the Final Query. Its average value is 5000.
Per topic performance (Difference to the median of 29 manual runs) est_RB est_PB
Routing task of Legal track 2007 Structured queries are known to be hard to construct. Not, with supervision Questions Weighted query help? Metadata&Annotations help? A definitive answer from Supervised Structured Query Construction
Supervised Structured Query Construction Relevance feedback => supervised learning Train linear SVM with keyword, keyword.field feature SVM classifier f i : training weights for terms, choose to be tfidf/LM scores Retrieval: #weight( w1 t1 w2 t2 … ) f i : tfidf/LM scores for terms Advantages Given enough training, know for sure whether one type of feature helps
Example Query 13 All documents to or from employees of a tobacco company or tobacco organization referring to the marketing, placement, or sale of chocolate candies in the form of cigarettes. (cand! OR chocolate) w/10 cigarette!