Presentation on theme: "DY 20061 An Information Retrieval Approach based on Discourse Type D. Y. Wang, R. W. P. Luk, K.F. Wong 1 and K.L. Kwok 2 NLDB 2006 Department of."— Presentation transcript:
DY An Information Retrieval Approach based on Discourse Type D. Y. Wang, R. W. P. Luk, K.F. Wong 1 and K.L. Kwok 2 NLDB 2006 Department of Computing The Hong Kong Polytechnic University 1 Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2 Department of Computer Science City University of New York
DY Content Introduction Motivation Discourse Type Information Unit Problem Formulation Score of topic terms Score of discourse type Document Re-ranking Experimental Results Conclusion
DY Motivation The effectiveness of information retrieval (IR) systems varies substantially from one topic to another. One reason: Users’ Information need is very diverse Our approach: finding the discourse type of the topic and adopt appropriate strategy
DY Discourse Type Q No.Information Need (TREC query) Independent Entity Discourse Type 654What are the advantages and disadvantages of same-sex schools? same-sex school advantages and disadvantages 436What are the causes of railway accidents throughout the world? railway accident cause Definition of discourse type: The functions (including properties and relations that cannot exist independently) of the independent entities
DY Performance Difference Discourse TypeNumberMAPVariance Treatment Concrete things Advantage / Disadvantage Reason Objection Number General Information Steps (solution) Abstraction Impact Procedure Average=0.2768
DY Why Choose “ Advantage / Disadvantage ” as our example? Its performance is worse than the average v.s It is relatively abstract and therefore it is unlikely to be investigated before. Compared with concrete things (e.g. people, country) It is related to some cue phrases (e.g., “more than”) that are composed of stop words. Conventional IR ignores stop words
DY Why Choose “ Advantage / Disadvantage ” as example? (cont.) It is a popular discourse type of information need. we found that there are at least 40 questions that are asking about advantages and disadvantages of something at a website (http://www.answerbag.com). It has a reasonable amount (i.e., eight) of TREC topics for investigation See next slide
DY Eight Queries with discourse type Advantage / Disadvantage Query No.Query Title 308 Implant Dentistry 605 Great Britain health care 608 taxing social security 624 SDI Star Wars 637 human growth hormone (HGH) 654 same-sex schools 690 college education advantage 699 term limits
DY Information Unit (IU) ………… term …………… …………… term …………… term A document t w words
DY Why IU? Assumption: terms inside an IU (around topic terms) are more important to relevance of document than the terms outside the IU Simplify the processing of the documents Compute score for each IU Aggregate the scores of all IU as the score of the document
DY Example: Score of Discourse Type FT : more companies to adopt a high- performance model of work organization giving more responsibility to entry-level employees it has also backed > reforms aimed at improving preparation for work mr clinton differs only in supporting more radical efforts to make employers train more (comparative words)=3 support=[' back ',' confirm ',' contest ',' contrari ',' defend ',' encourag ',' endors ',' object ',' oppon ',' oppos ',' opposit ',' prove ',' quibbl ',' refer ',' sponsor ',' support '] ( from )www.answers.com support =2
DY Documents Re-ranking IU score before re-ranking: S 0 S 0 : similarity score of the document that contains the IU IU re-ranking score S’ S’= S 0 * score of topic terms S’= S 0 * score of discourse type S’= S 0 * score of topic term* score of discourse type Aggregate the re-ranking score of all IUs in a document as the final score of the document. Re-rank the documents by the final score.
DY Re-ranking Results in MAP OriginalTopicDiscourseboth QIDBM25atS3cd4-8c2S2dc2S2d Mean p<= (Wilcoxon)
DY Conclusion Re-ranking based on topic terms and discourse type can both improve the retrieval performance. Combining above two can improve the results most significantly (at 95% confidence level, already considering the sample size). This approach is promising and is worth further investigation. Acknowledgement: We thank the Center for Intelligent Information Retrieval, University of Massachusetts, for facilitating Robert Luk to develop the basic IR system, when he was on leave there. This work is supported by the CERG Project # PolyU 5226/05E.