Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.

Similar presentations


Presentation on theme: "Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang."— Presentation transcript:

1 Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang

2 Research Goals General problem –How can we manage large amounts of text information? –Examples of text information: Web, Email, scientific literature… Our research aims at developing better techniques for text information access, text organization, and text mining Specific research questions –How can we find useful information accurately? (text retrieval) (e.g., Google) –How can we organize information automatically? (text classification) (e.g., automatically sort email messages into different folders) –How can we discover knowledge from text? (text mining) (e.g., discover similarities and differences in opinions on the same event from different news sources)

3 Elements of Text Information Management Technologies Search Text Filtering Categorization Summarization Clustering Natural Language Content Analysis Extraction Mining Visualization Retrieval Applications Mining Applications Information Access Knowledge Acquisition Information Organization Road Map & Current Research Focus - Personalized - Complex utility - Robust retrieval models - Comparative text mining Current focus Current focus

4 Project 1: User-Centered Adaptive IR (UCAIR) Users are not distinguished The meaning of a query depends on the user, e.g., “IR” can mean “information retrieval” or “infrared” A user gets mixed results Documents query Search Engine Top-N Server Side (good at picking top-N) 1.--- 2.--- 3.--- … User Search context User model results Personalized Agent query Client Side (good at fine tuning of ranking) Local Collection Traditional search engines: Use more user information (e.g., knowing “infrared” never occurred in any documents the user viewed in the past 3 months) Use more search context (e.g., knowing previous query is “retrieval algorithms” as opposed to “infrared applications” ) Search results are individually optimized for each user UCAIR search engines: UCAIR Architecture Developed a Bayesian decision framework Demonstrated ranking can be improved by exploiting query history [ACM SIGIR 2003 poster] Studied active user feedback [TREC 2003] Progress so far: Ongoing work: UCAIR prototype systems [ACM SIGIR 2004 poster] Language models for context-sensitive search Collaborative information retrieval

5 Project 2: Retrieval with Complex Preferences Optimize single objective function Assume independent, topical relevance Realistic retrieval involves multi- objective utility functions and dependent relevance Limitations of traditional models: In distributed IR, retrieved documents need to be relevant and have a communication cost below a threshold A child user would like to find documents that are relevant and have appropriate levels of readability In Web retrieval, a useful page is either relevant by itself or pointing to other relevant pages Examples of multi-objective retrieval: Developed a general multi-objective optimal retrieval framework [UIUC Tech Report 04] –Proposed a general query language –Proposed an efficient algorithm for executing a multi-objective query Proposed methods and evaluation metrics for subtopic retrieval [ACM SIGIR 2003] Studied how to exploit link information in Web search [TREC 2003] Progress so far: Ongoing work: Multi-objective retrieval as integer programming Applications of multi-objective retrieval in distributed IR and peer-to-peer IR Example of dependent relevance: Subtopic Retrieval: Find documents to cover as many distinct subtopics as possible. E.g., Query= “Find different applications of robotics” E.g., SELECT document set X ABOUT “information retrieval” MAXIMIZE relevance(X) SUBJECTTO responsetime(X) <30 (seconds) redundancy (X) < 20% RANKBY length(X)

6 Project 3: Robust & Accurate Retrieval Models Involve ad hoc experimental parameter tuning No guarantee of optimality Return whole documents or pre- segmented passages Traditional retrieval models: Need to develop more robust and more accurate retrieval models Formalized major retrieval heuristics with constraints, making it possible to [ACM SIGIR 2004] –Study optimality of a retrieval formula analytically –Derive bounds of parameters –Develop retrieval models in a non-traditional way Developed statistical language models that can tune parameters completely automatically [ACM SIGIR 2002] Studied the use of Hidden Markov Models (HMMs) for passage retrieval, resulting in more accurate methods for dynamic passage boundary detection Developed more robust mixture language models for pseudo feedback and for modeling semi- structured queries [TREC 2003, ACM SIGIR 2004 poster] Progress so far: Ongoing work: A novel framework for developing retrieval models based on constraint-satisfaction Robust and accurate language models for pseudo feedback Accurate passage retrieval and feedback using HMMs How can we predict retrieval performance analytically without empirical experimentation? How can we optimize parameter setting automatically? How can we retrieve and feedback with passages of query-specific boundaries?

7 Compare customer reviews of similar products to better summarize users’ opinions Compare opinions of different political parties on similar issues Compare computer science course web pages from different departments to reveal core CS topics and unique strengths of each department Compare literature in different time periods to reveal how research topics have been evolving Project 4: Comparative Text Mining Given a comparable set of text collections Discover & analyze their similarities and differences Problem definition: Progress so far: Ongoing work: More accurate mixture models (e.g., addressing proximity, more informative semantic units) Automatic generation of hyperlinks to reveal structures in unstructured text collections Comparative text mining prototype systems Developed a cross-collection mixture model [UIUC Tech Report 04] Applied the model to summarize laptop review data, automatically discovering themes such as “battery life”, “memory”, “port”, etc. Collection C 1 Collection C 2 ….Collection C k A General Topic C 1 - specific themes Common themes Possible applications (opinion summarization, business intelligence, text federation,…): C 2 - specific themes C k - specific themes


Download ppt "Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang."

Similar presentations


Ads by Google