Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.

Slides:



Advertisements
Similar presentations
Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language.
Partitioned Logistic Regression for Spam Filtering Ming-wei Chang University of Illinois at Urbana-Champaign Wen-tau Yih and Christopher Meek Microsoft.
A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong.
Evaluating Search Engine
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University
Generative and Discriminative Models in Text Classification David D. Lewis Independent Consultant Chicago, IL, USA
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Presented by Zeehasham Rasheed
Carnegie Mellon Exact Maximum Likelihood Estimation for Word Mixtures Yi Zhang & Jamie Callan Carnegie Mellon University Wei Xu.
Exploration & Exploitation in Adaptive Filtering Based on Bayesian Active Learning Yi Zhang, Jamie Callan Carnegie Mellon Univ. Wei Xu NEC Lab America.
An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,
Scalable Text Mining with Sparse Generative Models
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
1 Probabilistic Language-Model Based Document Retrieval.
Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.
Topic Detection and Tracking Introduction and Overview.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Overview of the TDT-2003 Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002.
Mining Social Network for Personalized Prioritization Language Techonology Institute School of Computer Science Carnegie Mellon University Shinjae.
Detection, Classification and Tracking in a Distributed Wireless Sensor Network Presenter: Hui Cao.
Less is More Probabilistic Models for Retrieving Fewer Relevant Documents Harr Chen, David R. Karger MIT CSAIL ACM SIGIR 2006 August 9, 2006.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
November 10, 2004Dmitriy Fradkin, CIKM'041 A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems Dmitriy Fradkin, Paul.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Computing & Information Sciences Kansas State University IJCAI HINA 2015: 3 rd Workshop on Heterogeneous Information Network Analysis KSU Laboratory for.
CMU at TDT 2004 — Novelty Detection Jian Zhang and Yiming Yang Carnegie Mellon University.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Combining Unsupervised Feature Selection.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.
CMU TDT Report November 2001 The CMU TDT Team: Jaime Carbonell, Yiming Yang, Ralf Brown, Chun Jin, Jian Zhang Language Technologies Institute, CMU.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
TDT 2004 Unsupervised and Supervised Tracking Hema Raghavan UMASS-Amherst at TDT 2004.
Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer Science Dept. Carnegie Mellon University SIGIR.
A Review of Information Filtering Part I: Adaptive Filtering Chengxiang Zhai Language Technologies Institiute School of Computer Science Carnegie Mellon.
Hierarchical Topic Detection UMass - TDT 2004 Ao Feng James Allan Center for Intelligent Information Retrieval University of Massachusetts Amherst.
Item 4 - Intrusion Detection and Prevention Yuh-Jye Lee Dept. of Computer Science and Information Engineering National Taiwan University of Science and.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Using Blog Properties to Improve Retrieval Gilad Mishne (ICWSM 2007)
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Semi-Supervised Clustering
An Empirical Study of Learning to Rank for Entity Search
Proposed Formative Evaluation Adaptive Topic Tracking Systems
John Lafferty, Chengxiang Zhai School of Computer Science
Jonathan Elsas LTI Student Research Symposium Sept. 14, 2007
Machine Learning – a Probabilistic Perspective
Presentation transcript:

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming Yang School of Computer Science Carnegie Mellon University

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 2 CMU Team A –Jaime Carbonell (PI) –Yiming Yang (Co-PI) –Ralf Brown –Jian Zhang –Nianli Ma –Shinjae Yoo –Bryan Kisiel, Monica Rogati, Yi Chang

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 3 Participated Tasks in TDT 2004  Topic Tracking (Nianli Ma et al.)  Supervised Adaptive Tracking (Yiming Yang et al.)  New Event Detection (Jian Zhang et al.)  Link Detection (Ralf Brown)  Hierarchical Topic Detection – not participated

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 4 Topic Tracking with Supervised Adaptation (“Adaptive Filtering” in TREC) On-topic Test documents Current document Training documents (past) time Off-topic Unlabeled documents Topic 1 Topic 2 Topic 3 … Relevance Feedback

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 5 Topic Tracking with Pseudo-Relevance (“Topic Tracking” in TDT) On-topic? Test documents Current document Training documents (past) time Off-topic Unlabeled documents Topic 1 Topic 2 Topic 3 … Pseudo-Relevance Feedback (PRF)

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 6 Adaptive Rocchio with PRF Conventional version Improved version

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 7 Rocchio in Tracking on TDT 2003 Data Weighted PRF reduced Ctrk by 12%. Ctrk: the cost of tracking, i.e., the harmonic average of miss rate and false alarm rate

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 8 Primary Tracking Results in TDT 2004

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 9 DET Curves of Methods on TDT 2004 Data Charles’ target

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 10 Supervised Adaptive Tracking “Adaptive filtering” in TREC (since 1997) –Rocchio with threshold calibration strategies (Yang et al., CIKM 2003) –Probabilistic models assuming Gaussian/exponential distributions (Arampatzis et al, TREC 2001) –Combined use of Rocchio and Logistic regression (Yi Zhang, SIGIR 2004) A new task in TDT 2004 –Topics are narrower, and typically short lasting than the TREC topics

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 11 Our Experiments 4 methods –Rocchio with a fixed threshold (Roc.fix) –Rocchio with an adaptive threshold using Margin-based Local Regression (Roc.MLR) –Nearest Neighbor (Ralf’s variant) with a fixed threshold (kNN.fix) –Logistic regression (LR) regularized by a complexity penalty 3 corpora –TDT5 corpus, as the evaluation set in TDT 2004 –TDT4 corpus, as a validation set for parameter tuning –TREC11 ( 2002) corpus, as reference set for robustness analysis 2 optimization criteria –Ctrk: TDT standard, equivalent to setting the penalty ratio for miss vs. false alarm to 1270: 1 (approximately) –T11SU: TREC standard, equivalent to the penalty ratio of 2:1

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 12 Outline of Our Methods Roc.fix and NN.fix –Non-probabilistic model, generating ad hoc scores for documents with respect to each topic –Fixed global threshold, tuned on a retrospective corpus Roc.MLR –Non-probabilistic model, ad hoc scores –Threshold locally optimized using incomplete relevance judgments for a sliding window of documents LR –Probabilistic modeling of Pr(topic | x) –Fixed global threshold that optimizes the utility

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 13 Regularized Logistic Regression The objective is defined as to find the optimal regression coefficients This is equivalent to Maximum A Posteriori (MAP) estimation with prior distribution It predicts the probability of a topic given the data

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 14 Roc.fix on TDT3 Corpus RF on 1.6% of documents, 25% Min-cost reduction Base: No RF or PRF PRF: Weighted PRF MLR: Partial RF FRF: Complete RF

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 15 Effect of SA vs. PRF: on TDT5 Corpus With Rocchio.fix: SA reduced Ctrk by 54% compared to PRF; With Nearest Neighbors: SA reduced Ctrk by 48%.

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 16 SATracking Results on TDT5 Corpus For each team, the best score (with respect to Ctrk or T11SU) of the submitted runs is presented. Ctrk (the lower the better) T11SU (the higher the better)

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 17 Relative Performance of Our Methods TREC Utility (T11SU): Penalty of miss vs. f/a = 2:1 TDT Cost (Ctrk): Penalty of miss vs. f/a ~= 1270:1

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 18 Main Observations Encouraging results: a small amount of relevance feedback (on 1~2% documents) yielded significant performance improvement Puzzling point: Rocchio without any threshold calibration, works surprisingly well in both Ctrk and T11SU, which is inconsistent to our observations on TREC data. Why? Scaling issue: a significant challenge for the learning algorithms including LR and MLR in the TDT domain.

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 19 Temporal Nature of Topics/Events TREC Topic: Elections TDT Event: Nov. APEC Meeting Broadcast News Topic: Kidnappings

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 20 Topics for Future Research Keep up with new algorithms/theories Exploit domain knowledge, e.g., predefined topics (and super topics) in a hierarchical setting Investigate topic-conditioned event tracking with predictive features (including Named Entities) Develop algorithms to detect and exploit temporal trends TDT in cross-lingual settings

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 21 References  Y. Yang and B. Kisiel. Margin-based Local Regression for Adaptive Filtering. ACM CIKM 2003 (Conference on Information and Knowledge Management).  J. Zhang and Y. Yang. Robustness of regularized linear classification methods in text categorization ACM SIGIR 2003, pp  J. Zhang, R. Jin, Y. Yang and A. Hauptmann. Modified logistic regression: an approximation to SVM and its application in large-scale text categorization. ICML 2003 (International Conference on Machine Learning), pp  N. Ma, Y. Yang & M. Rogati. Cross-Language Event Tracking. Asia Information Retrieval Symposium (AIRS), 2004.