UMass Amherst at TDT 2003 James Allan, Alvaro Bolivar, Margie Connell, Steve Cronen-Townsend, Ao Feng, FangFang Feng, Leah Larkey, Giridhar Kumaran, Victor.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

1 Opinion Summarization Using Entity Features and Probabilistic Sentence Coherence Optimization (UIUC at TAC 2008 Opinion Summarization Pilot) Nov 19,
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
Albert Gatt Corpora and Statistical Methods Lecture 13.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
DOMAIN DEPENDENT QUERY REFORMULATION FOR WEB SEARCH Date : 2013/06/17 Author : Van Dang, Giridhar Kumaran, Adam Troy Source : CIKM’12 Advisor : Dr. Jia-Ling.
A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Learning for Text Categorization
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Automatic Image Annotation and Retrieval using Cross-Media Relevance Models J. Jeon, V. Lavrenko and R. Manmathat Computer Science Department University.
 Manmatha MetaSearch R. Manmatha, Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst.
Information Retrieval: Models and Methods October 15, 2003 CMSC Gina-Anne Levow.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Chapter 5: Information Retrieval and Web Search
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Tag-based Social Interest Discovery
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Topic Detection and Tracking Introduction and Overview.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Text Classification, Active/Interactive learning.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for Intelligent Information Retrieval Department of Computer.
Overview of the TDT-2003 Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor : Dr.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
1 Thi Nhu Truong, ChengXiang Zhai Paul Ogilvie, Bill Jerome John Lafferty, Jamie Callan Carnegie Mellon University David Fisher, Fangfang Feng Victor Lavrenko.
Project Final Presentation – Dec. 6, 2012 CS 5604 : Information Storage and Retrieval Instructor: Prof. Edward Fox GTA : Tarek Kanan ProjArabic Team Ahmed.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
CMU at TDT 2004 — Novelty Detection Jian Zhang and Yiming Yang Carnegie Mellon University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2007.SIGIR.8 New Event Detection Based on Indexing-tree.
Vector Space Models.
NEW EVENT DETECTION AND TOPIC TRACKING STEPS. PREPROCESSING Removal of check-ins and other redundant data Removal of URL’s maybe Stemming of words using.
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Topics Detection and Tracking Presented by CHU Huei-Ming 2004/03/17.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
New Event Detection at UMass Amherst Giridhar Kumaran and James Allan.
TDT 2004 Unsupervised and Supervised Tracking Hema Raghavan UMASS-Amherst at TDT 2004.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Hierarchical Topic Detection UMass - TDT 2004 Ao Feng James Allan Center for Intelligent Information Retrieval University of Massachusetts Amherst.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Information Retrieval: Models and Methods
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
A Straightforward Author Profiling Approach in MapReduce
Information Retrieval: Models and Methods
Information Retrieval and Web Search
Presented by: Prof. Ali Jaoua
Presentation transcript:

UMass Amherst at TDT 2003 James Allan, Alvaro Bolivar, Margie Connell, Steve Cronen-Townsend, Ao Feng, FangFang Feng, Leah Larkey, Giridhar Kumaran, Victor Lavrenko, Ramesh Nallapati, and Hema Raghavan Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts Amherst

What we did  Tasks Story Link Detection Topic Tracking New Event Detection Cluster Detection

Outline  Rule of Interpretation (ROI) classification  ROI-based vocabulary reduction  Cross-language techniques Dictionary translation of Arabic stories Native language comparisons Adaptive tracking  Relevance models

ROI motivation  Analyzed vector space similarity measures Failed to distinguish between similar topics e.g. two “health care” stories from different topics  different locations and individuals  similarity dominated by “health care” terms drugs, cost, coverage, plan, prescription  Possible solution: first categorize stories different category  different topics (mostly true) use within-category statistics  “health care” may be less confusing Rules of Interpretation provide natural categories

ROI intuition Each document in the corpus is classified into one of the ROI categories Stories in different ROIs are less likely to be in same topic. If two stories belong to different ROIs, we should trust their similarities less ROI tagged corpus sim new (s1,s2)=sim old (s1,s2) sim new (s1,s2)<sim old (s1,s2) Sn SnSn

ROI classifiers  Naïve Bayes  BoosTexter [Schapire and Singer, 2000 ] Decision tree classifier Generates and combines simple rules Features are terms with tf as weights  Used most likely single class Explored distribution of all classes Unable to do so successfully

Training Data for Classification  Experiments: train on TDT-2,test on TDT-3 Submissions: train on TDT-2 plus TDT-3  Training data prepared the same way Stories in each topic tagged with topic’s ROI Remove duplicate stories (in topics with the same ROI) Remove all stories with more than one ROI  Worst case: a single story relevant to… Chinese Labor Activists with ROI Legal/Criminal Cases Blair Visits China in October with ROI Political/Diplomatic Mtgs. China will not allow Opposition Parties with ROI Miscellaneous Experiments with removing named entities for training

Naïve Bayes vs. BoosTexter  Similar classification accuracy Overall accuracy is the same Errors are substantially different  Our training results (TDT-3) BoosTexter beat Naïve Bayes for SLD and NED  BoosTexter used in most tasks for submission  Evaluation results: In Link Detection, using Naïve Bayes more useful

ROI classes in link detection  Given story pair and their estimated ROIs  If estimated ROIs are same, leave score alone  If they are different, reduce score Reduced to 1/3 of original value based on training runs  Used four different ROI classifiers ROI-BT,ne: BoosTexter with named entities ROI-BT, no-ne:BoosTexter without named entities ROI-NB, ne: Naïve Bayes with name entities ROI-NB, no-ne: Naïve Bayes without name entities

Training effectiveness (TDT-3)  Story Link Detection  Minimum normalized cost Various types of databases 1Dcos4DcosUDcos original ROI-BT,ne ROI-BT,no ne ROI-NB,ne ROI-NB,no ne

Evaluation results  Story link detection Various types of databases 1Dcos4DcosUDcos original ROI-BT,ne ROI-BT,no ne ROI-NB,ne ROI-NB,no ne

ROI for tracking  Compare story to centroid of topic Built from training stories  If ROI does not match, drop score based on how bad mismatch is  Used ROI-BT,ne classifier only

Training for tracking Various types of databases 1Dcos4DcosADcosUDcos Nt=1 orig ROI-BT,ne Nt=4 orig ROI-BT,ne  Topic tracking on TDT-3  Minimum normalized cost  ROI BoosTexter with named entities only

Evaluation results Various types of databases 1Dcos4DcosADcosUDcos Nt=1 orig ROI-BT,ne Nt=4 orig ROI-BT,ne  Topic tracking on TDT-3  Minimum normalized cost  ROI BoosTexter with named entities only

ROI-based vocabulary pruning  New Event Detection only  Create “stop list” for each ROI 300 most frequent terms in stories within ROI Obtained from TDT-2 corpus  When story is classified into an ROI… Remove those terms from the story’s vector  ROI determined from BoosTexter classifier

New Event Detection approach  Cosine Similarity measure ROI-based vocabulary pruning Score normalization Incremental IDF Remove short documents  Preprocessing Train BoosTexter on TDT-2 &TDT-3 Include named entities while training

NED Results TDT 3TDT 4

ROI Conclusions  Both uses of ROI helped in training Score reduction for ROI mismatch  Tracking and link detection Vocabulary pruning for new event detection  Score reduction failed in evaluation Name entities important in ROI classifier  TDT-4 has different set of entities (time gap) Possible overfitting to TDT-3?  Preliminary work applying to detection Unsuccessful to date

Outline  Rule of Interpretation (ROI) classification  ROI-based vocabulary reduction  Cross-language techniques Dictionary translation of Arabic stories Native language comparisons Adaptive tracking  Relevance models

Comparing multilingual stories  Baseline All stories converted to English Using provided machine translations  New approaches Dictionary translation of Arabic stories Native language comparisons Adaptation in tracking

Dictionary Translation of Arabic  Probabilistic translation model  Each Arabic word has multiple English translations Obtain P(e|a) from UN Arabic-English parallel corpus  Forms a pseudo-story in English representing Arabic Story  Can get large due to multiple translations per word  Keep English words whose summed probabilities are the greatest

Language specific comparisons  Language representations: Arabic CP1256 encoding and light stemming English stopped and stemmed with kstem Chinese segmented if necessary and overlapping bigrams  Linking Task: If stories in same language, use that language All other comparisons done using all stories translated into English

Adaptation in tracking  Adaptation Stories added to topic when high similarity score Establish topic representation in each language as soon as added story in that language appears Similarity of Arabic story compared to Arabic topic representation, etc.

Cross-Lingual Link Detection Results Translation Condition Minimum CostCost TDT-3TDT-4 1DcostIDF UDcosIDF (-8 %) (-1%) DcosIDF (-28%) (-20%) Translation Conditions:  1DcosIDF: baseline, all stories in English using provided translations.  UDcosIDF: all stories in English but using dictionary translation of Arabic.  4DcosIDF: comparing a pair of stories in native language if both stories within the same language, otherwise comparing them in English using the dictionary translation of Arabic

Cross-Lingual Topic Tracking Results (required condition: Nt=1,bnman) Translation Condition Minimum CostCost TDT-3TDT-4 1DcostIDF UDcosIDF (-2 %) (+3%) DcosIDF (-4%) (+3%) ADcosIDF (-26%) (+2%) Translation Conditions:  1DcosIDF: baseline.  UDcosIDF: dictionary translation of Arabic.  4DcosIDF: comparing a pair of stories in native language.  ADcosIDF: baseline plus adaptation, add a story to the centroid vector if its similarity score > adapting threshold, the vector limited top 100 terms, at maximum 100 stories could be added to the centroid.

Cross-Lingual Topic Tracking Results (alternate condition: Nt=4,bnasr) Translation Conditions:  1DcosIDF: baseline.  UDcosIDF: dictionary translation of Arabic.  4DcosIDF: comparing a pair of stories in native language.  ADcosIDF: baseline plus adaptation. Translation Condition Minimum CostCost TDT-3TDT-4 1DcostIDF UDcosIDF (-7 %) (-5 %) DcosIDF (-9 %) (-10%) ADcosIDF (-25%) (-14%)0.1463

Outline  Rule of Interpretation (ROI) classification  ROI-based vocabulary reduction  Cross-language techniques Dictionary translation of Arabic stories Native language comparisons Adaptive tracking  Relevance models

Relevance Models for SLD  Relevance Model (RM): “model of stories relevant to a query”  Algorithm: Given stories A,B 1.compute “queries” Q A and Q B 2.estimate relevance models P(w|Q A ) and P(w|Q B ) 3.compute divergence between relevance models

TDT-3TDT-4 Cosine / tf.idf Relevance Model Rel. Model +ROI Results: Story Link Detection

Relevance Models for Tracking 1.Initialize: set P(M|Q) = 1/Nt if M is a training doc compute relevance model as before 2.For each incoming story D: score = divergence between P(w|D) and RM if (score > threshold) add D to the training, recompute RM allow no more than k adaptations

TDT-3TDT-4 Cosine / tf.idf Language Model Adaptive tf.idf Relevance Model Results: Topic Tracking

Conclusions  Rule of Interpretation (ROI) classification  ROI-based vocabulary reduction  Cross-language techniques Dictionary translation of Arabic stories Native language comparisons Adaptive tracking  Relevance models