Domain Adaptation with Structural Correspondence Learning John Blitzer Shai Ben-David, Koby Crammer, Mark Dredze, Ryan McDonald, Fernando Pereira Joint.

Slides:

Advertisements

Similar presentations

Latent Variables Naman Agarwal Michael Nute May 1, 2013.

Advertisements

Koby Crammer Department of Electrical Engineering

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.

Domain Adaptation with Structural Correspondence Learning John Blitzer Shai Ben-David, Koby Crammer, Mark Dredze, Ryan McDonald, Fernando Pereira Joint.

Farag Saad i-KNOW 2014 Graz- Austria,

Linear Classifiers (perceptrons)

Distant Supervision for Emotion Classification in Twitter posts 1/17.

John Blitzer 自然语言计算组

Machine learning continued Image source:

Indian Statistical Institute Kolkata

Confidence-Weighted Linear Classification Mark Dredze, Koby Crammer University of Pennsylvania Fernando Pereira Penn  Google.

A Two-Stage Approach to Domain Adaptation for Statistical Classifiers Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois.

Data Visualization STAT 890, STAT 442, CM 462

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.

Online Learning Algorithms

Introduction to domain adaptation

Natural Language Understanding

Active Learning for Probabilistic Models Lee Wee Sun Department of Computer Science National University of Singapore LARC-IMS Workshop.

Unsupervised Domain Adaptation: From Practice to Theory John Blitzer TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

Adapting Natural Language Processing Systems to New Domains John Blitzer Joint with: Shai Ben-David, Koby Crammer, Mark Dredze, Fernando Pereira, Alex.

(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,

Shai Ben-David University of Waterloo Towards theoretical understanding of Domain Adaptation Learning ECML/PKDD 2009 LNIID Workshop.

1 Introduction to Transfer Learning (Part 2) For 2012 Dragon Star Lectures Qiang Yang Hong Kong University of Science and Technology Hong Kong, China

Introduction to Machine Learning for Information Retrieval Xiaolong Wang.

Text Classification, Active/Interactive learning.

Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.

Classification: Feature Vectors

Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.

1 Co-Training for Cross-Lingual Sentiment Classification Xiaojun Wan ( 萬小軍 ) Associate Professor, Peking University ACL 2009.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

+ Detecting Genre Shift Mark Dredze, Tim Oates, Christine Piatko Paper to appear at EMNLP-10.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

MODEL ADAPTATION FOR PERSONALIZED OPINION ANALYSIS MOHAMMAD AL BONI KEIRA ZHOU.

Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.

Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification John Blitzer, Mark Dredze and Fernando Pereira University.

Prior Knowledge Driven Domain Adaptation Gourab Kundu, Ming-wei Chang, and Dan Roth Hyphenated compounds are tagged as NN. Example: H-ras Digit letter.

Neural Net Language Models

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.

A presentation provided by John Blitzer and Hal Daumé III

Sentiment analysis algorithms and applications: A survey

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Erasmus University Rotterdam

Introductory Seminar on Research: Fall 2017

Machine Learning Ali Ghodsi Department of Statistics

Prototype-Driven Learning for Sequence Models

Connecting Data with Domain Knowledge in Neural Networks -- Use Deep learning in Conventional problems Lizhong Zheng.

Word embeddings (continued)

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

Domain Adaptation with Structural Correspondence Learning John Blitzer Shai Ben-David, Koby Crammer, Mark Dredze, Ryan McDonald, Fernando Pereira Joint work with

Statistical models, multiple domains

Different Domains of Text Huge variation in vocabulary & style tech blogs sports blogs Yahoo politics blogs “Ok, I’ll just build models for each domain I encounter”

Sentiment Classification for Product Reviews Product Review Classifier Positive Negative SVM, Naïve Bayes, etc. Multiple Domains books kitchen appliances... ??

books & kitchen appliances Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it, suffering from a headache the entire time, and eventually i lit it on fire. One less copy in the world...don't waste your money. I wish i had the time spent reading this book back so i could use it for better purposes. This book wasted my life Avante Deep Fryer, Chrome & Black Title: lid does not work well... I love the way the Tefal deep fryer cooks, however, I am returning my second one due to a defective lid closure. The lid may close initially, but after a few uses it no longer stays closed. I will not be purchasing this one again. Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it, suffering from a headache the entire time, and eventually i lit it on fire. One less copy in the world...don't waste your money. I wish i had the time spent reading this book back so i could use it for better purposes. This book wasted my life Avante Deep Fryer, Chrome & Black Title: lid does not work well... I love the way the Tefal deep fryer cooks, however, I am returning my second one due to a defective lid closure. The lid may close initially, but after a few uses it no longer stays closed. I will not be purchasing this one again. Error increase: 13%  26%

Error increase: 3%  12% Part of Speech Tagging DT NN VBZ DTNN INDTJJNNCC The clashisasignofanewtoughnessand NN IN NNP POS JJ JJNNS. divisiveness inJapan‘sonce-cozyfinancialcircles. DT JJVBN NNS IN DT NN NNS VBP The oncogenicmutatedformsoftherasproteinsare RB JJ CC VBP IN JJ NN constitutivelyactive andinterferewithnormalsignal NN. transduction. Wall Street Journal (WSJ) MEDLINE Abstracts (biomed) DT NN VBZ DTNN INDTJJNNCC The clashisasignofanewtoughnessand NN IN NNP POS JJ JJ NNS. divisivenessinJapan‘sonce-cozyfinancialcircles. DT JJVBN NNS IN DT NN NNS VBP The oncogenicmutatedformsoftherasproteinsare RB JJ CC VBP IN JJ NN constitutivelyactive andinterferewithnormalsignal NN. transduction.

Features & Linear Models horrible read_half waste Problem: If we’ve only trained on book reviews, then w(defective) = 0 0

Structural Correspondence Learning (SCL) Cut adaptation error by more than 40% Use unlabeled data from the target domain Induce correspondences among different features read-half, headache defective, returned Labeled data for source domain will help us build a good classifier for target domain Maximum likelihood linear regression (MLLR) for speaker adaptation (Leggetter & Woodland, 1995)

SCL: 2-Step Learning Process Unlabeled. Learn Labeled. Learn should make the domains look as similar as possible But should also allow us to classify well Step 1: Unlabeled – Learn correspondence mapping Step 2: Labeled – Learn weight vector

SCL: Making Domains Look Similar defective lid Incorrect classification of kitchen review Do not buy the Shark portable steamer …. Trigger mechanism is defective. the very nice lady assured me that I must have a defective set …. What a disappointment! Maybe mine was defective …. The directions were unclear Unlabeled kitchen contexts The book is so repetitive that I found myself yelling …. I will definitely not buy another. A disappointment …. Ender was talked about for pages altogether. it’s unclear …. It’s repetitive and boring Unlabeled books contexts

SCL: Pivot Features Pivot Features Occur frequently in both domains Characterize the task we want to do Number in the hundreds or thousands Choose using labeled source, unlabeled source & target data SCL: words & bigrams that occur frequently in both domains SCL-MI: SCL but also based on mutual information with labels book one so all very about they like good when a_must a_wonderful loved_it weak don’t_waste awful highly_recommended and_easy

SCL Unlabeled Step: Pivot Predictors Use pivot features to align other features Mask and predict pivot features using other features Train N linear predictors, one for each binary problem Each pivot predictor implicitly aligns non-pivot features from source & target domains Binary problem: Does “not buy” appear here? (2) Do not buy the Shark portable steamer …. Trigger mechanism is defective. (1) The book is so repetitive that I found myself yelling …. I will definitely not buy another.

SCL: Dimensionality Reduction gives N new features value of i th feature is the propensity to see “not buy” in the same document We still want fewer new features (1000 is too many) Many pivot predictors give similar information “horrible”, “terrible”, “awful” Compute SVD & use top left singular vectors Latent Semantic Indexing (LSI), (Deerwester et al. 1990) Latent Dirichlet Allocation (LDA), (Blei et al. 2003)

Back to Linear Classifiers Classifier Source training: Learn & together Target testing: First apply, then apply and

Inspirations for SCL 1.Alternating Structural Optimization (ASO) Ando & Zhang (JMLR 2005) Inducing structures for semi-supervised learning 2.Correspondence Dimensionality Reduction Verbeek, Roweis, & Vlassis (NIPS 2003). Ham, Lee, & Saul (AISTATS 2003). Learn a low-dimensional representation from high- dimensional correspondences

Sentiment Classification Data Product reviews from Amazon.com –Books, DVDs, Kitchen Appliances, Electronics –2000 labeled reviews from each domain –3000 – 6000 unlabeled reviews Binary classification problem –Positive if 4 stars or more, negative if 2 or less Features: unigrams & bigrams Pivots: SCL & SCL-MI At train time: minimize Huberized hinge loss (Zhang, 2004)

negative vs. positive plot _pages predictable fascinating engagingmust_read grisham the_plastic poorly_designed leaking awkward_to espresso are_perfect years_now a_breeze books kitchen Visualizing (books & kitchen)

Empirical Results: books & DVDs baseline loss due to adaptation: 7.6% SCL-MI loss due to adaptation: 0.7%

Empirical Results: electronics & kitchen

Empirical Results: books & DVDs Sometimes SCL can cause increases in error With only unlabeled data, we misalign features

Using Labeled Data 50 instances of labeled target domain data Source data, save weight vector for SCL features Target data, regularize weight vector to be close to Huberized hinge loss Avoid using high-dimensional features Keep SCL weights close to source weights Chelba & Acero, EMNLP 2004

Empirical Results: labeled data With 50 labeled target instances, SCL-MI always improves over baseline

Average Improvements model base +targsclscl-mi +targ Avg Adaptation Loss scl-mi reduces error due to transfer by 36% adding 50 instances [Chelba & Acero 2004] without SCL does not help scl-mi + targ reduces error due to transfer by 46%

PoS Tagging: Data & Model Data 40k Wall Street Journal (WSJ) training sentences 100k unlabeled biomedical sentences 100k unlabeled WSJ sentences Supervised Learner MIRA CRF: Online max-margin learner Separate correct label from top k=5 incorrect labels Crammer et al. JMLR 2006 Pivots: Common left/middle/right words

nouns vs. adjs & dets receptorsmutation assays lesions metastatic neuronaltransient functional company transaction investors officials political short-term your pretty MEDLINE Wall Street Journal Visualizing PoS Tagging

Empirical Results 561 MEDLINE test sentences # of WSJ training sentences Model All Words Unk words MXPOST super semi-ASO SCL Null Hypp-value semi vs. super< SCL vs. super< SCL vs. semi< Accuracy McNemar’s test

Results: Some labeled target domain data # of MEDLINE training sentences ModelAccuracy 1k-SCL95.0 1k-super94.5 Nosource94.5 Accuracy Use source tagger output as a feature (Florian et al. 2004) Compare SCL with supervised source tagger 561 MEDLINE test sentences

Adaptation & Machine Translation Source: Domain specific parallel corpora (news, legal text) Target: Similar corpora from the web (i.e. blogs) Learn translation rules / language model parameters for the new domain Pivots: common contexts

Adaptation & Ranking Input: query & list of top-ranked documents Output: Ranking Score documents based on editorial or click-through data Adaptation: Different markets or query types Pivots: common relevant features

Learning Theory & Adaptation Analysis of Representations for Domain Adaptation. Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira. NIPS Learning Bounds for Domain Adaptation. John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, Jenn Wortman. NIPS 2007 (To Appear). Bounds on the error of models in new domains

Pipeline Adaptation: Tagging & Parsing Accuracy for different tagger inputs # of WSJ training sentences Accuracy Dependency Parsing McDonald et al Uses part of speech tags as features Train on WSJ, test on MEDLINE Use different taggers for MEDLINE input features

Measuring Adaptability Given limited resources, which domains should we label? Idea: Train a classifier to distinguish instances from different domains Error of this classifier is an estimate of loss due to adaptation

A-distance vs Adaptation loss Suppose we can afford to label 2 domains Then we should label 1 of electronics/kitchen and 1 of books/DVDs

Features & Linear Models 1 0 LW=normal MW=signal RW=transduction Problem: If we’ve only trained on financial news, then w(RW=transduction) = 0 0 normal signal transduction

Future Work SCL for other problems & modalities –named entity recognition –vision (aligning SIFT features) –speaker / acoustic environment adaptation Learning low-dimensional representations for multi-part prediction problems –natural language parsing, machine translation, sentence compression