Domain Adaptation in Natural Language Processing Jing Jiang Department of Computer Science University of Illinois at Urbana-Champaign.

Slides:



Advertisements
Similar presentations
Pat Langley Computational Learning Laboratory Center for the Study of Language and Information Stanford University, Stanford, California
Advertisements

Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.
Knowledge Transfer via Multiple Model Local Structure Mapping Jing Gao Wei Fan Jing JiangJiawei Han University of Illinois at Urbana-Champaign IBM T. J.
Knowledge Transfer via Multiple Model Local Structure Mapping Jing Gao, Wei Fan, Jing Jiang, Jiawei Han l Motivate Solution Framework Data Sets Synthetic.
Using Large-Scale Web Data to Facilitate Textual Query Based Retrieval of Consumer Photos.
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
A Two-Stage Approach to Domain Adaptation for Statistical Classifiers Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Knowledge Transfer via Multiple Model Local Structure Mapping Jing Gao† Wei Fan‡ Jing Jiang†Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
Scalable Text Mining with Sparse Generative Models
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Introduction to Machine Learning Approach Lecture 5.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Instance Weighting for Domain Adaptation in NLP Jing Jiang & ChengXiang Zhai University of Illinois at Urbana-Champaign June 25, 2007.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Prepare Yourself for IR Research ChengXiang Zhai Department of Computer.
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Flexible Text Mining using Interactive Information Extraction David Milward
Exploiting Domain Structure for Named Entity Recognition Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign.
Xiaoxiao Shi, Qi Liu, Wei Fan, Philip S. Yu, and Ruixin Zhu
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
Presenter: Shanshan Lu 03/04/2010
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
Facilitating Document Annotation using Content and Querying Value.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 龙星计划课程 : 信息检索 Course Summary ChengXiang Zhai ( 翟成祥 ) Department of.
NATURAL LANGUAGE PROCESSING Zachary McNellis. Overview  Background  Areas of NLP  How it works?  Future of NLP  References.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Facilitating Document Annotation Using Content and Querying Value.
BeeSpace Informatics Research
Information Organization: Overview
Information Retrieval and Web Search
Course Summary (Lecture for CS410 Intro Text Info Systems)
Information Retrieval and Web Search
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
What is Pattern Recognition?
Machine Learning in Natural Language Processing
Overview of Machine Learning
John Lafferty, Chengxiang Zhai School of Computer Science
Knowledge Transfer via Multiple Model Local Structure Mapping
Information Organization: Overview
Presentation transcript:

Domain Adaptation in Natural Language Processing Jing Jiang Department of Computer Science University of Illinois at Urbana-Champaign

Jan 8, Contains much useful information –E.g. >85% corporate data stored as text Hard to handle –Large amount: e.g. by 2002, 2.5 billion documents on surface Web, +7.3 million / day –Diversity: s, news, digital libraries, Web logs, etc. –Unstructured: vs. relation databases How to manage textual data? Textual Data in the Information Age

Jan 8, Information retrieval: to rank documents based on relevance to keyword queries Not always satisfactory –More sophisticated services desired

Jan 8, Automatic Text Summarization

Jan 8, Question Answering

Jan 8, CompanyFounder …… GoogleLarry Page …… Information Extraction

Jan 8, Beyond Information Retrieval Automatic text summarization Question answering Information extraction Sentiment analysis Machine translation Etc. All relies on Natural Language Processing (NLP) techniques to deeply understand and analyze text

Jan 8, Typical NLP Tasks “Larry Page was Google’s founding CEO” Part-of-speech tagging Larry/noun Page/noun was/verb Google/noun ’s/possessive-end founding/adjective CEO/noun Chunking [NP: Larry Page] [V: was] [NP: Google ’s founding CEO] Named entity recognition [person: Larry Page] was [organization: Google] ’s founding CEO Relation extraction Founder(Larry Page, Google) Word sense disambiguation “Larry Page” vs. “Page 81” state-of-the-art solution: supervised machine learning

Jan 8, WSJ articles Supervised Learning for NLP Larry/NNP Page/NNP was/VBD Google/NNP ’s/POS founding/ADJ CEO/NN trained POS tagger Standard Supervised Learning Algorithm part-of-speech tagging on news articles representative corpushuman annotation POS-tagged WSJ articles training

Jan 8, MEDLINE articles In Reality… We/PRP analyzed/VBD the/DT mutations/NNS of/IN the/DT H-ras/NN genes/NNS trained POS tagger Standard Supervised Learning Algorithm part-of-speech tagging on biomedical articles representative corpushuman annotation POS-tagged MEDLINE articles training X human annotation is expensive POS-tagged WSJ articles

Jan 8, Many Other Examples Named entity recognition –News articles  personal blogs –Organism A  organism B Spam filtering –Public collection  personal inboxes Sentiment analysis of product reviews (positive vs. negative) –Movies  books –Cell phones  digital cameras Problem with this non-standard setting with domain difference?

Jan 8, Domain Difference  Performance Degradation MEDLINE POS Tagger ~96% WSJ MEDLINE POS Tagger ~86% ideal setting realistic setting

Jan 8, Another Example gene name recognizer 54.1% gene name recognizer 28.1% ideal setting realistic setting

Jan 8, Domain Adaptation source domain target domain Labeled Unlabeled Domain Adaptive Learning Algorithm to design learning algorithms that are aware of domain difference and exploit all available data to adapt to the target domain

Jan 8, With Domain Adaptation Techniques… Fly + Mouse Yeast gene name recognizer 63.3% Fly + Mouse Yeast gene name recognizer 75.9% standard learning domain adaptive learning

Jan 8, Roadmap What is domain adaptation in NLP? Our work –Overview –Instance weighting –Feature selection Summary and future work

Jan 8, Overview Source Domain Target Domain

Jan 8, Ideal Goal Target Domain Source Domain

Jan 8, Standard Supervised Learning Source Domain Target Domain

Jan 8, Source Domain Target Domain Standard Semi-Supervised Learning

Jan 8, Idea 1: Generalization Source Domain Target Domain

Jan 8, Idea 2: Adaptation Source Domain Target Domain

Jan 8, Source Domain Target Domain How to formally formulate the ideas?

Jan 8, Instance Weighting Source Domain Target Domain instance space (each point represents an observed instance) to find appropriate weights for different instances

Jan 8, Feature Selection Source Domain Target Domain feature space (each point represents a useful feature) to separate generalizable features from domain-specific features

Jan 8, Roadmap What is domain adaptation in NLP? Our work –Overview –Instance weighting –Feature selection Summary and future work

Jan 8, Observation source domain target domain

Jan 8, Observation source domain target domain

Jan 8, Analysis of Domain Difference p(x, y) p(x)p(y | x) p s (y | x) ≠ p t (y | x) p s (x) ≠ p t (x) labeling difference instance difference labeling adaptation instance adaptation ? x: observed instancey: class label (to be predicted)

Jan 8, Labeling Adaptation source domain target domain p t (y | x) ≠ p s (y | x) remove/demote instances

Jan 8, Labeling Adaptation source domain target domain p t (y | x) ≠ p s (y | x) remove/demote instances

Jan 8, Instance Adaptation (p t (x) < p s (x)) source domain target domain p t (x) < p s (x) remove/demote instances

Jan 8, Instance Adaptation (p t (x) < p s (x)) source domain target domain p t (x) < p s (x) remove/demote instances

Jan 8, Instance Adaptation (p t (x) > p s (x)) source domain target domain p t (x) > p s (x) promote instances

Jan 8, Instance Adaptation (p t (x) > p s (x)) source domain target domain p t (x) > p s (x) promote instances

Jan 8, Instance Adaptation (p t (x) > p s (x)) Target domain instances are useful source domain target domain p t (x) > p s (x)

Jan 8, Empirical Risk Minimization with Three Sets of Instances DsDs D t, l D t, u loss function expected loss use empirical loss to replace expected loss optimal classification model

Jan 8, Using D s DsDs D t, l D t, u instance difference (hard for high-dimensional data) XDsXDs labeling difference (need labeled target data)

Jan 8, DsDs D t, l D t, u Using D t,l X  D t,l small sample size estimation not accurate

Jan 8, DsDs D t, l D t, u Using D t,u X  D t,u use predicted labels (bootstrapping)

Jan 8, Combined Framework a flexible setup covering both standard methods and new domain adaptive methods

Jan 8, Experiments NLP tasks –POS tagging: WSJ (Penn TreeBank)  Oncology (biomedical) text (Penn BioIE) –NE type classification: newswire  conversational telephone speech (CTS) and web-log (WL) (ACE 2005) –Spam filtering: public collection  personal inboxes (u01, u02, u03) (ECML/PKDD 2006) Three heuristics to partially explore the parameter settings

Jan 8, Instance Pruning removing “misleading” instances from D s kCTSkWL all0.8830all kOncology all kUser 1User 2User all POS NE Type Spam useful in most cases; failed in some case When is it guaranteed to work? (future work)

Jan 8, D t,l with Larger Weights methodCTSWL DsDs D s + D t,l D s + 5D t,l D s + 10D t,l methodOncology DsDs D s + D t,l D s + 10D t,l D s + 20D t,l methodUser 1User 2User 3 DsDs D s + D t,l D s + 5D t,l D s + 10D t,l POS NE Type Spam D t,l is very useful promoting D t,l is more useful

Jan 8, Bootstrapping with Larger Weights until D s and D t,u are balanced methodCTSWL supervised standard bootstrap balanced bootstrap methodOncology supervised standard bootstrap balanced bootstrap methodUser 1User 2User 3 supervised standard bootstrap balanced bootstrap POSNE Type Spam promoting target instances is useful, even with predicted labels

Jan 8, Roadmap What is domain adaptation in NLP? Our work –Overview –Instance weighting –Feature selection Summary and future work

Jan 8, Observation 1 Domain-specific features wingless daughterless eyeless apexless …

Jan 8, Observation 1 Domain-specific features wingless daughterless eyeless apexless … describing phenotype in fly gene nomenclature feature “-less” useful for this organism CD38 PABPC5 … feature still useful for other organisms? No!

Jan 8, Observation 2 Generalizable features …decapentaplegic and wingless are expressed in analogous patterns in each … …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues.

Jan 8, Observation 2 Generalizable features …decapentaplegic and wingless are expressed in analogous patterns in each … …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. feature “X be expressed”

Jan 8, Assume Multiple Source Domains source domains target domain LabeledUnlabeled Domain Adaptive Learning Algorithm

Jan 8, x Detour: Logistic Regression Classifiers 01001:: :: : less X be expressed wyT xwyT x p binary features … and wingless are expressed in… wywy

Jan 8, Learning a Logistic Regression Classifier 01001:: :: : log likelihood of training data regularization term penalize large weights control model complexity wyT xwyT x

Jan 8, Generalizable Features in Weight Vectors : : : D1D1 D2D2 DKDK w1w1 w2w2 wKwK … K source domains generalizable features domain-specific features

Jan 8, : : : =+ 0 0 … … … … 0 : 0 0 … 0 Decomposition of w k for Each Source Domain shared by all domainsdomain-specific w k = A T v + u k a matrix that selects generalizable features

Jan 8, Framework for Generalization Fix A, optimize: wkwk regularization term λ s >> 1: to penalize domain-specific features Source Domain Target Domain log likelihood of labeled data from K source domains

Jan 8, Framework for Adaptation Source Domain Target Domain log likelihood of target domain examples with predicted labels λ t = 1 << λ s : to pick up domain-specific features in the target domain Fix A, optimize:

Jan 8, Joint optimization How to Find A? (1)

Jan 8, How to Find A? (2) Domain cross validation –Idea: training on (K – 1) source domains and validate on the held-out source domain –Approximation: w f k : weight for feature f learned from domain k w f k : weight for feature f learned from other domains rank features by

Jan 8, Intuition for Domain Cross Validation … domains … expressed … -less D1D1 D2D2 D k-1 D k (fly) … -less … expressed … w w … expressed … -less product of w 1 and w 2 w1w1 w2w2

Jan 8, Experiments Data set –BioCreative Challenge Task 1B –Gene/protein recognition –3 organisms/domains: fly, mouse and yeast Experimental setup –2 organisms for training, 1 for testing –F1 as performance measure

Jan 8, Experiments: Generalization MethodF+M  YM+Y  FY+F  M BL DA-1 (joint-opt) DA-2 (domain CV) Source Domain Target Domain Source Domain Target Domain using generalizable features is effective F: fly M: mouse Y: yeast domain cross validation is more effective than joint optimization

Jan 8, Experiments: Adaptation MethodF+M  YM+Y  FY+F  M BL-SSL DA-2-SSL Source Domain Target Domain F: fly M: mouse Y: yeast Source Domain Target Domain domain-adaptive bootstrapping is more effective than regular bootstrapping

Jan 8, Related Work Problem relatively new to NLP and ML communities –Most related work developed concurrently with our work Instances Used Standard Instance Weighting Feature Selection IW + FS DsDs supervised learning Shimodaira 00Blitzer et al. 06 Our Future Wok D s + D t,l supervised learning Daumé III & Marcus 06 Daumé III 07 D s + D t,u semi- supervised learning ACL’07 HLT’06, CIKM’07 D s + D t,l + D t,u semi- supervised learning

Jan 8, Roadmap What is domain adaptation in NLP? Our work –Overview –Instance weighting –Feature selection Summary and future work

Jan 8, Summary Domain adaptation is a critical novel problem in natural language processing and machine learning Contributions –First systematic formal analysis of domain adaptation –Two novel general frameworks, both shown to be effective –Potentially applicable to other classification problems outside of NLP Future work –Domain difference measure –Unify two frameworks –Incorporate domain knowledge into adaptation process –Leverage domain adaptation to perform large-scale information extraction on scientific literature and on the Web

Jan 8, Information Extraction System Existing Knowledge Bases Labeled Data from Related Domains Entity Recognition Relation Extraction Intelligent Learning Knowledge Resources Exploitation Interactive Expert Supervisio n Domain Adaptive Learning Domain Expert

Jan 8, Biomedical Literature (MEDLINE abstracts, full-text articles, etc.) DWnt-2 is expressed in somatic cells of the gonad throughout development. Entity Recognition Relation Extraction Information Extraction System Extracted Facts genetissue/position DWnt-2gonad expression relations Inference Engine Pathway Construction … Hypothesis Generation Knowledge Base Curation Applications

Jan 8, Applications (cont.) Similar ideas for Web text mining –Product reviews Existing annotated reviews limited (certain products from certain sources) Large amount of semi-structured reviews from review websites Unstructured reviews from personal blogs

Jan 8, Selected Publications J. Jiang & C. Zhai. “A two-stage approach to domain adaptation for statistical classifiers.” In CIKM’07. J. Jiang & C. Zhai. “Instance weighting for domain adaptation in NLP.” In ACL’07. J. Jiang & C. Zhai. “Exploiting domain structure for named entity recognition.” In HLT-NAACL’06. J. Jiang & C. Zhai. “A systematic exploration of the feature space for relation extraction.” In NAACL-HLT’07. J. Jiang & C. Zhai. “Extraction of coherent relevant passages using hidden Markov models.” ACM Transactions on Information Systems (TOIS), Jul J. Jiang & C. Zhai. “An empirical study of tokenization strategies for biomedical information retrieval.” Information Retrieval, Oct X. Ling, J. Jiang, X. He, Q. Mei, C. Zhai & B. Schatz. “Generating semi- structured gene summaries from biomedical literature.” Information Processing & Management, Nov X. Ling, J. Jiang, X. He, Q. Mei, C. Zhai & B. Schatz. “Automatically generating gene summaries from biomedical literature.” In PSB’06. this talk feature exploration for relation extraction information retrieval gene summarization