Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang † Huanhuan Liu † Chu-Ren Huang.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.
Clustering Categorical Data The Case of Quran Verses
Farag Saad i-KNOW 2014 Graz- Austria,
Large-Scale Entity-Based Online Social Network Profile Linkage.
Made with OpenOffice.org 1 Sentiment Classification using Word Sub-Sequences and Dependency Sub-Trees Pacific-Asia Knowledge Discovery and Data Mining.
A Brief Overview. Contents Introduction to NLP Sentiment Analysis Subjectivity versus Objectivity Determining Polarity Statistical & Linguistic Approaches.
Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.
Chang WangChang Wang, Sridhar mahadevanSridhar mahadevan.
Erasmus University Rotterdam Frederik HogenboomEconometric Institute School of Economics Flavius Frasincar.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Distributed Representations of Sentences and Documents
Introduction to Machine Learning Approach Lecture 5.
Selective Sampling on Probabilistic Labels Peng Peng, Raymond Chi-Wing Wong CSE, HKUST 1.
Training and future (test) data follow the same distribution, and are in same feature space.
Mining and Summarizing Customer Reviews
Yuan Li, Chang Huang and Ram Nevatia
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Active Learning for Class Imbalance Problem
The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.
Machine Learning CSE 681 CH2 - Supervised Learning.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
The Necessity of Combining Adaptation Methods Cognitive Computation Group, University of Illinois Experimental Results Title Ming-Wei Chang, Michael Connor.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
On the Issue of Combining Anaphoricity Determination and Antecedent Identification in Anaphora Resolution Ryu Iida, Kentaro Inui, Yuji Matsumoto Nara Institute.
Page 1 Ming Ji Department of Computer Science University of Illinois at Urbana-Champaign.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer 1 Ting-Hao (Kenneth) Huang Yun-Nung (Vivian) Chen Lingpeng Kong
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
1 Co-Training for Cross-Lingual Sentiment Classification Xiaojun Wan ( 萬小軍 ) Associate Professor, Peking University ACL 2009.
Modern Topics in Multivariate Methods for Data Analysis.
A MIXED MODEL FOR CROSS LINGUAL OPINION ANALYSIS Lin Gui, Ruifeng Xu, Jun Xu, Li Yuan, Yuanlin Yao, Jiyun Zhou, Shuwei Wang, Qiaoyun Qiu, Ricky Chenug.
1 Yang Yang *, Yizhou Sun +, Jie Tang *, Bo Ma #, and Juanzi Li * Entity Matching across Heterogeneous Sources *Tsinghua University + Northeastern University.
A hybrid SOFM-SVR with a filter-based feature selection for stock market forecasting Huang, C. L. & Tsai, C. Y. Expert Systems with Applications 2008.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
1 Yang Yang *, Yizhou Sun +, Jie Tang *, Bo Ma #, and Juanzi Li * Entity Matching across Heterogeneous Sources *Tsinghua University + Northeastern University.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Scalable Learning of Collective Behavior Based on Sparse Social Dimensions Lei Tang, Huan Liu CIKM ’ 09 Speaker: Hsin-Lan, Wang Date: 2010/02/01.
COURSE AND SYLLABUS DESIGN
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
Corpus Exploitation from Wikipedia for Ontology Construction Gaoying Cui, Qin Lu, Wenjie Li, Yirong Chen The Department of Computing The Hong Kong Polytechnic.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C
Automatically Labeled Data Generation for Large Scale Event Extraction
Guillaume-Alexandre Bilodeau
Generating Natural Answers by Incorporating Copying and Retrieving Mechanisms in Sequence-to-Sequence Learning Shizhu He, Cao liu, Kang Liu and Jun Zhao.
A Unifying View on Instance Selection
Discriminative Frequent Pattern Analysis for Effective Classification
iSRD Spam Review Detection with Imbalanced Data Distributions
Word embeddings (continued)
Three steps are separately conducted
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn
Presentation transcript:

Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang † Huanhuan Liu † Chu-Ren Huang ‡ † Soochow University ‡ Hong Kong Polytechnic University

Outline  Introduction  Inadequacies of the Existing Work  Our Methods  Experimental Results  Conclusion

Introduction  Sentiment classification is a task of predicting the sentimental orientation (e.g., positive or negative) for a certain text.  However, the resources are rather imbalanced across different languages.  For example, due to dominant studies on English sentiment classification, the labeled data in English is often in a large scale while the labeled data in some other languages is much limited.

Introduction ( Cont. )  Cross-lingual sentiment classification aims to predict the sentiment orientation of a text in a language (named as the target language) with the help of the resources from another language (named as the source language).

Inadequacies of the Existing Work  The classification performance of only using the labeled data in the source language remains far away from satisfaction due to the huge difference in linguistic expression and social culture.  One challenge in active learning-based cross-lingual sentiment classification lies in the much imbalanced labeled data from the source and target languages.  A huge imbalance in the labeled data easily floods the small amount of the labeled target data in the abundance of labeled source data and largely reduces the contribution of the labeled data in the target language.

Our Methods We propose a certainty-based quality measurement (the intra-quality measurement), together with cross-validation to select high-quality samples in the source language. We propose a similarity measurement (the extra-quality measurement) to select the samples in the source language that are similar to those in the target language. For a particular data in the target language, these two kinds of measurements are integrated to select high-quality samples in the source language. After obtaining the high-quality samples in the source language, we employ standard uncertainty sampling for active learning-based cross-lingual sentiment classification.

Intra-quality Measurement It only employs the data in the source language to measure the quality of the samples in the source language. We first split the labeled data from the source language into two different parts. One is severed as the training data and the other is severed as the validation data. Then, we use the training data to train a classifier which is used to predict the samples in the validation data. After the prediction process, we assume that the samples with high posterior possibilities are capable of representing the classification knowledge in the training data.

Intra-quality Measurement

Extra-quality Measurement

Integrating Intra- and Extra- Quality Measurements  We consider the certainty measurement as the main ranking factor and leave the similarity measurement as a supplementary one when designing the way to integrate them. Input: Translated training data from the source language Testing data from the target language Output: The selected data set

Integrating Intra- and Extra- Quality Measurements

Active Learning-based Cross-lingual Sentiment Classification

Active Learning-based Cross-lingual Sentiment Classification

Experimental Settings Labeled Data in the Source Language: English reviews from four domains: Book (B), DVD (D), Electronics (E) and Kitchen (K). Each domain contains 1000 positive and 1000 negative reviews. All these labeled samples are translated into Chinese ones with Google Translate. Testing Data in the Target Language: Chinese reviews from IT168 and Chinese reviews from 360BUY, together with 2000 unlabeled reviews. Unlabeled Data in the Target Language: We select 500 positive and 500 negative as the unlabeled samples for active learning.

Experimental Results Table 1:The classification performance by using all 8000 samples in the source domain Four Approaches: Random + No_source Uncertainty + No_source Uncertainty + All_source Uncertainty + Selected_source DomainIT168360BUY Accuracy

Experimental Results ( Cont. )

Conclusion We propose an active learning approach for cross-lingual sentiment classification and address the huge challenge of the data imbalance by controlling data quality in the source language. Experimentation verifies the appropriateness of active learning for cross-lingual sentiment classification. In future work, we would like to improve the extra-quality measurement to make it more effective for selecting high quality samples. Meanwhile, we will try data quality controlling in other cross-lingual NLP tasks.

Thank You !