CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

PEBL: Web Page Classification without Negative Examples Hwanjo Yu, Jiawei Han, Kevin Chen- Chuan Chang IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
SIGIR 2008 Yandong Liu, Jiang Bian, Eugene Agichtein from Emory & Georgia Tech University.
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.
Trust Relationship Prediction Using Online Product Review Data Nan Ma 1, Ee-Peng Lim 2, Viet-An Nguyen 2, Aixin Sun 1, Haifeng Liu 3 1 Nanyang Technological.
Modeling User Interactions in Social Media Eugene Agichtein Emory University.
Text Classification With Support Vector Machines
Combining Labeled and Unlabeled Data for Multiclass Text Categorization Rayid Ghani Accenture Technology Labs.
Mapping Between Taxonomies Elena Eneva 11 Dec 2001 Advanced IR Seminar.
Taking the Kitchen Sink Seriously: An Ensemble Approach to Word Sense Disambiguation from Christopher Manning et al.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.
Cross Validation Framework to Choose Amongst Models and Datasets for Transfer Learning Erheng Zhong ¶, Wei Fan ‡, Qiang Yang ¶, Olivier Verscheure ‡, Jiangtao.
Quality-aware Collaborative Question Answering: Methods and Evaluation Maggy Anastasia Suryanto, Ee-Peng Lim Singapore Management University Aixin Sun.
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
WEB FORUM MINING BASED ON USER SATISFACTION PAGE 1 WEB FORUM MINING BASED ON USER SATISFACTION By: Suresh Pokharel Information and Communications Technologies.
Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.
Modern Retrieval Evaluations Hongning Wang
Learning from Imbalanced, Only Positive and Unlabeled Data Yetian Chen
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Thien Anh Dinh1, Tomi Silander1, Bolan Su1, Tianxia Gong
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,
Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen, CS Division, UC Berkeley Susan Dumais, Microsoft Research ACM:CHI April.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
1 Discovering Authorities in Question Answer Communities by Using Link Analysis Pawel Jurczyk, Eugene Agichtein (CIKM 2007)
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Exploiting Domain Structure for Named Entity Recognition Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign.
Designing Ranking Systems for Consumer Reviews: The Economic Impact of Customer Sentiment in Electronic Markets Anindya Ghose Panagiotis Ipeirotis Stern.
Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.
1 Co-Training for Cross-Lingual Sentiment Classification Xiaojun Wan ( 萬小軍 ) Associate Professor, Peking University ACL 2009.
Improving Classification Accuracy Using Automatically Extracted Training Data Ariel Fuxman A. Kannan, A. Goldberg, R. Agrawal, P. Tsaparas, J. Shafer Search.
Semi-supervised Training of Statistical Parsers CMSC Natural Language Processing January 26, 2006.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
LOGO Finding High-Quality Content in Social Media Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis and Gilad Mishne (WSDM 2008) Advisor.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Jun Li, Peng Zhang, Yanan Cao, Ping Liu, Li Guo Chinese Academy of Sciences State Grid Energy Institute, China Efficient Behavior Targeting Using SVM Ensemble.
Exploit of Online Social Networks with Community-Based Graph Semi-Supervised Learning Mingzhen Mo and Irwin King Department of Computer Science and Engineering.
Finding high-Quality contents in Social media BY : APARNA TODWAL GUIDED BY : PROF. M. WANJARI.
Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,
Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project.
Online Multiple Kernel Classification Steven C.H. Hoi, Rong Jin, Peilin Zhao, Tianbao Yang Machine Learning (2013) Presented by Audrey Cheong Electrical.
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Finding the Right Facts in the Crowd: Factoid Question Answering over Social Media J. Bian, Y. Liu, E. Agichtein, and H. Zha ACM WWW, 2008.
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Classification using Co-Training
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
Ping-Tsun Chang Intelligent Systems Laboratory NTU/CSIE Using Support Vector Machine for Integrating Catalogs.
Wen Chan 1 , Jintao Du 1, Weidong Yang 1, Jinhui Tang 2, Xiangdong Zhou 1 1 School of Computer Science, Shanghai Key Laboratory of Data Science, Fudan.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
University of Rochester
Detecting Online Commercial Intention (OCI)
Categorizing networks using Machine Learning
Text Categorization Rong Jin.
Using Uneven Margins SVM and Perceptron for IE
A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 22, Feb, 2010 Department of Computer.
Presentation transcript:

CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein Emory University 1

Community Question Answering  An effective way of seeking information from other users  Can be searched for resolved questions 2

Community Question Answering (CQA)  Yahoo! Answers  Users  Asker: post questions  Answerer: post answers  Voter: vote for existing answers  Questions  Subject  Detail  Answers  Answer text  Votes  Archive: millions of questions and answers 3

Lifecycle of a Question in CQA User Choose a category Choose a category Compose the question Open question Open question Examine Find the answer? Close question Choose best answers Give ratings Close question Choose best answers Give ratings Question is closed by system. Best answer is chosen by voters Question is closed by system. Best answer is chosen by voters Yes No Answer

Problem Statement  How can we exploit structure of CQA to improve question classification?  Case Study: Question Subjectivity Prediction  Subjective questions: seek answers containing private states such as personal opinion, judgment, and experience;  Objective questions: are expected to be answered with reliable or authoritative information; 5

Example Questions  Subjective:  Has anyone got one of those home blood pressure monitors? and if so what make is it and do you think they are worth getting?  Objective:  What is the difference between chemotherapy and radiation treatments? 6

Motivation  Guiding the CQA engine to process questions more intelligently  Some Applications  Ranking/filtering answers  Improving question archive search  Evaluating answers provided by users  Inferring user intent 7

Challenges  Some challenges in online real question analysis:  Typically complex and subjective  Can be ill-phrased and vague  Not enough annotated data 8

Key Observations  Can we utilize the inherent structure of the CQA interactions, and use the unlimited amounts of unlabeled data to improve classification performance? 9

Natural Approach: Co-Training  Introduced by  Combining labeled and unlabeled data with co-training, Blum and Mitchell, 1998  Two views of the data  E.g.: content and hyperlinks in web pages  Provide complementary information for each other  Iteratively construct additional labeled data  Can often significantly improve accuracy 10

Questions and Answers: Two Views  Example:  Q: Has anyone got one of those home blood pressure monitors? and if so what make is it and do you think they are worth getting?  A: My mom has one as she is diabetic so its important for her to monitor it she finds it useful.  Answers usually match/fit question  My mom… she finds…  Askers can usually identify matching answers by selecting the “best answer” 11

CoCQA : A Co-Training Framework over Questions and Answers 12 Labeled Data CQCQ CQCQ CACA CACA Q A Unlabeled Data ?????????? Unlabeled Data ?????????? Q A Unlabeled Data ?????????? Unlabeled Data ?????????? Labeled Data Validation (Holdout training data) Validation (Holdout training data) Classify Stop

Details of CoCQA implementation  Base classifier  LibSVM  Term Frequency as Term Weight  Also tried Binary, TF*IDF  Select top K examples with highest confidence  Margin value in SVM 13

Feature Set  Character 3-grams  has, any, nyo, yon, one…  Words  Has, anyone, got, mom, she, finds…  Word with Character 3-grams  Word n-grams (n<=3, i.e. W i, W i W i+1, W i W i+1 W i+2 )  Has anyone got, anyone got one, she finds it…  Word and POS n-gram (n<=3, i.e. W i, W i W i+1, W i POS i+1, POS i W i+1, POS i POS i+1, etc.)  NP VBP, She PRP, VBP finds… 14

Overview of Experimental Setup  Datasets  From Yahoo! Answers  Manually labeled data by Amazon Mechanical Turk  Metrics  Compare CQA to state-of-the semi-supervised method 15

Dataset  1,000 Labeled Questions from Yahoo! Answers  5 categories (Arts, Education, Science, Health & Sports)  200 questions from each category  10,000 Unlabeled Questions from Yahoo! Answers  2,000 questions from each category  Data available at  16

Manual Labeling 17  Annotated using Amazon ’ s Mechanical Turk service  Each question was judged by 5 Mechanical Turk workers  25 questions included in each HIT task  Worker needs to pass the qualification test  Majority vote to derive gold standard  Discarded small fraction (22 out of 1000) of nonsensical questions such as “Upward Soccer Shorts?” and “1+1=?fdgdgdfg” by manual inspection

Example HIT task 18

Subjectivity Statistics by Category 19 Objective Subjective

Evaluation Metric  Macro-Averaged F-1  Prediction performance on both subjective questions and objective questions is equally important  F-1  Averaged over subjective and objective classes 20

Experimental Settings  5 fold cross validation  Methods Compared:  Supervised: LibSVM (Chang and Lin, 2001)  Generalized Expectation (GE): (Mann and McCallum, 2007)  CoCQA: our method  Base classifier: LibSVM  View 1: question text; View 2: answer text 21

F1 for Supervised Learning Features Features Char3-gramWordWord+Char3-gramWordPOSn-gram(n<=3) question best_ans q_bestans Na ï ve (majority class) baseline: F1 with different sets of features

Semi Supervised Learning: Adding unlabeled data Features FeaturesMethod QuestionQuestion+ Best Answer Supervised GE (-0.7%) (+3.2%) CoCQA (+1.9%) (+7.2%) 23 Comparison between Supervised, GE and CoCQA

CoCQA with varying K (# new examples added in each iteration) 24

CoCQA for varying # iterations 25

CoCQA for varying amount of labeled data 26

Conclusions and Future Work  Problem: Non-topical text classification in CQA  CoCQA: a co-training framework that can exploit information from both question and answers  Case study: subjectivity classification for real questions in CQA  We plan to explore:  more sophisticated features;  related variants of semi-supervised learning;  other applications (Sentiment classification) 27

Thank you! Baoli Li Yandong Liu Eugene Agichtein 28

Performance of Subjective vs. Objective classes  Subjective class  80%  Objective class  60% 29

Related work  Some related work:  Question Classification: (Zhang and Lee, 2003)( Tri et al., 2006)  Sentiment Analysis: (Pang and Lee, 2004)  (Yu and Hatzivassiloglou, 2003)  (Somasundaran et al. 2007) 30

Important words for Subjective, Objective classes by Information Gain 31