Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Similar presentations


Presentation on theme: "Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong."— Presentation transcript:

1 Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong Xiong, Zhongzhi Shi

2 Outline Introduction Problem Formulation Solution for Optimization Problem and Analysis of Algorithm Convergence Experimental Validation Related Works Conclusions Fuzhen Zhuang et al., SDM 20102

3 Introduction Many traditional learning techniques work well only under the assumption: Training and test data follow the same distribution Fuzhen Zhuang et al., SDM 2010 Training (labeled) Classifier Test (unlabeled) From different companies Enterprise News Classification: including the classes Product Announcement, Business scandal, Acquisition, … … Product announcement: HP's just-released LaserJet Pro P1100 printer and the LaserJet Pro M1130 and M1210 multifunction printers, price … performance... Announcement for Lenovo ThinkPad ThinkCentre – price $150 off Lenovo K300 desktop using coupon code... Lenovo ThinkPad ThinkCentre – price $200 off Lenovo IdeaPad U450p laptop using....their performance HP news Lenovo news Different distribution Fail ! 3

4 Motivation (1) Example Analysis: Fuzhen Zhuang et al., SDM 2010 Product announcement: HP's just-released LaserJet Pro P1100 printer and the LaserJet Pro M1130 and M1210 multifunction printers, price … performance... Announcement for Lenovo ThinkPad ThinkCentre – price $150 off Lenovo K300 desktop using coupon code... Lenovo ThinkPad ThinkCentre – price $200 off Lenovo IdeaPad U450p laptop using....their performance HP newsLenovo news Product word concept: LaserJet, printer, announcement, price, ThinkPad, ThinkCentre, announcement, price Related Product announcement document class: 4

5 Motivation (2) Example Analysis: Fuzhen Zhuang et al., SDM 2010 HP LaserJet, printer, price, performance et al. Lenovo Thinkpad, Thinkcentre, price, performance et al. The words expressing the same word concept are domain-dependent 5 Product announcement word concept indicates The association between word concepts and document classes is domain-independent Can we model this observation for classification? We study to model it for cross-domain classification Domain-dependent word concepts Domain-independent association between word concepts and document classes

6 Motivation (3) Example Analysis: Fuzhen Zhuang et al., SDM 2010 Product announcement: HP's just-released LaserJet Pro P1100 printer and the LaserJet Pro M1130 and M1210 multifunction printers, price … performance... Announcement for Lenovo ThinkPad ThinkCentre – price $150 off Lenovo K300 desktop using coupon code... Lenovo ThinkPad ThinkCentre – price $200 off Lenovo IdeaPad U450p laptop using....their performance HP newsLenovo news Product word concept: LaserJet, printer, announcement, price… ThinkPad, ThinkCentre announcement price… Related Product announcement document class: 6 Share some common words: announcement, price, performance …

7 Outline Introduction Problem Formulation Solution for Optimization Problem and Analysis of Algorithm Convergence Experimental Validation Related Works Conclusions Fuzhen Zhuang et al., SDM 20107

8 Preliminary Knowledge Basic formula of matrix tri-factorization: where the input X is the word-document co-occurrence matrix Fuzhen Zhuang et al., SDM 2010 denotes concept information, may vary in different domains F denotes the document classification information indeed is the association between word concepts and document classes, may retain stable cross domains G S 8

9 Problem Formulation (1) Input: source domain X s, target domain X t Matrix tri-factorization based classification framework Two-step Optimization Framework (MTrick0) Joint Optimization Framework (MTrick) Fuzhen Zhuang et al., SDM 2010 9

10 Problem Formulation (2) Sketch map of two-step optimization Fuzhen Zhuang et al., SDM 2010 Source domain X s SsSs FsFs GsGs FtFt GtGt Target domain X t SsSs 10 First step Second step

11 Problem Formulation (3) The optimization problem in source domain (First step) The optimization problem in target domain (Second step) Fuzhen Zhuang et al., SDM 2010 G 0 is used as the supervision information for this optimization Our goal: to obtain F s, G s and S s Our goal: to obtain F s, G s and S s 11 Our goal: to obtain F t, G t Our goal: to obtain F t, G t S s is the solution obtained from the source domain

12 Problem Formulation (4) Sketch map of joint optimization Fuzhen Zhuang et al., SDM 2010 Source domain X s FsFs GsGs FtFt GtGt Target domain X t S Knowledge Transfer 12

13 Problem Formulation (5) The joint optimization problem over source and target domain: Fuzhen Zhuang et al., SDM 2010 G 0 is the supervision information the association S is shared as bridge to transfer knowledge 13

14 Outline Introduction Problem Formulation Solution for Optimization Problem and Analysis of Algorithm Convergence Experimental Validation Related Works Conclusions Fuzhen Zhuang et al., SDM 201014

15 Solution for Optimization Alternately iterative algorithm is developed and the updated formulas are as follows, Fuzhen Zhuang et al., SDM 2010 This is the solution for joint optimization problem 15

16 Analysis of Algorithm Convergence According to the methodology of convergence analysis in the two works [Lee et al., NIPS01] and [Ding et al., KDD06], the following theorem holds. Fuzhen Zhuang et al., SDM 2010 Theorem (Convergence): After each round of calculating the iterative formulas, the objective function in the joint optimization will converge monotonically. 16

17 Outline Introduction Problem Formulation Solution for Optimization Problem and Analysis of Algorithm Convergence Experimental Validation Related Works Conclusions Fuzhen Zhuang et al., SDM 201017

18 Experimental Preparation (1) Construct Classification Tasks rec and sci denote the positive and negative classes, respectively For source domain: For target domain: 144 ( ) Tasks can be constructed from this data set rec vs. sci rec.autosrec.motorcyclesrec.baseballrec.hockey sci.cryptsic.electronicssci.medsci.space Fuzhen Zhuang et al., SDM 2010 rec sci rec.autos + sci.med rec.motorcycles + sci.space (4 x 4 cases) (3 x 3 cases) 18

19 Experimental Preparation (2) Data Sets 20 Newsgroup (three top categories are selected) –Two data sets for binary classification: rec vs. sci and sci vs. talk rec vs. sci : 144 tasks sci vs. talk : 144 tasks Reuters-21578 (the problems constructed in [Gao et al., KDD08]) rec.autosrec.motorcyclesrec.baseballrec.hockey sci.cryptsic.electronicssci.medsci.space talk.gunstalk.mideasttalk.misctalk.religion Fuzhen Zhuang et al., SDM 2010 rec sci talk 19

20 Experimental Preparation (3) Compared Algorithms –Supervised Learning: Logistic Regression (LG) [David et al., 00] Support Vector Machine (SVM) [Joachims, ICML99] –Semi-supervised Learning: TSVM [Joachims, ICML99] –Cross-domain Learning: CoCC [Dai et al., KDD07] LWE [Gao et al., KDD08] Our Methods MTrick0 (Two-step optimization framework) MTrick (Joint optimization framework) Measure: classification accuracy Fuzhen Zhuang et al., SDM 201020

21 Experimental Results (1) Comparisons among MTrick, MTrick0, CoCC, TSVM, SVM and LG on data set rec vs. sci Fuzhen Zhuang et al., SDM 2010 MTrick can perform well even the accuracy of LG is lower than 65% 21

22 Experimental Results (2) Comparisons among MTrick, MTrick0, CoCC, TSVM, SVM and LG on data set sci vs. talk Fuzhen Zhuang et al., SDM 2010 Similar with rec vs. sci Mtrick also achieves the best results in this data set 22

23 Experimental Results (3) The performance comparison of MTrick, LWE, CoCC, TSVM, SVM and LG on Reuters-21578 MTrick also performs very well on this data set Fuzhen Zhuang et al., SDM 201023

24 Experimental Results Summary The systemic experiments show that MTrick outperforms all the compared algorithms Especially, MTrick can perform very well when the accuracy of LG is low (< 65%), which indicates that MTrick still works when the difficulty degree of transfer learning is great Also we can find that the joint optimization is better than the two-step optimization Fuzhen Zhuang et al., SDM 201024

25 Overview Introduction Problem Formulation Solution for Optimization Problem and Analysis of Algorithm Convergence Experimental Validation Related Works Conclusions Fuzhen Zhuang et al., SDM 201025

26 Related Work (1) Cross-domain Learning Solve the distribution mismatch problems between the training and testing data. –Instance weighting based approaches Boosting based learning by Dai et al.[ICML07] Instance weighting framework for NLP tasks by Jiang et al.[ACL07] –Feature selection based approaches Two-phase feature selection framework by Jiang et al.[CIKM07] Dimensionality reduction approach by Pan et al.[AAAI08], which focuses on finding out the latent feature space regarded as the bridge knowledge between the source and target domains Co-Clustering based Classification method by Dai et al. [KDD07] Fuzhen Zhuang et al., SDM 201026

27 Related Work (2) Nonnegative Matrix Factorization (NMF) Weighted nonnegative matrix factorization (WNMF) by Guillamet et al. [PRL03] Incorporating word space knowledge for document clustering by Li et al. [SigIR08] Orthogonal constrained NMF by Ding et al.[KDD06] Cross-domain collaborative filtering by Li et al.[IJCAI09] Transfer the label information by sharing the information of word clusters, proposed by Li et al.[SigIR09]. However, the word clusters are not exactly the same due to distribution difference cross domains Fuzhen Zhuang et al., SDM 201027

28 Outline Introduction Problem Formulation Solution for Optimization Problem and Analysis of Algorithm Convergence Experimental Validation Related Works Conclusions Fuzhen Zhuang et al., SDM 201028

29 Conclusions Propose a nonnegative matrix factorization based classification framework (MTrick), which explicitly consider the domain-dependent concepts the domain-independent association between concepts and document classes Develop an alternately iterative algorithm to solve the optimization problem, and theoretically analyze the convergence Experiments on real-world text data sets show the effectiveness of the proposed approach Fuzhen Zhuang et al., SDM 201029

30 Thank you! Q. & A. Acknowledgement Fuzhen Zhuang et al., SDM 201030


Download ppt "Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong."

Similar presentations


Ads by Google