Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fuzhen Zhuang SDM 2010 Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo,

Similar presentations


Presentation on theme: "Fuzhen Zhuang SDM 2010 Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo,"— Presentation transcript:

1 Fuzhen Zhuang SDM 2010 Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong Xiong, Zhongzhi Shi

2 Outline Introduction Problem Formulation
Fuzhen Zhuang SDM 2010 Outline Introduction Problem Formulation Solution for Optimization Problem and Analysis of Algorithm Convergence Experimental Validation Related Works Conclusions Fuzhen Zhuang et al., SDM 2010

3 Fuzhen Zhuang SDM 2010 Introduction Many traditional learning techniques work well only under the assumption: Training and test data follow the same distribution Fail ! Enterprise News Classification: including the classes “Product Announcement”, “Business scandal”, “Acquisition”, … … Training (labeled) Test (unlabeled) From different companies Product announcement: HP's just-released LaserJet Pro P1100 printer and the LaserJet Pro M1130 and M1210 multifunction printers, price … performance ... Announcement for Lenovo ThinkPad ThinkCentre – price $150 off Lenovo K300 desktop using coupon code ... Lenovo ThinkPad ThinkCentre – price $200 off Lenovo IdeaPad U450p laptop using. ...their performance Classifier Different distribution HP news Lenovo news Fuzhen Zhuang et al., SDM 2010

4 Motivation (1) Example Analysis: Product announcement
Fuzhen Zhuang SDM 2010 Motivation (1) Example Analysis: HP news Lenovo news Product announcement: HP's just-released LaserJet Pro P1100 printer and the LaserJet Pro M1130 and M1210 multifunction printers, price … performance ... Announcement for Lenovo ThinkPad ThinkCentre – price $150 off Lenovo K300 desktop using coupon code ... Lenovo ThinkPad ThinkCentre – price $200 off Lenovo IdeaPad U450p laptop using. ...their performance Related document class: Product announcement LaserJet, printer, announcement, price, ThinkPad, ThinkCentre, announcement, price Product word concept: Fuzhen Zhuang et al., SDM 2010

5 Motivation (2) HP Example Analysis: Lenovo
Fuzhen Zhuang SDM 2010 Motivation (2) The words expressing the same word concept are domain-dependent Example Analysis: HP LaserJet, printer, price, performance et al. Lenovo Thinkpad, Thinkcentre, price, performance et al. Can we model this observation for classification? We study to model it for cross-domain classification Domain-dependent word concepts Domain-independent association between word concepts and document classes word concept indicates Product Product announcement The association between word concepts and document classes is domain-independent Fuzhen Zhuang et al., SDM 2010

6 Motivation (3) Example Analysis: Product announcement
Fuzhen Zhuang SDM 2010 Motivation (3) Example Analysis: Share some common words: announcement, price, performance … HP news Lenovo news Product announcement: HP's just-released LaserJet Pro P1100 printer and the LaserJet Pro M1130 and M1210 multifunction printers, price … performance ... Announcement for Lenovo ThinkPad ThinkCentre – price $150 off Lenovo K300 desktop using coupon code ... Lenovo ThinkPad ThinkCentre – price $200 off Lenovo IdeaPad U450p laptop using. ...their performance Related document class: Product announcement LaserJet, printer, announcement, price… ThinkPad, ThinkCentre announcement price… Product word concept: Fuzhen Zhuang et al., SDM 2010

7 Outline Introduction Problem Formulation
Fuzhen Zhuang SDM 2010 Outline Introduction Problem Formulation Solution for Optimization Problem and Analysis of Algorithm Convergence Experimental Validation Related Works Conclusions Fuzhen Zhuang et al., SDM 2010

8 Preliminary Knowledge
Fuzhen Zhuang SDM 2010 Preliminary Knowledge Basic formula of matrix tri-factorization: where the input X is the word-document co-occurrence matrix denotes concept information, may vary in different domains F indeed is the association between word concepts and document classes, may retain stable cross domains S denotes the document classification information G Fuzhen Zhuang et al., SDM 2010

9 Problem Formulation (1)
Fuzhen Zhuang SDM 2010 Problem Formulation (1) Input: source domain Xs, target domain Xt Matrix tri-factorization based classification framework Two-step Optimization Framework (MTrick0) Joint Optimization Framework (MTrick) Fuzhen Zhuang et al., SDM 2010

10 Problem Formulation (2)
Fuzhen Zhuang SDM 2010 Problem Formulation (2) Sketch map of two-step optimization Fs Gs Ss Ss Source domain Xs First step Ft Gt Target domain Xt Second step Fuzhen Zhuang et al., SDM 2010

11 Problem Formulation (3)
Fuzhen Zhuang SDM 2010 Problem Formulation (3) The optimization problem in source domain (First step) The optimization problem in target domain (Second step) Our goal: to obtain Fs , Gs and Ss G0 is used as the supervision information for this optimization Ss is the solution obtained from the source domain Our goal: to obtain Ft , Gt Fuzhen Zhuang et al., SDM 2010

12 Problem Formulation (4)
Fuzhen Zhuang SDM 2010 Problem Formulation (4) Sketch map of joint optimization Fs Gs Source domain Xs S Knowledge Transfer Ft Gt Target domain Xt Fuzhen Zhuang et al., SDM 2010

13 Problem Formulation (5)
Fuzhen Zhuang SDM 2010 Problem Formulation (5) The joint optimization problem over source and target domain: the association S is shared as bridge to transfer knowledge G0 is the supervision information Fuzhen Zhuang et al., SDM 2010

14 Outline Introduction Problem Formulation
Fuzhen Zhuang SDM 2010 Outline Introduction Problem Formulation Solution for Optimization Problem and Analysis of Algorithm Convergence Experimental Validation Related Works Conclusions Fuzhen Zhuang et al., SDM 2010

15 Solution for Optimization
Fuzhen Zhuang SDM 2010 Solution for Optimization Alternately iterative algorithm is developed and the updated formulas are as follows, This is the solution for joint optimization problem Fuzhen Zhuang et al., SDM 2010

16 Analysis of Algorithm Convergence
Fuzhen Zhuang SDM 2010 Analysis of Algorithm Convergence According to the methodology of convergence analysis in the two works [Lee et al., NIPS’01] and [Ding et al., KDD’06], the following theorem holds. Theorem (Convergence): After each round of calculating the iterative formulas, the objective function in the joint optimization will converge monotonically. Fuzhen Zhuang et al., SDM 2010

17 Outline Introduction Problem Formulation
Fuzhen Zhuang SDM 2010 Outline Introduction Problem Formulation Solution for Optimization Problem and Analysis of Algorithm Convergence Experimental Validation Related Works Conclusions Fuzhen Zhuang et al., SDM 2010

18 Experimental Preparation (1)
Fuzhen Zhuang SDM 2010 Experimental Preparation (1) Construct Classification Tasks rec and sci denote the positive and negative classes, respectively For source domain: For target domain: 144 ( ) Tasks can be constructed from this data set rec vs. sci rec rec.autos rec.motorcycles rec.baseball rec.hockey sci.crypt sic.electronics sci.med sci.space sci rec.autos + sci.med (4 x 4 cases) rec.motorcycles + sci.space (3 x 3 cases) Fuzhen Zhuang et al., SDM 2010

19 Experimental Preparation (2)
Fuzhen Zhuang SDM 2010 Experimental Preparation (2) Data Sets 20 Newsgroup (three top categories are selected) Two data sets for binary classification: rec vs. sci and sci vs. talk rec vs. sci : 144 tasks sci vs. talk : 144 tasks Reuters (the problems constructed in [Gao et al., KDD’08]) rec rec.autos rec.motorcycles rec.baseball rec.hockey sci.crypt sic.electronics sci.med sci.space talk.guns talk.mideast talk.misc talk.religion sci talk Fuzhen Zhuang et al., SDM 2010

20 Experimental Preparation (3)
Fuzhen Zhuang SDM 2010 Experimental Preparation (3) Compared Algorithms Supervised Learning: Logistic Regression (LG) [David et al., 00] Support Vector Machine (SVM) [Joachims, ICML’99] Semi-supervised Learning: TSVM [Joachims, ICML’99] Cross-domain Learning: CoCC [Dai et al., KDD’07] LWE [Gao et al., KDD’08] Our Methods MTrick0 (Two-step optimization framework) MTrick (Joint optimization framework) Measure: classification accuracy Fuzhen Zhuang et al., SDM 2010

21 Experimental Results (1)
Fuzhen Zhuang SDM 2010 Experimental Results (1) Comparisons among MTrick, MTrick0, CoCC, TSVM, SVM and LG on data set rec vs. sci MTrick can perform well even the accuracy of LG is lower than 65% Fuzhen Zhuang et al., SDM 2010

22 Experimental Results (2)
Fuzhen Zhuang SDM 2010 Experimental Results (2) Comparisons among MTrick, MTrick0, CoCC, TSVM, SVM and LG on data set sci vs. talk Similar with rec vs. sci Mtrick also achieves the best results in this data set Fuzhen Zhuang et al., SDM 2010

23 Experimental Results (3)
Fuzhen Zhuang SDM 2010 Experimental Results (3) The performance comparison of MTrick, LWE, CoCC, TSVM, SVM and LG on Reuters-21578 MTrick also performs very well on this data set Fuzhen Zhuang et al., SDM 2010

24 Experimental Results Summary
Fuzhen Zhuang SDM 2010 Experimental Results Summary The systemic experiments show that MTrick outperforms all the compared algorithms Especially, MTrick can perform very well when the accuracy of LG is low (< 65%), which indicates that MTrick still works when the difficulty degree of transfer learning is great Also we can find that the joint optimization is better than the two-step optimization Fuzhen Zhuang et al., SDM 2010

25 Overview Introduction Problem Formulation
Fuzhen Zhuang SDM 2010 Overview Introduction Problem Formulation Solution for Optimization Problem and Analysis of Algorithm Convergence Experimental Validation Related Works Conclusions Fuzhen Zhuang et al., SDM 2010

26 Related Work (1) Cross-domain Learning
Fuzhen Zhuang SDM 2010 Related Work (1) Cross-domain Learning Solve the distribution mismatch problems between the training and testing data. Instance weighting based approaches Boosting based learning by Dai et al.[ICML’07] Instance weighting framework for NLP tasks by Jiang et al.[ACL’07] Feature selection based approaches Two-phase feature selection framework by Jiang et al.[CIKM’07] Dimensionality reduction approach by Pan et al.[AAAI’08], which focuses on finding out the latent feature space regarded as the bridge knowledge between the source and target domains Co-Clustering based Classification method by Dai et al. [KDD’07] Fuzhen Zhuang et al., SDM 2010

27 Related Work (2) Nonnegative Matrix Factorization (NMF)
Fuzhen Zhuang SDM 2010 Related Work (2) Nonnegative Matrix Factorization (NMF) Weighted nonnegative matrix factorization (WNMF) by Guillamet et al. [PRL’03] Incorporating word space knowledge for document clustering by Li et al. [SigIR’08] Orthogonal constrained NMF by Ding et al.[KDD’06] Cross-domain collaborative filtering by Li et al.[IJCAI’09] Transfer the label information by sharing the information of word clusters, proposed by Li et al.[SigIR’09]. However, the word clusters are not exactly the same due to distribution difference cross domains Fuzhen Zhuang et al., SDM 2010

28 Outline Introduction Problem Formulation
Fuzhen Zhuang SDM 2010 Outline Introduction Problem Formulation Solution for Optimization Problem and Analysis of Algorithm Convergence Experimental Validation Related Works Conclusions Fuzhen Zhuang et al., SDM 2010

29 Fuzhen Zhuang SDM 2010 Conclusions Propose a nonnegative matrix factorization based classification framework (MTrick), which explicitly consider the domain-dependent concepts the domain-independent association between concepts and document classes Develop an alternately iterative algorithm to solve the optimization problem, and theoretically analyze the convergence Experiments on real-world text data sets show the effectiveness of the proposed approach Fuzhen Zhuang et al., SDM 2010

30 Thank you! Q. & A. Acknowledgement Fuzhen Zhuang SDM 2010
Fuzhen Zhuang et al., SDM 2010


Download ppt "Fuzhen Zhuang SDM 2010 Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo,"

Similar presentations


Ads by Google