Presentation on theme: "A Cross-Collection Mixture Model for Comparative Text Mining ChengXiang Zhai 1 Atulya Velivelli 2 Bei Yu 3 1 Department of Computer Science 2 Department."— Presentation transcript:
A Cross-Collection Mixture Model for Comparative Text Mining ChengXiang Zhai 1 Atulya Velivelli 2 Bei Yu 3 1 Department of Computer Science 2 Department of Electrical and Computer Engineering 3 Graduate School of Library and Information Science University of Illinois, Urbana-Champaign U.S.A.
2 Motivation Many applications involve a comparative analysis of several comparable text collections, e.g., –Given news articles from different sources (about the same event), can we extract what is common to all the sources and what is unique to one specific souce? –Given customer reviews about 3 different brands of laptops, can we extract the common themes (e.g., battery life, speed, warranty) and compare the three brands in terms of each common theme? –Given web sites about companies selling similar products, can we analyze the strength/weakness of each company? Existing work in text mining has conceptually focused on one single collection of text thus is inadequate for comparative text analysis We aim at developing methods for comparing multiple collections of text and performing comparative text mining
3 Problem definition: Given a comparable set of text collections Discover & analyze their common and unique properties Collection C 1 Collection C 2 …. C 1 - specific themes Common themes C 2 - specific themes C k - specific themes A pool of text Collections Collection C k Comparative Text Mining (CTM)
4 Example: Summarizing Customer Reviews Common ThemesIBM specificAPPLE specificDELL specific Battery LifeLong, 4-3 hrsMedium, 3-2 hrsShort, 2-1 hrs Hard diskLarge, GBSmall, 5-10 GBMedium, GB SpeedSlow, MhzVery Fast, 3-4 GhzModerate, 1-2 Ghz IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Ideal results from comparative text mining
5 A More Realistic Setup of CTM Common ThemesIBM specificAPPLE specificDELL specific Battery Hours Life … Long hours hours … Reasonable 0.10 Medium hours … Short 0.05 Poor hours Disk IDE Drive Large GB … Small GB Medium GB …. Pentium Processor … Slow Mhz … Fast Ghz … Moderate Ghz … IBM Laptop ReviewsAPPLE Laptop ReviewsDELL Laptop Reviews Collection-specific Word Distributions Common Word Distr.
6 A Basic Approach: Simple Clustering Pool all documents together and perform clustering Hopefully, some clusters are reflecting common themes and others specific themes However, we cant force a common theme to cover all collections ………………… Background B Theme 1 1 Theme 3 3 Theme 2 2 Theme 4 4
7 Improved Clustering: Cross-Collection Mixture Models Explicitly distinguish and model common themes and specific themes Fit a mixture model with the text data Estimate parameters using EM Clusters are more meaningful ………………… Background B Theme 1 in common: 1 Theme 1 Specific to C 1 1,1 CmCm C2C2 C1C1 … Theme k in common: k Theme k Specific to C 1 k,1 Theme 1 Specific to C 2 1,2 Theme 1 Specific to C m 1,m Theme k Specific to C 2 k,2 Theme k Specific to C m k,m
8 Details of the Mixture Model C B 1 1,i 1- C C k k,i 1- C … d,1 d,k B 1- B Theme 1 Background Account for noise (common non-informative words) Common Distribution Collection-specific Distr. Collection-specific Distr. Common Distribution Theme k Parameters: B =noise-level (manually set) C =Common-Specific tradeoff (manually set) s and s are estimated with Maximum Likelihood W Generating word w in doc d in collection C i
9 Experiments Two Data Sets –War news (2 collections) Iraq war: A combination of 30 articles from CNN and BBC websites Afghan war: A combination of 26 articles from CNN and BBC websites –Laptop customer reviews (3 collections) Apple iBook Mac: 34 reviews downloaded from epinions.com Dell Inspiron: 22 reviews downloaded from epinions.com IBM Thinkpad: 42 reviews downloaded from epinions.com On each data set, we compare a simple mixture model with the cross-collection mixture model
10 Comparison of Simple and Cross-Collection Clustering Simple Cross Collection Results from Cross-collection clustering are more meaningful
11 Cross-Collection Clustering Results (Laptop Reviews) Top words serve as labels for common themes (e.g., [sound, speakers], [battery, hours], [cd,drive]) These word distributions can be used to segment text and add hyperlinks between documents
12 Summary and Future Work We defined a new text mining problem, referred to as comparative text mining (CTM), which has many applications We proposed and evaluated a cross-collection mixture model for CTM Experiment results show that the proposed cross- collection model is more effective for CTM than a simple mixture model for CTM Future work –Further improve the mixture model and estimation method (e.g., consider proximity, MAP estimation) –Use the model to segment documents and create hyperlinks between segments (e.g., feed the learned word distributions into HMMs for segmentation)