Presentation on theme: "A Cross-Collection Mixture Model for Comparative Text Mining"— Presentation transcript:
1 A Cross-Collection Mixture Model for Comparative Text Mining ChengXiang Zhai1 Atulya Velivelli2 Bei Yu31Department of Computer Science2Department of Electrical and Computer Engineering3Graduate School of Library and Information ScienceUniversity of Illinois, Urbana-ChampaignU.S.A.
2 MotivationMany applications involve a comparative analysis of several comparable text collections, e.g.,Given news articles from different sources (about the same event), can we extract what is common to all the sources and what is unique to one specific souce?Given customer reviews about 3 different brands of laptops, can we extract the common themes (e.g., battery life, speed, warranty) and compare the three brands in terms of each common theme?Given web sites about companies selling similar products, can we analyze the strength/weakness of each company?Existing work in text mining has conceptually focused on one single collection of text thus is inadequate for comparative text analysisWe aim at developing methods for comparing multiple collections of text and performing comparative text mining
3 A pool of text Collections Comparative Text Mining (CTM)Problem definition:Given a comparable set of text collectionsDiscover & analyze their common and unique propertiesA pool of text CollectionsCollection C1Collection C2 ….Collection CkCommon themesC1-specificthemesC2-specificthemesCk-specificthemes
5 A More Realistic Setup of CTM IBM Laptop ReviewsAPPLE Laptop ReviewsDELL Laptop ReviewsCommon Themes“IBM” specific“APPLE” specific“DELL” specificBattery 0.129Hours 0.080Life 0.060…Long 0.1204hours 0.0103hours 0.008Reasonable 0.10Medium 0.082hours 0.002Short 0.05Poor 0.011hours 0.005..Disk 0.015IDE 0.010Drive 0.005Large 0.10080GB 0.050Small 0.0505GB...Medium 0.12320GB….Pentium 0.113Processor 0.050Slow 0.114200Mhz 0.080Fast 0.1513Ghz 0.100Moderate 0.1161GhzCommonWordDistr.Collection-specific Word Distributions
6 A Basic Approach: Simple Clustering Pool all documents together and perform clusteringHopefully, some clusters are reflecting common themes and others specific themesHowever, we can’t “force” a common theme to cover all collectionsBackground BTheme 1 1Theme 33Theme 2 2Theme 44…………………
7 Improved Clustering: Cross-Collection Mixture Models CmExplicitly distinguish and model common themes and specific themesFit a mixture model with the text dataEstimate parameters using EMClusters are more meaningfulBackground BTheme 1 in common: 1Theme 1Specificto C11,1Theme 1Specificto C21,2Theme 1Specificto Cm1,m……………………Theme k in common: kTheme kSpecificto C1k,1Theme kSpecificto C2k,2Theme kSpecificto Cmk,m
8 Details of the Mixture Model Account for noise (common non-informative words)BackgroundBCommonDistribution“Generating” word win doc d in collection CiB1CTheme 11,iW1-Cd,1Collection-specificDistr.1-B…d,kCommonDistributionkCTheme kk,iParameters:B=noise-level (manually set)C=Common-Specific tradeoff (manually set)’s and ’s are estimated with Maximum Likelihood1-CCollection-specificDistr.
9 Experiments Two Data Sets War news (2 collections)Iraq war: A combination of 30 articles from CNN and BBC websitesAfghan war: A combination of 26 articles from CNN and BBC websitesLaptop customer reviews (3 collections)Apple iBook Mac: 34 reviews downloaded from epinions.comDell Inspiron: 22 reviews downloaded from epinions.comIBM Thinkpad: 42 reviews downloaded from epinions.comOn each data set, we compare a simple mixture model with the cross-collection mixture model
10 Comparison of Simple and Cross-Collection Clustering Results from Cross-collection clustering are more meaningful
11 Cross-Collection Clustering Results (Laptop Reviews) Top words serve as “labels” for common themes(e.g., [sound, speakers], [battery, hours], [cd,drive])These word distributions can be used to segment text andadd hyperlinks between documents
12 Summary and Future Work We defined a new text mining problem, referred to as comparative text mining (CTM), which has many applicationsWe proposed and evaluated a cross-collection mixture model for CTMExperiment results show that the proposed cross-collection model is more effective for CTM than a simple mixture model for CTMFuture workFurther improve the mixture model and estimation method (e.g., consider proximity, MAP estimation)Use the model to segment documents and create hyperlinks between segments (e.g., feed the learned word distributions into HMMs for segmentation)