1 Finding Structure in Noisy Text: Topic Classification and Unsupervised Clustering Rohit Prasad, Prem Natarajan, Krishna Subramanian, Shirin Saleem, and.

1 Finding Structure in Noisy Text: Topic Classification and Unsupervised Clustering Rohit Prasad, Prem Natarajan, Krishna Subramanian, Shirin Saleem, and Rich Schwartz {rprasad,pnataraj}@bbn.com Presented by Daniel Lopresti 8 th January 2007

2 Outline Research objectives and challenges Overview of supervised classification using HMMs Supervised topic classification of newsgroup messages Unsupervised topic discovery and clustering Rejection of off-topic messages

3 Objectives Develop a system that performs topic based categorization of newsgroup messages in two modes Mode 1 – Supervised classification: topics of interest to the user are known apriori to the system –Spot messages that are on topics of interest to a user –Requires rejecting off-topic messages to ensure low false alarm rates Mode 2 – Unsupervised classification: topics of interest to the user are not known –Discover topics in a large corpus without human supervision –Automatically organize/cluster the messages to support efficient navigation

4 Challenges Posed by Newsgroup Messages Text in newsgroup messages tends to be noisy –Abbreviations, misspellings –Colloquial (non-grammatical) language of messages –Discursive structure with frequent switching between topics –Lack of context in some messages makes it impossible to understand the message without access to complete thread Supervised classification requires annotation of newsgroup messages with a set of topic labels –Every non-trivial message contains multiple topics –No completely annotated corpus of newsgroup messages exists By complete annotation we mean tagging each message with ALL relevant topics

6 Supervised Topic Classification President Clinton dumped his embattled Mexican bailout today. Instead, he announced another plan that doesnt need congressional approval. TextAudio or Images ASR or OCR Topic Classifier CNN, NBC, CBS, NPR, etc. Clinton, Bill Mexico Money Economic assistance, American Applications News Sorting Information Retrieval Detection of Key Events Improved Speech Recognition Topic Models Topic-Labeled Broadcast News Corpus Several Topics Training e.g., Primary Source Media 4-5 topics / story 40,000 stories / year 5,000 topics Text

7 OnTopic TM HMM Topic Model A probabilistic hidden Markov model (HMM) that attempts to capture the generation of a story Assumes a story can be on multiple topics, different words are related to different topics Uses an explicit state for General Language because most words in a story are not related to any topics Scalable to a large number of topics and requires only topic labels for each story for training the model Language independent methodology

9 Experiment Setup Performed experiments with two newsgroup corpora –Automated Front End (AFE) newsgroup corpus collected by Washington Univ. –20 Newsgroup (NG) corpus from http://people.csail.mit.edu/jrennie/20Newsgroups/ http://people.csail.mit.edu/jrennie/20Newsgroups/ Assumed the name of the newsgroup is the ONLY associated topic for each of the message Although cost effective this assumption leads to inaccuracies in estimating system performance –Messages typically contain multiple topics, some of which may be related to the dominant theme of another newsgroup

10 AFE Newsgroups Corpus Google newsgroups data collected by Washington University from 12 diverse newsgroups Messages posted to 11 newsgroups are considered to be in-topic and all messages posted to the talk.origins newsgroup are considered to be off-topic Message headers were stripped to exclude newsgroup name from training and test messages Split the corpus into training, test, and, validation sets according to the distribution specified in the config.xml file provided by the Washington University –But since the filenames were truncated we could not select the same messages as Washington University

11 AFE Newsgroups Corpus Newsgroup # of Messages #Training#Test#Validation Alt.sports.baseball.stl_cardinals213310 Comp.ai.neural_nets15257 Comp.programming.threads314715 Humanities.musics.composers.wagner19319 Misc.consumers.frugal_living10175 Misc.writing.moderated243712 Rec.Equestrian274113 Rec.martial_arts.moderated18299 Sci.archaelogy.moderated466923 Sci.logic203010 Soc.libraries.talk10175 Talk.origins (Chaff)24510401122 Total Number of Messages (w/o chaff)241376118 Total Number of Messages (w/ chaff)48610777240 Total Number of Words (w/o chaff)103K118K32K Total Number of Words (w/ chaff)187K3.4M63K

12 Closed-set Classification Accuracy on AFE Trained OnTopic models on 11 newsgroups –Excluded messages from talk.origins newsgroup because they are off-topic w.r.t topics of interest –Used stemming since some newsgroups had only a few training messages Classified 376 in-topic messages Achieved overall top-choice accuracy of 91.2% –Top-choice accuracy: Percentage of times the top-choice (best) topic returned by OnTopic was the correct answer Top-choice accuracy was worse on newsgroups with fewer training examples

13 Closed-set Classification Accuracy (Contd.) Newsgroup#Training Messages%Top-Choice Accuracy Misc.consumers.frugal_living1047.1% Soc.libraries.talk1058.8% Comp.ai.neural_nets1580.0% Rec.martial_arts.moderated1886.2% Humanities.musics.composers.wagner19100.0% Sci.logic2096.7% Alt.sports.baseball.stl_cardinals21100.0% Misc.writing.moderated2491.9% Rec.Equestrian2797.6% Comp.programming.threads31100.0% Sci.archaelogy.moderated4695.7% Overall24191.2%

14 20 Newsgroups Corpus Downloaded 20 Newsgroups Corpus (20 NG) from http://people.csail.mit.edu/jrennie/20Newsgroups/ Corpus characteristics: –Messages from 20 newsgroups with an average of 941 messages per newsgroup –Average of 350 threads in each newsgroup –Average message length of 300 words (170 words after headers and replied to text is excluded) –Some newsgroups are similar – the 20 newsgroups span 6 broad subjects Data pre-processing –Stripped message headers, e-mail IDs, and signatures to exclude newsgroup related information Corpus was split into training, development, and validation sets for topic classification experiments

15 Distribution of Messages Across Newsgroups NewsgroupTotal MessagesUnique ThreadsMessages Per Thread alt.atheism799879.2 comp.graphics9735321.8 comp.os.ms-windows.misc9854792.1 comp.sys.ibm.pc.hardware9825361.8 comp.sys.mac.hardware9614672.1 comp.windows.x9807731.3 misc.forsale9728771.1 rec.autos9902603.8 rec.motorcycles9941775.6 rec.sport.baseball9942723.7 rec.sport.hockey9993462.9 sci.crypt9912164.6 sci.electronics9813952.5 sci.med9903143.2 sci.space9872963.3 soc.religion.christian9972953.4 talk.politics.guns9101456.3 talk.politics.mideast9403073.1 talk.politics.misc7751335.8 talk.religion.misc6281036.1 Average9413503.7

16 Organization of Newsgroups By Subject Matter comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey sci.crypt sci.electronics sci.med sci.space misc.forsaletalk.politics.misc talk.politics.guns talk.politics.mideast talk.religion.misc alt.atheism soc.religion.christian

17 Splits for Training and Testing 80:20 split between training and test/validation sets for three different partitioning schemes Thread Partitioning: Entire thread is assigned to one of training, development, or validation sets Chronological Partitioning: Messages in each thread are split between training, test, and validation; first 80% in training, and rest in test and validation Random Partitioning: 80:20 split between training and test/validation, without regard to thread or chronology –Prior work by other researchers with 20 NG used random partitioning

18 Closed-set Classification Results Test Message Type %Top Choice Accuracy ThreadChronologicalRandom w/o replied-to text74.577.879.7 w/ replied-to text76.079.683.2 Trained OnTopic model set consisting of 20 topics Classified 2K test messages –Two test conditions, one where replied-to text (from previous messages) is included and the other where it is stripped from the test message Classification accuracy is low due to following –Significant subject overlap between newsgroups –Lack of useful a priori probabilities due to almost uniform distribution of topics, unlike AFE newsgroup data

19 Detailed Results for Thread Partitioned Newsgroup%Top-choice AccuracyTop Confusion talk.religion.misc29.3talk.politics.guns misc.forsale51.0comp.os.ms-windows.misc talk.politics.misc57.5talk.politics.guns sci.electronics58.3rec.autos comp.os.ms-windows.misc62.0comp.sys.mac.hardware alt.atheism63.4soc.religion.christian comp.graphics68.6comp.windows.x comp.sys.ibm.pc.hardware72.6comp.os.ms-windows.misc comp.sys.mac.hardware74.5comp.sys.ibm.pc.hardware comp.windows.x77.1comp.sys.ibm.pc.hardware rec.motorcycles81.9rec.autos talk.politics.guns82.9sci.crypt talk.politics.mideast84.6rec.motorcycles soc.religion.christian87.4sci.med sci.crypt89.0talk.politics.guns rec.sport.baseball90.7rec.sport.hockey rec.autos93.4misc.forsale sci.med93.6misc.forsale rec.sport.hockey94.2rec.sport.baseball sci.space94.6rec.autos Overall76.0

20 Detailed Results for Chronological Newsgroup%Top-choice AccuracyTop Confusion talk.religion.misc35.0alt.atheism misc.forsale53.8comp.sys.ibm.pc.hardware comp.os.ms-windows.misc62.9comp.windows.x comp.graphics63.9comp.os.ms-windows.misc sci.electronics64.3rec.autos talk.politics.misc71.4talk.politics.guns alt.atheism72.2soc.religion.christian comp.sys.ibm.pc.hardware73.2comp.os.ms-windows.misc comp.sys.mac.hardware75.8comp.sys.ibm.pc.hardware comp.windows.x81.4comp.os.ms-windows.misc rec.motorcycles86.7rec.autos sci.med86.9sci.space rec.autos88.7comp.os.ms-windows.misc talk.politics.guns90.1talk.politics.misc talk.politics.mideast90.2alt.atheism sci.space91.9comp.graphics rec.sport.baseball92.9rec.sport.hockey soc.religion.christian94.0alt.atheism sci.crypt96.0sci.electronics rec.sport.hockey98.0sci.med Overall79.6

21 Manual Clustering and Human Review Manually clustered newsgroups into 12 topics after reviewing content of training messages Recomputed top-choice classification accuracy using the cluster information Clustering %Top Choice Accuracy ThreadChronologicalRandom w/o Clustering76.079.683.2 w/ Clustering81.584.888.2 Effect of presence of multiple topics in a message and incomplete reference topic label set –Manually reviewed messages from 4 categories with lowest performance for Chronological split –Accuracy increases to 88.0% (from 84.8%) following manual rescoring

22 Cluster Table Topic ClusterNewsgroup(s) Autosrec.autos, rec.motorcycles Graphicscomp.graphics Macintoshcomp.sys.mac.hardware Misc.forsalemisc.forsale Politicstalk.politics.guns, talk.politics.mideast, talk.politics.misc Windowscomp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, comp.windows.x religionsoc.religion.christian, talk.religion.misc, alt.atheism sportsrec.sport.baseball, rec.sport.hockey sci.crypt sci.electronics sci.med sci.space

24 The Problem Why unsupervised topic discovery and clustering? –Topics of interest may not be known apriori –May not be feasible to annotate documents with a large number of topics Goals –Discover topics and meaningful topic names –Cluster topics instead of messages automatically to organize messages/documents for navigation at multiple levels

25 Unsupervised Topic Discovery 3 Add Phrases Topic Classification Topic Training Initial Topics for each doc Input documents Input documents Frequent phrases, using MDL criterion; Names, using IdentiFinder TM Select words/phrases with highest tf-idf; Keep topic names that occur in >3 documents Assign topics to all documents Key step: Associate many words/phrases with topics; Use EM training in OnTopic TM system Topic Models Topic Names Augmented docs Topic Annotated Corpus 3. S. Sista et al.. An Algorithm for Unsupervised Topic Discovery from Broadcast News Stories. In Proceedings of ACM HLT, San Diego, CA, 2002.

26 UTD output (English document) news source: Associated Press – November, 2001

27 UTD output (Arabic document) News Source: Al_Hayat (Aug-Nov, 2001)

28 Unsupervised Topic Clustering Organize automatically discovered topics (rather than documents) into a hierarchical topic tree Leaves of the topic tree are one of the fine topics discovered from the UTD process Intermediate nodes are logical collection of topics Each node in the topic tree has a set of messages associated with it –A message can be assigned to multiple topic clusters by virtue of multiple topic labels assigned to it by UTD process –Overcomes the problem of single cluster assignment of a document prevalent in most document clustering approaches Resulting topic tree enables browsing of the large corpus at multiple level of granularity –One can find a message with different set of logical actions

29 Topic Clustering Algorithm Agglomerative clustering for organizing topics in a hierarchical tree structure Topic clustering algorithm: Step 1: Each topic assigned to its own individual cluster Step 2: For every pair of clusters, compute the distance between the two clusters Step 3: Merge the closest pair into a single cluster if the distance is lower than a threshold and go to Step 2. Else Stop clustering Modification: merge more than two clusters at each iteration to limit the number of levels in the tree –Also add other constraints in terms of limiting the branching factor, number of levels etc.

30 Distance Metrics for Topic Clustering Metrics computed from topic co-occurrences: –Co-occurrence probability –Mutual Information Metrics computed from support/key word distributions: –Support word overlap between T i and T j –Kullback-Leibler (KL) and J-Divergence between two probability mass functions

31 Clustering Example

32 Evaluation of UTC Initial topic clustering experiments performed on 20 NG corpus –3,343 topics discovered from 19K message –Allowed a maximum of 4 topics to be clustered at each iteration Evaluation of UTC has been mostly subjective with a few objective metrics used to evaluate the clustering Clustering rate: rate of increase of clusters with more than one topic seems to be well correlated with subjective judgments Combination of J-divergence and topic co-occurrence seems to result in most uniform, logical clusters

33 Key Statistics of the UTC Topic Tree for 20 NG Corpus Key FeatureValue AverageMaximum Number of Levels-6 Branching Factor2.44 No. of topics in a cluster2.722 Measured some key features of the topic tree that could have significant impact on user experience

34 Screenshot of the UTC based Message Browser

36 Off-topic Message Rejection Significant fraction of messages processed by the topic classification system are likely to be off-topic Rejection Problem: Design a binary classifier for accepting or rejecting the top-choice topic –Accepting a message means asserting that the message contains the top-choice topic –Rejecting a message means asserting that the message does not contain the top-choice topic

37 Rejection Algorithm Use the General Language (GL) topic model as model for off-topic messages Compute the ratio of the log-posterior of top-choice topic T j and GL topic as a relevance score Accept the top-choice topic T j if: Threshold can be topic-independent or topic- specific

38 Parametric Topic-Specific Threshold Estimation Compute empirical distribution ( and ) of log likelihood ratio score for a large corpus of off-topic documents –Can assume most messages in corpus are off-topic –More reliable statistics than if computed for on-topic message Normalize the score for a test message before comparing to a topic-independent threshold Can be thought of as a transformation of the topic- independent threshold rather than score normalization

39 Parametric Topic-Specific Threshold Estimation Do a Null-hypothesis test using the score distribution of the off- topic messages Example histogram of normalized test scores (y-axis scaled to magnify view for on-topic messages) Off-topic score distribution On-topic score distribution A message that is not-off- topic is on-topic. A message several standard-deviations away from off-topic mean is very likely to be on-topic. On-topic

40 Non-Parametric Threshold Estimation Accept the top-choice topic T j if: Select (T j ) by constrained optimization:

41 Experimentation Configuration Message Type Distribution of Messages TrainDev.Validation In-topic5.6K 2.8K Off-topic/Chaff9.6K 76K In-topic messages from 14 newsgroups of the 20 NG corpus –Messages from six newsgroups were discarded due to significant subject overlap with off-topic messages Off-topic/chaff messages are from two sources: –talk.origins newsgroup from the AFE corpus –large collection of messages from 4 Yahoo! groups Used jack-knifing to estimate rejection thresholds on Train+Dev set and then applied them to validation set

42 Comparison of Threshold Estimation Techniques

43 Comparison of Threshold Estimation Techniques

44 Comparison of Threshold Estimation Techniques Rejection Method %False Rejections @ 1% False Acceptances Topic-independent thresholds31.4 Topic-specific thresholds (parametric)27.4 Topic-specific thresholds (non-parametric)23.7

45 Conclusions HMM based topic classification delivers comparable performance on 20 NG and AFE corpora as in [1],[2] Closed-set classification accuracy on 20 NG data after clustering is slightly worse than AFE data –Key reason is significant subject overlap between the newsgroups Clustered categories still exhibited significant subject overlap across clusters –The data set creators assign only six different subjects (topics) to the 20 NG set 1.J. D. M. Rennie, L. Shih, J. Teevan, and D. R. Karger. Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In Proceeding of ICML 2003, Washington, D.C., 2003. 2.S. Eick, J. Lockwood, R. Loui, J. Moscola, C. Kastner, A. Levine, and D. Weishar. Transformation Algorithms for Data Streams. In Proceedings of IEEE AAC, March 2005.

46 Conclusions (Contd.) Novel estimation of topic-specific thresholds outperforms topic-independent threshold for rejection of off-topic messages Introduced a novel concept of unsupervised topic clustering for organizing messages –Built a demonstration prototype for topic tree based browsing of large corpus of archived messages Future work will focus on measuring the utility of UTC on user experience and objective metrics to evaluate UTC performance

1 Finding Structure in Noisy Text: Topic Classification and Unsupervised Clustering Rohit Prasad, Prem Natarajan, Krishna Subramanian, Shirin Saleem, and.

Similar presentations

Presentation on theme: "1 Finding Structure in Noisy Text: Topic Classification and Unsupervised Clustering Rohit Prasad, Prem Natarajan, Krishna Subramanian, Shirin Saleem, and."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Finding Structure in Noisy Text: Topic Classification and Unsupervised Clustering Rohit Prasad, Prem Natarajan, Krishna Subramanian, Shirin Saleem, and.

Similar presentations

Presentation on theme: "1 Finding Structure in Noisy Text: Topic Classification and Unsupervised Clustering Rohit Prasad, Prem Natarajan, Krishna Subramanian, Shirin Saleem, and."— Presentation transcript:

Similar presentations

About project

Feedback