Presentation on theme: "Preserving Semantic Content in Text Mining Using Multigrams Yasmin H. Said Department of Computational and Data Sciences George Mason University QMDNS."— Presentation transcript:
Preserving Semantic Content in Text Mining Using Multigrams Yasmin H. Said Department of Computational and Data Sciences George Mason University QMDNS May 26, 2010 This is joint work with Edward J. Wegman
Outline Background on Text Mining Bigrams –Term-Document and Bigram- Document Matrices –Term-Term and Document- Document Associations Example using 15,863 Documents To read between the lines is easier than to follow the text. -Henry James
Text Data Mining Synthesis of … –Information Retrieval Focuses on retrieving documents from a fixed database May be multimedia including text, images, video, audio –Natural Language Processing Usually more challenging questions Bag-of-words methods Vector space models –Statistical Data Mining Pattern recognition, classification, clustering
Natural Language Processing Key elements are: –Morphology (grammar of word forms) –Syntax (grammar of word combinations to form sentences) –Semantics (meaning of word or sentence) –Lexicon (vocabulary or set of words) Time flies like an arrow –Time passes speedily like an arrow passes speedily or –Measure the speed of a fly like you would measure the speed of an arrow Ambiguity of nouns and verbs Ambiguity of meaning
Text Mining Tasks Text Classification –Assigning a document to one of several pre-specified classes Text Clustering –Unsupervised learning Text Summarization –Extracting a summary for a document –Based on syntax and semantics Author Identification/Determination –Based on stylistics, syntax, and semantics Automatic Translation –Based on morphology, syntax, semantics, and lexicon Cross Corpus Discovery –Also known as Literature Based Discovery
Preprocessing Denoising –Means removing stopper words … words with little semantic meaning such as the, an, and, of, by, that and so on. –Stopper words may be context dependent, e.g. Theorem and Proof in a mathematics document Stemming –Means removal suffixes, prefixes and infixes to root –An example: wake, waking, awake, woke wake
Bigrams and Trigrams A bigram is a word pair where the order of words is preserved. –The first word is the reference word. –The second is the neighbor word. A trigram is a word triple where order is preserved. Bigrams and trigrams are useful because they can capture semantic content.
Example Hell hath no fury like a woman scorned. Denoised: Hell hath no fury like woman scorned. Stemmed: Hell has no fury like woman scorn. Bigrams: –Hell has, has no, no fury, fury like, like woman, woman scorn, scorn. –Note that the “.” (any sentence ending punctuation) is treated as a word
Bigram Proximity Matrix.furyhashelllikenoscorn wom a n. fury1 has1 hell1 like1 no1 scorn1 wom a n 1
Bigram Proximity Matrix The bigram proximity matrix (BPM) is computed for an entire document –Entries in the matrix may be either binary or a frequency count The BPM is a mathematical representation of a document with some claim to capturing semantics –Because bigrams capture noun- verb, adjective-noun, verb- adverb, verb-subject structures –Martinez (2002)
Vector Space Methods The classic structure in vector space text mining methods is a term-document matrix where –Rows correspond to terms, columns correspond to documents, and –Entries may be binary or frequency counts A simple and obvious generalization is a bigram- document matrix where –Rows correspond to bigrams, columns to documents, and again entries are either binary or frequency counts
Example Data The text data were collected by the Linguistic Data Consortium in 1997 and were originally used in Martinez (2002) –The data consisted of 15,863 news reports collected from Reuters and CNN from July 1, 1994 to June 30, 1995 –The full lexicon for the text database included 68,354 distinct words In all 313 stopper words are removed after denoising and stemming, there remain 45,021 words in the lexicon –The example that I report here is based on the full set of 15,863 documents. This is the same basic data set that Dr. Wegman reported on in his keynote talk although he considered a subset of 503 documents.
Vector Space Methods A document corpus we have worked with has 45,021 denoised and stemmed entries in its lexicon and 1,834,123 bigrams –Thus the TDM is 45,021 by 15,863 and the BDM is 1,834,123 by 15,863 –The term vector is 45,021 dimensional and the bigram vector is 1,834,123 dimensional –The BPM for each document is 1,834,123 by 1,834,123 and, of course, very sparse.
Term-Document Matrix Analysis Zipf’s Law
Term-Document Matrix Analysis
Text Example - Clusters A portion of the hierarchical agglomerative tree for the clusters
Text Example - Clusters A portion of repeated bisections for clustering the dataset
Text Example - Clusters Cluster 0, Size: 157, ISim: 0.142, ESim: Descriptive: ireland 12.2%, ira 9.1%, northern.ireland 7.6%, irish 5.5%, fein 5.0%, sinn 5.0%, sinn.fein 5.0%, northern 3.2%, british 3.2%, adam 2.4% Discriminating: ireland 7.7%, ira 5.9%, northern.ireland 4.9%, irish 3.5%, fein 3.2%, sinn 3.2%, sinn.fein 3.2%, northern 1.6%, british 1.5%, adam 1.5% Phrases 1: ireland 121, northern 119, british 116, irish 111, ira 110, peac 107, minist 104, govern 104, polit 104, talk 102 Phrases 2: northern.ireland 115, sinn.fein 95, irish.republican 94, republican.armi 91, ceas.fire 87, polit.wing 76, prime.minist 71, peac.process 66, gerri.adam 59, british.govern 50 Phrases 3: irish.republican.armi 91, prime.minist.john 47, minist.john.major 43, ira.ceas.fire 35, ira.polit.wing 34, british.prime.minist 34, sinn.fein.leader 30, rule.northern.ireland 27, british.rule.northern 27, declar.ceas.fire 26
Text Example - Clusters Cluster 1, Size: 323, ISim: 0.128, ESim: Descriptive: korea 19.8%, north 13.2%, korean 11.2%, north.korea 10.8%, kim 5.8%, north.korean 3.7%, nuclear 3.5%, pyongyang 2.0%, south 1.9%, south.korea 1.5% Discriminating: korea 12.7%, north 7.4%, korean 7.2%, north.korea 7.0%, kim 3.8%, north.korean 2.4%, nuclear 1.7%, pyongyang 1.3%, south.korea 1.0%, simpson 0.8% Phrases 1: korea 305, north 303, korean 285, south 243, unit 215, nuclear 204, offici 196, pyongyang 179, presid 167, talk 165 Phrases 2: north.korea 291, north.korean 233, south.korea 204, south.korean 147, kim.sung 108, presid.kim 83, nuclear.program 79, kim.jong 74, light.water 71, presid.clinton 69 Phrases 3: light.water.reactor 56, unit.north.korea 55, north.korea.nuclear 53, chief.warrant.offic 49, presid.kim.sung 46, leader.kim.sung 39, presid.kim.sam 37, north.korean.offici 36, warrant.offic.bobbi 35, bobbi.wayn.hall 29
Text Example - Clusters Cluster 24, Size: 1788, ISim: 0.012, ESim: Descriptive: school 2.2%, film 1.3%, children 1.2%, student 1.0%, percent 0.8%, compani 0.7%, kid 0.7%, peopl 0.7%, movi 0.7%, music 0.6% Discriminating: school 2.3%, simpson 1.8%, film 1.7%, student 1.1%, presid 1.0%, serb 0.9%, children 0.8%, clinton 0.8%, movi 0.8%, music 0.8% Phrases 1: cnn 1034, peopl 920, time 893, report 807, don 680, dai 650, look 630, call 588, live 535, lot 498 Phrases 2: littl.bit 99, lot.peopl 90, lo.angel 85, world.war 71, thank.join 67, million.dollar 60, 000.peopl 54, york.citi 50, garsten.cnn 48, san.francisco 47 Phrases 3: jeann.moo.cnn 41, cnn.entertain.new 36, cnn.jeann.moo 32, norma.quarl.cnn 30, cnn.norma.quarl 28, cnn.jeff.flock 28, jeff.flock.cnn 27, brian.cabel.cnn 26, pope.john.paul 25, lisa.price.cnn 25
Bigrams Cluster 1
Cluster Size Distribution
Document by Cluster Plot
Cluster Identities Cluster 02: Comet Shoemaker Levy Crashing into Jupiter. Cluster 08: Oklahoma City Bombing. Cluster 11: Bosnian-Serb Conflict. Cluster 12: Court-Law, O.J. Simpson Case. Cluster 15: Cessna Plane Crashed onto South Lawn White House. Cluster 19: American Army Helicopter Emergency Landing in North Korea. Cluster 24: Death of North Korean Leader (Kim il Sung) and North Korea’s Nuclear Ambitions. Cluster 26: Shootings at Abortion Clinics in Boston. Cluster 28: Two Americans Detained in Iraq. Cluster 30: Earthquake that Hit Japan.
Bigram-Document Matrix for 50 Documents
Bigram-Bigram Matrix for 50 Documents
Bigram-Bigram Matrix Using the Top 253 Bigrams
Closing Remarks Text mining presents great challenges, but is amenable to statistical/mathematical approaches –Text mining using vector space methods challenges both the mathematical and visualization issues especially in terms of dimensionality, sparsity, and scalability.
Acknowledgments Dr. Angel Martinez Dr. Jeff Solka and Avory Bryant Dr. Walid Sharabati Funding Sources –National Institute on Alcohol Abuse and Alcoholism (Grant Number F32AA015876) –Army Research Office (Contract W911NF ) –Army Research Laboratory (Contract W911NF ) –Isaac Newton Institute
Contact Information Yasmin H. Said Department of Computational and Data Sciences Phone: The length of this document defends it well against the risk of its being read. -Winston Churchill