Presentation on theme: "Preserving Semantic Content in Text Mining Using Multigrams Yasmin H. Said Department of Computational and Data Sciences George Mason University QMDNS."— Presentation transcript:
Preserving Semantic Content in Text Mining Using Multigrams Yasmin H. Said Department of Computational and Data Sciences George Mason University QMDNS 2010 - May 26, 2010 This is joint work with Edward J. Wegman
Outline Background on Text Mining Bigrams –Term-Document and Bigram- Document Matrices –Term-Term and Document- Document Associations Example using 15,863 Documents To read between the lines is easier than to follow the text. -Henry James
Text Data Mining Synthesis of … –Information Retrieval Focuses on retrieving documents from a fixed database May be multimedia including text, images, video, audio –Natural Language Processing Usually more challenging questions Bag-of-words methods Vector space models –Statistical Data Mining Pattern recognition, classification, clustering
Natural Language Processing Key elements are: –Morphology (grammar of word forms) –Syntax (grammar of word combinations to form sentences) –Semantics (meaning of word or sentence) –Lexicon (vocabulary or set of words) Time flies like an arrow –Time passes speedily like an arrow passes speedily or –Measure the speed of a fly like you would measure the speed of an arrow Ambiguity of nouns and verbs Ambiguity of meaning
Text Mining Tasks Text Classification –Assigning a document to one of several pre-specified classes Text Clustering –Unsupervised learning Text Summarization –Extracting a summary for a document –Based on syntax and semantics Author Identification/Determination –Based on stylistics, syntax, and semantics Automatic Translation –Based on morphology, syntax, semantics, and lexicon Cross Corpus Discovery –Also known as Literature Based Discovery
Preprocessing Denoising –Means removing stopper words … words with little semantic meaning such as the, an, and, of, by, that and so on. –Stopper words may be context dependent, e.g. Theorem and Proof in a mathematics document Stemming –Means removal suffixes, prefixes and infixes to root –An example: wake, waking, awake, woke wake
Bigrams and Trigrams A bigram is a word pair where the order of words is preserved. –The first word is the reference word. –The second is the neighbor word. A trigram is a word triple where order is preserved. Bigrams and trigrams are useful because they can capture semantic content.
Example Hell hath no fury like a woman scorned. Denoised: Hell hath no fury like woman scorned. Stemmed: Hell has no fury like woman scorn. Bigrams: –Hell has, has no, no fury, fury like, like woman, woman scorn, scorn. –Note that the “.” (any sentence ending punctuation) is treated as a word
Bigram Proximity Matrix.furyhashelllikenoscorn wom a n. fury1 has1 hell1 like1 no1 scorn1 wom a n 1
Bigram Proximity Matrix The bigram proximity matrix (BPM) is computed for an entire document –Entries in the matrix may be either binary or a frequency count The BPM is a mathematical representation of a document with some claim to capturing semantics –Because bigrams capture noun- verb, adjective-noun, verb- adverb, verb-subject structures –Martinez (2002)
Vector Space Methods The classic structure in vector space text mining methods is a term-document matrix where –Rows correspond to terms, columns correspond to documents, and –Entries may be binary or frequency counts A simple and obvious generalization is a bigram- document matrix where –Rows correspond to bigrams, columns to documents, and again entries are either binary or frequency counts
Example Data The text data were collected by the Linguistic Data Consortium in 1997 and were originally used in Martinez (2002) –The data consisted of 15,863 news reports collected from Reuters and CNN from July 1, 1994 to June 30, 1995 –The full lexicon for the text database included 68,354 distinct words In all 313 stopper words are removed after denoising and stemming, there remain 45,021 words in the lexicon –The example that I report here is based on the full set of 15,863 documents. This is the same basic data set that Dr. Wegman reported on in his keynote talk although he considered a subset of 503 documents.
Vector Space Methods A document corpus we have worked with has 45,021 denoised and stemmed entries in its lexicon and 1,834,123 bigrams –Thus the TDM is 45,021 by 15,863 and the BDM is 1,834,123 by 15,863 –The term vector is 45,021 dimensional and the bigram vector is 1,834,123 dimensional –The BPM for each document is 1,834,123 by 1,834,123 and, of course, very sparse.
Cluster Identities Cluster 02: Comet Shoemaker Levy Crashing into Jupiter. Cluster 08: Oklahoma City Bombing. Cluster 11: Bosnian-Serb Conflict. Cluster 12: Court-Law, O.J. Simpson Case. Cluster 15: Cessna Plane Crashed onto South Lawn White House. Cluster 19: American Army Helicopter Emergency Landing in North Korea. Cluster 24: Death of North Korean Leader (Kim il Sung) and North Korea’s Nuclear Ambitions. Cluster 26: Shootings at Abortion Clinics in Boston. Cluster 28: Two Americans Detained in Iraq. Cluster 30: Earthquake that Hit Japan.
Closing Remarks Text mining presents great challenges, but is amenable to statistical/mathematical approaches –Text mining using vector space methods challenges both the mathematical and visualization issues especially in terms of dimensionality, sparsity, and scalability.
Acknowledgments Dr. Angel Martinez Dr. Jeff Solka and Avory Bryant Dr. Walid Sharabati Funding Sources –National Institute on Alcohol Abuse and Alcoholism (Grant Number F32AA015876) –Army Research Office (Contract W911NF-04-1-0447) –Army Research Laboratory (Contract W911NF-07-1-0059) –Isaac Newton Institute
Contact Information Yasmin H. Said Department of Computational and Data Sciences Email: email@example.com@hotmail.com Phone: 301-538-7478 The length of this document defends it well against the risk of its being read. -Winston Churchill