M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Challenges, Basics March.

M. Sulaiman Khan (mskhan@liv.ac.uk)‏ Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Challenges, Basics March 24, 2009 Slide 1 COMP527: Data Mining

Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction COMP527: Data Mining Text Mining: Challenges, Basics March 24, 2009 Slide 2 COMP527: Data Mining Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam

Text Representation What is a word? Dimensionality Reduction Text Mining vs Data Mining on Text Today's Topics Text Mining: Challenges, Basics March 24, 2009 Slide 3 COMP527: Data Mining

Basic Goal: Data Mining on Documents Each document must be an instance. First problem: What are the attributes of a document? Easy attributes: format, length in bytes, potentially some metadata extractable from headers or file properties (author, date etc)‏ Harder attributes: How to usefully represent the text? Basic idea is to treat each word as an attribute – either boolean (is present/is not present) or numeric (number of times word occurs)‏ Representation of Documents Text Mining: Challenges, Basics March 24, 2009 Slide 4 COMP527: Data Mining

Second Problem: We will have a LOT of false/0 attribute values in our instances. Very sparse matrix, requiring a LOT of storage space. 1,000,000 documents, 500,000 different words = 500,000,000,000 entries ~= 60 gigabytes if store each entry as a single bit ~= 950 gigabytes if store each entry as short integer (2 bytes)‏ Google's dictionary has 5 million words, times 18 billion web pages... (Process that, WEKA!)‏ Representation of Documents Text Mining: Challenges, Basics March 24, 2009 Slide 5 COMP527: Data Mining

Store only true values: (1,4,100,212,13948)‏ Or true values with frequency: (1:3, 4:1, 100:1, 212:3, 13948:4)‏ For compressed storage, we can maintain the differences: Attribs in order: 1,2,4,5,7,10,15,18,... 2348651,... Intervals: 1,1,2,1,2, 3, 5, 3,... 6,... With frequency: 1,4,1,3,2,5,1,3,2,6,3,3,... Can always store in short integer. Regular compression algorithms on this sequence will be efficient. Reorder attributes based on frequency rather than alphabetical. Representation of Documents Text Mining: Challenges, Basics March 24, 2009 Slide 6 COMP527: Data Mining

That's nice... but WEKA needs ARFF! Problems with toolkits: Won't accept a sparse input format Classification Algorithms: Rules: Possible, but unlikely Trees: Less likely than unlikely Bayes: Fine, especially Multinomial Bayes Bayesian Networks: Maybe... but too many possible networks SVM: Fine NN: Not Fine! Tooooo many nodes Perceptron/Winnow: See NN, but more feasible as no hidden layer KNN: Very slow without data structures due to no. of comparisons Input? Text Mining: Challenges, Basics March 24, 2009 Slide 7 COMP527: Data Mining

Overall problem for text classification: Accurate models with high dimensionality impossible to understand by humans (eg SVM, Multinomial Naive Bayes). Association Rule Mining: Fine for presence of word, but how to represent word frequency? Classification Association Rule Mining possible good solution for understandability? Clustering: Very high dimensionality a problem for many algorithms, especially with lots of comparisons (eg partitioning algorithms). Input? Text Mining: Challenges, Basics March 24, 2009 Slide 8 COMP527: Data Mining

First problem: Need to be able to extract data from the file. Very different processing for different file types, eg: XML, HTML, RSS, Word, Open Document, PDF, RTF, LaTeX,... May want to treat different parts of document separately. eg: title vs authors vs abstract vs text vs references Want to normalise texts into semantic areas across different formats – eg abstract in PDF is several lines, but in ODF is an XML element, in LaTeX surrounded by one or more \verb{} commands... Document Types Text Mining: Challenges, Basics March 24, 2009 Slide 9 COMP527: Data Mining

Requirement: Extract words from text. What is a 'word'?  Obvious 'words' (eg consecutive non space characters)‏  Number (1,000 55.6 $10 10 12 64.000 vs 64.0)‏  Hyphenated (book case vs book-case vs bookcase) but also for ranges: 32-45 or "New York-New Jersey"  URI http://www.liv.ac.uk/ and more complicated  Punctuation ( Rob's vs 'Robs' vs Robs' vs Robs)‏  Dates as single token?  Non-alphanumeric characters: AT&T, Yahoo! ... Term Extraction Text Mining: Challenges, Basics March 24, 2009 Slide 10 COMP527: Data Mining

Requirement: Extract 'words' from text.  Period character problematic: End of sentence, end of abbreviation, internal to acronyms (but not always present), internal to numbers (with 2 different meanings), dotted quad notation (eg: 138.253.81.72)‏  Emoticons :( :) >:( =>  Need extra processing for diacritics? eg: é ë ç etc.  Might want to use phrases as attributes 'with respect to', 'data mining' etc. but complicated to determine appropriate phrases.  Expand abbreviations? Expand acronyms?  Expand ranges? (1999-2007 means all years, not just end points)‏ Term Extraction Text Mining: Challenges, Basics March 24, 2009 Slide 11 COMP527: Data Mining

Requirement: Reduce number of words. (Dimensionality reduction)‏ Many words are useless for distinguishing a document. Don't want to store non-useful words...  a, an, the, these, those, them, they...  of, with, to, in, towards, on...  while, however, because, also, who, when, where... Long list of words to ignore. Called 'stopwords'. BUT... “The Who” -- Band? Stopwords? Part of speech filtering more accurate but more expensive. Dimensionality Reduction Text Mining: Challenges, Basics March 24, 2009 Slide 12 COMP527: Data Mining

Requirement: Normalise terms for consistency and dimensionality reduction.  Normally want to ignore case. eg 'computer' and 'Computer' should be the same attribute. But acronyms different: ram vs RAM, us vs US  Normally want to use word stems. eg 'computer' and 'computers' should be the same attribute. Porter algorithm relies on prefix/suffix matching, but note 'ram' could be noun or verb... Also, stems can be meaningless: "datum mine" Dimensionality Reduction Text Mining: Challenges, Basics March 24, 2009 Slide 13 COMP527: Data Mining

Can use simple statistics to reduce dimensionality: If a word appears in all classes evenly, then it doesn't distinguish any particular class, and is not useful for classification and can be ignored. eg 'the' Equally, a word that appears in only one document will be perfectly discriminating, but also probably over-fitting. Words that appear in most documents (regardless of class distribution) are also unlikely to be useful. Dimensionality Reduction Text Mining: Challenges, Basics March 24, 2009 Slide 14 COMP527: Data Mining

Data Mining: Discover hidden models to describe the data. Text Mining: Discover hidden facts within bodies of text. Completely different approaches: DM tries to generalise all of the data into a single model without getting caught up in over-fitting. TM tries to understand the details, and cross reference between individual instances. Text Mining vs Data Mining Text Mining: Challenges, Basics March 24, 2009 Slide 15 COMP527: Data Mining

Text Mining uses Natural Language Processing techniques to 'understand' the data. Tries to understand the semantics of the text (information) rather than treating it as a big bag of sequences of characters. Major processes:  Part of Speech Tagging  Phrase Chunking  Deep Parsing  Named Entity Recognition  Information Extraction Text Mining vs Data Mining Text Mining: Challenges, Basics March 24, 2009 Slide 16 COMP527: Data Mining

Part of Speech Tagging: Tag each word with its part of speech (noun, verb, adjective etc.)‏ Classification problem, but essential to understand the text, especially the verbs. Phrase Chunking: Discover sequences of words that constitute phrases. eg Noun phrase, verb phrase, prepositional phrase. Also essential, to discover clauses, rather than working with individual words. Text Mining vs Data Mining Text Mining: Challenges, Basics March 24, 2009 Slide 17 COMP527: Data Mining

Deep Parsing: Discover the structure of the clauses and participants in verbs etc. eg Dog bites man, not man bites dog. Essential as the first step where the semantics are really used. Named Entity Recognition: Discover 'entities' within the text and tag them with the same identifier. eg Magnesium and Mg are the same. Bush, President Bush, G.W. Bush, Dubya, the President, are all the same. Essential for correlation of entities. Text Mining vs Data Mining Text Mining: Challenges, Basics March 24, 2009 Slide 18 COMP527: Data Mining

Information Extraction: With all the previous information, find all of the information about each entity from all occurrences within all clauses. Remove duplicates and find correlations. Look for interesting correlations, perhaps according to some set of rules for what is interesting. Actually, this is an impossibly large task given a reasonable set of text, and the interestingness of 'new' facts is often very low. Text Mining vs Data Mining Text Mining: Challenges, Basics March 24, 2009 Slide 19 COMP527: Data Mining

DM crucial for TM, eg correct classification of part of speech. But TM processes also important for accurate dimensionality reduction in DM on Texts. Eg: Every word: average of 100 attributes per vector, 85.7% accuracy over 10 classes with SVM Same data, with linguistic stems and filtered for noun, verb and adjective: average of 64 attributes per vector, 87.2% accuracy. Text Mining vs Data Mining Text Mining: Challenges, Basics March 24, 2009 Slide 20 COMP527: Data Mining

Baeza-Yates, Modern Information Retrieval Weiss, Chapters 2,4 Berry, Survey of Text Mining, Chapter 5 (He gets around, doesn't he?!)‏ Konchady Witten, Managing Gigabytes Further Reading Text Mining: Challenges, Basics March 24, 2009 Slide 21 COMP527: Data Mining

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Challenges, Basics March.

Similar presentations

Presentation on theme: "M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Challenges, Basics March."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Challenges, Basics March.

Similar presentations

Presentation on theme: "M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Challenges, Basics March."— Presentation transcript:

Similar presentations

About project

Feedback