M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Text-as-Data March 25, 2009.

Slides:

Advertisements

Similar presentations

Text Categorization.

Advertisements

Chapter 5: Introduction to Information Retrieval

Introduction to Information Retrieval

1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Building an Intelligent Web: Theory and Practice Pawan Lingras Saint Mary’s University Rajendra Akerkar American University of Armenia and SIBER, India.

Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.

Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.

Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.

Recommender systems Ram Akella November 26 th 2008.

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.

CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.

Chapter 5: Information Retrieval and Web Search

Classification and Prediction: Regression Analysis

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,

1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.

Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.

Isolated-Word Speech Recognition Using Hidden Markov Models

CS324e - Elements of Graphics and Visualization Java Intro / Review.

RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.

Text Classification, Active/Interactive learning.

Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.

Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Bayes February 17, 2009.

Chapter 6: Information Retrieval and Web Search

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Challenges, Basics March.

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009.

Today Ensemble Methods. Recap of the course. Classifier Fusion

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]

C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set.

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

Vector Space Models.

Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore.

1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret

Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.

Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

Document Clustering and Collection Selection Diego Puppin Web Mining,

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.

Plan for Today’s Lecture(s)

Maximum Entropy Models and Feature Engineering CSCI-GA.2591

Text Based Information Retrieval

Data Mining Lecture 11.

Dept. of Computer Science University of Liverpool

Information Retrieval and Web Search

Dept. of Computer Science University of Liverpool

Basic Information Retrieval

Representation of documents and queries

CS 430: Information Discovery

Dept. of Computer Science University of Liverpool

Chapter 5: Information Retrieval and Web Search

Dept. of Computer Science University of Liverpool

Content Analysis of Text

Michal Rosen-Zvi University of California, Irvine

Panagiotis G. Ipeirotis Luis Gravano

Dept. of Computer Science University of Liverpool

Boolean and Vector Space Retrieval Models

Text Mining Application Programming Chapter 3 Explore Text

Dept. of Computer Science University of Liverpool

Latent Semantic Analysis

Presentation transcript:

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Text-as-Data March 25, 2009 Slide 1 COMP527: Data Mining

Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction COMP527: Data Mining Text Mining: Text-as-Data March 25, 2009 Slide 2 COMP527: Data Mining Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam

Word Frequencies Relevance Scoring Latent Semantic Indexing Markov Models Today's Topics Text Mining: Text-as-Data March 25, 2009 Slide 3 COMP527: Data Mining

Unlike other 'regular' attributes, each word can appear multiple times in a single document. More frequently occurring words within a single document good indication that it's important to that text. Also interesting to see overall distribution of words within the full data set, and within the vocabulary/attribute space/vector space. Distribution of parts of speech interesting. Even individual letter frequencies are potentially interesting between different texts or sets of texts. Word Frequencies Text Mining: Text-as-Data March 25, 2009 Slide 4 COMP527: Data Mining

Distribution of letters in a text can potentially show with very low dimensionality some minimal features of the text. Eg: Alice W'land: E T A O I H N S R D L U W G C Y M F P B Holy Bible: E T A H O N I S R D L F M U W C Y G B P Reuters: E T A I N S O R L D C H U M P F G B Y W Tale 2 Cities: E T A O N I H S R D L U M W C F G Y P B Eg 'C' and 'S' is a lot more common in the Reuters news articles, 'H' very uncommon. 'F' more common in the bible. Letter Frequencies Text Mining: Text-as-Data March 25, 2009 Slide 5 COMP527: Data Mining

Distribution of letters in a text can potentially also help with language determination. 'E' a lot more common in French (0.17) than Spanish (0.133) or English (0.125). 'V' and 'U' also more common in French. Top three in Spanish are all vowels: 'E' then 'A' then 'O'. Quite possible that the distribution of letters in texts to be classified is also interesting, if they're from different styles, languages, or subjects. Don't rule out the very easy :)‏ Letter Frequencies Text Mining: Text-as-Data March 25, 2009 Slide 6 COMP527: Data Mining

Distribution of words interesting. For vector construction, it would be nice to know approximately how many unique words there are likely to be. Heaps's Law: v = Kn b Where: n = number of words K = constant between 10 and 100 b = constant between 0 and 1, normally 0.4 and 0.6 and v is the size of the vocabulary While this seems very fuzzy, it often works in practice. Eg predicts a particular curve, which seems to hold up in experiments. Word Frequencies Text Mining: Text-as-Data March 25, 2009 Slide 7 COMP527: Data Mining

A second 'law': Zipf's Law Idea: We use a few words very often, and most words very rarely, because it's more effort to use a rare word. Zipf's Law: Product of frequency of word and its rank is [reasonably] constant. Also fuzzy, but also empirically demonstrable. And holds up over different languages. A 'Power Law Distribution' – few events occur often, and many events occur infrequently. Word Frequencies Text Mining: Text-as-Data March 25, 2009 Slide 8 COMP527: Data Mining

Zipf's Law Example: Word Frequencies Text Mining: Text-as-Data March 25, 2009 Slide 9 COMP527: Data Mining

The frequencies of words can be used with relation to each class in comparison to the entire document set. Eg words that are more frequent in a class than in the document set as a whole are discriminating. Can use this idea to generate weights for terms against each class, and then merge the weights for a general prediction of class. Also commonly used for search engines to predict relevance to user's query. Several different ways to create these weights... Word Frequencies Text Mining: Text-as-Data March 25, 2009 Slide 10 COMP527: Data Mining

Term Frequency, Inverse Document Frequency w(i, j) = tf(i,j) * log(N / df(i))‏ Weight of term i in document j is the frequency of term i in document j times the log of the total number of documents divided by the number of documents that contain term i. Eg, the more often the term occurs in the document, and the rarer the term is, the more likely that document is to be relevant. TF-IDF Text Mining: Text-as-Data March 25, 2009 Slide 11 COMP527: Data Mining

w(i, j) = tf(i,j) * log(N / df(i))‏ In 1000 documents, 20 contain the word 'lego'. It appears between 1 and 6 times in those 20. For the freq 6 document: w('lego', doc) = 6 * log(1000 / 20)‏ = 6 * log(50)‏ = TF-IDF Example Text Mining: Text-as-Data March 25, 2009 Slide 12 COMP527: Data Mining

Then for multiple terms, merge the weightings between each tfidf value according to some function. (sum, mean, etc). Can generate this sum for each class. Pros:  Easy to implement  Easy to understand Cons:  Document Size not taken into account  Low document frequency overpowers term frequency TF-IDF Text Mining: Text-as-Data March 25, 2009 Slide 13 COMP527: Data Mining

Jamie Callan of CMU proposed this algorithm: I = log((N + 0.5) / tf(i)) / log(N + 1.0)‏ T = df(i) / df(i)+50+ (150* size(j) / avgSize(N))‏ w(i,j) = (0.6 * T * I)‏ Takes into account document size, and average size of all documents. Otherwise a document with 6 matches in 100 words is treated the same as a document with 6 matches in 100,000 words. Vast improvement over simple TF-IDF, while still remaining easy to implement and understandable. CORI Text Mining: Text-as-Data March 25, 2009 Slide 14 COMP527: Data Mining

I = log((N + 0.5) / tf(i)) / log(N + 1.0)‏ T = df(i) / df(i)+50+ ( 150*size(j) / avgSize(N))‏ w(i,j) = (0.6 * T * I)‏ I = log( / 6) / log(1001) = 0.74 T = 20 / (150 * 350 / 500) = 0.11 w('lego', doc) = (0.6 * T * I) = Given the same 20 matched docs, 6 in doc, 1000 documents, 350 words in doc, and an average of 500 words per doc in the For more explanations see his papers: CORI Example Text Mining: Text-as-Data March 25, 2009 Slide 15 COMP527: Data Mining

Finds the relationships between words in the document/term matrix: the clusters of words that frequently co-occur in documents, and hence the 'latent semantic structure' of the document set. Doesn't depend on individual words, but instead on the clusters of words. Eg might use 'car' + 'automobile' + 'truck' + 'vehicle' instead of just 'car' Twin problems: Synonymy: Different words with the same meaning (car, auto)‏ Polysemy: Same spelling with different meaning (to ram, a ram)‏ (We'll come back to word sense disambiguation too)‏ Latent Semantic Indexing Text Mining: Text-as-Data March 25, 2009 Slide 16 COMP527: Data Mining

Based on Singular Value Decomposition of the matrix (which is something best left to math toolkits). Basically: Transforms the dimensions of the vectors such that documents with similar sets of terms are closer together. Then use these groupings as clusters of documents. You end up with fractions of words being present in documents (eg 'automobile' is somehow present in a document containing 'car'). Then use these vectors for analysis, rather than straight frequency vectors. As each dimension is multiple words, end up with smaller vectors too. Latent Semantic Indexing Text Mining: Text-as-Data March 25, 2009 Slide 17 COMP527: Data Mining

Patterns of letters in language don't happen at random. This sentence vs kqonw ldpwuf jghfmb edkfiu lpqwxz. Obviously not language. Markov models try to learn the probabilities of one item following another, in this case letters. Eg: Take all of the words we have and build a graph of which letters follow which other letters and 'start' and 'end' of words. Then each arc between nodes has a weight for the probability. Using a letter based markov model we might end up with words like: annofed, mamigo, quarn, etc. Markov Models Text Mining: Text-as-Data March 25, 2009 Slide 18 COMP527: Data Mining

Sequences of words are also not random (in English). We can use a much much larger Markov Model to show the probabilities of a word following another word or start/end of sentence. Equally, words clump together in short phrases, and we could use multi- word tokens as our graph nodes. Here we could see how likely 'states' is to follow 'united' for example. Markov Models Text Mining: Text-as-Data March 25, 2009 Slide 19 COMP527: Data Mining

Sequences of parts of speech for words are also not random (in English). But we don't care about the probabilities, we want to potentially use the observations as a way to determine the actual part of speech of a word. This is a Hidden Markov Model (HMM) as it uses the observable patterns to predict some variable which is hidden. Uses a trellis of state and observation sequence. eg: Hidden Markov Models Text Mining: Text-as-Data March 25, 2009 Slide 20 COMP527: Data Mining state 1 state 2 O1 O2 O3 O4

Stores the calculations towards the probabilities in the trellis arcs. Various clever algorithms to make this computationally feasible: Compute probability of particular output sequence: Forward-Backward algorithm. Find most likely sequence of hidden states to generate output sequence: Viterbi algorithm Given an output sequence, find most likely set of transitions and output probabilities (train the parameters for model given training set): Baum-Welch algorithm Hidden Markov Models Text Mining: Text-as-Data March 25, 2009 Slide 21 COMP527: Data Mining

Konchady Chou, Juang, Pattern Recognition in Speech and Language Processing, Chapters 8,9 Weiss Berry, Survey, Chapter 4 Han, Dunham 9.2 Further Reading Text Mining: Text-as-Data March 25, 2009 Slide 22 COMP527: Data Mining