Dept. of Computer Science University of Liverpool

Dept. of Computer Science University of Liverpool
COMP527: Data Mining COMP527: Data Mining M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 Introduction to Text Mining January 29, Slide 1

COMP527: Data Mining COMP527: Data Mining Introduction to the Course
Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam Introduction to Text Mining January 29, Slide 2

Information Retrieval (IR)‏ What is IR? Typical IR Process
Today's Topics COMP527: Data Mining Information Retrieval (IR)‏ What is IR? Typical IR Process Data Mining on Text Text Mining What is Text Mining? Typical Text Mining Process Applications Introduction to Text Mining January 29, Slide 3

What is Information Retrieval?
COMP527: Data Mining IR is concerned with retrieving textual records, not data items like relational databases, nor (specifically) with finding patterns like data mining. Examples: SQL: Find rows where the text column LIKE “%information retrieval%” DM: Find a model in order to classify document topics. IR: Find documents with text that contains the words Information adjacent to Retrieval, Protocol or SRW, but not Google. Introduction to Text Mining January 29, Slide 4

What is Information Retrieval?
COMP527: Data Mining IR focuses on finding the most appropriate or relevant records to the user's request. The supremacy of Google can be attributed primarily to its PageRank algorithm for ranking web pages in order of relevance to the user's query. $ (on , up from $ on ) a share says this topic is important to understand! IR also focuses on finding these records as quickly as possible. Not only does Google find relevant pages, it finds them Fast, for many thousands (maybe millions?) of concurrent users. Introduction to Text Mining January 29, Slide 5

So is “Google” the answer to the question of “Information Retrieval”?
IR = Google?? COMP527: Data Mining So is “Google” the answer to the question of “Information Retrieval”? No! Google has a good answer for how to search the web, but there are many more sources of data, and many more interesting questions. Many other examples, including: Library catalogues XML searching Distributed searching Query languages Introduction to Text Mining January 29, Slide 6

IR Processes: Discovery
COMP527: Data Mining Research topics exist for each box and arrow! Search Engine User Need Query Information Introduction to Text Mining January 29, Slide 7

IR Processes: Ingestion
COMP527: Data Mining Compare to the KDD process we looked at last time! Documents Search Engine Target Documents Records Preprocessed Documents Information Introduction to Text Mining January 29, Slide 8

What information do we need to store?
Document Indexing COMP527: Data Mining What information do we need to store? Query: Documents containing Information and Retrieval but not Protocol Need to find which documents contain which words. Could perform this query using a document/term matrix: Introduction to Text Mining January 29, Slide 9

Also useful to know is the frequency of the term in the document.
Document Indexing COMP527: Data Mining Also useful to know is the frequency of the term in the document. Each row in the matrix is a vector, and useful for data mining functions as the document has been reduced to a series of numbers rather than words. Our new matrix might look like: Introduction to Text Mining January 29, Slide 10

Common evaluation for IR relevance ranking: Precision and Recall
COMP527: Data Mining Common evaluation for IR relevance ranking: Precision and Recall Precision: Number Relevant and Retrieved / Number Retrieved Recall: Number Relevant and Retrieved / Number Relevant F Score: recall * precision / ((recall + precision) / 2)‏ Ideal situation is all and only relevant documents retrieved. Introduction to Text Mining January 29, Slide 11

Format Processing: Extraction of text from different file formats
Topics of Interest COMP527: Data Mining Format Processing: Extraction of text from different file formats Indexing: Efficient extraction/storage of terms from text Query Languages: Formulation of queries against those indexes Protocols: Transporting queries from client to server Relevance Ranking: Determining the relevance of a document to the user's query Metasearch: Cross-searching multiple document sets with the same query GridIR: Using the grid (or other massively parallel infrastructure) to perform IR processes Multimedia IR: IR techniques on multimedia objects, compound digital objects... Introduction to Text Mining January 29, Slide 12

Data Mining on Text COMP527: Data Mining All of the Data Mining functions can be applied to textual data, using term as the attribute and frequency as the value. Classification: Classify a text into subjects, genres, quality, reading age, ... Clustering: Cluster together similar texts Association Rule Mining: Find words that frequently appear together Finds texts that are frequently cited together Key challenge is the very large number of terms (eg the number of different words across all documents)‏ Introduction to Text Mining January 29, Slide 13

So, we've looked at Data Mining and IR... What's Text Mining then?
COMP527: Data Mining So, we've looked at Data Mining and IR... What's Text Mining then? Good question. No canonical definition yet, but a similar definition for Data Mining could be applied: The non-trivial extraction of previously unknown, interesting facts from an (invariably large) collection of texts. So it sounds like a combination of IR and Data Mining, but actually the process involves many other steps too. Before we look at what actually happens, let's look at why it's different... Introduction to Text Mining January 29, Slide 14

Text Mining vs Data Mining
COMP527: Data Mining Data Mining finds a model for the data based on the attributes of the items. The only attributes of text are the words that make up the text. As we looked at for IR, this creates a very sparse matrix. Even if we create that matrix, what sort of patterns could we find: Classification: We could classify texts into pre-defined classes (eg spam / not spam)‏ Association Rule Mining: Finding frequent sets of words. (eg if 'computer' appears 3+ times, then 'data' appears at least once)‏ Clustering: Finding groups of similar documents (IR?)‏ None of these fit our definition of Text Mining. Introduction to Text Mining January 29, Slide 15

Information Retrieval finds documents that match the user's query.
Text Mining vs IR COMP527: Data Mining Information Retrieval finds documents that match the user's query. Even if we matched at a sentence level rather than document, all we do is retrieve matching sentences, we're not discovering anything new. The relevance ranking is important, but it still just matches information we already knew... it just orders it appropriately. IR (typically) treats a document as a big bag of words... but doesn't care about the meaning of the words, just if they exist in the document. IR doesn't fit our definition of Text Mining either. Introduction to Text Mining January 29, Slide 16

How would one find previously unknown facts from a bunch of text?
Text Mining Process COMP527: Data Mining How would one find previously unknown facts from a bunch of text? Need to understand the meaning of the text! Part of speech of words Subject/Verb/Object/Preposition/Indirect Object Need to determine that two entities are the same entity. Need to find correlations of the same entity. Form logical chains: Milk contains Magnesium. Magnesium stimulates receptor activity. Inactive receptors cause Headaches --> Milk is good for Headaches. (fictional example!)‏ Introduction to Text Mining January 29, Slide 17

First we need to tag the text with the parts of speech for each word.
Part of Speech Tagging COMP527: Data Mining First we need to tag the text with the parts of speech for each word. eg: Rob/noun teaches/verb the/article course/noun How could we do this? By learning a model for the language! Essentially a data mining classification problem -- should the system classify the word as a noun, a verb, an adjective, etc. Lots of different tags, often based on a set called the Penn Treebank. (NN = Noun, VB = Verb, JJ = Adjective, RB = Adverb, etc)‏ Introduction to Text Mining January 29, Slide 18

Now we need to discover the phrases and parts of each clause.
Deep Parsing COMP527: Data Mining Now we need to discover the phrases and parts of each clause. Rob/noun teaches/verb the/article course/noun (Subject: Rob Verb:teaches (Object: the+course))‏ The phrase sections are often expressed as trees: ( TOP ( S ( NP ( DT This ) ( JJ crazy ) ( NN sentence ) ) ( VP ( VBD amused ) ( NP ( NNP Rob ) ) ( PP ( IN for ) ( NP ( DT a ) ( JJ few ) ( NNS minutes ) ) Introduction to Text Mining January 29, Slide 19

Rob: (Sanderson, Robert D. b.1976-07-20 Rangiora/New Zealand)‏
Entity Recognition COMP527: Data Mining Once we've parsed the text for linguistic structure, we need to identify the real world objects referred to. Rob teaches the course Rob: (Sanderson, Robert D. b Rangiora/New Zealand)‏ the course: Comp /2007, University of Liverpool, UK This is typically done via lookups in very large thesauri or 'ontologies', specific to the domain being processed (eg medical, historical, current events, etc.)‏ Introduction to Text Mining January 29, Slide 20

There will normally be a lot more text to parse:
Fact Extraction COMP527: Data Mining There will normally be a lot more text to parse: Rob Sanderson, a lecturer at the University of Liverpool, teaches a masters level course on data mining (Comp527)‏ Rob is a lecturer Rob is at the University of Liverpool Rob teaches a course The course is called Comp527 The course is masters level The course is about data mining Introduction to Text Mining January 29, Slide 21

Data mining is about finding models to describe data sets.
Correlation COMP527: Data Mining Rob Sanderson, a lecturer at the University of Liverpool, teaches a masters level course on data mining (Comp527)‏ Data mining is about finding models to describe data sets. --> The University of Liverpool has a course about finding models to describe data sets. (Not very interesting or novel in this case, but that's the process)‏ Introduction to Text Mining January 29, Slide 22

Search engines of all types are based on IR.
Applications COMP527: Data Mining Search engines of all types are based on IR. But where would you use text mining? Most research so far is on medical data sets ... because this is the most profitable! If you could correlate facts to find a cure for cancer, you would be very VERY rich! So ... lots of people are trying to do just that for various values of 'cancer'. Also because of the wide availability of ontologies and datasets, in particular abstracts for medical journal articles (PubMed/Medline)‏ Introduction to Text Mining January 29, Slide 23

More application areas: News feeds Terrorism detection
Applications COMP527: Data Mining More application areas: News feeds Terrorism detection Social sciences analysis Historical text analysis Corpus linguistics 'Net Nanny' filters etc. Introduction to Text Mining January 29, Slide 24

Dept. of Computer Science University of Liverpool

Similar presentations

Presentation on theme: "Dept. of Computer Science University of Liverpool"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dept. of Computer Science University of Liverpool

Similar presentations

Presentation on theme: "Dept. of Computer Science University of Liverpool"— Presentation transcript:

Similar presentations

About project

Feedback