Text mining.

Slides:



Advertisements
Similar presentations
Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Advertisements

Text Categorization.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
Albert Gatt Corpora and Statistical Methods Lecture 13.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Data Warehousing & Mining with Business Intelligence: Principles and Algorithms Data Warehousing & Mining with Business Intelligence: Principles and Algorithms.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Learning for Text Categorization
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Web-based Information Architectures Jian Zhang. Today’s Topics Term Weighting Scheme Vector Space Model & GVSM Evaluation of IR Rocchio Feedback Web Spider.
Vector Space Model CS 652 Information Extraction and Integration.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Chapter 5: Information Retrieval and Web Search
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Introduction to Data Mining Engineering Group in ACL.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Text Classification, Active/Interactive learning.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
No. 1 Classification and clustering methods by probabilistic latent semantic indexing model A Short Course at Tamkang University Taipei, Taiwan, R.O.C.,
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
TEXT CLASSIFICATION USING MACHINE LEARNING Student: Hung Vo Course: CP-SC 881 Instructor: Professor Luo Feng Clemson University 04/27/2011.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Project 1: Machine Learning Using Neural Networks Ver 1.1.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Chapter 23: Probabilistic Language Models April 13, 2004.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Natural Language Processing Topics in Information Retrieval August, 2002.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Data Mining and Text Mining. The Standard Data Mining process.
No. 1 Classification Methods for Documents with both Fixed and Free Formats by PLSI Model* 2004International Conference in Management Sciences and Decision.
1 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington.
IR 6 Scoring, term weighting and the vector space model.
Clustering of Web pages
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
K-means and Hierarchical Clustering
Revision (Part II) Ke Chen
Text Categorization Assigning documents to a fixed set of categories
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Fig. 1 (a) The PageRank algorithm (b) The web link structure
Presentation transcript:

Text mining

The Standard Data Mining process

Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks include: Text categorization (document classification) Text clustering Text summarization Opinion mining Entity/concept extraction Information retrieval: search engines information extraction: Question answering

Supervised learning algorithms Decision tree learning Naïve Bayes K-nearest neighbour Support Vector Machines Neural Networks Genetic algorithms

Supervised Machine learning 1. Build or get a representative corpus 2. Label it 3. Define features 4. Represent documents 5. Learn and analyse 6. Go to 3 until accuracy is acceptable First test features: stemmed words

Unsupervised Learning Document clustering k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering (CACTUS, STIRR) …… STC QDC Interactive learning Learning from unlabelled data Learning to label Two systems that teach each other

Similarity measure There are many different ways to measure how similar two documents are, or how similar a document is to a query Highly depending on the choice of terms to represent text documents Euclidian distance (L2 norm) L1 norm Cosine similarity

Document Similarity Measures

Document Similarity measures

Feature Extraction: Task(1) Task: Extract a good subset of words to represent documents Document collection All unique words/phrases Feature Extraction All good words/phrases Some slides by Huaizhong Kou

Feature Extraction Task Indexing Weighting Model Dimensionality Reduction

Feature Extraction: Task(2) While more and more textual information is available online, effective retrieval is difficult without good indexing of text content. 16 While-more-and-textual-information-is-available-online- effective-retrieval-difficult-without-good-indexing-text-content Feature Extraction The naive feature space consists of the unique terms that occur in documents, which can be tens or hundreds of thousands of terms of even a moderate-size text collection. The current techniques cannot deal with such large set of terms. It is desirable to reduce the native space without information loss. 5 Text-information-online-retrieval-index 2 1 1 1 1

Feature Extraction: Indexing(1) Training documents Identification all unique words Removal stop words non-informative word ex.{the,and,when,more} Removal of suffix to generate word stem grouping words increasing the relevance ex.{walker,walking}walk Word Stemming Naive terms Importance of term in Doc Term Weighting

Feature Extraction: Indexing(2) Vector Space Model (VSM) is one of the most commonly used Text data models Any text document is represented by a vector of terms Terms are typically words and/or phrases Every term in the vocabulary becomes an independent dimension Each term in the text document would be represented by a non zero value which will be added in the corresponding dimension A document collection is represented as a matrix: Where xji represents the weight of the ith term in jth document

Feature Extraction:Weighting Model(1) tf - Term Frequency weighting wij = Freqij Freqij : := the number of times jth term occurs in document Di.  Drawback: without reflection of importance factor for document discrimination. Ex. ABRTSAQWA XAO A B K O Q R S T W X D1 3 1 0 1 1 1 1 1 1 1 D2 3 2 1 0 1 1 1 1 0 1 D1 RTABBAXA QSAK D2

Feature Extraction:Weighting Model(2) tfidf - Inverse Document Frequency weighting wij = Freqij * log(N/ DocFreqj) . N : := the number of documents in the training document collection. DocFreqj ::= the number of documents in which the jth term occurs. Advantage: with reflection of importance factor for document discrimination. Assumption:terms with low DocFreq are better discriminator than ones with high DocFreq in document collection Ex. A B K O Q R S T W X D1 0 0 0 0.3 0 0 0 0 0.3 0 D2 0 0 0.3 0 0 0 0 0 0 0

Feature Extraction: Weighting Model Tf-IDF weighting Entropy weighting where is average entropy of ith term and -1: if word occurs once time in every document 0: if word occurs in only one document Ref:[11][22] Ref:[13]

Feature Extraction: Dimension Reduction Document Frequency Thresholding X2-statistic Latent Semantic Indexing Information Gain Mutual information

Dimension Reduction:DocFreq Thresholding Document Frequency Thresholding Training documents D Naive Terms Calculates DocFreq(w) Sets threshold  Removes all words: DocFreq <  Calculates document frequency DOCFREQU for each term in training collection. Sets a threshold  and removes all terms if its DOCFREQU <  holds. Rare terms are either non-informative for predictions or not influential in performance. It is the simplest method with the lowest cost in computation. Feature Terms