Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text mining.

Similar presentations


Presentation on theme: "Text mining."— Presentation transcript:

1 Text mining

2 The Standard Data Mining process

3 Text Mining Machine learning on text data Text Data mining
Text analysis Part of Web mining Typical tasks include: Text categorization (document classification) Text clustering Text summarization Opinion mining Entity/concept extraction Information retrieval: search engines information extraction: Question answering

4 Supervised learning algorithms
Decision tree learning Naïve Bayes K-nearest neighbour Support Vector Machines Neural Networks Genetic algorithms

5 Supervised Machine learning
1. Build or get a representative corpus 2. Label it 3. Define features 4. Represent documents 5. Learn and analyse 6. Go to 3 until accuracy is acceptable First test features: stemmed words

6 Unsupervised Learning
Document clustering k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering (CACTUS, STIRR) …… STC QDC Interactive learning Learning from unlabelled data Learning to label Two systems that teach each other

7 Similarity measure There are many different ways to measure how similar two documents are, or how similar a document is to a query Highly depending on the choice of terms to represent text documents Euclidian distance (L2 norm) L1 norm Cosine similarity

8 Document Similarity Measures

9 Document Similarity measures

10 Feature Extraction: Task(1)
Task: Extract a good subset of words to represent documents Document collection All unique words/phrases Feature Extraction All good words/phrases Some slides by Huaizhong Kou

11 Feature Extraction Task Indexing Weighting Model
Dimensionality Reduction

12 Feature Extraction: Task(2)
While more and more textual information is available online, effective retrieval is difficult without good indexing of text content. 16 While-more-and-textual-information-is-available-online- effective-retrieval-difficult-without-good-indexing-text-content Feature Extraction The naive feature space consists of the unique terms that occur in documents, which can be tens or hundreds of thousands of terms of even a moderate-size text collection. The current techniques cannot deal with such large set of terms. It is desirable to reduce the native space without information loss. 5 Text-information-online-retrieval-index

13 Feature Extraction: Indexing(1)
Training documents Identification all unique words Removal stop words non-informative word ex.{the,and,when,more} Removal of suffix to generate word stem grouping words increasing the relevance ex.{walker,walking}walk Word Stemming Naive terms Importance of term in Doc Term Weighting

14 Feature Extraction: Indexing(2)
Vector Space Model (VSM) is one of the most commonly used Text data models Any text document is represented by a vector of terms Terms are typically words and/or phrases Every term in the vocabulary becomes an independent dimension Each term in the text document would be represented by a non zero value which will be added in the corresponding dimension A document collection is represented as a matrix: Where xji represents the weight of the ith term in jth document

15 Feature Extraction:Weighting Model(1)
tf - Term Frequency weighting wij = Freqij Freqij : := the number of times jth term occurs in document Di.  Drawback: without reflection of importance factor for document discrimination. Ex. ABRTSAQWA XAO A B K O Q R S T W X D D D1 RTABBAXA QSAK D2

16 Feature Extraction:Weighting Model(2)
tfidf - Inverse Document Frequency weighting wij = Freqij * log(N/ DocFreqj) . N : := the number of documents in the training document collection. DocFreqj ::= the number of documents in which the jth term occurs. Advantage: with reflection of importance factor for document discrimination. Assumption:terms with low DocFreq are better discriminator than ones with high DocFreq in document collection Ex. A B K O Q R S T W X D D

17 Feature Extraction: Weighting Model
Tf-IDF weighting Entropy weighting where is average entropy of ith term and -1: if word occurs once time in every document 0: if word occurs in only one document Ref:[11][22] Ref:[13]

18 Feature Extraction: Dimension Reduction
Document Frequency Thresholding X2-statistic Latent Semantic Indexing Information Gain Mutual information

19 Dimension Reduction:DocFreq Thresholding
Document Frequency Thresholding Training documents D Naive Terms Calculates DocFreq(w) Sets threshold  Removes all words: DocFreq <  Calculates document frequency DOCFREQU for each term in training collection. Sets a threshold  and removes all terms if its DOCFREQU <  holds. Rare terms are either non-informative for predictions or not influential in performance. It is the simplest method with the lowest cost in computation. Feature Terms


Download ppt "Text mining."

Similar presentations


Ads by Google