Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

Similar presentations


Presentation on theme: "1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction."— Presentation transcript:

1 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction

2 Word association, represented by concept links, is useful in understanding the relationships between terms (as concepts). The same idea can be applied to understand the association between documents associated to a topic. Text Topics 2

3 Problems with “Term as Topic” Using single term to define a topic is problematic. –Lack of expressive power Can only represent simple topics Cannot represent complicated topics –Incompleteness in vocabulary coverage Cannot capture variations of vocabulary (e.g. related terms) –Ambiguous word Many words have more than one meaning/sense. 3

4 Multiple Terms as Topic A solution is to use multiple terms to define a topic. –Topic = {word1, word2,.. } –A weight assigned to each term represents the importance/relevance of the term in the topic. –Every document in the corpus can be given a score that represents the strength of association to a topic. –A document can contain zero, one or many topics. 4

5 Approach (1): Probabilistic Topic Mining Coursera, Text Mining and Analytics, ChengXiang Zhai 5

6 Topic as Word Distribution Coursera, Text Mining and Analytics, ChengXiang Zhai 6

7 Probabilistic Topic Mining Coursera, Text Mining and Analytics, ChengXiang Zhai 7

8 Techniques for Probabilistic Topic Mining Several techniques have been used in probabilistic topic mining to extract topics. –Maximum Likelihood –Bayesian –Mixture Model (where parameters are estimated typically using the Expectation Maximization (EM) algorithm) 8

9 Mixture Model for Topic Extraction (1) Coursera, Text Mining and Analytics, ChengXiang Zhai 9

10 Mixture Model for Topic Extraction (2) Coursera, Text Mining and Analytics, ChengXiang Zhai 10

11 Mixture Model as a Generative Model Coursera, Text Mining and Analytics, ChengXiang Zhai 11

12 Mixture of Two Unigram Language Models Coursera, Text Mining and Analytics, ChengXiang Zhai 12

13 Coursera, Text Mining and Analytics, ChengXiang Zhai 13

14 Coursera, Text Mining and Analytics, ChengXiang Zhai 14

15 Coursera, Text Mining and Analytics, ChengXiang Zhai 15

16 Expectation-Maximization (EM) Algorithm Coursera, Text Mining and Analytics, ChengXiang Zhai 16

17 Coursera, Text Mining and Analytics, ChengXiang Zhai 17

18 18 Approach (2): Dimensionality Reduction for Topics Extraction Reduced dimensions can also be considered topics. Singular Value Decomposition derives eigenvectors (SVD dimensions/Principal Components)  Topics. D1: “I love iPad.” D2: “iPad is great for kids.” D3: “Kids love to play soccer.” D4: “I play soccer at OSU.”

19 19 Example: Topics extracted by SAS Enterprise Miner for the yelp data

20 20 Term topic weight – relevance of the term in the topic Each term is assigned a weight corresponding to each topic. Since each topic is an SVD dimension, the term topic weights for a term are the coordinates of the term in the SVD space. The Term cutoff is used to determine whether a term belongs to a topic. Document topic weight – relevance of the document to the topic Every document in the corpus is assigned a weight corresponding to each topic. The document topic weight of a document towards a topic is the normalized sum of the TF*IDF weights for each term in the document multiplied by their term topic weights. The Document cutoff is used to determine whether a document belongs to a topic.

21 21 Interpretability of Extracted Topics A topic as a collection of weighted terms provides precise information about the topic. But some analysts find the binary topics are easier to understand.


Download ppt "1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction."

Similar presentations


Ads by Google