Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSC 594 Topics in AI – Text Mining and Analytics

Similar presentations


Presentation on theme: "CSC 594 Topics in AI – Text Mining and Analytics"— Presentation transcript:

1 CSC 594 Topics in AI – Text Mining and Analytics
Fall 2015/16 11. Wrap Up

2 Text Mining [from Wikipedia] - “Text mining refers to the process of deriving high-quality information from text.” - “The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods.”

3 Text Mining is Growing (1)
“North America text analytics market is expected to reach a value of $1,995.8 million by 2019 according to new research report” (Information Communications Media Technology Market News, October 15, 2015) This market is estimated to grow from $827.1 million in 2014 to $1,995.8 million by 2019, at a Compound Annual Growth Rate (CAGR) of 19.3% from 2014 to 2019.

4 Text Mining is Growing (2)
“Discover the text analytics market -- ” (Information Communications Media Technology Market News, November 4, 2015) Factors which are driving the growth of global text analytics service market are growing demand of social media analysis for effective brand building, development of multilingual text analytics to overcome language barriers, increasing concern of financial frauds and growing big data market. On the other hand, factors which are restraining the growth of global text analytics market are lack of awareness among end users about software handling, high deployment cost and compliance issue with present IT infrastructure. However, added advantage of predictive analytics and credibility to analyse big data is expected to create great opportunity for text analytics market in future.

5 Text Mining is Hard…(?) Data Collection: Text (Pre-)Processing:
Raw texts are ‘dirty’ – markup tags, nonsense words/symbols, irregular punctuations, mis-spellings.. Collected data becomes huge in size. Text (Pre-)Processing: So many ‘options’ Segmentation (Text unit) – whole document vs. paragraph vs. sentence vs. n-word context window, specific patterns (e.g. <Adj><Noun>). Tokenization -- stemming/lemmatization, case normalization, removing punctuations, Term – removing stop words, defining a ‘keep’ list, POS, synonyms Transformation – various term weighting schemes, dimensionality reduction (by top N terms, PCA, model parameter coefficients, etc.). We don’t know how each one affects the result until we generate the result  need for iterative experiments (i.e., feedback loop). Mining/Analysis Step: Whole Machine Learning and Data Mining comes after structured data is obtained.

6 Survey In your midterm project, did you do..? Stemming
Case normalization Removing punctuations Removing stop words POS-tagging Synonym creation Term weighting schemes Dimensionality reduction

7 Word Frequency Most naïve text mining is to look at the word frequency. But surprisingly, word frequency provides a lot of useful information (when the data size is large)… A good article, “Where to start with text mining” ( Google Ngram Viewer ( Word Cloud

8 Word Association, Concept Linking
Slightly more sophisticated analysis But still based on frequency. Two words/concepts occurring TOGETHER more than chance. Typically PMI or Likelihood is used to measure the strength of the co-occurrence.

9 Clustering, Topic Extraction
Discover the overall grouping of the corpus Clustering – a document is assigned to exactly one cluster. Topic – a document could be assigned to multiple clusters/topics. Cluster/topic definitions through terms/words Look at cluster centroids or term-cluster relevancy scores.

10 Text Categorization Build a classification/prediction model for texts
Goal 1: An optimal classifier (for the purpose of classification/prediction) Goal 2: Lean the domain of the texts (e.g. important features for each target category such as POS/NEG reviews).


Download ppt "CSC 594 Topics in AI – Text Mining and Analytics"

Similar presentations


Ads by Google