Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSC 594 Topics in AI – Text Mining and Analytics

Similar presentations


Presentation on theme: "CSC 594 Topics in AI – Text Mining and Analytics"— Presentation transcript:

1 CSC 594 Topics in AI – Text Mining and Analytics
Fall 2015/16 1. Introduction

2 Unstructured Data “80 % of business-relevant information originates in unstructured form, primarily text.” (a quote in 2008) “Based on the industry’s current estimations, unstructured data will occupy 90% of the data by volume in the entire digital space over the next decade.” (a quote in 2010) “The possibilities for data mining from large text collections are virtually untapped. Text expresses a vast, rich range of information, but encodes this information in a form that is difficult to decipher automatically. For example, it is much more difficult to graphically display textual content than quantitative data.” (Marti Hear, UC Berkeley, 2007)

3 IBM Watson Content Analytics: Discover Hidden Value in Your Unstructured Data

4 Text Mining and Analytics
You use the terms text analytics, text data mining, and text mining almost synonymously in this course. Text analytics uses algorithms for turning free-form text (unstructured data) into data that can be analyzed (structured data) by applying statistical and machine learning methods, as well as Natural Language Processing (NLP) techniques. Once structured data is obtained, the same mining and analytic techniques can apply. So the most significant part of Text Mining/Analytics is how to convert texts into structured data. This course covers SAS Enterprise Miner and SAS Text Miner. Other courses include Teragram products like SAS Content Categorization, SAS Sentiment Analysis, and SAS Ontology Management.

5 Converting Text into Structured Data
A huge amount of preprocessing is required to convert text. Cleaning up ‘dirty’ texts Remove mark-up tags from web documents, encrypted symbols such as emoticons/emoji’s, extraneous strings such as “AHHHHHHHHHHHHHHHHHHHHH” Correct misspelled words.. Tokenization Remove punctuations, normalizing upper/lower cases, etc. Sentence splitting Identifying multi-word expressions (e.g. “as well as”, “radio wave”) and Named Entities (e.g. “Allied Waste”, “Super Mario Bros.”) Adding other linguistic information Parts-of-speech (e.g. noun, verb, adjective, adverb, preposition) Filtering non-significant/irrelevant words – to reduce dimensions Filtering non-content words using a stop-list (e.g. “the”, “a”, “an”, “and”) Combining tokens by stemming/lemmatizing or using synonyms Other NLP features/techniques, e.g. n-grams, syntax trees This course covers SAS Enterprise Miner and SAS Text Miner. Other courses include Teragram products like SAS Content Categorization, SAS Sentiment Analysis, and SAS Ontology Management.

6 Text Mining Process Pipeline
Process is essentially a linear pipeline. Feedback from the results of Text Mining might affect earlier preprocessing (to Parsing, or even data collection)..

7 Text Mining Paradigm

8 Data Mining – Two Broad Areas
Pattern Discovery/Exploratory Analysis (Unsupervised Learning) There is no target variable, and some form of analysis is performed to do the following: identify or define homogeneous groups, clusters, or segments find links or associations between entities, as in market basket analysis Prediction (Supervised Learning) A target variable is used, and some form of predictive or classification model is developed. Input variables are associated with values of a target variable, and the model produces a predicted target value for a given set of inputs. Classification refers to identifying “natural” groups, such as identifying different breeds within a species. Anthropology and other sciences try to find clear boundaries between groups to help define natural classification schemes. On the other hand, the same algorithms that can identify natural groups can be applied to any data set, even if no natural grouping exists.

9 Text Mining Applications – Unsupervised
Information retrieval (IR) finding documents with relevant content of interest used for researching medical, scientific, legal, and news documents such as books and journal articles Document categorization for organizing clustering documents into naturally occurring groups extracting themes or concepts Anomaly detection identifying unusual documents that might be associated with cases requiring special handling such as unhappy customers, fraud activity, and so on Relate the major application areas to the two abstract goals.

10 Text Mining Applications – Unsupervised
Text clustering Trend analysis Trend for the Term “text mining” from Google Trends Cluster No. Comment Key Words 1 1, 3, 4 doctor, staff, friendly, helpful 2 5, 6, 8 treatment, results, time, schedule 3   2, 7 service, clinic, fast

11 Text Mining Applications – Supervised
Many typical predictive modeling or classification applications can be enhanced by incorporating textual data in addition to traditional input variables. churning propensity models that include customer center notes, website forms, s, and Twitter messages hospital admission prediction models incorporating medical records notes as a new source of information insurance fraud modeling using adjustor notes sentiment categorization (next page) stylometry or forensic applications that identify the author of a particular writing sample Relate the major application areas to the two abstract goals.

12 Sentiment Analysis The field of sentiment analysis deals with categorization (or classification) of opinions expressed in textual documents Green color represents positive tone, red color represents negative tone, and product features and model names are highlighted in blue and brown, respectively.

13 Structured + Text Data in Predictive Models
Use of both types of data in building predictive models. Anomaly detection includes examining internet traffic looking for terrorist sources. The course notes discuss how text mining references present general application areas in different ways. ROC Chart of Models With and Without Textual Comments

14 Discussion What sort of pattern discovery or predictive modeling application do you have in mind that can incorporate text data?

15 Typical Text Pre-processing Step
Given a raw text (in a corpus), we typically pre-process the text by applying the following tasks in order: Part-Of-Speech (POS) tagging – assign a POS to every word in a sentence in the text Named Entity Recognition (NER) – identify named entities (proper nouns and some common nouns which are relevant in the domain of the text) Shallow Parsing – identify the phrases (mostly verb phrases) which involve named entities Information Extraction (IE) – identify relations between phrases, and extract the relevant/significant “information” described in the text Source: Andrew McCallum, UMass Amherst

16 1. Part-Of-Speech (POS) Tagging
POS tagging is a process of assigning a POS or lexical class marker to each word in a sentence (and all sentences in a corpus). Input: the lead paint is unsafe Output: the/Det lead/N paint/N is/V unsafe/Adj

17 2. Named Entity Recognition (NER)
NER is to process a text and identify named entities in a sentence e.g. “U.N. official Ekeus heads for Baghdad.”

18 3. Shallow Parsing Shallow (or Partial) parsing identifies the (base) syntactic phases in a sentence. After NEs are identified, dependency parsing is often applied to extract the syntactic/dependency relations between the NEs. [NP He] [v saw] [NP the big dog] [PER Bill Gates] founded [ORG Microsoft]. found Bill Gates Microsoft nsubj dobj Dependency Relations nsubj(Bill Gates, found) dobj(found, Microsoft)

19 4. Information Extraction (IE)
Identify specific pieces of information (data) in an unstructured or semi-structured text Transform unstructured information in a corpus of texts or web pages into a structured database (or templates) Applied to various types of text, e.g. Newspaper articles Scientific articles Web pages etc. Source: J. Choi, CSE842, MSU

20 Entities: “Bridgestone Sport Co.” “a local concern”
Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. template filling TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$ ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990


Download ppt "CSC 594 Topics in AI – Text Mining and Analytics"

Similar presentations


Ads by Google