CSC 594 Topics in AI – Text Mining and Analytics

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Chapter 5: Introduction to Information Retrieval
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
1 Text Analytics for Unlocking the Potential of Big Data Bhavani Pacific Brands 5 1 Text analytics & big data 2 New opportunities with text.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Extract from various presentations: Bing Liu, Aditya Joshi, Aster Data … Sentiment Analysis January 2012.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Text Features Dr. Paula Matuszek (610)
Information Retrieval in Practice
Chapter 7 – K-Nearest-Neighbor
Model Personalization (1) : Data Fusion Improve frame and answer (of persistent query) generation through Data Fusion (local fusion on personal and topical.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Artificial Neural Nets and AI Connectionism Sub symbolic reasoning.
Lecture 9: Knowledge Discovery Systems Md. Mahbubul Alam, PhD Associate Professor Dept. of AEIS Sher-e-Bangla Agricultural University.
CSC 594 Topics in AI – Text Mining and Analytics
Text Classification, Active/Interactive learning.
Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Chapter 6: Information Retrieval and Web Search
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.
Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 3. Word Association.
Chapter Ⅳ. Categorization 2007 년 2 월 15 일 인공지능연구실 송승미 Text : THE TEXT MINING HANDBOOK Page. 64 ~ 81.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
CSC 594 Topics in AI – Text Mining and Analytics
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.
Link Distribution on Wikipedia [0407]KwangHee Park.
Intended for Knowledge Sharing only Optimizing Product Decisions with Insights THE PRODUCT MANAGEMENT & INNOVATION EVENT 2016 Jan 2016.
Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam s, also known as junk s, are unwanted s sent to numerous recipients.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
1 05 IT.ppt Market and Customer Management - Customer Loyalty 5. Loyalty and Information Technology Frequently asked questions: qWhat is a customer loyalty.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Global Mobile Analytics Market 2015 to 2020 No. Pages :115 Published on : October-2015.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
Information Retrieval in Practice
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Search Engine Architecture
Sentiment analysis algorithms and applications: A survey
© 2016 Global Market Insights, Inc. USA. All Rights Reserved Fuel Cell Market size worth $25.5bn by 2024 Text Analytics Market share.
CSC 594 Topics in AI – Natural Language Processing
Multimedia Information Retrieval
Data Warehousing and Data Mining
Presented by: Prof. Ali Jaoua
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Introduction to Text Analysis
Unsupervised Machine Learning: Clustering Assignment
USING NLP TO MAKE UNSTRUCTURED DATA HIGHLY ACCESSABLE
Presentation transcript:

CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 11. Wrap Up

Text Mining [from Wikipedia] - “Text mining refers to the process of deriving high-quality information from text.” - “The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods.”

Text Mining is Growing (1) “North America text analytics market is expected to reach a value of $1,995.8 million by 2019 according to new research report” (Information Communications Media Technology Market News, October 15, 2015) This market is estimated to grow from $827.1 million in 2014 to $1,995.8 million by 2019, at a Compound Annual Growth Rate (CAGR) of 19.3% from 2014 to 2019.

Text Mining is Growing (2) “Discover the text analytics market -- ” (Information Communications Media Technology Market News, November 4, 2015) Factors which are driving the growth of global text analytics service market are growing demand of social media analysis for effective brand building, development of multilingual text analytics to overcome language barriers, increasing concern of financial frauds and growing big data market. On the other hand, factors which are restraining the growth of global text analytics market are lack of awareness among end users about software handling, high deployment cost and compliance issue with present IT infrastructure. However, added advantage of predictive analytics and credibility to analyse big data is expected to create great opportunity for text analytics market in future.

Text Mining is Hard…(?) Data Collection: Text (Pre-)Processing: Raw texts are ‘dirty’ – markup tags, nonsense words/symbols, irregular punctuations, mis-spellings.. Collected data becomes huge in size. Text (Pre-)Processing: So many ‘options’ Segmentation (Text unit) – whole document vs. paragraph vs. sentence vs. n-word context window, specific patterns (e.g. <Adj><Noun>). Tokenization -- stemming/lemmatization, case normalization, removing punctuations, Term – removing stop words, defining a ‘keep’ list, POS, synonyms Transformation – various term weighting schemes, dimensionality reduction (by top N terms, PCA, model parameter coefficients, etc.). We don’t know how each one affects the result until we generate the result  need for iterative experiments (i.e., feedback loop). Mining/Analysis Step: Whole Machine Learning and Data Mining comes after structured data is obtained.

Survey In your midterm project, did you do..? Stemming Case normalization Removing punctuations Removing stop words POS-tagging Synonym creation Term weighting schemes Dimensionality reduction

Word Frequency Most naïve text mining is to look at the word frequency. But surprisingly, word frequency provides a lot of useful information (when the data size is large)… A good article, “Where to start with text mining” (http://tedunderwood.com/2012/08/14/where-to-start-with-text-mining/) Google Ngram Viewer (https://books.google.com/ngrams/) Word Cloud

Word Association, Concept Linking Slightly more sophisticated analysis But still based on frequency. Two words/concepts occurring TOGETHER more than chance. Typically PMI or Likelihood is used to measure the strength of the co-occurrence.

Clustering, Topic Extraction Discover the overall grouping of the corpus Clustering – a document is assigned to exactly one cluster. Topic – a document could be assigned to multiple clusters/topics. Cluster/topic definitions through terms/words Look at cluster centroids or term-cluster relevancy scores.

Text Categorization Build a classification/prediction model for texts Goal 1: An optimal classifier (for the purpose of classification/prediction) Goal 2: Lean the domain of the texts (e.g. important features for each target category such as POS/NEG reviews).