TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Advertisements

University of Sheffield NLP Module 11: Advanced Machine Learning.
Sentiment Analysis on Twitter Data
Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier.
Distant Supervision for Emotion Classification in Twitter posts 1/17.
Machine learning continued Image source:
Pollyanna Gonçalves (UFMG, Brazil) Matheus Araújo (UFMG, Brazil) Fabrício Benevenuto (UFMG, Brazil) Meeyoung Cha (KAIST, Korea) Comparing and Combining.
CMPUT 466/551 Principal Source: CMU
Sentiment Analysis An Overview of Concepts and Selected Techniques.
Made with OpenOffice.org 1 Sentiment Classification using Word Sub-Sequences and Dependency Sub-Trees Pacific-Asia Knowledge Discovery and Data Mining.
A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts 04 10, 2014 Hyun Geun Soo Bo Pang and Lillian Lee (2004)
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Assuming normally distributed data! Naïve Bayes Classifier.
Sarah Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON.
Scalable Text Mining with Sparse Generative Models
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.
Text Classification using SVM- light DSSI 2008 Jing Jiang.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Sentiment Analysis of Social Media Content using N-Gram Graphs Authors: Fotis Aisopos, George Papadakis, Theordora Varvarigou Presenter: Konstantinos Tserpes.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Natural language processing tools Lê Đức Trọng 1.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
​ Text Analytics ​ Teradata & Sabanci University ​ April, 2015.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
CSC 594 Topics in AI – Text Mining and Analytics
Recognizing Stances in Online Debates Unsupervised opinion analysis method for debate-side classification. Mine the web to learn associations that are.
Image Classifier Digital Image Processing A.A
©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek (610)
CSC 594 Topics in AI – Text Mining and Analytics
Class Imbalance in Text Classification
Data analysis tools Subrata Mitra and Jason Rahman.
Reputation Management System
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
COMP 4332 Tutorial 1 Feb 16 WANG YUE Tutorial Overview & Learning Python.
WEKA's Knowledge Flow Interface Data Mining Knowledge Discovery in Databases ELIE TCHEIMEGNI Department of Computer Science Bowie State University, MD.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Project Deliverable-1 -Prof. Vincent Ng -Girish Ramachandran -Chen Chen -Jitendra Mohanty.
Problem Solving with NLTK MSE 2400 EaLiCaRA Dr. Tom Way.
A Simple Approach for Author Profiling in MapReduce
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Sentiment Analysis of Twitter Messages Using Word2Vec
Name: Sushmita Laila Khan Affiliation: Georgia Southern University
Natural Language Processing (NLP)
Supervised Machine Learning
Juweek Adolphe Zhaoyu Li Ressi Miranda Dr. Shang
Prepared by Kimberly Sayre and Jinbo Bi
CMPT 733, SPRING 2016 Jiannan Wang
Tutorial for LightSIDE
CSCI N317 Computation for Scientific Applications Unit Weka
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Intro to Machine Learning
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Natural Language Processing (NLP)
1.7.2 Multinomial Naïve Bayes
Elena Mikhalkova, Nadezhda Ganzherli, Yuri Karyakin, Dmitriy Grigoryev
Austin Karingada, Jacob Handy, Adviser : Dr
Natural Language Processing (NLP)
Presentation transcript:

TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio

LABS Basic text analytics: text classification using bags-of-words – Sentiment analysis of tweets using Python’s SciKit Learn library More advanced text analytics: information extraction using NLP pipelines – Named Entity Recognition

LABS Basic text analytics: text categorization using bags-of-words – Specifically, sentiment analysis of tweets using Python’s SciKit-Learn’s library More advanced text analytics: information extraction using NLP pipelines – Named Entity Recognition

Sentiment analysis using SciKit Learn Materials for this part of the tutorial: – alyticsTutorial/SentimentLab alyticsTutorial/SentimentLab – Based on: chap. 6 of

TEXT ANALYTICS IN PYTHON Not quite as easy to do text manipulation in Python as in Perl, but a number of useful packages – SCIKIT-LEARN for machine learning including basic text classification – NLTK for NLP processing including libraries for tokenization, POS tagging, chunking, parsing, NE recognition; also support for ML-based methods eg for text classification

TEXT ANALYTICS IN PYTHON Not quite as easy to do text manipulation in Python as in Perl, but a number of useful packages – SCIKIT-LEARN for machine learning including basic text classification – NLTK for NLP processing including libraries for tokenization, POS tagging, chunking, parsing, NE recognition; also support for ML-based methods eg for text classification

SCIKIT-LEARN An open-source library supporting machine learning work – Based on numpy, scipy, and matplotlib Provides implementations of – Several supervised ML algorithms including eg regression, Naïve Bayes, SVMs – Clustering – Dimensionality reduction – It includes several facilities to support text classification including eg ways to create NLP pipelines out of componen td Website: –

REMINDER : SENTIMENT ANALYSIS (or opinion mining) Develop algorithms that can identify the ‘sentiment’ expressed by a text – Product X sucks – I was mesmerized by film Y

SENTIMENT ANALYSIS AS TEXT CATEGORIZATION Sentiment analysis can be viewed as just another type of text categorization, like spam detection or topic classification Most successful approaches use SUPERVISED LEARNING: – Use corpora annotated for subjectivity and/or sentiment – To train models using supervised machine learning algorithms: Naïve bayes Decision trees SVM Good results can already be obtained using only WORDS as features

TEXT CATEGORIZATION USING A NAÏVE BAYES, WORD-BASED APPROACH Attributes are text positions, values are words.

SENTIMENT ANALYSIS OF TWEETS A very popular application of sentiment analysis is trying to extract sentiment towards products or organizations from people’s comments about them on Twitter Several datasets for that – E.g., SEMEVAL-2014 In this lab: Nick Sanders’s dataset – 5000 Tweets – Annotated as positive / negative / neutral / irrelevant – List of ID / sentiment pairs, + script to download tweets on the basis of their ID

First Script Open the file: 01_start.py (but do not run it yet!!) Start an IDLE window

A word-based, Naïve Bayes sentiment analyzer using SciKit-Learn The library sklearn.naive_bayes includes implementations of three Naïve Bayes classifiers – GaussianNB (for features that have a Gaussian distribution, e.g., physical traits – height, etc) – MultinomialNB (when features are frequencies of words) – BernoulliNB (for boolean features)

A word-based, Naïve Bayes sentiment analyzer using SciKit-Learn The library sklearn.naive_bayes includes implementations of three Naïve Bayes classifiers – GaussianNB (for features that have a Gaussian distribution, e.g., physical traits – height, etc) – MultinomialNB (when features are frequencies of words) – BernoulliNB (for boolean features) For sentiment analysis: MultinomialNB

Creating the model The words contained in the tweets are used as features. They are extracted and weighted using the function create_ngram_model –create_ngram_model uses the function TfidfVectorizer from the package feature_extraction in scikit learn to extract terms from tweets learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVect orizer.html learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVect orizer.html create_ngram_model uses MultinomialNB to learn a classifier – learn.org/stable/modules/generated/sklearn.naive_bayes.Multinomial NB.html learn.org/stable/modules/generated/sklearn.naive_bayes.Multinomial NB.html The function Pipeline of scikit-learn is used to combine the feature extractor and the classifier in a single object (an estimator ) that can be used to extract features from data, create (‘fit’) a model, and use the model to classify – learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Tweet term extraction & classification Extract features and weights them Naïve Bayes classifier Creates Pipeline

Training and evaluation The function train_model – Uses a method from the cross_validation library in scikit-learn, ShuffleSplit, to calculate the folds to use in cross validation – At each iteration, the function creates a model using fit, then evaluates the results using score

Creating a model Identifies the indices in each fold Trains the model

Execution

Optimization The program above uses the default values of the parametes for TfidfVectorizer and MultinomialNB In text analytics it’s usually easy to build a first prototype, but lots of experimentation is needed to achieve good results Alternative choices for TfidfVectorizer : – Using unigrams, bigrams, trigrams ( Ngrams parameter) – Removing stopwords ( stop_words parameter) – Using binomial format of counts Alternative choices for MultinomialNB : – Which type of SMOOTHING to use

Smoothing Even a very large corpus remains a limited sample of language use, so many words even of common use are not found – Problem particularly common with tweets where a lot of ‘creative’ use of words found Solution: SMOOTHING – distribute the probability so that every word gets some Most used: ADD ONE or LAPLACE smoothing

Optimization Looking for the best values for the parameters is a standard operation in machine learning Scikit-learn, like Weka and similar packages, provides a function (GridSearchCV) to explore the results that can be achieved with different parameter configurations

Optimizing with GridSearchCV Note the syntax to specify the values of the parameters Use F metric to evaluate Which smoothing function to use

Second Script Open the file: 02_tuning.py (but do not run it yet!!) Start an IDLE window

Additional improvements: normalization, preprocessing Further improvements may be possible by doing some form of NORMALIZATION

Example of normalization: emoticons

Normalization: abbreviations

Adding a preprocessing step to TfidfVectorizer

Other possible improvements Using NLTK’s POS tagger Using a sentiment lexicon such as SentiWordNet – – (in the data/ directory)

Third Script Open and run the file: 03_clean.py (Start an IDLE window)

Overall results

TO LEARN MORE

SCIKIT-LEARN

NLTK