TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.

TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio

LABS Basic text analytics: text classification using bags-of-words – Sentiment analysis of tweets using Python’s SciKit Learn library More advanced text analytics: information extraction using NLP pipelines – Named Entity Recognition

LABS Basic text analytics: text categorization using bags-of-words – Specifically, sentiment analysis of tweets using Python’s SciKit-Learn’s library More advanced text analytics: information extraction using NLP pipelines – Named Entity Recognition

Sentiment analysis using SciKit Learn Materials for this part of the tutorial: – http://csee.essex.ac.uk/staff/poesio/Teach/TextAn alyticsTutorial/SentimentLab http://csee.essex.ac.uk/staff/poesio/Teach/TextAn alyticsTutorial/SentimentLab – Based on: chap. 6 of

TEXT ANALYTICS IN PYTHON Not quite as easy to do text manipulation in Python as in Perl, but a number of useful packages – SCIKIT-LEARN for machine learning including basic text classification – NLTK for NLP processing including libraries for tokenization, POS tagging, chunking, parsing, NE recognition; also support for ML-based methods eg for text classification

SCIKIT-LEARN An open-source library supporting machine learning work – Based on numpy, scipy, and matplotlib Provides implementations of – Several supervised ML algorithms including eg regression, Naïve Bayes, SVMs – Clustering – Dimensionality reduction – It includes several facilities to support text classification including eg ways to create NLP pipelines out of componen td Website: – http://scikit-learn.org/stable/

REMINDER : SENTIMENT ANALYSIS (or opinion mining) Develop algorithms that can identify the ‘sentiment’ expressed by a text – Product X sucks – I was mesmerized by film Y

SENTIMENT ANALYSIS AS TEXT CATEGORIZATION Sentiment analysis can be viewed as just another type of text categorization, like spam detection or topic classification Most successful approaches use SUPERVISED LEARNING: – Use corpora annotated for subjectivity and/or sentiment – To train models using supervised machine learning algorithms: Naïve bayes Decision trees SVM Good results can already be obtained using only WORDS as features

TEXT CATEGORIZATION USING A NAÏVE BAYES, WORD-BASED APPROACH Attributes are text positions, values are words.

SENTIMENT ANALYSIS OF TWEETS A very popular application of sentiment analysis is trying to extract sentiment towards products or organizations from people’s comments about them on Twitter Several datasets for that – E.g., SEMEVAL-2014 In this lab: Nick Sanders’s dataset – 5000 Tweets – Annotated as positive / negative / neutral / irrelevant – List of ID / sentiment pairs, + script to download tweets on the basis of their ID

First Script Open the file: 01_start.py (but do not run it yet!!) Start an IDLE window

A word-based, Naïve Bayes sentiment analyzer using SciKit-Learn The library sklearn.naive_bayes includes implementations of three Naïve Bayes classifiers – GaussianNB (for features that have a Gaussian distribution, e.g., physical traits – height, etc) – MultinomialNB (when features are frequencies of words) – BernoulliNB (for boolean features)

A word-based, Naïve Bayes sentiment analyzer using SciKit-Learn The library sklearn.naive_bayes includes implementations of three Naïve Bayes classifiers – GaussianNB (for features that have a Gaussian distribution, e.g., physical traits – height, etc) – MultinomialNB (when features are frequencies of words) – BernoulliNB (for boolean features) For sentiment analysis: MultinomialNB

Creating the model The words contained in the tweets are used as features. They are extracted and weighted using the function create_ngram_model –create_ngram_model uses the function TfidfVectorizer from the package feature_extraction in scikit learn to extract terms from tweets http://scikit- learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVect orizer.html http://scikit- learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVect orizer.html create_ngram_model uses MultinomialNB to learn a classifier – http://scikit- learn.org/stable/modules/generated/sklearn.naive_bayes.Multinomial NB.html http://scikit- learn.org/stable/modules/generated/sklearn.naive_bayes.Multinomial NB.html The function Pipeline of scikit-learn is used to combine the feature extractor and the classifier in a single object (an estimator ) that can be used to extract features from data, create (‘fit’) a model, and use the model to classify – http://scikit- learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html http://scikit- learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Tweet term extraction & classification Extract features and weights them Naïve Bayes classifier Creates Pipeline

Training and evaluation The function train_model – Uses a method from the cross_validation library in scikit-learn, ShuffleSplit, to calculate the folds to use in cross validation – At each iteration, the function creates a model using fit, then evaluates the results using score

Creating a model Identifies the indices in each fold Trains the model

Execution

Optimization The program above uses the default values of the parametes for TfidfVectorizer and MultinomialNB In text analytics it’s usually easy to build a first prototype, but lots of experimentation is needed to achieve good results Alternative choices for TfidfVectorizer : – Using unigrams, bigrams, trigrams ( Ngrams parameter) – Removing stopwords ( stop_words parameter) – Using binomial format of counts Alternative choices for MultinomialNB : – Which type of SMOOTHING to use

Smoothing Even a very large corpus remains a limited sample of language use, so many words even of common use are not found – Problem particularly common with tweets where a lot of ‘creative’ use of words found Solution: SMOOTHING – distribute the probability so that every word gets some Most used: ADD ONE or LAPLACE smoothing

Optimization Looking for the best values for the parameters is a standard operation in machine learning Scikit-learn, like Weka and similar packages, provides a function (GridSearchCV) to explore the results that can be achieved with different parameter configurations

Optimizing with GridSearchCV Note the syntax to specify the values of the parameters Use F metric to evaluate Which smoothing function to use

Second Script Open the file: 02_tuning.py (but do not run it yet!!) Start an IDLE window

Additional improvements: normalization, preprocessing Further improvements may be possible by doing some form of NORMALIZATION

Example of normalization: emoticons

Normalization: abbreviations

Adding a preprocessing step to TfidfVectorizer

Other possible improvements Using NLTK’s POS tagger Using a sentiment lexicon such as SentiWordNet – http://sentiwordnet.isti.cnr.it/download.php http://sentiwordnet.isti.cnr.it/download.php – (in the data/ directory)

Third Script Open and run the file: 03_clean.py (Start an IDLE window)

Overall results

TO LEARN MORE

SCIKIT-LEARN

NLTK http://www.nltk.org/book

TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.

Similar presentations

Presentation on theme: "TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.

Similar presentations

Presentation on theme: "TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio."— Presentation transcript:

Similar presentations

About project

Feedback