Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.

Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015

Motivation

3EMC CONFIDENTIAL—INTERNAL USE ONLY. מה יש בטקסט חופשי? מה הנושאים שעיתון מדבר עליו? מה הנושאים שחשובים לח"כ חזן על פי הצהרותיו? על מה הלקוחות מתלוננים? אבל הטקסט ברובו נעול ולא שמיש! מוביל לשני תרחישים: –התעלמות מהטקסט (עבודה עם נתונים מובנים בלבד) –דוגמאות אנקדוטליות (דוגמא אינה הוכחה, ע"ע "ישראלית שפה יפה") הבנה אוטומטית בעזרת חוקים או supervised learning הינה יקרה מאד

4EMC CONFIDENTIAL—INTERNAL USE ONLY. Let’s start small – one label per document

5EMC CONFIDENTIAL—INTERNAL USE ONLY. Bayesian Classification Each document may belong to one class Assume a generative model: –Document is generated like this: 1.Choose topic t (sample randomly) 2.For 1..n: 1.W <- sample topic distribution of t How do we learn the topic word generating distribution? Count!

6EMC CONFIDENTIAL—INTERNAL USE ONLY. Bayesian Classification Estimating the model – supervised Estimate probability of a word given the labeled class Given these examples: Sports: “great game”, “good pass”, “I love it” Crime: “the game is the game”, “drug dealers” P(w=great/C=sport) = 1/7 P(w=great/C=crime) = 0/7 P(w=game/C=sport) = 1/7 P(w=game/C=crime) = 1/7

7EMC CONFIDENTIAL—INTERNAL USE ONLY. Naïve Bayes - Prediction

8EMC CONFIDENTIAL—INTERNAL USE ONLY. Naïve Bayes - Unsupervised Sampling Let’s try gibbs sampling: –Label documents randomly (who has a coin?) –Relabel one document at a time using the model and restimate you can close this sr ok to close this you can close the ticket two disks replacement ok two disk drives replaced replaced the disks sr closed

9EMC CONFIDENTIAL—INTERNAL USE ONLY. But that assumed one topic per document…

10EMC CONFIDENTIAL—INTERNAL USE ONLY. Mixture of naïve Bayes classifiers Each with its own distribution over the words 1.Choose topics distribution in document (sample randomly) 2.For 1..n: 1.T <- sample topic 2.W <- sample topic distribution How to estimate? –Very hard…

11EMC CONFIDENTIAL—INTERNAL USE ONLY. Topic Modeling Latent Dirichlet Allocation More than one topic per document: Police, Cannabis, Celebrities If we try to classify into any one category we get a lot of noise (consider the bayesian model with words as features) What if we don’t know the categories in advance?

12EMC CONFIDENTIAL—INTERNAL USE ONLY. Topic Modeling Results (it really works)

13EMC CONFIDENTIAL—INTERNAL USE ONLY. Topic Modeling Results

14EMC CONFIDENTIAL—INTERNAL USE ONLY. Topic Modeling Works in Hebrew just as well…

15EMC CONFIDENTIAL—INTERNAL USE ONLY. LDA Topic Modeling Generative Model

16EMC CONFIDENTIAL—INTERNAL USE ONLY. Topic Modeling Sampling So we have a generative model, how does that help? We can learn the parameters of the model: The probability of each word in each topic The probability of the topic itself (in some LDA Variants) Gibbs Sampling – Start with random topic tags for all words Each step – remove one tag and recompute based on the topic mixture in the same document and the topic probability of the word in the corpus

17EMC CONFIDENTIAL—INTERNAL USE ONLY. Topic Modeling Sampling Let’s try gibbs sampling… –With LDA (uniform priors) you can close this sr ok to close this you can close the ticket two disks replacement ok two disk drives replaced replaced the disks sr closed

18EMC CONFIDENTIAL—INTERNAL USE ONLY. Topic Modeling Application in Data Science Data exploration of free text Summarization – state of the art uses topics to abstract words Text Classification – use the topics as features instead of the words Domain Adaptation – supervised learning on a small annotated sample then generalize with topics Go beyond text – cluster costumers by their latent needs / preferences based on their installed applications

19EMC CONFIDENTIAL—INTERNAL USE ONLY. LDA In practice Read: “Care and Feeding of Topic Models” by Boyd-Garber, Mimno and Newman Asymmetric Priors (Wallach) –Vanilla LDA assumes all topics are as likely –I’ve never encountered such a corpus –Assume the prior for each topic is different and sample as well –Not supported by most big data LDA off the shelf solutions

20EMC CONFIDENTIAL—INTERNAL USE ONLY. Redundancy and LDA In real life we encounter templates / copy-paste and document duplication A rare word which appears in a copied document will have bias the topics –Will get higher weight in the topic –Noisy topics (mix two topics together) Solution: –Remove redundant documents –Create a more complex model

21EMC CONFIDENTIAL—INTERNAL USE ONLY. Text Analytics System Topic modeling is not enough…

22EMC CONFIDENTIAL—INTERNAL USE ONLY. Algorithmic pipeline Pros: – Model the data you have, not the data you think is out there – Quick onboarding: reduce time to insight of generic systems from several months to a few days – Inject SME domain knowledge into model But how? Especially without going down the rule based path… Lightly Supervised Content Modeling, Cont. Data Driven Machine Learning modeling Preprocess Unsupervised clustering SME annotation Post processing Actionable insight

23EMC CONFIDENTIAL—INTERNAL USE ONLY. Preprocessing tricks

24EMC CONFIDENTIAL—INTERNAL USE ONLY. Cluster words to detect synonyms: Brown Clustering –Create a representation of the words with Brown Clustering [Peter F. Brown et. al, Class-Based n-gram Models of Natural Language, Computational Linguistics] –Model each word as all the context it appeared in: words  [“Cluster x”,”the x”] Cluster using the mutual information criteria: –Extract likely synonyms using heuristics: Preprocess Smart dimensionality reduction

25EMC CONFIDENTIAL—INTERNAL USE ONLY. Preprocess – Results Smart dimensionality reduction backup15492 backups4419 backed386 backup's32 bakcup28 bakup14 backuped14 bacup8 backp8 buckup7 backu7 backus5 backuo5 backup14 backkup4 licenses347 licence119 licences54 networker8703 netwoker59 netwroker22 netowrker15 networke13 neworker10 netorker5

26EMC CONFIDENTIAL—INTERNAL USE ONLY. Injecting SME knowledge into a Machine Learning Model

27EMC CONFIDENTIAL—INTERNAL USE ONLY. SME Annotation Inject the domain knowledge Unsupervised approaches provide us with clusters of documents / words How can we use this to benefit the business need? Step 1: label the document groups (easy???): SME Annotation

28EMC CONFIDENTIAL—INTERNAL USE ONLY. SME Annotation Inject the domain knowledge Provide as many layers of information as possible to make it easy: Single word information Bi-Gram (two word) relevant information: –Remove stop words –Choose statistically interesting words (not just two common words that are likely to appear together by sheer abundance) Whole sentence extraction – using statistical summarization SME Annotation

29EMC CONFIDENTIAL—INTERNAL USE ONLY. Model Tuning Inject the domain knowledge Is it enough? No, business users require high accuracy Solution: Allow SME to tune the model Make it easy and simple – much simpler than making up rules The result should be optimized for input to an accurate supervised machine learning model SME Annotation

Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.

Similar presentations

Presentation on theme: "Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.

Similar presentations

Presentation on theme: "Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015."— Presentation transcript:

Similar presentations

About project

Feedback