Presentation is loading. Please wait.

Presentation is loading. Please wait.

©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential.

Similar presentations


Presentation on theme: "©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential."— Presentation transcript:

1 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential.

2 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian and the marks used herein are service marks or registered trademarks of Experian Information Solutions, Inc. Other product and company names mentioned herein are the trademarks of their respective owners. No part of this copyrighted work may be reproduced, modified, or distributed in any form or manner without the prior written permission of Experian. Experian Confidential. Distributed Representation for Unstructured Data and Applications Kevin Chen Chief Scientist | North America Data Lab

3 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. 3 People go to But NOT LIKE? __________

4 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. 4  Gartner (*) predicted enterprise data volume to grow by 800% in the next five years  Unstructured data is growing 62% faster  80% of data will be unstructured data  Structured data: ► Well-studied ► Interval / categorical / ordinal  Forbes, Big Data—Big Money Says It Is A Paradigm Buster, June 2012 Representation of unstructured data

5 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. 5  Unstructured data: ► Diverse types of data (text, audio, image, video) ► Need to be able to search, compare, understand, and predict  Key question: ► How do we represent words, sentences, phrases, concepts, objects and use them in predictive modeling?  Applications in Transactional Behavior Modeling: ► Merchant grouping, ► Merchant Characteristics Insight ► Behavior shift detection Representation of unstructured data

6 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. 6  Language Model: ► P(“He likes to run”) = P(He) x P(likes | He) x P(to | He likes) x P(run | He likes to) ► P(red | The color of rose is) = ?  Discrete Representation (n-gram Model): ► Curse of Dimensionality: e.g. 4-grams  1.6x10 17 combinations assuming |V|=20,000 ► Unable to detect ‘similarity’ easily ● “The cat is walking in the bedroom” vs. “A dog was running in a room” ► Requires smoothing Language model and challenges  Analogy: Categorical Variables

7 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. 7  Hinton (1986), Bengio (2003)  Each word is associated with a point in a lower dimension space (e.g. 200 dim)  Benefits: ► Close vectors  Similar words ► Reduced dimensions enable near-real time look up of similar words and distance ● e.g. 8TB (1,000,000 x 1,000,000 x 8) vs. 1.6GB (200 x 1,000,000 x 8) ► Language models can be derived with much smaller training data ► Compositionality: ability to express negativity using dissimilarity Neural distributed representation Before Prior After Called Saying Told AboutAround (0.31,0.12,…,0.20) (0.29,0.11,…,0.21) (0.15,0.82,…,0.57)(0.16,0.81,…,0.55)

8 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. 8  Thomas Mikolov (2010) (@ google, facebook)  A recurrent neural network takes previous state s(t-1) as part of input  w(t): current word at t, y(t): next word  U: distributed representation  Current state s(t) takes into account current word w(t) and previous state s(t-1)  Back-propagation used to update V, and U  The recurrent weights W are updated by unfolding in time and train the net as a deep neural network Recurrent Neural Network Language Model U w(t-1) s(t-2) W w(t) s(t-1) s(t) y(t) U V W w(t-2) s(t-3) W U Bi-gram neural network LM

9 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. 9 word2vec Skip-gram W(t) W(t-2) W(t-1) W(t+1) W(t+2) CBOW W(t-2) W(t-1) W(t+1) W(t+2) W(t) INPUTOUTPUTPROJECTION Syn1 W WW WW

10 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. 10  The DataLab has been working on plastic card transaction data with merchant information  Merchant Information: ► MCC - not sufficient to categorize merchants and for identifying consumers’ behavior ► Merchant names – noisy, and not informative enough about their business  Neural Distributed Representation: ► Word: Merchant ID ► Sentence: Close sequence of merchants in transactions ► Model: skip-gram model ► 1.3M unique merchants, 835M transactions ► Trained in 280 minutes using 30 threads Application – merchant similarity and grouping

11 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. 11 Merchant group – international travel (selected merchants) 6300 | Insurance | WORLD NOMADS 3572 | MIYAKO HOTELS | SHERATON MIYAKO TOKTO 4814 | Telecomm | ONESIMCARD.COM 9399 | Government Services | CNE US CONSULAT

12 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. 12 Who’s Like Me -- Additive Compositionality POTTERN BARN KIDS CHILDRENS PLACE CRATE&BARREL NORDSTROM SEARS HOMETOWN DOLLAR GENERAL

13 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. 13 Who’s Like Me -- More Examples GOLF GALAXY

14 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. 14  People are creature of habits — there should be a ‘language model’ to describe the consumer’s shopping patterns  Help financial institutions to focus more on the consumers whose behavior are out of ordinary  (1) Potential fraud compromise, (2) Life-style change Application – Behavior Shift Detection Similarity Count Similarity to past 20 transactions Actual Transactions Randomly Generated w/ same ZIP dist. Outliers

15 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. 15  We have demonstrated that a neural distributed representation can be used to capture relationships of merchants in the transaction data  Compositionality allows higher-order understanding of merchant relationships  Reduced dimensions in the representation enables near real-time look-up of similar merchants  Future directions: ► Reduce the effect of localization by linking local merchants into higher level of aggregation ► Further develop behavior shift detection framework ► Deep learning of higher-order structures: Recurrent Neural Net, Convolutional Net, etc. Summary

16 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. #FOIC2014

17 ©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential. Kevin Chen Chief Scientist, North America Data Lab Experian e: kevin.chen@experian.com


Download ppt "©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential."

Similar presentations


Ads by Google