3Josh Patterson Email: Twitter: @jpatanooga Github: Twitter: @jpatanoogaGithub:github.com/jpatanoo gaPastPublished in IAAI-09:“TinyTermite: A Secure Routing Algorithm”Grad work in Meta-heuristics, Ant- algorithmsTennessee Valley Authority (TVA)Hadoop and the SmartgridClouderaPrincipal Solution ArchitectToday: Patterson Consulting
4Overview What Is Deep Learning? Neural Nets and Optimization AlgorithmsImplementation on Hadoop/YARNResults
5What is Deep Learning?Machine perception, pattern recognition.
6What Is Deep Learning?Algorithms called neural nets that learn to recognize patterns:Nodes learn smaller features of larger patternsAnd combine them to recognize feature groupsUntil finally they can classify objects, faces, etc.Each node layer in net learns larger groups
7Properties of Deep Learning Small training sets, they learn unsupervised dataThey save data scientists months of workAnything you can vectorize, DL nets can learnThey can handle millions of parametersAfter training, DL models are one, small vector
8Chasing Nature Learning sparse representations of auditory signals Leads to filters that correspond to neurons in early audio processing in mammalsWhen applied to speechLearned representations show a resemblance to cochlear filters in the auditory cortex.
9Yann Lecun on Deep Learning DL is the dominant method for acoustic modeling in speech recognitionIt is becoming dominant in machine vision for:object recognitionobject detectionsemantic segmentation.
11Restricted Boltzmann Machines RBMs are building blocks for deeper nets. They deal with Binary and Continuous data differently.BinaryContinuousDL nets learn through repeated sampling of input data. By reconstructing data, they learn which features matter.
12What Is a Deep-Belief Network? A stack of restricted Boltzmann machinesA generative probabilistic model1) A visible (input) layer …2) Two or more hidden layers that learn more & more complex features…3) An output layer that classifies the input.
13A Recursive Neural Tensor Network? RNTN’s are top-down; DBN’s are feed-forwardA tensor is a multi-dimensional matrixRNTN’s handle multiplicityScene and sentence parsing, windows of eventsExplain the difference between top-down and feed-forward.
14A Deep Autoencoder? DA’s are good for QA systems like Watson They encode lots of data in 30-numeral vectorsThey do not require pretraining, saving time.They are useful in information retrieval, condensing data, and cryptography.Explain how cosine similarity works in a vector space to retrieve relevant information.
15A Convolutional Net? ConvNets are asymmetric RBMs. ConvNets learns images in patches from a gridThey handle 2D and 3D imagesExplain the symmetry of RBMs, the asymmetry of ConvNets.
16DeepLearning4J The most complete, production-ready open- source DL lib Written in Java: Uses Akka, Hazelcast and JblasDistributed to run fast, built for non-specialistsMore features than Theano-based toolsTalks to any data source, expects 1 format
17DL4J Serves IndustryNonspecialists can rely on its conventions to solve computationally intensive problemsUsability first – DL4J follows ML tool conventionsDL4J’s nets work equally well with text, image, sound and time-seriesDL4J will integrate with Python community through SDKs
18Vectorized Implementation Handles lots of data concurrently.Any number of examples at once, but the code does not change.Faster: Allows for native and GPU execution.One input format: Everything is a matrix.Image, sound, text, time series are vectorized.
19DL4J vs Theano vs TorchDL4J’s distributed nature means problems can be solved by “throwing CPUs at them.”Java ecosystem has GPU integration tools.Theano is not distributed, and Torch7 has not automated its distribution like DL4J.DL4J’s matrix multiplication is native w/ Jblas.
20What Are Good Applications for DL? Recommendation engines (e-commerce)DL can model consumer and user behaviorAnomaly detection (fraud, money laundering)DL can recognize early signals of bad outcomesSignal processing (CRM, ERP)DL has predictive capacity with time-series data
21DL4J Vectorizes & Analyzes Text Sentiment analysisLogsNews articlesSocial mediaExplain the uses of sentiment analysis. Lie detection. Contagion of feeling a la Facebook experiments. Text correlating with other events like stock movements and social unrest.
22DL on Hadoop and AWSBuild Your Own Google Brain …
23Past Work: Parallel Iterative Algos on YARN Started withParallel linear, logistic regressionParallel Neural Networks“Metronome” packages DL4J for Hadoop100% Java, ASF 2.0 Licensed, on Github
24MapReduce vs. Parallel Iterative ProcessorSuperstep 1Superstep 2InputOutputMapMapMapBenefits of data flow: runtime can decide where to run tasks and can automatically recover from failuresAcyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:Iterative algorithms (many in machine learning)• No single programming model or framework can excel atevery problem; there are always tradeoﬀs between simplicity, expressivity, fault tolerance, performance, etc.ReduceReduce
25SGD: Serial vs Parallel Worker 1MasterPartial ModelGlobal ModelWorker 2Worker NSplit 1Split 2Split 3…ModelTraining DataPOLR: Parallel Online Logistic RegressionTalking points:wanted to start with a known tool to the Hadoop community, with expected characteristicsMahout’s SGD is well known, and so we used that as a base point
26Managing Resources Running through YARN on Hadoop is important Allows for workflow schedulingAllows for scheduler oversightAllows the jobs to be first-class citizens on HadoopAnd shares resources nicely
27Parallelizing Deep-Belief Networks Two-phase trainingPretrainFine-tuneEach phase can do multiple passes over datasetEntire network is averaged at master
28PreTrain and Lots of Data We’re exploring how to better leverage the unsupervised aspects of the PreTrain phase of Deep-Belief NetworksAllows for the use of far more unlabeled dataAllows us to more easily model the massive amounts of structured data in HDFS
35What’s Next? GPU integration in the cloud (AWS) Better vectorization tooling & data pipelinesMove YARN version back over to JBLAS for matricesSpark
36References “A Fast-Learning Algorithm for Deep Belief Nets” Hinton, G. E., Osindero, S. and Teh, Y. - Neural Computation (2006)“Large Scale Distributed Deep Networks”Dean, Corrado, Monga - NIPS (2012)“Visually Debugging Restricted Boltzmann Machine Training with a 3D Example”Yosinski, Lipson - Representation Learning Workshop (2012)
37Parameter Averaging McDonald, 2010 Langford, 2007 Distributed Training Strategies for the Structured PerceptronLangford, 2007Vowpal WabbitJeff Dean’s Work on Parallel SGDDownPour SGDBottou similar to Xu2010 in the 2010 paper