Hands on Deep Learning with Keras, TensorFlow, and Apache Spark™

©Databricks 2019

Schedule

Keras and Neural Network Fundamentals

MLflow and Spark UDFs

Horovod: Distributed Model Training

LIME, SHAP & Model Interpretability

CNNs and ImageNet

Transfer Learning

©Databricks 2019

Course Non-Objectives

Demonstrate every new research technique/API

Detailed Math/CS behind the algorithms

©Databricks 2019

Survey

Spark before?


Pandas/Numpy?


Machine Learning? Deep Learning?


Expectations?

©Brooke Wenig 2019

Deep Learning Overview

©Databricks 2019

What is Deep Learning?

Composing representations of data in a hierarchical manner

©Databricks 2019

Why are you here?

©Databricks 2019
©Databricks 2019
©Databricks 2019

Keras

High-level Python API to build neural networks

Official high-level API of TensorFlow

Supports: TensorFlow, Theano, and CNTK

Has over 250,000 users

Released by François Chollet in 2015

©Databricks 2019

Why Keras?

©Databricks 2019

Neural Network Fundamentals

©Databricks 2019

Layers

Input layer (fixed)

Zero or more hidden layers

Output layer (fixed)

©Databricks 2019

Backpropagation

Calculate gradients to update weights

©Databricks 2019

Regression Evaluation

Measure "closeness" between label and prediction

  • When predicting someone's weight, better to be off by 2 lbs instead of 20 lbs

Evaluation metrics:

  • Loss: $(y - \hat{y})$
  • Absolute loss: $|y - \hat{y}|$
  • Squared loss: $(y - \hat{y})^2$
©Databricks 2019

Evaluation metric: MSE

$Error_{i} = (y_{i} - \hat{y_{i}})$

$SE_{i} = (y_{i} - \hat{y_{i}})^2$

$SSE = \sum_{i=1}^n (y_{i} - \hat{y_{i}})^2$

$MSE = \frac{1}{n}\sum_{i=1}^n (y_{i} - \hat{y_{i}})^2$

$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^n (y_{i} - \hat{y_{i}})^2}$

©Databricks 2019

Train vs. Test MSE

Which is more important? Why?

©Databricks 2019
©Databricks 2019

Linear Regression with Keras

©Databricks 2019

Activation Functions

Provide non-linearity in our neural networks to learn more complex relationships

Sigmoid

Tangent

ReLU

Leaky ReLU

PReLU

ELU

©Databricks 2019

Sigmoid

Saturates and kills gradients

Not zero-centered

©Databricks 2019
Image credit A. Karpathy
©Databricks 2019

Tangent

Zero centered!

BUT, like the sigmoid, its activations saturate

©Databricks 2019
Image credit A. Karpathy
©Databricks 2019

ReLU

BUT, gradients can still go to zero

©Databricks 2019
Image credit A. Karpathy
©Databricks 2019

Leaky ReLU

For x < 0: $$f(x) = \alpha * x$$ For x >= 0: $$f(x) = x$$

©Databricks 2019
Image credit A. Karpathy
©Databricks 2019

Comparison

©Databricks 2019

Optimizers

©Databricks 2019

Stochastic Gradient Descent (SGD)

Choosing a proper learning rate can be difficult

Image credit F. Chollet
©Databricks 2019

Stochastic Gradient Descent

Easy to get stuck in local minima

Image credit F. Chollet
©Databricks 2019

Momentum

Accelerates SGD: Like pushing a ball down a hill

Take average of direction we’ve been heading (current velocity and acceleration)

Limits oscillating back and forth, gets out of local minima

©Databricks 2019

ADAM

Adaptive Moment Estimation (Adam)


©Databricks 2019

Keras

©Databricks 2019

Keras Lab

©Databricks 2019

Hyperparameter Selection

©Databricks 2019

Hyperparameter Selection

Which dataset should we use to select hyperparameters? Train? Test?

©Databricks 2019

Validation Dataset

Split the dataset into three!

  • Train on the training set
  • Select hyperparameters based on performance of the validation set
  • Test on test set
©Databricks 2019

Advanced Keras & Lab

©Databricks 2019

MLflow & Lab

©Databricks 2019

Horovod

©Databricks 2019
©Databricks 2019

Horovod

Created by Alexander Sergeev of Uber, open-sourced in 2017

Simplifies distributed neural network training

Supports TensorFlow, Keras, and PyTorch

©Databricks 2019

Classical Parameter Server

©Databricks 2019

All-Reduce


					

# Only one line of code change! optimizer = hvd.DistributedOptimizer(optimizer)

©Databricks 2019

Horovod Demo

©Databricks 2019

Model Interpretability

©Databricks 2019

Convolutional Neural Networks

©Databricks 2019

Convolutions

Focus on Local Connectivity (fewer parameters to learn)

Filter/kernel slides across input image (often 3x3)

Image Kernels Visualization


CS 231 Convolutional Networks
©Databricks 2019
©Databricks 2019

ImageNet Challenge

Classify images in one of 1000 categories

2012 Deep Learning breakthrough with AlexNet: 16% top-5 test error rate (next closest was 25%)

©Databricks 2019

VGG16 (2014)

One of the most widely used architectures for its simplicity

©Databricks 2019

Max vs Avg. Pooling

©Databricks 2019

Residual Connection

©Databricks 2019

Inception

©Databricks 2019

What do CNNs Learn?

Breaking Convnets
©Databricks 2019

CNN Demo

©Databricks 2019

Transfer Learning

©Databricks 2019

Transfer Learning

IDEA: Intermediate representations learned for one task may be useful for other related tasks

©Databricks 2019

When to use Transfer Learning?

©Databricks 2019

Transfer Learning

©Databricks 2019

Ok, so how do I find the optimal neural network architecture?

©Databricks 2019
Neural Architecture Search with Reinforcement Learning

©Databricks 2019
©Databricks 2019

Resources

Horovod Meetup Talk

MLflow

Deep Learning with Python

Stanford's CS 231

fast.ai

Blog posts & webinars

Databricks Runtime for ML

©Databricks 2019

Pyception

©Databricks 2019

Thank you!