Hands on Deep Learning with Keras, TensorFlow, and Apache Spark™

Schedule

Keras and Neural Network Fundamentals

MLflow and Spark UDFs

Horovod: Distributed Model Training

LIME, SHAP & Model Interpretability

CNNs and ImageNet

Transfer Learning

Course Non-Objectives

Demonstrate every new research technique/API

Detailed Math/CS behind the algorithms

Survey

Spark before?


Pandas/Numpy?


Machine Learning? Deep Learning?


Expectations?

Deep Learning Overview

Why Deep Learning?

Performs well on complex datasets like images, sequences, and natural language

Scales better as data size increases

Theoretically can learn any shape (universal approximation theorem)

Open Source Landscape

Where Does DL Fit In?

What is Deep Learning?

Composing representations of data in a hierarchical manner

Keras

High-level Python API to build neural networks

Official high-level API of TensorFlow

Supports: TensorFlow, Theano, and CNTK

Has over 250,000 users

Released by François Chollet in 2015

Why Keras?

Hardware Considerations

GPUs have more cores and memory

CPUs are generally acceptable for inference

Iterative nature makes parallelism challenging

Why DL on Databricks?

Neural Network Fundamentals

Layers

Input layer

Zero or more hidden layers

Output layer

Regression Evaluation

Measure "closeness" between label and prediction

  • When predicting someone's weight, better to be off by 2 lbs instead of 20 lbs

Evaluation metrics:

  • Loss: $(y - \hat{y})$
  • Absolute loss: $|y - \hat{y}|$
  • Squared loss: $(y - \hat{y})^2$

Evaluation metric: MSE

$Error_{i} = (y_{i} - \hat{y_{i}})$

$SE_{i} = (y_{i} - \hat{y_{i}})^2$

$SSE = \sum_{i=1}^n (y_{i} - \hat{y_{i}})^2$

$MSE = \frac{1}{n}\sum_{i=1}^n (y_{i} - \hat{y_{i}})^2$

$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^n (y_{i} - \hat{y_{i}})^2}$

Train vs. Test MSE

Which is more important? Why?

Backpropagation

Calculate gradients to update weights

Linear Regression with Keras

Activation Functions

Provide non-linearity in our neural networks to learn more complex relationships

Sigmoid

Tangent

ReLU

Leaky ReLU

PReLU

ELU

Sigmoid

Saturates and kills gradients

Not zero-centered

Image credit A. Karpathy

Tangent

Zero centered!

BUT, like the sigmoid, its activations saturate

Image credit A. Karpathy

ReLU

BUT, gradients can still go to zero

Image credit A. Karpathy

Leaky ReLU

For x < 0: $$f(x) = \alpha * x$$ For x >= 0: $$f(x) = x$$

Image credit A. Karpathy

Comparison

Optimizers

Stochastic Gradient Descent (SGD)

Choosing a proper learning rate can be difficult

Image credit F. Chollet

Stochastic Gradient Descent

Easy to get stuck in local minima

Image credit F. Chollet

Momentum

Accelerates SGD: Like pushing a ball down a hill

Take average of direction we’ve been heading (current velocity and acceleration)

Limits oscillating back and forth, gets out of local minima

ADAM

Adaptive Moment Estimation (Adam)


Keras

Keras Lab

Hyperparameter Selection

Hyperparameter Selection

Which dataset should we use to select hyperparameters? Train? Test?

Validation Dataset

Split the dataset into three!

  • Train on the training set
  • Select hyperparameters based on performance of the validation set
  • Test on test set

Advanced Keras & Lab

MLflow & Lab

Horovod

Horovod

Created by Alexander Sergeev of Uber, open-sourced in 2017

Simplifies distributed neural network training

Supports TensorFlow, Keras, and PyTorch

Classical Parameter Server

All-Reduce


					

# Only one line of code change! optimizer = hvd.DistributedOptimizer(optimizer)

Horovod Demo

Model Interpretability

Convolutional Neural Networks

Convolutions

Focus on Local Connectivity (fewer parameters to learn)

Filter/kernel slides across input image (often 3x3)

Image Kernels Visualization


CS 231 Convolutional Networks

ImageNet Challenge

Classify images in one of 1000 categories

2012 Deep Learning breakthrough with AlexNet: 16% top-5 test error rate (next closest was 25%)

VGG16 (2014)

One of the most widely used architectures for its simplicity

Max vs Avg. Pooling

Residual Connection

Inception

What do CNNs Learn?

Breaking Convnets

CNN Demo

Transfer Learning

Transfer Learning

IDEA: Intermediate representations learned for one task may be useful for other related tasks

When to use Transfer Learning?

Transfer Learning

Ok, so how do I find the optimal neural network architecture?

Neural Architecture Search with Reinforcement Learning

Resources

Horovod Meetup Talk

MLflow

Deep Learning with Python

Stanford's CS 231

fast.ai

Blog posts & webinars

Databricks Runtime for ML

Pyception

Thank you!