Keras and Neural Network Fundamentals

MLflow and Spark UDFs

Horovod: Distributed Model Training

LIME, SHAP & Model Interpretability

CNNs and ImageNet

Transfer Learning

Demonstrate every new research technique/API

Detailed Math/CS behind the algorithms

Spark before?

Pandas/Numpy?

Machine Learning? Deep Learning?

Expectations?

Performs well on complex datasets like images, sequences, and natural language

Scales better as data size increases

Theoretically can learn any shape (universal approximation theorem)

Composing representations of data in a hierarchical manner

High-level Python API to build neural networks

Official high-level API of TensorFlow

Supports: TensorFlow, Theano, and CNTK

Has over 250,000 users

Released by François Chollet in 2015

GPUs are prefered for training due to speed of computation, but not good in data transfer

CPUs are generally acceptable for inference

Input layer

Zero or more hidden layers

Output layer

Measure "closeness" between label and prediction

- When predicting someone's weight, better to be off by 2 lbs instead of 20 lbs

Evaluation metrics:

- Loss: $(y - \hat{y})$
- Absolute loss: $|y - \hat{y}|$
- Squared loss: $(y - \hat{y})^2$

$Error_{i} = (y_{i} - \hat{y_{i}})$

$SE_{i} = (y_{i} - \hat{y_{i}})^2$

$SSE = \sum_{i=1}^n (y_{i} - \hat{y_{i}})^2$

$MSE = \frac{1}{n}\sum_{i=1}^n (y_{i} - \hat{y_{i}})^2$

$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^n (y_{i} - \hat{y_{i}})^2}$

Which is more important? Why?

Calculate gradients to update weights

Provide non-linearity in our neural networks to learn more complex relationships

Sigmoid

Tangent

ReLU

Leaky ReLU

PReLU

ELU

Saturates and kills gradients

Not zero-centered

Zero centered!

BUT, like the sigmoid, its activations saturate

BUT, gradients can still go to zero

For x < 0: $$f(x) = \alpha * x$$ For x >= 0: $$f(x) = x$$

These functions are not differentiable at 0, so we set the derivative to 0 or average of left and right derivative

Choosing a proper learning rate can be difficult

Easy to get stuck in local minima

Accelerates SGD: Like pushing a ball down a hill

Take average of direction we’ve been heading (current velocity and acceleration)

Limits oscillating back and forth, gets out of local minima

Adaptive Moment Estimation (Adam)

Which dataset should we use to select hyperparameters? Train? Test?

Split the dataset into three!

- Train on the
*training set* - Select hyperparameters based on performance of the
*validation set* - Test on
*test set*

Created by Alexander Sergeev of Uber, open-sourced in 2017

Simplifies distributed neural network training

Supports TensorFlow, Keras, PyTorch, and Apache MXNet

```
``` # Only one line of code change!
optimizer = hvd.DistributedOptimizer(optimizer)

Focus on Local Connectivity (fewer parameters to learn)

Filter/kernel slides across input image (often 3x3)

Image Kernels VisualizationCS 231 Convolutional Networks

Classify images in one of 1000 categories

2012 Deep Learning breakthrough with AlexNet: 16% top-5 test error rate (next closest was 25%)

One of the most widely used architectures for its simplicity

IDEA: Intermediate representations learned for one task may be useful for other related tasks