M.S. Computer Science (Distributed Machine Learning)

Fluent in 中文

brooke@databricks.com

Keras and Neural Network Fundamentals

MLFlow

CNNs and ImageNet

Transfer Learning and Deep Learning Pipelines

Horovod: Distributed Tensorflow

Spark before?

Pandas/Numpy?

Machine Learning? Deep Learning?

Expectations?

Fundamentals of Deep Learning and best practices

Utilize Keras, Deep Learning Pipelines, and Horovod

Understand when/where to use transfer learning

List advantages/disadvantages of distributed neural network training

Demonstrate every new research technique/API

Detailed Math/CS behind the algorithms

Supervised Learning

Unsupervised Learning

Reinforcement Learning

Classification

Regression

Learn structure of the unlabeled data

Learning what to do to maximize reward

Explore and exploit

"All models are wrong; some models are useful."

What if I told you I had a model that was 99% accurate in predicting brain cancer?

You ALWAYS want to have a baseline model to compare to

This should be a "dummy" model, i.e. a coin flip

Underlying data distribution

Some models are more costly to train

Need for interpretability?

General Data Protection Regulation

Composing representations of data in a hierarchical manner

Input layer (fixed)

Zero or more hidden layers

Output layer (fixed)

Calculate gradients to update weights

Measure "closeness" between label and prediction

- When predicting someone's weight, better to be off by 2 lbs instead of 20 lbs

Evaluation metrics:

- Loss: $(y - \hat{y})$
- Absolute loss: $|y - \hat{y}|$
- Squared loss: $(y - \hat{y})^2$

$Error = (y_{i} - \hat{y_{i}})$

$SE = (y_{i} - \hat{y_{i}})^2$

$SSE = \sum_{i=1}^n (y_{i} - \hat{y_{i}})^2$

$MSE = \frac{1}{n}\sum_{i=1}^n (y_{i} - \hat{y_{i}})^2$

$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^n (y_{i} - \hat{y_{i}})^2}$

Which is more important? Why?

High-level Python API to build neural networks

Official high-level API of Tensorflow

Supports: Tensorflow, Theano, and CNTK

Has over 250,000 users

Released by François Chollet in 2015

Provide non-linearity in our neural networks to learn more complex relationships

Sigmoid

Tangent

ReLU

Leaky ReLU

PReLU

ELU

Saturates and kills gradients

Not zero-centered

Zero centered!

BUT, like the sigmoid, its activations saturate

BUT, gradients can still go to zero

For x < 0: $$f(x) = \alpha * x$$ For x >= 0: $$f(x) = x$$

Choosing a proper learning rate can be difficult

Easy to get stuck in local minima

Accelerates SGD: Like pushing a ball down a hill

Take average of direction we’ve been heading (current velocity and acceleration)

Limits oscillating back and forth, gets out of local minima

Adaptive Moment Estimation (Adam)

Which dataset should we use to select hyperparameters? Train? Test?

Split the dataset into three!

- Train on the
*training set* - Select hyperparameters based on performance of the
*validation set* - Test on
*test set*

Classify images in one of 1000 categories

2012 Deep Learning breakthrough with AlexNet: 16% top-5 test error rate (next closest was 25%)

One of the most widely used architectures for its simplicity

Focus on Local Connectivity (fewer parameters to learn)

Filter/kernel slides across input image (often 3x3)

CS 231 Convolutional NetworksIDEA: Intermediate representations learned for one task may be useful for other related tasks

- Hand curated
- Aggregates

Problems with these approaches?

k factors characterize the users and items (k << n)

Use the user + product factors as input to neural network

Can augment with additional features

Build distributed neural network for end-to-end scalability!

Created by Alexander Sergeev of Uber, open-sourced in 2017

Simplifies distributed neural network training

Supports TensorFlow, Keras, and PyTorch

```
``` # Only one line of code change!
optimizer = hvd.DistributedOptimizer(optimizer)

Part of Databricks' Runtime for ML

Distributed Tensorflow training on Spark DataFrames

MLlib Estimator API

Specify model via tf.estimator model_fn

model_fn(features, labels, mode) → tf.EstimatorSpec

Shards data across nodes’ local disks

Trains a tf.estimator across nodes

Feeds TFRecord data to estimator

Automatic checkpointing, logging

Simultaneous model evaluation

Jules Damji and Brooke Wenig @ 15:20 Thursday!