# Linear Regression

## Build a Model ## Temporal Data ## Linear Regression

Goal: Find line of best fit

y $\approx \hat{y} = w_{0} + w_{1}x + \epsilon$

x: feature

y: label #### Learn weights that minimize the residuals • Blue point: True value
• Red line: Positive residual
• Green line: Negative residual

## Regression Evaluator

Measure "closeness" between label and prediction

Evaluation metrics:

• Loss: $(y - \hat{y})$
• Absolute loss: $|y - \hat{y}|$
• Squared loss: $(y - \hat{y})^2$

## Evaluation metric: RMSE

$Error = (y_{i} - \hat{y_{i}})$

$SE = (y_{i} - \hat{y_{i}})^2$

$SSE = \sum_{i=1}^n (y_{i} - \hat{y_{i}})^2$

$MSE = \frac{1}{n}\sum_{i=1}^n (y_{i} - \hat{y_{i}})^2$

$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^n (y_{i} - \hat{y_{i}})^2}$

## Train vs. Test RMSE

Which is more important? Why?

## R2

Another measurement of "goodness of fit"

$SS_{\text{tot}}=\sum _{i}(y_{i}-{\bar {y}})^{2}$

$SS_{\text{res}}=\sum _{i}(y_{i}-\hat{y_{i}})^{2}$

$R^{2} = 1-{SS_{\rm {res}} \over SS_{\rm {tot}}}$

What is the range of R2?

## Machine Learning Libraries

Scikit-learn (single machine)

Can train multiple models in parallel with Spark-Sklearn, but what if our data or model get big... ## ML Libraries in Spark

### MLlib

Original Spark ML API based on RDDs

### SparkML

Spark 2.0: Entered maintenance mode

Supported API moving forward

## How to Handle Non-Numeric Features?

Categorical

• No intrinsic ordering
• e.g. Gender, Country, Occupation

Ordinal

• Relative ordering, but inconsistent spacing between categories
• e.g. Excellent, good, poor

## One idea

Create single numerical feature to represent non-numeric one

Categorical features:

• Animals = {'Dog', 'Cat', 'Fish'}
• 'Dog' = 1, 'Cat' = 2, 'Fish' = 3

Implies Cats are 2x dogs!

## One Hot Encoding (OHE)

Create a ‘dummy’ feature for each category

'isDog' => [1, 0, 0], 'isCat' => [0, 1, 0], 'isFish' => [0, 0, 1]

No spurious relationships!

## Storage Space

Ok, so that works if we only have a few animal types, but what if we had a zoo?

## Sparse Vectors

Size of vector, indices of non-zero elements, values


DenseVector(0, 0, 0, 7, 0, 2, 0, 0, 0, 0)
SparseVector(10, [3, 5], [7, 2])


## Assumptions?

Linear relationship between X and Y

Features not correlated

Errors only in Y ##### Which ones are suited to Linear Regression? 