Machine Learning Engineering

BDA-602 - Machine Learning Engineering

Dr. Julien Pierret

Lecture 7

Lecture 7 Agenda

Feature Engineering
- Categorical vs Continuous
- Coming up with features
- Analyzing Feature Performance
  - Plotting
  - p-values
  - Binning and difference with Mean (weighted / unweighted)
  - Random Forest Variable Importance
Tree based models
- Decision Trees
- Random Forests

Features?

Features are inputs that help us predict some outcome
- Features = Predictors
- Outcome = Response / Label
Feature types
- Continuous
  - money
  - time
  - height
  - distance
- Categorical (Discrete)
  - gender
  - zipcode
  - state
  - color
  - Social Security Number

›

Continuous Features

Easy to use
Nothing really needs to be done
- Transformations
- Just improving it

›

Categorical Features

Harder to use
Two different types
- Ordinal
  - Only belong to one category at a time
  - There's some kind of order to the categories
    - Low
    - Medium
    - High
  - Can be fudged into a continuous variable
- Nominal
  - Only belong to one category at a time

›

Categorical to usable

RED, GREEN, BLUE
One-hot encoding
- Most common - for model building
  - RED: [1, 0, 0]
  - GREEN: [0, 1, 0]
  - BLUE: [0, 0, 1]
  - PURPLE: [1, 0, 1]
Dummy encoding
- Have a control group - testing for differences among groups

›

Categorical to usable

Effects Encoding

Comparing one group to all groups
Comparison made at the mean of all groups combined
Group of least interest coded with -1

Nationality	C1	C2	C3
French	0	0	1
Italian	1	0	0
German	0	1	0
Other	−1	−1	−1

›

Categorical to usable

Contrast Coding

The sum of the contrast coef. equal zero.
The difference between the sum of the + coef. and the sum of the - coef. equals 1.

Coded variables should be orthogonal

Nationality	C1	C2
French	+0.25	+0.50
Italian	+0.25	−0.50
German	−0.50	0

Hypothesis 1: French and Italian persons will score higher on optimism than Germans (French = +0.25, Italian = +0.25, German = −0.50).
Hypothesis 2: French and Italians are expected to differ on their optimism scores (French = +0.50, Italian = −0.50, German = 0).

Nonsense Coding

›

Categorical - Binary encoding

Color	Binary 1	Binary 2	Binary 3
Blue	0	0	0
Red	0	0	1
Orange	0	1	0
Green	0	1	1
Yellow	1	0	0
Purple	1	0	1
Pink	1	1	1

›

Categorical - Ranking

Figure out some way to rank them that makes sense

Color	Ranking
Red	1
Orange	2
Yellow	3
Green	4
Blue	5
Purple	6

Pattern? 🌈

›

Categorical - Embedding

Useful in Neural Networks
Encode categoricals into a set of continuous numbers
- You get to pick the number of continuous numbers
- Rule of thumb $\sqrt[4]{n}$ (where $n$ number of distinct categories)
Neural Networks don't work well with one-hot encodings
Word Embedding

›

Categorical - Vector

Something new
- Take all categorical fields and generate a vector
- Sort of like word2vec
- Hopefully I'll find a good coding of it

›

Coming up with Features

Feature engineering is more important than modeling!
A better model can raise accuracy by centimeters
Better predictors can raise accuracy by meters
Feature Engineering
- Model only as good as what you put into it
  - New ways to look at the data
  - New data
- One of my favorite parts of the job
  - Let your imagination run free
  - When I interview I dig heavily into this
- Unsupervised learning
- Story-time
  - "6k+ Feature Bank"
  - "Data Bank"

›

Horse Racing

Running Style
- Early (E)
- Early Presser (EP)
- Presser (P)
- Sustained? (S)
Look at historical races 🎰:
- Starting Pos, 1st / 2nd / final Call Position
- 1st / 2nd / Final Call Beaten Length
- Assign the horses one of the running styles
Labeled 40 races this way
A decision tree modeled it perfectly!

›

Baseball

Elo rating system
Baseball statistics
- Baseball nerds everywhere 🤓
- Use it to your advantage

›

Automated Valuation Models (AVM)

Predicting apartment complex values in Florida
Address to lat/lon
Shapefiles with coasts
- Line segments of the coastline
Calculated shortest distance to coastline
- AVM goes up closer to coast
- Dependant on area

›

Document Scanning Predictions

Predicting "Date Due" from an invoice
Optical Character Recognition (OCR)
- Know where every character/word is on the page
Find correct "Date Due" on the document
Generate a Heatmap of correct "Date Due" pixels
- Scale all points so 1 is the highest number
Predictor: For any candidate date
- average the value of the "pixels" it covers from the heat map
- Higher the number the better

›

Too many Categories

Fraud Model
Unsupervised on all the predictors
- Number of clusters in 1000s
  - Crazy!
- Crazy Brilliant
  - Fraud is a rare event
  - Most clusters were garbage
  - Others full of fraud
  - Inspected them
- Grouped these bad clusters together
- Extra boolean feature fed into the final fraud model

Bonus - Proximity to an internation airport

›

Failures > Success

What features are good?
Plot it!
Rank features from best to worst
- p-value / Z-score (with caveats)
- Binning and difference with Mean (weighted / unweighted)
- Random Forest Variable Importance Ranking

›

Plotting

Crazy Important 🧼📦:
See the actual relationship
Predictor / Response type Dependant
- Response: Boolean / Categorical
  - Predictor: Boolean / Categorical
    - Heatplot
  - Predictor: Continous
    - Violin plot on predictor grouped by response
    - Distribution plot on predictor grouped by response
- Response: Continuous
  - Predictor: Boolean / Categorical
    - Violin plot on response grouped by predictor
    - Distribution plot on response grouped by predictor
  - Predictor: Continuous
    - Scatter plot with trendline

›

Categorical Response / Categorical Predictor

Plot, Code

Categorical Response / Categorical Predictor

Plot, code

Categorical Response / Continuous Predictor

Plot, Code

Categorical Response / Continuous Predictor

Plot, Code

Continuous Response / Categorical Predictor

Plot, Code

Continuous Response / Categorical Predictor

Plot, Code

Continuous Response / Continuous Predictor

Plot, Code

An extra reason to plot

Target Leakage
- Nostradamus Variables
- Accidentally leak the response into the predictor
Obvious when plotted
- If model is too good
- Always check the best performing predictors
Need to rank predictors

›

p-value and Z/t-score

Statistics (statsmodels)
- Linear Regression
  - Continuous response
- Logistic Regression
  - Boolean response

›

Regression p-value & t-score

Source

Dataset information

Diabetes Dataset
Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.
Predictors
- age in years
- sex
- bmi body mass index
- bp average blood pressure
- s1 tc, T-Cells (a type of white blood cells)
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, thyroid stimulating hormone

›

Regression p-value & t-score

Source

Diabetes - Age Summary Statistics

Diabetes - Regression Code - Plot

Source

Regression p-value & t-score

Source

Diabetes - Sex Summary Statistics

Diabetes - BMI Summary Statistics

Diabetes - BP Summary Statistics

Not Perfect

Linear Regression
- The problem is in it's name!
- Non-linear relationships?

›

Difference with mean of response - Aggregation

Formula
- Bin candidate predictor variable
- Compute the average value of response for the predictors in each bin
- Compare this with the population average response
- Square the difference, sum them up and divide by number of bins
Bigger the number, better the predictor
- Plot it! ($\mu_{i}-\mu_{pop}$ vs. bins)
- Could just be noise
- Look for patterns

›

Difference with mean of response

plot

Difference with mean table

($i$)	LowerBin	UpperBin	BinCenters	BinCount	BinMeans ($\mu_{i}$)	PopulationMean ($\mu_{pop}$)	MeanSquaredDiff
0	-3.563517	-2.930781	-3.247149	1	-3.616800	0.014081	13.183300
1	-2.930781	-2.298045	-2.614413	2	-2.641231	0.014081	7.050685
2	-2.298045	-1.665308	-1.981676	22	-1.900161	0.014081	3.664322
3	-1.665308	-1.032572	-1.348940	43	-1.310229	0.014081	1.753797
4	-1.032572	-0.399836	-0.716204	94	-0.671962	0.014081	0.470656
5	-0.399836	0.232900	-0.083468	129	-0.070665	0.014081	0.007182
6	0.232900	0.865636	0.549268	120	0.487407	0.014081	0.224037
7	0.865636	1.498372	1.182004	59	1.158775	0.014081	1.310324
8	1.498372	2.131108	1.814740	25	1.875793	0.014081	3.465971
9	2.131108	2.763844	2.447476	5	2.522430	0.014081	6.291814
							37.42208/10 = 3.74

Difference with mean of response - Issues

Some bins have very small populations
- Weigh the squared difference by population proportion before $ \sum{} $
Adjust the tails of the bins
- Force the tails to contain a certain population proportion
  - $ \geq 5\% $
If you have NULLs
- It could be predictive in itself
- Make a separate bin for it
- Include it's calculation

›

Difference with mean table (weighted)

($i$)	LowerBin	UpperBin	BinCenters	BinCount	BinMeans ($\mu_{i}$)	PopulationMean ($\mu_{pop}$)	MeanSquaredDiff	PopulationProportion ($w_{i}$)	MeanSquaredDiffWeighted
0	-3.563517	-2.930781	-3.247149	1	-3.616800	0.014081	13.183300	0.002	0.026367
1	-2.930781	-2.298045	-2.614413	2	-2.641231	0.014081	7.050685	0.004	0.028203
2	-2.298045	-1.665308	-1.981676	22	-1.900161	0.014081	3.664322	0.044	0.161230
3	-1.665308	-1.032572	-1.348940	43	-1.310229	0.014081	1.753797	0.086	0.150827
4	-1.032572	-0.399836	-0.716204	94	-0.671962	0.014081	0.470656	0.188	0.088483
5	-0.399836	0.232900	-0.083468	129	-0.070665	0.014081	0.007182	0.258	0.001853
6	0.232900	0.865636	0.549268	120	0.487407	0.014081	0.224037	0.240	0.053769
7	0.865636	1.498372	1.182004	59	1.158775	0.014081	1.310324	0.118	0.154618
8	1.498372	2.131108	1.814740	25	1.875793	0.014081	3.465971	0.050	0.173299
9	2.131108	2.763844	2.447476	5	2.522430	0.014081	6.291814	0.010	0.062918
							37.42208/10=3.74	1.0	0.901566

Random Forest

Variable Importance
- Great for variable ranking
- sklearn has this as feature_importances_
Good segwe into decision trees

›

Decision Trees

How do they work?
- Look at all the data for a branch
  - Look at all the predictors, how do we maximize
  - Split on it
- Repeat until we have everything categorized correctly or hit stopping criteria
  - Overfits!
    - Fit to the noise
    - Don't over grow
  - Pruning
    - Cross Validation
    - Where to cut branches
sklearn.tree.DecisionTreeClassifier

›

Decision Tree - Building

›

Fisher's Iris Data set

150 observations
- 50 samples from each of the species of Iris
4 predictors
- Sepal length
- Sepal width
- Petal length
- Petal width

›

Decision Tree - Building

Decision Tree - Original Dataset


              ********************************************************************************
              Original Dataset
              ********************************************************************************
                  sepal_length  sepal_width  petal_length  petal_width           class
              0             5.1          3.5           1.4          0.2     Iris-setosa
              1             4.9          3.0           1.4          0.2     Iris-setosa
              2             4.7          3.2           1.3          0.2     Iris-setosa
              3             4.6          3.1           1.5          0.2     Iris-setosa
              4             5.0          3.6           1.4          0.2     Iris-setosa
              ..            ...          ...           ...          ...             ...
              145           6.7          3.0           5.2          2.3  Iris-virginica
              146           6.3          2.5           5.0          1.9  Iris-virginica
              147           6.5          3.0           5.2          2.0  Iris-virginica
              148           6.2          3.4           5.4          2.3  Iris-virginica
              149           5.9          3.0           5.1          1.8  Iris-virginica

              [150 rows x 5 columns]

Decision Tree - Building

Decision Tree - Unpruned Tree

as PDF

Decision Tree - Pruning

Decision Tree - Cross Validation Score

	criterion	max_depth	score
0	gini	1	0.666667
1	gini	2	0.933333
2	gini	3	0.960000
3	gini	4	0.966667
4	gini	5	0.960000
5	gini	6	0.960000
6	entropy	1	0.666667
7	entropy	2	0.933333
8	entropy	3	0.960000
9	entropy	4	0.953333
10	entropy	5	0.953333
11	entropy	6	0.953333

Decision Tree - Plotting the score

Decision Tree - Pruning Plot

source

Decision Tree - Plotting the optimal tree

Decision Tree - Pruned Tree

as PDF

Decision Tree - Overfitting

Probably the biggest problem
Running this on the titanic dataset

Full Tree

as PDF

Prunned Tree

as PDF

Titanic - Cross Val Negative Mean Squared Error

source

Bootstrap Aggregating (Bagging)

Bootstrap
- Sample with replacement from dataset
Aggregating
- Build many trees and agregate their results
sklearn.ensemble.BaggingClassifier
- Wrapper for any classifier

›

Bagging - Code

Bagging - Trees

PDF

Random Forest

Original Paper
Builds on bagging
- Many trees (aggregate)
  - Each tree built from a sample of values (bootstrap)
  - Each split picks from a random subset of predictors
    - Usually $ \sqrt{n} $
    - $n = $ number of predictors

›

Random Forest - Variable Importance

Impurity Based feature importance (Gini importance)
- sklean.ensemble.RandomForestClassifier.feature_importances_
- Uses the out-of-bag error 🤷
- Biased toward categorical predictors with high cardinality
Permutation based feature importance
- sklearn.inspection.permutation_importance
- Will actually work with any classifier!

›

In Summary

Features
- Continuous: Easy
- Categorical: Harder
  - One-hot
  - Dummy
  - Ranking
  - Embedding
  - Vector
Feature Engineering
- Best way to increase a model's accuracy
- Always plot predictons!
- Variable Importance rankings
  - p-value & z/t score
  - Difference with mean of response
  - Random Forest variable importance

Homework - Assignment 4 - (1/2) Due Marth 17th

Branch must be named hw_04

git checkout -b hw_04

Given a pandas dataframe
- Contains both a response and predictors
Given a list of predictors and the response columns
- Determine if response is continuous or boolean (don't worry about >2 category responses)
- Loop through each predictor column
  - Determine if the predictor is cat/cont (no tricks)
    - "string" == Categorical
    - number == Continuous
  - Automatically generate the necessary plot(s) to inspect it
  - ...

Homework - Assignment 4 - (2/2) Due Marth 17th

Loop through each predictor column
- ...
- Calculate the different ranking algos
  - p-value & t-score (continuous predictors only) along with it's plot
    - Regression: Continuous response
    - Logistic Regression: Boolean response
  - Difference with mean of response along with it's plot (weighted and unweighted)
  - Random Forest Variable importance ranking (continuous predictors only)
Generate a table with all the variables and their rankings
I'm going to grade this by giving it some random dataset and seeing if it outputs everything
Desire: html (hint 1 or hint 2) based rankings report with links to the plots