BDA-602 - Machine Learning Engineering

Dr. Julien Pierret

Lecture 10

Continuous - Scikit-Learn Continuous Metrics

source

Continuous - Mean Squared Error

$$MSE = \frac{1}{n}\sum_{i=1}^{n}\left(y_i - \hat{y}_i\right)^2$$
  • Most common loss function in statistics
  • Sensitive to outliers

Continuous - Mean Absolute Error

$$MAE = \frac{\sum_{i=1}^{n}{\left|y-\hat{y}_i\right|}}{n}$$
  • Not as sensetive to outliers
  • Some prefer this over $MSE$

Continuous - Coefficient of Determination ($R^2$)

$$R^2(y, \hat{y}) = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}$$
$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$$
  • Might look familiar to correlation coefficient
  • Depending on the case: $r^2 = R^2$
  • [0, 1]
  • Considered a Goodness of fit statistic: measure of how well the predictions approximate real data
  • One of the common outputs when running a linear/logistic regression
  • Not a good measure when comparing for overfitting

$R^2$ - Issues

html

Continuous - Adjusted Coeff. of Determination (Adj. $R^2$)

$$ \bar{R}^{2}=1-(1-R^{2})\frac{n-1}{n-p-1} $$
  • Where $p$ is number of features and $n$ number of data points
  • Also one of the common outputs when running a linear/logistic regression

Continuous - Coefficient of Determination $R^2$


            

Continuous - Explained Variance

$$explained\:variance\left(y, \hat{y}\right) =1- \frac{Var\{y-\hat{y}\}}{Var\{y\}}$$
  • Remember PCA?

Continuous - Residual Sum of Squares

    $$ RSS = \sum_{i=1}^{n}{\left(y_i - \hat{y}\right)^2}$$
    • Careful
      • Any change to dataset
      • Comparing models with models
    • Residuals are important

    Categorical - Scikit-Learn Classification Metrics

    html

    Categorical - Accuracy

    $$accuracy \left(y,\hat{y}\right) = \frac{1}{n}\sum_{i=1}^{n} 1\left(\hat{y}_i = y_i\right) $$

    Confusion Matrix

    Actual
    1 0
    Predicted 1 500 24
    0 19 687

    Actual
    1 0
    Predicted 1 500 - True Positives 24 - False Positives
    0 19 - False Negative 687 - True Negatives

    Confusion Matrix

    • Classification - Categorical Responses
    • Basically - Right vs Wrong
    • Two dimensions
      • Rows represent predictions
      • Columns represent actuals
    • Can be done for > 2 categorical predictions
      • Table for each category

    Confusion Matrix - Multiple Categories

    Actual
    A B C D
    Predicted A 20 1 2 3
    B 4 21 5 6
    C 7 8 22 9
    D 10 11 12 23

    Confusion Matrix - Category A

    Actual
    A B C D
    Predicted A 20 1 2 3
    B 4 21 5 6
    C 7 8 22 9
    D 10 11 12 23

    Actual
    A Not A
    Predicted A 20 6
    Not A 21 117

    Confusion Matrix - Category B

    Actual
    A B C D
    Predicted A 20 1 2 3
    B 4 21 5 6
    C 7 8 22 9
    D 10 11 12 23

    Actual
    B Not B
    Predicted B 21 15
    Not B 20 108

    Confusion Matrix - Category C

    Actual
    A B C D
    Predicted A 20 1 2 3
    B 4 21 5 6
    C 7 8 22 9
    D 10 11 12 23

    Actual
    C Not C
    Predicted C 22 24
    Not C 19 99

    Confusion Matrix - Category D

    Actual
    A B C D
    Predicted A 20 1 2 3
    B 4 21 5 6
    C 7 8 22 9
    D 10 11 12 23

    Actual
    D Not D
    Predicted D 23 33
    Not D 18 90

    Confusion Matrix - Calculating


    Confusion Matrix - Metrics


    Accuracy (ACC) $$ ACC = \dfrac{\color{green}{TP} + \color{pink}{TN}}{\color{green}{TP} + \color{pink}{TN} + \color{blue}{FP} + \color{red}{FN}} $$

    True Positive Rate (TPR),
    Recall, Sensitivity
    $$ TPR = \dfrac{\sum{\color{green}{TP}}}{\sum{\color{purple}{CP}}} $$ False Positive Rate (FPR) \begin{equation} FPR = \dfrac{\sum{\color{blue}{FP}}}{\sum{\color{brown}{CN}}}\end{equation}
    False Negative Rate (FNR) $$ FNR = \dfrac{\sum{\color{red}{FN}}}{\sum{\color{purple}{CP}}} $$ Specificity (SPC), Selectivity,
    True Negative Rate (TNR)
    \begin{equation} TNR = \dfrac{\sum{\color{pink}{TN}}}{\sum{\color{brown}{CN}}}\end{equation}
    Source

    Confusion Matrix - Metrics


    Positive predictive value (PPV) Precision $$ PPV = \dfrac{\sum{\color{green}{TP}}}{\sum{\color{orange}{PP}}} $$ False discovery rate (FDR) $$ FDR = \dfrac{\sum{\color{blue}{FP}}}{\sum{\color{orange}{PP}}} $$
    False omission rate (FOR) $$ FOR = \dfrac{\sum{\color{red}{FN}}}{\sum{\color{yellow}{PN}}} $$ Negative predictive value (NPV) $$ NPV = \dfrac{\sum{\color{pink}{FN}}}{\sum{\color{yellow}{PN}}} $$
    Source

    Confusion Matrix - Common Measures


    Categorical Response Metrics


    In Summary

    • Many different evaluation techniques
    • Continuous
      • Accuracy
      • Mean Squared Error / Mean Absolute Error
      • $R^2$
    • Residuals
    • Categorical
      • Confusion Matrix
        • Tons of metrics
    • Reciever Operator Characteristic
      • Area Under the ROC
    • Model calibration
      • Many methods

    Mid-Term Due October 28th

    • Finish up your mid-term
    • I will download all repos at midnight on Friday
    • No late assignments accepted
      • You will get a zero