BDA-602 - Machine Learning Engineering

Dr. Julien Pierret

Lecture 1

BDA-602 - Machine Learning Engineering

Syllabus - Overview

This course introduces practical machine learning model building techniques with a strong emphasis on bringing these models to a production environment.

Course Description (What we're going to do)

  • Work with code
    • Introduce proper ways to work with code
  • Prepare data
    • We will cover preparing unstructured data for easier model ingestion.
  • Extract features
    • Feature engineering and its importance!!
    • Creating reusable pipeline transformations
  • Build machine learning models
    • Which features are important?
    • Which model should I use?
    • How do I know if the model is any good?
  • Deploy them to production
    • Wrap these models into an online application so predictions can be made in real-time

Student Learning Outcomes

  • Organize data and extract meaningful features for producing machine learning models
  • Build and analyze machine learning models in a way they can be deployed to production environments
  • Deploy machine learning models to a production environment

My goals for this class

  • To teach you what they didn't teach me when I was in school
  • All the topics I want to cover I learned on my own or at my various jobs while going to school
  • This class is ambitious
  • This class is HARD
    • You'll need to work hard
    • Not an easy ride
    • We will move fast
    • Do not fall behind
  • This class is similar to a boot camp

Prerequisites

  • This is a graduate level course
  • Proficiency in Python 3.x
    • I won't spend any time teaching Python
      • Lower level languages (C/C++, Java, etc)
      • Will do a small IDE review session
    • You should be familar with coding
      • main, class, list comprehension
  • Familiarity using the command-line
    • I won't teach this
    • Resources to learn provided
  • The Engineering in the class title should give it away. You will code... a lot
  • 2-3 hours of study per unit per week (6-9 hours minimum)
  • Curiosity, Grit, Growth Mindset

Equipment

  • Computer with a Linux / Bash environment required

Required Textbooks

  • We are going to touch so many topics, I don't know of "a" good book
  • Renaissance man / woman
  • Web articles
  • Books I've read recently
    • Not required

Syllabus - Homework

  • We are each going to build a library that we can use to aid in building machine learning models
  • Throughout the semester you will be assigned new features or abilities to add to this library
  • Assignments may be directly or indirectly related to building out this library
  • These assignments will be submitted as pull requests (PRs) in Github
  • You will need to select a peer / friend / acquaintance in the class to review your PR and give meaningful and insightful comments 🤓
  • As a reviewer, if you think changes need to be made to the code, you can and should reject the PR asking for fixes
    • If you do reject, make sure to re-review once changes have been made
  • You will be graded as both a reviewer and a submitter of PRs
  • The instructor will do a final review of the PR and a grade will be determined based off of this

Syllabus - Homework (cont.)

  • It is important not to fall behind on these tasks as they will build on each other!
  • You will rarely get a PR accepted on its first try
  • Other eyes will always see something you’ve missed
  • PR must be created and submitted for review on the Wednesday before the homework is due.
    • Gives 🤓 time to review
  • Make sure code runs!!
    • I will not be forgiving if it doesn't
  • Branches must have the correct (case sensitive) name
    • If it doesn't, automated download processes won't find it and you'll get a 0
  • PR must be finished by midnight on the due date

Final Project - What to do

  • Pick a fun dataset (here's a few) and predict something of importance from this dataset.
    • I will provide some
    • I must approve dataset
    • You'll need to make a good case if you want to use your own
    • Most likely: I won't approve it, but sometimes I do
  • You will
    • Analyze the dataset
    • Do feature engineering
    • Build predictive models
    • Analyze model performance

Final Project - What you should code

  • Provide a way for the me (instructor) to run this code from beginning to end
    • Using: docker-compose
    • I should be able to
      • Check out Github code
      • run a single script that will download the necessary data, go through the workflow of cleaning, organizing, and preparing the data
      • do all the necessary transformations
      • generate whatever results you need
      • build a predictive model
      • output the predictive model
      • ...

Final Project - Written Report

  • Discuss what you wanted to predict. What's in the source dataset
  • Discuss data conversions: Cleaned / Organized / ETL (Extract Transform and Load) Procedure
  • Feature Analysis
    • Which features were predictive - why?
    • Feature Engineering
      • What did you think would be predictive?
      • Was it? Why or why not? Discuss, even if the feature was a failure
      • Charts / Graphs can be very helpful 📊
  • Model building
    • Which models did you try? Why? Results?
    • Accuracy between these models?
    • Final model performance metrics

Final Project - Written Report (cont.)

  • The final report should be of sufficient length to cover your modeling process in enough depth that someone without access to your code could recreate your work
  • A final report must be a Wiki on your github project

Final Project - Presentations

  • You will each present your project to the class
  • We'll see how many students are still in the class and allocate time accordingly
  • I'll randomly assign time slots ON THE DAY(S) OF PRESENTATIONS!
    • Need to be prepared on the first day
    • Volunteers welcome

Syllabus - Grading

  • Midterm exam 15%
  • Homework: 50% (40% for PRs 10% as reviewer)
  • Class participation 5%
  • Final project 30%

Syllabus - University Policies

University Policies
  • Accommodations
    • Simple Accommodations - Talk to me
    • Complicated Accomodations - Your responsibility to contact Student Ability Success Center at (619) 594-6473
  • Religious observances - Tell me ahead of time
  • Academic Honesty - Don't cheat 🧼📦
    • Instant F for the course and you'll be reported
    • Place comments with links where you got a code from
  • Medical-related absences - Talk to me. Talk to Student Health Services ⚕️
  • Student privacy (FERPA) and intellectual property policies - Communications between me and students are private. Your grades and feedback will be kept private. I reserve the right to keep your assignments after the course has completed.
  • SDSU Economic Crisis Response Team - Well-being & Health Promotion 3rd floor Calpulli Center

COVID

  • I expect you to be vaccinated*
  • If you are sick - do not come to class!
    • Recordings will be made available from Fall 2020

Spring Syllabus - Weekly Schedule (subject to change)

Week 10

  • Spring Break 🏖️ No Class
    • I'll be available for questions and help

Last Week

  • Final Project Presentations

Introductions

I've introduced myself, now it's time for everyone to introduce themselves

  • Name
  • What you want to get out of this course
  • Something no one here would know about you if you didn't tell us

In Summary

  • We pull code from our repo
  • We create a branch to work on
  • We write code
  • We write unit tests
  • We commit code to git (locally, not github)
    • Unit tests will be run
    • Code will be checked for errors
    • Code will be formatted
    • Code will be scanned for other issues
  • We push our branch up to github
  • We make a pull-request (PR)
  • There may be some back and fourth on the PR
  • PR is eventually accepted and we merge our branch to master