BDA-602 - Machine Learning Engineering

Dr. Julien Pierret

Lecture 4

Data - Why is Data so important?

  • Data is King
  • Models are built off data!
  • G
    o
    o
    g
    l
    e

    • #1 in web search
    • What if all their top talent left and formed a competitor?
    • Could they compete?
    • Why or why not?
  • Data Flywheel / Data network effect

“Data network effects occur when your product, generally powered by machine learning, becomes smarter as it gets more data from your users.”
Matt Turk

Data Flywheel



Data - Putting it to use

  • Different Models, different data requirements
  • What is available?
    • What do we have now?
    • What can I get?
      • Is it free?
      • How much is it?
    • How much do we have?
    • Will we get more?

Data - For model building

  • Data preperation for model building
    • Gathering
    • Verifying
    • Cleaning
    • Visualizing
    • Transforming

Data - Where you spend the most time

  • ~80% of time is spent dealing with data (Data Wrangling with R)
  • Garbage in, garbage out!
    • Mislabeled?
    • Missing values?
    • What are you predicting?
    • Unbalanced?
    • Complex Data signals?
    • Noise?
    • Usable?
    • Needed?
  • Different types of data...

Data - Structured vs Unstructured

  • What is structured data?
    • Data that can fit into a spreadsheet
    • Fixed number of fields
    • Predefined schema
    • Easy to search
  • What is unstructured data?
    • No schema
    • Hard to search
      • Text
      • Images
      • Audio
      • Video
    • There are ways to get structure from the above data

Data - ML Requirements

  • ML Models required structured data
    • Fixed number of predictors
    • Numerical
  • Two types of data
    • Continuous
      • Examples: Temperature, Height, Weight, Speed, Counts,...
      • Easy to implement into a model
    • Categorical
      • Examples: Gender, Color, Brand, Zipcode...
      • Requires a fixed number of categories
      • Harder to implement into a model

Data - Types

  • Examples file formats I've used
    • .txt - Text ...
    • .csv / .tsv - Comma/Tab Seperated Values
    • .xls / .xlsx - Excel
    • .xml - Extensible Markup Language
    • .yaml - YAML Ain't Markup Language
    • .json - JavaScript Object Notation
    • .shp - Shapefiles
    • .netcdf - Network Common Data Form
    • .las / .laz - Lidar data
    • .dem - Digital Elevation Model
    • .sql - Sql dump formats
    • .log - log files from servers
    • .avro - Avro
    • .parquet - Parquet
    • ...

One filetype reigns supreme

  • Snappy compresed .parquet data is the best format currently out there
    • Compressed
    • Fast decompression. With multiple cores, just as fast as not compressed
    • Schema
    • Splitting Friendly
      • Almost all compression schemes don't let you split data. Exceptions:
        • lzo - Requires a generated index file 💩

End Goal

  • We need to organize our data for use in a model
  • Guess what?
    • Data is already available!
    • Most companies already have loads of data
    • In databases
    • It's done, the data is already sturctured
    • You need to know how to get at it

Databases

  • SQL - Structured ✔️
    • Structured Query Language
    • Common
    • Established
    • Data already how we need it
  • NoSQL - Unstructured ❌
    • Uncommon
    • Newer
    • Data needs to be wrangled

Databases - NoSQL

Databases - SQL

  • MySQL
    • Most of the internet runs on this
    • Bought out by Oracle,... open source work plummeted
  • Mariadb
    • Made by the guy who started MySQL
    • All opensource activity moved here
    • Tries to stay compatible with MySQL
  • SqlServer
    • Microsoft's SQL implementation
    • Surprisingly very capable 🧼📦
  • Oracle
    • Expensive
    • Powerful
  • Postgresql
    • More features than MySQL, not as easy

Databases - Getting the data

  • Even if you can get access to the data
  • ... it won't be organized in a way easy to model with
  • Need to learn how to query the data

SQL - Table structures


  • Table:
    widget
    • ID:
      id
      ,
      widget_id
      ,
      WidgetId
      • Helps identify specific rows
      • id
        's for rows in other tables
  • Columns
    • key with unique values
    • index setup for faster querying

SQL - Indexes

  • Speed up searching through tables
  • Types of Keys
    • PRIMARY KEY
      • Main key for the table
      • Must be unique, can use multiple columns
      • Cannot be NULL
      • Usually numeric and automatically numbered
    • INDEX
      • Whatever you want
    • UNIQUE INDEX
      • Combination of all columns involved must be unique
      • Can contain NULL
    • FOREIGN KEY
      • Restrictions so value must exist in another table 🧼📦: 🐢

SQL - Entity Relationship Diagram

SQL - Entity Relationship Diagram

SQL - Functions


              

SQL - Other JOINs

  • There are other JOINs you need to learn
    • INNER JOIN
      • Same as a JOIN
    • CROSS JOIN
      • All rows joined with one another
    • LEFT JOIN / OUTER JOIN
      • Containns all rows from LEFT table
      • No row found on RIGHT table, values filled with NULLs
    • RIGHT JOIN
      • Same as LEFT but reversed

SQL - What Normally Happens

  • SQL as Data Bank
    • Use SQL queries to generate new predictors
    • Pull data from SQL into Python. Perform operations then store results back in SQL
  • Model building
    • Gather all the needed variables from Sql
      • Predictors
      • Response(s)
      • Each row one observation for training on
    • Combine it with data from other sources
      • Images
      • Unstructured Data that will be processed
    • Do some data preperation
    • Build a model

In Summary

  • Data is King
    • Get it wrong, everything else fails
  • ML models require structured data
  • As file formats go, parquet is the best one
  • SQL > NoSQL
  • SQL
    • Learn how to query data from a SQL database
      • WHERE
      • GROUP BY
      • JOIN
      • CASE
      • functions
      • ...
    • Know how to get data from SQL into Python

Homework - References 📚 and Tutorials 📓