BDA-602 - Machine Learning Engineering

Dr. Julien Pierret

Lecture 5

Transactional vs Data Warehouse Databases

  • Transactional
    • Stores day to day operations
    • Table rows are constantly updated
      • INSERT
      • UPDATE
  • Data Warehouse
    • Long term storage of data
    • Final values
    • Mostly getting data
      • SELECT

Database Design

  • Database design is about identifying relationships in your data
  • A good database design should
    • Divides information into subject based tables to reduce redundancy
    • Make it obvious how to join tables together
    • Ensure the accuracy and integrity of the information
  • Design process
    • Determine the purpose of your database
    • Find and organize the information required
    • Divide the information into tables
    • Turn information items into columns
    • Specify primary keys
    • Set up the table relationships
    • Refine your design
    • Apply the normalization rules

Good Table Design

  • Spend some upfront time
  • It'll pay back dividends
  • 4dvd.sdsu.edu
    • SQL has never been updated
    • Code gets updated all the time
    • Front-end interacts with it via stored procedures
    • Code

Best Practices

  • Table names: singular vs plural
    • Customers = Costumer
    • People = Person
  • Always have a primary key in each table
    • "id", "ID"
    • "TableName_id", "_ID"
    • "TableNameId"
    • Keep it consistent
    • Could also be a combination of other keys
  • Key INT type to use?
    • Go for the smallest one possible, makes lookups faster
    • SMALLINT UNSIGNED from 0 to 65,535
    • MEDIUMINT UNSIGNED from 0 to 16,777,215
    • INT UNSIGNED from 0 to 4,294,967,295
    • BIGINT UNSIGNED from 0 to 18,446,744,073,709,551,615

Linking tables

  • One-to-one
    • Normally bad
    • Data could be on the same table
    • Sometimes done for performance
  • One-to-many
  • Many-to-many
  • Foreign Keys
    • Protects data from getting invalid values
    • Slows things down

Data - Verifying

  • Data preperation for model building
    • Gathering
    • Verifying
    • Cleaning
    • Visualizing
    • Transforming
  • Garbage in Garbage out

Verifying Data - Constraining Input

  • Catch errors in data before modeling
  • SQL tables can constrain the input
    • You can specify field types (Strongly Typed) in columns
    • Enforce uniqueness
    • Constrain length
    • Catch errors in data like:
      • Entering text where we expect a number
      • A number for a date (invalid date)
      • Two users with the same username where it should be unique
      • Entering in a zip code with 100 numbers
    • Maintains data integrity

Verifying Data - Normalizing Data

  • Database Normalization (single source of truth)
    • Organizing data in logical groups or sets
    • Finding relationships between sets of data
    • Minimizing redundancy
  • Why?
    • Easy to do JOINs
      • Within your database
      • With new datasets 🧼📦: zip-codes
    • Save space
    • One place to get data
    • One place to update/fix
    • Don't need to keep different tables in sync
    • Don't duplicate data
      • Create an ID and put the values into another table
  • The Six Normal Forms

Indexes - Review

  • PRIMARY KEY
    • Main key for the table
    • Must be unique, can use multiple columns
    • Cannot be NULL
    • Usually numeric and automatically numbered
  • INDEX
    • Whatever you want
  • UNIQUE INDEX
    • Combination of all columns involved must be unique
    • Can contain NULL
  • FOREIGN KEY
    • Restrictions so value must exist in another table
  • Why indexes matter
    • Makes lookups fast

Creating Tables - Code


              

Creating Tables - Notes

  • AUTO_INCREMENT
    • Insertions into the table do not give an ID
    • ID will be auto generated
  • NULL
    • An unknown value
    • When defining table can state if these are allowed or not
  • ENGINE
    • How the table stores it's data
    • How keys are handled
  • DEFAULT CHARSET
    • Character set to use in columns
    • Some character sets use more space than others

Indexes - Code


              

Adding Indexes

  • Indexes can also be added after the fact
    
                  ALTER TABLE GridData ADD INDEX `TimeseriesData_Grid_IX`(`GridBox_ID`); 
                  
  • This is important when you have a large data insert that has to take place
    • Delete the index first
    • Add the index back in
    • If you don't the table insert may be slow
  • ALTER TABLE
  • FOREIGN KEY
  • OPTIMIZE TABLE TableName;
    • Run after a lot of inserts / modifications to a table

INSERT - code


              

Inserting - Notes

Extract, Transform, Load (ETL)

  • Getting data into a data warehouse
  • Usually talked in terms of databases
    • Doesn't need to be

  • Extract
    • Get the raw data
  • Transform
    • Prepare the data
  • Load
    • Put data where users will use it

Extract

  • Get the data from all the sources
    • Databases
    • CSV, JSON,...
  • Loaded into DB


Methods

  Full Extraction Partial Extraction
Implementation Easier Harder
Speed Slower Faster
Efficiency Less More
Data set size Large Small
Errors Less More

Extract - Staging Tables

  • Data is usually loaded into staging tables
    • May not be a schema
    • No keys
  • Better to put staging tables in other databases
    • Database server can hold multiple databases
    • Not always possible
    • Preface table names with "s_" for staging

Transform

  • Work with the data loaded in staging tables
    • Clean it
      • Schema setup
      • Select what to load
      • Type check
      • Check for errors / consistency
      • Deduplication
      • Normalization
      • Setup indexes/key
    • Results from above may go into more staging tables
    • Run business logic
      • Aggregations
      • Data transformations

Load

  • Final step
  • Data is loaded where users will use it
    • System may slow down
    • Run off-hours
  • Run checks
    • Loaded correctly?
    • Key fields missing?
    • Row counts go up/down?
  • Keys and indexes should be updated
  • If staging tables in same database as final data
    • Preface table names with "f_" if final table
    • Better for them to be elsewhere, but not always possible

Summary

  • ETL code example
  • Batches
    • ETL described here done in batches
    • Hourly / Daily / etc...
    • Not realtime

Stored Procedures and Views

  • Stored Procedure
    • Accepts parameters
    • Can NOT be used as building block in a larger query
    • Can contain several statements, loops, IF ELSE, etc.
    • Can perform modifications to one or several tables
    • Can NOT be used as the target of an INSERT, UPDATE or DELETE statement.
  • View
    • Does NOT accept parameters
    • Can be used as building block in a larger query
    • Can contain only one single SELECT query
    • Can NOT perform modifications to any table
    • But can (sometimes) be used as the target of an INSERT, UPDATE or DELETE statement.

Why bother with them?

  • Communicate with a database in production
    • Prefer stateless
    • May not be avoidabe
  • Split logic between them
    • Application doesn't need to know SQL structure
    • Database structure can change without involving application
  • User with limited access
  • Access specific stored procedures / views
  • Better security!
    • SELECT * FROM {table_name};
    • SELECT * FROM users;
    • SELECT * FROM {users; SHOW TABLES;}

Which to use?

  • Stored Procedures
    • Better for system to system communication
    • Application needs to get data
  • Views
    • Better for users
    • Can get data but restrict access to a table

Stored Procedure - Code


              

In Summary

  • We deal with Data Warehouse database design styles
  • Take time to properly setup your database
  • Use schema types to your advantage to find errors in data
  • Use unique indexes to help find errors
  • deduplicate
  • Normalize your data
    • Don't need to memorize the normal forms
    • Know what duplicating data looks like
    • Single source of truth
  • Indexes help make lookups fast
  • Extract, Transform, and Load
  • Stored Procedures > Views

Homework - Assignment

  • No new homework!!! 😌
  • Don't forget to work on assignment 2 from lecture 3