SJSU - Summer 2022 - Big Data Analytics

Session 1

Dr. Julien Pierret

Today's Agenda

  • Hadoop: Just background, we won't code in it
  • Spark
    • Background
    • Resilient Distributed Dataset
    • Spark SQL
    • User Defined Functions
    • Pipelines
    • Partitions

In Summary

  • Hadoop: slow, hard to use
    • MapReduce
  • Spark: fast, easier to use
      • Partitions
      • Lazy execution
    • Resilient Distributed Dataset
    • Dataframes
    • SQL
    • User Defined Functions
    • Pipelines
    • Transformers
    • Estimators

Homework - Assignment

  • Tomorrow will be lab day
  • Sign up for a Github account (if you do not already have one)
  • Setup software so you can run Python 3 code
  • Let's go over this...

Software Setup - Demo

Please work! 🤞

See you tomorrow!