SJSU - Summer 2022 - Big Data Analytics
Session 1
Dr. Julien Pierret
Today's Agenda
Hadoop
: Just background, we won't code in it
Spark
Background
Resilient Distributed Dataset
Spark SQL
User Defined Functions
Pipelines
Partitions
In Summary
Hadoop: slow, hard to use
MapReduce
Spark: fast, easier to use
Partitions
Lazy execution
Resilient Distributed Dataset
Dataframes
SQL
User Defined Functions
Pipelines
Transformers
Estimators
Homework - Assignment
Tomorrow will be lab day
Sign up for a
Github
account (if you do not already have one)
Needed to download a
template repo
Setup software so you can run Python 3 code
PyCharm
🥰
Visual Studio Code
🙂
Let's go over this...
Software Setup - Demo
Please work! 🤞
See you tomorrow!