SJSU - Big Data Analytics - Session 1

SJSU - Summer 2022 - Big Data Analytics

Session 1

Dr. Julien Pierret

Today's Agenda

Hadoop: Just background, we won't code in it
Spark
- Background
- Resilient Distributed Dataset
- Spark SQL
- User Defined Functions
- Pipelines
- Partitions

In Summary

Hadoop: slow, hard to use
- MapReduce
Spark: fast, easier to use
- Resilient Distributed Dataset
- Dataframes
- SQL
- User Defined Functions
- Pipelines
- Transformers
- Estimators

Homework - Assignment

Tomorrow will be lab day
Sign up for a Github account (if you do not already have one)
- Needed to download a template repo
Setup software so you can run Python 3 code
- PyCharm 🥰
- Visual Studio Code 🙂
Let's go over this...

Software Setup - Demo

Please work! 🤞

See you tomorrow!