BDA-602 - Machine Learning Engineering

Dr. Julien Pierret

Lecture 6

Spark - Parquet Files

In Summary

  • Hadoop: slow, hard to use
    • MapReduce
  • Spark: fast, easier to use
      • Partitions
      • Lazy execution
    • Resilient Distributed Dataset
    • Dataframes
    • SQL
    • User Defined Functions
    • Pipelines
    • Transformers
    • Estimators