Purchase Required

You need to purchase this content in order to view it

Spark Batch Processing - Managing Spark Jobs and Notebooks: Challenges, Caching, and Performance Optimization (Day 2 Lecture)

Week 4: Batch Pipelines with Apache Spark
39 mins

Description

In this lecture, Zach discusses the intricacies of managing Spark jobs and notebooks, highlighting the challenges of achieving code modularity and the nuances of jar submissions for Scala Spark and PySpark. Exploring the advantages and distinctions between notebooks and spark-submit for production jobs, Zach also sheds light on caching, the auto broadcast join threshold, and the effective use of UDFs in PySpark. Closing with expert insights, he offers valuable tips on performance optimization and the optimal language selection for Spark jobs. [Recorded on Dec 7 2023]