Anatomy of a PySpark script
Jaideep Ganguly ◼︎ Last Updated: Oct 29, 2019

Problem

In the table below, find the max of each row. Lets assume there are 10 columns after the key column.

KeyCol 1Col 2Col 3...Col n
A5010020...80
B259299...10
..................
Z12155300...88

 

Step 1 - Session

Step 2 - Function to be passed in map. All global data is available to this function

Step 3 - Load all Data

Step 4 - select, filter & collect into a list

Step 5 - Create list and/or dict from data frame for processing

Step 6 - Parallelize and process via map function

Important Note:

All global variables are broadcasted by Spark. That means all global data is copied.

The map function needs to be defined prior to the call as python is interpreted. All global data is available in this function.

Data cannot be in main function of python as in that case it will be local and will not be broadcasted.