Problem

In the table below, find the max of each row. Lets assume there are 10 columns after the key column.

Key	Col 1	Col 2	Col 3	...	Col n
A	50	100	20	...	80
B	25	92	99	...	10
...	...	...	...	...	...
Z	121	55	300	...	88

Step 1 - Session


x
1
import findspark
2
from pyspark.sql import SparkSession
3
4
findspark.init()
5
spark = SparkSession \
6
    .builder \
7
    .appName("Example") \
8
    .getOrCreate()

Step 2 - Function to be passed in `map`. All global data is available to this function


x
1
def find_max(inx):
2
    v_max = -float('inf')
3
    k_max = None
4
5
    for key in series_data:
6
        v = float(series_data[key][inx])
7
        if delta > v:
8
            k_max = key
9
            v_max = v
10
    return {'n': k_max, 'y': str(inx)}

Step 3 - Load all Data


xxxxxxxxxx
10
1
# read the data frame from csv file
2
df = spark.read \
3
    .format("csv") \
4
    .option("header", "true") \
5
    .option("inferSchema", "true") \
6
    .load(
7
    "filepath/filename")
8
df.printSchema()
9
df.show(10)
10
df.count()

Step 4 - select, filter & collect into a list


xxxxxxxxxx
6
1
print(len(df.columns))
2
# column_1 is key to data, the rest of the columns are data corresponding to the key
3
col_size = len(df.columns) - 1  # ignore the 0th column, that is the key
4
# select and filter the data frame by attributes (a), returns a list
5
list_data = df.select(a1.a2, a3.a4.a5).filter(a6.a7 == "value").collect()
6
data = df.collect()

Step 5 - Create list and/or dict from data frame for processing


x
1
series_data = dict()
2
# column 0 are the keys, get some of these rows and store in a dictionary
3
region_dict = {"key_1": 10, "key_2": 15, "key_3": 53, "key_4": 30}
4
5
# create series data for each key
6
for key in region_dict:
7
    inx = region_dict[key]
8
    row = list()
9
    for x in range(0, col_size):  
10
        row.append(data[inx][str(x)])
11
    series_data[key] = row

Step 6 - Parallelize and process via map function


x
1
# create spark context
2
sc = spark.sparkContext.getOrCreate()
3
result = sc.parallelize(range(col_size)).map(find_max).collect()  
4
5
for k in result:
6
    print(k)

Important Note:

All global variables are broadcasted by Spark. That means all global data is copied.
The map function needs to be defined prior to the call as python is interpreted. All global data is available in this function.
Data cannot be in main function of python as in that case it will be local and will not be broadcasted.

Anatomy of a PySpark script

Jaideep Ganguly ◼︎ Last Updated: Oct 29, 2019

Problem

Step 1 - Session

Step 2 - Function to be passed in map. All global data is available to this function

Step 3 - Load all Data

Step 4 - select, filter & collect into a list

Step 5 - Create list and/or dict from data frame for processing

Step 6 - Parallelize and process via map function

Step 2 - Function to be passed in `map`. All global data is available to this function