Data transformation with pandas vs. pyspark - Jingwen Zheng

In recent work, I transformed data with pyspark. Since I often did this with pandas in my previous work, I will compare these two packages on data transformation with the following points:

Checking dataframe size
Checking unique values of a column
Creating a new column
Filtering
Selecting a list of columns
Aggregating
Rename column
Joining 2 dataframes
Creating a new dataframe
Creating a pivot table

Packages

import pandas as pd

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

Data

20200908-data

pyspark vs. pandas

Checking dataframe size

20200908-df-size

.count() counts the number of rows in pyspark.
pandas.DataFrame.shape returns a tuple representing the dimensionality of the DataFrame.

Checking unique values of a column

20200908-unique-value

.select().distinct(): distinct value of the column in pyspark is obtained by using select() function along with distinct() function. select() takes up the column name as argument, followed by distinct() will give distinct value of the column.
pandas.unique(): uniques are returned in order of appearance.

Creating a new column

20200908-new-col

withColumn(colName, col) returns a new dataframe by adding a column or replacing the existing column that has the same name.

Filtering

20200908-filtering

filter(condition) filters rows using the given condition.
We can apply df[condition] to get only the rows that satisfy the condition.

Selecting a list of columns

20200908-select-cols

select(*cols) projects a set of expressions and returns a new DataFrame.
[[*cols]]: we can pass a list of columns to [] to select columns in that order.

Aggregating

20200908-aggregating

agg(*exprs): aggregates on the entire dataframe without groups (shorthand for df.groupBy.agg()).

Renaming columns

20200908-rename-cols

withColumnRenamed(existing, new) returns a new DataFrame by renaming an existing column. This is a no-op if the schema doesn’t contain the given column name.
DataFrame.rename(**kwargs) alters axes labels.

Joining 2 dataframes

20200908-joining

join(other, on=None, how=None) joins with another DataFrame, using the given join expression.
DataFrame.merge(): merges DataFrame or named Series objects with a database-style join.

Creating a new dataframe

20200908-new-df

pandas.DataFrame() creates two-dimensional, size-mutable, potentially heterogeneous tabular data.
createDataFrame() creates a dataframe from an RDD, a list or a pandas.DataFrame.

Creating a pivot table

20200908-pivot-table

pivot(pivot_col, values=None): pivots a column of the current DataFrame and perform the specified aggregation.
DataFrame.pivot_table() creates a spreadsheet-style pivot table as a dataframe.

Reference

“DataScience Made Simple”, datasciencemadesimple.com. [Online]. Available: https://www.datasciencemadesimple.com/
“pandas”, pandas.pydata.org. [Online]. Available: https://pandas.pydata.org/pandas-docs/stable/index.html
“pyspark”, spark.apache.org. [Online]. Available: https://spark.apache.org/docs/latest/api/python/index.html
mordilla-net, “mammal animal world cute animal”, pixabay.com. [Online]. Available: https://pixabay.com/photos/mammal-animal-world-cute-animal-3162194/

PREVIOUSHow to apply mock with python unittest module?

NEXTHow COVID-19 impacts people's working in France?