In recent work, I transformed data with pyspark
. Since I often did this with
pandas
in my previous work, I will compare these two packages on data
transformation with the following points:
- Checking dataframe size
- Checking unique values of a column
- Creating a new column
- Filtering
- Selecting a list of columns
- Aggregating
- Rename column
- Joining 2 dataframes
- Creating a new dataframe
- Creating a pivot table
Packages
import pandas as pd
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
Data
pyspark vs. pandas
Checking dataframe size
.count()
counts the number of rows in pyspark.pandas.DataFrame.shape
returns a tuple representing the dimensionality of the DataFrame.
Checking unique values of a column
.select().distinct()
: distinct value of the column in pyspark is obtained by usingselect()
function along withdistinct()
function.select()
takes up the column name as argument, followed bydistinct()
will give distinct value of the column.pandas.unique()
: uniques are returned in order of appearance.
Creating a new column
withColumn(colName, col)
returns a new dataframe by adding a column or replacing the existing column that has the same name.
Filtering
filter(condition)
filters rows using the given condition.- We can apply
df[condition]
to get only the rows that satisfy the condition.
Selecting a list of columns
select(*cols)
projects a set of expressions and returns a new DataFrame.[[*cols]]
: we can pass a list of columns to [] to select columns in that order.
Aggregating
agg(*exprs)
: aggregates on the entire dataframe without groups (shorthand fordf.groupBy.agg()
).
Renaming columns
withColumnRenamed(existing, new)
returns a new DataFrame by renaming an existing column. This is a no-op if the schema doesn’t contain the given column name.DataFrame.rename(**kwargs)
alters axes labels.
Joining 2 dataframes
join(other, on=None, how=None)
joins with another DataFrame, using the given join expression.DataFrame.merge()
: merges DataFrame or named Series objects with a database-style join.
Creating a new dataframe
pandas.DataFrame()
creates two-dimensional, size-mutable, potentially heterogeneous tabular data.createDataFrame()
creates a dataframe from an RDD, a list or apandas.DataFrame
.
Creating a pivot table
pivot(pivot_col, values=None)
: pivots a column of the current DataFrame and perform the specified aggregation.DataFrame.pivot_table()
creates a spreadsheet-style pivot table as a dataframe.
Reference
- “DataScience Made Simple”, datasciencemadesimple.com. [Online]. Available: https://www.datasciencemadesimple.com/
- “pandas”, pandas.pydata.org. [Online]. Available: https://pandas.pydata.org/pandas-docs/stable/index.html
- “pyspark”, spark.apache.org. [Online]. Available: https://spark.apache.org/docs/latest/api/python/index.html
- mordilla-net, “mammal animal world cute animal”, pixabay.com. [Online]. Available: https://pixabay.com/photos/mammal-animal-world-cute-animal-3162194/