Pandas Dataframe vs PySpark Dataframe

In many ways PySpark dataframes are similar to Pandas dataframe. However, there are some key differences as well. This is not an exhaustive list of all the actions / transformation between the two dataframes.

1. Loading & Viewing Data

# Pandas
df = pd.read_csv("csv_file.csv")

# PySpark
df = spark.read \
    .options(header=True, inferSchema=True) \
    .csv("csv_file.csv")
# Pandas
df
df.head(10)

# PySpark
df.show()
df.show(10)
# Pandas
df.columns
df.dtypes

# PySpark
df.columns
df.dtypes

2. Transformations / Filtering

# Pandas
df["coln"] = 100

# PySpark
df.withColumn("coln", 100)
# Pandas
df.drop("col1", axis=1)

# PySpark
df.drop("mpg")
# Pandas
df.columns = ["col1","col2","col3"]
df.rename(columns = {"old" : "new"})

# PySpark
df.toDF("col1","col2","col3")
df.withColumnRenamed("old","new)
# Pandas
df[df.col > 30]
df[(df.col1 >20) & (df.col2 == 1)]

# PySpark
df[df.col > 30]
df[(df.col1 >20) & (df.col2 == 1)]
# Pandas
df.fillna(0)

# PySpark
df.fillna(0)
# Pandas
df.groupby(['col1','col2']) \
    .agg({'col1' : 'mean','disp' :'min'})

# PySpark
df.groupby(['col1','col2']) \
    .agg({'col1' : 'mean','disp' :'min'})