In many ways PySpark dataframes are similar to Pandas dataframe. However, there are some key differences as well. This is not an exhaustive list of all the actions / transformation between the two dataframes.
1. Loading & Viewing Data
# Pandas
df = pd.read_csv("csv_file.csv")
# PySpark
df = spark.read \
.options(header=True, inferSchema=True) \
.csv("csv_file.csv")
# Pandas
df
df.head(10)
# PySpark
df.show()
df.show(10)
# Pandas
df.columns
df.dtypes
# PySpark
df.columns
df.dtypes
2. Transformations / Filtering
# Pandas
df["coln"] = 100
# PySpark
df.withColumn("coln", 100)
# Pandas
df.drop("col1", axis=1)
# PySpark
df.drop("mpg")
# Pandas
df.columns = ["col1","col2","col3"]
df.rename(columns = {"old" : "new"})
# PySpark
df.toDF("col1","col2","col3")
df.withColumnRenamed("old","new)
# Pandas
df[df.col > 30]
df[(df.col1 >20) & (df.col2 == 1)]
# PySpark
df[df.col > 30]
df[(df.col1 >20) & (df.col2 == 1)]
# Pandas
df.fillna(0)
# PySpark
df.fillna(0)
# Pandas
df.groupby(['col1','col2']) \
.agg({'col1' : 'mean','disp' :'min'})
# PySpark
df.groupby(['col1','col2']) \
.agg({'col1' : 'mean','disp' :'min'})