Getting started with PySpark

What is PySpark?

PySpark allows Python users to interface with Apache Spark and wrangle data present in clusters over multiple nodes. PySpark provides users with Dataframes which is an abstraction of its underlying RDD (Resilient Distributed Datasets). PySpark supports most Spark features such as Spark SQL, dataframe, streaming, ML Lib and Spark Core.

Install PySpark

There are multiple ways of installing PySpark such as using pip , conda or manually downloading. However, the simplest way to get started is using pip install from PyPI.

pip install pyspark

Note: Java installation is required for PySpark

Initialize SparkSession

SparkSession is the entry point for PySpark. It can be initialized as below

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

Load and read CSV file

Run SQL queries on dataframe

Column Operations

1. Add Column

2. Select Column(s)

3. Rename Column(s)