How to Import PySpark Pandas?
import pyspark.pandas as ps
Run Pandas API DataFrame on PySpark (Spark with Python)
Use the above created pandas DataFrame and run it on PySpark. In order to do so, you need to use import pyspark.pandas as ps
instead of import pandas as pd
. And use ps.DataFrame()
to create a DataFrame.
Note that here, df
and df2
objects are not Spark DataFrame instead they are objects of pyspark.pandas.frame.DataFrame
. This DataFrame corresponds to pandas DataFrame logically and it holds Spark DataFrame internally. In other words, it is a wrapper class for Spark DataFrame to behave similarly to pandas DataFrame.
Run this program on an environment that has PySpark and you should get the same output as above. I have executed the above code in the pyspark shell, you can refer to the output below.

If you try to run df.show()
or df.printSchema()
you will get errors.

9. Convert DataFrame between Pandas, Spark & Pandas API on Spark
After completing all your operations running on Spark you might be required to convert the result to pandas DataFrame for further processing or to return to UI e.t.c, You can convert this pyspark.pandas.frame.DataFrame
object to pandas.core.frame.DataFrame
(Convert Pandas API on Spark to Pandas DataFrame) by using ps.to_pandas()
.
Note that to_pandas()
loads all data from multiple machines into the spark driver’s memory hence, it should only be used if the resulting pandas DataFrame is expected to be small and fits in pandas memory.
To convert pandas.core.frame.DataFrame
to pyspark.pandas.frame.DataFrame
(Convert Pandas DataFrame to Pandas API on Spark DataFrame) use ps.from_pandas()
.
We can also convert a Pandas API on Spark Dataframe into a Spark DataFrame by using to_spark()
. It converts object from type pyspark.pandas.frame.DataFrame
to pyspark.sql.dataframe.DataFrame
.
Similarly, use pandas_api()
to convert pyspark.sql.dataframe.DataFrame
to pyspark.pandas.frame.DataFrame
DataFrame.
10. Compare Pandas API on Spark vs PySpark
In this section let’s see some operations using pandas API and compare that with PySpark.
In PySpark, first, you need to create a SparkSession in PySpark programming, by using builder() let’s create one. If you are using Azure Databricks, you don’t have to create a session object as the Databricks runtime environment by default provides you with the spark object similar to the PySpark shell.
10.1 Select Columns
Let’s select columns from both these approaches and see the difference in Syntax.
10.2 Select or Filter Rows
Similarly, select rows from the DataFrame. On both == operator is used to check if a value is matching.
10.2 Count
count() method is pretty much the same on both API’s. It basically returns the number of rows in a DataFrame.
10.3 Sort Rows
Sort rows based on two columns.
10.4 Rename Columns
Let’s see how to rename columns on pandas using this API vs pyspark rename columns with examples.
10.5 Group By
Let’s do a group by on Courses
and get the count for each group.
10.6 Access CSV File
Finally, let’s read a CSV file in both frameworks. For details examples on pandas refer to read CSV file in Pandas.
11. Data Types of Pandas API on Spark vs PySpark
When converting pandas on Spark DataFrame from/to PySpark DataFrame, all data types will be automatically cast to the appropriate type.
Note that pandas API on Spark DataFrame and pandas DataFrame contains the same data types hence when you do the conversion you don’t see any differences in type. However, you need to keep a close eye when you convert from/to PySpark.
Comments
Post a Comment