Pyspark code with example

How to Import PySpark Pandas?

import pyspark.pandas as ps

# Import pandas

import pandas as pd

How to Create pandas DataFrame??

technologies = ({

'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python","NA"],

'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,1500],

'Duration':['30days','50days','55days','40days','60days','35days','30days','50days','40days'],

'Discount':[1000,2300,1000,1200,2500,None,1400,1600,0]

})

df = pd.DataFrame(technologies)

print(df)

Run Pandas API DataFrame on PySpark (Spark with Python)

Use the above created pandas DataFrame and run it on PySpark. In order to do so, you need to use import pyspark.pandas as ps instead of import pandas as pd. And use ps.DataFrame() to create a DataFrame.


# Import pyspark.pandas
import pyspark.pandas as ps

# Create pandas DataFrame
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python","NA"],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,1500],
    'Duration':['30days','50days','55days','40days','60days','35days','30days','50days','40days'],
    'Discount':[1000,2300,1000,1200,2500,None,1400,1600,0]
          })
df = ps.DataFrame(technologies)
print(df)

# Use groupby() to compute the sum
df2 = df.groupby(['Courses']).sum()
print(df2)

Note that here, df and df2 objects are not Spark DataFrame instead they are objects of pyspark.pandas.frame.DataFrame. This DataFrame corresponds to pandas DataFrame logically and it holds Spark DataFrame internally. In other words, it is a wrapper class for Spark DataFrame to behave similarly to pandas DataFrame.

Run this program on an environment that has PySpark and you should get the same output as above. I have executed the above code in the pyspark shell, you can refer to the output below.

pandas api on spark — Running pandas on Apache Spark (PySpark)

If you try to run df.show() or df.printSchema() you will get errors.

9. Convert DataFrame between Pandas, Spark & Pandas API on Spark

After completing all your operations running on Spark you might be required to convert the result to pandas DataFrame for further processing or to return to UI e.t.c, You can convert this pyspark.pandas.frame.DataFrame object to pandas.core.frame.DataFrame (Convert Pandas API on Spark to Pandas DataFrame) by using ps.to_pandas().


# Convert Pandas API on Spark to Pandas DataFrame
pdf = df.to_pandas()
print(type(pdf))

# Output
#<class 'pandas.core.frame.DataFrame'>

Note that to_pandas() loads all data from multiple machines into the spark driver’s memory hence, it should only be used if the resulting pandas DataFrame is expected to be small and fits in pandas memory.

To convert pandas.core.frame.DataFrame to pyspark.pandas.frame.DataFrame (Convert Pandas DataFrame to Pandas API on Spark DataFrame) use ps.from_pandas().


# Convert Pandas DataFrame to Pandas API on Spark DataFrame
psdf = ps.from_pandas(pdf)
print(type(psdf))

# Output
# <class 'pyspark.pandas.frame.DataFrame'>

We can also convert a Pandas API on Spark Dataframe into a Spark DataFrame by using to_spark(). It converts object from type pyspark.pandas.frame.DataFrame to pyspark.sql.dataframe.DataFrame.


# Pandas API on Spark Dataframe into a Spark DataFrame
sdf = df.to_spark()
print(type(sdf))
sdf.show()

# Output
#<class 'pyspark.sql.dataframe.DataFrame'>
#+-------+-----+--------+--------+
#|Courses|  Fee|Duration|Discount|
#+-------+-----+--------+--------+
#|  Spark|22000|  30days|  1000.0|
#|PySpark|25000|  50days|  2300.0|
#| Hadoop|23000|  55days|  1000.0|
#| Python|24000|  40days|  1200.0|
#| Pandas|26000|  60days|  2500.0|
#| Hadoop|25000|  35days|    null|
#|  Spark|25000|  30days|  1400.0|
#| Python|22000|  50days|  1600.0|
#|     NA| 1500|  40days|     0.0|
#+-------+-----+--------+--------+

Similarly, use pandas_api() to convert pyspark.sql.dataframe.DataFrame to pyspark.pandas.frame.DataFrame DataFrame.


# Convert a Spark Dataframe into a Pandas API on Spark Dataframe
psdf = sdf.pandas_api()
print(type(psdf))

# (or)
# to_pandas_on_spark() is depricated
psdf = sdf.to_pandas_on_spark()
print(type(psdf))

# Output
<class 'pyspark.pandas.frame.DataFrame'>

10. Compare Pandas API on Spark vs PySpark

In this section let’s see some operations using pandas API and compare that with PySpark.

In PySpark, first, you need to create a SparkSession in PySpark programming, by using builder() let’s create one. If you are using Azure Databricks, you don’t have to create a session object as the Databricks runtime environment by default provides you with the spark object similar to the PySpark shell.


spark = SparkSession.builder().master("local[1]")
          .appName("SparkByExamples.com")
          .getOrCreate()

10.1 Select Columns

Let’s select columns from both these approaches and see the difference in Syntax.


# Pandas API on Spark
df[["Courses","Fee"]]

# PySpark
sdf.select("Courses","Fee").show()

10.2 Select or Filter Rows

Similarly, select rows from the DataFrame. On both == operator is used to check if a value is matching.


# Pandas API on Spark
df2 = df.loc[ (df.Courses == "Python")]

# PySpark
sdf2 = sdf.filter(sdf.Courses == "Python")
sdf2.show()

10.2 Count

count() method is pretty much the same on both API’s. It basically returns the number of rows in a DataFrame.


# Pandas API on Spark
df.count()

# PySpark
sdf.count()

10.3 Sort Rows

Sort rows based on two columns.


# Pandas API on Spark
df2 = df.sort_values(["Courses", "Fee"])

# PySpark
sdf2 = sdf.sort("Courses", "Fee")
sdf2.show()

10.4 Rename Columns

Let’s see how to rename columns on pandas using this API vs pyspark rename columns with examples.


# Pandas API on Spark
df2 = df.rename(columns={'Fee': 'Courses_Fee'})

# PySpark
sdf2 = sdf.withColumnRenamed("Fee", "Courses_Fee")
sdf2.show()

10.5 Group By

Let’s do a group by on Courses and get the count for each group.


# Pandas API on Spark
df.groupby(['Courses']).count()

# PySpark
sdf.groupBy("Courses").count()

10.6 Access CSV File

Finally, let’s read a CSV file in both frameworks. For details examples on pandas refer to read CSV file in Pandas.


# Pandas API on Spark
pdf = ps.read_csv('/tmp/resources/courses.csv')

# PySpark
sdf = spark.read.csv("/tmp/resources/courses.csv")

11. Data Types of Pandas API on Spark vs PySpark

When converting pandas on Spark DataFrame from/to PySpark DataFrame, all data types will be automatically cast to the appropriate type.

Note that pandas API on Spark DataFrame and pandas DataFrame contains the same data types hence when you do the conversion you don’t see any differences in type. However, you need to keep a close eye when you convert from/to PySpark.


# Pandas API on Spark
print(df.dtypes)

# Output
#Courses      object
#Fee           int64
#Duration     object
#Discount    float64
#dtype: object

# PySpark
sdf.printSchema()

# Output
#root
# |-- Courses: string (nullable = false)
# |-- Courses_Fee: long (nullable = false)
# |-- Duration: string (nullable = false)
# |-- Discount: double (nullable = true)

Questions and Answers.

Search This Blog