Skip to main content

Pyspark code with example

 


How to Import PySpark Pandas?

import pyspark.pandas as ps


# Import pandas
import pandas as pd

How to Create pandas DataFrame??
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python","NA"],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,1500],
    'Duration':['30days','50days','55days','40days','60days','35days','30days','50days','40days'],
    'Discount':[1000,2300,1000,1200,2500,None,1400,1600,0]
          })
df = pd.DataFrame(technologies)
print(df)

Run Pandas API DataFrame on PySpark (Spark with Python)

Use the above created pandas DataFrame and run it on PySpark. In order to do so, you need to use import pyspark.pandas as ps instead of import pandas as pd. And use ps.DataFrame() to create a DataFrame.


# Import pyspark.pandas
import pyspark.pandas as ps

# Create pandas DataFrame
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python","NA"],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,1500],
    'Duration':['30days','50days','55days','40days','60days','35days','30days','50days','40days'],
    'Discount':[1000,2300,1000,1200,2500,None,1400,1600,0]
          })
df = ps.DataFrame(technologies)
print(df)

# Use groupby() to compute the sum
df2 = df.groupby(['Courses']).sum()
print(df2)

Note that here, df and df2 objects are not Spark DataFrame instead they are objects of pyspark.pandas.frame.DataFrame. This DataFrame corresponds to pandas DataFrame logically and it holds Spark DataFrame internally. In other words, it is a wrapper class for Spark DataFrame to behave similarly to pandas DataFrame.

Run this program on an environment that has PySpark and you should get the same output as above. I have executed the above code in the pyspark shell, you can refer to the output below.

pandas api on spark
Running pandas on Apache Spark (PySpark)

If you try to run df.show() or df.printSchema() you will get errors.

Run pandas on pyspark

9. Convert DataFrame between Pandas, Spark & Pandas API on Spark

After completing all your operations running on Spark you might be required to convert the result to pandas DataFrame for further processing or to return to UI e.t.c, You can convert this pyspark.pandas.frame.DataFrame object to pandas.core.frame.DataFrame (Convert Pandas API on Spark to Pandas DataFrame) by using ps.to_pandas().


# Convert Pandas API on Spark to Pandas DataFrame
pdf = df.to_pandas()
print(type(pdf))

# Output
#<class 'pandas.core.frame.DataFrame'>

Note that to_pandas() loads all data from multiple machines into the spark driver’s memory hence, it should only be used if the resulting pandas DataFrame is expected to be small and fits in pandas memory.

To convert pandas.core.frame.DataFrame to pyspark.pandas.frame.DataFrame (Convert Pandas DataFrame to Pandas API on Spark DataFrame) use ps.from_pandas().


# Convert Pandas DataFrame to Pandas API on Spark DataFrame
psdf = ps.from_pandas(pdf)
print(type(psdf))

# Output
# <class 'pyspark.pandas.frame.DataFrame'>

We can also convert a Pandas API on Spark Dataframe into a Spark DataFrame by using to_spark(). It converts object from type pyspark.pandas.frame.DataFrame to pyspark.sql.dataframe.DataFrame.


# Pandas API on Spark Dataframe into a Spark DataFrame
sdf = df.to_spark()
print(type(sdf))
sdf.show()

# Output
#<class 'pyspark.sql.dataframe.DataFrame'>
#+-------+-----+--------+--------+
#|Courses|  Fee|Duration|Discount|
#+-------+-----+--------+--------+
#|  Spark|22000|  30days|  1000.0|
#|PySpark|25000|  50days|  2300.0|
#| Hadoop|23000|  55days|  1000.0|
#| Python|24000|  40days|  1200.0|
#| Pandas|26000|  60days|  2500.0|
#| Hadoop|25000|  35days|    null|
#|  Spark|25000|  30days|  1400.0|
#| Python|22000|  50days|  1600.0|
#|     NA| 1500|  40days|     0.0|
#+-------+-----+--------+--------+

Similarly, use pandas_api() to convert pyspark.sql.dataframe.DataFrame to pyspark.pandas.frame.DataFrame DataFrame.


# Convert a Spark Dataframe into a Pandas API on Spark Dataframe
psdf = sdf.pandas_api()
print(type(psdf))

# (or)
# to_pandas_on_spark() is depricated
psdf = sdf.to_pandas_on_spark()
print(type(psdf))

# Output
<class 'pyspark.pandas.frame.DataFrame'>

10. Compare Pandas API on Spark vs PySpark

In this section let’s see some operations using pandas API and compare that with PySpark.

In PySpark, first, you need to create a SparkSession in PySpark programming, by using builder() let’s create one. If you are using Azure Databricks, you don’t have to create a session object as the Databricks runtime environment by default provides you with the spark object similar to the PySpark shell.


spark = SparkSession.builder().master("local[1]")
          .appName("SparkByExamples.com")
          .getOrCreate()

10.1 Select Columns

Let’s select columns from both these approaches and see the difference in Syntax.


# Pandas API on Spark
df[["Courses","Fee"]]

# PySpark
sdf.select("Courses","Fee").show()

10.2 Select or Filter Rows

Similarly, select rows from the DataFrame. On both == operator is used to check if a value is matching.


# Pandas API on Spark
df2 = df.loc[ (df.Courses == "Python")]

# PySpark
sdf2 = sdf.filter(sdf.Courses == "Python")
sdf2.show()

10.2 Count

count() method is pretty much the same on both API’s. It basically returns the number of rows in a DataFrame.


# Pandas API on Spark
df.count()

# PySpark
sdf.count()

10.3 Sort Rows

Sort rows based on two columns.


# Pandas API on Spark
df2 = df.sort_values(["Courses", "Fee"])

# PySpark
sdf2 = sdf.sort("Courses", "Fee")
sdf2.show()

10.4 Rename Columns

Let’s see how to rename columns on pandas using this API vs pyspark rename columns with examples.


# Pandas API on Spark
df2 = df.rename(columns={'Fee': 'Courses_Fee'})

# PySpark
sdf2 = sdf.withColumnRenamed("Fee", "Courses_Fee")
sdf2.show()

10.5 Group By

Let’s do a group by on Courses and get the count for each group.


# Pandas API on Spark
df.groupby(['Courses']).count()

# PySpark
sdf.groupBy("Courses").count()

10.6 Access CSV File

Finally, let’s read a CSV file in both frameworks. For details examples on pandas refer to read CSV file in Pandas.


# Pandas API on Spark
pdf = ps.read_csv('/tmp/resources/courses.csv')

# PySpark
sdf = spark.read.csv("/tmp/resources/courses.csv")

11. Data Types of Pandas API on Spark vs PySpark

When converting pandas on Spark DataFrame from/to PySpark DataFrame, all data types will be automatically cast to the appropriate type.

Note that pandas API on Spark DataFrame and pandas DataFrame contains the same data types hence when you do the conversion you don’t see any differences in type. However, you need to keep a close eye when you convert from/to PySpark.


# Pandas API on Spark
print(df.dtypes)

# Output
#Courses      object
#Fee           int64
#Duration     object
#Discount    float64
#dtype: object

# PySpark
sdf.printSchema()

# Output
#root
# |-- Courses: string (nullable = false)
# |-- Courses_Fee: long (nullable = false)
# |-- Duration: string (nullable = false)
# |-- Discount: double (nullable = true)

Comments

Popular posts from this blog

What's is Tableau?

Tableau is the most powerful visualisation to data analytics table use as a most powerful visualisation tools so in the market there are various tools are available but the main thing is that if you are from either it background or if you are not from my dick I am back ground table is for you because for learn the tab blue no too much coding skills require you can easily crap the opportunity in the market and there are so many more than other branch like mechanical and she will can't learn the tab low or the mechanical branch can't get the job in the market but the thing is that you can easily on the Tableau.

RRR

  S. Rajamouli’s RRR has been released on 25th March 2022 (Friday) in various languages on many and many screens. Fans of the movie who are looking at the  RRR Box Office Collection  can get detailed information regarding the same. You are well known for the fact that the movie has Ram Charan and N. T. Rama Rao Jr. as the lead role, it is expected the movie is going to earn a great amount of money, and break all other box office collection records, through this writing you are going to get the RRR Box Office Collection overall, worldwide and day-wise collection. So, be with the article till the very end and learn all about it. RRR Box Office Collection Day 1 The running time of the most awaited action drama movie is 182 minutes, the movie is much better than the other according to fans, as the movie was released on 25th March 2022. The RRR Box Office Collection Day 1 is yet disclosed, but it will be available shortly here in this article after the announcement. According ...

Actor Randeep