Thursday 4 April 2019

Pivot a DataFrame

How to pivot the data to create multiple columns out of 1 column with multiple rows.
spark-shell --queue= *;

To adjust logging level use sc.setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
Spark context available as sc 
SQL context available as sqlContext.

scala>  val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
sqlcontext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@4f9a8d71  

scala> val BazarDF = Seq(
     | ("Veg", "tomato", 1.99),
     | ("Veg", "potato", 0.45),
     | ("Fruit", "apple", 0.99),
     | ("Fruit", "pineapple", 2.59),
     | ("Fruit", "apple", 1.99)
     | ).toDF("Type", "Item", "Price")
BazarDF: org.apache.spark.sql.DataFrame = [Type: string, Item: string, Price: double]

scala> BazarDF.show()
+-----+---------+-----+
| Type|     Item|Price|
+-----+---------+-----+
|  Veg|   tomato| 1.99|
|  Veg|   potato| 0.45|
|Fruit|    apple| 0.99|
|Fruit|pineapple| 2.59|
|Fruit|    apple| 1.99|
+-----+---------+-----+

A pivot can be thought of as translating rows into columns while applying one or more aggregations.
Lets see how we can achieve the same using the above dataframe.
We will pivot the data based on "Item" column.
scala> BazarDF.groupBy("Type").pivot("Item").agg(min("Price")).show()
+-----+-----+---------+------+------+
| Type|apple|pineapple|potato|tomato|
+-----+-----+---------+------+------+
|  Veg| null|     null|  0.45|  1.99|
|Fruit| 0.99|     2.59|  null|  null|
+-----+-----+---------+------+------+

You are Done!