Thursday 4 April 2019

Pivot a DataFrame

How to pivot the data to create multiple columns out of 1 column with multiple rows.
spark-shell --queue= *;

To adjust logging level use sc.setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
Spark context available as sc 
SQL context available as sqlContext.

scala>  val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
sqlcontext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@4f9a8d71  

scala> val BazarDF = Seq(
     | ("Veg", "tomato", 1.99),
     | ("Veg", "potato", 0.45),
     | ("Fruit", "apple", 0.99),
     | ("Fruit", "pineapple", 2.59),
     | ("Fruit", "apple", 1.99)
     | ).toDF("Type", "Item", "Price")
BazarDF: org.apache.spark.sql.DataFrame = [Type: string, Item: string, Price: double]

scala> BazarDF.show()
+-----+---------+-----+
| Type|     Item|Price|
+-----+---------+-----+
|  Veg|   tomato| 1.99|
|  Veg|   potato| 0.45|
|Fruit|    apple| 0.99|
|Fruit|pineapple| 2.59|
|Fruit|    apple| 1.99|
+-----+---------+-----+

A pivot can be thought of as translating rows into columns while applying one or more aggregations.
Lets see how we can achieve the same using the above dataframe.
We will pivot the data based on "Item" column.
scala> BazarDF.groupBy("Type").pivot("Item").agg(min("Price")).show()
+-----+-----+---------+------+------+
| Type|apple|pineapple|potato|tomato|
+-----+-----+---------+------+------+
|  Veg| null|     null|  0.45|  1.99|
|Fruit| 0.99|     2.59|  null|  null|
+-----+-----+---------+------+------+

You are Done!





Sunday 3 March 2019

How to select multiple columns from a spark data frame using List[Column]


Let us create Example DataFrame to explain how to select List of columns of type "Column" from a dataframe

spark-shell --queue= *;

To adjust logging level use sc.setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
Spark context available as sc 
SQL context available as sqlContext.

scala>  val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
sqlcontext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@4f9a8d71  

scala> val BazarDF = Seq(
     | ("Veg", "tomato", 1.99),
     | ("Veg", "potato", 0.45),
     | ("Fruit", "apple", 0.99),
     | ("Fruit", "pineapple", 2.59),
     | ("Fruit", "apple", 1.99)
     | ).toDF("Type", "Item", "Price")
BazarDF: org.apache.spark.sql.DataFrame = [Type: string, Item: string, Price: double]

scala> BazarDF.show()
+-----+---------+-----+
| Type|     Item|Price|
+-----+---------+-----+
|  Veg|   tomato| 1.99|
|  Veg|   potato| 0.45|
|Fruit|    apple| 0.99|
|Fruit|pineapple| 2.59|
|Fruit|    apple| 1.99|
+-----+---------+-----+

Create a List[Column] with column names.

scala> var selectExpr : List[Column] = List("Type","Item","Price")
<console>:25: error: not found: type Column
         var selectExpr : List[Column] = List("Type","Item","Price")
                               ^

If you are getting the same error Please take a look into this page .
Using : _* annotation select the columns from dataframe.

scala> var dfNew = BazarDF.select(selectExpr: _*)
dfNew: org.apache.spark.sql.DataFrame = [Type: string, Item: string, Price: double]

scala> dfNew.show()
+-----+---------+-----+
| Type|     Item|Price|
+-----+---------+-----+
|  Veg|   tomato| 1.99|
|  Veg|   potato| 0.45|
|Fruit|    apple| 0.99|
|Fruit|pineapple| 2.59|
|Fruit|    apple| 1.99|
+-----+---------+-----+


Tuesday 12 February 2019

How to select multiple columns from a spark data frame using List[String]


Lets see how to select multiple columns from a spark data frame.
Create Example DataFrame
spark-shell --queue= *;

To adjust logging level use sc.setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
Spark context available as sc 
SQL context available as sqlContext.

scala>  val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
sqlcontext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@4f9a8d71  

scala> val BazarDF = Seq(
     | ("Veg", "tomato", 1.99),
     | ("Veg", "potato", 0.45),
     | ("Fruit", "apple", 0.99),
     | ("Fruit", "pineapple", 2.59),
     | ("Fruit", "apple", 1.99)
     | ).toDF("Type", "Item", "Price")
BazarDF: org.apache.spark.sql.DataFrame = [Type: string, Item: string, Price: double]

scala> BazarDF.show()
+-----+---------+-----+
| Type|     Item|Price|
+-----+---------+-----+
|  Veg|   tomato| 1.99|
|  Veg|   potato| 0.45|
|Fruit|    apple| 0.99|
|Fruit|pineapple| 2.59|
|Fruit|    apple| 1.99|
+-----+---------+-----+

Now our example dataframe is ready.
Create a List[String] with column names.
scala> var selectExpr : List[String] = List("Type","Item","Price")
selectExpr: List[String] = List(Type, Item, Price)

Now our list of column names is also created.
Lets select these columns from our dataframe.
Use .head and .tail to select the whole values mentioned in the List()

scala> var dfNew = BazarDF.select(selectExpr.head,selectExpr.tail: _*)
dfNew: org.apache.spark.sql.DataFrame = [Type: string, Item: string, Price: double]

scala> dfNew.show()
+-----+---------+-----+
| Type|     Item|Price|
+-----+---------+-----+
|  Veg|   tomato| 1.99|
|  Veg|   potato| 0.45|
|Fruit|    apple| 0.99|
|Fruit|pineapple| 2.59|
|Fruit|    apple| 1.99|
+-----+---------+-----+

I will also explaine How to select multiple columns from a spark data frame using List[Column] in next post.

Wednesday 2 January 2019

How to add Scalastyle plugin in eclipse


Scalastyle is a style checker for Scala. It checks your Scala code against a number of configurable rules, and marks the code which violates these rules with warning or error markers in your source code.


Lets add scalastyle plugin in 4 steps.

  1. Install Scalastyle plugin
  2. Add Scalastyle Nature to your project
  3. Set up a configuration for Scalastyle
  4. Enable Scalastyle for that project


  1. Install Scalastyle plugin
    1. From Eclipse Marketplace install Scala Style 0.9.0



    1. Accept the terms and conditions. Scalastyle is installed in your eclipse.

  1. Add Scalastyle Nature to your project
               Right click project → ScalaStyle → Add Scalastyle nature

  1. Set up a configuration for Scalastyle
    1. Now we need to download the configuration rules from scalastyle.org 

    1. Place this file in your project, where the eclipse can access the file                                 eg: project/src/main/resource/scalastyle_config.xml

    1. Add configuration file to list of available configuration
                     In eclipse windows preference → select scalastyle 
                     configuration page → you can then add the file path of configuration file .


  1. Enable Scalastyle for that project
    1. Finally we need to select the configuration for our project
                      In project properties → select the configuration in 
               drop down list → enable check box




    1. Do a project clean and build, and you should see Scalastyle errors appear in your problems view or in the source code.