Here is Something !: 2019

Thursday, 4 April 2019

Pivot a DataFrame

How to pivot the data to create multiple columns out of 1 column with multiple rows.

spark-shell --queue= *;

To adjust logging level use sc.setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
Spark context available as sc 
SQL context available as sqlContext.

scala>  val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
sqlcontext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@4f9a8d71  

scala> val BazarDF = Seq(
     | ("Veg", "tomato", 1.99),
     | ("Veg", "potato", 0.45),
     | ("Fruit", "apple", 0.99),
     | ("Fruit", "pineapple", 2.59),
     | ("Fruit", "apple", 1.99)
     | ).toDF("Type", "Item", "Price")
BazarDF: org.apache.spark.sql.DataFrame = [Type: string, Item: string, Price: double]

scala> BazarDF.show()
+-----+---------+-----+
| Type|     Item|Price|
+-----+---------+-----+
|  Veg|   tomato| 1.99|
|  Veg|   potato| 0.45|
|Fruit|    apple| 0.99|
|Fruit|pineapple| 2.59|
|Fruit|    apple| 1.99|
+-----+---------+-----+

A pivot can be thought of as translating rows into columns while applying one or more aggregations.
Lets see how we can achieve the same using the above dataframe.
We will pivot the data based on "Item" column.

scala> BazarDF.groupBy("Type").pivot("Item").agg(min("Price")).show()
+-----+-----+---------+------+------+
| Type|apple|pineapple|potato|tomato|
+-----+-----+---------+------+------+
|  Veg| null|     null|  0.45|  1.99|
|Fruit| 0.99|     2.59|  null|  null|
+-----+-----+---------+------+------+

You are Done!

Sunday, 3 March 2019

How to select multiple columns from a spark data frame using List[Column]

Let us create Example DataFrame to explain how to select List of columns of type "Column" from a dataframe

spark-shell --queue= *;

To adjust logging level use sc.setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
Spark context available as sc 
SQL context available as sqlContext.

scala>  val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
sqlcontext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@4f9a8d71  

scala> val BazarDF = Seq(
     | ("Veg", "tomato", 1.99),
     | ("Veg", "potato", 0.45),
     | ("Fruit", "apple", 0.99),
     | ("Fruit", "pineapple", 2.59),
     | ("Fruit", "apple", 1.99)
     | ).toDF("Type", "Item", "Price")
BazarDF: org.apache.spark.sql.DataFrame = [Type: string, Item: string, Price: double]

scala> BazarDF.show()
+-----+---------+-----+
| Type|     Item|Price|
+-----+---------+-----+
|  Veg|   tomato| 1.99|
|  Veg|   potato| 0.45|
|Fruit|    apple| 0.99|
|Fruit|pineapple| 2.59|
|Fruit|    apple| 1.99|
+-----+---------+-----+

Create a List[Column] with column names.

scala> var selectExpr : List[Column] = List("Type","Item","Price")
<console>:25: error: not found: type Column
         var selectExpr : List[Column] = List("Type","Item","Price")
                               ^

If you are getting the same error Please take a look into this page .

Using : _* annotation select the columns from dataframe.

scala> var dfNew = BazarDF.select(selectExpr: _*)
dfNew: org.apache.spark.sql.DataFrame = [Type: string, Item: string, Price: double]

scala> dfNew.show()
+-----+---------+-----+
| Type|     Item|Price|
+-----+---------+-----+
|  Veg|   tomato| 1.99|
|  Veg|   potato| 0.45|
|Fruit|    apple| 0.99|
|Fruit|pineapple| 2.59|
|Fruit|    apple| 1.99|
+-----+---------+-----+

You are Done!

I have also explained How to select multiple columns from a sparkdata frame using List[String]

Tuesday, 12 February 2019

How to select multiple columns from a spark data frame using List[String]

Lets see how to select multiple columns from a spark data frame.

Create Example DataFrame

spark-shell --queue= *;

To adjust logging level use sc.setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
Spark context available as sc 
SQL context available as sqlContext.

scala>  val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
sqlcontext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@4f9a8d71  

scala> val BazarDF = Seq(
     | ("Veg", "tomato", 1.99),
     | ("Veg", "potato", 0.45),
     | ("Fruit", "apple", 0.99),
     | ("Fruit", "pineapple", 2.59),
     | ("Fruit", "apple", 1.99)
     | ).toDF("Type", "Item", "Price")
BazarDF: org.apache.spark.sql.DataFrame = [Type: string, Item: string, Price: double]

scala> BazarDF.show()
+-----+---------+-----+
| Type|     Item|Price|
+-----+---------+-----+
|  Veg|   tomato| 1.99|
|  Veg|   potato| 0.45|
|Fruit|    apple| 0.99|
|Fruit|pineapple| 2.59|
|Fruit|    apple| 1.99|
+-----+---------+-----+

Now our example dataframe is ready.
Create a List[String] with column names.

scala> var selectExpr : List[String] = List("Type","Item","Price")
selectExpr: List[String] = List(Type, Item, Price)

Now our list of column names is also created.
Lets select these columns from our dataframe.

Use .head and .tail to select the whole values mentioned in the List()

scala> var dfNew = BazarDF.select(selectExpr.head,selectExpr.tail: _*)
dfNew: org.apache.spark.sql.DataFrame = [Type: string, Item: string, Price: double]

scala> dfNew.show()
+-----+---------+-----+
| Type|     Item|Price|
+-----+---------+-----+
|  Veg|   tomato| 1.99|
|  Veg|   potato| 0.45|
|Fruit|    apple| 0.99|
|Fruit|pineapple| 2.59|
|Fruit|    apple| 1.99|
+-----+---------+-----+

I will also explaine How to select multiple columns from a spark data frame using List[Column] in next post.

Wednesday, 2 January 2019

How to add Scalastyle plugin in eclipse

Scalastyle is a style checker for Scala. It checks your Scala code against a number of configurable rules, and marks the code which violates these rules with warning or error markers in your source code.

Lets add scalastyle plugin in 4 steps.

Install Scalastyle plugin
Add Scalastyle Nature to your project
Set up a configuration for Scalastyle
Enable Scalastyle for that project

Install Scalastyle plugin
1. From Eclipse Marketplace install Scala Style 0.9.0

Accept the terms and conditions. Scalastyle is installed in your eclipse.

Add Scalastyle Nature to your project

Right click project → ScalaStyle → Add Scalastyle nature

Set up a configuration for Scalastyle
1. Now we need to download the configuration rules from scalastyle.org

Place this file in your project, where the eclipse can access the file eg: project/src/main/resource/scalastyle_config.xml

Add configuration file to list of available configuration

In eclipse windows preference → select scalastyle
configuration page → you can then add the file path of configuration file .

Enable Scalastyle for that project
1. Finally we need to select the configuration for our project

In project properties → select the configuration in
drop down list → enable check box

Do a project clean and build, and you should see Scalastyle errors appear in your problems view or in the source code.