Friday, 26 October 2018

How to add multiple withColumn to Spark Dataframe




In order to explain, Lets create a dataframe with 3 columns

spark-shell --queue= *;

To adjust logging level use sc.setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
Spark context available as sc 
SQL context available as sqlContext.

scala>  val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
sqlcontext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@4f9a8d71  
 
scala> import org.apache.spark.sql.Column
scala> val BazarDF = Seq(
        ("Veg", "tomato", 1.99),
        ("Veg", "potato", 0.45),
        ("Fruit", "apple", 0.99),
        ("Fruit", "pineapple", 2.59)
         ).toDF("Type", "Item", "Price")
BazarDF: org.apache.spark.sql.DataFrame = [Type: string, Item: string, Price: double]

Lets see how to add 3 new columns into this dataframe with dummy values.

One way of doing this is

scala> var BazarWithColumnDF = BazarDF.withColumn("Retailer",lit("null").as("StringType"))
          .withColumn("Quantity",lit(0.0).as("DoubleType"))
BazarWithColumnDF: org.apache.spark.sql.DataFrame = 
 [Type: string, Item: string, Price: double, Retailer: string, Quantity: double]

We can use the same method using foldLeft aswell

Now we need to define a list with new columnName to do this.

scala> var ColNameWithDatatype: List[(String, Column)] = List()
ColNameWithDatatype: List[(String, org.apache.spark.sql.Column)] = List()

scala> ColNameWithDatatype = List(("Retailer", lit("null").as("StringType")),
     ("Quantity", lit(0.0).as("DoubleType")))
ColNameWithDatatype: List[(String, org.apache.spark.sql.Column)] = 
 List((Retailer,null AS StringType#91), 
      (Quantity,0.0 AS DoubleType#92))

scala> var BazarWithColumnDF1 = ColNameWithDatatype.foldLeft(BazarDF) 
  { (tempDF, colName) =>
     |       tempDF.withColumn(colName._1, colName._2)
     |     }
BazarWithColumnDF1: org.apache.spark.sql.DataFrame = 
       [Type: string, Item: string, Price: double, 
              Retailer: string, Quantity: double]
scala> BazarWithColumnDF1.show()
+-----+---------+-----+--------+--------+
| Type|     Item|Price|Retailer|Quantity|
+-----+---------+-----+--------+--------+
|  Veg|   tomato| 1.99|    null|     0.0|
|  Veg|   potato| 0.45|    null|     0.0|
|Fruit|    apple| 0.99|    null|     0.0|
|Fruit|pineapple| 2.59|    null|     0.0|
+-----+---------+-----+--------+--------+

6 comments:

  1. Super helpful :) Thanks.

    ReplyDelete
  2. awesome blog you shared with us and its very helpful for my training institute candidates. keep update more techniques on spark online training Hyderabad

    ReplyDelete
  3. You are doing a great job by sharing useful information about apache spark scala course. It is one of the post to read and imporove my knowledge in apache spark scala.apache spark scala training institute in bangalore is also one of the training institute which provides best training in apache spark scala, big data, data science and many other courses.

    ReplyDelete