Sunday 3 March 2019

How to select multiple columns from a spark data frame using List[Column]


Let us create Example DataFrame to explain how to select List of columns of type "Column" from a dataframe

spark-shell --queue= *;

To adjust logging level use sc.setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
Spark context available as sc 
SQL context available as sqlContext.

scala>  val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
sqlcontext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@4f9a8d71  

scala> val BazarDF = Seq(
     | ("Veg", "tomato", 1.99),
     | ("Veg", "potato", 0.45),
     | ("Fruit", "apple", 0.99),
     | ("Fruit", "pineapple", 2.59),
     | ("Fruit", "apple", 1.99)
     | ).toDF("Type", "Item", "Price")
BazarDF: org.apache.spark.sql.DataFrame = [Type: string, Item: string, Price: double]

scala> BazarDF.show()
+-----+---------+-----+
| Type|     Item|Price|
+-----+---------+-----+
|  Veg|   tomato| 1.99|
|  Veg|   potato| 0.45|
|Fruit|    apple| 0.99|
|Fruit|pineapple| 2.59|
|Fruit|    apple| 1.99|
+-----+---------+-----+

Create a List[Column] with column names.

scala> var selectExpr : List[Column] = List("Type","Item","Price")
<console>:25: error: not found: type Column
         var selectExpr : List[Column] = List("Type","Item","Price")
                               ^

If you are getting the same error Please take a look into this page .
Using : _* annotation select the columns from dataframe.

scala> var dfNew = BazarDF.select(selectExpr: _*)
dfNew: org.apache.spark.sql.DataFrame = [Type: string, Item: string, Price: double]

scala> dfNew.show()
+-----+---------+-----+
| Type|     Item|Price|
+-----+---------+-----+
|  Veg|   tomato| 1.99|
|  Veg|   potato| 0.45|
|Fruit|    apple| 0.99|
|Fruit|pineapple| 2.59|
|Fruit|    apple| 1.99|
+-----+---------+-----+


8 comments:

  1. I feel really happy to have seen your post and look forward to so many more interesting post reading here. Thanks once more for all the details.
    Data Science Training in Hyderabad

    ReplyDelete
  2. Thanks for sharing How to select multiple columns from a spark data frame using List this article.
    keep sharing
    best training institute in bangalore
    full stack developer course

    mean Stack Development Training

    ReplyDelete
  3. Hello! This is my first visit to your blog! We are a team of volunteers and starting a new initiative in a community in the same niche. Your blog provided us useful information to work on. You have done an outstanding job.
    AWS Training in Hyderabad
    AWS Course in Hyderabad

    ReplyDelete
  4. Despite its short length, your article gives a decent overview of the storyline and presents concepts well. This is an amazing post, thank you. If you need to resolve any errors in quickbooks you can downlaod QuickBooks File Doctor

    ReplyDelete
  5. Nice blogs.
    Quickbook database server managerQuickbook database server manager was the best way to send or receive the data or files related to the organization. due to this, it was the best and secure method of transferring the file from one computer to another computer.

    ReplyDelete

  6. I really like reading a post that can make people think. Also, thank you for permitting me to comment.

    ReplyDelete
  7. Hello friends, I am Ronald Frankeliene, an education expert. In academics, I am here to share information about online Dissertation Help . If you are one of them, who is facing dissertation-related problems, contact us for help. I have a team of online assignment experts that can help those students who need help in their subjects.

    ReplyDelete
  8. I work as a certified technical expert for the QuickBooks refresher tool, handling all processes within the tool. I also help users if they find any difficulties and problems with the tool. It also provides user manual assistance.

    ReplyDelete