Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams
val results = fruits.
  map(fruit => Seq(("aaa", "bbb", fruit)).toDF("aCol","bCol","name")).
  reduce(_.union(_))
results.show()

Steffen Schmitz's answer is the most concise one I believe. Below is a more detailed answer if you are looking for more customization (of field types, etc):

import org.apache.spark.sql.types.{StructType, StructField, StringType}
import org.apache.spark.sql.Row
//initialize DF
val schema = StructType(
  StructField("aCol", StringType, true) ::
  StructField("bCol", StringType, true) ::
  StructField("name", StringType, true) :: Nil)
var initialDF = spark.createDataFrame(sc.emptyRDD[Row], schema)
//list to iterate through
var fruits = List(
    "apple"
    ,"orange"
    ,"melon"
for (x <- fruits) {
  //union returns a new dataset
  initialDF = initialDF.union(Seq(("aaa", "bbb", x)).toDF)
//initialDF.show()

references:

  • How to create an empty DataFrame with a specified schema?
  • https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/Dataset.html
  • https://docs.databricks.com/spark/latest/faq/append-a-row-to-rdd-or-dataframe.html
  • If you have different/multiple dataframes you can use below code, which is efficient.

    val newDFs = Seq(DF1,DF2,DF3)
    newDFs.reduce(_ union _)
                    how can i keep adding new dataframes to the Seq using a loop? I would like to do a union at the end, but the dataframes in my Seq are to be added using a loop. Is it doable?
    – Regressor
                    Jul 2, 2019 at 6:55
                    Why is this efficcient? If you are applying a reduce function to a Scala Seq you are not making use of cluster paralelism and no distributed computing at all, right?
    – Borja_042
                    Sep 18, 2019 at 11:38
    

    you can first create a sequence and then use toDF to create Dataframe.

    scala> var dseq : Seq[(String,String,String)] = Seq[(String,String,String)]()
    dseq: Seq[(String, String, String)] = List()
    scala> for ( x <- fruits){
         |  dseq = dseq :+ ("aaa","bbb",x)
    scala> dseq
    res2: Seq[(String, String, String)] = List((aaa,bbb,apple), (aaa,bbb,orange), (aaa,bbb,melon))
    scala> val df = dseq.toDF("aCol","bCol","name")
    df: org.apache.spark.sql.DataFrame = [aCol: string, bCol: string, name: string]
    scala> df.show
    +----+----+------+
    |aCol|bCol|  name|
    +----+----+------+
    | aaa| bbb| apple|
    | aaa| bbb|orange|
    | aaa| bbb| melon|
    +----+----+------+
                    actually  what i tried was to a create a Seq and convert it to dataframe, since i'm iterating through the list of fruit and appending it into a same variable, so i have taken it as var.
    – Rajat Mishra
                    Apr 19, 2017 at 9:47
                    The OP has used var but he did not actually need it. And, you could have just mapped the fruits into your dseq. The important thing to note here is that your dseq is a List. And then you are appending to this list in your for "loop". The problem with this is that append on List is O(n) making your whole dseq generation O(n^2), which will just kill performance on large data.
    – sarveshseri
                    Apr 19, 2017 at 9:51
    

    Well... I think your question is a bit mis-guided.

    As per my limited understanding of whatever you are trying to do, you should be doing following,

    val fruits = List(
      "apple",
      "orange",
      "melon"
    val df = fruits
      .map(x => ("aaa", "bbb", x))
      .toDF("aCol", "bCol", "name")
    

    And this should be sufficient.

    Thanks Sarvesh.. but I only have to get the union dataframe in Loop.. because there are various operation such as join, withColumn in Loop . I will get the dataframe from hiveSql in Loop. – J.soo Apr 19, 2017 at 8:54 "union data-frame in loop" well... just this one statement leaves me unable to answer this question. Why do you need this "union data-frame in loop" ? Can you elaborate in your question with more details about - "various operation such as join, withColumn in Loop". – sarveshseri Apr 19, 2017 at 9:42

    Thanks for contributing an answer to Stack Overflow!

    • Please be sure to answer the question. Provide details and share your research!

    But avoid

    • Asking for help, clarification, or responding to other answers.
    • Making statements based on opinion; back them up with references or personal experience.

    To learn more, see our tips on writing great answers.