Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I am using Spark 1.3.1 (PySpark) and I have generated a table using a SQL query. I now have an object that is a DataFrame . I want to export this DataFrame object (I have called it "table") to a csv file so I can manipulate it and plot the columns. How do I export the DataFrame "table" to a csv file?

Thanks!

If data frame fits in a driver memory and you want to save to local files system you can convert Spark DataFrame to local Pandas DataFrame using toPandas method and then simply use to_csv :

df.toPandas().to_csv('mycsv.csv')

Otherwise you can use spark-csv:

  • Spark 1.3

    df.save('mycsv.csv', 'com.databricks.spark.csv')
    
  • Spark 1.4+

    df.write.format('com.databricks.spark.csv').save('mycsv.csv')
    

    In Spark 2.0+ you can use csv data source directly:

    df.write.csv('mycsv.csv')
                    If you have spark dataframes you can use df.write.csv('/tmp/lookatme/') and that will drop a set of csv files in /tmp/lookatme Using spark is significantly faster than serializing it in pandas. The only drawback is that you'll end up with a set of csvs instead of a single one and if the destination tool doesn't know how to concatenate them you will need to do it yourself.
    – Txangel
                    Jul 19, 2017 at 16:39
                    What a big deal it is to get a csv out of spark. Something interesting about that first solution is that to_csv works without needing to import Pandas. .toPandas is part of Spark maybe it implicitly imports it..
    – cardamom
                    Sep 14, 2017 at 1:12
                    You should be able to use df.coalesce(1).write.csv('mycsv.csv') if you insist on having a single output file
    – MichaelChirico
                    Feb 21, 2018 at 3:31
                    @Txangel thanks for your answer. However, when I use that it runs without any error but I can't find any csv created in the target location. Any thoughts?
    – Rotail
                    Jun 13, 2019 at 17:14
                    using df.write.csv('mycsv.csv') exports the csv to hdfs environment. How can i get it in my local environment?
    – Tracy
                    Jun 10, 2020 at 14:57
    

    For Apache Spark 2+, in order to save dataframe into single csv file. Use following command

    query.repartition(1).write.csv("cc_out.csv", sep='|')
    

    Here 1 indicate that I need one partition of csv only. you can change it according to your requirements.

    As indicated here: spark.apache.org/docs/2.2.0/api/python/… it is recommended to use coalesce() instead of repartition() to increase performance ("If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.") – Seastar Nov 28, 2018 at 15:34 @Seastar: While coalescing might have advantages in several use cases, your comment does not apply in this special case. If you want to have a .csv in your hdfs (or whatever), you will usually want one file and not dozens of files spreaded across your cluster (the whole sense of doing repartition(1). You need to shuffle the data for this either way, so coalescing will not help at all in the bigger picture. – Markus Jun 24, 2020 at 17:17

    If you cannot use spark-csv, you can do the following:

    df.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("file.csv")
    

    If you need to handle strings with linebreaks or comma that will not work. Use this:

    import csv
    import cStringIO
    def row2csv(row):
        buffer = cStringIO.StringIO()
        writer = csv.writer(buffer)
        writer.writerow([str(s).encode("utf-8") for s in row])
        buffer.seek(0)
        return buffer.read().strip()
    df.rdd.map(row2csv).coalesce(1).saveAsTextFile("file.csv")
    

    You need to repartition the Dataframe in a single partition and then define the format, path and other parameter to the file in Unix file system format and here you go,

    df.repartition(1).write.format('com.databricks.spark.csv').save("/path/to/file/myfile.csv",header = 'true')
    

    Read more about the repartition function Read more about the save function

    However, repartition is a costly function and toPandas() is worst. Try using .coalesce(1) instead of .repartition(1) in previous syntax for better performance.

    Read more on repartition vs coalesce functions.

    Using PySpark

    Easiest way to write in csv in Spark 3.0+

    sdf.write.csv("/path/to/csv/data.csv")
    

    this can generate multiple files based on the number of spark nodes you are using. In case you want to get it in a single file use repartition.

    sdf.repartition(1).write.csv("/path/to/csv/data.csv")
    

    Using Pandas

    If your data is not too much and can be held in the local python, then you can make use of pandas too

    sdf.toPandas().to_csv("/path/to/csv/data.csv", index=False)
    

    Using Koalas

    sdf.to_koalas().to_csv("/path/to/csv/data.csv", index=False)
                    omg I have been looking for something like .repartition(1) for HOURS to write into only 1 csv file thank you so so so much!!!!!!!!!!
    – sweetmusicality
                    Apr 25, 2022 at 6:21
    

    How about this (in case you don't want a one liner) ?

    for row in df.collect():
        d = row.asDict()
        s = "%d\t%s\t%s\n" % (d["int_column"], d["string_column"], d["string_column"])
        f.write(s)
    

    f is an opened file descriptor. Also the separator is a TAB char, but it's easy to change to whatever you want.

    I am late to the pary but: this will let me rename the file, move it to a desired directory and delete the unwanted additional directory spark made import shutil import os import glob path = 'test_write' #write single csv students.repartition(1).write.csv(path) #rename and relocate the csv shutil.move(glob.glob(os.getcwd() + '\\' + path + '\\' + r'*.csv')[0], os.getcwd()+ '\\' + path+ '.csv') #remove additional directory shutil.rmtree(os.getcwd()+'\\'+path)

    I used the method with pandas and this gave me horrible performance. In the end it took so long that I stopped to look for another method.

    If you are looking for a way to write to one csv instead of multiple csv's this would be what you are looking for:

    df.coalesce(1).write.csv("train_dataset_processed", header=True)
    

    It reduced processing my dataset from 2+ hours to 2 minutes

    Thanks for contributing an answer to Stack Overflow!

    • Please be sure to answer the question. Provide details and share your research!

    But avoid

    • Asking for help, clarification, or responding to other answers.
    • Making statements based on opinion; back them up with references or personal experience.

    To learn more, see our tips on writing great answers.

  •