Spark is designed to write out multiple files in parallel. So there may be cases where we need to merge all the part files, remove the success/commit files and write the content to a single file.
This blog helps you to write spark output to a single file.
Using df.coalesce(1) we can write data to a single file,
result_location = "dbfs:///mnt/datalake/unmesha/output/" df.coalesce(1).write.format("csv").options(header='true').mode("overwrite").save(result_location)
but still you will see _success files.
This solution - adding coalesce isn’t sufficient when you want to write data to a file with a specific name.
We are going to achieve this using dbutils
result_location = "dbfs:///mnt/datalake/unmesha/output/"
df.coalesce(1).write.format("csv").options(header='true').mode("overwrite").save(result_location)
files = dbutils.fs.ls(result_location)
csv_file = [x.path for x in files if x.path.endswith(".csv")][0]
dbutils.fs.mv(csv_file, result_location.rstrip('/') + ".csv")
dbutils.fs.rm(result_location, recurse = True)
Above snippet helps you to write dataframe output to a single file with a specific name.