How to... export to a flat file?
There are generally speaking 2 ways to handle exporting data to a file in Databricks:
- The default is using Spark, and will result in a folder containing the part files alongside some metadata files allowing for versioning and other spark related features.
- Depending on the size of your dataset, you will see a number of part files in the folder.
- Using a different engine, such as Pandas to output a single file.
- This is only recommended for small datasets, as the whole Dataframe needs to be able to fit into memory on a single worker node.
The below examples assume saving to a csv, for other formats you can change the writer method from `.csv` as well as other options available using Spark. See here for reference
Default spark export
#Load table into dataframe
df = spark.read.table('table_name')
#Save file_path name
output_file_path = 'Volumes/catalog_name/schema_name/folder/file.csv'
#Write to csv file using spark
df.write.csv(output_file_path)
If you have a small file and would like to save it to a single CSV there are multiple ways to doing this.
- Using the Databricks UI to export data, by selecting the 'export all' button at the bottom left of any table on display in a notebook.
- The same can be done for tables by going to the Catalog Explorer and exporting small tables.
- Using Pandas to export to a single csv file
Below is an example of using Pandas to export to a single file.
Pandas Export
#Load the table into a dataframe
df = spark.read.table('table_name')
#Save file path name
output_file_path = 'Volume/catalog/schema/folder/file.csv'
#Convert spark dataframe to a Pandas dataframe
pd_df = df.toPandas()
#Save the pandas dataframe
pd.df.to_csv(output_file_path, index=False) #By default Pandas creates an index column, hence we specify index=False
Converting a large Dataframe to a pandas DF using the df.toPandas() method is not recommended as it will load all the data into a single worker's memory, thus only use for small Dataframes.
For other options when using Pandas see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html
Saving large files as a single CSV
We recommend one of 2 options to save large datasets to a single csv file:
- Use the coalesce(1) method in spark to only output 1 part file, then rename and download that file.
- Using a custom function, which will do the above, as demonstrated below:
Firstly, import the function from it's definition notebook (This must be in a cell on it's own):
%run /Workspace/production_team/Functions/_exporting
Secondly run said function:
#Using the custom function from the Production Team Library to export a file
output_file_path = /Volumes/path_to_file
export_dataframe_to_single_csv_databricks(df=df, output_path=output_file_path)
See Also: