How to... export to a flat file?

There are generally speaking 2 ways to handle exporting data to a file in Databricks:

The default is using Spark, and will result in a folder containing the part files alongside some metadata files allowing for versioning and other spark related features.
- Depending on the size of your dataset, you will see a number of part files in the folder.
Using a different engine, such as Pandas to output a single file.
- This is only recommended for small datasets, as the whole Dataframe needs to be able to fit into memory on a single worker node.

The below examples assume saving to a csv, for other formats you can change the writer method from `.csv` as well as other options available using Spark. See here for reference

Default spark export

#Load table into dataframe
df = spark.read.table('table_name')

#Save file_path name
output_file_path = 'Volumes/catalog_name/schema_name/folder/file.csv'

#Write to csv file using spark
df.write.csv(output_file_path)

If you have a small file and would like to save it to a single CSV there are multiple ways to doing this.

Using the Databricks UI to export data, by selecting the 'export all' button at the bottom left of any table on display in a notebook.
The same can be done for tables by going to the Catalog Explorer and exporting small tables.
Using Pandas to export to a single csv file

Below is an example of using Pandas to export to a single file.

Pandas Export

#Load the table into a dataframe
df = spark.read.table('table_name')

#Save file path name
output_file_path = 'Volume/catalog/schema/folder/file.csv'

#Convert spark dataframe to a Pandas dataframe
pd_df = df.toPandas()

#Save the pandas dataframe
pd.df.to_csv(output_file_path, index=False) #By default Pandas creates an index column, hence we specify index=False

Converting a large Dataframe to a pandas DF using the df.toPandas() method is not recommended as it will load all the data into a single worker's memory, thus only use for small Dataframes.

For other options when using Pandas see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

Saving large files as a single CSV

We recommend one of 2 options to save large datasets to a single csv file:

Use the coalesce(1) method in spark to only output 1 part file, then rename and download that file.
Using a custom function, which will do the above, as demonstrated below:

Firstly, import the function from it's definition notebook (This must be in a cell on it's own):

%run /Workspace/production_team/Functions/_exporting

Secondly run said function:

#Using the custom function from the Production Team Library to export a file
output_file_path = /Volumes/path_to_file

export_dataframe_to_single_csv_databricks(df=df, output_path=output_file_path)