What is... a DataFrame?
A DataFrame is a two-dimensional, tabular data structure commonly used in data analysis and manipulation tasks. It is a core component in libraries like Pandas/Spark (Python) and R for managing and analyzing data. Conceptually, a DataFrame is similar to a spreadsheet or SQL table, where data is arranged in rows and columns.
Load sample data:
%python
# Use the Spark CSV datasource with options specifying:
# - First line of file is a header
# - Automatically infer the schema of the data
data = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/databricks-datasets/samples/population-vs-price/data_geo.csv")
data.cache() # Cache data for faster reuse
data = data.dropna() # drop rows with missing values
Loading a DataFrame from a VolumePath:
file_location = "/Volumes/<catalog>/<schema>/<table>/data_geo.csv"
df = (spark.read.option("header", "true" #header record true/false)
.option("delimiter", "," #Define delimiter here)
.csv(file_location)
)
df.cache() # Cache data for faster reuse
df = df.dropna() # drop rows with missing values
Joining multiple DataFrames together, converting col_types and formatting:
from pyspark.sql.functions import col, round
# Read the data into PySpark DataFrames and filter based on the ID range
dataframe_1 = (
spark.table("table_1")
.filter((col("ID") >= start_id) & (col("ID") <= end_id))
)
dataframe_2 = (
spark.table("table_2")
.filter((col("ID") >= start_id) & (col("ID") <= end_id))
)
# Perform the join and transformation
dataframe_3 = (
dataframe_1
.join(
dataframe_2,
dataframe_1.ID == dataframe_2.ID
)
.select(
"table_1.*",
"i_HouseholdSize",
(col("i_HouseholdSize") == 1).cast("int").alias("IVIB_HouseholdSize_1"),
(col("i_HouseholdSize") == 2).cast("int").alias("IVIB_HouseholdSize_2"),
(col("i_HouseholdSize") == 3).cast("int").alias("IVIB_HouseholdSize_3"),
(col("i_HouseholdSize") == 4).cast("int").alias("IVIB_HouseholdSize_4"),
(col("i_HouseholdSize") == 5).cast("int").alias("IVIB_HouseholdSize_5"),
round(col("d_HouseholdSize_1"), 5).alias("IVIC_HouseholdSize_1"),
round(col("d_HouseholdSize_2"), 5).alias("IVIC_HouseholdSize_2"),
round(col("d_HouseholdSize_3"), 5).alias("IVIC_HouseholdSize_3"),
round(col("d_HouseholdSize_4"), 5).alias("IVIC_HouseholdSize_4"),
round(col("d_HouseholdSize_5"), 5).alias("IVIC_HouseholdSize_5")
)
)
See Also: