What is... Databricks?
Databricks is a multi-workload integrated Intelligence Platform, that sits above our AWS account and Databrick's own Data Lake technology. This allows Sagacity to run all of its big data processing, analytics and modelling from one platform efficiently.
Databricks was founded by the creators of Apache Spark, and as such their platform performs as a fully-hosted Spark solution. Spark processes big data at scale on clusters, be it for either ML training and analytics pipelines. In addition to Spark, it also comes with a host of other (generally) open-sourced components:
- Unity Catalog - this replaced the hive metastore for all things metadata, access control, lineage and data governance.
- MLFlow - ML lifecycle management from experiments to real-time ML serving.
- Delta Storage format - extending parquet to leverage cloud storage and enable efficient updates and very fast access.
- SQL Warehouses - this exposes Databricks as a SQL engine for Analytics.
- Jobs/Workflows - using Datbaricks own orchestration engine for scheduling and templating.
So whilst Databricks can perform very well as a “Data Warehousing” tool, it can also do very large scale streaming; machine learning; processing of unstructured data like text, audio or images; and support BI applications. In summary, it’s a full blown multi-faceted data platform for many kinds of use cases and company sizes.
See Also: