How to... set up a Scheduled Task and a Job Cluster in Databricks?
In Databricks, a Scheduled Task refers to a task or job that is set to run automatically at specified intervals or times, without manual intervention. Scheduled tasks are typically part of a Databricks Job, which can contain one or more tasks. These tasks can include executing a notebook, running a Spark job, submitting a Python script, or performing any operation on the Databricks platform.
All scheduled tasks should use job clusters as they reduce cost and increase performance for two reasons:
- They only run when needed and shutdown when done
- All cluster resources are dedicated to that job, i.e. No competition for resources.
Walkthrough
1) On your script you wish to schedule click Schedule
2) Give the job a name and set the schedule
3) Select whether script is in Git or held in the workspace. If held in Git you need the repo and branch
4) Click on pen logo on cluster, now you will setup the cluster settings. each setup will need to tailored to the needs of the task at hand, but here is a few pointers
- If proton is not needed untick
- Use the lowest worker type as possible to reduce cost
- Enable both autoscaling options so:
- The cluster will auto-size depending on load
- Any memory overspill is caught to disk
5) Populate the tags so that tracking of job details are enabled. The tags we require are
- Env (Enviroment)
- Project
- Client
- map-migrated (needs to be set to comm51EWWBPV67)
- Owner
- Billing (this is the billing code provided by finance)
6) Once everything is OK click confirm on the cluster tab then create on the Schedule tab.
7) You can then see the job details on Workflows tab
- This is where you would also come to check on if the job is running and any issues that arise from the run
See Also: