Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Continuous integration and continuous delivery (CI/CD) refers to the process of developing and delivering software in short, frequent cycles through the use of automation pipelines. CI/CD is common to software development, and is becoming increasingly necessary to data engineering and data science. By automating the building, testing, and deployment of code, development teams are able to deliver releases more reliably than with manual processes.
Common tools are available for developing CI/CD pipelines, but implementations and approaches from organization to organization may differ slightly due to unique aspects of each organization's software development lifecycle. This page provides information about the following approaches to CI/CD on Databricks, and pros and cons for each approach:
For an overview of CI/CD for machine learning projects on Azure Databricks, see How does Databricks support CI/CD for machine learning?.
Databricks Asset Bundles (Recommended)
Databricks Asset Bundles are the recommended approach to CI/CD on Databricks. Use Databricks Asset Bundles to describe Databricks resources such as jobs and pipelines as source files, and bundle them together with other assets to provide an end-to-end definition of a deployable project. These bundles of files can be source controlled, and you can use external CI/CD automation such as Github Actions to trigger deployments.
Pros | Cons |
---|---|
|
|
Production Git folder
If you are not yet ready to adopt Databricks Asset Bundles, but want your code to be source controlled, you can set up a production Git folder. Then use external CI/CD tools such as GitHub Actions to pull the Git folder on merge, or when you do not have access to external CI/CD pipelines, create a scheduled job to pull to a Git folder in the workspace.
Pros | Cons |
---|---|
|
|
Git with jobs
If you only need CI/CD for jobs, Git with jobs enables you to configure some job types to use a remote Git repository as the source. When a job run begins, Databricks takes a snapshot commit of the remote repository and ensures that the entire job runs against the same version of the code.
Pros | Cons |
---|---|
|
|
Other CI/CD recommendations
Regardless of the CI/CD approach that you choose, use service principals for CI/CD. See Service principals for CI/CD.
Databricks also recommends that you use the Databricks Terraform provider to manage your Databricks workspaces and the associated cloud infrastructure.
Related links
For more information on managing the lifecycle of Azure Databricks assets and data, see the following documentation about CI/CD and data pipeline tools.
Area | Use these tools when you want to… |
---|---|
Databricks Asset Bundles | Programmatically define, deploy, and run Azure Databricks jobs, DLT pipelines, and MLOps Stacks by using CI/CD best practices and workflows. |
Databricks Terraform provider | Provision and manage Databricks workspaces and infrastructure using Terraform. |
CI/CD workflows with Git and Databricks Git folders | Use GitHub and Databricks Git folders for source control and CI/CD workflows. |
Authenticate with Azure DevOps on Azure Databricks | Authenticate with Azure DevOps. |
Use a Microsoft Entra service principal to authenticate access to Azure Databricks Git folders | Use an MS Entra service principal to authenticate access to Databricks Git folders. |
Continuous integration and delivery on Azure Databricks using Azure DevOps | Develop a CI/CD pipeline for Azure Databricks that uses Azure DevOps. |
GitHub Actions | Include a GitHub Action developed for Azure Databricks in your CI/CD workflow. |
CI/CD with Jenkins on Azure Databricks | Develop a CI/CD pipeline for Azure Databricks that uses Jenkins. |
Orchestrate Azure Databricks jobs with Apache Airflow | Manage and schedule a data pipeline that uses Apache Airflow. |
Service principals for CI/CD | Use service principals, instead of users, with CI/CD systems. |