Azure DevOps CI/CD for Data Platform-Azure Databricks

ISHMEET KAUR
7 min readJan 30, 2021

For gaining the maximum value of data products, they should be delivered in a timely manner. Moreover, consumers should have confidence in the validity of outcomes. By automating the building, testing and deployment of code, development teams are able to deliver more releases and reliably in a shorter time than the manual processes.

So, in short definition of CI/CD-

“Continuous integration is the practice of testing each change made to your codebase automatically and as early as possible.

Continuous delivery follows the testing that happens during continuous integration and pushes changes to a staging or production system.”

In case of Azure Databricks workspace deployment, below steps are performed as part of CI/CD.

Continuous integration:

  1. Develop code and unit tests in an Azure Databricks notebook or using an external IDE and integrate it with Azure Devops repository.

a. Launch your workspace from the portal

b. Select user settings from right side user profile. In user settings, select Git integration. In Git integration, select Azure DevOps Services

c. Go to home and create new workspace

d. Create a new project in Azure DevOps and initialize the repository(If you wish to create new. Otherwise existing project/repo can also be used)

e. Create a new branch with the name of featurebranch in Azure DevOps repository. Copy https link to paste in git preferences of Azure Databricks.

Copy repo link and paste in Git preferences and select branch.

Perform development work in feature branch and save on azure dev ops repository.

2. Manually run tests

After completing development work in feature branch and saving, create pull request from repo in Azure Devops for deploying in main(master) branch.

After creating pull request, merge the feature branch into main branch

3.Create a build.

Generate variable group from library present in pipeline of azure devops. Add azure subscription and key vault name where azure databrick secret is shared.

In pipeline, go to create new pipeline and use the classic editor.

Select Azure Devops Git as source and repository name and continue further.

Select empty job from template selection.

Provide the name of built artifact and agent with agent specification.

Provide the variable group name for Azure Databricks which is created in library.

In continuous integration triggers, enable CI.

Release: Generate a release artifact by saving Build Artifact.

Continuous Delivery:

  1. Create pipeline with various stages deployment once build is triggered to main branch. Notebook changes will be deployed to testing and will wait for approval from preprod supervisor to approve it after testing in test phase. Same will happen for production once testing is performed in preporod and production supervisor can proceed with its deployment in the production

a. Click the add artifacts button and then select your build pipeline which will show that it last created an artifact called notebooks.

b. Click the lightning icon next to the artifact to enable continuous deployment.

c. Click Variables on the menu and add in the variable group so that your pipeline can find the secret we set up earlier.

d. For testing phase, add agent job for running powershell script with below script details-

Configure the powershell task to use inline code and paste in the below code:

# Upload a notebook to Azure Databricks
# Docs at https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/workspace#--import
$fileName = "$(System.DefaultWorkingDirectory)/<path to file in artifact>/<filename>.py"
$newNotebookName = "<PhasebasedNotebookName>"
# Get our secret from the variable
$Secret = "Bearer " + "$(Databricks)"
# Set the URI of the workspace and the API endpoint
$Uri = "https://<your region>.azuredatabricks.net/api/2.0/workspace/import"
# Open and import the notebook
$BinaryContents = [System.IO.File]::ReadAllBytes($fileName)
$EncodedContents = [System.Convert]::ToBase64String($BinaryContents)
$Body = @{
content = "$EncodedContents"
language = "PYTHON"
overwrite = $true
format = "SOURCE"
path= "/Users/<your user>/" + "$newNotebookName"
}
#Convert body to JSON
$BodyText = $Body | ConvertTo-Json
$headers = @{
Authorization = $Secret
}
Invoke-RestMethod -Uri $uri -Method Post -Headers $headers -Body $BodyText

Change <path to file in artifact>/<filename>.py to your path inside the artifact. Name of notebook if you wish to change in any other stage , PhasebasedNotebookName> test/preprod/prod phase . Add databrick url (https://<your region>.azuredatabricks.net )which you will use in test/preprod/prod.

Change <your user> to your user id in Azure Databricks. The variable set from Key Vault will automatically be downloaded so you don't need to do anything to use it, just reference it by name. Make sure your URI is correct for your workspace, you can see this on the overview pane in the Azure Portal when looking at the workspace.

This script could be extended for multiple files using a for each loop on a folder. Using a different workspace URI or different token would deploy to different workspaces. You’ll probably want one for testing, one for QA or Preprod and one for production.

e. Finally, click Create Release on the menu. In future this won’t be necessary since we set up the trigger, but since we won’t have another build to start that trigger we need to manually start this one. Just click create and your tasks will run and deploy the notebook to your workspace using the $newNotebookName variable as the name. Since we only have one workspace in the demo, that’s where the notebook will go.

f. Same set of steps will repeat while creating preprod and prod phase taking care of filename which you have to use, url link of Azure Databrick , notebook path and your username linked with Azure Databrick.

g. Moreover in a production scenario, your code should be placed into libraries and deployed to the workspace with notebooks calling those libraries of tested code. The notebook in that scenario will simply act as a placeholder for parameters for the job. I showed a single workspace here but you should end up with multiple workspaces, each acting as the trigger for the next in the release pipeline.

h. If you want to provide more condition while deploying into business stages, you can enable pre-deployment approvals. So, that before deploying in the preprod environment it will ask for deployment from the users. Same feature can be enabled for any other stage.

--

--