ML Ops on Azure

Testing CI/CD pipelines for data science using GitHub Actions and Azure’s ML Workspace.

7 min readMay 21, 2021

Context

For my hobby data science projects I’ve come to like Paperspace Gradient. The machines (containers) start up quickly, they have machine types that are very affordable on an hourly basis, pre-mount data directories for you and make you configure an auto-shutdown time from the start, avoiding billing surprises. All of that good stuff comes with the downside that, for hobbyists, they don’t really support CI/CD, those features appear to be reserved for Enterprise customers. So I set it up using Azure.

Objective

I want a pipeline connected to my GitHub repo that automatically trains my model, logs performance and, if necessary, packages and deploys it every time I check into the Main branch.

I used the Azure documentation and modified their SKLearn example to be a minimum viable TensorFlow demonstration.

Setup

This will require an Azure Subscription with an ML Workspace provisioned. This can be found under AI+Machine Learning when creating a new resource, select Machine Learning, right there at the top.

All of the required fields relate to naming; name the stuff what you want, you’ll need to remember the name of the Workspace and the resource group you put it in.

Once created, we’ll need to get credentials for our GitHub Actions to use, these can be generated using the Azure Cloud Shell with the command;

Supply the service-principal-name you’d like, you can find your subscription-id under ‘Subscriptions’, the resource-group-name will be the one selected when making the Workspace.

GitHub Actions

My repository with the reference implementation I’ll use is here.

Now we’re ready to setup Actions. Create your repository and register the full JSON response from the Cloud Shell command as “AZURE_CREDENTIALS”. This will be used by the deploy scripts.

The primary file for all Actions is the main.yml under .github/workflows.

This file will connect to AML, select a compute target and train the model;

This script will work as-is for your use case because all of the variables that are environment specific are contained in different files. In the full repo you can see additional steps to package and deploy the model are included; this smaller version can serve as a minimum viable product.

This .yml depends on configuration files in the .cloud/.azure directory of your repository. These files will depend on your specific environment. The first is “workspace.json”. This contains your AML Workspace name and resource group. You should modify these values based on the names you configured, if you don’t it will create them based on my naming.

Compute

Next we require a compute target. You can create a compute cluster yourself, by logging into the Azure ML Workspace, selecting “Compute” from the left side bar and then “Compute Clusters”, or leave the defaults in my repo and it will create an instance named “pbml” that auto scales between 0 and 4 non-GPU nodes. This configuration is contained in the “compute.json”.

That’s it for global environment configuration, now it’s down to training environment and packages.

Training

Files for our model are contained in code/training. The pipeline looks for two .yml config files in this directory, “run_config.yml” and “environment.yml”. Environment is straightforward, it’s our software or package dependencies. Run_config is more bespoke to Azure and it defines environment (docker or VM) and framework (TensorFlow, Python, PyTorch) and other associated dependencies. This also references the compute target, so make sure it matches the “compute.json” compute target or you could be in for some weird errors.

That said, this is TensorFlow, I’ve configured the dependencies and run_config for it. The run_config also specifies the file or script to run for training, in this case “train.py”. Finally, we get to the algorithm. I used Keras for a basic MNIST implementation. The only real difference is getting the run context.

This context is what allows us to log back to the experiment Azure ML (AML) will create for us. The ‘run’ object allows for logging arbitrary key/value pairs and tracks our experiment progress.

Additionally, we will save the model to “./outputs/model/”. Other artifacts like images can be saved and will be available for viewing in the AML interface.

Now all that’s left is to check into the main branch and see how it goes.

Lessons Learned so Far

The short of it is: I really like the ability to check in code and let a training run execute autonomously, this could be enhanced with hyper parameter tuning, complex deployment models and so on. The bad is, this pipeline and solution is fiddly and unreliable.

Multiple Debugging Surfaces

My testing loop starts in Visual Studio Code, which has a great function for running code interactively, this is my first test and is extremely effective as an initial check.

Next, you must debug the pipeline and any macro config issues in GitHub from the Actions.

The problem is, unless it’s a credential or workflow config issue, the errors you find here are generally unhelpful for training issues, here’s a case in point;

This pipeline ran and executed training but is now unauthorized to complete? So we must go to the third place for looking to resolve, the Azure ML Workspace. In this case, you can see the run completed successfully;

It is worth pointing out that you can (and I have) connected VS Code to AML Workspace to allow you to perform the last action in the IDE, which does somewhat alleviate the complaint. I found using the tensorflow-datasets package produced erratic errors relating to tokens, replacing it with Keras datasets seemed to alleviate many of these.

Fiddly

The challenges I encountered with tensorflow-datasets presenting errors, on occasion receiving some zip checksum failure, and the unauthorized error above are just a few examples of the kinds of things that rapidly suck enough time away you’re no longer saving time with your pipeline, but spending more time maintaining it than you’re saving with it. As evidence; the error detailed above was resolved when I… added a line to my readme and checked in the repo;

You can see it in the Actions history, this commit succeeded, the other failed; nothing on the identity or secrets was changed. Not a deal breaker but frustrating to spend time on these issues.

Deployment

I won’t detail all of the config files and steps for deployment; know that one additional json file is required for registering the image, “registermodel.json”, this file usually takes the name of the file we’ve serialized our model to, in this case we need the entire path because Keras saves and loads using multiple files in the folder, hence our model name is the path with no file names. Additional steps are in the “.github/workflows/main.yml”.

The complexity is contained in our “code/deploy/score.py” file. This is the python wrapper that will tell Azure how to wrap our model with a RESTful interface. Key things to note are;

First, the init() is run once at startup; this is good for loading model binaries and other setup work. Here you can see we’re loading our keras model using the environment variable “AZUREML_MODEL_DIR”, our artifacts are actually in the “model” folder contained in this directory. Otherwise we’re creating the objects that will allow us to save inputs and outputs to object storage.

The run() function is invoked when our API is accessed;

Here we are using the optional @input and @output decorators that inform the swagger documentation Microsoft produces for our API and validate the input to our API. Otherwise, we’re inferencing on the data payload provided from the API call and returning our object. The return contains some machinery for transforming our logits (the model returns an array of 10 numbers; the probability assigned to each number, 1–10), this is converted into the number the model assigned highest probability to.

Testing our Service

Now, if your run has completed successfully, you should see one (or more) registered models in the Studio under “Models”;

And a deployment under “Endpoints”;

Clicking on the name gives us our service information and we can test the API;

The payload required by the API consists of the key “data” with the value being our MNIST image information. We must enclose the array in another array to make a (1,28,28) shape because the model was trained on batches of (N,28,28).

So the inferencing works, the deployment pipeline (mostly) works. Despite some of the minor irritations I outlined (another failure happened between updating my .gitignore and deleting a folder for no obvious reason), I will follow up with using hyperparameter tuning in the service and better instrumentation for logging out from the models.