Testing approaches for Amazon SageMaker ML models

This post was co-written with Tobias Wenzel, Software Engineering Manager for the Intuit Machine Learning Platform.

We all appreciate the importance of a high-quality and reliable machine learning (ML) model when using autonomous driving or interacting with Alexa, for examples. ML models also play an important role in less obvious ways—they’re used by business applications, healthcare, financial institutions, amazon.com, TurboTax, and more.

As ML-enabled applications become core to many businesses, models need to follow the same vigor and discipline as software applications. An important aspect of MLOps is to deliver a new version of the previously developed ML model in production by using established DevOps practices such as testing, versioning, continuous delivery, and monitoring.

There are several prescriptive guidelines around MLOps, and this post gives an overview of the process that you can follow and which tools to use for testing. This is based on collaborations between Intuit and AWS. We have been working together to implement the recommendations explained in this post in practice and at scale. Intuit’s goal of becoming an AI-driven expert platform is heavily dependent on a strategy of increasing velocity of initial model development as well as testing of new versions.

Requirements

The following are the main areas of consideration while deploying new model versions:

Model accuracy performance – It’s important to keep track of model evaluation metrics like accuracy, precision, and recall, and ensure that the objective metrics remain relatively the same or improve with a new version of the model. In most cases, deploying a new version of the model doesn’t make sense if the experience of end-users won’t improve.
Test data quality – Data in non-production environments, whether simulated or point-in-time copy, should be representative of the data that the model will receive when fully deployed, in terms of volume or distribution. If not, your testing processes won’t be representative, and your model may behave differently in production.
Feature importance and parity – Feature importance in the newer version of the model should relatively compare to the older model, though there might be new features introduced. This is to ensure that the model isn’t becoming biased.
Business process testing – It’s important that a new version of a model can fulfill your required business objectives within acceptable parameters. For example, one of the business metrics can be that the end-to-end latency for any service must not be more than 100 milliseconds, or the cost to host and retrain a particular model can’t be more than $10,000 per year.
Cost – A simple approach to testing is to replicate the whole production environment as a test environment. This is a common practice in software development. However, such an approach in the case of ML models might not yield the right ROI depending upon the size of data and may impact the model in terms of the business problem it’s addressing.
Security – Test environments are often expected to have sample data instead of real customer data and as a result, data handling and compliance rules can be less strict. Just like cost though, if you simply duplicate the production environment into a test environment, you could introduce security and compliance risks.
Feature store scalability – If an organization decides to not create a separate test feature store because of cost or security reasons, then model testing needs to happen on the production feature store, which can cause scalability issues as traffic is doubled during the testing period.
Online model performance – Online evaluations differ from offline evaluations and can be important in some cases like recommendation models because they measure user satisfaction in real time rather than perceived satisfaction. It’s hard to simulate real traffic patterns in non-production due to seasonality or other user behavior, so online model performance can only be done in production.
Operational performance – As models get bigger and are increasingly deployed in a decentralized manner on different hardware, it’s important to test the model for your desired operational performance like latency, error rate, and more.

Most ML teams have a multi-pronged approach to model testing. In the following sections, we provide ways to address these challenges during various testing stages.

Offline model testing

The goal of this testing phase is to validate new versions of an existing model from an accuracy standpoint. This should be done in an offline fashion to not impact any predictions in the production system that are serving real-time predictions. By ensuring that the new model performs better for applicable evaluation metrics, this testing addresses challenge 1 (model accuracy performance). Also, by using the right dataset, this testing can address challenges 2 and 3 (test data quality, feature importance and parity), with the additional benefit of tackling challenge 5 (cost).

This phase is done in the staging environment.

You should capture production traffic, which you can use to replay in offline back testing. It’s preferable to use past production traffic instead of synthetic data. The Amazon SageMaker Model Monitor capture data feature allows you to capture production traffic for models hosted on Amazon SageMaker. This allows model developers to test their models with data from peak business days or other significant events. The captured data is then replayed against the new model version in a batch fashion using Sagemaker batch transform. This means that the batch transform run can tests with data that has been collected over weeks or months in just a few hours. This can significantly speed up the model evaluation process compared to running two or more versions of a real-time model side by side and sending duplicate prediction requests to each endpoint. In addition to finding a better-performing version faster, this approach also uses the compute resources for a shorter amount of time, reducing the overall cost.

A challenge with this approach to testing is that the feature set changes from one model version to another. In this scenario, we recommend creating a feature set with a superset of features for both versions so that all features can be queried at once and recorded through the data capture. Each prediction call can then work on only those features necessary for the current version of the model.

As an added bonus, by integrating Amazon SageMaker Clarify in your offline model testing, you can check the new version of model for bias and also compare feature attribution with the previous version of the model. With pipelines, you can orchestrate the entire workflow such that after training, a quality check step can take place to perform an analysis of the model metrics and feature importance. These metrics are stored in the SageMaker model registry for comparison in the next run of training.

Integration and performance testing

Integration testing is needed to validate end-to-end business processes from a functional as well as a runtime performance perspective. Within this process, the whole pipeline should be tested, including fetching, and calculating features in the feature store and running the ML application. This should be done with a variety of different payloads to cover a variety of scenarios and requests and achieve high coverage for all possible code runs. This addresses challenges 4 and 9 (business process testing and operational performance) to ensure none of the business processes are broken with the new version of the model.

This testing should be done in a staging environment.

Both integration testing and performance testing need to be implemented by individual teams using their MLOps pipeline. For the integration testing, we recommend the tried and tested method of maintaining a functionally equivalent pre-production environment and testing with a few different payloads. The testing workflow can be automated as shown in this workshop. For the performance testing, you can use Amazon SageMaker Inference Recommender, which offers a great starting point to determine which instance type and how many of those instances to use. For this, you’ll need to use a load generator tool, such as the open-source projects perfsizesagemaker and perfsize that Intuit has developed. Perfsizesagemaker allows you to automatically test model endpoint configurations with a variety of payloads, response times, and peak transactions per second requirements. It generates detailed test results that compare different model versions. Perfsize is the companion tool that tries different configurations given only the peak transactions per second and the expected response time.

A/B testing

In many cases where user reaction to the immediate output of the model is required, such as ecommerce applications, offline model functional evaluation isn’t sufficient. In these scenarios, you need to A/B test models in production before making the decision of updating models. A/B testing also has its risks because there could be real customer impact. This testing method serves as the final ML performance validation, a lightweight engineering sanity check. This method also addresses challenges 8 and 9 (online model performance and operational excellence).

A/B testing should be performed in a production environment.

With SageMaker, you can easily perform A/B testing on ML models by running multiple production variants on an endpoint. Traffic can be routed in increments to the new version to reduce the risk that a badly behaving model could have on production. If results of the A/B test look good, traffic is routed to the new version, eventually taking over 100% of traffic. We recommend using deployment guardrails to transition from model A to B. For a more complete discussion on A/B testing using Amazon Personalize models as an example, refer to Using A/B testing to measure the efficacy of recommendations generated by Amazon Personalize.

Online model testing

In this scenario, the new version of a model is significantly different to the one already serving live traffic in production, so the offline testing approach is no longer suitable to determine the efficacy of the new model version. The most prominent reason for this is a change in features required to produce the prediction, so that previously recorded transactions can’t be used to test the model. In this scenario, we recommend using shadow deployments. Shadow deployments offer the capability to deploy a shadow (or challenger) model alongside the production (or champion) model that is currently serving predictions. This lets you evaluate how the shadow model performs in production traffic. The predictions of the shadow model aren’t served to the requesting application; they’re logged for offline evaluation. With the shadow approach for testing, we address challenges 4, 5, 6, and 7 (business process testing, cost, security, and feature store scalability).

Online model testing should be done in staging or production environments.

This method of testing new model versions should be used as a last resort if all the other methods can’t be used. We recommend it as a last resort because duplexing calls to multiple models generates additional load on all downstream services in production, which can lead to performance bottlenecks as well as increased cost in production. The most obvious impact this has is on the feature serving layer. For use cases that share features from a common pool of physical data, we need to be able to simulate multiple use cases concurrently accessing the same data table to ensure no resource contention exists before transitioning to production. Wherever possible, duplicate queries to the feature store should be avoided, and features needed for both versions of the model should be reused for the second inference. Feature stores based on Amazon DynamoDB, as the one Intuit has built, can implement Amazon DynamoDB Accelerator(DAX) to cache and avoid doubling the I/O to the database. These and other caching options can mitigate challenge 7 (feature store scalability).

To address challenge 5 (cost) as well as 7, we propose using shadow deployments to sample the incoming traffic. This gives model owners another layer of control to minimize impact on the production systems.

Shadow deployment should be onboarded to the Model Monitor offerings just like the regular production deployments in order to observe the improvements of the challenger version.

Conclusion

This post illustrates the building blocks to create a comprehensive set of processes and tools to address various challenges with model testing. Although every organization is unique, this should help you get started and narrow down your considerations when implementing your own testing strategy.

About the authors

Tobias Wenzel is a Software Engineering Manager for the Intuit Machine Learning Platform in Mountain View, California. He has been working on the platform since its inception in 2016 and has helped design and build it from the ground up. In his job, he has focused on the operational excellence of the platform and bringing it successfully through Intuit’s seasonal business. In addition, he is passionate about continuously expanding the platform with the latest technologies.

Shivanshu Upadhyay is a Principal Solutions Architect in the AWS Business Development and Strategic Industries group. In this role, he helps most advanced adopters of AWS transform their industry by effectively using data and AI.

Alan Tan is a Senior Product Manager with SageMaker, leading efforts on large model inference. He’s passionate about applying machine learning to the area of analytics. Outside of work, he enjoys the outdoors.

AWS Machine Learning Blog