Minimize real-time inference latency by using Amazon SageMaker routing strategies

Amazon SageMaker makes it straightforward to deploy machine learning (ML) models for real-time inference and offers a broad selection of ML instances spanning CPUs and accelerators such as AWS Inferentia. As a fully managed service, you can scale your model deployments, minimize inference costs, and manage your models more effectively in production with reduced operational burden. A SageMaker real-time inference endpoint consists of an HTTPs endpoint and ML instances that are deployed across multiple Availability Zones for high availability. SageMaker application auto scaling can dynamically adjust the number of ML instances provisioned for a model in response to changes in workload. The endpoint uniformly distributes incoming requests to ML instances using a round-robin algorithm.

When ML models deployed on instances receive API calls from a large number of clients, a random distribution of requests can work very well when there is not a lot of variability in your requests and responses. But in systems with generative AI workloads, requests and responses can be extremely variable. In these cases, it’s often desirable to load balance by considering the capacity and utilization of the instance rather than random load balancing.

In this post, we discuss the SageMaker least outstanding requests (LOR) routing strategy and how it can minimize latency for certain types of real-time inference workloads by taking into consideration the capacity and utilization of ML instances. We talk about its benefits over the default routing mechanism and how you can enable LOR for your model deployments. Finally, we present a comparative analysis of latency improvements with LOR over the default routing strategy of random routing.

SageMaker LOR strategy

By default, SageMaker endpoints have a random routing strategy. SageMaker now supports a LOR strategy, which allows SageMaker to optimally route requests to the instance that is best suited to serve that request. SageMaker makes this possible by monitoring the load of the instances behind your endpoint, and the models or inference components that are deployed on each instance.

The following interactive diagram shows the default routing policy where requests coming to the model endpoints are forwarded in a random manner to the ML instances.

The following interactive diagram shows the routing strategy where SageMaker will route the request to the instance that has the least number of outstanding requests.

In general, LOR routing works well for foundational models or generative AI models when your model responds in hundreds of milliseconds to minutes. If your model response has lower latency (up to hundreds of milliseconds), you may benefit more from random routing. Regardless, we recommend that you test and identify the best routing algorithm for your workloads.

How to set SageMaker routing strategies

SageMaker now allows you to set the RoutingStrategy parameter while creating the EndpointConfiguration for endpoints. The different RoutingStrategy values that are supported by SageMaker are:

LEAST_OUTSTANDING_REQUESTS
RANDOM

The following is an example deployment of a model on an inference endpoint that has LOR enabled:

Create the endpoint configuration by setting RoutingStrategy as LEAST_OUTSTANDING_REQUESTS:

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "instance_type",
            "InitialInstanceCount": initial_instance_count,
	…..
            "RoutingConfig": {
                'RoutingStrategy': 'LEAST_OUTSTANDING_REQUESTS'}
        },
    ],
)

Create the endpoint using the endpoint configuration (no change):

create_endpoint_response = sm_client.create_endpoint(
    EndpointName="endpoint_name", 
    EndpointConfigName="endpoint_config_name"
)

Performance results

We ran performance benchmarking to measure the end-to-end inference latency and throughput of the codegen2-7B model hosted on ml.g5.24xl instances with default routing and smart routing endpoints. The CodeGen2 model belongs to the family of autoregressive language models and generates executable code when given English prompts.

In our analysis, we increased the number of ml.g5.24xl instances behind each endpoint for each test run as the number of concurrent users were increased, as shown in the following table.

Test	Number of Concurrent Users	Number of Instances
1	4	1
2	20	5
3	40	10
4	60	15
5	80	20

We measured the end-to-end P99 latency for both endpoints and observed an 4–33% improvement in latency when the number of instances were increased from 5 to 20, as shown in the following graph.

Similarly, we observed an 15–16% improvement in the throughput per minute per instance when the number of instances were increased from 5 to 20.

This illustrates that smart routing is able to improve the traffic distribution among the endpoints, leading to improvements in end-to-end latency and overall throughput.

Conclusion

In this post, we explained the SageMaker routing strategies and the new option to enable LOR routing. We explained how to enable LOR and how it can benefit your model deployments. Our performance tests showed latency and throughput improvements during real-time inferencing. To learn more about SageMaker routing features, refer to documentation. We encourage you to evaluate your inference workloads and determine if you are optimally configured with the routing strategy.

About the Authors

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends. You can find him on LinkedIn.

Venugopal Pai is a Solutions Architect at AWS. He lives in Bengaluru, India, and helps digital-native customers scale and optimize their applications on AWS.

David Nigenda is a Senior Software Development Engineer on the Amazon SageMaker team, currently working on improving production machine learning workflows, as well as launching new inference features. In his spare time, he tries to keep up with his kids.

Deepti Ragha is a Software Development Engineer in the Amazon SageMaker team. Her current work focuses on building features to host machine learning models efficiently. In her spare time, she enjoys traveling, hiking and growing plants.

Alan Tan is a Senior Product Manager with SageMaker, leading efforts on large model inference. He’s passionate about applying machine learning to the area of analytics. Outside of work, he enjoys the outdoors.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Dylan Chow is a Software Development Engineer on the Amazon SageMaker team, currently working on building features that enable customers to run generative AI workloads on AWS. In his spare time, he enjoys the outdoors, art, and music.

AWS Machine Learning Blog

Minimize real-time inference latency by using Amazon SageMaker routing strategies

SageMaker LOR strategy

How to set SageMaker routing strategies

Performance results

Conclusion

About the Authors

Resources

Blog Topics

Follow