The world’s leading publication for data science, AI, and ML professionals.

HuggingFace Inference Endpoints

Rapid production-grade deployment of Transformers models

Image from Unsplash by Towfiqu barbhuiya
Image from Unsplash by Towfiqu barbhuiya

A constant theme in my articles has been the deployment of your Machine Learning models. As Machine Learning grows in popularity so has the range of model deployment options for users. HuggingFace in particular has become a leader in the Machine Learning space and for Data Science Practitioners it’s incredibly likely you’ve used a Transformers model in the past.

HuggingFace has partnerships with both AWS and Azure and has provided deployment options across both Cloud providers. While it’s a relatively easy process to deploy Transformers models on these cloud providers it does require some knowledge of their ecosystem. How could HuggingFace provide production level infrastructure for model hosting while letting users focus on their model?

Introduce HuggingFace Inference Endpoints. This hosting option still integrates with the infrastructure provided by both cloud providers, but abstracts out the work needed with their ML services such as Amazon SageMaker and Azure ML Endpoints.

In this article we’ll take a look at how you can spin up your first HuggingFace Inference Endpoint. We’ll set up a sample endpoint, show how you can invoke the endpoint, and how you can monitor the endpoint’s performance.

NOTE: For this article we will assume basic knowledge of HuggingFace/Transformers and Python. For this article you also need to create a HuggingFace account and add your billing information. Make sure to delete your endpoint to not incur further charges.

Table of Contents

  1. Setup/Endpoint Creation
  2. Endpoint Invocation/Monitoring
  3. Other Deployment Options
  4. Additional Resources & Conclusion

Setup/Endpoint Creation

As noted earlier, make sure to create a HuggingFace account, you will need to add your billing information as you will be creating an endpoint backed by dedicated compute infrastructure. We can go to the Inference Endpoints home page to get started on deploying a model.

With Inference Endpoint creation there’s three main steps to consider:

  1. Model Selection
  2. Cloud Provider/Infrastructure Selection
  3. Endpoint Security Level

To create an endpoint, you need to select a model from the Hugging Face hub. For this use-case we’ll take a Roberta Model that has been tuned on a Twitter dataset for Sentiment Analysis.

Model Selection (Screenshot by Author)
Model Selection (Screenshot by Author)

After choosing your model for endpoint deployment, you need to select a cloud provider. For this instance we’ll select AWS as our provider and we can then see what hardware options are available for both CPU and GPU.

A more advanced feature is setting an AutoScaling configuration. You can set a minimum and maximum instance count to scale up and down based on traffic load and hardware utilization.

Along with this in the advanced configuration you can control the Task of your model, source Framework, and also a custom container image. This image can contain other dependencies you may install or other scripts you mount on your image. You can point to an image on Docker Hub or also your cloud providers image registry such as AWS ECR.

Advanced Configuration (Screenshot by Author)
Advanced Configuration (Screenshot by Author)

Lastly, you can also define the security level behind your endpoint. For a private endpoint you have to use AWS PrivateLink, for an end to end guide follow Julien Simon’s example here. For simplicity’s sake in this example we will create a public endpoint.

Security Level of Endpoint (Screenshot by Author)
Security Level of Endpoint (Screenshot by Author)

Now you can create the endpoint and it should be provisioned within a few minutes.

Endpoint Running (Screenshot by Author)
Endpoint Running (Screenshot by Author)

Endpoint Invocation/Monitoring

To invoke our endpoint the Inference Endpoint UI has made it simple by providing an automated curl command.

Test Endpoint (Screenshot by Author)
Test Endpoint (Screenshot by Author)
curl https://ddciyc4dikwsl6kg.us-east-1.aws.endpoints.huggingface.cloud 
-X POST 
-d '{"inputs": "I like you. I love you"}' 
-H "Authorization: Bearer PYVevWdShZXpmWWixcYZtxsZRzCDNVaLillyyxeclCIlvNxCnyYhDwNQGtfmyQfciOhYpXRxcEFyiRppXAurMLafbPLroPrGUCmLsqAauOVhvMVbukAqJQYtKBrltUix" 
-H "Content-Type: application/json"

Using the curl command converter we can get the equivalent Python code to test the endpoint in our local development environment.

import requests
import time

headers = {
    'Authorization': 'Bearer PYVevWdShZXpmWWixcYZtxsZRzCDNVaLillyyxeclCIlvNxCnyYhDwNQGtfmyQfciOhYpXRxcEFyiRppXAurMLafbPLroPrGUCmLsqAauOVhvMVbukAqJQYtKBrltUix',
    # Already added when you pass json=
    # 'Content-Type': 'application/json',
}

json_data = {
    'inputs': 'I like you. I love you',
}

def invoke_ep(headers, json_data):
    response = requests.post('https://ddciyc4dikwsl6kg.us-east-1.aws.endpoints.huggingface.cloud', headers=headers, json=json_data)
    return response.text

We can further stress test the endpoint by sending requests for an extended duration of time.

request_duration = 100 #adjust for length of test
end_time = time.time() + request_duration
print(f"test will run for {request_duration} seconds")
while time.time() < end_time:
    invoke_ep(headers, json_data)

We can observe these requests and the endpoint performance using the Inference Endpoints Analytics UI. Here the analytics dashboard provides us request count and latency metrics for us to understand our traffic and corresponding endpoint performance.

In the case you need to debug your endpoint you can view container logs as well on the UI. Here we can also track individual request duration, any logging you add as well in a Custom Inference Handler or Custom Container Image will be reflected here.

Container Logs (Screenshot by Author)
Container Logs (Screenshot by Author)

To update or delete your endpoint go to the Settings tab to manage your resources as necessary.

Other Deployment Options

Within HuggingFace there are different hosting options that you can implement as well. There’s the free Hosted Inference API that you can use to test your models before adopting Inference Endpoints. In addition, there’s also SageMaker with which HuggingFace is strongly integrated. There’s supported Container Images for HuggingFace that you can use for both training and inference on Amazon SageMaker. Along with this there’s also HuggingFace Spaces that you can utilize to build quick UI’s for your ML models via the Streamlit and Gradio frameworks.

Additional Resources & Conclusion

GitHub – RamVegiraju/HuggingFace-Examples: A repository of HuggingFace examples/features/

For the code for the example please click on the link above. For further HuggingFace related content please access the list here. To get started on your own with HuggingFace Inference Endpoints follow the official documentation. I hope this article was a useful guide for those getting started with HuggingFace Inference Endpoints, stay tuned for more content in this area.


If you enjoyed this article feel free to connect with me on LinkedIn and subscribe to my Medium Newsletter. If you’re new to Medium, sign up using my Membership Referral.


Related Articles