IT & Data Science

How to deploy Hugging Face models with Run:ai

June 1, 2023

Ready for a demo of Run:ai?

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

In this blog post, we will demonstrate how to deploy GPT-2 using Run:ai, a platform designed to simplify the deployment and orchestration of AI workloads in Kubernetes clusters.

Why Hugging Face?

Hugging Face, established in 2016, has become a prominent AI community and Machine Learning platform. It embarked on a mission to democratize Natural Language Processing (NLP), making it accessible to Data Scientists, AI practitioners, and Engineers. One of the key offerings of Hugging Face is its vast collection of over 20,000 pre-trained models based on the state-of-the-art transformer architecture, enabling users to tackle various NLP tasks. These models can handle text processing in over 100 languages, covering tasks like classification, information extraction, question answering, generation, and translation. Moreover, Hugging Face extends its capabilities to speech recognition and object audio classification, as well as vision-related tasks such as object detection, image classification, and segmentation. Even tabular data for regression and classification problems can be effectively addressed using Hugging Face models. 

Taking a step further, Hugging Face introduced Hugging Face Spaces’, a feature that simplifies and accelerates the creation and deployment of ML applications. With the latest release, Docker Spaces, users can effortlessly develop customized applications of their choice by utilizing a Dockerfile.

Why Run:ai?

With the recent LLM trends, many teams started to deploy these models for various use cases. However, deploying these models in production, especially in a Kubernetes cluster, can be a challenging task. Managing scalability, resource allocation, and version control of models within a dynamic cluster environment requires careful planning and implementation. This is where Run:ai comes to the rescue, providing a solution for seamless deployment and management of Hugging Face models in a Kubernetes cluster.

Firstly, it streamlines the deployment process by providing a user-friendly interface and intuitive workflows. With Run:ai, you can easily deploy your models anywhere, from on-premise and cloud, with just a few clicks (or on CLI) after your fine-tuning is done.

Secondly, Run:ai improves scalability, allowing you to scale your Hugging Face models efficiently based on demand. With the auto-scaling feature, you can automatically adjust the number of replicas based on defined threshold metrics when you deploy a model, ensuring optimal performance during high-demand situations without manual intervention. Additionally, Run:ai supports the "scale to zero" feature, ensuring that resources are consumed only upon request, effectively managing resource utilization.

Furthermore, Run:ai enhances resource management in a Kubernetes cluster. It provides dynamic control over GPU and CPU allocations, allowing you to allocate resources according to your model's requirements, from small models running on CPUs or fractions of GPUs to large language models running on more than single GPUs. This level of flexibility ensures efficient utilization of resources and maximizes the performance of your Hugging Face models.

Deployment on Run:ai

Step 1: Creating the Docker Space with GPT-2

Of course, you use your favorite model for your own use case. To get started, go to the model page of GPT2 in Hugging Face and select 'Spaces' under the 'Deploy' button.

Next, give your space a name and select Docker as your Space SDK. After creating the Docker space, you will see three files in your repository, including instructions for cloning the repo and creating the Dockerfile for your application. You can either use the Hugging Face interface to make changes in your repository or clone it to your local machine to work on it. For more information about the Spaces of Hugging Face, please refer to the documentation: https://huggingface.co/docs/hub/spaces-sdks-docker-first-demo

You can find the repo that we used for this demo here: https://huggingface.co/spaces/ekinnk/gpt2_demo/tree/main

Step 2: Create the Docker Image

Once you have made the necessary changes to your application, it's time to create the Docker image. If you haven't cloned the repository yet, go back to the first step and clone it to your machine. After cloning the repository, navigate to the directory where your Dockerfile resides. To create the image locally, you only need to run three commands:

$ docker login -u YOUR-USER-NAME 

$ docker build -t YOUR-USER-NAME/gpt2 .

$ docker push YOUR-USER-NAME/gpt2

Note: If you are creating the Docker image on an ARM-based machine (e.g. MacBook with Apple Silicon), you will need to build the image for the amd64 platform. For that use the --platform flag:

$ docker login -u YOUR-USER-NAME 

$ docker build --platform linux/amd64 -t YOUR-USER-NAME/gpt2_amd64 .

$ docker push YOUR-USER-NAME/gpt2_amd64

If you are using another registry instead of Docker Hub, go ahead and push your image there. 

Here is the image that we created for this demo: https://hub.docker.com/r/ekink/gpt2_amd64/tags

Step 3: Deploy 🚀

Now it's time to deploy your Hugging Face model using Run:ai. Sign into the cluster and navigate to Deployments on the left bar. Be aware that you can only deploy a model if your account has the correct permissions. If that is not the case, please reach out to your admin.

After clicking 'New Deployment', fill out the required information for the deployment, including the Docker image that you pushed to your Docker Hub or another registry. The name specified here will be used in the URL of your online application. For resources, define the desired GPU and CPU allocations.

Under the container definition, specify the port that you are exposing in your container for the application. In this case, it is port 7860, which is defined in the python file of the application.

Lastly, you can enable auto-scaling based on a threshold metric of your choice. This is useful for scaling your application up and down automatically to meet demand, ensuring stable system performance under high loads. Set the number of minimum replicas to 1 to prevent cold start time with at least one replica of the application running.

After clicking 'Deploy', your application will be up and running at the specified URL. In this example, the domain of the cluster is runai-poc.com. Access your application by navigating to the URL and see it in action.

Conclusion

Deploying Hugging Face models with Run:ai offers numerous benefits for data science teams. By leveraging its capabilities, you can simplify the deployment process, improve scalability, and enhance resource management in your Kubernetes cluster. With Run:ai, you can focus on developing and fine-tuning your favorite Hugging Face models without worrying about infrastructure complexities.