Ready for a demo of Run:ai?
Large Language Models (LLM) dominate our lives. I don’t mean that literally (at least for the time being), but rather how it seems that every day brings a wealth of new models and new use cases. What had started as a much-hyped proprietary service from OpenAI, had literally exploded into the open-source domain after the release of Meta’s LLaMA and subsequent open-source models.
If you’re anything like me, you may have found yourself wanting to use an open-source LLM. Personally, a part of my day job is analyzing and evaluating different models, so I wanted to find an effective way to deploy an LLM and its interface so I can play with its capabilities.
Before diving into the implementation’s details, here are the main requirements I put down:
- Ease of access: accessing the model should be done via a web UI
- Persistable: the setup can be restarted after rebooting my laptop/cloud instance
- Scalability: I want to be able to scale the solution’s throughput by adding more computing power to run the model
- Availability: while not a strict must, I’d like to be able to provide minimal service guarantees by protecting the setup from an occasional node failure
- Foolproof: ideally, I’d like to type a single command to automatically spin the whole thing up.
What I came up with serves me well (pun intended). However, I figured it can also help anyone who wishes to set up a similar service for personal or even inter-organizational use. If you’re that person, read on.
My solution relies on Kubernetes for all the heavy lifting of infrastructure, including state persistence, availability, and scalability. As a bonus, this approach gives me total flexibility in deploying it locally, to the cloud, or to a managed K8s service.
For demonstration purposes, the application uses GPT-2 from Hugging Face, but you can choose any other model in its stead. I chose this LLM because larger models require more resources and may not perform well on the average laptop.
Lastly, the application’s UI is implemented with Gradio. Gradio automatically generates a no-fuss fully-functional web interface that’s tailored to the selected open-source model.
To get started, you’ll need to be able to create container images. I’m using Docker for that, but you can use any alternative. The instructions for downloading and installing Docker can be found at: https://docs.docker.com/engine/install/.
Next, you’ll need access to a K8s cluster via the `kubeconfig` command and have a `cluster-admin` role. I like using kind for creating and managing my local dockerized cluster, but that’s just me. You can definitely use the tooling you’re used to, but if you want to give kind a try, just follow the installation quick start at: https://kind.sigs.k8s.io/docs/user/quick-start/#installation.
At this point, we can create our minimal cluster, which consists of control plane and worker nodes. To do that, first, create a “cluster.yaml” file with the following contents:
Then, create and export the kind cluster with these commands from your shell prompt:
Let’s create a containerized Python application that provides the interface and runs the GPT-2 model. First, create a file called “app.py” with the following contents:
And then a “requirements.txt” file:
Next, define the application’s container image by adding a new “Dockerfile” file with the following:
Now, create the image by pasting this command to the prompt and running it:
Finally, load the image to the cluster’s nodes:
Application and service deployment
Start by defining the application’s deployment by creating the “deployment.yaml” file:
And applying it to the K8s cluster:
Now, define the service by creating a “service.yaml” file:
And, again, apply it to the cluster:
Because spinning up the pod may take a few moments to complete, you can monitor the progress with the following command:
Last, but not least, you’ll need to enable port forwarding for the service so it can be accessed from the browser:
That’s practically all there is to it. You can now point your browser to http://localhost:7860/ and start using the GPT-2 LLM:
Of course, you can customize the build to use another open-source models, for example, LLaMA by Meta or Falcon built by TII . If you run into any issues or have suggestions for improvements, I accept issues and pull requests in the repository :)