Ready for a demo of Run:ai?

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Large Language Models (LLM) dominate our lives. I don’t mean that literally (at least for the time being), but rather how it seems that every day brings a wealth of new models and new use cases. What had started as a much-hyped proprietary service from OpenAI, had literally exploded into the open-source domain after the release of Meta’s LLaMA and subsequent open-source models.

If you’re anything like me, you may have found yourself wanting to use an open-source LLM. Personally, a part of my day job is analyzing and evaluating different models, so I wanted to find an effective way to deploy an LLM and its interface so I can play with its capabilities.

Before diving into the implementation’s details, here are the main requirements I put down:

Ease of access: accessing the model should be done via a web UI
Persistable: the setup can be restarted after rebooting my laptop/cloud instance
Scalability: I want to be able to scale the solution’s throughput by adding more computing power to run the model
Availability: while not a strict must, I’d like to be able to provide minimal service guarantees by protecting the setup from an occasional node failure
Foolproof: ideally, I’d like to type a single command to automatically spin the whole thing up.

What I came up with serves me well (pun intended). However, I figured it can also help anyone who wishes to set up a similar service for personal or even inter-organizational use. If you’re that person, read on.

My solution relies on Kubernetes for all the heavy lifting of infrastructure, including state persistence, availability, and scalability. As a bonus, this approach gives me total flexibility in deploying it locally, to the cloud, or to a managed K8s service.

For demonstration purposes, the application uses GPT-2 from Hugging Face, but you can choose any other model in its stead. I chose this LLM because larger models require more resources and may not perform well on the average laptop.

Lastly, the application’s UI is implemented with Gradio. Gradio automatically generates a no-fuss fully-functional web interface that’s tailored to the selected open-source model.

‍

Prerequisites

To get started, you’ll need to be able to create container images. I’m using Docker for that, but you can use any alternative. The instructions for downloading and installing Docker can be found at: https://docs.docker.com/engine/install/.

Next, you’ll need access to a K8s cluster via the `kubeconfig` command and have a `cluster-admin` role. I like using kind for creating and managing my local dockerized cluster, but that’s just me. You can definitely use the tooling you’re used to, but if you want to give kind a try, just follow the installation quick start at: https://kind.sigs.k8s.io/docs/user/quick-start/#installation.

‍

Cluster creation

At this point, we can create our minimal cluster, which consists of control plane and worker nodes. To do that, first, create a “cluster.yaml” file with the following contents:


kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker

Then, create and export the kind cluster with these commands from your shell prompt:


kind create cluster --name llm --config cluster.yaml
kind export kubeconfig --name llm

‍

Application container

Let’s create a containerized Python application that provides the interface and runs the GPT-2 model. First, create a file called “app.py” with the following contents:


from fastapi import FastAPI
import gradio as gr


app = FastAPI()
io = gr.Interface.load('models/gpt2')
app = gr.mount_gradio_app(app, io, '/')

And then a “requirements.txt” file:


torch==1.11.*
transformers==4.*
gradio

Next, define the application’s container image by adding a new “Dockerfile” file with the following:


FROM python:3.9
WORKDIR /code
COPY ./requirements.txt /code/requirements.txt
RUN pip install --no-cache-dir --upgrade -r /code/requirements.txt
COPY . .
CMD ["uvicorn", "--host", "0.0.0.0", "--port", "7860", "app:app"]

Now, create the image by pasting this command to the prompt and running it:


docker build . -t model

Finally, load the image to the cluster’s nodes:


kind load docker-image model --name llm

‍

Application and service deployment

Start by defining the application’s deployment by creating the “deployment.yaml” file:


apiVersion: apps/v1
kind: Deployment
metadata:
 name: model
 labels:
   app: model
spec:
 replicas: 1
 selector:
   matchLabels:
     app: model
 template:
   metadata:
     labels:
       app: model
   spec:
     containers:
     - name: model
       image: model:latest
       imagePullPolicy: IfNotPresent
       ports:
       - containerPort: 7860

And applying it to the K8s cluster:


kubectl apply -f deployment.yaml

Now, define the service by creating a “service.yaml” file:


apiVersion: v1
kind: Service
metadata:
  name: model
spec:
  selector:
    app: model
  ports:
    - name: http
      protocol: TCP
      port: 7860
      targetPort: 7860

And, again, apply it to the cluster:


kubectl apply -f service.yaml

Because spinning up the pod may take a few moments to complete, you can monitor the progress with the following command:


kubectl get pods

Last, but not least, you’ll need to enable port forwarding for the service so it can be accessed from the browser:


kubectl port-forward svc/model 7860:7860

That’s practically all there is to it. You can now point your browser to http://localhost:7860/ and start using the GPT-2 LLM:

Of course, you can customize the build to use another open-source models, for example, LLaMA by Meta or Falcon built by TII . If you run into any issues or have suggestions for improvements, I accept issues and pull requests in the repository :)