1. Compute
  2. Model Inference with vLLM

Overview

Running your own inference endpoint gives you full control over model selection, GPU allocation, and cost. Chalk Compute supports autoscaling containers with GPU access — deploy a vLLM server once and let Chalk scale it from zero to your peak traffic.

This tutorial deploys Gemma 4 using vLLM with autoscaling and persistent model caching.


Cache model weights with a volume

Large model files (multi-GB) should live in a Volume so they persist across container restarts and are shared across replicas. This avoids re-downloading weights every time a container scales up.

from chalkcompute import Volume

vol = Volume(name="gemma4-weights")

On first boot, vLLM downloads the model into the Hugging Face cache directory. By mounting the volume at that path, subsequent containers — including new replicas created by autoscaling — start serving immediately.


Define the container

Create deploy_gemma4.py:

from chalkcompute import Container, Image, Volume

vol = Volume(name="gemma4-weights")

image = (
    Image.base("vllm/vllm-openai:latest")
    .run_commands(
        "pip install huggingface_hub",
    )
)

container = Container(
    image=image,
    name="gemma4-vllm",
    env={
        "HF_TOKEN": "hf_...",                    # Hugging Face access token
        "HUGGING_FACE_HUB_TOKEN": "hf_...",
    },
    port=8000,
    volumes={"gemma4-weights": "/root/.cache/huggingface"},
    min_instances=1,
    max_instances=4,
    max_concurrent_requests=32,
    entrypoint=[
        "python", "-m", "vllm.entrypoints.openai.api_server",
        "--model", "google/gemma-3-27b-it",
        "--host", "0.0.0.0",
        "--port", "8000",
        "--tensor-parallel-size", "1",
        "--max-model-len", "8192",
        "--dtype", "auto",
    ],
).run()

print(f"Inference endpoint: {container.info.web_url}")

Key parameters

ParameterPurpose
min_instances=1Keep at least one replica warm — no cold starts.
max_instances=4Scale up to 4 replicas under load.
max_concurrent_requests=32Each replica handles up to 32 concurrent requests before Chalk routes to another.
volumes={...}Mount the weight cache so new replicas skip the download.

Deploy it

chalk compute deploy deploy_gemma4.py
# ✓ Container created successfully
# Container ID: c9d4e71a-5f23-48b6-a0e3-7824bc19d5f6
# Name: gemma4-vllm
# Status: Running
# Pod Name: chalk-container-gemma4-vllm
# URL: https://c9d4e71a-5f23-48b6-a0e3-7824bc19d5f6.compute.chalk.ai

Query the endpoint

vLLM exposes an OpenAI-compatible API. Point any OpenAI client at your container URL:

from openai import OpenAI

client = OpenAI(
    base_url="https://c9d4e71a-5f23-48b6-a0e3-7824bc19d5f6.compute.chalk.ai/v1",
    api_key="not-needed",  # no auth required within Chalk
)

response = client.chat.completions.create(
    model="google/gemma-3-27b-it",
    messages=[
        {"role": "user", "content": "Explain feature stores in two sentences."},
    ],
)

print(response.choices[0].message.content)

Or with curl:

curl https://c9d4e71a-5f23-48b6-a0e3-7824bc19d5f6.compute.chalk.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-3-27b-it",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Scaling behavior

Chalk monitors the request queue across all replicas. When max_concurrent_requests is reached on every running instance, a new replica spins up — pulling weights from the shared volume instead of re-downloading them.

When traffic drops, Chalk scales back down to min_instances. Set min_instances=0 for development workloads where cold starts are acceptable.

# Dev configuration — scale to zero when idle
container = Container(
    image=image,
    name="gemma4-dev",
    port=8000,
    volumes={"gemma4-weights": "/root/.cache/huggingface"},
    min_instances=0,
    max_instances=2,
    max_concurrent_requests=8,
    entrypoint=[...],
).run()