Model Deployments

With Chalk, you can deploy machine learning models as isolated services running in dedicated scaling groups. This approach allows your models to run with their own compute resources, auto-scaling policies, and independent lifecycle management—separate from the Chalk engine itself.

This is different from the traditional approach of including models directly in Chalk feature resolvers. Instead of embedding model inference within your feature computation, model deployments host your models as standalone services that can be called from resolvers or external applications.

When to Use Model Deployments

Model deployments are ideal when you want to:

Isolate model resources: Give models their own CPU, memory, and GPU resources independent of the engine
Scale models independently: Auto-scale models based on inference demand without affecting other services
Version and update models separately: Deploy new model versions without redeploying your entire Chalk system
Run containerized models: Deploy models as Chalk images or Docker images without converting to Python objects
Enable high-throughput inference: Run multiple replicas of your model in parallel

The `@model_handler` decorator

The fastest way to deploy a model is the @model_handler decorator. You write a single class with a predict method, hand a trained model object to register_model_version, and Chalk builds the serving image for you, ships your class as source, serializes and mounts the model, and wires up the runtime — no Dockerfile, no hand-written Arrow plumbing, no chalkcompute.Image.

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

from chalk.client import ChalkClient
from chalk.ml import model_handler
from chalk.scalinggroup import ScalingGroupResourceRequest


@model_handler
class HousePriceModel:
    def predict(self, df):
        return pd.DataFrame({"price": self.model.predict(df.to_pandas())})


rng = np.random.default_rng(0)
X_train = rng.normal(size=(200, 2))  # columns: sqft, rooms
y_train = X_train @ [150.0, 50.0]    # price
rf = RandomForestRegressor().fit(X_train, y_train)

client = ChalkClient()
result = client.register_model_version(
    name="house_price",
    model=HousePriceModel(model=rf),
    input_schema={"sqft": float, "rooms": float},
    output_schema={"price": float},
    dependencies=["scikit-learn", "pandas", "chalkdf"],
)
client.deploy_model_version_to_scaling_group(
    name=f"house-price-{result.model_version}",
    model_name="house_price",
    model_version=result.model_version,
    resources=ScalingGroupResourceRequest(cpu="1", memory="2Gi"),
)

Pass input_schema, output_schema, and dependencies explicitly. Schemas are required — naming your columns is what makes the output line up with what predict returns — and dependencies is the list of pip packages your predict needs at runtime.

Schema values can be plain Python types — float, int, str, and bool map to pa.float64(), pa.int64(), pa.string(), and pa.bool_() respectively, so you don’t need to import pyarrow for ordinary tabular models. Reach for a PyArrow type only when you need one outside those four (for example pa.float32(), pa.large_string(), or a timestamp type).

The decorated class is a normal Python class with three Chalk-managed attributes injected:

model — your trained model object. Chalk serializes it with the framework’s native serializer at registration time, uploads it to the model’s artifact volume, and deserializes it back into self.model inside the container before predict runs. The same attribute is available on both sides.
files — a list of local file paths (files=["./scaler.pkl"]) that Chalk uploads to the artifact volume. At runtime self.files is a {basename: Path} mapping, so self.files["scaler.pkl"] resolves to the mounted path in the container (and to your local path when testing). Use this for tokenizers, scalers, encoders, lookup tables, etc.
artifact_path — the mounted artifact volume directory.

`predict` and return types

predict(self, df) receives a chalkdf.DataFrame built from the request. Call df.to_pandas() (or df.to_arrow()) to get the shape your model expects. The return value can be any of the following — Chalk coerces it to the output columns for you:

Return type	Becomes
`pandas.DataFrame`	columns by name
`polars.DataFrame` / `chalkdf.DataFrame`	columns by name
`pyarrow.RecordBatch` / `pyarrow.Table`	columns by name
`numpy.ndarray` (1-D)	a single `prediction` column
`numpy.ndarray` (2-D)	`col_0`, `col_1`, … columns

Returning a named frame (pandas/polars/Arrow) is recommended so your output columns are explicit and match your output_schema.

Supported frameworks

The model= object can be any of: scikit-learn, PyTorch, XGBoost, LightGBM, CatBoost, TensorFlow/Keras, or ONNX. The framework is auto-detected and the right serializer is used.

Custom loading with `load_model`

For models Chalk can’t serialize (a custom Python class, a sentence encoder, a lookup table), leave model=None, ship the artifacts via files=[...], and own the loading in load_model, which runs once per replica at startup:

@model_handler
class CategoryEnricher:
    def load_model(self):
        import pickle
        with open(self.files["table.pkl"], "rb") as f:
            self.table = pickle.load(f)

    def predict(self, df):
        ids = df.to_pandas()["category_id"]
        return pd.DataFrame({"name": [self.table.get(i, "unknown") for i in ids]})

When you do have a model= object and also need extra setup (device placement, .eval(), auxiliary files), call self.default_load_model() inside your override to populate self.model the default way, then add your own steps:

@model_handler
class UserEmbedder:
    def load_model(self):
        self.default_load_model()                    # restores self.model
        self.model.to("cuda").eval()

Local testing

Because the decorated class is plain Python, you can exercise it locally with no container plumbing — construct it, call load_model() (a no-op outside the container when model= is already set), and call predict with a chalkdf.DataFrame:

import pyarrow as pa
from chalkdf import DataFrame

m = HousePriceModel(model=rf)
m.load_model()
out = m.predict(DataFrame.from_arrow(pa.RecordBatch.from_pydict({"sqft": [1000.0], "rooms": [3.0]})))

The `predict` path hands your code a `chalkdf.DataFrame`, so the serving image installs `chalkdf`, which requires Python below 3.13 (Chalk pins the image to 3.12 automatically). The container also installs the same `chalkpy` version you registered from, so the version you run locally must be a released version that includes `@model_handler` `predict` support.

Examples by model type

Working @model_handler examples for each supported framework. The pattern is identical — write predict, hand over a trained model, register, deploy — but note that the object you get back as self.model in the container is whatever that framework’s loader returns, which is often not the class you trained (see the comments in each snippet). Switch tabs to see each one:

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

from chalk.client import ChalkClient
from chalk.ml import model_handler
from chalk.scalinggroup import ScalingGroupResourceRequest


@model_handler
class SklearnRF:
    def predict(self, df):
        X = df.to_pandas()[["f0", "f1", "f2", "f3"]]
        return pd.DataFrame({"prediction": self.model.predict(X)})


rng = np.random.default_rng(0)
X_train = rng.normal(size=(200, 4))
y_train = X_train @ [3.0, -2.0, 1.0, 0.5]
rf = RandomForestRegressor().fit(X_train, y_train)

client = ChalkClient()
v = client.register_model_version(
    name="sklearn_rf",
    model=SklearnRF(model=rf),
    input_schema={f"f{i}": float for i in range(4)},
    output_schema={"prediction": float},
    dependencies=["scikit-learn", "pandas", "chalkdf"],
)
client.deploy_model_version_to_scaling_group(
    name=f"sklearn-rf-{v.model_version}", model_name="sklearn_rf", model_version=v.model_version,
    resources=ScalingGroupResourceRequest(cpu="1", memory="2Gi"),
)

Container images

When you need full control over the serving container — a custom inference handler, extra system packages, or a pre-built image — register a model with a container image instead of a decorated class or a Python model object. Every model image runs the chalk-remote-call-python shim, which routes requests and handles PyArrow serialization. You supply a handler and an entrypoint that points chalk-remote-call at it, then register either a chalkcompute.Image (Chalk builds it) or a pre-built Docker image (you build it).

Writing a handler

The handler receives a dict of PyArrow Arrays — one per input_schema column — and returns a PyArrow Array. Optionally define on_startup to load resources once when the container starts; artifacts mounted via a volume live at /app/artifacts/ (see Automatic Volume Upload for Model Artifacts).

model.py

import json

import pyarrow as pa
import pyarrow.compute as pc

model = None


def on_startup():
    global model
    with open("/app/artifacts/model.json") as f:
        model = json.load(f)


def handler(event: dict[str, pa.Array], context: dict) -> pa.Array:
    factor = model["factor"]
    return pc.multiply(event["x"], pa.scalar(factor, type=pa.float64()))

Point the entrypoint at the handler (and optional startup hook):

chalk-remote-call --handler model.handler --on-startup model.on_startup --port 8080

Building a Chalk image

Pass a chalkcompute.Image and Chalk builds and manages the container for you. Install chalk-remote-call-python, add your handler file, and set chalk-remote-call as the entrypoint:

from chalk.client import ChalkClient
from chalkcompute import Image

client = ChalkClient()

image = (
    Image.debian_slim("3.11")
    .pip_install(["chalk-remote-call-python", "pyarrow"])
    .add_local_file("model.py", "/app/model.py", strategy="copy")
    .env({"PYTHONPATH": "/app"})
    .workdir("/app")
    .entrypoint(["chalk-remote-call", "--handler", "model.handler", "--port", "8080"])
)

client.register_model_version(
    name="my-model",
    input_schema={"x": float},
    output_schema={"y": float},
    model_image=image,
)

Useful chalkcompute.Image methods:

.base(image) — use a custom base Docker image
.debian_slim(python_version) — slim Debian base with the given Python version
.pip_install(packages) — install Python packages
.run_commands(commands) — run shell commands during the build
.add_local_file(src, dest, strategy) / .add_local_dir(src, dest, strategy) — copy files in
.env(vars) / .workdir(path) / .entrypoint(command) — set env vars, working dir, entrypoint

Using a pre-built Docker image

If you build and push the image yourself, register it by string reference. The image must meet the same requirements: install chalk-remote-call-python, define a handler (and optional on_startup), and run chalk-remote-call as the entrypoint.

client.register_model_version(
    name="my-model",
    input_schema={"x": float},
    output_schema={"y": float},
    model_image="my-model-image:latest",
)

FROM python:3.11-slim
WORKDIR /app
RUN pip install --no-cache-dir chalk-remote-call-python pyarrow
COPY model.py /app/model.py
ENV PYTHONPATH=/app
EXPOSE 8080
ENTRYPOINT ["chalk-remote-call", "--handler", "model.handler", "--port", "8080"]

docker build --platform linux/amd64 -t my-model:latest .
docker push my-model:latest

Automatic Volume Upload for Model Artifacts

When your model files are large (e.g. multi-gigabyte weight files), baking them into the container image is impractical—it slows down builds, increases image pull times, and wastes storage. Instead, Chalk automatically uploads model artifacts to a volume that gets mounted into your container at runtime. If your model artifacts are already baked into the image and you want to skip this automatic upload, pass skip_volume_upload=True during registration:

client.register_model_version(
    name="my-large-model",
    input_schema={"x": float},
    output_schema={"y": float},
    model_image="my-large-model-image:latest",
    skip_volume_upload=True,
)

The uploaded artifacts are mounted at /app/artifacts/ inside the container. Load them in on_startup exactly as shown in Writing a handler — open /app/artifacts/model.json once at startup and reference it from your handler.

Deploying to Scaling Groups

Once registered, deploy a model version to a scaling group with resource specifications and auto-scaling policies.

from chalk.client import ChalkClient
from chalk.scalinggroup import AutoScalingSpec, ScalingGroupResourceRequest

client = ChalkClient()

# Deploy the model version to a scaling group
client.deploy_model_version_to_scaling_group(
    name="my-model-sg",
    model_name="my-model",
    model_version=1,
    handler="model.handler",
    scaling=AutoScalingSpec(
        min_replicas=1,
        max_replicas=2,
        target_cpu_utilization_percentage=70,
    ),
    resources=ScalingGroupResourceRequest(
        cpu="2",
        memory="4Gi",
    ),
)

Auto-Scaling Configuration

Control how your model deployment scales based on demand using AutoScalingSpec.

from chalk.scalinggroup import AutoScalingSpec

# Configure auto-scaling behavior
scaling = AutoScalingSpec(
    min_replicas=1,                          # Minimum number of replicas
    max_replicas=5,                          # Maximum number of replicas
    target_cpu_utilization_percentage=70,    # Target CPU utilization (optional)
)

Chalk automatically scales the number of replicas based on inference request load and CPU utilization, staying within your min/max bounds. This ensures your models handle traffic spikes efficiently without wasting resources during quiet periods.

Resource Configuration

Specify CPU, memory, and GPU resources for each replica of your model using ScalingGroupResourceRequest.

from chalk.scalinggroup import ScalingGroupResourceRequest

# Request resources per replica
resources = ScalingGroupResourceRequest(
    cpu="2",                          # CPU allocation per replica
    memory="4Gi",                     # Memory allocation per replica
    gpu="nvidia-tesla-t4:1",          # Optional: GPU type and count
)

Each replica gets the specified resources. When Chalk scales from 1 to 3 replicas, total resource usage is multiplied accordingly (e.g., 3 replicas × 2 CPU = 6 CPU total).

Calling Deployed Models

A deployed model is addressed by the scaling group name you passed to deploy_model_version_to_scaling_group, referenced as model.{scaling_group_name}. You can call it two ways. In every case the argument order must match your model’s input_schema.

From a feature resolver — `F.catalog_call`

Call the model as part of feature computation, so its output is just another feature:

from chalk.features import features, _
from chalk import functions as F

@features
class MyModel:
    id: int
    x_1: float
    x_2: float
    y: float = F.catalog_call(
        "model.my-model-sg",
        _.x_1,
        _.x_2,
    )

F.catalog_call is evaluated during chalk query, so the call only takes effect once the feature graph is applied to your environment with chalk apply.

From SQL — chalksql

In a SQL resolver, invoke the model with the catalog_call('model.{scaling_group_name}', ...) function, passing the model’s qualified name as the first argument:

select
    id,
    catalog_call('model.my-model-sg', x_1, x_2) as y
from my_table

A deployed model only becomes available to SQL after you **redeploy** your Chalk deployment with `chalk apply`. Registering the model and deploying it to a scaling group is not enough on its own — the redeploy is what makes the model's qualified name resolvable by `catalog_call`. If you get an "unknown function" or unresolved-name error in a SQL resolver right after deploying a model, run `chalk apply` and try again.

Serving an LLM with vLLM

deploy_model_version_to_scaling_group always runs the image with the chalk-remote-call entrypoint (Chalk’s Arrow request/response runtime), so you can’t deploy a stock vLLM image directly — its own OpenAI HTTP server never starts. Instead, put the Chalk shim in front of vLLM: ship a handler that loads vLLM at startup and answers each Arrow request by generating with it. The model then deploys through the normal register-model + deploy-to-scaling-group path and is called like any other Chalk model (F.catalog_call or SQL). This is exactly how curated models are built.

handler.py

import os

import pyarrow as pa

_llm = None


def on_startup():
    global _llm
    from vllm import LLM

    _llm = LLM(model=os.environ.get("VLLM_MODEL", "Qwen/Qwen2.5-0.5B-Instruct"))


def handler(event, context):
    from vllm import SamplingParams

    rb = pa.Table.from_pydict(event).combine_chunks().to_batches()[0]
    prompts = rb.column("prompt").to_pylist()
    outs = _llm.generate(prompts, SamplingParams(max_tokens=64))
    texts = [o.outputs[0].text for o in outs]
    return {"completion": pa.array(texts, type=pa.large_string())}

deploy.py

import pyarrow as pa
from chalkcompute import Image

from chalk.client import ChalkClient
from chalk.scalinggroup import AutoScalingSpec, ScalingGroupResourceRequest

# vLLM base + the chalk-remote-call runtime + our handler.
image = (
    Image.base("vllm/vllm-openai:latest")
    .pip_install(["chalk-remote-call-python", "pyarrow"])
    .add_local_file("handler.py", "/app/handler.py", strategy="copy")
    .workdir("/app")
    .env({"PYTHONPATH": "/app"})
)

client = ChalkClient()
result = client.register_model_version(
    name="vllm-qwen",
    model_image=image,
    input_schema={"prompt": pa.large_string()},
    output_schema={"completion": pa.large_string()},
)
client.deploy_model_version_to_scaling_group(
    name="vllm-qwen-sg",
    model_name="vllm-qwen",
    model_version=result.model_version,
    scaling=AutoScalingSpec(min_replicas=1, max_replicas=2),
    resources=ScalingGroupResourceRequest(gpu="nvidia-tesla-t4:1"),  # vLLM needs a GPU
    handler="handler.handler",
    env_vars={"PYTHONPATH": "/app"},
)

Then call it like any Chalk model — over the Arrow contract, not raw HTTP. For example, from a feature resolver:

from chalk.features import features, _
from chalk import functions as F

@features
class Doc:
    id: int
    prompt: str
    completion: str = F.catalog_call("model.vllm-qwen-sg", _.prompt)

**Curated models** are Chalk's prepackaged version of this idea: a maintained catalog of open-weight models (e.g. `gemma-3-4b-it`, `mistral-7b-instruct`, `qwen3-embedding-0-6b`, `chronos-2`) that serve the same Arrow contract and deploy with one call — no image to build or maintain. They are not built on this vLLM-in-a-Python-handler recipe, though: the text models serve through Chalk's native Rust `chalk_model_runtime`, with `-cpu` and `-gpu` image variants selected at deploy time. The recipe above is how you serve your own vLLM model when a curated one doesn't fit.

If you instead want vLLM's raw OpenAI-compatible HTTP API — to point an OpenAI client straight at it — deploy it as a `chalkcompute.Container`, which runs the image's own entrypoint and exposes the HTTP endpoint directly. See [Model Inference with vLLM](/docs/compute/model-inference).

Managing Deployments

Updating a Deployment

Deploy a new version of a model to an existing scaling group:

from chalkcompute import Image
from chalk.scalinggroup import ScalingGroupResourceRequest

# Register a new model version with an updated Chalk image
new_version = client.register_model_version(
    name="my-model",
    input_schema={"x": float},
    output_schema={"y": float},
    model_image=(
        Image.debian_slim("3.11")
        .pip_install(["chalk-remote-call-python", "pyarrow"])
        .add_local_file("model_v2.py", "/app/model.py", strategy="copy")
        .env({"PYTHONPATH": "/app"})
        .workdir("/app")
        .entrypoint(
            [
                "chalk-remote-call",
                "--handler",
                "model.handler",
                "--port",
                "8080",
            ]
        )
    ),
)

# Update the scaling group with the new version
client.deploy_model_version_to_scaling_group(
    name="my-model-sg",
    model_name="my-model",
    model_version=new_version.model_version,
    handler="model.handler",
    resources=ScalingGroupResourceRequest(cpu="1", memory="2Gi"),
)

For more information on listing, inspecting, and deleting scaling groups, see the Scaling Groups page.

Structuring Your Model Deployment Code

Model registration and deployment should be controlled manually and separately from your feature definitions. Either:

Add to .chalkignore to prevent them from running during chalk apply.
Run in a separate repository dedicated to model management, keeping it independent from your Chalk feature code.

Your chalk apply will fail if it tries to run model registration and deployment code.

Organize your project to keep model management separate from feature definitions:

my-chalk-project/
|- models/                          # Model deployment code (add to .chalkignore)
|  |- model.py
|  `- deploy_model.py               # Registration + deployment script
|
|- features/                        # Feature definitions (synced with chalk apply)
|  |- __init__.py
|  `- user_features.py
|
|- .chalkignore
`- chalk.yaml

Put the following line in your .chalkignore so chalk apply skips everything under models/.

models/

​When to Use Model Deployments

​The @model_handler decorator

​predict and return types

​Supported frameworks

​Custom loading with load_model

​Local testing

​Examples by model type

​Container images

​Writing a handler

​Building a Chalk image

​Using a pre-built Docker image

​Automatic Volume Upload for Model Artifacts

​Deploying to Scaling Groups

​Auto-Scaling Configuration

​Resource Configuration

​Calling Deployed Models

​From a feature resolver — F.catalog_call

​From SQL — chalksql

​Serving an LLM with vLLM

​Managing Deployments

​Updating a Deployment

​Structuring Your Model Deployment Code

On this page