Model Platform
Learn how to deploy and manage machine learning models in Chalk
With Chalk, you can deploy machine learning models as isolated services running in dedicated scaling groups. This approach allows your models to run with their own compute resources, auto-scaling policies, and independent lifecycle management—separate from the Chalk engine itself.
This is different from the traditional approach of including models directly in Chalk feature resolvers. Instead of embedding model inference within your feature computation, model deployments host your models as standalone services that can be called from resolvers or external applications.
Model deployments are ideal when you want to:
To deploy models to scaling groups, register them with a Docker image reference instead of local model files or Python model objects.
from chalk.client import ChalkClient
import pyarrow as pa
client = ChalkClient()
# Register the model version with a Docker image
client.register_model_version(
name="my-model",
input_schema={"text": pa.large_string()},
output_schema={"entities": pa.large_string()},
model_image="my-model-image:latest",
)Once registered, deploy a model version to a scaling group with resource specifications and auto-scaling policies.
from chalk.client import ChalkClient
from chalk.scaling import AutoScalingSpec, ScalingGroupResourceRequest
client = ChalkClient()
# Deploy the model version to a scaling group
scaling_group = client.deploy_model_version_to_scaling_group(
name="my-model-scaling-group",
model_name="my-model",
model_version=1,
scaling=AutoScalingSpec(
min_replicas=1,
max_replicas=2,
target_cpu_utilization_percentage=70,
),
resources=ScalingGroupResourceRequest(
cpu="2",
memory="4Gi",
),
)Control how your model deployment scales based on demand using AutoScalingSpec.
from chalk.scaling import AutoScalingSpec
# Configure auto-scaling behavior
scaling = AutoScalingSpec(
min_replicas=1, # Minimum number of replicas
max_replicas=5, # Maximum number of replicas
target_cpu_utilization_percentage=70, # Target CPU utilization (optional)
)Chalk automatically scales the number of replicas based on inference request load and CPU utilization, staying within your min/max bounds. This ensures your models handle traffic spikes efficiently without wasting resources during quiet periods.
Specify CPU, memory, and GPU resources for each replica of your model using ScalingGroupResourceRequest.
from chalk.scaling import ScalingGroupResourceRequest
# Request resources per replica
resources = ScalingGroupResourceRequest(
cpu="2", # CPU allocation per replica
memory="4Gi", # Memory allocation per replica
gpu="nvidia-tesla-t4:1", # Optional: GPU type and count
)Each replica gets the specified resources. When Chalk scales from 1 to 3 replicas, total resource usage is multiplied accordingly (e.g., 3 replicas × 2 CPU = 6 CPU total).
Models deployed to scaling groups can be called from Chalk feature resolvers using the catalog_call function with the scaling group name.
from chalk.features import features, feature, _
from chalk import functions as F
@features
class Document:
id: int
text: str
entities: str = F.catalog_call(
"model.my-model-scaling-group",
_.text
)The catalog call format is: model.{scaling_group_name}
You can pass multiple inputs by providing them as additional arguments:
@features
class Request:
id: int
input_a: str
input_b: str
output: str = F.catalog_call(
"model.my-model-scaling-group",
_.input_a,
_.input_b
)The order of arguments must match the order of fields in your model’s input_schema.
Deploy a new version of a model to an existing scaling group:
# Register a new model version
new_version = client.register_model_version(
name="my-model",
model_image="my-model:v2.0",
input_schema={"text": pa.large_string()},
output_schema={"entities": pa.large_string()},
)
# Update the scaling group with the new version
client.deploy_model_version_to_scaling_group(
name="my-model-scaling-group",
model_name="my-model",
model_version=new_version.model_version,
)For more information on listing, inspecting, and deleting scaling groups, see the Scaling Groups page.
Model registration and deployment should be controlled manually and separately from your feature definitions. Either:
.chalkignore to prevent them from running during chalk apply.Your chalk apply will fail if it tries to run model registration and deployment code.
Organize your project to keep model management separate from feature definitions:
my-chalk-project/
|- models/ # Model deployment code (add to .chalkignore)
| |- Dockerfile
| |- model.py
| |- requirements.txt
| `- deploy_model.py # Registration + deployment script
|
|- features/ # Feature definitions (synced with chalk apply)
| |- __init__.py
| `- user_features.py
|
|- .chalkignore
`- chalk.yamlPut the following line in your .chalkignore so chalk apply skips everything under models/.
models/Model deployments use the chalk-remote-call-python shim to handle request routing and PyArrow serialization. Your Docker image should:
Here’s a complete example using spaCy for named entity recognition:
Dockerfile:
FROM python:3.11-slim
WORKDIR /app
RUN pip install --no-cache-dir chalk-remote-call-python spacy
RUN python -m spacy download en_core_web_sm
COPY model.py /app/model.py
ENV PYTHONPATH=/app
EXPOSE 8080
ENTRYPOINT ["chalk-remote-call", "--handler", "model.handler", "--port", "8080"]Build and push to a registry:
docker build --platform linux/amd64 -t my-model:latest .
docker push my-model:latestmodel.py:
"""Model using spaCy — chalk-remote-call handler convention.
This is an example customer model that uses chalk-remote-call-python.
The handler receives PyArrow Arrays and returns results.
"""
import json
import pyarrow as pa
import spacy
nlp = None
def on_startup():
"""Load the spaCy model once at startup."""
global nlp
print("Loading model...")
nlp = spacy.load("en_core_web_sm")
print("Model loaded!")
def handler(event: dict[str, pa.Array], context: dict) -> pa.Array:
"""Extract named entities from text.
Parameters
----------
event
Dictionary of PyArrow Arrays. Keys match your input_schema.
Example: {"text": pa.Array of strings}
context
Request metadata (peer address, headers, etc.)
Returns
-------
pa.Array
Output must be a single PyArrow Array matching your output_schema
"""
texts = event["text"].to_pylist()
results = []
for text, doc in zip(texts, nlp.pipe(texts, batch_size=32)):
if text is None:
results.append(None)
continue
entities = [
{
"text": ent.text,
"label": ent.label_,
"start": ent.start_char,
"end": ent.end_char,
}
for ent in doc.ents
]
results.append(json.dumps({"text": text, "entities": entities}))
return pa.array(results, type=pa.utf8())Your handler function must follow this signature:
def handler(event: dict[str, pa.Array], context: dict) -> pa.Array:
"""
Parameters
----------
event : dict[str, pa.Array]
Input data as PyArrow Arrays. Keys correspond to your input_schema fields.
context : dict
Request metadata (peer address, headers, etc.)
Returns
-------
pa.Array
Single PyArrow Array output matching your output_schema
"""
# Your model inference logic here
passDefine an on_startup() function to initialize resources when the container starts:
def on_startup():
"""Called once when the model service starts."""
global nlp
print("Initializing model...")
nlp = spacy.load("en_core_web_sm")
print("Model ready!")This is useful for: