# Scaling Groups
source: https://docs.chalk.ai/docs/compute/scaling-groups

## Deploy replicated, HTTP-fronted services with autoscaling, GPU support, and automatic DNS routing.

A scaling group is a replicated, HTTP-fronted service managed by Chalk. Each
group is backed by a Kubernetes Deployment with rolling updates, automatic
service discovery, and CPU-utilization-based autoscaling between configurable
min and max replica counts. Use scaling groups for ML inference servers,
internal APIs, agent backends, and any long-lived service that needs to be
reachable from outside the cluster.

For workloads that don't need replication or HTTP fronting, use a
Container instead. For serverless, function-shaped
invocations, use Functions.

You can drive scaling groups two ways: the Python SDK (chalkcompute.ScalingGroup)
or the chalk scaling-group CLI. Both surfaces target the same underlying
service, so groups created one way are visible and manageable from the other.

### Quick start (SDK)

```
from chalkcompute import ScalingGroup, Image

img = (
    Image.debian_slim("3.12")
    .pip_install(["flask"])
    .add_local_file("./app.py", "/app/app.py")
    .entrypoint(["python", "/app/app.py"])
)

sg = ScalingGroup(
    image=img,
    name="hello-api",
    port=8080,
    min_replicas=1,
    max_replicas=3,
).deploy().wait_ready()

resp = sg.call("/health", method="GET")
print(resp.status_code, resp.text)

sg.delete()
```

ScalingGroup(...).deploy() builds the image, uploads any local files declared
via add_local_file / add_local_dir, creates the scaling group, and waits
until it reaches Running. Chain .wait_ready() to also wait for at least one
replica to be ready to serve traffic.

### Quick start (CLI)

```
$ chalk scaling-group create \
    --image=my-registry/my-inference-server:v1 \
    --name=inference \
    --port=8080
```

This creates a Kubernetes Deployment with a single replica running your image,
a headless Service, and an HTTPRoute that exposes the service over HTTPS.

You can tune replicas and resource limits at creation time:

```
$ chalk scaling-group create \
    --image=my-registry/my-inference-server:v1 \
    --name=inference \
    --port=8080 \
    --replicas=3 \
    --cpu=4 \
    --memory=8Gi
```

### Replicas and autoscaling

The SDK exposes the autoscaler controls directly on the ScalingGroup
constructor:

| Parameter                           | Default                 | Description                                                        |
| ----------------------------------- | ----------------------- | ------------------------------------------------------------------ |
| `min_replicas`                      | `1`                     | Lower bound for the autoscaler. Set to `0` to allow scale-to-zero. |
| `max_replicas`                      | `1`                     | Upper bound for the autoscaler.                                    |
| `target_cpu_utilization_percentage` | unset (cluster default) | Average CPU utilization the autoscaler targets.                    |
| `scaling_interval_seconds`          | unset                   | How often the autoscaler re-evaluates.                             |
| `shutdown_delay_seconds`            | unset                   | Grace period when removing replicas, to drain in-flight requests.  |

```
sg = ScalingGroup(
    image=img,
    name="inference",
    port=8080,
    min_replicas=2,
    max_replicas=20,
    target_cpu_utilization_percentage=70,
    shutdown_delay_seconds=30,
).deploy().wait_ready()
```

From the CLI, --replicas=N sets a fixed replica count. For dynamic
autoscaling, use the SDK or edit the underlying spec.

### CPU and memory resources

The --cpu and --memory flags (CLI) and the cpu / memory constructor
arguments (SDK) control the resource requests and limits for each replica.
When specified, both the Kubernetes request and limit are set to the same
value, guaranteeing your workload the resources it asks for.

```
$ chalk scaling-group create \
    --image=my-registry/my-app:v1 \
    --name=my-service \
    --port=8080 \
    --cpu=2 \
    --memory=4Gi
```

If you omit these flags, the following defaults are applied:

| Resource | Request | Limit |
| -------- | ------- | ----- |
| CPU      | `100m`  | `1`   |
| Memory   | `256Mi` | `1Gi` |

CPU values follow Kubernetes conventions: 1 means one full core, 500m means
half a core. Memory values use standard suffixes: Mi (mebibytes) and Gi
(gibibytes).

### GPU support

Scaling groups support GPU-accelerated workloads. The SDK and CLI both accept a
type:count value (or just count) for the GPU request:

```
sg = ScalingGroup(
    image=img,
    name="gpu-inference",
    port=8080,
    cpu="8",
    memory="32Gi",
    gpu="nvidia-l4:1",
    min_replicas=1,
    max_replicas=4,
).deploy().wait_ready()
```

```
$ chalk scaling-group create \
    --image=my-registry/my-gpu-model:v1 \
    --name=gpu-inference \
    --port=8080 \
    --gpu=nvidia-tesla-t4:1 \
    --cpu=4 \
    --memory=16Gi
```

The type portion (e.g. nvidia-tesla-t4) is used to select the right node
pool via a Kubernetes node selector (cloud.google.com/gke-accelerator), while
the count sets the nvidia.com/gpu resource request and limit on the
container. A toleration for nvidia.com/gpu is applied automatically so the
pods can schedule onto GPU-tainted nodes.

If you don't need to target a specific GPU type (for example on EKS where node
selection is handled differently), pass just the count:

```
$ chalk scaling-group create \
    --image=my-registry/my-gpu-model:v1 \
    --name=gpu-inference \
    --port=8080 \
    --gpu=1
```

Available GPU types depend on what node pools are configured in your cluster.
Common GKE values include:

| GPU type            | Description                                           |
| ------------------- | ----------------------------------------------------- |
| `nvidia-tesla-t4`   | NVIDIA T4 (cost-effective inference)                  |
| `nvidia-tesla-a100` | NVIDIA A100 (high-performance training and inference) |
| `nvidia-l4`         | NVIDIA L4 (balanced inference)                        |
| `nvidia-tesla-v100` | NVIDIA V100 (training)                                |

To request multiple GPUs, increase the count:

```
$ chalk scaling-group create \
    --image=my-registry/my-training-server:v1 \
    --name=multi-gpu \
    --port=8080 \
    --gpu=nvidia-tesla-a100:4
```

### DNS and routing

Every scaling group is automatically assigned a TLS-terminated DNS hostname
through the cluster's gateway. The pattern is:

```
<environment-id>-<scaling-group-name>.<gateway-domain>
```

For example, a scaling group named inference in environment abc123 with
gateway domain gw.chalk.ai would be reachable at:

```
https://abc123-inference.gw.chalk.ai
```

From the SDK, read it off the deployed group:

```
print(sg.web_url)
# https://abc123-inference.gw.chalk.ai
```

From the CLI, the URL is shown as Web URL in the output of chalk
scaling-group get and chalk scaling-group list. Traffic arriving at this
hostname is routed to the port you specified with --port (or the port
constructor argument).

Because the hostname includes the scaling group name, you can reference it by
a stable, human-readable URL rather than an opaque UUID.

### Invoking from the SDK

The .call() helper composes paths against web_url and returns an
httpx.Response. Use it for ad-hoc invocations from the SDK — production
traffic should hit web_url directly.

```
resp = sg.call(
    "/predict",
    method="POST",
    json={"text": "the cat sat on the mat"},
    timeout=60.0,
)
print(resp.json())
```

### Environment, secrets, and volumes

These are wired identically to containers:

```
from chalkcompute import ScalingGroup, Image, Secret

sg = ScalingGroup(
    image=img,
    name="api",
    port=8080,
    env={"LOG_LEVEL": "INFO"},
    secrets=[Secret.from_env("OPENAI_API_KEY")],
    volumes=[("training-data", "/data")],
    min_replicas=1,
    max_replicas=3,
).deploy().wait_ready()
```

Secrets are resolved at deploy time and injected as environment variables in
every replica. Volume mounts are shared across all replicas, so use them for
read-mostly data — if multiple replicas write to the same volume path
concurrently, the last writer wins on the next sync.

### Managing scaling groups

### Listing

```
$ chalk scaling-group list
```

Displays a table of all scaling groups in your environment with their ID,
name, image, status, replica counts, URL, and creation time.

```
from chalk.client import ChalkClient

client = ChalkClient()
response = client.list_scaling_groups()

for scaling_group in response.scaling_groups:
    print(f"Scaling group: {scaling_group.name}")
    print(f"  Status: {scaling_group.status}")
```

### Inspecting

```
$ chalk scaling-group get --name=inference
```

Shows detailed information including the spec, replica status (desired, ready,
available), tags, and URL.

```
sg = ScalingGroup.from_name("inference")
# or
sg = ScalingGroup.from_id("scaling-group-id")

info = sg.refresh()
print(info.status)             # 'Running', 'Available', 'Failed', ...
print(info.ready_replicas)
print(info.available_replicas)
print(sg.web_url)
```

### Deleting

```
$ chalk scaling-group delete --name=inference
```

```
sg.delete()
```

Both surfaces remove the Kubernetes Deployment, Service, and HTTPRoute
associated with the scaling group. The SDK additionally cleans up any
temporary volumes created from add_local_file uploads. The database record
is soft-deleted so the group still appears in history.

### Tags

You can attach arbitrary key-value tags to a scaling group for organization
and filtering:

```
$ chalk scaling-group create \
    --image=my-registry/my-app:v1 \
    --name=my-service \
    --port=8080 \
    --tags="team=ml,version=2.1"
```

Tags are applied as Kubernetes labels with the prefix chalk.ai/tag-, making
them visible in standard Kubernetes tooling.

### Entrypoint override

To override the image's default entrypoint, pass entrypoint from the SDK or
--entrypoint from the CLI:

```
sg = ScalingGroup(
    image="python:3.12",
    name="custom-server",
    port=8000,
    entrypoint=["python", "-m", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"],
).deploy()
```

```
$ chalk scaling-group create \
    --image=python:3.12 \
    --name=custom-server \
    --port=8000 \
    --entrypoint="python,-m,uvicorn,main:app,--host,0.0.0.0,--port,8000"
```

The CLI takes comma-separated arguments; the SDK takes a list.

### Full example

A GPU-accelerated inference service with custom resource limits, multiple
replicas, autoscaling, and tags:

```
$ chalk scaling-group create \
    --image=my-registry/llm-serving:v3 \
    --name=llm-inference \
    --port=8080 \
    --replicas=2 \
    --gpu=nvidia-l4:1 \
    --cpu=8 \
    --memory=32Gi \
    --tags="model=llama3,team=ml-platform" \
    --entrypoint="python,-m,vllm.entrypoints.openai.api_server,--model,/models/llama3"
```

After creation, the service is reachable at its assigned DNS name and can be
referenced by resolvers or other services running in the same Chalk
environment.

### When to use a Scaling Group vs other compute primitives

See Choosing the right primitive
on the Compute overview for the full decision table.





