Learn how to debug and troubleshoot feature persistence issues in Chalk
Chalk uses background persistence writers to asynchronously persist query results to the online store, offline store, and metrics store. When these writers encounter issues, features may appear stale, offline queries may return incomplete data, or online store reads may return unexpected results. This guide walks through how to identify and resolve common persistence problems.
Understanding the data flow is essential for debugging. When a query executes on the engine, results flow through a message bus before reaching their final storage destinations:
query_log table in the offline store. Note that other components (such as the query engine) may also publish messages directly to the streaming insert topic, so this writer handles more than just the output of the offline writer.store_offline=True, they write result parquet files to a cloud storage bucket and publish a notification to a bulk upload topic. The bulk insert writer picks up these notifications and loads the files into the offline store using COPY INTO operations. This writer is required for non-BigQuery offline store backends (Snowflake, Redshift, Databricks); for BigQuery, dedicated BigQuery streaming writers handle this role instead. It requires a configured upload bucket (BQ_UPLOAD_BUCKET) with the correct cloud prefix (s3:// for AWS, gs:// for GCP).The offline store write path is a two-stage pipeline: the offline writer creates table schemas, transforms result bus messages, and republishes them to the streaming insert topic, then the streaming insert writer writes the data to the database. Both must be deployed for offline persistence to function. The bulk insert writer is independent of this pipeline — it handles loading parquet files from cloud storage into the offline store, which is how offline query results (e.g. from store_offline=True) are persisted. Which offline writers you need depends on your offline store backend: for BigQuery, use dedicated BigQuery streaming writers; for all other backends (Snowflake, Redshift, Databricks), use the bulk insert writer and streaming insert writer.
A failure at any stage of this pipeline can cause data to stop flowing to one or more stores.
When you suspect a persistence issue, follow this decision tree:
Start by determining which store is affected:
Navigate to Infrastructure > Kube Events in the Chalk dashboard and filter by the background-persistence namespace.
If the background-persistence namespace does not exist, the writers may not exist and we need to create them in the dashboard by navigating
to Settings > Shared Resources > Background Persistence.
Look for pods in an unhealthy state:
If writers appear healthy but data is not flowing, the result bus itself may be the bottleneck. Navigate to Infrastructure > Kubernetes and expand the drop down for Background persistence writers. Click on a result-bus-offline-writer or result-bus-online-writer pod to access the logs that can help indicate why data may not be ingesting.
and filter by component:"background-persistence" to look for consumer lag or
error messages on the pods with the names rust-result-bus-online-writer and the result-bus-offline-writer.
Signs of a clogged result bus include:
This is the most common persistence issue. When a writer’s memory usage exceeds its Kubernetes memory limit,
the pod is terminated with an OOMKilled status. While KEDA will restart the pod, repeated OOM kills cause
the writer to fall behind, creating a backlog on the result bus.
Symptoms: pods cycling between Running and OOMKilled, increasing consumer lag on the result bus,
stale or missing feature values.
Solution: increase the memory request and limit for the affected writer. See Raising memory requests.
If writers were down for a period (due to OOM kills, crashes, or scaling issues), a backlog of unprocessed messages may accumulate on the result bus. Once writers are healthy again, they will work through the backlog, but this can take time depending on the volume.
Symptoms: writers are healthy and running, but data is still delayed. Consumer lag is high but decreasing.
Solution: wait for the backlog to drain. If you need to accelerate processing, you can temporarily increase the replica count for the affected writer in the background persistence configuration.
If a writer is in CrashLoopBackOff, it is repeatedly failing to start.
Symptoms: pod restarts with increasing backoff intervals.
Solution: check the pod logs in Infrastructure > Kubernetes and expanding the Background Persistence drop down.
If online queries return correct values but offline data is missing or stale:
skip_offline is not set to True on the relevant stream resolvers.store_offline=True) are not appearing in the offline store, the bulk insert writer may be misconfigured or missing. This writer is required for non-BigQuery backends. Verify that BQ_UPLOAD_BUCKET is set to a valid cloud storage path with the correct prefix (s3:// for AWS, gs:// for GCP), and that the associated storage integration has the correct permissions.If a writer pod is stuck in ContainerCreating for an extended period, it is typically unable to mount a required volume or secret.
Symptoms: pod remains in ContainerCreating state for more than a few minutes. Kube Events may show FailedMount errors.
Solution: check the pod logs in Infrastructure > Kubernetes and expanding the Background Persistence drop down for a possible cause.
If online queries unexpectedly get cache misses:
max_staleness.Background persistence writers run as Kubernetes deployments with configurable CPU and memory requests and limits. When a writer is OOMKilled, you need to increase its memory allocation.
memory on.memory value in both the request and limit fields.max_staleness to control how stale features are served.