The Chalk Dataset class governs metadata related to offline queries, supports revisions to queries over time, and enables the easy retrieval of data from the cloud.

Datasets from offline query

Dataset instances are obtained by calling ChalkClient.offline_query() which computes feature values from the offline store. If inputs are given, the method returns the values corresponding to those inputs. Otherwise, the method returns a random sample according to the parameter max_samples, or features from within timebounds specified by lower_bound and upper_bound.

from chalk.client import ChalkClient, Dataset
uids = [1, 2, 3, 4]
at = datetime.now()
dataset: Dataset = ChalkClient().offline_query(
     input={
         User.id: uids,
     },
     input_times=[at] * len(uids),
     output=[
         User.id,
         User.fullname,
         User.email,
         User.name_email_match_score,
     ],
     dataset_name='my_dataset'
)

sample_dataset: Dataset = ChalkClient().offline_query(
     output=[
         User.id,
         User.fullname,
         User.email,
         User.name_email_match_score,
     ],
     max_samples=10,
     lower_bound=datetime.now() - timedelta(days=7),
     upper_bound=datetime.now(),
     dataset_name='my_sample'
)

Here, we attach a unique name to the Dataset. Whenever we send additional queries with the same name, a new DatasetRevision instance will be created and attached to the existing dataset. If a dataset_name is not given, the output data won’t be retrievable beyond the current session.

A Dataset’s revisions can be inspected in Dataset.revisions: they hold useful metadata relating to the offline query job and the data itself. Be sure to check out Dataset.errors for any errors upon submitting the query.

Retrieving output data

Since offline queries are not realtime, the Dataset instance returned is not guaranteed to have the outputs of the query instantaneously. Thus, loading the data may take some time.

The data can be accessed programmatically by calling Dataset.get_data_as_pandas(), Dataset.get_data_as_polars(), or Dataset.get_data_as_dataframe(). If the offline query job is still running, the Dataset will poll the engine until the results are completed.

from chalk.client import ChalkClient, Dataset
uids = [1, 2, 3, 4]
at = datetime.now()
dataset: Dataset = ChalkClient().offline_query(
     input={
         User.id: uids,
     },
     input_times=[at] * len(uids),
     output=[
         User.id,
         User.fullname,
         User.email,
         User.name_email_match_score,
     ],
     dataset_name='my_dataset'
)

pandas_df: pd.DataFrame = dataset.get_data_as_pandas()
polars_df: pl.LazyFrame = dataset.get_data_as_polars()
chalk_df: chalk.features.DataFrame = dataset.get_data_as_dataframe()

The file outputs of the query themselves can also be downloaded to a specified directory.

from chalk.client import ChalkClient, Dataset
uids = [1, 2, 3, 4]
at = datetime.now()
dataset: Dataset = ChalkClient().offline_query(
     input={
         User.id: uids,
     },
     input_times=[at] * len(uids),
     output=[
         User.id,
         User.fullname,
         User.email,
         User.name_email_match_score,
     ],
     dataset_name='my_dataset'
)
dataset.download_data('my_directory')

By default, Dataset instances fetch the output data from their most recent revision. A specific DatasetRevision’s output data can be fetched using the same methods.

from chalk.client import ChalkClient, Dataset
uids = [1, 2, 3, 4]
at = datetime.now()
dataset: Dataset = ChalkClient().offline_query(
     input={
         User.id: uids,
     },
     input_times=[at] * len(uids),
     output=[
         User.id,
         User.fullname,
         User.email,
         User.name_email_match_score,
     ],
     dataset_name='my_dataset'
)
for revision in dataset.revisions:
    print(revision.get_data_as_pandas())

Dataset Inputs

Dataset objects also store the inputs for each revision.

from chalk.client import ChalkClient, Dataset
uids = [1, 2, 3, 4]
at = datetime.now()
dataset: Dataset = ChalkClient().offline_query(
     input={
         User.id: uids,
     },
     input_times=[at] * len(uids),
     output=[
         User.id,
         User.fullname,
         User.email,
         User.name_email_match_score,
     ],
     dataset_name='my_dataset'
)
df = dataset.get_input_dataframe()

Managing Datasets

Renaming Datasets

You can rename a dataset from the Chalk dashboard to better organize your datasets or reflect changes in your workflow.

To rename a dataset:

  1. From the Datasets page under the Offline section in the sidebar, click on a dataset to open its detail page
  2. Click the Edit button in the page header next to the dataset name
  3. Enter the new name and click Save, or press Enter

The dataset name is used when retrieving datasets via the API, so update any references in your code after renaming.

Archiving Dataset Revisions

As datasets accumulate revisions over time, you can archive older revisions to keep your dataset history organized. Archived revisions are hidden by default but remain accessible if needed.

To archive a revision:

  1. From the Datasets page under the Offline section in the sidebar, click on a dataset to open its detail page
  2. Select the Revisions tab
  3. Click the Archive button in the table row for the revision you want to archive
  4. Confirm the action in the dialog

To view archived revisions, check the Include archived revisions checkbox above the revisions table.

Archiving a revision cannot be undone. Archived revisions will no longer be accessible via the API.

Recompute a Dataset

Datasets expose a recompute method that enables users to see the results of updates to resolvers/features in the context of this dataset. recompute takes a list of features as an argument to be recomputed, and any other required input features are sampled from the offline store.