Pytorch Integration with Datasets

Chalk Datasets expose the following methods, which create PyTorch datasets from the Chalk dataset’s contents.

Dataset.create_torch_map_dataset(...), which creates a map-based PyTorch dataset.
Dataset.create_torch_iter_dataset(...), which creates an iterable Pytorch dataset.

These functions can be used to integrate the results of your Chalk query directly with your PyTorch workflows. Check out the MNIST Dataset Example for example code that can do this.

These functions will create datasets that return values in pydict format: {"column_name": torch_tensor}. These column names will by default be the column names of the dataset, but can be mapped by specifying the columns= kwarg in the above function calls. Note that when using a data loader with a batch_size >= 1, each batch when iterating over the data loader will give a mapping from column names to tensors of size batch_size.

Map-based vs Iterable Datasets

In PyTorch, the map-based torch.utils.data.Dataset and iterable torch.utils.data.IterableDataset are the two different types of datasets that can be used in DataLoaders.

Map-based datasets are optimized for random access on any given row. In Chalk, Dataset.create_torch_map_dataset(...) will materialize the entire dataset upon the underlying operation’s completion. However, if you do not need random access on your dataset, it may be more appropriate to use Dataset.create_torch_iter_dataset(...), which will only materialize one row group of the Chalk dataset at a time by default.

​Map-based vs Iterable Datasets

On this page

Map-based vs Iterable Datasets