Queries
Create PyTorch Datasets from Chalk Datasets
Chalk Datasets expose the following methods, which create PyTorch datasets from the Chalk dataset’s contents.
Dataset.create_torch_map_dataset(...), which creates a
map-based PyTorch dataset.
Dataset.create_torch_iter_dataset(...), which creates an
iterable Pytorch dataset.
These functions can be used to integrate the results of your Chalk query directly with your PyTorch workflows. Check out the MNIST Dataset Example for example code that can do this.
These functions will create datasets that return values in pydict format: {"column_name": torch_tensor}. These column
names will by default be the column names of the dataset, but can be mapped by specifying the columns= kwarg in the
above function calls. Note that when using a data loader with a batch_size >= 1, each batch when iterating over the
data loader will give a mapping from column names to tensors of size batch_size.
In PyTorch, the map-based torch.utils.data.Dataset
and iterable torch.utils.data.IterableDataset
are the two different types of datasets that can be used in DataLoaders.
Map-based datasets are optimized for random access on any given row. In Chalk, Dataset.create_torch_map_dataset(...)
will materialize the entire dataset upon the underlying operation’s completion. However, if you do not need random
access on your dataset, it may be more appropriate to use Dataset.create_torch_iter_dataset(...), which will
only materialize one row group of the Chalk dataset at a time by default.