Chalk DataFrame Reference

DataFrame.__init__(root, tables)

Return the number of columns on this dataframe

Functions

Initialise the base DataFrame with a query plan.

Subclasses set additional state before calling super().__init__. Direct callers should use from_dict, from_arrow, or scan class methods rather than this constructor.

DataFrame.__len__()

DataFrame.to_proto()

Serialize this DataFrame into a LogicalExprNode proto.

DataFrame.to_lazyframe()

Return a LazyFramePlaceholder when lazy recording is enabled.

DataFrame.compose_lazyframe(lazyframe)

Compose a LazyFramePlaceholder on top of this DataFrame.

DataFrame.named_table(name, schema, sorted_by)

Create a schema-only placeholder DataFrame for a named table.

The returned DataFrame contains no data; it is a logical reference that must be supplied with actual Arrow data at execution time via the tables argument of run or to_arrow. This is useful when you want to build a reusable query plan against a well-known schema and inject different data at runtime.

Parameters

Table identifier used as the key in the tables mapping at execution time.

Arrow schema describing the table's columns and types.

sorted_by:

list[str] | None

= None

Optional list of column names indicating that the data is pre-sorted by those columns (used for optimisation hints).

Returns

type:

DataFrame.from_arrow(data)

import pyarrow as pa
from chalkdf import DataFrame
schema = pa.schema([("user_id", pa.int64()), ("score", pa.float64())])
df = DataFrame.named_table("users", schema)
# Build a query plan
from chalk.features import _
result_plan = df.filter(_.score > 0.5)
# Inject real data at execution time
import pyarrow as pa
data = pa.table({"user_id": [1, 2, 3], "score": [0.3, 0.8, 0.6]})
result = result_plan.run(tables={"users": data})

Construct a DataFrame from an in-memory Arrow object.

Parameters

DataFrame.from_pandas(data)

MaterializedTable

PyArrow Table or RecordBatch to convert into a DataFrame.

Returns

import pyarrow as pa
from chalkdf import DataFrame
table = pa.table({"x": [1, 2, 3], "y": ["a", "b", "c"]})
df = DataFrame.from_arrow(table)

Construct a DataFrame from a pandas DataFrame.

Parameters

pandas.DataFrame

pandas DataFrame to convert.

Returns

type:

DataFrame.from_dict(data)

import pandas as pd
from chalkdf import DataFrame
df = DataFrame.from_pandas(pd.DataFrame({"x": [1, 2, 3]}))

Construct a DataFrame from a Python dictionary.

Parameters

DataFrame.from_dataset(dataset_id, revision_id, client)

dict

Dictionary mapping column names to lists of values.

Returns

from chalkdf import DataFrame
df = DataFrame.from_dict({"x": [1, 2, 3], "y": ["a", "b", "c"]})

Load a Chalk offline-query dataset's parquet output as a DataFrame.

Resolves a Chalk dataset revision to its signed parquet output URLs using the server-streaming DatasetMetadataService.StreamDatasetRevisionDownloadLinks RPC, downloads the parquet partitions, and returns a materialized DataFrame backed by the concatenated tables.

Exactly one of dataset_id or revision_id must be provided. When dataset_id is given, the dataset's most recent revision is loaded.

Parameters

dataset_id:

= None

ID of a Chalk dataset. The dataset's most recent revision is used.

revision_id:

= None

ID of a specific dataset revision.

client:

Any | None

= None

Optional chalk.client.client_grpc.ChalkGRPCClient instance to use. When omitted, a default client is constructed.

Returns

type:

DataFrame.deserialize(source, format)

from chalkdf import DataFrame
df = DataFrame.from_dataset(revision_id="rev_abc123")

Deserialize bytes into a materialized DataFrame.

Parameters

bytes

The serialized data (produced by serialize).

format:

= 'binary'

Serialization format. Currently only "binary" (Arrow IPC) is supported.

Returns

type:

DataFrame.from_python_udf(udf, schema, output_timeout)

Create a DataFrame from a Python async generator function.

This method allows you to create a DataFrame by streaming data from a custom Python async generator. The generator can yield data as PyArrow RecordBatches, pydicts, or pylists, and the method will handle conversion and schema alignment automatically. If the UDF yields an invalid batch, no further batches will be processed.

Parameters

udf:

Callable[[], AsyncGenerator[pyarrow.RecordBatch | dict | list, None]]

An async generator function that yields data batches. Each yielded value can be a pyarrow.RecordBatch, a dictionary (will be converted using pyarrow.RecordBatch.from_pydict), or a list (will be converted using pyarrow.RecordBatch.from_pylist). The generator should yield None or complete iteration to signal completion.

The expected PyArrow schema for the output data. If yielded batches have columns in a different order, they will be automatically reordered to match this schema.

output_timeout:

float

= 300.0

Maximum time in seconds to wait for the output handler to accept each batch. Prevents deadlocks when the consumer is blocked. Default is 300 seconds (5 minutes). Set to None to disable timeout (not recommended).

Returns

type:

DataFrame.scan(input_uris, name, ...+3)

Raises

error:

asyncio.TimeoutError

If sending a batch to the output handler exceeds the timeout.

import pyarrow as pa
from chalkdf import DataFrame
async def generate_data():
    for i in range(3):
        yield {"x": [i * 10, i * 10 + 1], "y": [i, i]}
schema = pa.schema([("x", pa.int64()), ("y", pa.int64())])
df = DataFrame.from_python_udf(generate_data, schema)
result = df.run()
# Example with PyArrow RecordBatches
async def generate_batches():
    batch1 = pa.RecordBatch.from_pydict({"a": [1, 2], "b": [3, 4]})
    batch2 = pa.RecordBatch.from_pydict({"a": [5, 6], "b": [7, 8]})
    yield batch1
    yield batch2
schema = pa.schema([("a", pa.int64()), ("b", pa.int64())])
df = DataFrame.from_python_udf(generate_batches, schema)
# Example with custom timeout
df = DataFrame.from_python_udf(generate_data, schema, output_timeout=60.0)

Scan files and return a DataFrame.

Currently supports CSV (with headers), Parquet, Delta, and Iceberg.

Parameters

input_uris:

Sequence[str | Path] | str | Path

File path/URI or list of paths/URIs to scan. Supports local paths and file:// URIs.

= None

Optional name to assign to the table being scanned.

pyarrow.Schema | None

= None

Schema of the data. Required for CSV files, optional for Parquet. For Delta and Iceberg, inferred from table metadata when omitted.

mode:

'auto' | 'hive' | 'delta' | 'iceberg'

= 'auto'

Scan inference mode:

"auto": infer scan type from the URI/path. Uses suffixes for CSV/Parquet and can also recognize Delta table roots.
"hive": expand Hive/glob paths without Delta inference fallback.
"delta": treat the input as a Delta table root (requires exactly one URI).
"iceberg": treat the input as an Iceberg table root (requires exactly one URI). To read from a catalog instead of file system, use scan_iceberg.

row_sample:

float | None

= None

Optional Bernoulli row-level sampling rate in (0, 1]. For example, 0.01 yields ~1% of rows. Passed through to the Velox Hive connector's sampleRate; sampling is non-uniform and skips rows during read (it does not skip files or row groups, so I/O is not reduced proportionally). None disables sampling.

Returns

type:

DataFrame.scan_iceberg(table, storage_options, snapshot_id)

from chalkdf import DataFrame
# Scan Parquet files
df = DataFrame.scan(["data/sales_2024.parquet"], name="sales_data")
# Scan CSV with explicit schema
import pyarrow as pa
schema = pa.schema([("id", pa.int64()), ("name", pa.string())])
df = DataFrame.scan(["data/users.csv"], schema=schema)
# Scan ~1% of rows
df = DataFrame.scan(["data/big.parquet"], row_sample=0.01)

Scan an Iceberg table that is registered in a catalog.

storage_options follows the Apache Iceberg catalog + FileIO properties spec — the same key namespace accepted by PyIceberg, iceberg-rust, and Trino. See the full list of standard keys at https://iceberg.apache.org/docs/latest/configuration/.

Only type='glue' (AWS Glue catalog) is supported today. Example for Glue::

{
    "type": "glue",
    "glue.id": "123456789012",
    "region_name": "us-east-1",
    "client.assume-role.arn": "arn:aws:iam::123456789012:role/chalk-reader",
    "warehouse": "s3://my-bucket/warehouse",
}

client.assume-role.arn is optional; when absent, ambient AWS credentials are used (instance profile / IRSA / env vars / ~/.aws).

Parameters

table:

Catalog-qualified table identifier. For Glue, database.table. Also used as the plan-node name.

storage_options:

= None

Apache Iceberg catalog + FileIO properties. None uses the ambient catalog configuration from the host engine's environment.

snapshot_id:

= None

Pin the scan to a specific snapshot id. None selects the current snapshot at plan time.

Returns

type:

DataFrame.scan_delta(table, storage_options, ...+2)

Scan a Delta Lake table by URI or Unity Catalog three-part name.

Parameters

table:

Delta table URI or Unity Catalog three-part name.

storage_options:

= None

Object-store and/or Unity Catalog configuration forwarded to the underlying object_store. None falls back to ambient credentials resolved from process env.

To read from a Databricks Unity Catalog–registered table, pass:

type='unity'
unity.workspace_url='https://<workspace>.cloud.databricks.com'
unity.token='<PAT or service-principal bearer token>'
unity.operation='READ' (optional, default READ)

When type='unity', table is interpreted as the three-part identifier catalog.schema.table and the resolver looks up the actual storage location (S3/GCS/ABFSS) via the Unity Catalog REST API. Short-lived cloud credentials are vended via the temporary-table-credentials endpoint and forwarded to object_store automatically.

When type is omitted, table must be a filesystem URI (e.g. s3://bucket/delta/path, file:///abs/path, or a local path), and storage_options may carry explicit cloud credential keys (aws_access_key_id, azure_storage_sas_token, etc.).

pyarrow.Schema | None

= None

Optional pyarrow schema. If omitted, inferred from Delta metadata.

row_sample:

float | None

= None

Bernoulli row-level sample rate in (0, 1]. None disables.

Returns

type:

DataFrame.scan_glue_iceberg(glue_table_name, schema, ...+8)

Load data from an AWS Glue Iceberg table.

Parameters

glue_table_name:

Fully qualified database.table name.

Mapping[str, pyarrow.DataType]

Mapping of column names to Arrow types.

batch_row_count:

= 1000

Number of rows per batch.

aws_catalog_account_id:

= None

AWS account hosting the Glue catalog.

aws_catalog_region:

= None

Region of the Glue catalog.

aws_role_arn:

parquet_scan_range_column:

= None

IAM role to assume for access.

filter_predicate:

Expr | None

= None

Optional filter applied during scan.

= None

Column used for range-based reads.

custom_partitions:

dict[str, tuple['date_trunc(day)', str]] | None

= None

Additional partition definitions.

partition_column:

= None

Column name representing partitions.

Returns

type:

DataFrame.write_iceberg(table, storage_options, ...+4)

Write this DataFrame to an Iceberg table.

Parameters

table:

Either a catalog-qualified identifier ("namespace.name") or a direct URI.

storage_options:

= None

Iceberg catalog + FileIO properties. See scan_iceberg. None uses ambient configuration from the host engine's environment.

shard_id:

= 0

Shard identifier for the write (used for partitioned writes).

num_retries:

= 3

Number of end-to-end retries for the write operation.

num_internal_retries:

= 3

Number of retries for the catalog commit step.

partition_spec:

list[tuple[str, str]] | None

= None

List of (column_name, transform) pairs defining how the table is partitioned. Supported transforms: "identity", "year", "month", "day", "hour", "bucket[N]", "truncate[N]". If omitted, new tables are created unpartitioned and existing tables reuse their current partition spec.

Returns

type:

DataFrame.write_glue_iceberg(glue_table_name, uri, ...+7)

A passthrough DataFrame (same data as input); run it to execute the write.

Write this DataFrame to an AWS Glue Iceberg table.

.. deprecated:: Use write_iceberg with storage_options={"catalog": "glue", …}.

DataFrame.from_catalog_table(table_name, catalog)

Create a DataFrame from a Chalk SQL catalog table.

Parameters

table_name:

Name of the table in the catalog.

catalog:

ChalkSqlCatalog

ChalkSqlCatalog instance containing the table.

Returns

type:

DataFrame.from_sql(query, tables)

from chalkdf import DataFrame
from libchalk.chalksql import ChalkSqlCatalog
catalog = ChalkSqlCatalog()
df = DataFrame.from_catalog_table("users", catalog=catalog)

Create a DataFrame from the result of executing a SQL query (DuckDB dialect).

Pass DataFrames or Arrow tables as keyword arguments to make them available as named tables inside the query. If no keyword arguments are provided, from_sql will attempt to auto-register any DataFrames found in the calling scope.

Parameters

SQL query string (DuckDB dialect).

CompatibleFrameType

= {}

Returns

type:

DataFrame.from_datasource(source, query, expected_output_schema)

from chalkdf import DataFrame
orders = DataFrame.from_dict({"order_id": [1, 2, 3], "amount": [10.0, 20.0, 5.0]})
result = DataFrame.from_sql(
    "SELECT order_id, amount FROM orders WHERE amount > 8",
    orders=orders,
)

Join two DataFrames with SQL:

users = DataFrame.from_dict({"id": [1, 2], "name": ["Alice", "Bob"]})
purchases = DataFrame.from_dict({"user_id": [1, 1, 2], "item": ["a", "b", "c"]})
result = DataFrame.from_sql(
    "SELECT u.name, p.item FROM users u JOIN purchases p ON u.id = p.user_id",
    users=users,
    purchases=purchases,
)

Create a DataFrame from the result of querying a SQL data source.

Parameters

BaseSQLSource

SQL source to query (e.g., PostgreSQL, Snowflake, BigQuery).

SQL query to execute against the data source.

expected_output_schema:

DataFrame.scan_from_sql(query, pool, ...+4)

Output schema of the query result. The datasource's driver converts the native query result to this schema.

Returns

import pyarrow as pa
from chalkdf import DataFrame
from chalk.sql import PostgreSQLSource
source = PostgreSQLSource(...)
schema = pa.schema([("user_id", pa.int64()), ("name", pa.string())])
df = DataFrame.from_datasource(source, "SELECT * FROM users", schema)

Create a DataFrame by executing a SQL SELECT and scanning the resulting parquet files.

The connection wraps the SELECT in dialect-specific EXPORT DATA / COPY INTO syntax, writes parquet to output_uri_prefix, and reads the files via a deferred TableScan node. Filter and projection pushdown is applied automatically when the DataFrame is composed with .filter() or .project() before execution.

Parameters

A SELECT statement to execute against the data warehouse.

pool:

Any

A libchalk.sql.ConnectionPool for the data warehouse.

URI prefix where the exported parquet output will be written (e.g., gs://bucket/path/ or s3://bucket/prefix/). Alternatively, can be something that the SQL operation can unload to, such as a stage in Snowflake defined via CREATE STAGE.

Arrow schema of the parquet files produced by the export.

dialect:

external_location_prefix:

= 'bigquery'

SQL dialect for query rewriting (default "bigquery").

= None

If output_uri_prefix is not an external location (ex: set to a Snowflake stage), this should specify the URI prefix where the exported parquet output will be written (e.g., gs://bucket/path/ or s3://bucket/prefix/). If None, will assume that output_uri_prefix is the URI prefix.

Returns

type:

DataFrame.from_stream_source(source, n, ...+2)

import pyarrow as pa
from chalkdf import DataFrame
from libchalk.sql import ConnectionPool
from libchalk.sql.bigquery import make_bigquery_connection_factory
factory = make_bigquery_connection_factory(project_id="my-project")
pool = ConnectionPool(factory, max_pool_size=1)
schema = pa.schema([("user_id", pa.int64()), ("name", pa.string())])
df = DataFrame.scan_from_sql(
    "SELECT user_id, name FROM `my_dataset.users`",
    pool=pool,
    output_uri_prefix="gs://my-bucket/unload-output/",
    schema=schema,
)

Create a DataFrame by pulling messages from a streaming source.

This method connects to a Kafka, Kinesis, or PubSub source and pulls up to n messages, returning them as a DataFrame.

Parameters

Any

A streaming source configuration. Can be one of:

KafkaSource: Kafka topic configuration
KinesisSource: Kinesis stream configuration
PubSubSource: Google PubSub subscription configuration

Maximum number of messages to pull from the stream.

timeout_ms:

= 15000

Timeout in milliseconds for pulling messages. Default is 15000ms.

include_metadata:

= False

If False (default), returns a DataFrame with a single "value" column containing the raw message bytes. If True, returns a DataFrame with columns for topic, partition, offset, timestamp, key, and value.

Returns

type:

DataFrame.get_onnx_model_metadata(onnx_model_path)

from chalkdf import DataFrame
from chalk.streams import KafkaSource
source = KafkaSource(
    name="my_kafka",
    bootstrap_server="localhost:9092",
    topic="my_topic",
)
# Pull 100 messages, just the raw bytes
df = DataFrame.from_stream_source(source, n=100)
# Pull with full metadata
df = DataFrame.from_stream_source(source, n=100, include_metadata=True)

DataFrame.get_plan()

Expose the underlying ChalkTable plan.

DataFrame.get_tables()

Return the mapping of materialized tables for this DataFrame.

Return input/output schema metadata for an ONNX model.

Requires the chalkdf-onnx-runtime package to be installed so that libchalk_onnx_module.so can be dlopened.

Parameters

onnx_model_path:

DataFrame.write_to(destination, namespace, ...+4)

Filesystem path to a .onnx model file.

Returns

type:

ONNXModelMetadata

A dict with keys input_names (list[str]) and output_names (list[str]).

Raises

error:

RuntimeError

If the ONNX module is not available or the model cannot be loaded.

meta = DataFrame.get_onnx_model_metadata("model.onnx")
meta["input_names"]

['input']

meta["output_names"]

['output']

Append an "Unbound" online-store write to this plan.

Wraps the underlying ChalkTable in an UploadFeaturesUDF that records the destination name + per-call config. The runtime OnlineStoreWriter is resolved at plan compile time by looking destination up against the active BindingRegistry (the engine populates one per request; ad-hoc local runs construct one with register_redis_lightning_client / register_dynamodb_client as appropriate). All arguments round-trip cleanly through the lazyframe proto.

Supported destinations:

"redis_online_store"
"dynamodb_online_store"

chalkdf write_to is flat-scalar-only today — input columns must be present in column_to_feature. Nested has-one / has-many / vector feature persistence is handled by the engine's own persist flow, not through chalkdf.

Parameters

destination:

'redis_online_store' | 'dynamodb_online_store'

Online store identifier (see above).

namespace:

Root feature namespace (e.g. "user").

pkey_column:

Name of the primary-key column on the input.

ts_column:

Name of the observed-at timestamp column on the input.

column_to_feature:

dict[str, str]

Mapping from input column name to feature FQN. Every key must be an existing column on the DataFrame's schema.

ttl_seconds:

= None

Optional TTL in seconds applied to every persisted row.

Returns

type:

DataFrame.onnx_inference_udf(onnx_model_path)

Run ONNX model inference on this DataFrame.

The DataFrame must contain a column for each model input (matching the names from get_onnx_model_metadata) plus a __cidx__ column. The returned DataFrame contains the model's output columns along with __cidx__ and __valid__ columns.

Requires the chalkdf-onnx-runtime package to be installed.

Parameters

onnx_model_path:

Filesystem path to a .onnx model file.

Returns

type:

DataFrame.with_columns(columns)

A new DataFrame with the model's output columns, __cidx__, and __valid__.

Raises

error:

RuntimeError

If the ONNX module is not available or the model cannot be loaded.

meta = DataFrame.get_onnx_model_metadata("model.onnx")
input_name = meta["input_names"][0]
input_values = pa.array(
    [[1.0] * 10],
    type=pa.list_(pa.float32()),
)
df = DataFrame.from_arrow(
    pa.table(
        {
            input_name: input_values,
            "__cidx__": [0],
        }
    )
)
result = (
    df.onnx_inference_udf(onnx_model_path="model.onnx")
    .to_arrow()
)
result.column_names

['output', '__valid__', '__cidx__']

Add or replace columns while keeping all existing columns.

Unlike project, which returns only the columns you specify, with_columns keeps every existing column and either adds new ones or replaces columns whose names match.

Accepts multiple forms:

A dict mapping column names to expressions
Positional tuples of (name, expression)
Bare positional expressions that carry a name via .alias(<name>)

Returns

type:

DataFrame.fill_null(value)

from chalkdf import DataFrame
from chalk.features import _
import chalk.functions as F
df = DataFrame.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
# Add a new column using underscore syntax
df2 = df.with_columns({"z": _.x + _.y})
# Add a column using an F function
df3 = df.with_columns({"z_capped": F.least(_.x + _.y, 8)})
# Add a column using .alias()
df4 = df.with_columns((_.x * 2).alias("x_doubled"))
# Both df2, df3, df4 still contain x and y in addition to the new column

Replace null values with a fill value.

Parameters

value:

object | dict[str, object]

Either a scalar value to fill all nulls across every column, or a dict mapping column names to per-column fill values.

Returns

type:

DataFrame.with_unique_id(name)

df = DataFrame.from_dict({"x": [1, None, 3], "y": [None, 5, None]})
df.fill_null(0)           # fill all columns with 0
df.fill_null({"x": 0, "y": -1})  # fill x with 0, y with -1

Add a monotonically increasing unique identifier column.

Parameters

Name of the new ID column.

Returns

type:

from chalkdf import DataFrame
df = DataFrame.from_dict({"x": [10, 20, 30]})
df_with_id = df.with_unique_id("row_id")

DataFrame.replay(tag)

Force caching of this subcomputation during execution.

replay is an escape hatch for expensive subplans that are reused downstream. It inserts an explicit replay boundary into the logical plan, asking libchalk to compute this DataFrame once and serve later consumers from the replay cache instead of duplicating the work.

Most plans should rely on the optimizer to place replay nodes automatically. Use this only when a test or carefully inspected plan needs a stable caching boundary.

Parameters

tag:

= None

Optional label that appears in plan explanations.

Returns

type:

from chalkdf import DataFrame
df = DataFrame.from_dict({"x": [1, 2, 3]})
cached = df.with_columns({"x2": df.col("x") * 2}).replay("shared_x2")

DataFrame.filter(expr)

Filter rows based on a boolean expression.

Parameters

expr:

Expr | Underscore

Boolean expression to filter rows. Only rows where the expression evaluates to True are kept.

Returns

type:

DataFrame.slice(start, length)

from chalkdf import DataFrame
from chalk.features import _
df = DataFrame.from_dict({"x": [1, 2, 3, 4], "y": [10, 20, 30, 40]})
filtered = df.filter(_.x > 2)

Return a subset of rows starting at a specific position.

Parameters

start:

Zero-based index where the slice begins.

length:

= None

Number of rows to include. If None, includes all remaining rows.

Returns

type:

from chalkdf import DataFrame
df = DataFrame.from_dict({"x": [1, 2, 3, 4, 5]})
# Get rows 1-3 (indices 1, 2, 3)
sliced = df.slice(1, 3)

DataFrame.col(column)

Return a column expression for the named column.

df.col("name") is equivalent to _.name but validates that "name" exists in the DataFrame's schema at call time and is therefore useful when the column name is a runtime string variable rather than a literal attribute access.

Parameters

Name of an existing column in this DataFrame.

Returns

type:

Underscore

Raises

error:

If column is not present in the DataFrame's schema.

from chalkdf import DataFrame
df = DataFrame.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
# Reference a column by name to build expressions
col_x = df.col("x")
df_filtered = df.filter(col_x > 1)
# Useful when the column name comes from a variable
target = "y"
df2 = df.with_columns({"doubled": df.col(target) * 2})

DataFrame.column(column)

Return a column expression for the named column.

Alias for col.

Parameters

Name of an existing column in this DataFrame.

Returns

type:

Underscore

Raises

error:

DataFrame.union_all(others)

If column is not present in the DataFrame's schema.

from chalkdf import DataFrame
df = DataFrame.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
# Compute a sum from two columns referenced by name
df2 = df.with_columns({"sum": df.col("x") + df.col("y")})

Combine this DataFrame with one or more others by stacking rows.

All DataFrames must have the same schema (different column order is allowed - the output will have the same column order as self). Duplicates are retained. Row order is not preserved.

Returns

type:

Raises

error:

If no other DataFrames are provided, or if schemas don't match.

df1 = DataFrame({"x": [1, 2], "y": [10, 20]})
df2 = DataFrame({"x": [3, 4], "y": [30, 40]})
df3 = DataFrame({"x": [5], "y": [50]})
result = df1.union_all(df2, df3)
# result contains all 5 rows from df1, df2, and df3, in any order

DataFrame.union(other)

Combine this DataFrame with another by stacking rows.

Convenience method for unioning with a single DataFrame. Equivalent to union_all(other).

Both DataFrames must have the same schema (different column order is allowed - the output will have the same column order as self). Duplicates are retained. Row order is not preserved.

"inner": Keep only rows that match in both DataFrames (default)
"left": Keep all rows from left DataFrame
"right": Keep all rows from right DataFrame
"outer" or "full": Keep all rows from both DataFrames
"semi": Return rows from left that have matches in right (no right columns)
"anti": Return rows from left that have no matches in right
"cross": Cartesian product (do not pass in on)

= None

Optional suffix applied to right-hand columns when names collide. For example, if both DataFrames have a column "value" and right_suffix="_right", the result will have "value" and "value_right".

probe_with_right_side:

= False

If True, the probe side of the join will be the right and the build will be the left. Default is False (left is probe, right is build)

where:

Expr | None

= None

Returns

type:

DataFrame.join_asof(other, on, ...+8)

Perform an as-of join with another DataFrame.

An as-of join is similar to a left join, but instead of matching on equality, it matches on the nearest key from the right DataFrame. This is commonly used for time-series data where you want to join with the most recent observation.

Important: Both DataFrames must be sorted by the on (or left_on/right_on) column before calling this method. Use .order_by(on) to sort if needed.

Parameters

other:

Right-hand DataFrame to join with.

on:

str | Underscore | None

= None

Column name to use as the as-of join key (must be sorted). This column is used for both left and right DataFrames. The join finds the nearest match according to the strategy. Either on or both left_on and right_on must be specified.

left_on:

str | Underscore | None

= None

Column name in left DataFrame for the as-of join key. Only used when on is None. Must be paired with right_on.

right_on:

str | Underscore | None

= None

Column name in right DataFrame for the as-of join key. Can be used with on (to specify a different right column name) or with left_on (when on is None).

= None

Additional exact-match columns (optional). These columns must match exactly before performing the as-of match on the on column. Can be specified as:

A sequence of column names (same names on both sides): by=["col1", "col2"]
A mapping of left->right column names: by={"left_col": "right_col"}
If None, can specify left_by and right_by separately.

left_by:

Sequence[str | Underscore] | None

= None

Column names in left DataFrame for exact-match conditions. Only used when by is None. Must be paired with right_by.

right_by:

Sequence[str | Underscore] | None

= None

Column names in right DataFrame for exact-match conditions. Only used when by is None. Must be paired with left_by.

strategy:

AsOfJoinStrategy | 'forward' | 'backward'

= 'backward'

Join strategy controlling which match to select:

"backward" (default): Match with the most recent past value
"forward": Match with the nearest future value Can also pass AsOfJoinStrategy.BACKWARD or AsOfJoinStrategy.FORWARD.

= None

Suffix to add to overlapping column names from the right DataFrame.

coalesce:

= True

Whether to coalesce the join keys (default True).

Returns

type:

DataFrame.window(by, order_by, expressions)

Compute window (analytic) expressions partitioned by by and ordered by order_by.

Window operations evaluate each WindowExpr over a partition of rows (defined by by) sorted within that partition (by order_by). The result columns are appended to the existing schema; original columns are preserved.

Overlap between by and order_by columns is not allowed.

Parameters

Sequence[str | Underscore]

Column names that define the partition boundaries. Rows with the same combination of values in these columns form one partition.

order_by:

Column names (or (name, direction) / (name, direction, nulls_order) tuples) that define the sort order within each partition. Direction can be "asc" (default) or "desc". nulls_order can be "nulls_first"/"first" or "nulls_last"/"last" (defaults to nulls last).

expressions:

WindowExpr

= ()

Returns

type:

DataFrame.agg(by, aggregations, pre_grouped_keys)

from chalkdf import DataFrame
from libchalk.chalktable import WindowExpr
df = DataFrame.from_dict({
    "idx": [1, 1, 2, 2],
    "v":   [10, 20, 30, 40],
})
# Partition by "idx", sort by "v" ascending, shift "v" by -1 into "v_shifted"
result = df.window(["idx"], ["v"], WindowExpr.shift("v", "v_shifted", -1))
# result schema: idx, v, v_shifted
# v_shifted contains the *next* value of v within each idx partition

DataFrame.group_by(by)

Create a GroupBy object for chained aggregation operations.

This method returns a GroupBy object that can be used to apply aggregation expressions via the .agg() method. This provides an alternative syntax to df.agg(by, *aggregations).

Returns

type:

GroupBy

from chalkdf import DataFrame
from chalk.features import _
df = DataFrame.from_dict({"group": ["A", "A", "B"], "value": [1, 2, 3]})
grouped = df.group_by("group").agg(_.value.sum().alias("total"))

Multiple grouping columns:

df2 = DataFrame.from_dict({"g1": ["A", "A", "B"], "g2": ["X", "Y", "X"], "val": [1, 2, 3]})
result = df2.group_by("g1", "g2").agg(_.val.sum().alias("sum"))

Using underscore expressions:

result = df.group_by(_.group).agg(_.value.mean().alias("avg"))

Group by columns and apply aggregation expressions.

Parameters

Sequence[str | Underscore] | str | Underscore

Column name(s) to group by. Can be a single column name/expression or a sequence of column names/expressions.

aggregations:

AggExpr | Underscore

= ()

pre_grouped_keys:

Sequence[str]

= ()

Returns

type:

DataFrame.rollup(by, grouping_id_col)

from chalkdf import DataFrame
from chalk.features import _
df = DataFrame.from_dict({"group": ["A", "A", "B"], "value": [1, 2, 3]})
agg_df = df.agg(["group"], _.value.sum().alias("total"))
# Or with a single column:
agg_df = df.agg("group", _.value.sum().alias("total"))

Multi-set aggregation matching SQL GROUP BY ROLLUP(b1, b2, ...).

Expands to grouping sets [(b1,...,bN), (b1,...,bN-1), ..., (b1,), ()] — every prefix of by plus the empty (grand-total) set. The output schema is (b1, ..., bN, aggregations..., grouping_id_col): the by-columns are nullable (carrying NULL in rows from sets that omit them) and grouping_id_col is an int64 0-based index into the expansion above (set 0 is the full key, the grand total is the last index).

Parameters

str | Underscore

= ()

grouping_id_col:

DataFrame.cube(by, grouping_id_col)

= None

Name for the discriminator column. Defaults to __chalk_grouping_set_id__ (auto-suffixed if the default collides with an existing column).

Returns

type:

GroupingSetsGroupBy

from chalkdf import DataFrame
from chalk.features import _
df = DataFrame.from_dict({"a": ["A", "A", "B"], "b": ["X", "Y", "X"], "v": [1, 2, 3]})
rolled = df.rollup("a", "b").agg(_.v.sum().alias("sv"))

Multi-set aggregation matching SQL GROUP BY CUBE(b1, b2, ...).

Expands to all 2^N subsets of by, emitted in descending size and lexicographic-by-index order within each size. The output schema and grouping_id_col semantics match rollup.

Parameters

str | Underscore

= ()

grouping_id_col:

DataFrame.grouping_sets(sets, grouping_id_col)

= None

Name for the discriminator column. Defaults to __chalk_grouping_set_id__ (auto-suffixed if the default collides with an existing column).

Returns

type:

GroupingSetsGroupBy

from chalkdf import DataFrame
from chalk.features import _
df = DataFrame.from_dict({"a": ["A", "B"], "b": ["X", "Y"], "v": [1, 2]})
cubed = df.cube("a", "b").agg(_.v.sum().alias("sv"))

Multi-set aggregation matching SQL GROUP BY GROUPING SETS (...).

Each inner sequence is a grouping set; the empty inner sequence denotes the grand-total () set. The by-clause is the column-order union of all sets (first-appearance order across the supplied sets). The output schema and grouping_id_col semantics match rollup.

Parameters

sets:

Sequence[Sequence[str | Underscore]]

Sequence of grouping sets. At least two sets are required; a single set is equivalent to df.group_by(...).agg(...) and should use that form instead.

grouping_id_col:

DataFrame.distinct_on(columns)

= None

Name for the discriminator column. Defaults to __chalk_grouping_set_id__ (auto-suffixed if the default collides with an existing column).

Returns

type:

GroupingSetsGroupBy

from chalkdf import DataFrame
from chalk.features import _
df = DataFrame.from_dict({"a": ["A", "A", "B"], "b": ["X", "Y", "X"], "v": [1, 2, 3]})
result = df.grouping_sets([("a", "b"), ("a",), ()]).agg(_.v.sum().alias("sv"))

Remove duplicate rows based on the specified partition columns.

For each unique combination of values in columns, exactly one row is emitted. Which row is kept within a partition is not guaranteed — the engine may choose any row. If you need a deterministic choice, sort the DataFrame first with order_by before calling distinct_on.

Returns

type:

Raises

error:

DataFrame.order_by(columns)

If no columns are provided.

from chalkdf import DataFrame
df = DataFrame.from_dict({"x": [1, 1, 2], "y": [10, 20, 30]})
unique = df.distinct_on("x")  # one row per unique x value

Sort the DataFrame by one or more columns.

Returns

type:

DataFrame.write_lazy(target_path, target_file_name, ...+5)

from chalkdf import DataFrame
df = DataFrame.from_dict({"x": [3, 1, 2], "y": [30, 10, 20]})
# Sort by x ascending
sorted_df = df.order_by("x")
# Sort by x descending, then y ascending
sorted_df = df.order_by(("x", "desc"), "y")
# Sort by x descending, nulls first
sorted_df = df.order_by(("x", "desc", "nulls_first"))

Build a lazy write plan without executing it.

Returns a new DataFrame whose query plan ends with a TableWrite operator. No files are written until you call run or to_arrow on the returned DataFrame.

For immediate execution use write instead.

Parameters

target_path:

Directory to write output files.

target_file_name:

= None

Optional explicit file name.

file_format:

= 'parquet'

Output format (default parquet).

serde_parameters:

= None

Optional SerDe options for text formats.

compression:

= None

Optional compression codec.

ensure_files:

= False

Ensure writers emit files even if no rows were produced.

connector_id:

= None

Optional connector id override.

Returns

type:

DataFrame.write(target_path, target_file_name, ...+6)

Execute the DataFrame plan and write the output files immediately.

This is the eager counterpart to write_lazy: it builds the write plan and runs it in one step.

By default (return_table_write_result=False) the method returns None after the write completes. Pass return_table_write_result=True to receive the raw TableWrite result MaterializedDataFrame instead.

Parameters

target_path:

Directory to write output files.

target_file_name:

= None

Optional explicit file name.

file_format:

= 'parquet'

Output format (default parquet).

serde_parameters:

= None

Optional SerDe options for text formats.

compression:

= None

Optional compression codec.

ensure_files:

= False

Ensure writers emit files even if no rows were produced.

connector_id:

= None

Optional connector id override.

= False

If True, return the raw TableWrite result DataFrame. If False (default), return None.

Returns

type:

DataFrame.write_parquet(output_uri_prefix, skip_planning_time_validation, return_table_write_result)

Write the DataFrame as Parquet files using an auto-configured connector.

Convenience method that simplifies writing Parquet files compared to the more general write. The connector is selected automatically based on the URI scheme.

Parameters

skip_planning_time_validation:

URI prefix where Parquet files will be written. Supports local (file://), S3 (s3://), and GCS (gs://) URIs.

= False

Whether to skip validation at planning time (default: False).

= False

If True, return the raw TableWrite result DataFrame. If False (default), return None.

Returns

type:

DataFrame.write_parquet_and_load(output_uri_prefix, destination, ...+3)

from chalkdf import DataFrame
df = DataFrame.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
df.write_parquet("file:///tmp/output/")  # returns None
result = df.write_parquet("gs://my-bucket/output/", return_table_write_result=True)

Write the DataFrame as Parquet files and load into a data warehouse.

Combines parquet writing with a subsequent warehouse load step (e.g., COPY INTO Snowflake).

Parameters

skip_planning_time_validation:

URI prefix where Parquet files will be written.

destination:

DataWarehouseDestination

Target data warehouse destination (database, schema, table).

loader:

DataWarehouseLoader

A DataWarehouseLoader implementation that performs the load.

= False

Whether to skip validation at planning time (default: False).

= False

If True, return the raw TableWrite result (default False).

Returns

type:

DataFrame.rename(new_names)

Rename columns in the DataFrame.

Parameters

new_names:

Mapping[str | Underscore, str] | Mapping[str, str]

Dictionary mapping old column names to new column names. Both keys and values can be either strings or underscore column references (e.g., _.col_name).

Returns

type:

DataFrame.compile(config, recompile)

from chalkdf import DataFrame
df = DataFrame.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
renamed = df.rename({"x": "id", "y": "value"})
# Can also use underscore syntax for keys
renamed = df.rename({_.x: "id", _.y: "value"})

Compile the current plan if necessary.

Configuration is resolved from multiple sources in priority order:

Explicit config parameter (highest priority)
Active compilation_config context manager
Global defaults from set_compilation_defaults
Environment variables (e.g., CHALK_USE_VELOX_PARQUET_READER)
Built-in fallback defaults

If a different configuration is provided than the previous compilation, the plan will be automatically recompiled.

Parameters

config:

CompilationConfig | None

= None

Explicit compilation configuration (highest priority).

recompile:

DataFrame.explain_logical(simplified)

= False

Force recompilation even if a plan exists.

Returns

type:

CompiledPlan

from chalkdf import DataFrame
from chalkdf.config import CompilationConfig
df = DataFrame.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
filtered = df.filter(_.x > 1)
compiled = filtered.compile(config=CompilationConfig(use_online_hash_join=True))
print(compiled.explain_logical())

Return a string representation of the logical query plan.

Returns

type:

DataFrame.explain_physical()

from chalkdf import DataFrame
from chalk.features import _
df = DataFrame.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
filtered = df.filter(_.x > 1)
print(filtered.explain_logical())

Return a string representation of the physical execution plan.

Returns

type:

DataFrame.explain_as_json()

from chalkdf import DataFrame
from chalk.features import _
df = DataFrame.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
filtered = df.filter(_.x > 1)
print(filtered.explain_physical())

Computes plan JSON for debugging the structure of the computation.

DataFrame.as_plan_json()

DataFrame.to_arrow(tables)

Execute the query plan and return the result as a PyArrow Table.

Parameters

DataFrame.remote_run(resource_group, correlation_id)

Mapping[str, MaterializedTable]

= _empty_table_dict

Optional mapping of table names to materialized Arrow data for execution.

Returns

type:

pyarrow.Table

from chalkdf import DataFrame
from chalk.features import _
df = DataFrame.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
filtered = df.filter(_.x > 1)
arrow_table = filtered.to_arrow()

DataFrame.to_pandas()

Execute the query plan and return the result as a pandas DataFrame.

Returns

type:

pandas.DataFrame

Submit this DataFrame plan for remote execution as an asynchronous Chalk job.

The DataFrame must have been created via a serializable constructor (from_dict, from_arrow, scan, from_datasource, etc.) or a chain of operations on such a DataFrame. If the plan cannot be serialized a ValueError is raised.

Requires chalkpy to be installed.

Parameters

resource_group:

= None

Optional resource group to run the job on.

correlation_id:

DataFrame.to_dict(as_series)

= None

Optional correlation ID for tracking.

Returns

type:

DataFrameJob

A non-blocking handle. Call wait to block until the job finishes and retrieve results.

Execute the query plan and convert to a dictionary.

Parameters

as_series:

DataFrame.iter_rows(named)

= True

If True (default), values are Series objects. If False, values are plain Python lists.

Execute the query plan and iterate over rows, yielding tuples or dicts.

Parameters

named:

= False

If True, yield dict objects; otherwise tuples.

DataFrame.iter_columns()

Execute the query plan and iterate over columns, yielding Series objects.

DataFrame.head(n)

Execute the query plan and return the first n rows.

Parameters

= 5

Number of rows to return. Defaults to 5.

DataFrame.tail(n)

Execute the query plan and return the last n rows.

Parameters

DataFrame.serialize(format)

= 5

Number of rows to return. Defaults to 5.

Execute the query plan and serialize to bytes.

Parameters

format:

DataFrame.from_lazyframe_proto(proto_message)

= 'binary'

Serialization format. Currently only "binary" (Arrow IPC) is supported.

DataFrame.run(tables, remote, ...+2)

Execute the query plan and return a materialized DataFrame.

Parameters

Mapping[str, MaterializedTable]

= _empty_table_dict

Optional mapping of table names to materialized Arrow data for execution. Ignored when remote=True.

remote:

= False

If True, submit the plan for remote async execution.

resource_group:

= None

Optional resource group for remote execution. Only used when remote=True.

correlation_id:

= None

Optional correlation ID for tracking. Only used when remote=True.

Returns

type:

named_table(name, schema, sorted_by)

from chalkdf import DataFrame
from chalk.features import _
df = DataFrame.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
filtered = df.filter(_.x > 1)
materialized = filtered.run()
job = filtered.run(remote=True)

Creating DataFrames

Class methods for constructing new DataFrame instances from various sources.

From Python dicts or Arrow tables in memory
From Parquet, CSV, or Delta files on disk or cloud storage
From AWS Glue Iceberg tables
From SQL queries (DuckDB dialect) or external SQL datasources
From streaming sources (Kafka, Kinesis, PubSub)
From Chalk SQL catalog tables
Schema-only placeholders via named_table (data injected at run time)

Create a schema-only placeholder DataFrame for a named table.

Parameters

Table identifier used as the key in the tables mapping at execution time.

Arrow schema describing the table's columns and types.

sorted_by:

list[str] | None

= None

Optional list of column names indicating that the data is pre-sorted by those columns (used for optimisation hints).

Returns

type:

import pyarrow as pa
from chalkdf import DataFrame
schema = pa.schema([("user_id", pa.int64()), ("score", pa.float64())])
df = DataFrame.named_table("users", schema)
# Build a query plan
from chalk.features import _
result_plan = df.filter(_.score > 0.5)
# Inject real data at execution time
import pyarrow as pa
data = pa.table({"user_id": [1, 2, 3], "score": [0.3, 0.8, 0.6]})
result = result_plan.run(tables={"users": data})

from_dict(data)

Construct a DataFrame from a Python dictionary.

Parameters

dict

Dictionary mapping column names to lists of values.

Returns

from chalkdf import DataFrame
df = DataFrame.from_dict({"x": [1, 2, 3], "y": ["a", "b", "c"]})

from_arrow(data)

Construct a DataFrame from an in-memory Arrow object.

Parameters

scan(input_uris, name, ...+3)

MaterializedTable

PyArrow Table or RecordBatch to convert into a DataFrame.

Returns

import pyarrow as pa
from chalkdf import DataFrame
table = pa.table({"x": [1, 2, 3], "y": ["a", "b", "c"]})
df = DataFrame.from_arrow(table)

Scan files and return a DataFrame.

Currently supports CSV (with headers), Parquet, Delta, and Iceberg.

Parameters

input_uris:

Sequence[str | Path] | str | Path

File path/URI or list of paths/URIs to scan. Supports local paths and file:// URIs.

= None

Optional name to assign to the table being scanned.

pyarrow.Schema | None

= None

Schema of the data. Required for CSV files, optional for Parquet. For Delta and Iceberg, inferred from table metadata when omitted.

mode:

'auto' | 'hive' | 'delta' | 'iceberg'

= 'auto'

Scan inference mode:

"auto": infer scan type from the URI/path. Uses suffixes for CSV/Parquet and can also recognize Delta table roots.
"hive": expand Hive/glob paths without Delta inference fallback.
"delta": treat the input as a Delta table root (requires exactly one URI).
"iceberg": treat the input as an Iceberg table root (requires exactly one URI). To read from a catalog instead of file system, use scan_iceberg.

row_sample:

float | None

= None

Returns

type:

scan_glue_iceberg(glue_table_name, schema, ...+8)

from chalkdf import DataFrame
# Scan Parquet files
df = DataFrame.scan(["data/sales_2024.parquet"], name="sales_data")
# Scan CSV with explicit schema
import pyarrow as pa
schema = pa.schema([("id", pa.int64()), ("name", pa.string())])
df = DataFrame.scan(["data/users.csv"], schema=schema)
# Scan ~1% of rows
df = DataFrame.scan(["data/big.parquet"], row_sample=0.01)

Load data from an AWS Glue Iceberg table.

Parameters

glue_table_name:

Fully qualified database.table name.

Mapping[str, pyarrow.DataType]

Mapping of column names to Arrow types.

batch_row_count:

= 1000

Number of rows per batch.

aws_catalog_account_id:

= None

AWS account hosting the Glue catalog.

aws_catalog_region:

= None

Region of the Glue catalog.

aws_role_arn:

parquet_scan_range_column:

= None

IAM role to assume for access.

filter_predicate:

Expr | None

= None

Optional filter applied during scan.

= None

Column used for range-based reads.

custom_partitions:

dict[str, tuple['date_trunc(day)', str]] | None

= None

Additional partition definitions.

partition_column:

= None

Column name representing partitions.

Returns

type:

from_sql(query, tables)

Create a DataFrame from the result of executing a SQL query (DuckDB dialect).

Parameters

SQL query string (DuckDB dialect).

CompatibleFrameType

= {}

Returns

type:

from_stream_source(source, n, ...+2)

from chalkdf import DataFrame
orders = DataFrame.from_dict({"order_id": [1, 2, 3], "amount": [10.0, 20.0, 5.0]})
result = DataFrame.from_sql(
    "SELECT order_id, amount FROM orders WHERE amount > 8",
    orders=orders,
)

Join two DataFrames with SQL:

users = DataFrame.from_dict({"id": [1, 2], "name": ["Alice", "Bob"]})
purchases = DataFrame.from_dict({"user_id": [1, 1, 2], "item": ["a", "b", "c"]})
result = DataFrame.from_sql(
    "SELECT u.name, p.item FROM users u JOIN purchases p ON u.id = p.user_id",
    users=users,
    purchases=purchases,
)

Create a DataFrame by pulling messages from a streaming source.

This method connects to a Kafka, Kinesis, or PubSub source and pulls up to n messages, returning them as a DataFrame.

Parameters

Any

A streaming source configuration. Can be one of:

KafkaSource: Kafka topic configuration
KinesisSource: Kinesis stream configuration
PubSubSource: Google PubSub subscription configuration

Maximum number of messages to pull from the stream.

timeout_ms:

= 15000

Timeout in milliseconds for pulling messages. Default is 15000ms.

include_metadata:

= False

Returns

type:

from_catalog_table(table_name, catalog)

from chalkdf import DataFrame
from chalk.streams import KafkaSource
source = KafkaSource(
    name="my_kafka",
    bootstrap_server="localhost:9092",
    topic="my_topic",
)
# Pull 100 messages, just the raw bytes
df = DataFrame.from_stream_source(source, n=100)
# Pull with full metadata
df = DataFrame.from_stream_source(source, n=100, include_metadata=True)

Create a DataFrame from a Chalk SQL catalog table.

Parameters

table_name:

Name of the table in the catalog.

catalog:

ChalkSqlCatalog

ChalkSqlCatalog instance containing the table.

Returns

type:

from_datasource(source, query, expected_output_schema)

from chalkdf import DataFrame
from libchalk.chalksql import ChalkSqlCatalog
catalog = ChalkSqlCatalog()
df = DataFrame.from_catalog_table("users", catalog=catalog)

Create a DataFrame from the result of querying a SQL data source.

Parameters

BaseSQLSource

SQL source to query (e.g., PostgreSQL, Snowflake, BigQuery).

SQL query to execute against the data source.

expected_output_schema:

Output schema of the query result. The datasource's driver converts the native query result to this schema.

Returns

import pyarrow as pa
from chalkdf import DataFrame
from chalk.sql import PostgreSQLSource
source = PostgreSQLSource(...)
schema = pa.schema([("user_id", pa.int64()), ("name", pa.string())])
df = DataFrame.from_datasource(source, "SELECT * FROM users", schema)

Column Operations

Methods for selecting, transforming, and manipulating columns.

Column expressions accept _ (underscore) syntax or functions — for example _.price * _.qty or F.coalesce(_.value, 0). See the Python function reference for available functions.

select / drop — pick or remove columns by name
with_columns — add or replace columns while keeping existing ones
project — replace all columns with a new set of expressions
col / column — reference a column by runtime name string
rename — rename one or more columns
explode — expand a list column into one row per element
with_unique_id — append a monotonically increasing ID column

col(column)

Return a column expression for the named column.

Parameters

Name of an existing column in this DataFrame.

Returns

type:

Underscore

Raises

error:

If column is not present in the DataFrame's schema.

from chalkdf import DataFrame
df = DataFrame.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
# Reference a column by name to build expressions
col_x = df.col("x")
df_filtered = df.filter(col_x > 1)
# Useful when the column name comes from a variable
target = "y"
df2 = df.with_columns({"doubled": df.col(target) * 2})

column(column)

Return a column expression for the named column.

Alias for col.

Parameters

Name of an existing column in this DataFrame.

Returns

type:

Underscore

Raises

error:

If column is not present in the DataFrame's schema.

from chalkdf import DataFrame
df = DataFrame.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
# Compute a sum from two columns referenced by name
df2 = df.with_columns({"sum": df.col("x") + df.col("y")})

select(columns, strict)

Select existing columns by name.

Parameters

columns:

str | Underscore

= ()

strict:

= True

If True, raise an error if any column doesn't exist. If False, silently ignore missing columns.

Returns

type:

from chalkdf import DataFrame
df = DataFrame.from_dict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
selected = df.select("x", "y")

drop(columns, strict)

Drop specified columns from the DataFrame.

Parameters

columns:

str | Underscore

= ()

strict:

= True

If True (default), raise a ValueError if any named column does not exist. If False, silently skip missing columns.

Returns

type:

from chalkdf import DataFrame
df = DataFrame.from_dict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
df_dropped = df.drop("z")           # raises if "z" missing
df_dropped = df.drop("z", "missing_col", strict=False)  # silently skips

with_columns(columns)

Add or replace columns while keeping all existing columns.

Unlike project, which returns only the columns you specify, with_columns keeps every existing column and either adds new ones or replaces columns whose names match.

Accepts multiple forms:

A dict mapping column names to expressions
Positional tuples of (name, expression)
Bare positional expressions that carry a name via .alias(<name>)

Returns

type:

from chalkdf import DataFrame
from chalk.features import _
import chalk.functions as F
df = DataFrame.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
# Add a new column using underscore syntax
df2 = df.with_columns({"z": _.x + _.y})
# Add a column using an F function
df3 = df.with_columns({"z_capped": F.least(_.x + _.y, 8)})
# Add a column using .alias()
df4 = df.with_columns((_.x * 2).alias("x_doubled"))
# Both df2, df3, df4 still contain x and y in addition to the new column

project(columns)

Project to an exact set of output columns using expressions.

Unlike with_columns, which keeps all existing columns and only adds or replaces the ones you name, project returns only the columns you specify. Columns not listed in columns are dropped.

Use project when you want to reshape or rename the schema entirely; use with_columns when you only want to augment it.

Parameters

columns:

Mapping[str, Expr | Underscore | Any]

Mapping of output column names to expressions that define them. Every key becomes a column in the output; every value is an expression (_ expression or Expr) evaluated against the current schema.

Returns

type:

from chalkdf import DataFrame
from chalk.features import _
df = DataFrame.from_dict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
# Keep only "sum" and "x"; z is dropped
projected = df.project({"sum": _.x + _.y, "x": _.x})

rename(new_names)

Rename columns in the DataFrame.

Parameters

new_names:

Mapping[str | Underscore, str] | Mapping[str, str]

Dictionary mapping old column names to new column names. Both keys and values can be either strings or underscore column references (e.g., _.col_name).

Returns

type:

from chalkdf import DataFrame
df = DataFrame.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
renamed = df.rename({"x": "id", "y": "value"})
# Can also use underscore syntax for keys
renamed = df.rename({_.x: "id", _.y: "value"})

with_unique_id(name)

Add a monotonically increasing unique identifier column.

Parameters

Name of the new ID column.

Returns

type:

from chalkdf import DataFrame
df = DataFrame.from_dict({"x": [10, 20, 30]})
df_with_id = df.with_unique_id("row_id")

explode(column)

Explode a list or array column into multiple rows.

Each element in the list becomes a separate row, with other column values duplicated.

Parameters

str | Underscore

Name of the list/array column to explode.

Returns

type:

from chalkdf import DataFrame
df = DataFrame.from_dict({"id": [1, 2], "items": [[10, 20], [30]]})
exploded = df.explode("items")

Row Operations

Methods for filtering, ordering, and deduplicating rows.

filter — keep rows matching a boolean expression
order_by — sort rows by one or more columns
slice — select a positional range of rows
distinct_on — deduplicate by a set of key columns

filter(expr)

Filter rows based on a boolean expression.

Parameters

expr:

Expr | Underscore

Boolean expression to filter rows. Only rows where the expression evaluates to True are kept.

Returns

type:

from chalkdf import DataFrame
from chalk.features import _
df = DataFrame.from_dict({"x": [1, 2, 3, 4], "y": [10, 20, 30, 40]})
filtered = df.filter(_.x > 2)

order_by(columns)

Sort the DataFrame by one or more columns.

Returns

type:

from chalkdf import DataFrame
df = DataFrame.from_dict({"x": [3, 1, 2], "y": [30, 10, 20]})
# Sort by x ascending
sorted_df = df.order_by("x")
# Sort by x descending, then y ascending
sorted_df = df.order_by(("x", "desc"), "y")
# Sort by x descending, nulls first
sorted_df = df.order_by(("x", "desc", "nulls_first"))

slice(start, length)

Return a subset of rows starting at a specific position.

Parameters

start:

Zero-based index where the slice begins.

length:

= None

Number of rows to include. If None, includes all remaining rows.

Returns

type:

from chalkdf import DataFrame
df = DataFrame.from_dict({"x": [1, 2, 3, 4, 5]})
# Get rows 1-3 (indices 1, 2, 3)
sliced = df.slice(1, 3)

distinct_on(columns)

Remove duplicate rows based on the specified partition columns.

Returns

type:

Raises

error:

If no columns are provided.

from chalkdf import DataFrame
df = DataFrame.from_dict({"x": [1, 1, 2], "y": [10, 20, 30]})
unique = df.distinct_on("x")  # one row per unique x value

Set Operations

Methods for combining rows from multiple DataFrames.

Both DataFrames must share the same schema (column order may differ). Duplicates are retained; row order is not guaranteed.

union(other)

Combine this DataFrame with another by stacking rows.

Convenience method for unioning with a single DataFrame. Equivalent to union_all(other).

Both DataFrames must have the same schema (different column order is allowed - the output will have the same column order as self). Duplicates are retained. Row order is not preserved.

Joins

Methods for combining two DataFrames based on matching keys.

join — standard equality joins (inner, left, right, outer, semi, anti, cross)
join_asof — temporal / nearest-key join for time-series data

join(other, on, ...+6)

Join this DataFrame with another.

Parameters

other:

Right-hand DataFrame.

on:

= None

Join keys. Can be specified in multiple ways:

A sequence of column names (same names on both sides): on=["col1", "col2"]
A mapping of left->right column names: on={"left_col": "right_col"}
If None, must specify left_on and right_on separately.

left_on:

Sequence[str | Underscore] | None

= None

Column names for left DataFrame join keys. Only used when on is None. Must be paired with right_on.

right_on:

Sequence[str | Underscore] | None

= None

Column names for right DataFrame join keys. Only used when on is None. Must be paired with left_on.

how:

JoinType

= 'inner'

Join type. Supported values:

"inner": Keep only rows that match in both DataFrames (default)
"left": Keep all rows from left DataFrame
"right": Keep all rows from right DataFrame
"outer" or "full": Keep all rows from both DataFrames
"semi": Return rows from left that have matches in right (no right columns)
"anti": Return rows from left that have no matches in right
"cross": Cartesian product (do not pass in on)

= None

probe_with_right_side:

= False

If True, the probe side of the join will be the right and the build will be the left. Default is False (left is probe, right is build)

where:

Expr | None

= None

Returns

type:

join_asof(other, on, ...+8)

Perform an as-of join with another DataFrame.

Important: Both DataFrames must be sorted by the on (or left_on/right_on) column before calling this method. Use .order_by(on) to sort if needed.

Parameters

other:

Right-hand DataFrame to join with.

on:

str | Underscore | None

= None

left_on:

str | Underscore | None

= None

Column name in left DataFrame for the as-of join key. Only used when on is None. Must be paired with right_on.

right_on:

str | Underscore | None

= None

Column name in right DataFrame for the as-of join key. Can be used with on (to specify a different right column name) or with left_on (when on is None).

= None

Additional exact-match columns (optional). These columns must match exactly before performing the as-of match on the on column. Can be specified as:

A sequence of column names (same names on both sides): by=["col1", "col2"]
A mapping of left->right column names: by={"left_col": "right_col"}
If None, can specify left_by and right_by separately.

left_by:

Sequence[str | Underscore] | None

= None

Column names in left DataFrame for exact-match conditions. Only used when by is None. Must be paired with right_by.

right_by:

Sequence[str | Underscore] | None

= None

Column names in right DataFrame for exact-match conditions. Only used when by is None. Must be paired with left_by.

strategy:

AsOfJoinStrategy | 'forward' | 'backward'

= 'backward'

Join strategy controlling which match to select:

"backward" (default): Match with the most recent past value
"forward": Match with the nearest future value Can also pass AsOfJoinStrategy.BACKWARD or AsOfJoinStrategy.FORWARD.

= None

Suffix to add to overlapping column names from the right DataFrame.

coalesce:

= True

Whether to coalesce the join keys (default True).

Returns

type:

agg(by, aggregations, pre_grouped_keys)

Window and Aggregation

Methods for computing group summaries and window (analytic) expressions.

group_by — returns a GroupBy object for chained aggregations
agg — group by columns and apply aggregation expressions directly
window — compute analytic (window) expressions partitioned by key columns

group_by(by)

Create a GroupBy object for chained aggregation operations.

This method returns a GroupBy object that can be used to apply aggregation expressions via the .agg() method. This provides an alternative syntax to df.agg(by, *aggregations).

Returns

type:

GroupBy

from chalkdf import DataFrame
from chalk.features import _
df = DataFrame.from_dict({"group": ["A", "A", "B"], "value": [1, 2, 3]})
grouped = df.group_by("group").agg(_.value.sum().alias("total"))

Multiple grouping columns:

df2 = DataFrame.from_dict({"g1": ["A", "A", "B"], "g2": ["X", "Y", "X"], "val": [1, 2, 3]})
result = df2.group_by("g1", "g2").agg(_.val.sum().alias("sum"))

Using underscore expressions:

result = df.group_by(_.group).agg(_.value.mean().alias("avg"))

Group by columns and apply aggregation expressions.

Parameters

Sequence[str | Underscore] | str | Underscore

Column name(s) to group by. Can be a single column name/expression or a sequence of column names/expressions.

aggregations:

AggExpr | Underscore

= ()

pre_grouped_keys:

Sequence[str]

= ()

Returns

type:

window(by, order_by, expressions)

from chalkdf import DataFrame
from chalk.features import _
df = DataFrame.from_dict({"group": ["A", "A", "B"], "value": [1, 2, 3]})
agg_df = df.agg(["group"], _.value.sum().alias("total"))
# Or with a single column:
agg_df = df.agg("group", _.value.sum().alias("total"))

Compute window (analytic) expressions partitioned by by and ordered by order_by.

Overlap between by and order_by columns is not allowed.

Parameters

Sequence[str | Underscore]

Column names that define the partition boundaries. Rows with the same combination of values in these columns form one partition.

order_by:

expressions:

WindowExpr

= ()

Returns

type:

run(tables, remote, ...+2)

from chalkdf import DataFrame
from libchalk.chalktable import WindowExpr
df = DataFrame.from_dict({
    "idx": [1, 1, 2, 2],
    "v":   [10, 20, 30, 40],
})
# Partition by "idx", sort by "v" ascending, shift "v" by -1 into "v_shifted"
result = df.window(["idx"], ["v"], WindowExpr.shift("v", "v_shifted", -1))
# result schema: idx, v, v_shifted
# v_shifted contains the *next* value of v within each idx partition

Execution and Inspection

Methods for executing query plans and inspecting DataFrame structure.

run — execute and return a materialized DataFrame
to_arrow — execute and return a pyarrow.Table
write / write_parquet — execute and persist output files
explain_logical / explain_physical — inspect the query plan
get_plan / get_tables — access internal plan and table state

Execute the query plan and return a materialized DataFrame.

Parameters

Mapping[str, MaterializedTable]

= _empty_table_dict

Optional mapping of table names to materialized Arrow data for execution. Ignored when remote=True.

remote:

= False

If True, submit the plan for remote async execution.

resource_group:

= None

Optional resource group for remote execution. Only used when remote=True.

correlation_id:

= None

Optional correlation ID for tracking. Only used when remote=True.

Returns

type:

from chalkdf import DataFrame
from chalk.features import _
df = DataFrame.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
filtered = df.filter(_.x > 1)
materialized = filtered.run()
job = filtered.run(remote=True)

to_arrow(tables)

Execute the query plan and return the result as a PyArrow Table.

Parameters

write(target_path, target_file_name, ...+6)

Mapping[str, MaterializedTable]

= _empty_table_dict

Optional mapping of table names to materialized Arrow data for execution.

Returns

type:

pyarrow.Table

from chalkdf import DataFrame
from chalk.features import _
df = DataFrame.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
filtered = df.filter(_.x > 1)
arrow_table = filtered.to_arrow()

Execute the DataFrame plan and write the output files immediately.

This is the eager counterpart to write_lazy: it builds the write plan and runs it in one step.

Parameters

target_path:

Directory to write output files.

target_file_name:

= None

Optional explicit file name.

file_format:

= 'parquet'

Output format (default parquet).

serde_parameters:

= None

Optional SerDe options for text formats.

compression:

= None

Optional compression codec.

ensure_files:

= False

Ensure writers emit files even if no rows were produced.

connector_id:

= None

Optional connector id override.

= False

If True, return the raw TableWrite result DataFrame. If False (default), return None.

Returns

type:

write_parquet(output_uri_prefix, skip_planning_time_validation, return_table_write_result)

Write the DataFrame as Parquet files using an auto-configured connector.

Convenience method that simplifies writing Parquet files compared to the more general write. The connector is selected automatically based on the URI scheme.

Parameters

skip_planning_time_validation:

URI prefix where Parquet files will be written. Supports local (file://), S3 (s3://), and GCS (gs://) URIs.

= False

Whether to skip validation at planning time (default: False).

= False

If True, return the raw TableWrite result DataFrame. If False (default), return None.

Returns

type:

explain_logical(simplified)

from chalkdf import DataFrame
df = DataFrame.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
df.write_parquet("file:///tmp/output/")  # returns None
result = df.write_parquet("gs://my-bucket/output/", return_table_write_result=True)

Return a string representation of the logical query plan.

Returns

type:

from chalkdf import DataFrame
from chalk.features import _
df = DataFrame.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
filtered = df.filter(_.x > 1)
print(filtered.explain_logical())

explain_physical()

Return a string representation of the physical execution plan.

Returns

type:

Remote.scaling_group(uri, input_names, use_ssl)

from chalkdf import DataFrame
from chalk.features import _
df = DataFrame.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
filtered = df.filter(_.x > 1)
print(filtered.explain_physical())

get_plan()

Expose the underlying ChalkTable plan.

get_tables()

Return the mapping of materialized tables for this DataFrame.

Remote

High-level interface for calling functions deployed on Chalk scaling groups.

Remote lets you invoke a deployed function by name from within a resolver or from client code. Two calling modes are supported:

Sync (default) — calls the scaling group's gRPC endpoint directly via Velox and returns the result immediately.
Async — routes the call through the function queue (fnq-server) and returns a FunctionCallHandle that you can get() later.

Remote

Class

High-level interface for calling remote functions.

Examples

Direct call (scalar):

remote = Remote(client)
fn = remote.function("add")
result = fn(1, 2)

Direct call from async code:

fn = remote.function("add")
result = await fn(1, 2)

Generator (sync):

fn = remote.function("count-up")
for batch in fn(5):
    print(batch)

Generator (async):

fn = remote.function("async-count-up")
async for batch in fn(5):
    print(batch)

Deferred (any function type):

handle = fn.defer(5)
result = handle.get()
for batch in handle.stream(): ...

Functions

Connect to a scaling group by URI (low-level).

Remote.function(name, output_type)

Look up a deployed function by name and return a callable.

The function's traits (sync/async, scalar/generator) are discovered from the catalog metadata. The calling convention mirrors Python:

Scalar functions: result = fn(args) in sync code or result = await fn(args) in async code.
def f() -> Iterator: for x in fn(args): ...
async def f() -> AsyncIterator: async for x in fn(args): ...

Use fn.defer(args) for deferred execution via the function queue.

Remote.purge_all()

Drop pending items from every per-function queue for the tenant.

Returns {function_name: items_removed} for every queue that had at least one pending item. Empty queues are omitted from the dict. Already-popped (in-flight) calls continue to completion; only items still pending in the per-function Redis queues are removed.

RemoteFunction

A deployed remote function.

Returned by function. The calling convention mirrors Python:

def f() -> int: result = fn(args)
async def f() -> int: result = await fn(args)
def f() -> Iterator: for x in fn(args): ...
async def f() -> AsyncIterator: async for x in fn(args): ...

Use defer for deferred execution via the function queue.

RemoteFunction

Class

A deployed remote function.

Supports both direct calls and deferred calls.

Calling Convention:

Scalar functions: result = fn(args) in sync code or result = await fn(args) in async code.
def f() -> Iterator[int]: for x in fn(args): ...
async def f() -> AsyncIterator[int]: async for x in fn(args): ...

All function types also support deferred execution via the function queue:

handle = fn.defer(args)
result = handle.get() or for x in handle.stream(): ...

Functions

RemoteFunction.__call__(args)

Call the function directly through the scaling group via Velox.

Returns the appropriate type based on the function's signature:

Scalar function in sync code: returns the value
Scalar function in async code: returns a coroutine
Sync generator: returns an iterator
Async generator: returns an async iterator

RemoteFunction.defer(args)

Enqueue the call via the function queue and return a handle immediately.

RemoteFunction.purge()

Drop all pending items from this function's Redis queue.

Already-popped (in-flight) calls continue to completion; only items still pending in the queue are removed. Returns the number of items that were dropped.

Raises

error:

RuntimeError

When the function was constructed without a queue client.

FunctionCallHandle

Handle to a deferred function call.

Returned by defer. Use get to block until the result is ready, or stream to iterate over generator results as they arrive.

FunctionCallHandle

Class

Handle to a deferred function call enqueued via the function queue.

Returned by defer.

Attributes

call_id