Chalk home page
Docs
API
CLI
  1. Features
  2. Underscore

Overview

Underscore expressions are used to derive features from operations on other features. In feature definitions, _ represents a reference to the containing feature class and is used to access other features in the same instance. In addition to arithmetic operations, underscore expressions can also be used to filter and aggregate data.

Underscore expressions are useful because Chalk statically analyzes and optimizes them, leading to better performance when compared to equivalent Python resolvers. They are also useful for succinctly defining common features, such as the number of times a user failed a login attempt over the past 30 days or the total amount spent at a given merchant.

Arithmetic

In this example, we have a Transaction feature class with total and sales_tax features and we want to define subtotal as total minus sales_tax. Instead of writing a Python resolver, we can resolve subtotal with an underscore expression:

from chalk import online
from chalk.features import _, features

@features
class Transaction:
  id: int
  total: float
  sales_tax: float
  subtotal: float = _.total - _.sales_tax

@online
def get_subtotal(total: Transaction.total, sales_tax: Transaction.sales_tax) -> Transaction.subtotal:
    return total - sales_tax

Conditions and filters

DataFrame features can be filtered with underscore expressions.

Extending our Transaction example, we can create a User feature class with a has-many relationship to Transaction. Then, we can define a feature representing the number of large purchases by referencing the existing User.transactions feature:

from chalk.features import _, features, DataFrame

@features
class Transaction:
  id: int
  user_id: "User.id"
  total: float
  sales_tax: float
  subtotal: float = _.total - _.sales_tax

@features
class User:
   id: int
   # implicit has-many relationship with Transaction due to `user_id` above
   transactions: DataFrame[Transaction]
   num_large_transactions: int = _.transactions[_.total > 1000].count()

The object referenced by _ changes depending on its current scope. In this code, the _ in _.transactions references the User object. Within the DataFrame filter, the _ in _.total references each Transaction object as each one is evaluated. The count aggregation is covered in the next section.

Projections and aggregations

DataFrame features support projection with underscore expressions, which produce a new DataFrame scoped down to the referenced columns. DataFrames can be aggregated after eligible columns are selected.

With our Transaction example, we already saw a count aggregation for counting the number of large transactions. We can add another aggregation for computing the user’s total spend:

from chalk.features import _, features, DataFrame

@features
class Transaction:
  id: int
  user_id: "User.id"
  total: float
  sales_tax: float
  subtotal: float = _.total - _.sales_tax

@features
class User:
  id: int
  # implicit has-many relationship with Transaction due to `user_id` above
  transactions: DataFrame[Transaction]
  num_large_transactions: int = _.transactions[_.total > 1000].count()
  total_spend: float = _.transactions[_.total].sum()

To compute User.total_spend, we needed to create a projection of the User.transactions DataFrame limited to only the Transaction.total column so that the sum aggregation could work. In contrast, no projection was needed for num_large_transaction’s count aggregation because count works on DataFrames with any number of columns.

All supported operations

Arithmetic

  • Addition: +
  • Subtraction: -
  • Multiplication: *
  • Division: /

Conditions

  • Greater than: >
  • Greater than or equal: >=
  • Less than: <
  • Less than or equal: <=
  • Equal: ==
  • Not equal: !=
  • Boolean and: &
  • Boolean or: |

Do not use Python’s and, or, not, or is operators in underscore expressions. Python does not allow these operators to be overridden, so they will not work with Chalk’s underscore expressions.

Aggregations

Aggregation functions have varying behavior when handling None values and empty DataFrames. If an aggregation function says None values are skipped in the table below, it will consider a DataFrame with only None values as empty.

FunctionNone valuesEmpty DataFrameNotes
sumSkippedReturns 0
minSkippedReturns None
maxSkippedReturns None
meanSkippedReturns NoneFeature type must be float or float | None. None values are skipped, meaning they are not included in the mean calculation.
countIncludedReturns 0
anySkippedReturns False
allSkippedReturns True
stdSkippedSee notesStandard deviation. Requires at least 2 values. For DataFrames with less than 2 values, returns None.

Aliases: stddev, stddev_sample, std_sample.
varSkippedSee notesVariance. Same requirements as std.

Alias: var_sample.
approx_count_distinctSkippedReturns 0
approx_percentileSkippedReturns NoneTakes one argument, quantile, expected to be a float in the range [0, 1].

Example: approx_percentile(0.75) returns a value approximately equal to the 75th percentile of the not-None values in the DataFrame.

For aggregations that can return None, either mark the feature as optional (for example, by setting the feature type to float | None) or use coalesce to fall back to a default value.

Additional functions

The chalk.functions module exposes several helpful functions that can be used in combination with underscore references to transform features: