Underscore - Chalk

Overview

Underscore expressions are used to derive features from operations on other features. In feature definitions, _ represents a reference to the containing feature class and is used to access other features in the same instance. In addition to arithmetic operations, underscore expressions can also be used to filter and aggregate data.

Underscore expressions are useful because Chalk statically analyzes and optimizes them, leading to better performance when compared to equivalent Python resolvers. They are also useful for succinctly defining common features, such as the number of times a user failed a login attempt over the past 30 days or the total amount spent at a given merchant.

Arithmetic

In this example, we have a Transaction feature class with total and sales_tax features and we want to define subtotal as total minus sales_tax. Instead of writing a Python resolver, we can resolve subtotal with an underscore expression:

from chalk import online
from chalk.features import _, features

@features
class Transaction:
  id: int
  total: float
  sales_tax: float
  subtotal: float = _.total - _.sales_tax

@online
def get_subtotal(total: Transaction.total, sales_tax: Transaction.sales_tax) -> Transaction.subtotal:
    return total - sales_tax

Conditions and filters

DataFrame features can be filtered with underscore expressions.

Extending our Transaction example, we can create a User feature class with a has-many relationship to Transaction. Then, we can define a feature representing the number of large purchases by referencing the existing User.transactions feature:

from chalk.features import _, features, DataFrame

@features
class Transaction:
  id: int
  user_id: "User.id"
  total: float
  sales_tax: float
  subtotal: float = _.total - _.sales_tax

@features
class User:
   id: int
   # implicit has-many relationship with Transaction due to `user_id` above
   transactions: DataFrame[Transaction]
   num_large_transactions: int = _.transactions[_.total > 1000].count()

The object referenced by _ changes depending on its current scope. In this code, the _ in _.transactions references the User object. Within the DataFrame filter, the _ in _.total references each Transaction object as each one is evaluated. The count aggregation is covered in the next section.

Projections and aggregations

DataFrame features support projection with underscore expressions, which produce a new DataFrame scoped down to the referenced columns. DataFrames can be aggregated after eligible columns are selected.

With our Transaction example, we already saw a count aggregation for counting the number of large transactions. We can add another aggregation for computing the user’s total spend:

from chalk.features import _, features, DataFrame

@features
class Transaction:
  id: int
  user_id: "User.id"
  total: float
  sales_tax: float
  subtotal: float = _.total - _.sales_tax

@features
class User:
  id: int
  # implicit has-many relationship with Transaction due to `user_id` above
  transactions: DataFrame[Transaction]
  num_large_transactions: int = _.transactions[_.total > 1000].count()
  total_spend: float = _.transactions[_.total].sum()

To compute User.total_spend, we needed to create a projection of the User.transactions DataFrame limited to only the Transaction.total column so that the sum aggregation could work. In contrast, no projection was needed for num_large_transaction’s count aggregation because count works on DataFrames with any number of columns.

All supported operations

Arithmetic

Addition: +
Subtraction: -
Multiplication: *
Division: /

Conditions

Greater than: >
Greater than or equal: >=
Less than: <
Less than or equal: <=
Equal: ==
Not equal: !=
Boolean and: &
Boolean or: |

Do not use Python’s and, or, not, or is operators in underscore expressions. Python does not allow these operators to be overridden, so they will not work with Chalk’s underscore expressions.

Aggregations

Aggregation functions have varying behavior when handling None values and empty DataFrames. If an aggregation function says None values are skipped in the table below, it will consider a DataFrame with only None values as empty.

Function	`None` values	Empty DataFrame	Notes
`sum`	Skipped	Returns `0`
`min`	Skipped	Returns `None`
`max`	Skipped	Returns `None`
`mean`	Skipped	Returns `None`	Feature type must be `float` or `float \| None`. `None` values are skipped, meaning they are not included in the mean calculation.
`count`	Included	Returns `0`
`any`	Skipped	Returns `False`
`all`	Skipped	Returns `True`
`std`	Skipped	See notes	Standard deviation. Requires at least 2 values. For DataFrames with less than 2 values, returns `None`. Aliases: `stddev`, `stddev_sample`, `std_sample`.
`var`	Skipped	See notes	Variance. Same requirements as `std`. Alias: `var_sample`.
`approx_count_distinct`	Skipped	Returns `0`
`approx_percentile`	Skipped	Returns `None`	Takes one argument, `quantile`, expected to be a float in the range `[0, 1]`. Example: `approx_percentile(0.75)` returns a value approximately equal to the 75th percentile of the not-`None` values in the DataFrame.

For aggregations that can return None, either mark the feature as optional (for example, by setting the feature type to float | None) or use coalesce to fall back to a default value.

Additional functions

The chalk.functions module exposes several helpful functions that can be used in combination with underscore references to transform features:

​Overview

​Arithmetic

​Conditions and filters

​Projections and aggregations

​All supported operations

​Arithmetic

​Conditions

​Aggregations

​Additional functions

On this page

Overview

Arithmetic

Conditions and filters

Projections and aggregations

All supported operations

Arithmetic

Conditions

Aggregations

Additional functions