# Expressions
source: https://docs.chalk.ai/docs/expression

## Using Chalk expressions to define features

### Overview

Chalk Expressions let you define features declaratively, using symbolic computation over your data.
While you write expressions in idiomatic Python, they are compiled and executed as vectorized C++,
enabling low-latency computation at serve time and high-throughput processing at train time.

Expressions support a wide range of operations, including arithmetic, filtering, aggregations,
and built-in functions like .

For example, in a Transaction feature class, we can compute the subtotal of a transaction as the
difference between its total and sales tax:

```
from chalk.features import features, Primary
from chalk import _

@features
class Transaction:
    id: int
    total: float
    sales_tax: float
    subtotal: float = _.total - _.sales_tax
```

The _ symbol refers to the current scope (here, the feature class Transaction)
and is used to reference other fields on the same instance.
Expressions like _.total - _.sales_tax are compiled into native execution
plans that run efficiently in production.

In addition to referencing fields on the same object, you can traverse relationships.
If each Transaction is associated with a User, for example, you can compute a
string similarity between the user’s name and the transaction memo:

```
  from chalk.features import _, features
+ import chalk.functions as F

  @features
  class Transaction:
      ...
      amount: float
      memo: str
      user: "User"
+     name_match_score: float = F.jaccard_similarity(
+       _.user.name, _.memo
+     )
```

Here, _.user.name follows the foreign key relationship from Transaction to User.
The function F.jaccard_similarity is one of many built-in Chalk functions that
operate on symbolic expressions.

Expressions can also perform aggregations over related records.
In a User feature class, we can compute aggregates like the
number of large transactions or the total amount spent:

```
from chalk import _
from chalk.features import DataFrame, features

@features
class User:
    id: int
    name: str
    transactions: DataFrame[Transaction]

    # Total spend is nullable because the sum of an empty DataFrame is null
    total_spend: float | None = _.transactions[_.total].sum()

    # The count is never null, because an empty DataFrame has count 0
    num_large_txns: int = _.transactions[_.total > 1000].count()
```

In this context, _ refers to the User instance when referring to _.transactions.
But when you apply a filter, like _.transactions[_.total > 1000],
the expression inside the brackets is evaluated in the context of each individual Transaction.
That means _.total refers to the total field on each Transaction, not on the User.
This scoped evaluation makes it easy to filter, project, and aggregate over related data.

All expressions are statically analyzed, optimized to eliminate redundant computation,
and executed as high-performance C++ at runtime.

### Scalar Functions

Chalk expressions support a wide range of built-in functions for manipulating data, performing calculations,
and transforming features. These functions can be used in expressions to operate on feature values, DataFrames, and other data types.

```
- from chalk import features, online
+ from chalk.features import _, features

@features
class Transaction:
    id: int
    total: float
    sales_tax: float
+   subtotal: float = _.total - _.sales_tax

- @online
- def get_subtotal(total: Transaction.total, sales_tax: Transaction.sales_tax) -> Transaction.subtotal:
-     return total - sales_tax
```

### Infix Operators

Chalk expressions support a variety of infix operators for arithmetic, conditions, and boolean logic.

| Operator | Description           | Example                       |
| -------- | --------------------- | ----------------------------- |
| `+`      | Addition              | `_.total + _.sales_tax`       |
| `-`      | Subtraction           | `_.total - _.sales_tax`       |
| `*`      | Multiplication        | `_.quantity * _.price`        |
| `/`      | Division              | `_.total / _.quantity`        |
| `>`      | Greater than          | `_.total > 1000`              |
| `>=`     | Greater than or equal | `_.total >= 1000`             |
| `<`      | Less than             | `_.total < 1000`              |
| `<=`     | Less than or equal    | `_.total <= 1000`             |
| `==`     | Equal                 | `_.status == "completed"`     |
| `!=`     | Not equal             | `_.status != "completed"`     |
| `&`      | Boolean and           | `_.is_active & _.is_verified` |
| \|       | Boolean or            | _.is_active \| _.is_verified  |
| `~`      | Boolean not           | `~_.is_active`                |

Python does not allow these operators to be overridden, so they will not work with Chalk's expressions.
Instead, use the infix operators &, |, and ~ for boolean logic,
and use == and != for equality comparisons.

### Builtin Functions

The chalk.functions module exposes several helpful functions that can be used in
combination with expression references to transform features. These functions are meant
to be used in expressions and are not available as standalone functions. To view all
available functions, see our SDK docs.

### Structs

Expressions can be used to access nested attributes from other features in a feature class, whether these
other features are struct dataclasses, features, or DataFrames.

```
import chalk.functions as F
from chalk.features import features
from dataclasses import dataclass

@dataclass
class LatLon:
    lat: float | None
    lon: int | None

@features
class User:
    id: int
    home: LatLon
    work: LatLon
    commute_distance: float = F.haversine(
        lat1=_.home.lat,
        lon1=_.home.lon,
        lat2=_.work.lat,
        lon2=_.work.lon,
    )
```

### Custom Functions

You can create custom functions to encapsulate complex logic or reusable computations in your expressions.
For example, if you wanted to apply consistent windows across many features, you could define a custom function
like this:

Use helper functions to create expressions

```
from chalk import _
from chalk.features import features, DataFrame

def count_where(*filters):
    return _.transactions[_.created_at > _.chalk_window, *filters].count()

@features
class User:
    id: int
    transactions: DataFrame[Transaction]
    num_large_transactions: int = count_where(_.total > 1000)
    num_small_transactions: int = count_where(_.total < 100)
```

Expressions are not supported in Python resolvers, so you cannot use Chalk functions like F.jaccard_similarity
in a Python resolver.

Don't use expressions in Python resolvers

```
@features
class User:
    id: int
    name: str
    email: str
    name_email_match_score: float

@online
def get_score(
    name: User.name,
    email: User.email,
) -> User.name_email_match_score:
    # Don't do this!! Expressions don't run in Python resolvers
    return F.jaccard_similarity(name, email)
```

Instead, use expressions to define the feature directly in the feature class:

Use expressions in feature classes

```
@features
class User:
    id: int
    name: str
    email: str
    name_email_match_score: float = F.jaccard_similarity(
        _.name, _.email
    )
```

### DataFrame Functions

### Conditions and filters

DataFrame features can be filtered with expressions.

Extending our Transaction example, we can create a User feature class with a has-many relationship
to Transaction. Then, we can define a feature representing the number of large purchases by referencing the existing
User.transactions feature:

```
  from chalk.features import _, features, DataFrame

  @features
  class Transaction:
      id: int
+     user_id: "User.id"
      total: float
      sales_tax: float
      subtotal: float = _.total - _.sales_tax

+ @features
+ class User:
+    id: int
+    # implicit has-many relationship with Transaction due to `user_id` above
+    transactions: DataFrame[Transaction]
+    num_large_transactions: int = _.transactions[_.total > 1000].count()
```

The object referenced by _ changes depending on its current scope. In this code, the _ in _.transactions
references the User object.
Within the DataFrame filter, the _ in _.total references each Transaction object as each one is evaluated.
The count aggregation is covered in the next section.

### Projections and aggregations

DataFrame features support projection with expressions, which produce a new
DataFrame scoped down to the referenced columns. DataFrames can be aggregated after eligible columns are selected.

With our Transaction example, we already saw a count aggregation for counting the number of large transactions. We
can add another aggregation for computing the user's total spend:

```
from chalk.features import _, features, DataFrame

@features
class Transaction:
    id: int
    user_id: "User.id"
    sales_tax: float
    subtotal: float
    total: float = _.subtotal + _.sales_tax

@features
class User:
    id: int
    transactions: DataFrame[Transaction]
    num_large_transactions: int = _.transactions[_.total > 1000].count()
+   total_spend: float = _.transactions[_.total].sum()
```

To compute User.total_spend, we needed to create a projection of the User.transactions
DataFrame limited to only the Transaction.total column so that the sum aggregation could work.
In contrast, no projection was needed for the num_large_transactions aggregation because count
works on DataFrames with any number of columns.

For computing low-latency aggregations over high volumes of data, Chalk also offers
materialized windowed aggregations
that uses materialization of buckets of data to compute large aggregations efficiently.

### Aggregation functions

Aggregation functions have varying behavior when handling None values and empty DataFrames.
If an aggregation function says None values are skipped in the table below,
it will consider a DataFrame with only None values as empty.

| Function                | `None` values | Empty DataFrame | Notes                                                                                                                                                                                                                    |
| ----------------------- | ------------- | --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `sum`                   | Skipped       | Returns `0`     |                                                                                                                                                                                                                          |
| `min`                   | Skipped       | Returns `None`  |                                                                                                                                                                                                                          |
| `max`                   | Skipped       | Returns `None`  |                                                                                                                                                                                                                          |
| `mean`                  | Skipped       | Returns `None`  | Feature type must be `float` or `float \| None`. `None` values are skipped, meaning they are not included in the mean calculation.                                                                                       |
| `count`                 | Included      | Returns `0`     |                                                                                                                                                                                                                          |
| `any`                   | Skipped       | Returns `False` |                                                                                                                                                                                                                          |
| `all`                   | Skipped       | Returns `True`  |                                                                                                                                                                                                                          |
| `std`                   | Skipped       | See notes       | Standard deviation. Requires at least 2 values. For DataFrames with less than 2 values, returns `None`. ; Aliases: `stddev`, `stddev_sample`, `std_sample`.                                                              |
| `var`                   | Skipped       | See notes       | Variance. Same requirements as `std`. ; Alias: `var_sample`.                                                                                                                                                             |
| `approx_count_distinct` | Skipped       | Returns `0`     |                                                                                                                                                                                                                          |
| `approx_percentile`     | Skipped       | Returns `None`  | Takes one argument, `quantile`, expected to be a float in the range `[0, 1]`. ; Example: `approx_percentile(0.75)` returns a value approximately equal to the 75th percentile of the not-`None` values in the DataFrame. |
| `approx_top_k`          | Skipped       | Returns `None`  | Takes one argument, `k`, expected to be a positive integer, as a keyword argument. ; Example: `approx_top_k(k=25)` returns the 25 or fewer approximately-most common values.                                             |

For aggregations that can return None, either mark the feature as optional
(for example, by setting the feature type to float | None) or use coalesce
to fall back to a default value.

### Run conditions and Execution

To specify run conditions such as environment, tags, and versions for a feature
that is resolved through an expression,
you can use the feature function and pass in the expression as an argument.

```
from chalk.features import Primary, features, feature

@features
class User:
    id: int

    purchases: DataFrame[Purchase]
    # Uses a default value of 0 when one cannot be computed.
    num_purchases: int = feature(
        expression=_.purchases.count(),
        default=0,
        environment=["staging", "dev"],
        tags=["fraud", "credit"],
        version=1,
    )
```

You may also want to define expressions that are only meant to be used for the offline pathway, say
if you would like to compute values or aggregations dependent on offline resolvers. To do so, you can
specify a distinct offline_expression as an offline resolver for your feature.

```
from chalk.features import features, feature 

@features 
class Merchant: 
    id: int 
    
    transactions: DataFrame[Transaction]

    merchant_risk_score: float = feature(
        expression=_.transactions[_.status=="approved"].count() / _.transactions.count(),
        offline_expression=(0.50*_.transactions[_.is_chargeback==True].count()+0.25*_.transactions[_.is_refund==True].count()+0.25*_.transactions[_.status=="approved"].count())/_.transactions.count()
    )
```

### Testing

To test your expressions, we recommend setting up integration tests or
iterating on a branch.

### Dynamic Expressions

In some cases, you may want to build expressions dynamically.
For example, if you have a rules engine that generates expressions
and stores them in a database, you can load those expressions at runtime
and ask Chalk to compute their values.

Consider this User and Transaction model, which would be checked in to the code:

```
from chalk.features import features, DataFrame

@features
class 
Transaction:
    id: int
    user_id: "User.id"
    user: "User"
    total: float
    sales_tax: float

@features
class User:
    id: int
    transactions: DataFrame[Transaction]
    name: str
    email: str
```

Using the Golang
or Python
SDKs, you can compute both scalar and aggregate features on demand.

For example, in Golang, if we wanted to compare the lowercased name and email
by their Jaccard similarity, we could build the expression dynamically and ask
Chalk to compute it:

```
package main

import (
	"testing"
	"github.com/zeebo/assert"
	"github.com/chalk-ai/chalk-go/expr"
	"github.com/chalk-ai/chalk-go"
)

func TestQueryingExpressions(t *testing.T) {
	// Picks up the ambient credentials from the `chalk login` run on the CLI.
	client, err := chalk.NewGRPCClient(t.Context())
	assert.NoError(t, err)
	result, err := client.OnlineQueryBulk(
		t.Context(),
		chalk.OnlineQueryParams{}.
			WithInput("user.id", []int{1}).
			WithOutputs("user.name").
			WithOutputs("user.email").
			WithOutputExprs(
				expr.FunctionCall(
					"jaccard_similarity",
					expr.FunctionCall("lower", expr.Col("name")),
					expr.FunctionCall("lower", expr.Col("email")),
				).
					As("user.name_email_sim"),
			),
	)
	assert.NoError(t, err)
	row, err := result.GetRow(0)
	assert.NoError(t, err)
	for feature, value := range row.Features {
		t.Logf("Feature: %s, Value: %+v", feature, value.Value)
	}
}
```

This program produces output like this:

```
=== RUN   TestChalkClient
    chalk_test.go:46: Feature: user.name, Value: Nicole Mann
    chalk_test.go:46: Feature: user.email, Value: nicoleam@nasa.gov
    chalk_test.go:46: Feature: user.name_email_sim, Value: 0.5714285714285714
--- PASS: TestChalkClient (0.43s)
```

The feature user.name_email_sim was computed using the expression
jaccard_similarity(lower(name), lower(email)), which was built dynamically
using the expr package.

For a complete list of functions, see our SDK docs.
All the functions available in expressions are also available in the expr package.

You can also compute aggregations dynamically.
Using the Golang SDK, we can compute the number of large transactions for a user:

```
result, err := client.OnlineQueryBulk(
    t.Context(),
    chalk.OnlineQueryParams{}.
        WithInput("user.id", []int{1}).
        WithOutputExprs(
            expr.DataFrame("transactions").
                Filter(expr.Col("amount").Gt(expr.Float(0.))).
                Agg("count").
                As("user.positive_transaction_count"),
        ),
)
```

For this program, the output would look like this:

```
=== RUN   TestChalkClient
    chalk_test.go:42: Feature: user.positive_transaction_count, Value: 33
--- PASS: TestChalkClient (0.45s)
```

In a Python expression, the above SDK call is equivalent to:

```
User.positive_transaction_count = _.transactions[_.amount > 0].count()
```

Both the scalar functions and aggregations can be computed in Python SDK.
See this guide for more details.





