Chalk home page
Docs
API
CLI
  1. Time
  2. Temporal Consistency

Introduction

Temporal consistency is crucial for ensuring your model’s training performance is reflective of production. When creating datasets for model training, your system must consistently retrieve data as it would appear at a specific point in time. If your model trains on data that was retrieved after the point where it would have made a prediction, it will not perform consistently in production.

Chalk performs point-in-time lookups on your training data so you can train your models knowing they won’t receive data from the future, even across complex relationships.

Scenario 1: Historical loan decisions

In this scenario, we will consider a machine learning model for determining whether to issue loans to businesses based on their financial performance.

For each business, we track sales and COGS (cost of goods sold) over time in millions of dollars. We define a Business feature class with sales and cogs features:

from chalk.features import features, FeatureTime

@features
class Business:
    id: int
    sales: float
    cogs: float
    ts: FeatureTime

After collecting historical data on the performance of businesses we previously considered, we want to retrain our model.

For a business with id=123, we gave loans at t1 and t2:

Samples
id=123
t1
id=123
t2
Business.sales
1.3
1
Business.cogs
0.5
0.3
0.4
Time

In training, you want to know the observed gross profit and COGS for the business at the time you made the loan without knowing the future values of those features.

For example, at t1, when we issued the loan, we had observed sales of $1.3M. It would be inconsistent for this model to train with the $1M data point because this data would not have existed at t1.

At t1 and t2, the temporally consistent values of sales and cogs would be as follows:

FeatureValue at t1Value at t2
Business.sales 1.3 1
Business.cogs 0.5 0.4

Each of these values occurred at or before the sample time and is valid to use in training.

Sample code

Chalk allows you to control the time your query considers to be “now”, which is covered in greater detail in our time documentation.

To set the query’s “now” time, pass input_times to your offline queries:

from chalk.client import ChalkClient

t1 = datetime.now() - timedelta(days=365)
t2 = datetime.now() - timedelta(days=30)

dataset = ChalkClient().offline_query(
     input={
         # Sample business 123 twice because we have two input_times
         Business.id: [123, 123],
     },
     # Each element of `input_times` corresponds to the element
     # with the same index in `input`
     input_times=[t1, t2],
     # Sample all features of business.
     # Alternatively, sample only the features you need:
     #   output=[Business.sales, Business.cogs]
     output=[Business],
)

This query will return a Dataset with temporally consistent values for each input time.

Scenario 2: Backfilling new features

Temporal consistency is especially difficult when you want to build new features. Building on the first scenario, imagine you’ve observed Business.sales and Business.cogs many times in the past, and for each of the businesses that you track, these values have changed over time.

Now, you want to compute a new feature, Business.gross_profit, which is the difference between Business.sales and Business.cogs. We can add an underscore expression to compute this new feature:

from chalk.features import features, FeatureTime
from chalk.features import _, features, FeatureTime

@features
class Business:
  id: int
  sales: float
  cogs: float
  ts: FeatureTime
  gross_profit: float = _.sales - _.cogs

With the same historical data from the previous scenario, we can determine the correct inputs for computing Business.gross_profit:

Samples
id=123
t1
id=123
t2
Business.sales
1.3
1
Business.cogs
0.5
0.3
0.4
Inputs
(1.3, 0.5)
(1, 0.4)
Time

Business.gross_profit at t1 and t2 would be computed with the following values:

Query timeBusiness.salesBusiness.cogsBusiness.gross_profit
t1 1.3 0.5 0.8
t2 1.0 0.4 0.6

In your code, you can compute historical values for updated features by running an offline query with recompute_features=True. You may also consider one of our other backfill options for other use cases.

As you start layering more resolvers or using has-many relationships, temporal consistency becomes even more difficult to reason about. Chalk manages the complexity under the hood so that you can write any query and expect temporally consistent results.