Chalk home page
Docs
API
CLI
  1. Time
  2. Time

Overview

Chalk uses two main timestamps you should be aware of as you build your Chalk project:

  1. Feature time (FeatureTime): The time at which a feature was observed, which is used in query time filters and aggregation time windows.
  2. Query time (Now): The time a query should assume is “now” when retrieving data. Features whose feature time comes after a given query time will never be returned for those queries, in online or offline contexts.

Feature time is returned in the __observed_at__ column in Chalk query results. Query time is returned in the __ts__ column.

Feature time

Feature time is the time at which a feature was observed. By default, Chalk sets a feature’s time to the feature’s resolver execution time. The feature time can be overridden for a feature class, accessed from resolver parameters, and requested in query inputs and outputs.

You may have multiple timestamps associated with your data. It’s important to set the feature time to the value that most closely represents when your system would have accessed the data in production.

For example, in an asynchronous streaming system, you may have one timestamp for when an event was added to your task queue and another timestamp for when the event was removed from the queue and processed. We recommend using the latter timestamp as your feature time to make your training most closely resemble production. If you use the timestamp for when an event was added to your task queue, you would be training your system with data it would not have been able to access in production.

Setting feature time for a feature class

Each individual feature has its own feature time, which is used to retrieve point-in-time correct data for temporal consistency.

To access the latest time associated with any feature in a feature class, use the special FeatureTime feature throughout Chalk.

By default, if it is present, Chalk will treat a feature named ts with datetime.datetime type as the FeatureTime value. Otherwise, you can use the FeatureTime type annotation to set a different name:

from chalk.features import features, FeatureTime

@features
class User:
    id: int
    name: str
    timestamp: FeatureTime  # datetime.datetime under the hood

Using your FeatureTime feature, you can access and override feature time for the whole feature class.

You can access the FeatureTime as a resolver input. In this example, ts will be set to the maximum feature time for all features passed as resolver parameters:

@offline
def fn(name: User.name, ts: User.timestamp) -> ...:

You can directly set the FeatureTime value by returning it from a resolver:

@offline
def fn(...) -> Features[User.name, User.timestamp]:
    return User(
        name="Maryam Mirzakhani",
        ts=datetime(2014, 8, 12, tzinfo=timezone.utc)
    )

You can include the FeatureTime feature in query output. Its value will be set to the maximum timestamp across all features in its feature class.

Time filtering in resolvers

has-many features create DataFrames. These DataFrames can be filtered with before and after.

Regardless of which time filters you use, Chalk will never return features where the feature time is strictly greater than the current query time (Now), in order to maintain temporal consistency.

Examples

To compute the number of transfers a user made in the last seven days, use after:

from chalk.features import after, ...

@online
def fn(transfers: User.transfers[after(days_ago=7)]) -> ...:
    return transfers.count()

To compute the number of transfers a user made more than seven days ago, use before:

from chalk.features import before, ...

@online
def fn(transfers: User.transfers[before(days_ago=7)]) -> ...:
    return transfers.count()

Combine before and after to retrieve transfers made 1-2 weeks ago:

from chalk.features import before, after, ...

@online
def fn(transfers: User.transfers[after(days_ago=14), before(days_ago=7)]) -> ...:
    return transfers.count()

All of these examples can be used in combination with other DataFrame projections and filters. You may also find windowed aggregations useful.

Interaction with the online store

Features with overriden observation timestamps are treated specially when inserted into the online store. In particular, Chalk will always check for existing “newer” feature values in the online store before inserting historically dated feature values. This means that you can safely ingest large quantities of backdated features without accidentally ingesting stale data into the online store.

Additionally, once features are inserted into the online store, Chalk tracks the source observation timestamps when these feature values are returned as part of online queries. Chalk uses these source timestamps to compute the “feature staleness” metric. Staleness in this context is defined as “query time - observation time”.

Interaction with the offline store

Features with overriden observation timestamps are inserted into the offline store with the timestamp that you specify. The observation timestamp works like an “effective as of” timestamp, so if you insert something like this:

| id | feature | value | timestamp            |
|---------------------------------------------|
| 1  | age     | 7     | 2022-02-01T00:00:00Z |

into an offline store that already contained these observations:

| id | feature | value | timestamp            |
|---------------------------------------------|
| 1  | age     | 6     | 2022-01-01T00:00:00Z |
| 1  | age     | 8     | 2022-03-01T00:00:00Z |

then the observation will be interleaved “in between” the existing observations, and you would see the following query results:

id, age, <= 2022-02-01
output: 7

id, age, <= 2022-03-02
output: 8

id, age, <= 2022-01-02
output: 6

Enforcing a TTL

Features in the offline store have an optional TTL (time to live). When a feature has a TTL value, it will never be returned at any time later than FeatureTime + TTL. For example, you may not want to consider credit scores which were retrieved more than a year ago. Setting offline_ttl will make credit_score return None if last observed credit score is more than one year old in comparison to the current query time.

@features
class User:
    id: str
    credit_score: int = feature(offline_ttl=timedelta(years=1))

Query time

Query time is the time treated as “now” within a query context. For online queries, Now is equal to datetime.now(). For offline queries, you can pass one or more timestamps that will be used as the query time for each input row.

Setting query time

In training, you will likely want to retrieve data as if you are at a point in the past to create the most accurate predictions. We cover this idea in greater detail in our temporal consistency documentation.

To set the query’s “now” time, pass input_times as either a single timestamp or as a list corresponding to the Now times to use for each entry in input:

from datetime import timezone

ChalkClient().offline_query(
    # Pass id 1 multiple times because we want to
    # request it with multiple input_times
    input={User.id: [1, 1, 1]},  
    input_times=[
        datetime.now(tz=timezone.utc) - timedelta(days=365 * 10),
        datetime.now(tz=timezone.utc) - timedelta(days=365),
        datetime.now(tz=timezone.utc) - timedelta(days=0),
    ],
    output=[User.age_in_years],
)

## Output:

# | id | age_in_years |
# | 1  | <age> - 10   |
# | 1  | <age> - 1    |
# | 1  | <age> - 0    |

Accessing query time in resolvers

To access the query time in your resolvers, you can reference a special feature called Now, which is a datetime.datetime object.

You can pass Now in Python resolvers:

from chalk import Now

@online
def get_age_in_years(birthday: User.birthday, now: Now) -> User.age_in_years:
    return (now - birthday).years

Now can be used in DataFrame resolvers as well in order to compute bulk values:

@online
def batch_get_age_in_years(df: DataFrame[User.id, User.birthday, Now]) -> DataFrame[User.id, User.age_in_years]:
    return (
        df.to_polars()
            .select(
                pl.col(User.id),
                pl.col(str(User.birthday) - pl.col(str(Now))).alias(str(User.age_in_years))
            )
    )

You can also reference ${now} in SQL file resolvers. If Now is used in a resolver to compute a has-many join, then the Now feature must be passed as input.

-- source: sql_file_resolver_temp_db
-- resolves: tv_episode
select id,
       name,
       season_no,
       episode_no,
       show_name,
       air_date
from tv_episodes
where air_date < ${now}
  and id = ${tv_episode.id}

Accessing query time in query results

Chalk Datasets return the query time in the __ts__ column.

When converting Chalk Datasets to Polars or Pandas DataFrames, you may want to include the query time column. To do so, pass output_ts to your to_polars or to_pandas calls. You may pass a column name to output_ts to set the name of the query time column. If you pass True, the query time column name will be __chalk__.CHALK_TS.

Be careful to not mix up __ts__ and ts. __ts__ represents the query time, or the time the query treats as "now" during query execution. ts is a common name for the feature representingFeatureTime, the time at which a feature was observed.

Timezone handling for naive datetimes

Chalk stores UTC as the timezone for naive datetime objects. Additionally, Chalk assumes UTC if retrieving naive datetimes from data stores.

We recommend that you include timezone information on all datetime objects you work with to avoid ambiguity.