Feature Engine
Integrate with your BigQuery data warehouse.
Chalk has an integration with BigQuery that makes it easy to read queries and tables into your feature store.
To use BigQuery in your resolvers, you first need to add the Chalk GCP integration to the environments where you would like to use BigQuery.
When querying your BigQuery data source, Chalk will push down filters on top of your queries to optimize the amount of data read from your tables. For larger queries, rather than interpolating values directly in the SQL string for the query, which has length limits in BigQuery, Chalk will use a table to temporarily hold the values against which to query.
The service account that you register in the data source requires the following permissions to fully allow Chalk to integrate with BigQuery. Here, the “target dataset” is the temporary project and temporary dataset if they are specified within the data source. If not, the target dataset is just the project and dataset of the data source.
bigquery.readsessions.create on the target datasetbigquery.readsessions.getData on the target datasetbigquery.tables.getData on the target dataset and all referenced datasets in queriesbigquery.tables.get on the target dataset and all referenced datasets in queriesbigquery.tables.create on the target datasetbigquery.tables.updateData on the target datasetbigquery.jobs.create on the target projectbigquery.jobs.get on the target projectbigquery.datasets.get on all projects, for Chalk SQL compatibilityUsing BigQuery’s predefined IAM roles, you can get these permissions by ensuring that the service account has the following:
roles/bigquery.JobUser on the target projectroles/bigquery.dataEditor on the target datasetroles/bigquery.dataViewer on the target dataset and all referenced datasets in queriesYou can learn more about the various BigQuery IAM roles and permissions here.
After configuring your BigQuery integration with the GCP integration, define your data sources in Python:
from chalk.sql import BigQuerySource
risk = BigQuerySource(name="RISK")
marketing = BigQuerySource(name="MARKETING")You can then reference them in SQL file resolvers using the name parameter. For example, to query from the RISK source:
-- type: online
-- resolves: User
-- source: RISK
SELECT id, credit_score FROM usersAnd to query from the MARKETING source:
-- type: online
-- resolves: User
-- source: MARKETING
SELECT id, email, campaign_status FROM users