# AWS Cloud Deployment IAM Requirements
source: https://docs.chalk.ai/docs/aws-cloud-deployment

## Deploying Chalk to your AWS account.

Chalk's feature platform with best-in-class developer experience enables
machine learning teams to focus on building the unique products and models
that make their business stand out. Chalk provides a feature store so that
you can deploy production machine learning pipelines for real time data
in minutes.

Chalk is both a framework and a platform — developers can write code
using familiar Python packages, and deploy their feature and data pipeline definitions
to Chalk’s platform. In the Customer Cloud deployment, Chalk runs &
administers its platform on the customer’s cloud account. Chalk's managed
infrastructure then executes the customer defined pipelines to compute
feature
data for machine learning applications. Chalk then serves this data
back to customer applications for online inference and to customer
data teams for training set generation.

### Architecture

### IAM Role Permissions

In order to manage infrastructure in your cloud account, Chalk requires certain
IAM permissions. At a high level, Chalk needs the ability
to provision the key components of your infrastructure:

- Storage resources (buckets for Dataset storage and bulk insertion into offline
storage)
- Networking resources (LBs, VPCs, etc.)
- IAM resources (e.g. creating service accounts for workload identity, etc.)
- Kubernetes resources (EKS)
- Online storage (RDS, or Elasticache, or Dynamo)
- Offline storage (typically Snowflake, Iceberg, or Redshift)

### Setup Instructions

We typically recommend the following steps for enabling Chalk to provision and manage AWS
infrastructure in your cloud account:

- Create a new account in your AWS organization.
- Create a new role in the new account that enables Chalk's server
to perform actions in this account. Ensure that AssumeRole is granted.

```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::754784422779:role/chalk-api-server"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "<your-chalk-external-id>"
                }
            }
        }
    ]
}
```

- Create an IAM policy using the JSON document below and attach this IAM policy
to the created role.

```
{
    "Statement": [
        {
            "Resource": "*",
            "Action": [
                "sqs:*",
                "sns:*",
                "acm:*",
                "secretsmanager:*",
                "s3:*",
                "redshift:*",
                "redshift-data:*",
                "redshift-serverless:*",
                "rds:*",
                "logs:*",
                "kms:*",
                "kafkaconnect:*",
                "kafka:*",
                "kafka-cluster:*",
                "iam:CreateServiceLinkedRole",
                "iam:UploadServerCertificate",
                "iam:UploadSSHPublicKey",
                "iam:UpdateServerCertificate",
                "iam:UpdateRoleDescription",
                "iam:UpdateRole",
                "iam:UpdateOpenIDConnectProviderThumbprint",
                "iam:UpdateAssumeRolePolicy",
                "iam:RemoveRoleFromInstanceProfile",
                "iam:RemoveClientIDFromOpenIDConnectProvider",
                "iam:PutRolePolicy",
                "iam:PassRole",
                "iam:ListSSHPublicKeys",
                "iam:ListRoles",
                "iam:ListRolePolicies",
                "iam:ListRoleTags",
                "iam:ListPolicyVersions",
                "iam:ListPolicyTags",
                "iam:ListPolicies",
                "iam:ListOpenIDConnectProviders",
                "iam:ListOpenIDConnectProviderTags",
                "iam:ListAttachedRolePolicies",
                "iam:GetServerCertificate",
                "iam:GetSSHPublicKey",
                "iam:GetRolePolicy",
                "iam:GetRole",
                "iam:GetPolicyVersion",
                "iam:GetPolicy",
                "iam:GetOpenIDConnectProvider",
                "iam:GetInstanceProfile",
                "iam:DetachRolePolicy",
                "iam:DeleteServiceLinkedRole",
                "iam:DeleteServerCertificate",
                "iam:DeleteRolePolicy",
                "iam:DeleteRole",
                "iam:DeleteOpenIDConnectProvider",
                "iam:DeleteInstanceProfile",
                "iam:CreateServiceLinkedRole",
                "iam:CreateRole",
                "iam:CreatePolicyVersion",
                "iam:CreatePolicy",
                "iam:CreateOpenIDConnectProvider",
                "iam:CreateInstanceProfile",
                "iam:AttachRolePolicy",
                "iam:TagPolicy",
                "iam:TagRole",
                "iam:TagOpenIDConnectProvider",
                "iam:AddRoleToInstanceProfile",
                "iam:AddClientIDToOpenIDConnectProvider",
                "iam:ListInstanceProfilesForRole",
                "iam:DeletePolicy",
                "iam:UntagPolicy",
                "iam:UntagRole",
                "iam:UntagOpenIDConnectProvider",
                "elasticloadbalancing:*",
                "eks:*",
                "ecr:*",
                "ec2:*",
                "cloudwatch:*",
                "autoscaling:*",
                "dynamodb:*",
                "dax:*",
                "application-autoscaling:*",
                "elasticache:CreateCacheSubnetGroup",
                "elasticache:DeleteCacheSubnetGroup",
                "elasticache:DescribeCacheSubnetGroups",
                "elasticache:ModifyCacheSubnetGroup",
                "elasticache:CreateReplicationGroup",
                "elasticache:DeleteReplicationGroup",
                "elasticache:DescribeReplicationGroups",
                "elasticache:ModifyReplicationGroup",
                "elasticache:DescribeCacheClusters",
                "elasticache:AddTagsToResource",
                "elasticache:RemoveTagsFromResource",
                "elasticache:ListTagsForResource",
                "glue:*",
                "route53:*"
            ],
            "Effect": "Allow"
        }
    ],
    "Version": "2012-10-17"
}
```

### What are these permissions for?

- cloudwatch: The metrics viewer in the web UI
- logs: Logs for the web UI
- acm: Provisioning SSL certs for client<>server encryption
- rds: RDS Online store
- redshift: Redshift offline store
- redshift-data: Redshift offline store
- redshift-serverless: Redshift offline store
- kafka: Asynchronous persistence queues for metrics & feature storage
- kafka-cluster: Asynchronous persistence queues for metrics & feature storage
- dynamodb: DynamoDB online store
- dax: DynamoDB online store
- application-autoscaling: DynamoDB online store, if auto-scaling is required
- ec2: ALB and EKS node pool management
- ecr: ECR image management for deployments
- eks: EKS cluster management for running feature engineering workloads
- kms: Encryption keys for secrets
- secretsmanager: Encrypting data source secrets from the web UI
- glue: Iceberg Glue offline store

### Ongoing permissions

Chalk's support team will work with you to scope permissions down to what are needed for ongoing maintenance.
The precise details depend on the level of ongoing support that your team needs and your compliance requirements.
In principle, Chalk does not require ongoing access to data, or to the ability to edit IAM permissions,
but Chalk requires ongoing access to update the software deployed in your environment. For managing resource
configurations such as autoscaling, the Chalk team can use an execution role that is scoped to the specific
resources that Chalk manages.

The scoped execution role uses resource tagging
to restrict permissions to only resources managed by Chalk (tagged with chalk.ai/managed-by: chalk).
This provides a secure way to limit Chalk's access to only the infrastructure it creates and maintains.

The IAM policy for the scoped execution role consists of three statements:

```
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ChalkManagedResources",
      "Effect": "Allow",
      "Action": [
        "eks:*",
        "ecr:*",
        "cloudwatch:*",
        "iam:GetPolicy",
        "iam:GetPolicyVersion",
        "iam:GetRole",
        "iam:GetRolePolicy",
        "iam:ListRolePolicies",
        "iam:ListAttachedRolePolicies",
        "iam:ListPolicyVersions",
        "iam:GetOpenIDConnectProvider",
        "iam:ListInstanceProfilesForRole",
        "ec2:*",
        "secretsmanager:*",
        "dynamodb:*",
        "kms:*",
        "logs:*",
        "kafka:*",
        "elasticloadbalancing:*",
        "sqs:*",
        "rds:*",
        "elasticache:*",
        "acm:*",
        "application-autoscaling:*",
        "glue:*"
      ],
      "Resource": [
        "arn:aws:eks:${AWS_REGION}:${AWS_ACCOUNT_ID}:*",
        "arn:aws:ecr:*",
        "arn:aws:iam::${AWS_ACCOUNT_ID}:policy/*",
        "arn:aws:iam::aws:policy/*",
        "arn:aws:iam::${AWS_ACCOUNT_ID}:role/*",
        "arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/*",
        "arn:aws:cloudwatch:${AWS_REGION}:${AWS_ACCOUNT_ID}:chalk*",
        "arn:aws:ec2:${AWS_REGION}:${AWS_ACCOUNT_ID}:*",
        "arn:aws:secretsmanager:${AWS_REGION}:${AWS_ACCOUNT_ID}:*",
        "arn:aws:dynamodb:${AWS_REGION}:${AWS_ACCOUNT_ID}:table/chalk*",
        "arn:aws:kms:${AWS_REGION}:${AWS_ACCOUNT_ID}:*",
        "arn:aws:logs:${AWS_REGION}:${AWS_ACCOUNT_ID}:log-group:*",
        "arn:aws:kafka:${AWS_REGION}:${AWS_ACCOUNT_ID}:cluster/chalk*",
        "arn:aws:elasticloadbalancing:${AWS_REGION}:${AWS_ACCOUNT_ID}:k8s-*",
        "arn:aws:rds:${AWS_REGION}:${AWS_ACCOUNT_ID}:*:chalk*",
        "arn:aws:sqs:${AWS_REGION}:${AWS_ACCOUNT_ID}:chalk*",
        "arn:aws:elasticache:${AWS_REGION}:${AWS_ACCOUNT_ID}:*:chalk*",
        "arn:aws:acm:${AWS_REGION}:${AWS_ACCOUNT_ID}:*",
        "arn:aws:application-autoscaling:${AWS_REGION}:${AWS_ACCOUNT_ID}:*",
        "arn:aws:glue:${AWS_REGION}:${AWS_ACCOUNT_ID}:*"
      ],
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/chalk.ai/managed-by": "chalk"
        }
      }
    },
    {
      "Sid": "ChalkUntaggedResources",
      "Effect": "Allow",
      "Action": [
        "iam:GetPolicy",
        "iam:GetPolicyVersion",
        "kafka:DescribeConfiguration",
        "kafka:DescribeConfigurationRevision",
        "s3:*"
      ],
      "Resource": [
        "arn:aws:iam::aws:policy/*",
        "arn:aws:kafka:${AWS_REGION}:${AWS_ACCOUNT_ID}:configuration/chalk*",
        "arn:aws:s3:::chalk-*"
      ]
    },
    {
      "Sid": "ChalkDescribeResources",
      "Effect": "Allow",
      "Action": [
        "eks:Describe*",
        "dynamodb:Describe*",
        "dynamodb:ListTables*",
        "elasticache:Describe*",
        "acm:Describe*",
        "route53:List*",
        "route53:Get*"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/chalk.ai/managed-by": "chalk"
        }
      }
    },
    {
      "Sid": "ChalkGlobalDescribe",
      "Effect": "Allow",
      "Action": [
        "ecr:GetAuthorizationToken",
        "ec2:Describe*",
        "logs:DescribeLogGroups"
      ],
      "Resource": "*"
    }
  ]
}
```

Note that most permissions are restricted by the aws:ResourceTag/chalk.ai/managed-by condition,
ensuring Chalk can only manage resources it has created. Some resources (like MSK configurations and S3 buckets)
do not support tag-based conditions and are instead restricted by resource name patterns.

### Kubernetes

Chalk requires the following Kubernetes roles to manage the resources in your cluster:

### Managed Components

Depending on cluster settings and enabled Chalk features, AWS customer-cloud deployments manage the following Kubernetes platform components in EKS.

### EKS Add-ons

- CoreDNS
- kube-proxy
- Amazon VPC CNI

### Helm Charts

- Argo Workflows
- CloudNativePG
- Envoy Gateway
- Karpenter
- KEDA
- external-dns
- metrics-server
- aws-mountpoint-s3-csi-driver
- cert-manager
- AWS Load Balancer Controller

### EKS Version Support

Chalk supports EKS versions up to two Kubernetes minor versions back, and Chalk updates managed clusters approximately one month before the end of AWS standard support for the in-use EKS version.

```
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: chalk-cluster-management-role
rules:
  - apiGroups:
      - ""
    resources:
      - nodes # required to track usage
    verbs:
      - get
      - list
      - watch
  # For allowing the web UI to manage cluster scaling
  - apiGroups:
      - "karpenter.sh"
    resources:
      - nodepools
    verbs:
      - get
      - list
      - create
      - update
      - patch
      - delete
  - apiGroups:
      - "karpenter.k8s.aws"
    resources:
      - ec2nodeclasses
    verbs:
      - get
      - list
      - create
      - update
      - patch
      - delete

  # For read/list access to Karpenter NodeClaims
  - apiGroups:
      - "karpenter.sh"
    resources:
      - nodeclaims
    verbs:
      - get
      - list

  - apiGroups:
      - ""
    resources:
      - persistentvolumes
    verbs:
      - get
      - list
      - create
      - update
      - patch
      - watch
      - delete

  - apiGroups:
      - "storage.k8s.io"
    resources:
      - storageclasses
    verbs:
      - get
      - list

---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: chalk-management-role
  namespace: <namespace> # replace <namespace> with your actual namespace
rules:
  - apiGroups:
      - ""
    resources:
      - configmaps
      - pods
      - services
    verbs:
      - get
      - list
      - create
      - update
      - patch
      - watch

  # For support w/ debugging & rendering logs in the dashboard.
  - apiGroups:
      - ""
    resources:
      - pods/log
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - "extensions"
      - "networking.k8s.io"
    resources:
      - ingresses
      - ingresses/status
      - ingressclasses
    verbs:
      - get
      - list
      - watch
      - delete
      - update
      - create
  - apiGroups:
      - "extensions"
      - "networking.k8s.io"
    resources:
      - ingresses # can be self-managed if necessary
    verbs:
      - get
      - create
      - delete
      - update
      - list
      - patch
  - apiGroups:
      - ""
    resources:
      - secrets
    verbs:
      - get
      - list
      - watch
      - create
      - patch
      - update
      - delete
  - apiGroups:
      - ""
    resources:
      - events
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - "apps"
    resources:
      - replicasets
      - statefulsets
      - deployments
      - daemonsets
    verbs:
      - get
      - list
      - update
      - patch
      - create
      - delete
  - apiGroups:
      - "batch"
    resources:
      - cronjobs
    verbs:
      - get
      - list
      - create
      - update
      - patch
      - watch
      - delete

  # For managing Jobs
  - apiGroups:
      - "batch"
    resources:
      - jobs
    verbs:
      - get
      - list
      - create
      - update
      - patch
      - watch
      - delete

  # For durable storage management.
  - apiGroups:
      - ""
    resources:
      - persistentvolumeclaims
    verbs:
      - get
      - list
      - create
      - update
      - patch
      - watch
      - delete
  # For managing KEDA objects for autoscaling.
  - apiGroups:
      - keda.sh
    resources:
      - scaledobjects
      - scaledjobs
    verbs:
      - get
      - list
      - create
      - update
      - patch
      - watch
      - delete
  # For managing PodDisruptionBudgets.
  - apiGroups:
      - policy
    resources:
      - poddisruptionbudgets
    verbs:
      - get
      - list
      - create
      - update
      - patch
      - watch

  # OPTIONAL: For showing thread dumps & profiling for batch backfills + some support use-cases.
  - apiGroups:
      - ""
    resources:
      - pods/exec
    verbs:
      - create
      - get

  # OPTIONAL: For support with debugging k8s rbac, grant read-only access to this namespaces' rbac configuration.
  - apiGroups: ["", "rbac.authorization.k8s.io"]
    resources: ["roles", "serviceaccounts", "rolebindings"]
    verbs: ["get", "list"]
```

Additionally, Chalk workload service accounts require the following role to be able to manage batch workloads:

```
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: chalk-job-reader
  namespace: <namespace>  # replace <namespace> with your actual namespace
rules:
- apiGroups: ["batch"]
  resources: ["jobs"]
  verbs: ["get", "list", "watch"]
```

If you use in-cluster docker image building powered by kaniko and Argo, Chalk requires this role on
the namespace where the docker image building is happening:

```
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: argo-workflows-role
  namespace: <namespace>  # replace <namespace> with your actual namespace
rules:
- apiGroups: ["argoproj.io"]
  resources: ["workflows"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
```

This role must be granted to the management service account that Chalk uses to manage the cluster, and to the workload
service account that runs the docker image building.

### Logging

In order to display logs in the Chalk web UI, Chalk requires permissions to be able to read Cloudwatch logs.
If you use a separate logging service, you can set up Fluent Bit as a DaemonSet to send logs to CloudWatch Logs.
Then, to view these logs we need the following permissions:

```
{
    "Statement": [
        {
            "Action": [
                "logs:TestMetricFilter",
                "logs:StartQuery",
                "logs:StartLiveTail",
                "logs:List*",
                "logs:Get*",
                "logs:FilterLogEvents",
                "logs:Describe*"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:logs:us-east-1:<your account id>:log-group:<the log group path>:*",
            "Sid": "readlogs"
        },
        {
            "Action": [
                "logs:StopQuery",
                "logs:StopLiveTail"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "stopActions"
        }
    ],
    "Version": "2012-10-17"
}
```

### VPC CIDR Blocks

Chalk will create the underlying VPC and
subnets for the EKS cluster(s) for your Chalk deployment. Typically, Chalk will request a /16 address block and
work within it. However, the number of addresses will depend on the size of your implementation. The Chalk team will
work with you to determine the appropriate CIDR sections to use.

Our default configuration is:

```
vpc_cidr_block = "10.130.0.0/16"
vpc_subnets = [
  {
    name              = "primary"
    cidr_block        = "10.130.0.0/20"
    az                = "a"
    public_cidr_block = "10.130.16.0/20"
  },
  {
    name              = "secondary"
    cidr_block        = "10.130.32.0/20"
    az                = "b"
    public_cidr_block = "10.130.48.0/20"
  },
  {
    name              = "tertiary"
    cidr_block        = "10.130.64.0/20"
    az                = "c"
    public_cidr_block = "10.130.80.0/20"
  }
]
```





