Setup Guide

End-to-end instructions to wire this platform to a GitHub repository and an Azure subscription, ready for the first Provision Infrastructure run. The guide assumes you have Owner rights on the target Azure subscription and admin rights on the GitHub repository.

What you’ll end up with

A GitHub repository hosting this platform code.
Three GitHub Environments (dev, staging, prod), with optional approval gates on prod.
An Azure App Registration + Service Principal authenticated to GitHub via OIDC — no client secrets stored anywhere.
Four federated credentials covering the bootstrap workflow and the three per-environment plan jobs.
The Service Principal granted the minimum RBAC roles needed to bootstrap state storage and plan infrastructure changes.

Estimated time: 15–20 minutes the first time.

Prerequisites

Tooling on your workstation:

Tool	Minimum version	Notes
`git`	2.30	Push the repo to GitHub
`gh` (optional)	2.40	Convenient for env/secret commands
`az`	2.60	App Registration + RBAC + federated credentials
`jq` (optional)	1.6	Useful for inspecting `az` output

Azure access:

An existing Azure subscription where you have the Owner role (required to create role assignments at subscription scope).
The subscription is registered with Microsoft.Web, Microsoft.Storage, Microsoft.Network, Microsoft.OperationalInsights, and Microsoft.Insights providers. They’re registered by default in most subscriptions; if you hit MissingSubscriptionRegistration later, run az provider register --namespace <namespace>.

GitHub access:

A GitHub account or organization where you’ll host the repository.
The ability to create Environments (free for public repos and for private repos on paid plans).

Step 1 — Push the repository to GitHub

Create an empty repository on GitHub (e.g. your-org/workshop-platform-eng), without initial README, license, or .gitignore.

From the local checkout of this project:

git init
git add .
git commit -m "feat: initial platform engineering scaffold"
git branch -M main
git remote add origin https://github.com/<your-org>/<repo-name>.git
git push -u origin main

Note. All federated credentials below tie OIDC tokens to this exact repository slug and to the main branch. If you push to a different branch or rename the repo later, you must update the federated credentials too.

Step 2 — Create the GitHub Environments

GitHub Environments are referenced by the plan job’s environment: key, which is what makes per-environment OIDC subjects work. Create them even if you don’t add protection rules yet.

In the repository: Settings → Environments → New environment, and create:

Environment	Suggested protection rules
`dev`	(none)
`staging`	(none for now)
`prod`	Required reviewers: at least one trusted reviewer

You can also create them from the CLI if gh is set up:

gh api -X PUT repos/<your-org>/<repo-name>/environments/dev
gh api -X PUT repos/<your-org>/<repo-name>/environments/staging
gh api -X PUT repos/<your-org>/<repo-name>/environments/prod

Step 3 — Create the Azure App Registration

# Sign in to the right tenant if you have several
az login

# Make sure you're operating against the intended subscription
az account set --subscription "<your-subscription-id-or-name>"

# Create the App Registration
az ad app create --display-name "sp-platform-eng-github"

# Capture the appId — this is the value you'll pass as `azure_client_id`
APP_ID=$(az ad app list \
  --display-name "sp-platform-eng-github" \
  --query "[0].appId" -o tsv)

# Create the matching Service Principal in your tenant
az ad sp create --id "$APP_ID"

# Capture the SP object ID — needed for RBAC role assignments
SP_OBJECT_ID=$(az ad sp show --id "$APP_ID" --query id -o tsv)

# Capture the tenant ID — you'll pass this as `azure_tenant_id`
TENANT_ID=$(az account show --query tenantId -o tsv)

# Capture the subscription ID — you'll pass this as `subscription_id`
SUBSCRIPTION_ID=$(az account show --query id -o tsv)

cat <<EOF

Save these three values — you'll feed them to the workflow as inputs:

  azure_client_id : $APP_ID
  azure_tenant_id : $TENANT_ID
  subscription_id : $SUBSCRIPTION_ID

  (SP object ID, only used in the next steps: $SP_OBJECT_ID)
EOF

Tip. The App Registration’s appId and the Service Principal’s objectId are different identifiers. RBAC assignments and federated credentials work with the SP. Keep both handy.

Step 4 — Configure federated credentials (OIDC)

The workflows authenticate to Azure with short-lived OIDC tokens issued by GitHub Actions. Azure validates each token against a federated credential on the App Registration. The token’s subject claim must match exactly.

You need four credentials because two different subject formats apply:

Credential	Used by	Subject
Branch-scoped	every job in `provision-infrastructure.yml` without an `environment:` key — `resolve-inputs`, `fmt`, `checkov`, `bootstrap-tfstate`, `create-app-repo`, `create-run-issue`, `configure-environments`, `configure-federated-credentials`, `observe-ci`, `finalize`; and the standalone `bootstrap-tfstate.yml`	`repo:<org>/<repo>:ref:refs/heads/main`
Environment `dev`	every job pinned to `environment: dev` — `plan`, `apply`, the `verify-infrastructure.yml` reusable workflow	`repo:<org>/<repo>:environment:dev`
Environment `staging`	same set of jobs, pinned to `environment: staging`	`repo:<org>/<repo>:environment:staging`
Environment `prod`	same set of jobs, pinned to `environment: prod`	`repo:<org>/<repo>:environment:prod`

Create all four:

REPO="<your-org>/<repo-name>"   # e.g. deors/workshop-platform-eng

# 1. Branch-scoped credential (bootstrap jobs)
az ad app federated-credential create --id "$APP_ID" --parameters '{
  "name":     "github-main-branch",
  "issuer":   "https://token.actions.githubusercontent.com",
  "subject":  "repo:'"$REPO"':ref:refs/heads/main",
  "audiences": ["api://AzureADTokenExchange"]
}'

# 2-4. One credential per GitHub Environment
for ENV in dev staging prod; do
  az ad app federated-credential create --id "$APP_ID" --parameters '{
    "name":     "github-env-'"$ENV"'",
    "issuer":   "https://token.actions.githubusercontent.com",
    "subject":  "repo:'"$REPO"':environment:'"$ENV"'",
    "audiences": ["api://AzureADTokenExchange"]
  }'
done

# Verify
az ad app federated-credential list --id "$APP_ID" \
  --query "[].{name:name, subject:subject}" -o table

You should see exactly four rows.

Step 5 — Assign Azure RBAC roles

The Service Principal needs three roles at subscription scope. The third one — Storage Blob Data Contributor — is the easy-to-miss one: the bootstrap script creates the state storage account with allow-shared-key-access=false, so the only way the script can then create the container is via RBAC. The role must be in place before the first run.

SCOPE="/subscriptions/$SUBSCRIPTION_ID"

# Manage control-plane resources (RGs, App Service, networking, …)
az role assignment create \
  --assignee-object-id      "$SP_OBJECT_ID" \
  --assignee-principal-type ServicePrincipal \
  --role  "Contributor" \
  --scope "$SCOPE"

# Read/write state blobs (Contributor does NOT cover the data plane)
az role assignment create \
  --assignee-object-id      "$SP_OBJECT_ID" \
  --assignee-principal-type ServicePrincipal \
  --role  "Storage Blob Data Contributor" \
  --scope "$SCOPE"

# Create role assignments — needed for the webapp module's ACR pull and
# Key Vault access policy resources
az role assignment create \
  --assignee-object-id      "$SP_OBJECT_ID" \
  --assignee-principal-type ServicePrincipal \
  --role  "User Access Administrator" \
  --scope "$SCOPE"

# Verify
az role assignment list --assignee "$SP_OBJECT_ID" --scope "$SCOPE" \
  --query "[].roleDefinitionName" -o table

Expected output:

Result
-------------------------------
Contributor
Storage Blob Data Contributor
User Access Administrator

Why all three at subscription scope? During bootstrap the resource group and storage account don’t exist yet, so any role on a narrower scope wouldn’t apply. Once we evolve the platform to provision infrastructure for many apps in many subscriptions, this RBAC model will be revisited (likely a per-subscription identity rather than a single shared SP).

Allow the SP to manage its own federated credentials

After the platform provisions infrastructure for a new app, it must register three additional federated credentials on this same App Registration — one per environment, scoped to the new app repo (subjects repo:<owner>/<app>:environment:{dev,staging,prod}). Without these, deploy workflows in the new repo fail at azure/login with AADSTS70021.

The platform workflow does this automatically (see job configure-federated-credentials), but the SP needs two things to be allowed to write to its own App Registration:

Self-ownership of the App Registration object (directory-level), and
The Application.ReadWrite.OwnedBy application permission on Microsoft Graph, with admin consent.

Ownership alone is sufficient for user-delegated flows but not for application-only flows like the OIDC token a workflow runs under, even in your own tenant — the corporate Entra default policy denies the call with Insufficient privileges to complete the operation.

1. Add the SP as owner of its own App Registration

APP_OBJECT_ID=$(az ad app show --id "$APP_ID" --query id -o tsv)

az ad app owner add \
  --id              "$APP_OBJECT_ID" \
  --owner-object-id "$SP_OBJECT_ID"

# Verify
az ad app owner list --id "$APP_OBJECT_ID" --query "[].id" -o tsv

2. Grant `Application.ReadWrite.OwnedBy` on Microsoft Graph

This step requires admin consent in the tenant: a Global Administrator, Privileged Role Administrator, Cloud Application Administrator, or Application Administrator must run it (or grant consent in the portal). In a corporate tenant this typically means filing an internal request.

# Microsoft Graph's well-known appId
GRAPH_APP_ID="00000003-0000-0000-c000-000000000000"
GRAPH_SP_ID=$(az ad sp show --id "$GRAPH_APP_ID" --query id -o tsv)

# AppRoleId for Application.ReadWrite.OwnedBy on Graph
ROLE_ID=$(az ad sp show --id "$GRAPH_APP_ID" \
  --query "appRoles[?value=='Application.ReadWrite.OwnedBy'].id | [0]" -o tsv)

# Grant it (admin consent required to execute this call)
az rest --method POST \
  --uri "https://graph.microsoft.com/v1.0/servicePrincipals/${SP_OBJECT_ID}/appRoleAssignments" \
  --headers "Content-Type=application/json" \
  --body "{
    \"principalId\": \"${SP_OBJECT_ID}\",
    \"resourceId\":  \"${GRAPH_SP_ID}\",
    \"appRoleId\":   \"${ROLE_ID}\"
  }"

# Verify — should list one row with role 'Application.ReadWrite.OwnedBy'
az rest --method GET \
  --uri "https://graph.microsoft.com/v1.0/servicePrincipals/${SP_OBJECT_ID}/appRoleAssignments" \
  --query "value[].{resource:resourceDisplayName, roleId:appRoleId}" -o table

Portal alternative

Entra ID → App registrations → your app → API permissions → Add a permission → Microsoft Graph → Application permissions → Application.ReadWrite.OwnedBy → Add. Then click Grant admin consent for <tenant>.

Why OwnedBy and not All? Application.ReadWrite.OwnedBy only lets the SP write to App Registrations where it is an owner (set in step 1 above). Application.ReadWrite.All would let it write to any App Registration in the tenant — a much wider blast radius.

Bootstrap storage account — security model

The state storage account is created with:

--allow-shared-key-access false — disables SAS/account keys; AAD auth is the only way in, gated by Storage Blob Data Contributor. This is the primary security boundary.
--allow-blob-public-access false — no anonymous blob reads.
--https-only true and --min-tls-version TLS1_2.
Public network endpoint enabled (defaultAction = Allow). GitHub-hosted runners have no fixed egress IPs, so a firewall (defaultAction = Deny) would block the bootstrap and every terraform init. AAD-only auth
- RBAC is what protects the account, not the network layer.

If your threat model requires network-level isolation, switch to a Private Endpoint and run the workflows on a self-hosted runner inside the VNet. That trade-off is intentionally out of scope for the workshop baseline.

Backend implication. Because the SA forbids shared-key auth, the azurerm backend must also be told to use Azure AD against the blob endpoint (not just for credential acquisition). The workflow sets both use_oidc=true and use_azuread_auth=true (plus ARM_USE_AZUREAD=true). Without the second flag, terraform init hits 403 KeyBasedAuthenticationNotPermitted even with a valid OIDC token.

Web App network exposure — per-environment policy

Application Web Apps follow a deliberate per-env split:

Env	Private Endpoint	Public endpoint	Rationale
`dev`	enabled	enabled	GitHub-hosted runners are not in the VNet and have no fixed egress IP. The dev environment intentionally accepts public traffic so the application repo’s CI/CD can run an HTTP smoke test against `https://<webapp>.azurewebsites.net/health` after each deploy.
`staging`	enabled	disabled	PE-only. Mirrors the production posture so any data flowing through staging is treated with the same network sensitivity as prod.
`prod`	enabled	disabled	PE-only. The only path in is from the integration subnet via the private endpoint.

The toggle is a module variable, public_network_access_enabled (default false). Dev sets it to true explicitly; staging/prod inherit the secure default. CKV_AZURE_222 (Public network access disabled) is enforced for prod and skipped for dev/staging in .checkov.nonprod.yaml so the dev exception doesn’t fail policy.

Deploy validation strategy

The application repository ships two top-level workflows that both delegate to a reusable deploy.yml:

ci.yml — runs on every push to main; builds, tests, and calls deploy.yml with environment: dev. The dev Web App has its public endpoint open, so the deploy step can finish with a real HTTP smoke test (curl -fsS https://<webapp>.azurewebsites.net/health).
release.yml — promotes an existing GHCR digest to staging and then to prod, triggered by a Git release (vX.Y.Z-RC → staging, vX.Y.Z → prod). It does not rebuild the image:
1. Identify image — az webapp config show against the source env (dev for RC, staging for GA) to read the currently-deployed sha-<short> tag.
2. Retag — docker pull <sha-tag> then docker tag + docker push to add the release tag to the same digest in GHCR. The same image ends up with multiple tags: sha-abc1234, 1.4.0-RC, 1.4.0.
3. Deploy — calls deploy.yml with the target environment (staging after vX.Y.Z-RC, prod after vX.Y.Z), pointing App Service at the new tag. The prod environment’s reviewers approve before the prod step runs.
Because staging and prod are PE-only, the deploy step uses control-plane assertions only:
```
az webapp show        -g $RG -n $APP --query state          -o tsv  # should be 'Running'
az webapp config show -g $RG -n $APP --query linuxFxVersion -o tsv  # contains the deployed image tag
```
This proves the platform accepted the new image. App Service’s built-in health check (configured in this module to hit /health) handles the “is it actually serving traffic?” question and marks unhealthy instances unavailable automatically — the platform’s verify job assertions cover the rest.

The retag-not-rebuild model means the exact bits that passed CI are the exact bits in prod — no chance of a release-time rebuild drift, and every released digest is traceable back to its sha-<short> and the commit.

Operators who need real HTTP smoke tests against PE-only environments should run the deploy/release workflows on a self-hosted runner inside webapp_integration_subnet (or a peered VNet). Out of scope for the workshop baseline.

Step 6 — Handle GitHub Advanced Security (optional)

The checkov job uploads its findings as SARIF to Security → Code scanning. Code scanning requires GitHub Advanced Security, which is:

Free for public repositories.
A paid add-on for private repositories on personal accounts.

If you can’t enable it, the upload step will fail. Either:

Make the repository public (recommended for this workshop), or
Disable the SARIF upload by adding if: false to the Upload SARIF to GitHub Security tab step in .github/workflows/provision-infrastructure.yml. The Checkov scan itself still runs and still fails the build on findings.

Step 7 — Provide a `GH_PAT` secret for cross-repo operations

After the infrastructure is provisioned and verified, the workflow continues into application-repo bootstrap: it creates a new repo from a template, opens a tracking issue, configures GitHub Environments + variables, dispatches the app’s CI workflow and posts a summary back to the issue.

All of those operations write to a different repository than the one the workflow runs in. The default GITHUB_TOKEN is scoped to this repo only and cannot create repositories or write to other repos’ environments/variables.

Provide a Personal Access Token (or a GitHub App installation token) as a repository secret named GH_PAT, with these scopes:

Scope	Used for
`repo`	Read/write the application repository (creation, issues, comments)
`workflow`	Dispatch the CI workflow in the application repo
`admin:repo_hook` (optional)	Future drift-detection wiring

Create one at https://github.com/settings/tokens?type=beta (fine-grained, recommended) with the target organization and Administration: Read and write, Contents: Read and write, Issues: Read and write, Actions: Read and write, Variables: Read and write, Environments: Read and write repository permissions. Save it as the GH_PAT secret on this platform repo.

Why a PAT and not the workflow token? GitHub deliberately scopes GITHUB_TOKEN to the repository running the workflow. Cross-repo writes require a token whose installation/owner has access to the target.

Step 8 — Trigger the first run

In the GitHub UI: Actions → Provision Infrastructure → Run workflow, and provide:

Input	Value for the first test
`environment`	`dev`
`app_name`	`test-webapp` (3–22 chars, lowercase, digits, hyphens)
`subscription_id`	the GUID captured in step 3
`azure_client_id`	the `appId` captured in step 3
`azure_tenant_id`	the tenant GUID captured in step 3
`container_image`	`mcr.microsoft.com/appsvc/staticsite:latest`
`container_registry_url`	(leave empty — public image)
`template_repo`	the `<owner>/<name>` of the application template repo
`ci_workflow_file`	(leave empty — defaults to `ci.yml`)

What you should observe

Each row below names the job exactly as it appears in the run’s UI:

Resolve inputs                    ✓ validated inputs, derived sttf<app><sub>
Checkov · {env}                   ✓ no findings
Terraform fmt check               ✓ formatting clean
Bootstrap tfstate storage         ✓ rg-tfstate-test-webapp + storage account + container
Plan · {env}                      ✓ terraform plan generated, artifact uploaded
Apply · {env}                     ✓ terraform apply succeeded
Verify · {env}                    ✓ control-plane assertions passed
Create application repo           ✓ <owner>/<app_name> created from template (or skipped)
Create run issue                  ✓ per-run tracking issue opened
Configure env · {env}             ✓ GitHub Environment + variables set
Federated credential · {env}      ✓ AAD subject registered on the SP
Observe CI in app repo            ✓ template auto-triggered CI watched, build+test+dev-deploy succeeded
Summarize and comment             ✓ summary posted as issue comment

The exact storage account name shows up in the bootstrap-tfstate job logs as TFSTATE_STORAGE_ACCOUNT=.... The plan output (and the binary tfplan file) is attached as a workflow artifact named tfplan-test-webapp-dev, retained for 7 days. The plan is then consumed by the apply job, which provisions the resources for real, after which verify runs control-plane assertions against the live infrastructure.

The full run also creates the application repository from your template, configures its GitHub Environments + variables, registers the per-env federated credentials on the platform SP, observes the auto-triggered CI in the new repo, and posts a summary comment on the per-run tracking issue.

Troubleshooting

`AADSTS70021: No matching federated identity record found`

The OIDC token’s subject doesn’t match any federated credential on the App Registration. Re-check:

The repo slug is exactly <org>/<repo> — case-sensitive.
For environment subjects, the GitHub Environment exists with the exact name (dev, staging, prod) and the plan job runs against it.
If you triggered the workflow from a branch other than main, the branch-scoped credential won’t match. Either trigger from main or add another federated credential for that branch.

`Insufficient privileges to complete the operation` in `configure-federated-credentials`

The job calls az ad app federated-credential create, which hits Microsoft Graph (POST /applications/{id}/federatedIdentityCredentials). Two things are required and people commonly stop after the first:

The SP is an owner of its own App Registration (az ad app owner add …).
The SP has the Application.ReadWrite.OwnedBy Graph application permission with admin consent.

Without (2), even a fully-owning SP gets Insufficient privileges. Run the two-step procedure in step 5 — Allow the SP to manage its own federated credentials. Step (2) requires a directory-role admin (Global, Privileged Role, Cloud Application, or Application Administrator) — in a corporate tenant this is usually an internal request.

`AuthorizationFailed` during `bootstrap-tfstate`

The Service Principal lacks one of the three RBAC roles, or propagation hasn’t finished yet. Re-run after a minute. If it persists, re-run the az role assignment list command from step 5 and confirm all three roles are listed at subscription scope.

`Failed to query container 'tfstate' on '<account>'` during `bootstrap-tfstate`

The script (scripts/bootstrap-tfstate.sh) traps this on the az storage container exists call. Two possible causes:

RBAC: the SP has Contributor (control plane) but not Storage Blob Data Contributor (data plane). Re-check step 5.
Network rules: the storage account has defaultAction = Deny (e.g. created by an earlier version of the script, or modified manually). The GitHub-hosted runner has no fixed egress IP and is blocked. Fix:
```
az storage account update \
  --name <account> --resource-group <rg> --default-action Allow
```
The current bootstrap script keeps defaultAction = Allow by design — see the security-model note in step 5.

`terraform init` fails with `Error refreshing state`

Most often a missing Storage Blob Data Contributor assignment. Same fix as above. If RBAC is correct, double-check that the workflow is using use_oidc=true in the -backend-config flags (it is, by default).

CI in the new app repo fails with `denied: permission_denied: write_package`

The container push to GHCR (docker push ghcr.io/<owner>/<repo>:<tag>) is rejected even though the platform workflow set the new repo’s default workflow permissions to write. Common causes, in rough order of frequency:

The CI workflow declares its own permissions: block that omits packages: write. The block replaces the default — it doesn’t merge with it. The workflow must include all the scopes it needs, e.g. contents: read, packages: write, id-token: write.
The login step uses the wrong token or username. For docker login ghcr.io, expect username: $ and password: $ — typos or a stale PAT will fail with the same denied error.
Org-level setting overrides the repo setting. Org admins can lock workflow permissions at Settings → Actions → General with override disabled. The repo-level PUT is silently ignored. Ask the org admin to allow per-repo overrides or set the org default to write.
Image namespace mismatch. GHCR only accepts pushes to ghcr.io/<owner>/<name> where <owner> matches the repo owner. A tag computed against a different org/user is rejected.
A pre-existing GHCR package linked to a different repo (or unlinked). If a package with the same name already exists in the owner’s namespace from a deleted repo or earlier experiment, GHCR refuses pushes from this repo even with correct permissions. Visit https://github.com/orgs/<owner>/packages (or /users/<owner>/packages), open the package’s settings and either delete it or use Manage Actions access to link it to the new repository.

Useful diagnostic command:

gh run view <run-id> -R <owner>/<app> --log-failed

Checkov reports new findings after a Terraform change

Either fix the finding or, if you’ve judged it a false positive or not-applicable, add a justified entry to .checkov.yaml documenting why the check is skipped. See CONTRIBUTING.md for the rules around skips.

What’s next

With the first run green end-to-end, the typical follow-up workshop topics are:

Iterate on the application — push to the new repo’s main; the template’s CI workflow builds, tests, and (on success) deploys to dev.
Promote to staging / prod — the template ships a release workflow that promotes a built image to staging and then to prod, using the same per-env GitHub Environments + variables this platform configured (so prod honours whatever protection rules / reviewers you’ve added on its environment). Trigger it by creating a GitHub Release in the app repo — vX.Y.Z-RC deploys to staging, vX.Y.Z deploys to prod — or run it manually from Actions → Release → Run workflow for ad-hoc promotions. The release workflow performs control-plane validation only against staging/prod — HTTP smoke tests don’t work from a GitHub-hosted runner against PE-only envs (see Web App network exposure in step 5).
Try the self-service web UI — enable GitHub Pages on this repo (Settings → Pages → Source: Deploy from a branch / main / /docs), then fire subsequent runs from https://<owner>.github.io/<repo>/. The page works for provision (new env on an existing app) and reconcile runs too, not just first-time bootstrap.
Pick the next item from the roadmap — destroy/decommission, scheduled drift detection, container console logs in the module, and cost reporting are the most-requested next steps.