Anthos Service Mesh Workshop: Lab Guide

1. ALPHA WORKSHOP

Link to the workshop codelab bit.ly/asm-workshop

2. Overview

Architecture Diagram

9a033157f44308f3.png

This workshop is a hands-on immersive experience that goes through how to set up globally distributed services on GCP in production. The main technologies used are Google Kubernetes Engine (GKE) for compute and Istio service mesh to create secure connectivity, observability, and advanced traffic shaping. All the practices and tools used in this workshop are what you would use in production.

Agenda

  • Module 0 - Introduction and Platform Setup
  • Intro and architecture
  • Intro to Service Mesh and Istio/ASM
  • Lab: Infrastructure Setup: User workflow
  • Break
  • QnA
  • Module 1 - Install, secure and monitor applications with ASM
  • Repo model: Infrastructure and Kubernetes repos explained
  • Lab: Deploy sample application
  • Distributed Services and Observability
  • Lunch
  • Lab: Observability with Stackdriver
  • QNA
  • Module 2 - DevOps - Canary rollouts, policy/RBAC
  • Multi cluster service discovery and security/policy
  • Lab: Mutual TLS
  • Canary Deployments
  • Lab: Canary Deployments
  • Secure multi cluster global load balancing
  • Break
  • Lab: Authorization Policy
  • QNA
  • Module 3 - Infra Ops - Platform upgrades
  • Distributed Service building blocks
  • Lab: Infrastructure Scaling
  • Next steps

Slides

The slides for this workshop can be found at the following link:

ASM Workshop Slides

Prerequisites

The following are required before you proceed with this workshop:

  1. A GCP Organization node
  2. A billing account ID (your user must be Billing Admin on this billing account)
  3. Organization Administrator IAM role at the Org level for your user

3. Infrastructure Setup - Admin Workflow

Bootstrap workshop script explained

A script called bootstrap_workshop.sh is used to set up the initial environment for the workshop. You can use this script to set up a single environment for yourself or multiple environments for multiple users in case you are giving this workshop as training to multiple users.

The bootstrap workshop script requires the following as inputs:

  • Organization name (for example yourcompany.com)- This is the organization in which you create environments for the workshop.
  • Billing ID (for example 12345-12345-12345) - This billing ID is used to bill all resources used during the workshop.
  • Workshop number (for example 01) - A two digit number. This is used in case you are doing multiple workshops in one day and want to keep track of them separately. Workshop numbers are also used to derive project IDs. Having separate workshop numbers makes it easier to ensure you get unique project IDs each time. In addition to workshop number, present date (formatted as YYMMDD) is also used for project IDs. The combination of date and workshop number provides unique project IDs.
  • Start user number (for example 1) - This number signifies the first user in the workshop. For example, if you want to create a workshop for 10 users, you may have a start user number of 1 and an end user number of 10.
  • End user number (for example 10) - This number signifies the last user in the workshop. For example, if you want to create a workshop for 10 users, you may have a start user number of 1 and an end user number of 10. If you are setting up a single environment (for yourself for example), make the start and end user number the same. This will create a single environment.
  • Admin GCS bucket (for example my-gcs-bucket-name) - A GCS bucket is used to store workshop related info. This information is used by the cleanup_workshop.sh script to gracefully delete all resources created during the bootstrap workshop script. Admins creating workshops must have read/write permissions to this bucket.

The bootstrap workshop script uses the values provided above and acts as a wrapper script that calls the setup-terraform-admin-project.sh script. The setup-terraform-admin-project.sh script creates the workshop environment for a single user.

Admin permissions required for bootstrapping the workshop

There are two types of users in this workshop. An ADMIN_USER, who creates and deletes resources for this workshop. The second is MY_USER, who performs the steps in the workshop. MY_USER only has access to only its own resources. ADMIN_USER has access to all user setups. If you are creating this setup for yourself, then the ADMIN_USER and MY_USER are the same. If you are an instructor creating this workshop for multiple students, then your ADMIN_USER and MY_USER will be different.

The following organization level permissions are required for the ADMIN_USER:

  • Owner - Project Owner permission to all projects in the Organization.
  • Folder Admin - Ability to create and delete folders in the Organization. Every user gets a single folder with all of their resources inside the project.
  • Organization Administrator
  • Project Creator - Ability to create projects in the Organization.
  • Project Deleter - Ability to delete projects in the Organization.
  • Project IAM Admin - Ability to create IAM rules in all projects in the Organization.

In addition to these, ADMIN_USER must also be the Billing Administrator for the Billing ID used for the workshop.

User schema and permissions performing the workshop

If you are planning to create this workshop for users (other than yourself) in your organization, you must follow a specific user naming scheme for MY_USERs. During the bootstrap_workshop.sh script, you provide a start and an end user number. These numbers are used to create the following user names:

  • user<3 digit user number>@<organization_name>

For example, if you run the bootstrap workshop script with the start user number of 1 and end user number of 3, in your organization called yourcompany.com, the workshop environments for the following users are created:

  • user001@yourcompany.com
  • user002@yourcompany.com
  • user003@yourcompany.com

These usernames are assigned project Owner roles for their specific projects created during the setup_terraform_admin_project.sh script. You must adhere to this user naming schema when using the bootstrap script. Refer to how to add several users at once in GSuite.

Tools required for the workshop

This workshop is intended to be bootstrapped from Cloud Shell. The following tools are required for this workshop.

Set up workshop for yourself (single user setup)

  1. Open Cloud Shell, perform all actions below in Cloud Shell. Click on the link below.

CLOUD SHELL

  1. Verify you are logged into gcloud with the intended Admin user.
gcloud config list
 
  1. Create a WORKDIR and clone the workshop repo.
mkdir asm-workshop
cd asm-workshop
export WORKDIR=`pwd`
git clone https://github.com/GoogleCloudPlatform/anthos-service-mesh-workshop.git asm
 
  1. Define your Organization name, Billing ID, workshop number and an admin GCS bucket to use for the workshop. Review permissions required for setting up the workshop in the sections above.
gcloud organizations list
export ORGANIZATION_NAME=<ORGANIZATION NAME>

gcloud beta billing accounts list
export ADMIN_BILLING_ID=<ADMIN_BILLING ID>

export WORKSHOP_NUMBER=<two digit number for example 01>

export ADMIN_STORAGE_BUCKET=<ADMIN CLOUD STORAGE BUCKET>
 
  1. Run the bootstrap_workshop.sh script. This script can take a few minutes to complete.
cd asm
./scripts/bootstrap_workshop.sh --org-name ${ORGANIZATION_NAME} --billing-id ${ADMIN_BILLING_ID} --workshop-num ${WORKSHOP_NUMBER} --admin-gcs-bucket ${ADMIN_STORAGE_BUCKET} --set-up-for-admin 
 

After the bootstrap_workshop.sh script completes, a GCP folder is created for each user within the organization. Within the folder, a terraform admin project is created. The terraform admin project is used to create the remainder of the GCP resources required for this workshop. You enable the required APIs in the terraform admin project. You use Cloud Build to apply Terraform plans. You give the Cloud Build service account proper IAM roles for it to be able to create resources on GCP. Lastly you configure a remote backend in a Google Cloud Storage (GCS) bucket to store Terraform states for all GCP resources.

In order to view the Cloud Build tasks in the terraform admin project, you need the terraform admin project ID. This is stored in the vars/vars.sh file under your asm directory. This directory is only persisted if you are setting up the workshop for yourself as an administrator.

  1. Source the variables file to set environment variables
echo "export WORKDIR=$WORKDIR" >> $WORKDIR/asm/vars/vars.sh
source $WORKDIR/asm/vars/vars.sh 
 

Set up workshop for multiple users (multi user setup)

  1. Open Cloud Shell, perform all actions below in Cloud Shell. Click on the link below.

CLOUD SHELL

  1. Verify you are logged into gcloud with the intended Admin user.
gcloud config list
 
  1. Create a WORKDIR and clone the workshop repo.
mkdir asm-workshop
cd asm-workshop
export WORKDIR=`pwd`
git clone https://github.com/GoogleCloudPlatform/anthos-service-mesh-workshop.git asm
 
  1. Define your Organization name, Billing ID, workshop number, start and end user number, and an admin GCS bucket to use for the workshop. Review permissions required for setting up the workshop in the sections above.
gcloud organizations list
export ORGANIZATION_NAME=<ORGANIZATION NAME>

gcloud beta billing accounts list
export ADMIN_BILLING_ID=<BILLING ID>

export WORKSHOP_NUMBER=<two digit number for example 01>

export START_USER_NUMBER=<number for example 1>

export END_USER_NUMBER=<number greater or equal to START_USER_NUM>

export ADMIN_STORAGE_BUCKET=<ADMIN CLOUD STORAGE BUCKET>
 
  1. Run the bootstrap_workshop.sh script. This script can take a few minutes to complete.
cd asm
./scripts/bootstrap_workshop.sh --org-name ${ORGANIZATION_NAME} --billing-id ${ADMIN_BILLING_ID} --workshop-num ${WORKSHOP_NUMBER} --start-user-num ${START_USER_NUMBER} --end-user-num ${END_USER_NUMBER} --admin-gcs-bucket ${ADMIN_STORAGE_BUCKET}
 
  1. Get the workshop.txt file from the admin GCS bucket to retrieve the terraform project IDs.
export WORKSHOP_ID="$(date '+%y%m%d')-${WORKSHOP_NUMBER}"
gsutil cp gs://${ADMIN_STORAGE_BUCKET}/${ORGANIZATION_NAME}/${WORKSHOP_ID}/workshop.txt .
 

4. Lab Setup and Prep

Choose your lab path

The labs in this workshop can be performed in one of two ways:

  • The "easy fast track interactive scripts" way
  • The "manual copy-and-paste each instruction" way

The fast track scripts method allows you to run a single interactive script for each lab that walks you through the lab by automatically running the commands for that lab. The commands are run in batches with concise explanations of each step and what they accomplish. After each batch, you are prompted to move on to the next batch of commands. This way you can run the labs at your pace. The fast track scripts are idempotent, meaning you can run these scripts multiple times resulting in the same outcome.

The fast track scripts will appear at the top of every lab in a green box as shown below.

The copy-and-paste method is the traditional way of copy-and-pasting individual command blocks with explanations of the commands. This method is only intended to be run once. There is no guarantee that re-running commands in this method will give you the same results.

When performing the labs, please pick one of the two methods.

Fast Track Script Setup

Get User info

This workshop is performed using a temporary user account (or a lab account) created by the administrator of the workshop. The lab account owns all of the projects in the workshop. The workshop administrator provides the lab account credentials (username and password) to the user performing the workshop. All of the user's projects are prefixed by the lab account's username for example for the lab account user001@yourcompany.com, the terraform admin project ID would be user001-200131-01-tf-abcde and so on for the rest of the projects. Each user must log in with the lab account provided by the workshop administrator and perform the workshop using the lab account.

  1. Open Cloud Shell by clicking on the link below.

CLOUD SHELL

  1. Log in with the lab account credentials (do not log in with your corporate or personal account). The lab account looks like userXYZ@<workshop_domain>.com. 3101eca1fd3722bf.png
  2. Since this is a new account, you are prompted to accept the Google Terms of Service. Click Accept.

fb0219a89ece5168.png 4. In the next screen, select the checkbox to agree to the Google Terms of Service and click Start Cloud Shell.

7b198cf2e32cb457.png

This step provisions a small Linux Debian VM for you to use to access GCP resources. Each account gets a Cloud Shell VM. Logging in with the lab account provisions and logs you in using the lab account credentials. In addition to Cloud Shell, a Code Editor is also provisioned making it easier to edit configuration files (terraform, YAML etc.). By default, the Cloud Shell screen is split into the Cloud Shell shell environment (at the bottom) and the Cloud Code Editor (at the top). 5643bb4ebeafd00a.png The pencil 8bca25ef1421c17e.pngand the shell prompt eaeb4ac333783ba8.png icons at the top right corner allow you to toggle between the two (shell and code editor). You can also drag the middle separator bar (up or down) and change the size of each window manually. 5. Create a WORKDIR for this workshop. The WORKDIR is a folder from which you perform all of the labs for this workshop. Run the following commands in Cloud Shell to create the WORKDIR.

mkdir -p ${HOME}/asm-workshop
cd ${HOME}/asm-workshop
export WORKDIR=`pwd` 
 
  1. Export the lab account user as a variable to be used for this workshop. This is the same account you logged into Cloud Shell with.
export MY_USER=<LAB ACCOUNT EMAIL PROVIDED BY THE WORKSHOP ADMIN>
# For example export MY_USER=user001@gcpworkshops.com 
 
  1. Echo the WORKDIR and MY_USER variables to ensure both are set correctly by running the following commands.
echo "WORKDIR set to ${WORKDIR}" && echo "MY_USER set to ${MY_USER}"
 
  1. Clone the workshop repo.
git clone https://github.com/GoogleCloudPlatform/anthos-service-mesh-workshop.git ${WORKDIR}/asm
 

5. Infrastructure Setup - User Workflow

Objective: Verify infrastructure and Istio installation

  • Install workshop tools
  • Clone workshop repo
  • Verify Infrastructure install
  • Verify k8s-repo install
  • Verify Istio installation

Copy-and-Paste Method Lab Instructions

Get User info

The administrator setting up the workshop needs to provide the username and password information to the user. All of the user's projects will be prefixed by the username for example for the user user001@yourcompany.com, the terraform admin project ID would be user001-200131-01-tf-abcde and so on for the rest of the projects. Each user only has access to their own workshop environment.

Tools required for the workshop

This workshop is intended to be bootstrapped from Cloud Shell. The following tools are required for this workshop.

Access terraform admin project

After the bootstrap_workshop.sh script completes, a GCP folder is created for each user within the organization. Within the folder, a terraform admin project is created. The terraform admin project is used to create the remainder of the GCP resources required for this workshop. The setup-terraform-admin-project.sh script enables the required APIs in the terraform admin project. Cloud Build is used to apply Terraform plans. Through the script, you give the Cloud Build service account proper IAM roles for it to be able to create resources on GCP. Lastly, a remote backend is configured in a Google Cloud Storage (GCS) bucket to store Terraform states for all GCP resources.

In order to view the Cloud Build tasks in the terraform admin project, you need the terraform admin project ID. This is stored in the admin GCS bucket that was specified in the bootstrap script. If you run the bootstrap script for multiple users, all terraform admin project IDs are in the GCS bucket.

  1. Open Cloud Shell (if its not already open from the Lab Setup and Prep section) by clicking on the link below.

CLOUD SHELL

  1. Install kustomize (if not already installed) in the $HOME/bin folder and add $HOME/bin folder to $PATH.
mkdir -p $HOME/bin
cd $HOME/bin
curl -s "https://raw.githubusercontent.com/\
kubernetes-sigs/kustomize/master/hack/install_kustomize.sh"  | bash
cd $HOME
export PATH=$PATH:${HOME}/bin
echo "export PATH=$PATH:$HOME/bin" >> $HOME/.bashrc
 
  1. Install pv and move it to $HOME/bin/pv.
sudo apt-get update && sudo apt-get -y install pv
sudo mv /usr/bin/pv ${HOME}/bin/pv
 
  1. Update your bash prompt.
cp $WORKDIR/asm/scripts/krompt.bash $HOME/.krompt.bash
echo "export PATH=\$PATH:\$HOME/bin" >> $HOME/.asm-workshop.bash
echo "source $HOME/.krompt.bash" >> $HOME/.asm-workshop.bash

alias asm-init='source $HOME/.asm-workshop.bash' >> $HOME/.bashrc
echo "source $HOME/.asm-workshop.bash" >> $HOME/.bashrc
source $HOME/.bashrc
 
  1. Verify you are logged into gcloud with the intended user account.
echo "Check logged in user output from the next command is $MY_USER"
gcloud config list account --format=json | jq -r .core.account
 
  1. Get your Terraform admin project ID by running the following command:
export TF_ADMIN=$(gcloud projects list | grep tf- | awk '{ print $1 }')
echo $TF_ADMIN
 
  1. All of the resources associated with the workshop are stored as variables in a vars.sh file stored in a GCS bucket in the terraform admin project. Get the vars.sh file for your terraform admin project.
mkdir $WORKDIR/asm/vars
gsutil cp gs://$TF_ADMIN/vars/vars.sh $WORKDIR/asm/vars/vars.sh
echo "export WORKDIR=$WORKDIR" >> $WORKDIR/asm/vars/vars.sh
 
  1. Click the link displayed to open the Cloud Build page for the Terraform admin project and verify the build completed successfully.
source $WORKDIR/asm/vars/vars.sh
echo "https://console.cloud.google.com/cloud-build/builds?project=${TF_ADMIN}"
 

If accessing Cloud Console for the first time, agree to the Google Terms of Service.

  1. Now that you're looking at the Cloud Build page, click on the History link from the left hand nav and click on the latest build to view the details of the initial Terraform apply. The following resources are created as part of the Terraform script. You can also refer to the architecture diagram above.
  • 4 GCP Projects in the Org. The provided Billing account is associated with each project.
  • One project is the network host project for the shared VPC. No other resources are created in this project.
  • One project is the ops project used for Istio control plane GKE clusters.
  • Two projects represent two different development teams working on their respective services.
  • Two GKE clusters are created in each of the three ops, dev1 and dev2 projects.
  • A CSR repo named k8s-repo is created which contains six folders for Kubernetes manifests files. One folder per GKE cluster. This repo is used to deploy Kubernetes manifests to the clusters in a GitOps fashion.
  • A Cloud Build trigger is created so that anytime there is a commit to the master branch of the k8s-repo, it deploys the Kubernetes manifests to GKE clusters from their respective folders.
  1. Once the build completes in the terraform admin project another build will start on the ops project. Click the link displayed to open the Cloud Build page for the ops project and verify the k8s-repo Cloud Build finished successfully.
echo "https://console.cloud.google.com/cloud-build/builds?project=${TF_VAR_ops_project_name}"
 

Verify installation

  1. Create kubeconfig files for all clusters. Run the following script.
$WORKDIR/asm/scripts/setup-gke-vars-kubeconfig.sh
 

This script creates a new kubeconfig file in the gke folder called kubemesh.

  1. Change the KUBECONFIG variable to point to the new kubeconfig file.
source $WORKDIR/asm/vars/vars.sh
export KUBECONFIG=$WORKDIR/asm/gke/kubemesh
 
  1. Add the vars.sh and KUBECONFIG var to the .bashrc in Cloud Shell so it is sourced every time Cloud Shell is restarted.
echo "source ${WORKDIR}/asm/vars/vars.sh" >> $HOME/.bashrc
echo "export KUBECONFIG=${WORKDIR}/asm/gke/kubemesh" >> $HOME/.bashrc
 
  1. List your cluster contexts. You should see six clusters.
kubectl config view -ojson | jq -r '.clusters[].name'
 
    `Output (do not copy)`
gke_tf05-01-ops_us-central1_gke-asm-2-r2-prod
gke_tf05-01-ops_us-west1_gke-asm-1-r1-prod
gke_tf05-02-dev1_us-west1-a_gke-1-apps-r1a-prod
gke_tf05-02-dev1_us-west1-b_gke-2-apps-r1b-prod
gke_tf05-03-dev2_us-central1-a_gke-3-apps-r2a-prod
gke_tf05-03-dev2_us-central1-b_gke-4-apps-r2b-prod

Verify Istio Installation

  1. Ensure Istio is installed on both clusters by checking all pods are running and jobs have completed.
kubectl --context ${OPS_GKE_1} get pods -n istio-system
kubectl --context ${OPS_GKE_2} get pods -n istio-system
 
    `Output (do not copy)`
NAME                                      READY   STATUS    RESTARTS   AGE
grafana-5f798469fd-z9f98                  1/1     Running   0          6m21s
istio-citadel-568747d88-qdw64             1/1     Running   0          6m26s
istio-egressgateway-8f454cf58-ckw7n       1/1     Running   0          6m25s
istio-galley-6b9495645d-m996v             2/2     Running   0          6m25s
istio-ingressgateway-5df799fdbd-8nqhj     1/1     Running   0          2m57s
istio-pilot-67fd786f65-nwmcb              2/2     Running   0          6m24s
istio-policy-74cf89cb66-4wrpl             2/2     Running   1          6m25s
istio-sidecar-injector-759bf6b4bc-mw4vf   1/1     Running   0          6m25s
istio-telemetry-77b6dfb4ff-zqxzz          2/2     Running   1          6m24s
istio-tracing-cd67ddf8-n4d7k              1/1     Running   0          6m25s
istiocoredns-5f7546c6f4-g7b5c             2/2     Running   0          6m39s
kiali-7964898d8c-5twln                    1/1     Running   0          6m23s
prometheus-586d4445c7-xhn8d               1/1     Running   0          6m25s
    `Output (do not copy)`
NAME                                      READY   STATUS    RESTARTS   AGE
grafana-5f798469fd-2s8k4                  1/1     Running   0          59m
istio-citadel-568747d88-87kdj             1/1     Running   0          59m
istio-egressgateway-8f454cf58-zj9fs       1/1     Running   0          60m
istio-galley-6b9495645d-qfdr6             2/2     Running   0          59m
istio-ingressgateway-5df799fdbd-2c9rc     1/1     Running   0          60m
istio-pilot-67fd786f65-nzhx4              2/2     Running   0          59m
istio-policy-74cf89cb66-4bc7f             2/2     Running   3          59m
istio-sidecar-injector-759bf6b4bc-grk24   1/1     Running   0          59m
istio-telemetry-77b6dfb4ff-6zr94          2/2     Running   4          60m
istio-tracing-cd67ddf8-grs9g              1/1     Running   0          60m
istiocoredns-5f7546c6f4-gxd66             2/2     Running   0          60m
kiali-7964898d8c-nhn52                    1/1     Running   0          59m
prometheus-586d4445c7-xr44v               1/1     Running   0          59m
  1. Ensure Istio is installed on both dev1 clusters. Only Citadel, sidecar-injector and coredns run in the dev1 clusters. They share an Istio controlplane running in the ops-1 cluster.
kubectl --context ${DEV1_GKE_1} get pods -n istio-system
kubectl --context ${DEV1_GKE_2} get pods -n istio-system
 
  1. Ensure Istio is installed on both dev2 clusters. Only Citadel, sidecar-injector and coredns run in the dev2 clusters. They share an Istio controlplane running in the ops-2 cluster.
kubectl --context ${DEV2_GKE_1} get pods -n istio-system
kubectl --context ${DEV2_GKE_2} get pods -n istio-system
 
    `Output (do not copy)`
NAME                                      READY   STATUS    RESTARTS   AGE
istio-citadel-568747d88-4lj9b             1/1     Running   0          66s
istio-sidecar-injector-759bf6b4bc-ks5br   1/1     Running   0          66s
istiocoredns-5f7546c6f4-qbsqm             2/2     Running   0          78s

Verify service discovery for shared control planes

  1. Optionally, verify the secrets are deployed.
kubectl --context ${OPS_GKE_1} get secrets -l istio/multiCluster=true -n istio-system
kubectl --context ${OPS_GKE_2} get secrets -l istio/multiCluster=true -n istio-system
 
    `Output (do not copy)`
For OPS_GKE_1:
NAME                  TYPE     DATA   AGE
gke-1-apps-r1a-prod   Opaque   1      8m7s
gke-2-apps-r1b-prod   Opaque   1      8m7s
gke-3-apps-r2a-prod   Opaque   1      44s
gke-4-apps-r2b-prod   Opaque   1      43s

For OPS_GKE_2:
NAME                  TYPE     DATA   AGE
gke-1-apps-r1a-prod   Opaque   1      40s
gke-2-apps-r1b-prod   Opaque   1      40s
gke-3-apps-r2a-prod   Opaque   1      8m4s
gke-4-apps-r2b-prod   Opaque   1      8m4s

In this workshop, you use a single shared VPC in which all GKE clusters are created. To discover services across clusters, you use the kubeconfig files (for each of the application clusters) created as secrets in the ops clusters. Pilot uses these secrets to discover Services by querying the Kube API server of the application clusters (authenticated via the secrets above). You see that both ops clusters can authenticate to all app clusters using kubeconfig-created secrets. Ops clusters can discover services automatically using the kubeconfig files as a secret method. This requires that the Pilot in the ops clusters has access to the Kube API server of all of the other clusters. If Pilot cannot reach the Kube API servers, you would manually add remote services as ServiceEntries. You can think of ServiceEntries as DNS entries in your service registry. ServiceEntries define a service using a fully qualified DNS name ( FQDN) and an IP address where it can be reached. See the Istio Multicluster docs for more information.

6. Infrastructure Repo explained

Infrastructure Cloud Build

The GCP resources for the workshop are built using Cloud Build and an infrastructure CSR repo. You just ran a bootstrap script (located at scripts/bootstrap_workshop.sh) from your local terminal. The bootstrap script creates a GCP folder, a terraform admin project and appropriate IAM permissions for the Cloud Build service account. Terraform admin project is used to store terraform states, logs and miscellaneous scripts. It contains the infrastructure and the k8s_repo CSR repos. These repos are explained in detail in the next section. No other workshop resources are built in the terraform admin project. The Cloud Build service account in the terraform admin project is used to build resources for the workshop.

A cloudbuild.yaml file located in the infrastructure folder is used to build GCP resources for the workshop. It creates a custom builder image with all of the tools required to create GCP resources. These tools include gcloud SDK, terraform and other utilities like python, git, jq etc. The custom builder image runs the terraform plan and apply for each resource. Each resource's terraform files are located in separate folders (details in the next section). The resources are built one at a time and in order of how they would typically be built (for example, a GCP project is built before the resources are created in the project). Please review the cloudbuild.yaml file for more detail.

Cloud Build is triggered any time there is a commit to the infrastructure repo. Any change that is made to the infrastructure is stored as Infrastructure as code (IaC) and committed to the repo. The state of your workshop is always stored in this repo.

Folder structure - teams, environments and resources

The Infrastructure repo sets up the GCP infrastructure resources for the workshop. It is structured into folders and subfolders. The base folders within the repo represent the team that own specific GCP resources. The next layer of folders represent the specific environment for the team (for example dev, stage, prod). The next layer of folders within the environment represent the specific resource (for example host_project, gke_clusters etc). The required scripts and terraform files exist within the resource folders.

434fc1769bb49b8c.png

The following four types of teams are represented in this workshop:

  1. infrastructure - represents the cloud infrastructure team. They are responsible for creating the GCP resources for all other teams. They use the Terraform admin project for their resources. The infrastructure repo itself is in the Terraform admin project, as well as the Terraform state files (explained below). These resources are created by a bash script during the bootstrap process (see Module 0 - Administrator Workflow for details).
  2. network - represents the networking team. They are responsible for VPC and networking resources. They own the following GCP resources.
  3. host project - represents the shared VPC host project.
  4. shared VPC - represents the shared VPC, the subnets, secondary IP ranges, routes and firewall rules.
  5. ops - represents the operations/devops team. They own the following resources.
  6. ops project - represents a project for all ops resources.
  7. gke clusters - an ops GKE cluster per region. The Istio control plane is installed in each of the ops GKE clusters.
  8. k8s-repo - a CSR repo that contains GKE manifests for all GKE clusters.
  9. apps - represents the application teams. This workshop simulates two teams namely app1 and app2. They own the following resources.
  10. app projects - every app team gets their own set of projects. This allows them to control billing and IAM for their specific project.
  11. gke clusters - these are application clusters where the application containers/Pods run.
  12. gce instances - optionally, if they have applications that run on GCE instances. In this workshop, app1 has a couple of GCE instances where part of the application runs.

In this workshop, the same app (Hipster shop app) represents both app1 and app2.

Provider, States and Outputs - Backends and shared states

The google and google-beta providers are located at gcp/[environment]/gcp/provider.tf. The provider.tf file is symlinked in every resource folder. This allows you to change the provider in one place instead of individually managing providers for each resource.

Every resource contains a backend.tf file which defines the location for the resource's tfstate file. This backend.tf file is generated from a template (located at templates/backend.tf_tmpl) using a script (located at scripts/setup_terraform_admin_project) and then placed in the respective resource folder. Google Cloud Storage (GCS) buckets are used for backends. The GCS bucket folder name matches the resource name. All of the resource backends reside in the terraform admin project.

Resources with interdependent values contain an output.tf file. The required output values are stored in the tfstate file defined in the backend for that particular resource. For example, in order to create a GKE cluster in a project, you need to know the project ID. The Project ID is outputted via output.tf to the tfstate file that can be used via a terraform_remote_state data source in the GKE cluster resource.

The shared_state file is a terraform_remote_state data source pointing to a resource's tfstate file. A shared_state_[resource_name].tf file (or files) exist in the resource folders that require outputs from other resources. For example, in the ops_gke resource folder, there are shared_state files from ops_project and shared_vpc resources, because you need the project ID and the VPC details to create GKE clusters in the ops project. The shared_state files are generated from a template (located at templates/shared_state.tf_tmpl) using a script (located at scripts/setup_terraform_admin_project). All resources' shared_state files are placed in the gcp/[environment]/shared_states folder. The required shared_state files are symlinked in the respective resource folders. Placing all of the shared_state files in one folder and sym linking them to in the appropriate resource folders makes it easy to manage all state files in a single place.

Variables

All resource values are stored as environment variables. These variables are stored (as export statements) in a file called vars.sh located in a GCS bucket in the terraform admin project. This contains organization ID, billing account, project IDs, GKE cluster details etc. You can download and source the vars.sh from any terminal to get the values for your setup.

Terraform variables are stored in vars.sh as TF_VAR_[variable name]. These variables are used to generate a variables.tfvars file in the respective resource folder. The variables.tfvars file contains all the variables with their values. The variables.tfvars file is generated from a template file in the same folder using a script (located at scripts/setup_terraform_admin_project).

K8s Repo explained

k8s_repo is a CSR repo (separate from the infrastructure repo) located in the Terraform admin project. Itis used to store and apply GKE manifests to all GKE clusters. k8s_repo is created by the infrastructure Cloud Build (see previous section for details). During the initial infrastructure Cloud Build process, a total of six GKE clusters are created. In the k8s_repo, six folders are created. Each folder (name matching the GKE cluster name) corresponds to a GKE cluster containing its respective resource manifest files. Similar to building infrastructure, Cloud Build is used to apply the Kubernetes manifests to all the GKE clusters using the k8s_repo. Cloud Build is triggered anytime there is a commit to the k8s_repo repo. Similar to infrastructure, all Kubernetes manifests are stored as code in the k8s_repo repository and the state of each GKE cluster is always stored in its respective folder.

As part of the initial infrastructure build, k8s_repo is created and Istio is installed on all clusters.

Projects, GKE clusters, and namespaces

The resources in this workshop are divided into different GCP projects. Projects should match your company's organizational (or team) structure. Teams (in your organization) responsible for different projects/products/resources use different GCP projects. Having separate projects allows you to create separate sets of IAM permissions and manage billing at a project level. Additionally, quotas are also managed at a project level.

Five teams are represented in this workshop, each with their own project.

  1. The infrastructure team that builds GCP resources uses the Terraform admin project. They manage the infrastructure as code in a CSR repo (called infrastructure) and store all Terraform state information pertaining to resources built in GCP in GCS buckets. They control access to the CSR repo and Terraform state GCS buckets.
  2. The network team that builds the shared VPC uses the host project. This project contains the VPC, subnets, routes and firewall rules. Having a shared VPC allows them to centrally manage networking for GCP resources. All projects used this single shared VPC for networking.
  3. The ops/platform team that builds GKE clusters and ASM/Istio control planes use the ops project. They manage the life cycle of the GKE clusters and service mesh. They are responsible for hardening the clusters, managing the resilience and scale of the Kubernetes platform. In this workshop, you use the gitops method of deploying resources to Kubernetes. A CSR repo (called the k8s_repo) exists in the ops project.
  4. Lastly, dev1 and dev2 teams (represent two development teams) that build applications use their own dev1 and dev2 projects. These are the applications and services you provide to your customers. These are built on the platform that the ops team manages. The resources (Deployments, Services etc) are pushed to the k8s_repo and get deployed to the appropriate clusters. It is important to note that this workshop does not focus on CI/CD best practices and tooling. You use Cloud Build to automate deploying Kubernetes resources to the GKE clusters directly. In real world production scenarios, you would use a proper CI/CD solution to deploy applications to GKE clusters.

There are two types of GKE clusters in this workshop.

  1. Ops clusters - used by the ops team to run devops tools. In this workshop, they run the ASM/Istio control plane to manage the service mesh.
  2. Application (apps) clusters - used by the development teams to run applications. In this workshop, the Hipster shop app is used.

Separating the ops/admin tooling from the clusters running the application allows you to manage the life cycle of each resource independently. The two types of clusters also exist in different projects pertaining to the team/product that uses them which makes IAM permissions also easier to manage.

There are a total of six GKE clusters. Two regional ops clusters are created in the ops project. ASM/Istio control plane is installed on both ops clusters. Each ops cluster is in a different region. In addition, there are four zonal application clusters. These are created in their own projects. This workshop simulates two development teams each with their own projects. Each project contains two app clusters. App clusters are zonal clusters in different zones. The four app clusters are located in two regions and four zones. This way you get regional and zonal redundancy.

The application used in this workshop, the Hipster Shop app, is deployed on all four app clusters. Each microservice lives in its own namespace in every app cluster. Hipster shop app Deployments (Pods) are not deployed on the ops clusters. However, the namespaces and Service resources for all microservices are also created in the ops clusters. ASM/Istio control plane uses the Kubernetes service registries for service discovery. In the absence of Services (in the ops clusters), you would have to manually create ServiceEntries for each service running in the app cluster.

You deploy a 10-tier microservices application in this workshop. The application is a web-based e-commerce app called " Hipster Shop" where users can browse items, add them to the cart, and purchase them.

Kubernetes manifests and k8s_repo

You use the k8s_repo to add Kubernetes resources to all GKE clusters. You do this by copying Kubernetes manifests and committing to the k8s_repo. All commits to the k8s_repo trigger a Cloud Build job which deploys the Kubernetes manifests to the respective cluster. Each cluster's manifest is located in a separate folder named the same as the cluster name.

The six cluster names are:

  1. gke-asm-1-r1-prod - the regional ops cluster in region 1
  2. gke-asm-2-r2-prod - the regional ops cluster in region 2
  3. gke-1-apps-r1a-prod - the app cluster in region 1 zone a
  4. gke-2-apps-r1b-prod - the app cluster in region 1 zone b
  5. gke-3-apps-r2a-prod - the app cluster in region 2 zone a
  6. gke-4-apps-r2b-prod - the app cluster in region 2 zone b

The k8s_repo has folders corresponding to these clusters. Any manifest placed in these folders get applied to the corresponding GKE cluster. Manifests for each cluster are placed in sub-folders (within the cluster's main folder) for ease of management. In this workshop, you use Kustomize to keep track of resources that get deployed. Please refer to the Kustomize official documentation for more details.

7. Deploy the Sample App

Objective: Deploy Hipster shop app on apps clusters

  • Clone k8s-repo
  • Copy Hipster shop manifests to all apps clusters
  • Create Services for Hipster shop app in the ops clusters
  • Setup loadgenerators in the ops clusters to test global connectivity
  • Verify secure connectivity to the Hipster shop app

Copy-and-Paste Method Lab Instructions

Clone the ops project source repo

As part of the initial Terraform infrastructure build, the k8s-repo is already created in the ops project.

  1. Create an empty directory for git repo:
mkdir $WORKDIR/k8s-repo
 
  1. Init git repo, add remote and pull master from remote repo:
cd $WORKDIR/k8s-repo
git init && git remote add origin \
https://source.developers.google.com/p/$TF_VAR_ops_project_name/r/k8s-repo
 
  1. Set local git local configuration.
git config --local user.email $MY_USER
git config --local user.name "K8s repo user"
git config --local \
credential.'https://source.developers.google.com'.helper gcloud.sh
git pull origin master

Copy manifests, commit and push

  1. Copy the Hipster Shop namespaces and services to the source repo for all clusters.
cp -r $WORKDIR/asm/k8s_manifests/prod/app/namespaces \
$WORKDIR/k8s-repo/$DEV1_GKE_1_CLUSTER/app/.
cp -r $WORKDIR/asm/k8s_manifests/prod/app/namespaces \
$WORKDIR/k8s-repo/$DEV1_GKE_2_CLUSTER/app/.
cp -r $WORKDIR/asm/k8s_manifests/prod/app/namespaces \
$WORKDIR/k8s-repo/$DEV2_GKE_1_CLUSTER/app/.
cp -r $WORKDIR/asm/k8s_manifests/prod/app/namespaces \
$WORKDIR/k8s-repo/$DEV2_GKE_2_CLUSTER/app/.
cp -r $WORKDIR/asm/k8s_manifests/prod/app/namespaces \
$WORKDIR/k8s-repo/$OPS_GKE_1_CLUSTER/app/.
cp -r $WORKDIR/asm/k8s_manifests/prod/app/namespaces \
$WORKDIR/k8s-repo/$OPS_GKE_2_CLUSTER/app/.

cp -r $WORKDIR/asm/k8s_manifests/prod/app/services \
$WORKDIR/k8s-repo/$DEV1_GKE_1_CLUSTER/app/.
cp -r $WORKDIR/asm/k8s_manifests/prod/app/services \
$WORKDIR/k8s-repo/$DEV1_GKE_2_CLUSTER/app/.
cp -r $WORKDIR/asm/k8s_manifests/prod/app/services \
$WORKDIR/k8s-repo/$DEV2_GKE_1_CLUSTER/app/.
cp -r $WORKDIR/asm/k8s_manifests/prod/app/services \
$WORKDIR/k8s-repo/$DEV2_GKE_2_CLUSTER/app/.
cp -r $WORKDIR/asm/k8s_manifests/prod/app/services \
$WORKDIR/k8s-repo/$OPS_GKE_1_CLUSTER/app/.
cp -r $WORKDIR/asm/k8s_manifests/prod/app/services \
$WORKDIR/k8s-repo/$OPS_GKE_2_CLUSTER/app/.
 
  1. Copy the app folder kustomization.yaml to all clusters.
cp $WORKDIR/asm/k8s_manifests/prod/app/kustomization.yaml \
$WORKDIR/k8s-repo/$DEV1_GKE_1_CLUSTER/app/
cp $WORKDIR/asm/k8s_manifests/prod/app/kustomization.yaml \
$WORKDIR/k8s-repo/$DEV1_GKE_2_CLUSTER/app/
cp $WORKDIR/asm/k8s_manifests/prod/app/kustomization.yaml \
$WORKDIR/k8s-repo/$DEV2_GKE_1_CLUSTER/app/
cp $WORKDIR/asm/k8s_manifests/prod/app/kustomization.yaml \
$WORKDIR/k8s-repo/$DEV2_GKE_2_CLUSTER/app/
cp $WORKDIR/asm/k8s_manifests/prod/app/kustomization.yaml \
$WORKDIR/k8s-repo/$OPS_GKE_1_CLUSTER/app/
cp $WORKDIR/asm/k8s_manifests/prod/app/kustomization.yaml \
$WORKDIR/k8s-repo/$OPS_GKE_2_CLUSTER/app/
 
  1. Copy the Hipster Shop Deployments, RBAC and PodSecurityPolicy to the source repo for the apps clusters.
cp -r $WORKDIR/asm/k8s_manifests/prod/app/deployments \
$WORKDIR/k8s-repo/$DEV1_GKE_1_CLUSTER/app/
cp -r $WORKDIR/asm/k8s_manifests/prod/app/deployments \
$WORKDIR/k8s-repo/$DEV1_GKE_2_CLUSTER/app/
cp -r $WORKDIR/asm/k8s_manifests/prod/app/deployments \
$WORKDIR/k8s-repo/$DEV2_GKE_1_CLUSTER/app/
cp -r $WORKDIR/asm/k8s_manifests/prod/app/deployments \
$WORKDIR/k8s-repo/$DEV2_GKE_2_CLUSTER/app/

cp -r $WORKDIR/asm/k8s_manifests/prod/app/rbac \
$WORKDIR/k8s-repo/$DEV1_GKE_1_CLUSTER/app/
cp -r $WORKDIR/asm/k8s_manifests/prod/app/rbac \
$WORKDIR/k8s-repo/$DEV1_GKE_2_CLUSTER/app/
cp -r $WORKDIR/asm/k8s_manifests/prod/app/rbac \
$WORKDIR/k8s-repo/$DEV2_GKE_1_CLUSTER/app/
cp -r $WORKDIR/asm/k8s_manifests/prod/app/rbac \
$WORKDIR/k8s-repo/$DEV2_GKE_2_CLUSTER/app/
cp -r $WORKDIR/asm/k8s_manifests/prod/app/podsecuritypolicies \
$WORKDIR/k8s-repo/$DEV1_GKE_1_CLUSTER/app/
cp -r $WORKDIR/asm/k8s_manifests/prod/app/podsecuritypolicies \
$WORKDIR/k8s-repo/$DEV1_GKE_2_CLUSTER/app/
cp -r $WORKDIR/asm/k8s_manifests/prod/app/podsecuritypolicies \
$WORKDIR/k8s-repo/$DEV2_GKE_1_CLUSTER/app/
cp -r $WORKDIR/asm/k8s_manifests/prod/app/podsecuritypolicies \
$WORKDIR/k8s-repo/$DEV2_GKE_2_CLUSTER/app/
  1. Remove the cartservice deployment, rbac and podsecuritypolicy from all but one dev cluster. Hipstershop was not built for multi-cluster deployment, so to avoid inconsistent results, we are using just one cartservice.
rm $WORKDIR/k8s-repo/$DEV1_GKE_2_CLUSTER/app/deployments/app-cart-service.yaml
rm $WORKDIR/k8s-repo/$DEV1_GKE_2_CLUSTER/app/podsecuritypolicies/cart-psp.yaml
rm $WORKDIR/k8s-repo/$DEV1_GKE_2_CLUSTER/app/rbac/cart-rbac.yaml

rm $WORKDIR/k8s-repo/$DEV2_GKE_1_CLUSTER/app/deployments/app-cart-service.yaml
rm $WORKDIR/k8s-repo/$DEV2_GKE_1_CLUSTER/app/podsecuritypolicies/cart-psp.yaml
rm $WORKDIR/k8s-repo/$DEV2_GKE_1_CLUSTER/app/rbac/cart-rbac.yaml

rm $WORKDIR/k8s-repo/$DEV2_GKE_2_CLUSTER/app/deployments/app-cart-service.yaml
rm $WORKDIR/k8s-repo/$DEV2_GKE_2_CLUSTER/app/podsecuritypolicies/cart-psp.yaml
rm $WORKDIR/k8s-repo/$DEV2_GKE_2_CLUSTER/app/rbac/cart-rbac.yaml
 
  1. Add cartservice deployment, rbac and podsecuritypolicy to kustomization.yaml in the first dev cluster only.
cd ${WORKDIR}/k8s-repo/${DEV1_GKE_1_CLUSTER}/app
cd deployments && kustomize edit add resource app-cart-service.yaml
cd ../podsecuritypolicies && kustomize edit add resource cart-psp.yaml
cd ../rbac && kustomize edit add resource cart-rbac.yaml
cd ${WORKDIR}/asm
 
  1. Remove podsecuritypolicies, deployments and rbac directories from ops clusters kustomization.yaml
sed -i -e '/- deployments\//d' -e '/- podsecuritypolicies\//d' \
  -e '/- rbac\//d' \
$WORKDIR/k8s-repo/$OPS_GKE_1_CLUSTER/app/kustomization.yaml
sed -i -e '/- deployments\//d' -e '/- podsecuritypolicies\//d' \
  -e '/- rbac\//d' \
$WORKDIR/k8s-repo/$OPS_GKE_2_CLUSTER/app/kustomization.yaml
  1. Replace the PROJECT_ID in the RBAC manifests.
sed -i 's/\${PROJECT_ID}/'${TF_VAR_dev1_project_name}'/g' \
${WORKDIR}/k8s-repo/${DEV1_GKE_1_CLUSTER}/app/rbac/*
sed -i 's/\${PROJECT_ID}/'${TF_VAR_dev1_project_name}'/g' \
${WORKDIR}/k8s-repo/${DEV1_GKE_2_CLUSTER}/app/rbac/*
sed -i 's/\${PROJECT_ID}/'${TF_VAR_dev2_project_name}'/g' \
${WORKDIR}/k8s-repo/${DEV2_GKE_1_CLUSTER}/app/rbac/*
sed -i 's/\${PROJECT_ID}/'${TF_VAR_dev2_project_name}'/g' \
${WORKDIR}/k8s-repo/${DEV2_GKE_2_CLUSTER}/app/rbac/*
  
  1. Copy the IngressGateway and VirtualService manifests to the source repo for the ops clusters.
cp -r $WORKDIR/asm/k8s_manifests/prod/app-ingress/* \
$WORKDIR/k8s-repo/$OPS_GKE_1_CLUSTER/app-ingress/
cp -r $WORKDIR/asm/k8s_manifests/prod/app-ingress/* \
$WORKDIR/k8s-repo/$OPS_GKE_2_CLUSTER/app-ingress/
 
  1. Copy the Config Connector resources to one of clusters in each project.
cp -r $WORKDIR/asm/k8s_manifests/prod/app-cnrm/* \
$WORKDIR/k8s-repo/$OPS_GKE_1_CLUSTER/app-cnrm/
cp -r $WORKDIR/asm/k8s_manifests/prod/app-cnrm/* \
$WORKDIR/k8s-repo/$DEV1_GKE_1_CLUSTER/app-cnrm/
cp -r $WORKDIR/asm/k8s_manifests/prod/app-cnrm/* \
$WORKDIR/k8s-repo/$DEV2_GKE_1_CLUSTER/app-cnrm/
 
  1. Replace the PROJECT_ID in the Config Connector manifests.
sed -i 's/${PROJECT_ID}/'$TF_VAR_ops_project_name'/g' \
$WORKDIR/k8s-repo/$OPS_GKE_1_CLUSTER/app-cnrm/*
sed -i 's/${PROJECT_ID}/'$TF_VAR_dev1_project_name'/g' \
$WORKDIR/k8s-repo/$DEV1_GKE_1_CLUSTER/app-cnrm/*
sed -i 's/${PROJECT_ID}/'$TF_VAR_dev2_project_name'/g' \
$WORKDIR/k8s-repo/$DEV2_GKE_1_CLUSTER/app-cnrm/*
 
  1. Copy loadgenerator manifests (Deployment, PodSecurityPolicy and RBAC) to the ops clusters. The Hipster shop app is exposed using a global Google Cloud Load Balancer (GCLB). GCLB receives client traffic (destined to frontend) and sends it to the closest instance of the Service. Putting loadgenerator on both ops clusters will ensure traffic to being sent to both Istio Ingress gateways running in the ops clusters. Load balancing is explained in detail in the following section.
cp -r $WORKDIR/asm/k8s_manifests/prod/app-loadgenerator/. \
$WORKDIR/k8s-repo/$OPS_GKE_1_CLUSTER/app-loadgenerator/.
cp -r $WORKDIR/asm/k8s_manifests/prod/app-loadgenerator/. \
$WORKDIR/k8s-repo/$OPS_GKE_2_CLUSTER/app-loadgenerator/. 
 
  1. Replace the ops project ID in the loadgenerator manifests for both ops clusters.
sed -i 's/OPS_PROJECT_ID/'$TF_VAR_ops_project_name'/g'  \
$WORKDIR/k8s-repo/$OPS_GKE_1_CLUSTER/app-loadgenerator/loadgenerator-deployment.yaml
sed -i 's/OPS_PROJECT_ID/'$TF_VAR_ops_project_name'/g' \
$WORKDIR/k8s-repo/$OPS_GKE_1_CLUSTER/app-loadgenerator/loadgenerator-rbac.yaml
sed -i 's/OPS_PROJECT_ID/'$TF_VAR_ops_project_name'/g' \
$WORKDIR/k8s-repo/$OPS_GKE_2_CLUSTER/app-loadgenerator/loadgenerator-deployment.yaml
sed -i 's/OPS_PROJECT_ID/'$TF_VAR_ops_project_name'/g' \
$WORKDIR/k8s-repo/$OPS_GKE_2_CLUSTER/app-loadgenerator/loadgenerator-rbac.yaml
 

  1. Add the loadgenerator resources to kustomization.yaml for both ops clusters.
cd $WORKDIR/k8s-repo/$OPS_GKE_1_CLUSTER/app-loadgenerator/
kustomize edit add resource loadgenerator-psp.yaml
kustomize edit add resource loadgenerator-rbac.yaml
kustomize edit add resource loadgenerator-deployment.yaml

cd $WORKDIR/k8s-repo/$OPS_GKE_2_CLUSTER/app-loadgenerator/
kustomize edit add resource loadgenerator-psp.yaml
kustomize edit add resource loadgenerator-rbac.yaml
kustomize edit add resource loadgenerator-deployment.yaml
 

  1. Commit to k8s-repo.
cd $WORKDIR/k8s-repo
git add . && git commit -am "create app namespaces and install hipster shop"
git push --set-upstream origin master 
 
  1. View the status of the Ops project Cloud Build in a previously opened tab or by clicking the following link:
echo "https://console.cloud.google.com/cloud-build/builds?project=${TF_VAR_ops_project_name}"
  

Verify Application deployment

  1. Verify pods in all application namespaces except cart are in Running state in all dev clusters.
for ns in ad checkout currency email frontend payment product-catalog recommendation shipping; do
  kubectl --context $DEV1_GKE_1 get pods -n $ns;
  kubectl --context $DEV1_GKE_2 get pods -n $ns;
  kubectl --context $DEV2_GKE_1 get pods -n $ns;
  kubectl --context $DEV2_GKE_2 get pods -n $ns;
done;
 

Output (do not copy)

NAME                               READY   STATUS    RESTARTS   AGE
currencyservice-5c5b8876db-pvc6s   2/2     Running   0          13m
NAME                               READY   STATUS    RESTARTS   AGE
currencyservice-5c5b8876db-xlkl9   2/2     Running   0          13m
NAME                               READY   STATUS    RESTARTS   AGE
currencyservice-5c5b8876db-zdjkg   2/2     Running   0          115s
NAME                               READY   STATUS    RESTARTS   AGE
currencyservice-5c5b8876db-l748q   2/2     Running   0          82s

NAME                            READY   STATUS    RESTARTS   AGE
emailservice-588467b8c8-gk92n   2/2     Running   0          13m
NAME                            READY   STATUS    RESTARTS   AGE
emailservice-588467b8c8-rvzk9   2/2     Running   0          13m
NAME                            READY   STATUS    RESTARTS   AGE
emailservice-588467b8c8-mt925   2/2     Running   0          117s
NAME                            READY   STATUS    RESTARTS   AGE
emailservice-588467b8c8-klqn7   2/2     Running   0          84s

NAME                        READY   STATUS    RESTARTS   AGE
frontend-64b94cf46f-kkq7d   2/2     Running   0          13m
NAME                        READY   STATUS    RESTARTS   AGE
frontend-64b94cf46f-lwskf   2/2     Running   0          13m
NAME                        READY   STATUS    RESTARTS   AGE
frontend-64b94cf46f-zz7xs   2/2     Running   0          118s
NAME                        READY   STATUS    RESTARTS   AGE
frontend-64b94cf46f-2vtw5   2/2     Running   0          85s

NAME                              READY   STATUS    RESTARTS   AGE
paymentservice-777f6c74f8-df8ml   2/2     Running   0          13m
NAME                              READY   STATUS    RESTARTS   AGE
paymentservice-777f6c74f8-bdcvg   2/2     Running   0          13m
NAME                              READY   STATUS    RESTARTS   AGE
paymentservice-777f6c74f8-jqf28   2/2     Running   0          117s
NAME                              READY   STATUS    RESTARTS   AGE
paymentservice-777f6c74f8-95x2m   2/2     Running   0          86s

NAME                                     READY   STATUS    RESTARTS   AGE
productcatalogservice-786dc84f84-q5g9p   2/2     Running   0          13m
NAME                                     READY   STATUS    RESTARTS   AGE
productcatalogservice-786dc84f84-n6lp8   2/2     Running   0          13m
NAME                                     READY   STATUS    RESTARTS   AGE
productcatalogservice-786dc84f84-gf9xl   2/2     Running   0          119s
NAME                                     READY   STATUS    RESTARTS   AGE
productcatalogservice-786dc84f84-v7cbr   2/2     Running   0          86s

NAME                                     READY   STATUS    RESTARTS   AGE
recommendationservice-5fdf959f6b-2ltrk   2/2     Running   0          13m
NAME                                     READY   STATUS    RESTARTS   AGE
recommendationservice-5fdf959f6b-dqd55   2/2     Running   0          13m
NAME                                     READY   STATUS    RESTARTS   AGE
recommendationservice-5fdf959f6b-jghcl   2/2     Running   0          119s
NAME                                     READY   STATUS    RESTARTS   AGE
recommendationservice-5fdf959f6b-kkspz   2/2     Running   0          87s

NAME                              READY   STATUS    RESTARTS   AGE
shippingservice-7bd5f569d-qqd9n   2/2     Running   0          13m
NAME                              READY   STATUS    RESTARTS   AGE
shippingservice-7bd5f569d-xczg5   2/2     Running   0          13m
NAME                              READY   STATUS    RESTARTS   AGE
shippingservice-7bd5f569d-wfgfr   2/2     Running   0          2m
NAME                              READY   STATUS    RESTARTS   AGE
shippingservice-7bd5f569d-r6t8v   2/2     Running   0          88s
  1. Verify pods in cart namespace are in Running state in first dev cluster only.
kubectl --context $DEV1_GKE_1 get pods -n cart;
 

Output (do not copy)

NAME                           READY   STATUS    RESTARTS   AGE
cartservice-659c9749b4-vqnrd   2/2     Running   0          17m

Access the Hipster Shop app

Global load balancing

You now have Hipster Shop app deployed to all four app clusters. These clusters are in two regions and four zones. Clients can access the Hipster shop app by accessing the frontend service. The frontend service runs on all four app clusters. A Google Cloud Load Balancer ( GCLB) is used to get client traffic to all four instances of the frontend service.

Istio Ingress gateways only run in the ops clusters and act as a regional load balancer to the two zonal application clusters within the region. GCLB uses the two Istio ingress gateways (running in the two ops clusters) as backends to the global frontend service. The Istio Ingress gateways receive the client traffic from the GCLB and then send the client traffic onwards to the frontend Pods running in the application clusters.

4c618df35cb928ee.png

Alternatively, you can put Istio Ingress gateways on the application clusters directly and the GCLB can use those as backends.

GKE Autoneg controller

Istio Ingress gateway Kubernetes Service registers itself as a backend to the GCLB using Network Endpoint Groups (NEGs). NEGs allow for container-native load balancing using GCLBs. NEGs are created through a special annotation on a Kubernetes Service, so it can register itself to the NEG Controller. Autoneg controller is a special GKE controller that automates the creation of NEGs as well as assigning them as backends to a GCLB using Service annotations. Istio control planes including the Istio ingress gateways are deployed during the initial infrastructure Terraform Cloud Build. The GCLB and autoneg configuration is done as part of the initial Terraform infrastructure Cloud Build.

Secure Ingress using Cloud Endpoints and managed certs

GCP Managed certs are used to secure the client traffic to the frontend GCLB service. GCLB uses managed certs for the global frontend service and the certificate is terminated at the GCLB. In this workshop, you use Cloud Endpoints as the domain for the managed cert. Alternatively, you can use your domain and a DNS name for the frontend to create GCP managed certs.

  1. To access the Hipster shop, click on the link output of the following command.
echo "https://frontend.endpoints.$TF_VAR_ops_project_name.cloud.goog" 
 
  1. You can check that the certificate is valid by clicking the lock symbol in the URL bar of your Chrome tab.

6c403a63caa06c84.png

Verify global load balancing

As part of the application deployment, load generators were deployed in both ops clusters that generate test traffic to the GCLB Hipster shop Cloud Endpoints link. Verify that the GCLB is receiving traffic and sending to both Istio Ingress gateways.

  1. Get the GCLB > Monitoring link for the ops project where the Hipster shop GCLB is created.
echo "https://console.cloud.google.com/net-services/loadbalancing/details/http/istio-ingressgateway?project=$TF_VAR_ops_project_name&cloudshell=false&tab=monitoring&duration=PT1H" 
 
  1. Change from All backends to istio-ingressgateway from the Backend dropdown menu as shown below.

6697c9eb67998d27.png

  1. Note traffic going to both istio-ingressgateways.

ff8126e44cfd7f5e.png

There are three NEGs created per istio-ingressgateway. Since the ops clusters are regional clusters, one NEG is created for each zone in the region. The istio-ingressgateway Pods, however, run in a single zone per region. Traffic is shown going to the istio-ingressgateway Pods.

Load generators are running in both ops clusters simulating client traffic from the two regions they are in. The load generated in the ops cluster region 1 is being sent to istio-ingressgateway in region 2. Likewise, the load generated in ops cluster region 2 is being sent to istio-ingressgateway in region 2.

8. Observability with Stackdriver

Objective: Connect Istio telemetry to Stackdriver and validate.

  • Install istio-telemetry resources
  • Create/update Istio Services dashboards
  • View container logs
  • View distributed tracing in Stackdriver

Copy-and-Paste Method Lab Instructions

One of Istio's major features is built-in observability ("o11y"). This means that even with black-box, uninstrumented containers, operators can still observe the traffic going in and out of these containers, providing services to customers. This observation takes the shape of a few different methods: metrics, logs, and traces.

We will also utilize the built-in load generation system in Hipster Shop. Observability doesn't work very well in a static system with no traffic, so load generation helps us see how it works. This load is already running, now we'll just be able to see it.

  1. Install the istio to stackdriver config file.
cd $WORKDIR/k8s-repo/$OPS_GKE_1_CLUSTER/istio-telemetry
kustomize edit add resource istio-telemetry.yaml

cd $WORKDIR/k8s-repo/$OPS_GKE_2_CLUSTER/istio-telemetry
kustomize edit add resource istio-telemetry.yaml
 
  1. Commit to k8s-repo.
cd $WORKDIR/k8s-repo
git add . && git commit -am "Install istio to stackdriver configuration"
git push 
 
  1. View the status of the Ops project Cloud Build in a previously opened tab or by clicking the following link:
echo "https://console.cloud.google.com/cloud-build/builds?project=${TF_VAR_ops_project_name}"
 
  1. Verify the Istio → Stackdriver integration Get the Stackdriver Handler CRD.
kubectl --context $OPS_GKE_1 get handler -n istio-system
 

The output should show a handler named stackdriver:

NAME            AGE
kubernetesenv   12d
prometheus      12d
stackdriver     69s      # <== NEW!
  1. Verify that the Istio metrics export to Stackdriver is working. Click the link output from this command:
echo "https://console.cloud.google.com/monitoring/metrics-explorer?cloudshell=false&project=$TF_VAR_ops_project_name"
 

You will be prompted to create a new Workspace, named after the Ops project, just choose OK. If it prompts you about the new UI, just dismiss the dialog.

In the Metrics Explorer, under "Find resource type and metric" type "istio" to see there are options like "Server Request Count" on the "Kubernetes Container" resource type. This shows us that the metrics are flowing from the mesh into Stackdriver.

(You will have to Group By destination_service_name label if you want to see the lines below.)

b9b59432ee68e695.png

Visualizing metrics with Dashboards:

Now that our metrics are in the Stackdriver APM system, we want a way to visualize them. In this section, we will install a pre-built dashboard which shows us the three of the four " Golden Signals" of metrics: Traffic (Requests per second), Latency (in this case, 99th and 50th percentile), and Errors (we're excluding Saturation in this example).

Istio's Envoy proxy gives us several metrics, but these are a good set to start with. (exhaustive list is here). Note that each metric has a set of labels that can be used for filtering, aggregating, such as: destination_service, source_workload_namespace, response_code, istio_tcp_received_bytes_total, etc).

  1. Now let's add our pre-canned metrics dashboard. We are going to be using the Dashboard API directly. This is something you wouldn't normally do by hand-generating API calls, it would be part of an automation system, or you would build the dashboard manually in the web UI. This will get us started quickly:
sed -i 's/OPS_PROJECT/'${TF_VAR_ops_project_name}'/g' \
$WORKDIR/asm/k8s_manifests/prod/app-telemetry/services-dashboard.json
OAUTH_TOKEN=$(gcloud auth application-default print-access-token)
curl -X POST -H "Authorization: Bearer $OAUTH_TOKEN" -H "Content-Type: application/json" \
https://monitoring.googleapis.com/v1/projects/$TF_VAR_ops_project_name/dashboards \
 -d @$WORKDIR/asm/k8s_manifests/prod/app-telemetry/services-dashboard.json
 
  1. Navigate to the output link below to view the newly added "Services dashboard".
echo "https://console.cloud.google.com/monitoring/dashboards/custom/servicesdash?cloudshell=false&project=$TF_VAR_ops_project_name"
 
 

We could edit the dashboard in-place using the UX, but in our case we are going to quickly add a new graph using the API. In order to do that, you should pull down the latest version of the dashboard, apply your edits, then push it back up using the HTTP PATCH method.

  1. You can get an existing dashboard by querying the monitoring API. Get the existing dashboard that was just added:
curl -X GET -H "Authorization: Bearer $OAUTH_TOKEN" -H "Content-Type: application/json" \
https://monitoring.googleapis.com/v1/projects/$TF_VAR_ops_project_name/dashboards/servicesdash > /tmp/services-dashboard.json
 
  1. Add a new graph: (50th %ile latency): [ API reference] Now we can add a new graph widget to our dashboard in code. This change can be reviewed by peers and checked into version control. Here is a widget to add that shows 50%ile latency (median latency).

Try editing the dashboard you just got, adding a new stanza:

NEW_CHART=${WORKDIR}/asm/k8s_manifests/prod/app-telemetry/new-chart.json
jq --argjson newChart "$(<$NEW_CHART)" '.gridLayout.widgets += [$newChart]' /tmp/services-dashboard.json > /tmp/patched-services-dashboard.json
 
  1. Update the existing services dashboard:
curl -X PATCH -H "Authorization: Bearer $OAUTH_TOKEN" -H "Content-Type: application/json" \
https://monitoring.googleapis.com/v1/projects/$TF_VAR_ops_project_name/dashboards/servicesdash \
 -d @/tmp/patched-services-dashboard.json
 
  1. View the updated dashboard by navigating to the following output link:
echo "https://console.cloud.google.com/monitoring/dashboards/custom/servicesdash?cloudshell=false&project=$TF_VAR_ops_project_name"
 
  1. Do some simple Logs Analysis.

Istio provides a set of structured logs for all in-mesh network traffic and uploads them to Stackdriver Logging to allow cross-cluster analysis in one powerful tool. Logs are annotated with service-level metadata such as the cluster, container, app, connection_id, etc.

An example log entry (in this case, Envoy proxy's accesslog) might look like this (trimmed):

*** DO NOT PASTE *** 
 logName: "projects/PROJECTNAME-11932-01-ops/logs/server-tcp-accesslog-stackdriver.instance.istio-system" 
labels: {
  connection_id: "fbb46826-96fd-476c-ac98-68a9bd6e585d-1517191"   
  destination_app: "redis-cart"   
  destination_ip: "10.16.1.7"   
  destination_name: "redis-cart-6448dcbdcc-cj52v"   
  destination_namespace: "cart"   
  destination_owner: "kubernetes://apis/apps/v1/namespaces/cart/deployments/redis-cart"   
  destination_workload: "redis-cart"   
  source_ip: "10.16.2.8"   
  total_received_bytes: "539"   
  total_sent_bytes: "569" 
...  
 }

View your logs here:

echo "https://console.cloud.google.com/logs/viewer?cloudshell=false&project=$TF_VAR_ops_project_name"
 

You can view Istio's control plane logs by selecting Resource > Kubernetes Container, and searching on "pilot" —

6f93b2aec6c4f520.png

Here, we can see the Istio Control Plane pushing proxy config to the sidecar proxies for each sample app service. "CDS," "LDS," and "RDS" represent different Envoy APIs ( more information).

Beyond Istio's logs, you can also find container logs as well as infrastructure or other GCP services logs all in the same interface. Here are some sample logs queries for GKE. The logs viewer also allows you to create metrics out of logs (eg: "count every error that matches some string") which can be used on a dashboard or as part of an alert. Logs can also be streamed to other analysis tools such as BigQuery.

Some sample filters for hipster shop:

resource.type="k8s_container" labels.destination_app="productcatalogservice"

resource.type="k8s_container" resource.labels.namespace_name="cart"

  1. Check out Distributed Traces.

Now that you're working with a distributed system, debugging needs a new tool: Distributed Tracing. This tool allows you to discover statistics about how your services are interacting (such as finding outlying slow events in the picture below), as well as dive into raw sample traces to investigate the details of what is really going on.

The Timeline View shows all requests over time, graphed by their latency, or time spent between initial request, through the Hipster stack, to finally respond to the end user. The higher up the dots, the slower (and less-happy!) the user's experience.

You can click on a dot to find the detailed Waterfall View of that particular request. This ability to find the raw details of a particular request (not just aggregate statistics) is vital to understanding the interplay between services, especially when hunting down rare, but bad, interactions between services.

The Waterfall View should be familiar to anyone who has used a debugger, but in this case instead of showing time spent in different processes of a single application, this is showing time spent traversing our mesh, between services, running in separate containers.

Here you can find your Traces:

echo "https://console.cloud.google.com/traces/overview?cloudshell=false&project=$TF_VAR_ops_project_name"
 

An example screenshot of the tool:

5ee238836dc9047f.png

9. Mutual TLS Authentication

Objective: Secure connectivity between microservices (AuthN).

  • Enable mesh wide mTLS
  • Verify mTLS by inspecting logs

Copy-and-Paste Method Lab Instructions

Now that our apps are installed and Observability is set up, we can start securing the connections between services and make sure it keeps working.

For example, we can see on the Kiali dashboard that our services are not using MTLS (no "lock" icon). But the traffic is flowing and the system is working fine. Our StackDriver Golden Metrics dashboard is giving us some peace of mind that things are working, overall.

  1. Check MeshPolicy in ops clusters. Note mTLS is PERMISSIVE allowing for both encrypted and non-mTLS traffic.
kubectl --context $OPS_GKE_1 get MeshPolicy -o json | jq '.items[].spec'
kubectl --context $OPS_GKE_2 get MeshPolicy -o json | jq '.items[].spec'
 
    `Output (do not copy)`
{
  "peers": [
    {
      "mtls": {
        "mode": "PERMISSIVE"
      }
    }
  ]
}

Istio is configured on all clusters using the Istio operator, which uses the IstioControlPlane custom resource (CR). We will configure mTLS in all clusters by updating the IstioControlPlane CR and updating the k8s-repo. Setting global > mTLS > enabled: true in the IstioControlPlane CR results in the following two changes to the Istio control plane:

  • MeshPolicy is set to turn on mTLS mesh wide for all Services running in all clusters.
  • A DestinationRule is created to allow ISTIO_MUTUAL traffic between Services running in all clusters.
  1. We will apply a kustomize patch to the istioControlPlane CR to enable mTLS cluster wide. Copy the patch to relevant dir for all clusters and add a kustomize patch.
cp -r $WORKDIR/asm/k8s_manifests/prod/app-mtls/mtls-kustomize-patch-replicated.yaml \
$WORKDIR/k8s-repo/$OPS_GKE_1_CLUSTER/istio-controlplane/mtls-kustomize-patch.yaml
cd $WORKDIR/k8s-repo/$OPS_GKE_1_CLUSTER/istio-controlplane
kustomize edit add patch mtls-kustomize-patch.yaml

cp -r $WORKDIR/asm/k8s_manifests/prod/app-mtls/mtls-kustomize-patch-replicated.yaml \
$WORKDIR/k8s-repo/$OPS_GKE_2_CLUSTER/istio-controlplane/mtls-kustomize-patch.yaml
cd $WORKDIR/k8s-repo/$OPS_GKE_2_CLUSTER/istio-controlplane
kustomize edit add patch mtls-kustomize-patch.yaml

cp -r $WORKDIR/asm/k8s_manifests/prod/app-mtls/mtls-kustomize-patch-shared.yaml \
$WORKDIR/k8s-repo/$DEV1_GKE_1_CLUSTER/istio-controlplane/mtls-kustomize-patch.yaml
cd $WORKDIR/k8s-repo/$DEV1_GKE_1_CLUSTER/istio-controlplane
kustomize edit add patch mtls-kustomize-patch.yaml

cp -r $WORKDIR/asm/k8s_manifests/prod/app-mtls/mtls-kustomize-patch-shared.yaml \
$WORKDIR/k8s-repo/$DEV1_GKE_2_CLUSTER/istio-controlplane/mtls-kustomize-patch.yaml
cd $WORKDIR/k8s-repo/$DEV1_GKE_2_CLUSTER/istio-controlplane
kustomize edit add patch mtls-kustomize-patch.yaml

cp -r $WORKDIR/asm/k8s_manifests/prod/app-mtls/mtls-kustomize-patch-shared.yaml \
$WORKDIR/k8s-repo/$DEV2_GKE_1_CLUSTER/istio-controlplane/mtls-kustomize-patch.yaml
cd $WORKDIR/k8s-repo/$DEV2_GKE_1_CLUSTER/istio-controlplane
kustomize edit add patch mtls-kustomize-patch.yaml

cp -r $WORKDIR/asm/k8s_manifests/prod/app-mtls/mtls-kustomize-patch-shared.yaml \
$WORKDIR/k8s-repo/$DEV2_GKE_2_CLUSTER/istio-controlplane/mtls-kustomize-patch.yaml
cd $WORKDIR/k8s-repo/$DEV2_GKE_2_CLUSTER/istio-controlplane
kustomize edit add patch mtls-kustomize-patch.yaml
 
  1. Commit to k8s-repo.
cd $WORKDIR/k8s-repo
git add . && git commit -am "turn mTLS on"
git push
 
  1. View the status of the Ops project Cloud Build in a previously opened tab or by clicking the following link:
echo "https://console.cloud.google.com/cloud-build/builds?project=${TF_VAR_ops_project_name}"

 

Verify mTLS

  1. Check MeshPolicy once more in ops clusters. Note mTLS is no longer PERMISSIVE and will only allow for mTLS traffic.
kubectl --context $OPS_GKE_1 get MeshPolicy -o json | jq .items[].spec
kubectl --context $OPS_GKE_2 get MeshPolicy -o json | jq .items[].spec
 

Output (do not copy):

{
  "peers": [
    {
      "mtls": {}
    }
  ]
}
  1. Describe the DestinationRule created by the Istio operator controller.
kubectl --context $OPS_GKE_1 get DestinationRule default -n istio-system -o json | jq '.spec'
kubectl --context $OPS_GKE_2 get DestinationRule default -n istio-system -o json | jq '.spec'

Output (do not copy):

{
    host: '*.local',
    trafficPolicy: {
      tls: {
        mode: ISTIO_MUTUAL
      }
   }
}

We can also see the move from HTTP to HTTPS in the logs.

We can expose this particular field from the logs in the UI by clicking one one log entry and then clicking on the value of the field you want to display, in our case, click on "http" next to "protocol:

d92e0c88cd5b2132.png

This results in a nice way to visualize the changeover.:

ea3d0240fa6fed81.png

10. Canary Deployments

Objective: Rollout a new version of the frontend Service.

  • Rollout frontend-v2 (next production version) Service in one region
  • Use DestinationRules and VirtualServices to slowly steer traffic to frontend-v2
  • Verify GitOps deployment pipeline by inspecting series of commits to the k8s-repo

Copy-and-Paste Method Lab Instructions

A canary deployment is a progressive rollout of a new service. In a canary deployment, you send an increasing amount of traffic to the new version, while still sending the remaining percentage to the current version. A common pattern is to perform a canary analysis at each stage of traffic splitting, and compare the "golden signals" of the new version (latency, error rate, saturation) against a baseline. This helps prevent outages, and ensure the stability of the new "v2" service at every stage of traffic splitting.

In this section, you will learn how to use Cloud Build and Istio traffic policies to create a basic canary deployment for a new version of the frontend service.

First, we'll run the Canary pipeline in the DEV1 region (us-west1), and roll out frontend v2 on both clusters in that region. Second, we'll run the Canary pipeline in the DEV2 region (us-central), and deploy v2 onto both clusters in that region. Running the pipeline on regions in order, versus in parallel across all regions, helps avoid global outages caused by bad configuration, or by bugs in the v2 app itself.

Note: we'll manually trigger the Canary pipeline in both regions, but in production, you would use an automated trigger, for instance based on a new Docker image tag pushed to a registry.

  1. From Cloud Shell, define some env variables to simplify running the rest of the commands.
CANARY_DIR="$WORKDIR/asm/k8s_manifests/prod/app-canary/"
K8S_REPO="$WORKDIR/k8s-repo"
 
  1. Run the repo_setup.sh script, to copy the baseline manifests into k8s-repo.
$CANARY_DIR/repo-setup.sh 
 

The following manifests are copied:

  • frontend-v2 deployment
  • frontend-v1 patch (to include the "v1" label, and an image with a "/version" endpoint)
  • respy, a small pod that will print HTTP response distribution, and help us visualize the canary deployment in real time.
  • frontend Istio DestinationRule - splits the frontend Kubernetes Service into two subsets, v1 and v2, based on the "version" deployment label
  • frontend Istio VirtualService - routes 100% of traffic to frontend v1. This overrides the Kubernetes Service default round-robin behavior, which would immediately send 50% of all Dev1 regional traffic to frontend v2.
  1. Commit changes to k8s_repo:
cd $K8S_REPO 
git add . && git commit -am "frontend canary setup"
git push
 
  1. View the status of the Ops project Cloud Build in a previously opened tab or by clicking the following link:
echo "https://console.cloud.google.com/cloud-build/builds?project=${TF_VAR_ops_project_name}" 
 
  1. Navigate to Cloud Build in the console for the OPS1 project. Wait for the Cloud Build pipeline to complete, then get pods in the frontend namespace in both DEV1 clusters. You should see the following:
watch -n 1 kubectl --context $DEV1_GKE_1 get pods -n frontend 
 

Output (do not copy)

NAME                           READY   STATUS    RESTARTS   AGE
frontend-578b5c5db6-h9567      2/2     Running   0          59m
frontend-v2-54b74fc75b-fbxhc   2/2     Running   0          2m26s
respy-5f4664b5f6-ff22r         2/2     Running   0          2m26s

We will use tmux to split our cloudshell window into 2 panes:

  • The bottom pane will be running the watch command to observe the HTTP response distribution for the frontend service.
  • The top pane will be running the actual canary pipeline script.
  1. Run the command to split the cloud shell window and execute the watch command in the bottom pane.
RESPY_POD=$(kubectl --context $DEV1_GKE_1 get pod \
-n frontend -l app=respy -o jsonpath='{..metadata.name}')
export TMUX_SESSION=$(tmux display-message -p '#S')
tmux split-window -d -t $TMUX_SESSION:0 -p33 \
-v "export KUBECONFIG=$WORKDIR/asm/gke/kubemesh; \
kubectl --context $DEV1_GKE_1 exec -n frontend -it \
$RESPY_POD -c respy /bin/sh -- -c 'watch -n 1 ./respy \
--u http://frontend:80/version --c 10 --n 500'; sleep 2"
 

Output (do not copy)

500 requests to http://frontend:80/version...
+----------+-------------------+
| RESPONSE | % OF 500 REQUESTS |
+----------+-------------------+
| v1       | 100.0%            |
|          |                   |
+----------+-------------------+
  1. Execute the canary pipeline on the Dev1 region. We provide a script that updates frontend-v2 traffic percentages in the VirtualService (updating weights to 20%, 50%, 80%, then 100%). Between updates, the script waits for the Cloud Build pipeline to complete. Run the canary deployment script for the Dev1 region. Note - this script takes about 10 minutes to complete.
K8S_REPO=$K8S_REPO CANARY_DIR=$CANARY_DIR \
OPS_DIR=$OPS_GKE_1_CLUSTER OPS_CONTEXT=$OPS_GKE_1 \
${CANARY_DIR}/auto-canary.sh
 

You can see traffic splitting in real time in the bottom window where you're running the respy command. For instance, at the 20% mark :

Output (do not copy)

500 requests to http://frontend:80/version...
+----------+-------------------+
| RESPONSE | % OF 500 REQUESTS |
+----------+-------------------+
| v1       | 79.4%             |
|          |                   |
| v2       | 20.6%             |
|          |                   |
+----------+-------------------+
  1. Once the Dev2 rollout completes for frontend-v2, you should see a success message at the end of the script:
     Output (do not copy) 
    
✅ 100% successfully deployed
🌈 frontend-v2 Canary Complete for gke-asm-1-r1-prod
  1. And all frontend traffic from a Dev2 pod should be going to frontend-v2:
     Output (do not copy) 
    
500 requests to http://frontend:80/version...
+----------+-------------------+
| RESPONSE | % OF 500 REQUESTS |
+----------+-------------------+
| v2       | 100.0%            |
|          |                   |
+----------+-------------------+
  1. Close the split pane.
tmux respawn-pane -t ${TMUX_SESSION}:0.1 -k 'exit'
 
  1. Navigate to Cloud Source Repos at the link generated.
echo https://source.developers.google.com/p/$TF_VAR_ops_project_name/r/k8s-repo

You should see a separate commit for each traffic percentage, with the most recent commit at the top of the list:

b87b85f52fd2ff0f.png

Now, you will repeat the same process for the Dev2 region. Note that the Dev2 region is still "locked" on v1. This is because in the baseline repo_setup script, we pushed a VirtualService to explicitly send all traffic to v1. This way, we were able to safely do a regional canary on Dev1, and make sure it ran successfully before rolling out the new version globally.

  1. Run the command to split the cloud shell window and execute the watch command in the bottom pane.
RESPY_POD=$(kubectl --context $DEV2_GKE_1 get pod \
-n frontend -l app=respy -o jsonpath='{..metadata.name}')
export TMUX_SESSION=$(tmux display-message -p '#S')
tmux split-window -d -t $TMUX_SESSION:0 -p33 \
-v "export KUBECONFIG=$WORKDIR/asm/gke/kubemesh; \
kubectl --context $DEV2_GKE_1 exec -n frontend -it \
$RESPY_POD -c respy /bin/sh -- -c 'watch -n 1 ./respy \
--u http://frontend:80/version --c 10 --n 500'; sleep 2"
 

Output (do not copy)

500 requests to http://frontend:80/version...
+----------+-------------------+
| RESPONSE | % OF 500 REQUESTS |
+----------+-------------------+
| v1       | 100.0%            |
|          |                   |
+----------+-------------------+
  1. Execute the canary pipeline on the Dev2 region. We provide a script that updates frontend-v2 traffic percentages in the VirtualService (updating weights to 20%, 50%, 80%, then 100%). Between updates, the script waits for the Cloud Build pipeline to complete. Run the canary deployment script for the Dev1 region. Note - this script takes about 10 minutes to complete.
K8S_REPO=$K8S_REPO CANARY_DIR=$CANARY_DIR \
OPS_DIR=$OPS_GKE_2_CLUSTER OPS_CONTEXT=$OPS_GKE_2 \
${CANARY_DIR}/auto-canary.sh
 

Output (do not copy)

500 requests to http://frontend:80/version...
+----------+-------------------+
| RESPONSE | % OF 500 REQUESTS |
+----------+-------------------+
| v1       | 100.0%            |
|          |                   |
+----------+-------------------+
  1. From the Respy pod in Dev2, watch traffic from Dev2 pods move progressively from frontend v1 to v2. Once the script completes, you should see:

Output (do not copy)

500 requests to http://frontend:80/version...
+----------+-------------------+
| RESPONSE | % OF 500 REQUESTS |
+----------+-------------------+
| v2       | 100.0%            |
|          |                   |
+----------+-------------------+
  1. Close the split pane.
tmux respawn-pane -t ${TMUX_SESSION}:0.1 -k 'exit'

This section introduced how to use Istio for regional canary deployments. In production, instead of a manual script, you might automatically trigger this canary script as a Cloud Build pipeline, using a trigger such as a new tagged image pushed to a container registry. You would also want to add canary analysis in between each step, analyzing v2's latency and error rate against a predefined safety threshold, before sending over more traffic.

11. Authorization Policies

Objective: Set up RBAC between microservices (AuthZ).

  • Create AuthorizationPolicy to DENY access to a microservice
  • Create AuthorizationPolicy to ALLOW specific access to a microservice

Copy-and-Paste Method Lab Instructions

Unlike a monolithic application that might be running in one place, globally-distributed microservices apps make calls across network boundaries. This means more points of entry into your applications, and more opportunities for malicious attacks. And because Kubernetes pods have transient IPs, traditional IP-based firewall rules are no longer adequate to secure access between workloads. In a microservices architecture, a new approach to security is needed. Building on Kubernetes security building blocks like service accounts, Istio provides a flexible set of security policies for your applications.

Istio policies cover both authentication and authorization. Authentication verifies identity (is this server who they say they are?), and authorization verifies permissions (is this client allowed to do that?). We covered Istio authentication in the mutual TLS section in Module 1 (MeshPolicy). In this section, we will learn how to use Istio authorization policies to control access to one of our application workloads, currencyservice.

First, we'll deploy an AuthorizationPolicy across all 4 Dev clusters, closing off all access to currencyservice, and triggering an error in the frontend. Then, we will allow only the frontend service to access currencyservice.

  1. Inspect the contents of currency-deny-all.yaml. This policy uses Deployment label selectors to restrict access to the currencyservice. Notice how there is no spec field - this means this policy will deny all access to the selected service.
cat $WORKDIR/asm/k8s_manifests/prod/app-authorization/currency-deny-all.yaml
 

Output (do not copy)

apiVersion: "security.istio.io/v1beta1"
kind: "AuthorizationPolicy"
metadata:
  name: "currency-policy"
  namespace: currency
spec:
  selector:
    matchLabels:
      app: currencyservice
  1. Copy the currency policy into k8s-repo, for the ops clusters in both regions.
cp $WORKDIR/asm/k8s_manifests/prod/app-authorization/currency-deny-all.yaml \
$WORKDIR/k8s-repo/$OPS_GKE_1_CLUSTER/app-authorization/currency-policy.yaml
cd $WORKDIR/k8s-repo/$OPS_GKE_1_CLUSTER/app-authorization
kustomize edit add resource currency-policy.yaml
cp $WORKDIR/asm/k8s_manifests/prod/app-authorization/currency-deny-all.yaml \
$WORKDIR/k8s-repo/$OPS_GKE_2_CLUSTER/app-authorization/currency-policy.yaml
cd $WORKDIR/k8s-repo/$OPS_GKE_2_CLUSTER/app-authorization
kustomize edit add resource currency-policy.yaml
  1. Push changes.
cd $WORKDIR/k8s-repo 
git add . && git commit -am "AuthorizationPolicy - currency: deny all"
git push 
  1. Check the status of the Ops project Cloud Build in a previously opened tab or by clicking the following link:
echo https://console.cloud.google.com/cloud-build/builds?project=$TF_VAR_ops_project_name 
 
  1. After the build finishes successfully, try to reach the hipstershop frontend in a browser on the following link:
echo "https://frontend.endpoints.$TF_VAR_ops_project_name.cloud.goog" 
 

You should see an Authorization error from currencyservice:

f120f3d30d6ee9f.png

  1. Let's investigate how the currency service is enforcing this AuthorizationPolicy. First, enable trace-level logs on the Envoy proxy for one of the currency pods, since blocked authorization calls aren't logged by default.
CURRENCY_POD=$(kubectl --context $DEV1_GKE_2 get pod -n currency | grep currency| awk '{ print $1 }')
kubectl --context $DEV1_GKE_2 exec -it $CURRENCY_POD -n \
currency -c istio-proxy -- curl -X POST \
"http://localhost:15000/logging?level=trace"
 
  1. Get the RBAC (authorization) logs from the currency service's sidecar proxy. You should see an "enforced denied" message, indicating that the currencyservice is set to block all inbound requests.
kubectl --context $DEV1_GKE_2 logs -n currency $CURRENCY_POD \
-c istio-proxy | grep -m 3 rbac
 

Output (do not copy)

[Envoy (Epoch 0)] [2020-01-30 00:45:50.815][22][debug][rbac] [external/envoy/source/extensions/filters/http/rbac/rbac_filter.cc:67] checking request: remoteAddress: 10.16.5.15:37310, localAddress: 10.16.3.8:7000, ssl: uriSanPeerCertificate: spiffe://cluster.local/ns/frontend/sa/frontend, subjectPeerCertificate: , headers: ':method', 'POST'
[Envoy (Epoch 0)] [2020-01-30 00:45:50.815][22][debug][rbac] [external/envoy/source/extensions/filters/http/rbac/rbac_filter.cc:118] enforced denied
[Envoy (Epoch 0)] [2020-01-30 00:45:50.815][22][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1354] [C115][S17310331589050212978] Sending local reply with details rbac_access_denied
  1. Now, let's allow the frontend – but not the other backend services – to access currencyservice. Open currency-allow-frontend.yaml and inspect its contents. Note that we've added the following rule:
cat ${WORKDIR}/asm/k8s_manifests/prod/app-authorization/currency-allow-frontend.yaml

Output (do not copy)

rules:
 - from:
   - source:
       principals: ["cluster.local/ns/frontend/sa/frontend"]

Here, we are whitelisting a specific source.principal (client) to access currency service. This source.principal is defined by is Kubernetes Service Account. In this case, the service account we are whitelisting is the frontend service account in the frontend namespace.

Note: when using Kubernetes Service Accounts in Istio AuthorizationPolicies, you must first enable cluster-wide mutual TLS, as we did in Module 1. This is to ensure that service account credentials are mounted into requests.

  1. Copy over the updated currency policy
cp $WORKDIR/asm/k8s_manifests/prod/app-authorization/currency-allow-frontend.yaml \
$WORKDIR/k8s-repo/$OPS_GKE_1_CLUSTER/app-authorization/currency-policy.yaml
cp $WORKDIR/asm/k8s_manifests/prod/app-authorization/currency-allow-frontend.yaml \
$WORKDIR/k8s-repo/$OPS_GKE_2_CLUSTER/app-authorization/currency-policy.yaml
 
  1. Push changes.
cd $WORKDIR/k8s-repo
git add . && git commit -am "AuthorizationPolicy - currency: allow frontend"
git push
 
  1. View the status of the Ops project Cloud Build in a previously opened tab or by clicking the following link:
echo https://console.cloud.google.com/cloud-build/builds?project=$TF_VAR_ops_project_name
  1. After the build finishes successfully, open the Hipstershop frontend again. This time you should see no errors in the homepage - this is because the frontend is explicitly allowed to access the current service.
  2. Now, try to execute a checkout, by adding items to your cart and clicking "place order." This time, you should see a price-conversion error from currency service - this is because we have only whitelisted the frontend, so the checkoutservice is still unable to access currencyservice.

7e30813d693675fe.png

  1. Finally, let's allow the checkout service access to currency, by adding another rule to our currencyservice AuthorizationPolicy. Note that we are only opening up currency access to the two services that need to access it - frontend and checkout. The other backends will still be blocked.
  2. Open currency-allow-frontend-checkout.yaml and inspect its contents. Notice that the list of rules functions as a logical OR - currency will accept only requests from workloads with either of these two service accounts.
cat ${WORKDIR}/asm/k8s_manifests/prod/app-authorization/currency-allow-frontend-checkout.yaml
 

Output (do not copy)

apiVersion: "security.istio.io/v1beta1"
kind: "AuthorizationPolicy"
metadata:
  name: "currency-policy"
  namespace: currency
spec:
  selector:
    matchLabels:
      app: currencyservice
  rules:
  - from:
    - source:
        principals: ["cluster.local/ns/frontend/sa/frontend"]
  - from:
    - source:
        principals: ["cluster.local/ns/checkout/sa/checkout"]
  1. Copy the final authorization policy to k8s-repo.
cp $WORKDIR/asm/k8s_manifests/prod/app-authorization/currency-allow-frontend-checkout.yaml \
$WORKDIR/k8s-repo/$OPS_GKE_1_CLUSTER/app-authorization/currency-policy.yaml
cp $WORKDIR/asm/k8s_manifests/prod/app-authorization/currency-allow-frontend-checkout.yaml \
$WORKDIR/k8s-repo/$OPS_GKE_2_CLUSTER/app-authorization/currency-policy.yaml
 
  1. Push changes
cd $WORKDIR/k8s-repo 
git add . && git commit -am "AuthorizationPolicy - currency: allow frontend and checkout"
git push
 
  1. View the status of the Ops project Cloud Build in a previously opened tab or by clicking the following link:
echo https://console.cloud.google.com/cloud-build/builds?project=$TF_VAR_ops_project_name
 
  1. After the build finishes successfully, try to execute a checkout - it should work successfully.

This section walked through how to use Istio Authorization Policies to enforce granular access control at the per-service level. In production, you might create one AuthorizationPolicy per service, and (for instance) use an allow-all policy to let all workloads in the same namespace access each other.

12. Infrastructure Scaling

Objective: Scale infrastructure by adding new region, project, and clusters.

  • Clone the infrastructure repo
  • Update the terraform files to create new resources
  • 2 subnets in the new region (one for the ops project and one for the new project)
  • New ops cluster in new region (in the new subnet)
  • New Istio control plane for the new region
  • 2 apps clusters in the new project in the new region
  • Commit to infrastructure repo
  • Verify installation

Copy-and-Paste Method Lab Instructions

There are a number of ways to scale a platform. You can add more compute by adding nodes to existing clusters. You can add more clusters in a region. Or you can add more regions to the platform. The decision on what aspect of the platform to scale depends upon the requirements. For example, if you have clusters in all three zones in a region, perhaps adding more nodes (or node pools) to existing cluster may suffice. However, if you have clusters in two of three zones in a single region, then adding a new cluster in the third zone gives you scaling and an additional fault domain (i.e. a new zone). Another reason for adding a new cluster in a region might be the need to create a single tenant cluster - for regulatory or compliance reasons (for example PCI, or a database cluster that houses PII information). As your business and services expand, adding new regions become inevitable to provide services closer to the clients.

The current platform consists of two regions and clusters in two zones per region. You can think of scaling the platform in two ways:

  • Vertically - within each region by adding more compute. This is done either by adding more nodes (or node pools) to existing clusters or by adding new clusters within the region. This is done via the infrastructure repo. The simplest path is adding nodes to existing clusters. No additional configuration is required. Adding new clusters may require additional subnets (and secondary ranges), adding appropriate firewall rules, adding the new clusters to the regional ASM/Istio service mesh control plane and deploying application resources to the new clusters.
  • Horizontally - by adding more regions. The current platform gives you a regional template. It consists on a regional ops cluster where the ASM/Istio control please resides and two (or more) zonal application clusters where application resources are deployed.

In this workshop, you scale the platform "horizontally" as it encompasses the vertical use case steps as well. In order to horizontally, scale the platform by adding a new region (r3) to the platform, the following resources need to be added:

  1. Subnets in the host project shared VPC in region r3 for the new ops and application clusters.
  2. A regional ops cluster in region r3 where the ASM/Istio control plane resides.
  3. Two zonal application clusters in two zones on region r3.
  4. Update to the k8s-repo:
  5. Deploy ASM/Istio control plane resources to the ops cluster in region r3.
  6. Deploy ASM/Istio shared control plane resources to the app clusters in region r3.
  7. While you don't need to create a new project, the steps in the workshop demonstrate adding a new project dev3 to cover the use case of adding a new team to the platform.

Infrastructure repo is used to add new resources stated above.

  1. In Cloud Shell, navigate to WORKDIR and clone the infrastructure repo.
mkdir -p $WORKDIR/infra-repo
cd $WORKDIR/infra-repo
git init && git remote add origin https://source.developers.google.com/p/${TF_ADMIN}/r/infrastructure
git config --local user.email ${MY_USER}
git config --local user.name "infra repo user"
git config --local credential.'https://source.developers.google.com'.helper gcloud.sh
git pull origin master
  1. Clone the workshop source repo add-proj branch into the add-proj-repo directory.
cd $WORKDIR
git clone https://github.com/GoogleCloudPlatform/anthos-service-mesh-workshop.git add-proj-repo -b add-proj

 
  1. Copy files from the add-proj branch in the source workshop repo. The add-proj branch contains the changes for this section.
cp -r $WORKDIR/add-proj-repo/infrastructure/* $WORKDIR/infra-repo/
 
  1. Replace the infrastructure directory in the add-proj repo directory with a symlink to the infra-repo directory to allow the scripts on the branch to run.
rm -rf $WORKDIR/add-proj-repo/infrastructure
ln -s $WORKDIR/infra-repo $WORKDIR/add-proj-repo/infrastructure
 
  1. Run the add-project.sh script to copy the shared states and vars to the new project directory structure.
$WORKDIR/add-proj-repo/scripts/add-project.sh app3 $WORKDIR/asm $WORKDIR/infra-repo
  1. Commit and push changes to create new project
cd $WORKDIR/infra-repo
git add .
git status
git commit -m "add new project" && git push origin master
 

  1. The commit triggers the infrastructure repo to deploy the infrastructure with the new resources. View the Cloud Build progress by clicking on the output of the following link and navigating to the latest build at the top.
echo "https://console.cloud.google.com/cloud-build/builds?project=${TF_ADMIN}"
 

The last step of the infrastructure Cloud Build creates new Kubernetes resources in the k8s-repo. This triggers the Cloud Build in the k8s-repo (in the ops project). The new Kubernetes resources are for the three new clusters added in the previous step. ASM/Istio control plane and shared control plane resources are added to the new clusters with the k8s-repo Cloud Build.

  1. After the infrastructure Cloud Build successfully finishes, navigate to the k8s-repo latest Cloud Build run by clicking on the following output link.
echo "https://console.cloud.google.com/cloud-build/builds?project=${TF_VAR_ops_project_name}"
 
  1. Run the following script to add the new clusters to the vars and kubeconfig file.
$WORKDIR/add-proj-repo/scripts/setup-gke-vars-kubeconfig-add-proj.sh $WORKDIR/asm
 
  1. Change the KUBECONFIG variable to point to the new kubeconfig file.
source $WORKDIR/asm/vars/vars.sh
export KUBECONFIG=$WORKDIR/asm/gke/kubemesh
 
  1. List your cluster contexts. You should see eight clusters.
kubectl config view -ojson | jq -r '.clusters[].name'
 
    `Output (do not copy)`
gke_user001-200204-05-dev1-49tqc4_us-west1-a_gke-1-apps-r1a-prod
gke_user001-200204-05-dev1-49tqc4_us-west1-b_gke-2-apps-r1b-prod
gke_user001-200204-05-dev2-49tqc4_us-central1-a_gke-3-apps-r2a-prod
gke_user001-200204-05-dev2-49tqc4_us-central1-b_gke-4-apps-r2b-prod
gke_user001-200204-05-dev3-49tqc4_us-east1-b_gke-5-apps-r3b-prod
gke_user001-200204-05-dev3-49tqc4_us-east1-c_gke-6-apps-r3c-prod
gke_user001-200204-05-ops-49tqc4_us-central1_gke-asm-2-r2-prod
gke_user001-200204-05-ops-49tqc4_us-east1_gke-asm-3-r3-prod
gke_user001-200204-05-ops-49tqc4_us-west1_gke-asm-1-r1-prod

Verify Istio Installation

  1. Ensure Istio is installed on the new ops cluster by checking all pods are running and jobs have completed.
kubectl --context $OPS_GKE_3 get pods -n istio-system
 
    `Output (do not copy)`
NAME                                      READY   STATUS    RESTARTS   AGE
grafana-5f798469fd-72g6w                  1/1     Running   0          5h12m
istio-citadel-7d8595845-hmmvj             1/1     Running   0          5h12m
istio-egressgateway-779b87c464-rw8bg      1/1     Running   0          5h12m
istio-galley-844ddfc788-zzpkl             2/2     Running   0          5h12m
istio-ingressgateway-59ccd6574b-xfj98     1/1     Running   0          5h12m
istio-pilot-7c8989f5cf-5plsg              2/2     Running   0          5h12m
istio-policy-6674bc7678-2shrk             2/2     Running   3          5h12m
istio-sidecar-injector-7795bb5888-kbl5p   1/1     Running   0          5h12m
istio-telemetry-5fd7cbbb47-c4q7b          2/2     Running   2          5h12m
istio-tracing-cd67ddf8-2qwkd              1/1     Running   0          5h12m
istiocoredns-5f7546c6f4-qhj9k             2/2     Running   0          5h12m
kiali-7964898d8c-l74ww                    1/1     Running   0          5h12m
prometheus-586d4445c7-x9ln6               1/1     Running   0          5h12m
  1. Ensure Istio is installed on both dev3 clusters. Only Citadel, sidecar-injector and coredns run in the dev3 clusters. They share an Istio controlplane running in the ops-3 cluster.
kubectl --context $DEV3_GKE_1 get pods -n istio-system
kubectl --context $DEV3_GKE_2 get pods -n istio-system
 
    `Output (do not copy)`
NAME                                      READY   STATUS    RESTARTS   AGE
istio-citadel-568747d88-4lj9b             1/1     Running   0          66s
istio-sidecar-injector-759bf6b4bc-ks5br   1/1     Running   0          66s
istiocoredns-5f7546c6f4-qbsqm             2/2     Running   0          78s

Verify service discovery for shared control planes

  1. Verify the secrets are deployed in all ops clusters for all six application clusters.
kubectl --context $OPS_GKE_1 get secrets -l istio/multiCluster=true -n istio-system
kubectl --context $OPS_GKE_2 get secrets -l istio/multiCluster=true -n istio-system
kubectl --context $OPS_GKE_3 get secrets -l istio/multiCluster=true -n istio-system
 
    `Output (do not copy)`
NAME                  TYPE     DATA   AGE
gke-1-apps-r1a-prod   Opaque   1      14h
gke-2-apps-r1b-prod   Opaque   1      14h
gke-3-apps-r2a-prod   Opaque   1      14h
gke-4-apps-r2b-prod   Opaque   1      14h
gke-5-apps-r3b-prod   Opaque   1      5h12m
gke-6-apps-r3c-prod   Opaque   1      5h12m

13. Circuit Breaking

Objective: Implement a Circuit Breaker for the shipping Service.

  • Create a DestinationRule for the shipping Service to implement a circuit breaker
  • Use fortio (a load gen utility) to validate circuit breaker for the shipping Service by force tripping the circuit

Fast Track Script Lab Instructions

Fast Track Script Lab is coming soon!!

Copy-and-Paste Method Lab Instructions

Now that we've learned some basic monitoring and troubleshooting strategies for Istio-enabled services, let's look at how Istio helps you improve the resilience of your services, reducing the amount of troubleshooting you'll have to do in the first place.

A microservices architecture introduces the risk of cascading failures, where the failure of one service can propagate to its dependencies, and the dependencies of those dependencies, causing a "ripple effect" outage that can potentially affect end-users. Istio provides a Circuit Breaker traffic policy to help you isolate services, protecting downstream (client-side) services from waiting on failing services, and protecting upstream (server-side) services from a sudden flood of downstream traffic when they do come back online. Overall, using Circuit Breakers can help you avoid all your services failing their SLOs because of one backend service that is hanging.

The Circuit Breaker pattern is named for an electrical switch that can "trip" when too much electricity flows through, protecting devices from overload. In an Istio setup, this means that Envoy is the circuit breaker, keeping track of the number of pending requests for a service. In this default closed state, requests flow through Envoy uninterrupted.

But when the number of pending requests exceeds your defined threshold, the circuit breaker trips (opens), and Envoy immediately returns an error. This allows the server to fail fast for the client, and prevents the server application code from receiving the client's request when overloaded.

Then, after your defined timeout, Envoy moves to a half open state, where the server can start receiving requests again in a probationary way, and if it can successfully respond to requests, the circuit breaker closes again, and requests to the server begin to flow again.

This diagram summarizes the Istio circuit breaker pattern. The blue rectangles represent Envoy, the blue-filled circle represents the client, and the white-filled circles represent the server container:

2127a0a172ff4802.png

You can define Circuit Breaker policies using Istio DestinationRules. In this section, we'll apply the following policy to enforce a circuit breaker for the shipping service:

Output (do not copy)

apiVersion: "networking.istio.io/v1alpha3"
kind: "DestinationRule"
metadata:
  name: "shippingservice-shipping-destrule"
  namespace: "shipping"
spec:
  host: "shippingservice.shipping.svc.cluster.local"
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL
    connectionPool:
      tcp:
        maxConnections: 1
      http:
        http1MaxPendingRequests: 1
        maxRequestsPerConnection: 1
    outlierDetection:
      consecutiveErrors: 1
      interval: 1s
      baseEjectionTime: 10s
      maxEjectionPercent: 100

There are two DestinationRule fields to note here. connectionPool defines the number of connections this service will allow. The outlierDetection field is where we configure how Envoy will determine the threshold at which to open the circuit breaker. Here, every second (interval), Envoy will count the number of errors it received from the server container. If it exceeds the consecutiveErrors threshold, the Envoy circuit breaker will open, and 100% of productcatalog pods will be shielded from new client requests for 10 seconds. Once the Envoy circuit breaker is open (ie. active), clients will receive 503 (Service Unavailable) errors. Let's see this in action.

  1. Set environment variables for the k8s-repo and asm dir to simplify commands.
export K8S_REPO="${WORKDIR}/k8s-repo"
export ASM="${WORKDIR}/asm" 
 
  1. Update the k8s-repo
cd $WORKDIR/k8s-repo
git pull
cd $WORKDIR
  1. Update the shipping service DestinationRule on both Ops clusters.
cp $ASM/k8s_manifests/prod/istio-networking/app-shipping-circuit-breaker.yaml ${K8S_REPO}/${OPS_GKE_1_CLUSTER}/istio-networking/app-shipping-circuit-breaker.yaml
cp $ASM/k8s_manifests/prod/istio-networking/app-shipping-circuit-breaker.yaml ${K8S_REPO}/${OPS_GKE_2_CLUSTER}/istio-networking/app-shipping-circuit-breaker.yaml

cd ${K8S_REPO}/${OPS_GKE_1_CLUSTER}/istio-networking/; kustomize edit add resource app-shipping-circuit-breaker.yaml
cd ${K8S_REPO}/${OPS_GKE_2_CLUSTER}/istio-networking/; kustomize edit add resource app-shipping-circuit-breaker.yaml
 
  1. Copy a Fortio load generator pod into the GKE_1 cluster in the Dev1 region. This is the client pod we'll use to "trip" the circuit breaker for shippingservice.
cp $ASM/k8s_manifests/prod/app/deployments/app-fortio.yaml ${K8S_REPO}/${DEV1_GKE_1_CLUSTER}/app/deployments/
cd ${K8S_REPO}/${DEV1_GKE_1_CLUSTER}/app/deployments; kustomize edit add resource app-fortio.yaml
 
  1. Commit changes.
cd $K8S_REPO 
git add . && git commit -am "Circuit Breaker: shippingservice"
git push
cd $ASM
 
  1. Wait for Cloud Build to complete.
  2. Back in Cloud Shell, use the fortio pod to send gRPC traffic to shippingservice with 1 concurrent connection, 1000 requests total - this will not trip the circuit breaker, because we have not exceeded the connectionPool settings yet.
FORTIO_POD=$(kubectl --context ${DEV1_GKE_1} get pod -n shipping | grep fortio | awk '{ print $1 }')

kubectl --context ${DEV1_GKE_1} exec -it $FORTIO_POD -n shipping -c fortio /usr/bin/fortio -- load -grpc -c 1 -n 1000 -qps 0 shippingservice.shipping.svc.cluster.local:50051 
 

Output (do not copy)

Health SERVING : 1000
All done 1000 calls (plus 0 warmup) 4.968 ms avg, 201.2 qps
  1. Now run fortio again, increasing the number of concurrent connections to 2, but keeping the total number of requests constant. We should see up to two-thirds of the requests return an "overflow" error, because the circuit breaker has been tripped: in the policy we defined, only 1 concurrent connection is allowed in a 1-second interval.
kubectl --context ${DEV1_GKE_1} exec -it $FORTIO_POD -n shipping -c fortio /usr/bin/fortio -- load -grpc -c 2 -n 1000 -qps 0 shippingservice.shipping.svc.cluster.local:50051 
 

Output (do not copy)

18:46:16 W grpcrunner.go:107> Error making grpc call: rpc error: code = Unavailable desc = upstream connect error or disconnect/reset before headers. reset reason: overflow
...

Health ERROR : 625
Health SERVING : 375
All done 1000 calls (plus 0 warmup) 12.118 ms avg, 96.1 qps
  1. Envoy keeps track of the number of connections it dropped when the circuit breaker is active, with the upstream_rq_pending_overflow metric. Let's find this in the fortio pod:
kubectl --context ${DEV1_GKE_1} exec -it $FORTIO_POD -n shipping -c istio-proxy  -- sh -c 'curl localhost:15000/stats' | grep shipping | grep pending
 

Output (do not copy)

cluster.outbound|50051||shippingservice.shipping.svc.cluster.local.circuit_breakers.default.rq_pending_open: 0
cluster.outbound|50051||shippingservice.shipping.svc.cluster.local.circuit_breakers.high.rq_pending_open: 0
cluster.outbound|50051||shippingservice.shipping.svc.cluster.local.upstream_rq_pending_active: 0
cluster.outbound|50051||shippingservice.shipping.svc.cluster.local.upstream_rq_pending_failure_eject: 9
cluster.outbound|50051||shippingservice.shipping.svc.cluster.local.upstream_rq_pending_overflow: 565
cluster.outbound|50051||shippingservice.shipping.svc.cluster.local.upstream_rq_pending_total: 1433
  1. Clean up by removing the circuit breaker policy from both regions.
kubectl --context ${OPS_GKE_1} delete destinationrule shippingservice-circuit-breaker -n shipping 
rm ${K8S_REPO}/${OPS_GKE_1_CLUSTER}/istio-networking/app-shipping-circuit-breaker.yaml
cd ${K8S_REPO}/${OPS_GKE_1_CLUSTER}/istio-networking/; kustomize edit remove resource app-shipping-circuit-breaker.yaml
 

kubectl --context ${OPS_GKE_2} delete destinationrule shippingservice-circuit-breaker -n shipping 
rm ${K8S_REPO}/${OPS_GKE_2_CLUSTER}/istio-networking/app-shipping-circuit-breaker.yaml
cd ${K8S_REPO}/${OPS_GKE_2_CLUSTER}/istio-networking/; kustomize edit remove resource app-shipping-circuit-breaker.yaml
cd $K8S_REPO; git add .; git commit -m "Circuit Breaker: cleanup"; git push origin master
 

This section demonstrated how to set up a single circuit breaker policy for a service. A best practice is to set up a circuit breaker for any upstream (backend) service that has the potential to hang. By applying Istio circuit breaker policies, you help isolate your microservices, build fault tolerance into your architecture, and reduce the risk of cascading failures under high load.

14. Fault Injection

Objective: Test the resilience of the recommendation Service by introducing delays (before it is pushed to production).

  • Create a VirtualService for the recommendation Service to introduce a 5s delay
  • Test the delay using fortio load generator
  • Remove the delay in the VirtualService and validate

Fast Track Script Lab Instructions

Fast Track Script Lab is coming soon!!

Copy-and-Paste Method Lab Instructions

Adding circuit breaker policies to your services is one way to build resilience against services in production. But circuit breaking results in faults — potentially user-facing errors — which is not ideal. To get ahead of these error cases, and better predict how your downstream services might respond when backends do return errors, you can adopt chaos testing in a staging environment. Chaos testing is the practice of deliberately breaking your services, in order to analyze weak points in the system and improve fault tolerance. You can also use chaos testing to identify ways to mitigate user-facing errors when backends fail - for instance, by displaying a cached result in a frontend.

Using Istio for fault injection is helpful because you can use your production release images, and add the fault at the network layer, instead of modifying source code. In production, you might use a full-fledged chaos testing tool to test resilience at the Kubernetes/compute layer in addition to the network layer.

You can use Istio for chaos testing by applying a VirtualService with the "fault" field. Istio supports two kinds of faults: delay faults (inject a timeout) and abort faults (inject HTTP errors). In this example, we'll inject a 5-second delay fault into the recommendations service. But this time instead of using a circuit breaker to "fail fast" against this hanging service, we will force downstream services to endure the full timeout.

  1. Navigate into the fault injection directory.
export K8S_REPO="${WORKDIR}/k8s-repo"
export ASM="${WORKDIR}/asm/" 
cd $ASM
 
  1. Open k8s_manifests/prod/istio-networking/app-recommendation-vs-fault.yaml to inspect its contents. Notice that Istio has an option to inject the fault into a percentage of the requests - here, we'll introduce a timeout into all recommendationservice requests.

Output (do not copy)

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: recommendation-delay-fault
spec:
  hosts:
  - recommendationservice.recommendation.svc.cluster.local
  http:
  - route:
    - destination:
        host: recommendationservice.recommendation.svc.cluster.local
    fault:
      delay:
        percentage:
          value: 100
        fixedDelay: 5s
  1. Copy the VirtualService into k8s_repo. We'll inject the fault globally, across both regions.
cp $ASM/k8s_manifests/prod/istio-networking/app-recommendation-vs-fault.yaml ${K8S_REPO}/${OPS_GKE_1_CLUSTER}/istio-networking/app-recommendation-vs-fault.yaml
cd ${K8S_REPO}/${OPS_GKE_1_CLUSTER}/istio-networking/; kustomize edit add resource app-recommendation-vs-fault.yaml

cp $ASM/k8s_manifests/prod/istio-networking/app-recommendation-vs-fault.yaml ${K8S_REPO}/${OPS_GKE_2_CLUSTER}/istio-networking/app-recommendation-vs-fault.yaml
cd ${K8S_REPO}/${OPS_GKE_2_CLUSTER}/istio-networking/; kustomize edit add resource app-recommendation-vs-fault.yaml
 
  1. Push changes
cd $K8S_REPO 
git add . && git commit -am "Fault Injection: recommendationservice"
git push
cd $ASM
 
  1. Wait for Cloud Build to complete.
  2. Exec into the fortio pod deployed in the circuit breaker section, and send some traffic to recommendationservice.
FORTIO_POD=$(kubectl --context ${DEV1_GKE_1} get pod -n shipping | grep fortio | awk '{ print $1 }')

kubectl --context ${DEV1_GKE_1} exec -it $FORTIO_POD -n shipping -c fortio /usr/bin/fortio -- load -grpc -c 100 -n 100 -qps 0 recommendationservice.recommendation.svc.cluster.local:8080
 
    Once the fortio command is complete, you should see responses averaging 5s:

Output (do not copy)

Ended after 5.181367359s : 100 calls. qps=19.3
Aggregated Function Time : count 100 avg 5.0996506 +/- 0.03831 min 5.040237641 max 5.177559818 sum 509.965055
  1. Another way to see the fault we injected in action is open the frontend in a web browser, and click on any product. A product page should take 5 extra seconds to load, since it fetches the recommendations that are displayed at the bottom of the page.
  2. Clean up by removing the fault injection service from both Ops clusters.
kubectl --context ${OPS_GKE_1} delete virtualservice recommendation-delay-fault -n recommendation 
rm ${K8S_REPO}/${OPS_GKE_1_CLUSTER}/istio-networking/app-recommendation-vs-fault.yaml
cd ${K8S_REPO}/${OPS_GKE_1_CLUSTER}/istio-networking/; kustomize edit remove resource app-recommendation-vs-fault.yaml

kubectl --context ${OPS_GKE_2} delete virtualservice recommendation-delay-fault -n recommendation 
rm ${K8S_REPO}/${OPS_GKE_2_CLUSTER}/istio-networking/app-recommendation-vs-fault.yaml
cd ${K8S_REPO}/${OPS_GKE_2_CLUSTER}/istio-networking/; kustomize edit remove resource app-recommendation-vs-fault.yaml
 
  1. Push changes:
cd $K8S_REPO 
git add . && git commit -am "Fault Injection cleanup / restore"
git push
cd $ASM
 

15. Monitoring the Istio Control Plane

ASM installs four important control plane components: Pilot, Mixer, Galley and Citadel. Each sends its relevant monitoring metrics to Prometheus, and ASM ships with Grafana dashboards that let operators visualize this monitoring data and assess the health and performance of the control plane.

Viewing the Dashboards

  1. Port-forward your Grafana service installed with Istio
kubectl --context ${OPS_GKE_1} -n istio-system port-forward svc/grafana 3000:3000 >> /dev/null
 
  1. Open Grafana in your browser
  2. Click on the "Web Preview" icon on the top right corner of your Cloud Shell Window
  3. Click Preview on port 3000 (Note: if the port is not 3000, click on change port and select port 3000)
  4. This will open a tab in your browser with a URL similar to " BASE_URL/?orgId=1&authuser=0&environment_id=default"
  5. View available dashboards
  6. Modify the URL to " BASE_URL/dashboard"
  7. Click on "istio" folder to view available dashboards
  8. Click on any of those dashboards to view the performance of that component. We'll look at the important metrics for each component in the following sections.

Monitoring Pilot

Pilot is the control plane component that distributes networking and policy configuration to the data plane (the Envoy proxies). Pilot tends to scale with the number of workloads and deployments, although not necessarily with the amount of traffic to those workloads. An unhealthy Pilot can:

  • consume more resources than necessary (CPU and/or RAM)
  • result in delays in pushing updated configuration information to Envoys

Note: if Pilot is down, or if there are delays, your workloads still serve traffic.

  1. Navigate to " BASE_URL/dashboard/db/istio-pilot-dashboard" in your browser to view Pilot metrics.

Important monitored metrics

Resource Usage

Use the Istio Performance and Scalability page as your guide for acceptable usage numbers. Contact GCP support if you see significantly more sustained resource usage than this.

5f1969f8e2c8b137.png

Pilot Push Information

This section monitors Pilots pushes of configuration to your Envoy proxies.

  • Pilot Pushes shows the type of configuration pushed at any given time.
  • ADS Monitoring shows the number of Virtual Services, Services and Connected Endpoints in the system.
  • Clusters with no known endpoints shows endpoints that have been configured but do not have any instances running (which may indicate external services, such as *.googleapis.com).
  • Pilot Errors show the number of errors encountered over time.
  • Conflicts show the number of conflicts which are ambiguous configuration on listeners.

If you have Errors or Conflicts, you have bad or inconsistent configuration for one or more of your services. See Troubleshooting the data plane for information.

Envoy Information

This section contains information about the Envoy proxies contacting the control plane. Contact GCP support if you see repeated XDS Connection Failures.

Monitoring Mixer

Mixer is the component that funnels telemetry from the Envoy proxies to telemetry backends (typically Prometheus, Stackdriver, etc). In this capacity, it is not in the data plane. It is deployed as two Kubernetes Jobs (called Mixer) deployed with two different service names (istio-telemetry and istio-policy).

Mixer can also be used to integrate with policy systems. In this capacity, Mixer does affect the data plane, as policy checks to Mixer that fail block access to your services.

Mixer tends to scale with volume of traffic.

  1. Navigate to " BASE_URL/dashboard/db/istio-mixer-dashboard" in your browser to view Mixer metrics.

Important monitored metrics

Resource Usage

Use the Istio Performance and Scalability page as your guide for acceptable usage numbers. Contact GCP support if you see significantly more sustained resource usage than this.

87ed83238f9addd8.png

Mixer Overview

  • Response Duration is an important metric. While reports to Mixer telemetry are not in the datapath, if these latencies are high it will definitely slow down sidecar proxy performance. You should expect the 90th percentile to be in the single-digit milliseconds, and the 99th percentile to be under 100ms.

e07bdf5fde4bfe87.png

  • Adapter Dispatch Duration indicates the latency Mixer is experiencing in calling adapters (through which it sends information to telemetry and logging systems). High latencies here will absolutely affect performance on the mesh. Again, p90 latencies should be under 10ms.

1c2ee56202b32bd9.png

Monitoring Galley

Galley is Istio's configuration validation, ingestion, processing and distribution component. It conveys configuration from the Kubernetes API server to Pilot. Like Pilot, it tends to scale with the number of services and endpoints in the system.

  1. Navigate to " BASE_URL/dashboard/db/istio-galley-dashboard" in your browser to view Galley metrics.

Important monitored metrics

Resource Validation

The most important metric to follow which indicates the number of resources of various types like Destination rules, Gateways and Service entries that are passing or failing validation.

Connected clients

Indicates how many clients are connected to Galley; typically this will be 3 (pilot, istio-telemetry, istio-policy) and will scale as those components scale.

16. Troubleshooting Istio

Troubleshooting the data plane

If your Pilot dashboard indicates that you have configuration issues, you should examine PIlot logs or use istioctl to find configuration problems.

To examine Pilot logs, run kubectl -n istio-system logs istio-pilot-69db46c598-45m44 discovery, replacing istio-pilot-... with the pod identifier for the Pilot instance you want to troubleshoot.

In the resulting log, search for a Push Status message. For example:

2019-11-07T01:16:20.451967Z        info        ads        Push Status: {
    "ProxyStatus": {
        "pilot_conflict_outbound_listener_tcp_over_current_tcp": {
            "0.0.0.0:443": {
                "proxy": "cartservice-7555f749f-k44dg.hipster",
                "message": "Listener=0.0.0.0:443 AcceptedTCP=accounts.google.com,*.googleapis.com RejectedTCP=edition.cnn.com TCPServices=2"
            }
        },
        "pilot_duplicate_envoy_clusters": {
            "outbound|15443|httpbin|istio-egressgateway.istio-system.svc.cluster.local": {
                "proxy": "sleep-6c66c7765d-9r85f.default",
                "message": "Duplicate cluster outbound|15443|httpbin|istio-egressgateway.istio-system.svc.cluster.local found while pushing CDS"
            },
            "outbound|443|httpbin|istio-egressgateway.istio-system.svc.cluster.local": {
                "proxy": "sleep-6c66c7765d-9r85f.default",
                "message": "Duplicate cluster outbound|443|httpbin|istio-egressgateway.istio-system.svc.cluster.local found while pushing CDS"
            },
            "outbound|80|httpbin|istio-egressgateway.istio-system.svc.cluster.local": {
                "proxy": "sleep-6c66c7765d-9r85f.default",
                "message": "Duplicate cluster outbound|80|httpbin|istio-egressgateway.istio-system.svc.cluster.local found while pushing CDS"
            }
        },
        "pilot_eds_no_instances": {
            "outbound_.80_._.frontend-external.hipster.svc.cluster.local": {},
            "outbound|443||*.googleapis.com": {},
            "outbound|443||accounts.google.com": {},
            "outbound|443||metadata.google.internal": {},
            "outbound|80||*.googleapis.com": {},
            "outbound|80||accounts.google.com": {},
            "outbound|80||frontend-external.hipster.svc.cluster.local": {},
            "outbound|80||metadata.google.internal": {}
        },
        "pilot_no_ip": {
            "loadgenerator-778c8489d6-bc65d.hipster": {
                "proxy": "loadgenerator-778c8489d6-bc65d.hipster"
            }
        }
    },
    "Version": "o1HFhx32U4s="
}

The Push Status will indicate any issues that occurred when trying to push the configuration to Envoy proxies – in this case, we see several "Duplicate cluster" messages, which indicate duplicate upstream destinations.

For assistance in diagnosing problems, contact Google Cloud support with issues.

Finding configuration errors

In order to use istioctl to analyze your configuration, run istioctl experimental analyze -k --context $OPS_GKE_1. This will perform an analysis of configuration in your system, indicate any problems along with any suggested changes. See documentation for a full list of configuration errors that this command can detect.

17. Cleanup

An administrator runs the cleanup_workshop.sh script to delete resources created by the bootstrap_workshop.sh script. You need the following pieces of information for the cleanup script to run.

  • Organization name - for example yourcompany.com
  • Workshop ID - in the form YYMMDD-NN for example 200131-01
  • Admin GCS bucket - defined in the bootstrap script.
  1. Open Cloud Shell, perform all actions below in Cloud Shell. Click on the link below.

CLOUD SHELL

  1. Verify you are logged into gcloud with the intended Admin user.
gcloud config list
 
  1. Navigate you the asm folder.
cd ${WORKDIR}/asm
 
  1. Define your Organization name and workshop ID to be deleted.
export ORGANIZATION_NAME=<ORGANIZATION NAME>
export ASM_WORKSHOP_ID=<WORKSHOP ID>
export ADMIN_STORAGE_BUCKET=<ADMIN CLOUD STORAGE BUCKET>
 
  1. Run the cleanup script as follows.
./scripts/cleanup_workshop.sh --workshop-id ${ASM_WORKSHOP_ID} --admin-gcs-bucket ${ADMIN_STORAGE_BUCKET} --org-name ${ORGANIZATION_NAME}