Site Reliability Engineer at Bestow
Using Github for VCS in conjunction with Terraform Cloud enabled us to manage our infrastructure with standard version control from the start. All engineers can review and collaborate on infrastructure decisions and changes at Bestow. Work tasks are tracked and frequently linked to these changes as we develop our platform. All our engineers either had previous experience or were quickly ramped up on Terraform to support our infrastructure-as-code environment. Most new engineers absolutely loved that they could configure their own infrastructure programmatically and get approval for it in a PR on github.
Hashicorp has done a great job standardizing Terraform as an open-source tool for managing infrastructure-as-code. We were able to easily follow the GCP documentation for managing resources with some well-designed and re-usable modules. We provisioned multiple environments using a clear and concise folder structure with the modules we created.
We went with the decision to use Terraform Cloud because it gave us a secure and versioned platform to deploy our changes. We are able to limit access to the remote state while also integrating directly with our CI/CD on Github to test and deploy changes. The Terraform provider for GCP has been reliable and our engineers are able to make changes without too much time wasted. Terraform Cloud has its drawbacks but in general we found its configurability to be outstanding and the features it brings helps stimulate productivity.
We have also been able to use other Terraform providers to manage things like Cloudflare, New Relic, Okta, GCP, AWS, and Github . The standard interface offers an easy way to version control and manage the majority of our software platform as well as grant approvals via PR, and create a self documented audit trail of work being done and how. This makes it easy to onboard new teammates and audit our existing workflows.
Site Reliability Engineer at Bestow
At Bestow we engineer a life insurance platform with ease of use being the main focus. We want the process to be so easy that anybody can explore and purchase life insurance through a simple online portal. We know that security, scalability, and reliability are critical for a product certain to be a hit and we quickly landed on Kubernetes for our deployment platform. K8s gives us flexibility to manage and monitor our platform relatively easily without sacrificing configurability. GKE took those values a step further helping us with offload the burden of a control plane and creating a cluster with no scaling limits.
It was easy to get going with some core services such as DNS, certificate management, Nginx, and monitoring systems. We were able to migrate our platform quickly into Kubernetes deployments from simple containers running on an instance. In the early stages our engineers were able to easily scale deployments manually based on resource usage. However, GKE has some great dashboards to show resource utilization and eventually we implemented custom autoscaling based on GCP metrics and RabbitMQ metrics.
GKE has seamless integration between GCP and the Kubernetes clusters, giving us flexibility to manage workloads across our cloud infrastructure. Most of our engineers had previous Kubernetes experience, both self-hosted (K8s in K8s) and AWS-flavors. This familiarity made it easy to move between cloud providers and migrate workloads without interrupting their workday. We migrated from AWS to GCP (K8s in K8s -> GKE) on a Sunday morning and then early Monday morning, everyone just came to work as usual with no impact to developer productivity or policy sales.
The “datacenter as a service” capabilities provided by GKE are a selling point for us to use Google Cloud Platform. The capabilities allow us to focus on delivering a platform with 100% uptime, without getting lost in the details around backups, parallelism, load balancing, security patches, kernel patches, software upgrades, and other problems most data centers and home grown K8s clusters must consider.
The Terraform documentation is also very solid for implementing GKE, and the docs provided by Google were just as robust. Some of the challenges we faced included the maturity of the GKE platform alongside the existing virtual infrastructure we had provisioned. The extra layers of complexity introduced by Kubernetes sometimes makes troubleshooting infrastructure more difficult. We rely on our monitoring and metrics platform to produce more reliable clusters for our insurance platform.
Possibly the largest benefit of GKE was the integration between K8s RBAC and GCP IAM. The ability to provision users on the cluster using terraform and the google terraform provider created a very streamlined and simple way to give people access to the cluster where they needed it. It does this without limiting our ability to use RBAC in all its glory and grant additional permissions where required on the cluster. This principle also applies to service accounts. The seamless integration of IAM service accounts, created in terraform by pr, and provisioned automatically creates a record of what services can perform what actions in the GCP environment as well as on the cluster. This creates a very clear audit trail and assists with troubleshooting permission related woes without over permissioning for ease of use.
We’ve also updated our deployment pipeline to include testing and promotions through our different environment clusters. Developers could commit changes without worrying about how to get it deployed into Kubernetes. At the same time GKE + Kubectl makes it easy for all engineers to access cluster information such as logs, pods, and deployments. Our usage is always evolving as we find new ways to take advantage of Kubernetes and our container infrastructure.