Transforming the Management of Application Configurations & Secrets at 24 Hour Fitness

1,744
HashiCorp
Powering the software-managed datacenter. Maker of Vagrant, Packer, Terraform, Consul, Serf, and Vault

At 24 Hour Fitness, for many years, operations and development teams have gone through the pain of trying to manage and deploy application configurations with data stored in many files and locations across the ecosystem. The DevOps team was tasked with architecting and implementing a simple, reliable, highly available, testable solution to meet the growing needs of their applications. Through the combined use of Consul and Vault, they have successfully transformed the business.

In this talk, 24 Hour Fitness Senior DevOps Engineer Jason Yoe will describe the challenges faced, the overall design, and the implementation of the solution.

Transcript

It was a Wednesday afternoon before Thanksgiving, and I was finishing a few things up before I was heading out, when the team received an email talking about an intermittent problem that was happening with one of our sales applications.

A user, when selecting an option in our web application, was receiving an error. Not everybody would receive an error when they'd select the option, and as a matter of fact, when this user clicked the option again, they were successful. From the email, it was uncertain when this problem had begun and how many sales had been affected.

Flash forward an hour later, the issue had been escalated. The DevOps team is on a WebEx call with many agitated people. Agitated managers are in our cubicle aisle; they're looking for answers.

It's understandable. It's the day before Thanksgiving, it's a production issue, and it's our sales application. So the team's frantically looking through logs, we're checking deployments that had happened the previous days, any sign for what could happen. Finally, we stumble upon an email about a configuration change that had happened the previous week.

This change was made directly to an application instance, but unfortunately those instances were not bounced at the time of the change. When a code deploy occurred a few days later, the applications were restarted and the change was applied.

Unfortunately, the configuration change was missed on one of the application instances, hence the intermittent issue. Of course, this change had been applied in other environments and had been tested, but unfortunately the person applying this change in production had made a few small mistakes. So, we had to transform our configuration management.

Transformation is continuous improvement

How do we accomplish this transformation? Well, transformation is really all about continuous improvement. We all envision this glorious end state, where all of our application instances are containerized and we have systems that detect and fix errors while we go get more coffee.

But the reality is nowhere near that. We never really reach that nirvana where there are no challenges, there are no issues. There are always going to be challenges. The goal is to face better challenges.

When I was 15 years old, I had to walk everywhere. I had to walk to school, I had to walk to work, I had to walk to my friends' houses. When I improved my situation and saved up and bought a car, I no longer had to walk anywhere. I still had to pay for gas, I had to pay for car insurance, and I had to get my oil changed.

But these challenges were definitely better than having to walk everywhere.

Today I'm going to discuss the steps we use for continuous improvement. I will also dive into the 24 Hour Fitness case study, where we use Consul and Vault as an integral part of the solution, and then I'll go over some of the new, better challenges in our continuous improvement journey.

My name is Jason Yoe. I'm a senior DevOps engineer at 24 Hour Fitness. 24 Hour Fitness is the second-largest fitness chain in the world. In my previous roles, I worked at Cisco as a cloud engineer, and then as a technical architect at AT&T, so I have about 20 years' experience in the industry.

The continuous improvement process

The 4 steps in the continuous improvement process are:

  • Identify the challenges
  • Find the value
  • Define the path
  • Walk the path

The first step is to identify the challenges, and this step is really about communication. We had many sessions with development teams, operation teams. It's really about gathering all the positives, the negatives, issues, roles, responsibilities, all the processes that currently were in place for configuration and secrets control.

And as you will see in our case study, we found some common challenges that came from these sessions.

The second step is to find the value. The idea is to make the simple changes that provide the greatest benefit. We don't have to solve for everything. Because in continuous improvement, we will eventually face those other challenges, and we'll overcome them. It's really about identifying those key changes that bring the most value right now.

Also remember that value changes are definitely momentum-builders, especially with executive leadership, and they help us along that journey of continuous improvement.

Once we find the value challenges, we end up defining the path, and this is really about building the requirements, it's about architecting the solution, it's about defining those processes, those business processes that other teams need to follow.

It's also defining the tasks that need to be executed. This step also includes the research that goes into it, the technology, the processes. The key point is to look at the successes that are inside your organization and outside your organization.

The next step is walk the path. This is obviously merely the execution of the tasks, it's implementing the technology, and it's socializing the business processes, so it's making sure everyone in the organization's on the same page, that we're all moving in the same direction.

And remember, the improvement journey doesn't end with this step; it continues back again.

The case study

Let us dive into the case study at 24 Hour Fitness. We held sessions across the organization with development teams and our operations teams, and we wound up with a pretty extensive list.

There were a lot of issues that people brought up, but we identified 4 challenges that solving would provide the most benefit for our company:

  • Eliminate configuration sprawl
  • Eliminate secrets sprawl
  • Define a consistent lifecycle process for configuration management
  • Define a consistent lifecycle process for secrets management

The first challenge was to eliminate configuration sprawl. In our environment, we had configurations in many different files, in many different directories, and it wasn't consistent across applications. Different locations and different applications.

We also had different configuration management tools. Some configurations were managed by Chef, some were managed by SVN, some were stored locally on the instance and were managed there.

Our second challenge was to eliminate secrets sprawl. We had secrets that were stored in multiple files and multiple locations, and it wasn't consistent across applications.

One of the big problems was changing passwords. To stay compliant, we have to change our passwords every so often, particularly our database passwords, and our database passwords were stored on local application instances, albeit encrypted. But this process really was tough at 3:00 in the morning on a Sunday when an application wouldn't start because a password change had been missed.

We also had to define a consistent lifecycle process for both configurations and secrets. In talking to different teams, we had multiple processes for how to create, update, store, delete these configurations and secrets.

A lot of it had to do with the fact that each team had multiple tools to manage this, but there was also no clear role definition or responsibility definition for the teams.

What was needed in the solution

Once we identify the value challenges, we were able to define the path. So we built our requirements, we did our research, we architected a solution with specific tasks to be executed. I've highlighted some of the high-level directives for the solution:

  • Implement a single source for configurations using the Consul KV store
  • Implement a single source for secrets using Vault
  • All applications reference the single source for configurations and secrets
  • Applications can receive runtime configuration changes
  • Implement consistent configuration and secrets data structure and naming
  • Consul agent running in client mode on all application instances
  • Secure access to Consul by implementing ACLs
    • Only admin team has access to UI
    • Apps have own token and policy to access specific key prefixes
  • Secure access to Vault
    • Only admin team has access to UI
    • Apps have own token and policy to access specific secrets

The first one is to implement a single source for configurations using the Consul KV store. The next one is to implement a single source for secrets using Vault. What these both resolve is that issue of configuration sprawl, with configurations in multiple files, multiple locations.

When somebody needed to find out where a configuration was, there was no documentation, and we had to hunt through directories to find it. Now we're implementing a single source of truth, where we can go and make changes and manage these configurations.

To go along with that, all of our applications need to be able to access that single source of truth, so if you make a configuration change, all of the instances get that change, instead of having to go from instance to instance and validate that they've got that change.

Another requirement was these applications need to receive runtime configuration changes. In today's world, we want to be a 24-hour shop. Our applications always need to be up for our customers, especially at 24 Hour Fitness. We have people working out at the gym at 2:00 in the morning.

And so we want our applications available to everybody, and the idea is, if we're making configuration changes, we need these applications to take them hot; we don't have time to restart our applications for that change to take effect.

We also need to implement a consistent configuration and secrets data structure and naming. Different teams had different naming standards for their keys and their values and their configurations.

We also wanted to build a structure of, when we talk about a particular value, where is that value found? So that everybody understands and speaks the same language.

To implement all this, we need a Consul agent running on all our application instances, so that they can access their data. We also want to secure access to Consul by implementing ACLs (Access Control Lists).

With the Consul UI, anybody has access to change, modify, add, delete any kind of keys. This creates a problem. We don't have an audit trail, we don't know when it happened.

The idea is to lock the UI down and provide another management tool for developers to create configurations, delete configurations, add configurations.

Applications will access their data using a token and policy procedure.

We also want to secure access to Vault. We want to secure that access to the UI. We don't want everybody to come in and add secrets. We want a single admin team to be able to manage that, and then applications are going to use tokens and policies to access their data.

Being a 24-hour shop, we need our applications up all the time, and a key to that is to always have a functional path between our application instances and the Consul and Vault servers.

If an application needs to get a configuration change at 2:00 in the morning, we need to have that path available. During maintenance windows, we still need a path available from the instances to our Vault or Consul servers.

We also need to implement a consistent process for managing these configurations and secrets across all the teams, so everybody understands and knows this is how you accomplish this task.

Along those lines, we wanted to create a single point of management, one that had a historical view of configuration changes, so we know who changed what and when they changed it.

We needed the ability to roll back changes, and perform some sort of code review of the configurations before implementation.

And lastly, for our production environment, we wanted to insert some approval process, so that we get approval before we roll changes out to production.

The first iteration

Our initial implementation started in about November of 2018. That month we spent a week or 2 to implement the Consul and Vault architecture, then we proceeded on building the processes for configuration and secrets management.

Then we spent time socializing those processes across the organization. That took a little while.

Since then it's been onboarding the applications into our process and into our architecture. And of course the team continues to go through the continuous improvement process of identifying the challenges, finding the value, defining the path, and walking the path.

This slide show’s an overview of our Consul architecture. Pretty simple.

We used the open-source version of Consul, version 1.4.2. We have 2 datacenters and a single Consul cluster that spans both those datacenters, so we have Consul servers in both Datacenter 1 and 2.

We have a global entry point that's load-balanced between the 2 datacenters, so any communication coming in can either go to Datacenter 1 or Datacenter 2, and hit Consul servers in either datacenter. So if we're doing maintenance work in Datacenter 1, or there are network issues, we still have that single path from application instances to Consul.

The same architecture applies with Vault. We use the open-source version, 0.11.2. Again, 2 datacenters. We have a single Consul cluster backend; it's for the database for Vault.

We have Vault servers in Datacenter 1 and Vault servers in Datacenter 2, in the active standby, and a single point of entry globally that's load-balanced between both datacenters so that we always have that single point of access to Vault, so applications can get their secrets.

Defining a naming convention

Once we built out the architecture, we had to define a consistent configuration-structure naming convention. This is the language that we communicate to all the teams: "When you're going to create key-values, your application's going to use these key-values. What paths do I access these on?"

We came up with a standard that was consistent across all the teams. In Consul, the KV store endpoints are referenced by a keypath, and for configurations across applications, we decided to use a keypath that starts with /default. These keys are referenced by any application, so any keys along this path, any application has access to them.

For configurations for specific applications, the keypath would start with /env, an environment name, and the application name. This would specify a path for an application for them to get their key-values.

For an example, for app1, its keypath would start with /env/dev/app1, and that would be different in a different environment. For example, for QA, /env/qa/app1. So not only do we segment this by application, but we do by environment as well.

For host-specific configurations and those instances where a specific host needs something a little bit different from the application itself, we start with /host, and then the hostname and the key.

This is just a consistent naming convention that we wanted to put out there across all of our teams in the organization so we were speaking the same language. For our key names, it's the words separated by a dash, so the key name would be application-url.

We went ahead and generated a similar situation for Vault. In Vault, our secrets engines simply store application secrets; we don't use it for anything else today. Each application is tied to a Vault secret.

The way we structured it—we have 4 secrets engines, 1 for each environment—we have /apps/dev, /apps/qa/staging, /prod. And within those secrets engines then we have the secrets, which are tied to each application. For example, we have a secret for app1, a secret for app2, and their values will be in the key-values within each secret.

The naming convention for key names is words separated by a dot. This is different from the configuration value names. For example, it's spring.datasource.properties.user. What this allows us to do is, when we have applications that are trying to get values from Vault or Consul, we can specify the actual keypaths that will locate the values that they need.

How applications get their configurations

This slide shows the configuration load at application startup. Our application instance starts up, and it's going to get 3 values. We have Java applications, so I'm showing you the Java ops.

We'll talk a little bit more about these 3 values that the application needs. Basically it's, How am I going to get my values from Consul, and what's going to happen if I need to update my configurations?

Once the application gets those values, it starts up, and it's going to contact the local Consul agent that's sitting on the application instance, and it's going to be able to pull its values. We're going to use these keypaths or key prefixes that then locate the specific configuration information for the application instance.

We want our application instances to get changes hot, so the bottom half of this slide shows the process for the configuration reload, when the application's running.

The way we implemented it was to use watches and handlers. A watch detects if there's a change along a keypath. These keypaths in our architecture become really important, because it identifies what we're looking for, it identifies where our values are, so that the applications know, so that Consul knows, and so that our teams know.

This watch, when it detects a change—a create, an update, a delete along a specific keypath—it initiates a handler. This handler is simply a script. It's going to pass a signal to our application, our application's going to listen on a specific port, the handler is going to send this signal to that port, and it's going to tell the application, "Reload your configuration data."

This slide shows the 3 properties that I talked about that the application needs to pull at startup:

  • Server list
  • Key prefix
  • Listener port

The first one is the server list. This is just the connection that it looks for to find the data. Because we talk to the local Consul agents that are hosted on the application instance, we specify the localhost on port 8500. This could be the VIP for the Consul cluster.

We also specify a key prefix. This goes back to that idea of this standard path for the applications, of how they're going to pull their values. For each application instance, we can list a set of paths, of where they're going to look to get their data.

In this case the key prefix is /env/dev/app1. This would be the instance of application 1 in development, and it's going to look on this path to pull its key-values.

The last piece is the listener port. This is the port that the application's going to listen on when our handler sends a signal to it to reload its application, or the configuration data.

More on watches and handlers

A quick dive into watches and handlers for those that aren't familiar. Watches are a way of specifying a view of data which is monitored for updates.

In this case the watcher's looking on /apps/env/dev/app1, and if a key's been created or changed, it's going to initiate the handler, which then sends a signal to a port that the application's listening on to tell it to reload its configuration.

Here's the config.json, which is a configuration file for the Consul agent locally on the application instance, and here we specify the watch. We can see that the path name's been specified, so when any key-value's been changed on this path, the handler script gets executed.

How applications get their secrets

This is a similar process. The application starts up, it's going to get a few properties to know how to contact Vault, and get its information that it needs.

Once it starts up, it contacts the Vault URL to try to load its secrets. The way we manage that is token and policy and the keypath. We're giving it a path, and it's looking for specific secrets along that path, and we can lock that down with policies and tokens.

Taking a look at the properties, the first one is the Vault URI. This is how the application instance is going to connect to the Vault cluster to get its data.

The Vault path is simply that path to the data or the secrets that the application needs. And the Vault token is how we secure that data. So this application has a specific token to a specific path, and no other applications can access this data.

This slide shows a deeper dive into it. Each application has an associated secret, as we saw before, and the secrets engines are apps/dev, apps/staging, apps/qa. The secrets underneath are associated to an application. And each application has an ACL policy, which then says, "Who has access to this secret?"

There's a token associated with that. When the application calls Vault, it uses that token, which is then validated by the policy to say, "Yes, you have access to the data along this path."

Updating, creating, and deleting configurations and secrets

For the new process, we wound up choosing GitLab for source control. It does branching really well, it allows a semblance of code review through merge requests, it has an audit history so we know who changed configurations and when they changed it, and it does rollback, so we can roll back the changes.

The big piece, though, in our organization and in many organizations, is there wasn't a high learning curve. Because developers already use this for source control for the code that they write.

And we chose Jenkins as the orchestrator to run the scripts that will call the Consul APIs to implement these changes. It integrates with GitLab, and the teams that would manage this are already familiar with Jenkins and the processes.

Let's take a look at the process of what we socialize to the teams about how you're going to add and update and delete these configurations.

A developer's going to come in and log into GitLab. We have a properties repository, and the developer's going to update a specific property file, which we segment by environment, for now at least.

The developer pushes the branch to origin. The developer then submits a merge request, which allows a peer review or a team review to come in and check and validate the configuration. The merge is committed, webhook triggers a Jenkins job, and control's passed over to Jenkins.

We have a script that runs in Jenkins, Jenkins job calls a script, and this script creates a list of all the changes that have just occurred, any additions, any modifications, any deletions, any actions that are going to be taken against the Consul database.

Once it generates that list, a second script calls the Consul APIs, and it's going to perform those actions. Regardless of how many additions, modifications, or deletions we have, the second script is going to use the Consul APIs to make those modifications.

Then control's handed over to Consul, where the APIs actually do the work.

The only difference in this process between environments is that, in our production environment, we want some sort of approval gate before these configurations are passed on. So an approval process goes through, people make the approvals, and then the merge happens and it kicks off the stuff to the Jenkins job.

This slide shows an example of our property file that houses the key-values for the application. This is in a GitLab repository. Each one of these property files contains the key and value. Each environment has its own property file. Here's an example of the consul-config-dev.txt file.

As you can see, we have keys and values for app1 in the dev environment. It only shows 1 application, but in ours it would be either app2, app3, app4, and so on.

The process for managing secrets

This is a little bit different. It starts with a requester sending encrypted emails with some information: the secrets engine, the secret, and a key, and the key's going to be a username, client ID.

The great thing about the structure and naming convention that we implemented is, our secrets engine is the environment name, right? And the secret itself is the application name. So the developer doesn't have to figure out these random names for secrets engines and secrets.

They simply go, "I want to update app1 in dev, so..." There you go: secrets engine is dev, the secret is app1, and then they can provide the key. They send a second encrypted email containing the value, which in this case could be a password or client secret, and these emails go to the Vault admin team. Vault admin team gathers the data, they log into Vault, and they update and make the changes.

Improving the processes

As part of our continuous improvement journey, we constantly cycle through these steps that we've been going over. Here's a snippet of some of the new challenges that we face:

  • Issues with different Consul agents in different VLANs communicating with each other for health status (network segmentation)
  • Dynamic password generation for applications and database
  • Devise a better process for secrets administration
  • Split property files into individual app files instead of environment

One is issues with Consul agents in different VLANs communicating with each other for health status. Obviously, we have application instances in multiple VLANs, and a lot of them are protected by firewalls. We have some application instances in the DMZ. The problem becomes when the Consul agents are communicating with each other for health status.

We have all these instances with Consul agents, they're trying to talk to each other for health status, and they're being blocked by firewalls. We could open up the firewall ports, which we did initially, but one of the problems we saw with this is for our Vault cluster.

On the Vault servers, we have a Consul agent, because the Consul database is the backend for Vault. What we saw is, when applications in the DMZ couldn't communicate with the Consul agents on the Vault server, they were reporting them as unhealthy. And Vault was going through a process of electing a new active server.

Normally it's not a problem, except when the applications start up, they need to grab their secrets. So they go to the Vault cluster, they try to grab their secret, there's no active server at the moment, they can't start up, they can't get their secrets, applications fail.

Obviously, one of the resolutions is to implement some sort of network segmentation to separate agents talking to other agents, create separate network segmentations in different VLANs so only those agents can talk to each other, and not across.

Another challenge is that our password rotation is a manual process. Even though we now house passwords in Vault, the creation of those passwords, the sending of the email, the implementing of those passwords is a manual process.

So we need to move to a more dynamic password generation for our applications and database. And along with that, a better process for secrets administration. Sending encrypted emails is great, but when you're doing this a lot of times, it starts getting confusing. "Wait, this password goes to which other email?" It's a challenge that we have to overcome.

The last thing is with the property files. As we saw, we have our property files segmented by environment. All the key-values for dev are in a single file, and all the key-values for QA are in a single file.

In the beginning this was great, because it was the one spot developers could look, and they can say, "Oh yeah, here are all my values."

But as these configurations proliferated and we brought more applications on board, this file is just untenable; it's very big. One solution is to split these property files out, per application. So app1 would have its own property file that would contain its own key-values. That way we'd use the same process, but now we'd have separate property files per application.

The value of these processes

Today I talked about 24 Hour Fitness' transformation using Vault and Consul. We identified the challenges, we found the value, we defined the path, and we walked the path. But how did overcoming these challenges bring value to us? That's the key point, right? What value did we get from this?

Let's take a look back at the opening story where a configurations change was missed on a single application instance. Because we implemented that single source of truth, today that scenario would never exist.

We aren't going to individual instances and making changes. We make a configuration change in Consul through our process, and all the application instances get that immediately. So that scenario doesn't even exist, and everybody's happy, they get to start vacation early and go home.

Of course, solving these challenges also allows us at 24 Hour Fitness to dedicate more time to what's important: working out. We're a fitness company; we're supposed to work out. That is, until the next challenge.

Thank you very much.

HashiCorp
Powering the software-managed datacenter. Maker of Vagrant, Packer, Terraform, Consul, Serf, and Vault
Tools mentioned in article
Open jobs at HashiCorp
Sr Manager, Data Engineering
India - Noida
<p>We are looking for a Sr. Manager, Data Engineering to be part of our FP&amp;A’s Digitization team in Noida, Uttar Pradesh, India. This role is expected to be 30% hands on execution building the solutions while the rest is overseeing the delivery and solutioning for the team.</p> <p><strong>In this role you can expect to drive the following-</strong></p> <p>Data Strategy and Alignment</p> <ul> <li>Work closely with Lead- business analysis and analytics to understand requirements and provide data ready for analysis and reporting.</li> <li>Apply, help define, and champion data governance : data quality, testing, documentation, coding best practices and peer reviews.</li> <li>Continuously discover, transform, test, deploy, and document data sources and data models.</li> <li>Develop and execute data roadmap (and sprints) - with a keen eye on industry trends and direction.</li> </ul> <p>Data Stores and System Development</p> <ul> <li>Design and implement high-performance, reusable, and scalable data models for our data warehouse to ensure our end-users get consistent and reliable answers when running their own analyses.</li> <li>Focus on test driven design and results for repeatable and maintainable processes and tools.</li> <li>Create and maintain optimal data pipeline architecture - and data flow logging framework.</li> <li>Build the data schema, features, tools, and frameworks that enable and empower BI and Analytics teams across FP&amp;A function.</li> </ul> <p>Project Management</p> <ul> <li>Drive project execution using effective prioritization and resource allocation.</li> <li>Resolve blockers through technical expertise, negotiation, and delegation.</li> <li>Strive for on-time complete solutions through stand-ups and course-correction.</li> </ul> <p>Team Management</p> <ul> <li>Manage and elevate team of 2 members.</li> <li>Do regular one-on-ones with teammates to ensure resource welfare.</li> <li>Periodic assessment and actionable feedback for progress.</li> <li>Recruit new members with a view to long-term resource planning through effective collaboration with the hiring team.</li> </ul> <p>Process design</p> <ul> <li>Set the bar for the quality of technical and data-based solutions the team ships.</li> <li>Enforce code quality standards and establish good code review practices - using this as a nurturing tool.</li> <li>Set up communication channels and feedback loops for knowledge sharing and stakeholder management.</li> <li>Explore the latest best practices and tools for constant up-skilling.</li> </ul> <p>Data Engineering Stack</p> <ul> <li>Programming : <strong>Python</strong> ( expert)level. Ability to create API’s on python.</li> <li>Database : PostgreSQL, Amazon Redshift</li> <li>Warehouse : <strong>Snowflake</strong>, S3</li> <li>ETL : <strong>DBT</strong> + Custom-made Python</li> <li>Business Intelligence / Visualization : M+ Google Data Studio</li> <li>Frameworks : Spark + Dash + <strong>Stream Lit</strong></li> <li>Collaboration : Git, Notion</li> <li>Cloud Platform- AWS</li> </ul> <p>Qualification Prerequisites</p> <ul> <li>Industry experience of minimum 12 years (2 years+ in snowflake)</li> <li>Experience managing a team of at least 4 developers end-to-end</li> <li>Strong hands-on data modelling and data warehousing skills</li> <li><strong>Snowflake Certification is mandatory</strong>.</li> <li>Strong experience applying software engineering best practices to data and analytics scope (e.g. version control, testing, and CI/CD)</li> <li>Strong attention to detail to highlight and address data quality issues</li> <li>Excellent time management and proactive problem-solving skills to meet critical deadlines <strong>#LI-Onsite #LI-SG1</strong></li> </ul> <p>&nbsp;</p>
Implementation Services Engineer - Ma...
Spain - Madrid
<p><strong>About the role...</strong></p> <p>We are building a team of Implementation Services Engineers to ensure the successful delivery of HashiCorp’s solutions for our enterprise customers. As a member of the Customer Service &amp; Support organization, the Implementation Engineer will be responsible for the project management and technical aspects of implementations.&nbsp; This starts with an internal hand-off with Sales all the way to a successful deployment, and a hand-off to the designated Technical Account Manager.</p> <p><strong>In this role you can expect to...</strong></p> <ul> <li>Participate in the full lifecycle of a client implementation from initial internal hand-off from our sales team and sales engineer to installation, system configuration, and testing</li> <li>In some cases, you are also expected to provide some amount of training and limited post implementation support</li> <li>Be a Subject Matter Expert (SME) on HashiCorp architecture, tools and products</li> <li>Provide guidance for business decisions during the implementation from a technical perspective on issues such as: performance, scalability, reliability, and security</li> <li>Work with our Product Management, Product Engineering and Support team to resolve issues that occur during the implementation</li> <li>Manage the implementation plan, sharing weekly status reports</li> <li>Set, manage and meet mutual expectations with the customer to ensure satisfactory completion of implementation projects</li> <li>Handle escalations where necessary and manage customer as well as internal resources expectations</li> <li>Facilitate discovery and requirements analysis sessions with customers and the HashiCorp team</li> <li>Develop project plans and estimates in order to deliver on time and within budget</li> <li>Track and report project metrics to ensure project success and mitigate risk</li> </ul> <p><strong>You may be a good fit for our team if you have...</strong></p> <ul> <li>3+ years Technical Account Management, Customer Support experience or Sales Engineering experience or equivalent experience - practitioners are highly encouraged to apply</li> <li>Fluency in the English language. Candidates should possess excellent written and verbal communication skills</li> <li>At least 3+ years in a customer facing role/customer success role</li> <li>Flexibility to work a late/evening shift may be a requirement for this position.&nbsp; Candidates should be able to accommodate working during evening hours, as per the needs of the business</li> <li>Proficiency with Vault in a production environment preferred</li> <li>Ability to provide guidance for HashiCorp product deployment with strong skills in infrastructure architecture, cloud, IT operations, security, and development technologies and processes</li> <li>Proficiency in understanding concepts and technologies in DevOps, IT operations, security, cloud, microservices, containers, and scheduling platforms</li> <li>Proficiency and/or knowledge of existing HashiCorp tools such as Vagrant, Packer, Terraform, Consul, Nomad, Vault and others</li> <li>Expertise in open source and SaaS is a major advantage</li> <li>Experience with implementing software products or solutions to large enterprise companies</li> <li>Excellent presence; strong written and verbal communication skills</li> <li>Upbeat, passionate, and unparalleled customer focus</li> <li>B.S. degree in an engineering or similar program from an accredited college / university preferred or equivalent experience</li> </ul> <p>HashiCorp embraces diversity and equal opportunity. We are committed to building a team that represents a variety of backgrounds, perspectives, and skills. We believe the more inclusive we are, the better our company will be. #LI-JN1 #LI-Hybrid</p> <p>&nbsp;</p>
Engineer II, Terraform Enterprise
United States
<h2><strong>Location: US 100% remote; PT preferred</strong></h2> <h2><strong>About the Role</strong></h2> <p><a href="https://www.hashicorp.com/products/terraform">Terraform Cloud</a> provides complex infrastructure lifecycle management to organizations with a single workflow to provision their cloud, private datacenter, and SaaS infrastructure. The <a href="https://developer.hashicorp.com/terraform/enterprise">Terraform Enterprise</a> team’s mandate is to deliver the Terraform Cloud SaaS offering to the customer’s on-prem environments, with seamless user experience in installation, administration, operation and maintenance.&nbsp; As a result, this team is responsible for a wide range of responsibilities to service the diverse customer base. These include but are not limited to software development, infrastructure and site reliability engineering, release management, and more. We leverage major cloud providers: AWS, Azure, GCP, and invest heavily in deployment options in Docker and Kubernetes. The current stacks are Go, Terraform, Ruby on Rails, GitHub Actions and more.&nbsp; There is no front end development. The team’s key technical competencies are broad, but can be summarized to the following:&nbsp;</p> <ul> <li>Software engineering: requirements gathering, prototyping, implementation, validation, build and deployment, production monitoring.</li> <li>Infrastructure: system and security engineering: scaling, disaster recovery planning, error handling.</li> <li>Solution discovery: ability to identify options to both technical and business challenges, and the willingness to experiment and validate.</li> </ul> <p>Terraform Enterprise occupies a strategic position in HashiCorp and experiences vastly different opportunities at a fast pace. We do not expect everyone to have industry experience in all things we do today. However, we do expect the candidate to have deep understanding in the software development cycle, concise communication, proven record in cross functional collaboration, and willingness to pivot and pick up new skills quickly.&nbsp; This is a unique opportunity for those who excel at both system change and point solution, and enjoy acquiring broad experience.&nbsp; We follow the agile methodologies of two week sprints, refinement, scrum, and retrospectives. Terraform Enterprise team is spread across 9 time zones in 4 countries. We are 100% remote.</p> <h3>About this role</h3> <p>The Terraform Enterprise product continues to evolve to meet the needs of our customers. We are looking for an engineer who has experience in or is interested in system validation, test automation and Terraform development.&nbsp;&nbsp;</p> <h3><strong>In this role, you can expect to:</strong></h3> <h4>First three months:</h4> <ul> <li>Gain proficiency in the Terraform Enterprise application and are able to replicate customer’s user experience in installation, configuration, execution and monitoring.&nbsp;</li> <li>Execute the monthly release process, gain deep understanding in the test coverage and edge cases.</li> <li>Collaborate on a cross-functional team including Engineering, Product, and Design to deliver excellent customer experiences.</li> <li>Participate in code reviews and shadow senior engineers in development and planning.</li> </ul> <h4>First three to six months:</h4> <ul> <li>Participate in on-call rotation to resolve escalated critical product issues for customers.</li> <li>Automate and optimize the release test suites to&nbsp; minimize the cadence.</li> <li>Build, iterate on, and ship the Terraform modules, machine images, system configuration, and software that delivers Terraform Enterprise in customer environments.</li> <li>Maintain a reliable production application for our customers while working alongside infrastructure engineering.</li> </ul> <p><strong>You may be a good fit for our team if you have …</strong></p> <ul> <li>A Bachelor or higher degree in computer science, computer engineering, or related field.</li> <li>Some working experience in infrastructure or application software development.</li> <li>Site reliability or infrastructure engineering background with knowledge in application development, systems/infrastructure engineering concepts such as infrastructure as code, software defined networking, monitoring, and virtualization.</li> <li>Practitioner experience with Terraform and other HashiCorp products</li> <li>Deployment experience.</li> <li>Strong written and verbal communication skills.</li> <li>Experience working on an Enterprise product and / or participating in on-call support for production incidents is a plus.</li> </ul> <p>#LI-Remote</p><div class="content-pay-transparency"><div class="pay-input"><div class="title">The base pay range for this role in the SF Bay Area / NYC area is:</div><div class="pay-range"><span>$151,300</span><span class="divider">&mdash;</span><span>$178,000 USD</span></div></div><div class="pay-input"><div class="title">The base pay range for this role in Seattle Metro, Denver / Boulder Metro, New York (excluding NYC), or California (excluding SF Bay Area) is:</div><div class="pay-range"><span>$138,600</span><span class="divider">&mdash;</span><span>$163,100 USD</span></div></div><div class="pay-input"><div class="title">The base pay range for this role in Colorado (excluding Denver / Boulder Metro) and Washington (excluding Seattle Metro) is:</div><div class="pay-range"><span>$126,100</span><span class="divider">&mdash;</span><span>$148,300 USD</span></div></div></div>
Engineer II, Terraform Enterprise
Canada - Calgary
<h2><strong>Location: 100% remote</strong></h2> <h2><strong>About the Role</strong></h2> <p><a href="https://www.hashicorp.com/products/terraform">Terraform Cloud</a> provides complex infrastructure lifecycle management to organizations with a single workflow to provision their cloud, private datacenter, and SaaS infrastructure. The <a href="https://developer.hashicorp.com/terraform/enterprise">Terraform Enterprise</a> team’s mandate is to deliver the Terraform Cloud SaaS offering to the customer’s on-prem environments, with seamless user experience in installation, administration, operation and maintenance.&nbsp; As a result, this team is responsible for a wide range of responsibilities to service the diverse customer base. These include but are not limited to software development, infrastructure and site reliability engineering, release management, and more. We leverage major cloud providers: AWS, Azure, GCP, and invest heavily in deployment options in Docker and Kubernetes. The current stacks are Go, Terraform, Ruby on Rails, GitHub Actions and more.&nbsp; There is no front end development. The team’s key technical competencies are broad, but can be summarized to the following:&nbsp;</p> <ul> <li>Software engineering: requirements gathering, prototyping, implementation, validation, build and deployment, production monitoring.</li> <li>Infrastructure: system and security engineering: scaling, disaster recovery planning, error handling.</li> <li>Solution discovery: ability to identify options to both technical and business challenges, and the willingness to experiment and validate.</li> </ul> <p>Terraform Enterprise occupies a strategic position in HashiCorp and experiences vastly different opportunities at a fast pace. We do not expect everyone to have industry experience in all things we do today. However, we do expect the candidate to have deep understanding in the software development cycle, concise communication, proven record in cross functional collaboration, and willingness to pivot and pick up new skills quickly.&nbsp; This is a unique opportunity for those who excel at both system change and point solution, and enjoy acquiring broad experience.&nbsp; We follow the agile methodologies of two week sprints, refinement, scrum, and retrospectives. Terraform Enterprise team is spread across 9 time zones in 4 countries. We are 100% remote.</p> <h3>About this role</h3> <p>The Terraform Enterprise product continues to evolve to meet the needs of our customers. We are looking for an engineer who has experience in or is interested in system validation, test automation and Terraform development.&nbsp;&nbsp;</p> <h3><strong>In this role, you can expect to:</strong></h3> <h4>First three months:</h4> <ul> <li>Gain proficiency in the Terraform Enterprise application and are able to replicate customer’s user experience in installation, configuration, execution and monitoring.&nbsp;</li> <li>Execute the monthly release process, gain deep understanding in the test coverage and edge cases.</li> <li>Collaborate on a cross-functional team including Engineering, Product, and Design to deliver excellent customer experiences.</li> <li>Participate in code reviews and shadow senior engineers in development and planning.</li> </ul> <h4>First three to six months:</h4> <ul> <li>Participate in on-call rotation to resolve escalated critical product issues for customers.</li> <li>Automate and optimize the release test suites to&nbsp; minimize the cadence.</li> <li>Build, iterate on, and ship the Terraform modules, machine images, system configuration, and software that delivers Terraform Enterprise in customer environments.</li> <li>Maintain a reliable production application for our customers while working alongside infrastructure engineering.</li> </ul> <p><strong>You may be a good fit for our team if you have …</strong></p> <ul> <li>A Bachelor or higher degree in computer science, computer engineering, or related field.</li> <li>Some working experience in infrastructure or application software development.</li> <li>Site reliability or infrastructure engineering background with knowledge in application development, systems/infrastructure engineering concepts such as infrastructure as code, software defined networking, monitoring, and virtualization.</li> <li>Practitioner experience with Terraform and other HashiCorp products</li> <li>Deployment experience.</li> <li>Strong written and verbal communication skills.</li> <li>Experience working on an Enterprise product and / or participating in on-call support for production incidents is a plus.</li> </ul> <p>#LI-Remote</p>
You may also like