At 24 Hour Fitness, for many years, operations and development teams have gone through the pain of trying to manage and deploy application configurations with data stored in many files and locations across the ecosystem. The DevOps team was tasked with architecting and implementing a simple, reliable, highly available, testable solution to meet the growing needs of their applications. Through the combined use of Consul and Vault, they have successfully transformed the business.
In this talk, 24 Hour Fitness Senior DevOps Engineer Jason Yoe will describe the challenges faced, the overall design, and the implementation of the solution.
Transcript
It was a Wednesday afternoon before Thanksgiving, and I was finishing a few things up before I was heading out, when the team received an email talking about an intermittent problem that was happening with one of our sales applications.
A user, when selecting an option in our web application, was receiving an error. Not everybody would receive an error when they'd select the option, and as a matter of fact, when this user clicked the option again, they were successful. From the email, it was uncertain when this problem had begun and how many sales had been affected.
Flash forward an hour later, the issue had been escalated. The DevOps team is on a WebEx call with many agitated people. Agitated managers are in our cubicle aisle; they're looking for answers.
It's understandable. It's the day before Thanksgiving, it's a production issue, and it's our sales application. So the team's frantically looking through logs, we're checking deployments that had happened the previous days, any sign for what could happen. Finally, we stumble upon an email about a configuration change that had happened the previous week.
This change was made directly to an application instance, but unfortunately those instances were not bounced at the time of the change. When a code deploy occurred a few days later, the applications were restarted and the change was applied.
Unfortunately, the configuration change was missed on one of the application instances, hence the intermittent issue. Of course, this change had been applied in other environments and had been tested, but unfortunately the person applying this change in production had made a few small mistakes. So, we had to transform our configuration management.
Transformation is continuous improvement
How do we accomplish this transformation? Well, transformation is really all about continuous improvement. We all envision this glorious end state, where all of our application instances are containerized and we have systems that detect and fix errors while we go get more coffee.
But the reality is nowhere near that. We never really reach that nirvana where there are no challenges, there are no issues. There are always going to be challenges. The goal is to face better challenges.
When I was 15 years old, I had to walk everywhere. I had to walk to school, I had to walk to work, I had to walk to my friends' houses. When I improved my situation and saved up and bought a car, I no longer had to walk anywhere. I still had to pay for gas, I had to pay for car insurance, and I had to get my oil changed.
But these challenges were definitely better than having to walk everywhere.
Today I'm going to discuss the steps we use for continuous improvement. I will also dive into the 24 Hour Fitness case study, where we use Consul and Vault as an integral part of the solution, and then I'll go over some of the new, better challenges in our continuous improvement journey.
My name is Jason Yoe. I'm a senior DevOps engineer at 24 Hour Fitness. 24 Hour Fitness is the second-largest fitness chain in the world. In my previous roles, I worked at Cisco as a cloud engineer, and then as a technical architect at AT&T, so I have about 20 years' experience in the industry.
The continuous improvement process
The 4 steps in the continuous improvement process are:
- Identify the challenges
- Find the value
- Define the path
- Walk the path
The first step is to identify the challenges, and this step is really about communication. We had many sessions with development teams, operation teams. It's really about gathering all the positives, the negatives, issues, roles, responsibilities, all the processes that currently were in place for configuration and secrets control.
And as you will see in our case study, we found some common challenges that came from these sessions.
The second step is to find the value. The idea is to make the simple changes that provide the greatest benefit. We don't have to solve for everything. Because in continuous improvement, we will eventually face those other challenges, and we'll overcome them. It's really about identifying those key changes that bring the most value right now.
Also remember that value changes are definitely momentum-builders, especially with executive leadership, and they help us along that journey of continuous improvement.
Once we find the value challenges, we end up defining the path, and this is really about building the requirements, it's about architecting the solution, it's about defining those processes, those business processes that other teams need to follow.
It's also defining the tasks that need to be executed. This step also includes the research that goes into it, the technology, the processes. The key point is to look at the successes that are inside your organization and outside your organization.
The next step is walk the path. This is obviously merely the execution of the tasks, it's implementing the technology, and it's socializing the business processes, so it's making sure everyone in the organization's on the same page, that we're all moving in the same direction.
And remember, the improvement journey doesn't end with this step; it continues back again.
The case study
Let us dive into the case study at 24 Hour Fitness. We held sessions across the organization with development teams and our operations teams, and we wound up with a pretty extensive list.
There were a lot of issues that people brought up, but we identified 4 challenges that solving would provide the most benefit for our company:
- Eliminate configuration sprawl
- Eliminate secrets sprawl
- Define a consistent lifecycle process for configuration management
- Define a consistent lifecycle process for secrets management
The first challenge was to eliminate configuration sprawl. In our environment, we had configurations in many different files, in many different directories, and it wasn't consistent across applications. Different locations and different applications.
We also had different configuration management tools. Some configurations were managed by Chef, some were managed by SVN, some were stored locally on the instance and were managed there.
Our second challenge was to eliminate secrets sprawl. We had secrets that were stored in multiple files and multiple locations, and it wasn't consistent across applications.
One of the big problems was changing passwords. To stay compliant, we have to change our passwords every so often, particularly our database passwords, and our database passwords were stored on local application instances, albeit encrypted. But this process really was tough at 3:00 in the morning on a Sunday when an application wouldn't start because a password change had been missed.
We also had to define a consistent lifecycle process for both configurations and secrets. In talking to different teams, we had multiple processes for how to create, update, store, delete these configurations and secrets.
A lot of it had to do with the fact that each team had multiple tools to manage this, but there was also no clear role definition or responsibility definition for the teams.
What was needed in the solution
Once we identify the value challenges, we were able to define the path. So we built our requirements, we did our research, we architected a solution with specific tasks to be executed. I've highlighted some of the high-level directives for the solution:
- Implement a single source for configurations using the Consul KV store
- Implement a single source for secrets using Vault
- All applications reference the single source for configurations and secrets
- Applications can receive runtime configuration changes
- Implement consistent configuration and secrets data structure and naming
- Consul agent running in client mode on all application instances
- Secure access to Consul by implementing ACLs
- Only admin team has access to UI
- Apps have own token and policy to access specific key prefixes
- Secure access to Vault
- Only admin team has access to UI
- Apps have own token and policy to access specific secrets
The first one is to implement a single source for configurations using the Consul KV store. The next one is to implement a single source for secrets using Vault. What these both resolve is that issue of configuration sprawl, with configurations in multiple files, multiple locations.
When somebody needed to find out where a configuration was, there was no documentation, and we had to hunt through directories to find it. Now we're implementing a single source of truth, where we can go and make changes and manage these configurations.
To go along with that, all of our applications need to be able to access that single source of truth, so if you make a configuration change, all of the instances get that change, instead of having to go from instance to instance and validate that they've got that change.
Another requirement was these applications need to receive runtime configuration changes. In today's world, we want to be a 24-hour shop. Our applications always need to be up for our customers, especially at 24 Hour Fitness. We have people working out at the gym at 2:00 in the morning.
And so we want our applications available to everybody, and the idea is, if we're making configuration changes, we need these applications to take them hot; we don't have time to restart our applications for that change to take effect.
We also need to implement a consistent configuration and secrets data structure and naming. Different teams had different naming standards for their keys and their values and their configurations.
We also wanted to build a structure of, when we talk about a particular value, where is that value found? So that everybody understands and speaks the same language.
To implement all this, we need a Consul agent running on all our application instances, so that they can access their data. We also want to secure access to Consul by implementing ACLs (Access Control Lists).
With the Consul UI, anybody has access to change, modify, add, delete any kind of keys. This creates a problem. We don't have an audit trail, we don't know when it happened.
The idea is to lock the UI down and provide another management tool for developers to create configurations, delete configurations, add configurations.
Applications will access their data using a token and policy procedure.
We also want to secure access to Vault. We want to secure that access to the UI. We don't want everybody to come in and add secrets. We want a single admin team to be able to manage that, and then applications are going to use tokens and policies to access their data.
Being a 24-hour shop, we need our applications up all the time, and a key to that is to always have a functional path between our application instances and the Consul and Vault servers.
If an application needs to get a configuration change at 2:00 in the morning, we need to have that path available. During maintenance windows, we still need a path available from the instances to our Vault or Consul servers.
We also need to implement a consistent process for managing these configurations and secrets across all the teams, so everybody understands and knows this is how you accomplish this task.
Along those lines, we wanted to create a single point of management, one that had a historical view of configuration changes, so we know who changed what and when they changed it.
We needed the ability to roll back changes, and perform some sort of code review of the configurations before implementation.
And lastly, for our production environment, we wanted to insert some approval process, so that we get approval before we roll changes out to production.
The first iteration
Our initial implementation started in about November of 2018. That month we spent a week or 2 to implement the Consul and Vault architecture, then we proceeded on building the processes for configuration and secrets management.
Then we spent time socializing those processes across the organization. That took a little while.
Since then it's been onboarding the applications into our process and into our architecture. And of course the team continues to go through the continuous improvement process of identifying the challenges, finding the value, defining the path, and walking the path.
This slide show’s an overview of our Consul architecture. Pretty simple.
We used the open-source version of Consul, version 1.4.2. We have 2 datacenters and a single Consul cluster that spans both those datacenters, so we have Consul servers in both Datacenter 1 and 2.
We have a global entry point that's load-balanced between the 2 datacenters, so any communication coming in can either go to Datacenter 1 or Datacenter 2, and hit Consul servers in either datacenter. So if we're doing maintenance work in Datacenter 1, or there are network issues, we still have that single path from application instances to Consul.
The same architecture applies with Vault. We use the open-source version, 0.11.2. Again, 2 datacenters. We have a single Consul cluster backend; it's for the database for Vault.
We have Vault servers in Datacenter 1 and Vault servers in Datacenter 2, in the active standby, and a single point of entry globally that's load-balanced between both datacenters so that we always have that single point of access to Vault, so applications can get their secrets.
Defining a naming convention
Once we built out the architecture, we had to define a consistent configuration-structure naming convention. This is the language that we communicate to all the teams: "When you're going to create key-values, your application's going to use these key-values. What paths do I access these on?"
We came up with a standard that was consistent across all the teams. In Consul, the KV store endpoints are referenced by a keypath, and for configurations across applications, we decided to use a keypath that starts with /default. These keys are referenced by any application, so any keys along this path, any application has access to them.
For configurations for specific applications, the keypath would start with /env, an environment name, and the application name. This would specify a path for an application for them to get their key-values.
For an example, for app1, its keypath would start with /env/dev/app1, and that would be different in a different environment. For example, for QA, /env/qa/app1. So not only do we segment this by application, but we do by environment as well.
For host-specific configurations and those instances where a specific host needs something a little bit different from the application itself, we start with /host, and then the hostname and the key.
This is just a consistent naming convention that we wanted to put out there across all of our teams in the organization so we were speaking the same language. For our key names, it's the words separated by a dash, so the key name would be application-url.
We went ahead and generated a similar situation for Vault. In Vault, our secrets engines simply store application secrets; we don't use it for anything else today. Each application is tied to a Vault secret.
The way we structured it—we have 4 secrets engines, 1 for each environment—we have /apps/dev, /apps/qa/staging, /prod. And within those secrets engines then we have the secrets, which are tied to each application. For example, we have a secret for app1, a secret for app2, and their values will be in the key-values within each secret.
The naming convention for key names is words separated by a dot. This is different from the configuration value names. For example, it's spring.datasource.properties.user. What this allows us to do is, when we have applications that are trying to get values from Vault or Consul, we can specify the actual keypaths that will locate the values that they need.
How applications get their configurations
This slide shows the configuration load at application startup. Our application instance starts up, and it's going to get 3 values. We have Java applications, so I'm showing you the Java ops.
We'll talk a little bit more about these 3 values that the application needs. Basically it's, How am I going to get my values from Consul, and what's going to happen if I need to update my configurations?
Once the application gets those values, it starts up, and it's going to contact the local Consul agent that's sitting on the application instance, and it's going to be able to pull its values. We're going to use these keypaths or key prefixes that then locate the specific configuration information for the application instance.
We want our application instances to get changes hot, so the bottom half of this slide shows the process for the configuration reload, when the application's running.
The way we implemented it was to use watches and handlers. A watch detects if there's a change along a keypath. These keypaths in our architecture become really important, because it identifies what we're looking for, it identifies where our values are, so that the applications know, so that Consul knows, and so that our teams know.
This watch, when it detects a change—a create, an update, a delete along a specific keypath—it initiates a handler. This handler is simply a script. It's going to pass a signal to our application, our application's going to listen on a specific port, the handler is going to send this signal to that port, and it's going to tell the application, "Reload your configuration data."
This slide shows the 3 properties that I talked about that the application needs to pull at startup:
- Server list
- Key prefix
- Listener port
The first one is the server list. This is just the connection that it looks for to find the data. Because we talk to the local Consul agents that are hosted on the application instance, we specify the localhost on port 8500. This could be the VIP for the Consul cluster.
We also specify a key prefix. This goes back to that idea of this standard path for the applications, of how they're going to pull their values. For each application instance, we can list a set of paths, of where they're going to look to get their data.
In this case the key prefix is /env/dev/app1. This would be the instance of application 1 in development, and it's going to look on this path to pull its key-values.
The last piece is the listener port. This is the port that the application's going to listen on when our handler sends a signal to it to reload its application, or the configuration data.
More on watches and handlers
A quick dive into watches and handlers for those that aren't familiar. Watches are a way of specifying a view of data which is monitored for updates.
In this case the watcher's looking on /apps/env/dev/app1, and if a key's been created or changed, it's going to initiate the handler, which then sends a signal to a port that the application's listening on to tell it to reload its configuration.
Here's the config.json, which is a configuration file for the Consul agent locally on the application instance, and here we specify the watch. We can see that the path name's been specified, so when any key-value's been changed on this path, the handler script gets executed.
How applications get their secrets
This is a similar process. The application starts up, it's going to get a few properties to know how to contact Vault, and get its information that it needs.
Once it starts up, it contacts the Vault URL to try to load its secrets. The way we manage that is token and policy and the keypath. We're giving it a path, and it's looking for specific secrets along that path, and we can lock that down with policies and tokens.
Taking a look at the properties, the first one is the Vault URI. This is how the application instance is going to connect to the Vault cluster to get its data.
The Vault path is simply that path to the data or the secrets that the application needs. And the Vault token is how we secure that data. So this application has a specific token to a specific path, and no other applications can access this data.
This slide shows a deeper dive into it. Each application has an associated secret, as we saw before, and the secrets engines are apps/dev, apps/staging, apps/qa. The secrets underneath are associated to an application. And each application has an ACL policy, which then says, "Who has access to this secret?"
There's a token associated with that. When the application calls Vault, it uses that token, which is then validated by the policy to say, "Yes, you have access to the data along this path."
Updating, creating, and deleting configurations and secrets
For the new process, we wound up choosing GitLab for source control. It does branching really well, it allows a semblance of code review through merge requests, it has an audit history so we know who changed configurations and when they changed it, and it does rollback, so we can roll back the changes.
The big piece, though, in our organization and in many organizations, is there wasn't a high learning curve. Because developers already use this for source control for the code that they write.
And we chose Jenkins as the orchestrator to run the scripts that will call the Consul APIs to implement these changes. It integrates with GitLab, and the teams that would manage this are already familiar with Jenkins and the processes.
Let's take a look at the process of what we socialize to the teams about how you're going to add and update and delete these configurations.
A developer's going to come in and log into GitLab. We have a properties repository, and the developer's going to update a specific property file, which we segment by environment, for now at least.
The developer pushes the branch to origin. The developer then submits a merge request, which allows a peer review or a team review to come in and check and validate the configuration. The merge is committed, webhook triggers a Jenkins job, and control's passed over to Jenkins.
We have a script that runs in Jenkins, Jenkins job calls a script, and this script creates a list of all the changes that have just occurred, any additions, any modifications, any deletions, any actions that are going to be taken against the Consul database.
Once it generates that list, a second script calls the Consul APIs, and it's going to perform those actions. Regardless of how many additions, modifications, or deletions we have, the second script is going to use the Consul APIs to make those modifications.
Then control's handed over to Consul, where the APIs actually do the work.
The only difference in this process between environments is that, in our production environment, we want some sort of approval gate before these configurations are passed on. So an approval process goes through, people make the approvals, and then the merge happens and it kicks off the stuff to the Jenkins job.
This slide shows an example of our property file that houses the key-values for the application. This is in a GitLab repository. Each one of these property files contains the key and value. Each environment has its own property file. Here's an example of the consul-config-dev.txt file.
As you can see, we have keys and values for app1 in the dev environment. It only shows 1 application, but in ours it would be either app2, app3, app4, and so on.
The process for managing secrets
This is a little bit different. It starts with a requester sending encrypted emails with some information: the secrets engine, the secret, and a key, and the key's going to be a username, client ID.
The great thing about the structure and naming convention that we implemented is, our secrets engine is the environment name, right? And the secret itself is the application name. So the developer doesn't have to figure out these random names for secrets engines and secrets.
They simply go, "I want to update app1 in dev, so..." There you go: secrets engine is dev, the secret is app1, and then they can provide the key. They send a second encrypted email containing the value, which in this case could be a password or client secret, and these emails go to the Vault admin team. Vault admin team gathers the data, they log into Vault, and they update and make the changes.
Improving the processes
As part of our continuous improvement journey, we constantly cycle through these steps that we've been going over. Here's a snippet of some of the new challenges that we face:
- Issues with different Consul agents in different VLANs communicating with each other for health status (network segmentation)
- Dynamic password generation for applications and database
- Devise a better process for secrets administration
- Split property files into individual app files instead of environment
One is issues with Consul agents in different VLANs communicating with each other for health status. Obviously, we have application instances in multiple VLANs, and a lot of them are protected by firewalls. We have some application instances in the DMZ. The problem becomes when the Consul agents are communicating with each other for health status.
We have all these instances with Consul agents, they're trying to talk to each other for health status, and they're being blocked by firewalls. We could open up the firewall ports, which we did initially, but one of the problems we saw with this is for our Vault cluster.
On the Vault servers, we have a Consul agent, because the Consul database is the backend for Vault. What we saw is, when applications in the DMZ couldn't communicate with the Consul agents on the Vault server, they were reporting them as unhealthy. And Vault was going through a process of electing a new active server.
Normally it's not a problem, except when the applications start up, they need to grab their secrets. So they go to the Vault cluster, they try to grab their secret, there's no active server at the moment, they can't start up, they can't get their secrets, applications fail.
Obviously, one of the resolutions is to implement some sort of network segmentation to separate agents talking to other agents, create separate network segmentations in different VLANs so only those agents can talk to each other, and not across.
Another challenge is that our password rotation is a manual process. Even though we now house passwords in Vault, the creation of those passwords, the sending of the email, the implementing of those passwords is a manual process.
So we need to move to a more dynamic password generation for our applications and database. And along with that, a better process for secrets administration. Sending encrypted emails is great, but when you're doing this a lot of times, it starts getting confusing. "Wait, this password goes to which other email?" It's a challenge that we have to overcome.
The last thing is with the property files. As we saw, we have our property files segmented by environment. All the key-values for dev are in a single file, and all the key-values for QA are in a single file.
In the beginning this was great, because it was the one spot developers could look, and they can say, "Oh yeah, here are all my values."
But as these configurations proliferated and we brought more applications on board, this file is just untenable; it's very big. One solution is to split these property files out, per application. So app1 would have its own property file that would contain its own key-values. That way we'd use the same process, but now we'd have separate property files per application.
The value of these processes
Today I talked about 24 Hour Fitness' transformation using Vault and Consul. We identified the challenges, we found the value, we defined the path, and we walked the path. But how did overcoming these challenges bring value to us? That's the key point, right? What value did we get from this?
Let's take a look back at the opening story where a configurations change was missed on a single application instance. Because we implemented that single source of truth, today that scenario would never exist.
We aren't going to individual instances and making changes. We make a configuration change in Consul through our process, and all the application instances get that immediately. So that scenario doesn't even exist, and everybody's happy, they get to start vacation early and go home.
Of course, solving these challenges also allows us at 24 Hour Fitness to dedicate more time to what's important: working out. We're a fitness company; we're supposed to work out. That is, until the next challenge.
Thank you very much.