By Jessica Chan | Engineering Manager, MySQL & Key-Value Storage
Engineers hate migrations. What do engineers hate more than migrations? Data migrations. Especially critical, terabyte-scale, online serving migrations which, if done badly, could bring down the site, enrage customers, or cripple hundreds of critical internal services.
So why did the Key-Value Systems Team at Pinterest embark on a two-year realtime migration of all our online key-value serving data to a single unified storage system? Because the cost of not migrating was too high. In 2019, Pinterest had four separate key-value systems owned by different teams with different APIs and featuresets. This resulted in duplicated development effort, high operational overhead and incident counts, and confusion among engineering customers.
In unifying all of Pinterest’s 500+ key-value use cases (over 4PB of unique data serving 100Ms of QPS) onto one single interface, not only did we make huge gains in reducing system complexity and lowering operational overhead, we achieved a 40–90% performance improvement by moving to the most efficient storage engine, and we saved the company a significant amount in costs per year by moving to the most optimal replication and versioning architecture.
In this blog post, we selected three (out of many more) innovations to dive into that helped us notch all these wins.
But first, some background
Before this effort, Pinterest used to have four key-value storage systems:
- Terrapin: a read-only, batch-load, key-value storage built at Pinterest and featured in Designing Data-Intensive Applications based on HDFS
- Rockstore: a multi-mode (readonly, read-write, streaming-write) key-value storage also built at Pinterest, based on the open-source Rocksplicator framework, written in C++, and using RocksDB as a storage engine
- UserMetaStore: a read-write key-value storage with a simplified thrift API on top of HBase
- Rocksandra: a read-write, key-value storage based on a version of Cassandra, which used RocksDB under the hood
One of the biggest challenges when consolidating to a single system is assessing the feasibility of both achieving feature parity across all systems and integrating those features well into a single platform. Another challenge is to determine which system to consolidate to, and whether to go with an existing system or to consider something that doesn’t already exist at Pinterest. And a final, nontrivial challenge is to convince leadership and hundreds of engineers that migrating in the first place is a good idea.
Before embarking on such a large undertaking, we had to step back. A working group dedicated a few months to deep-dive on requirements and technologies, analyze tradeoffs and benefits, and come up with a final proposal that was ultimately approved. Rockstore, which was the most cost-efficient and performant, simplest to operate and extend, and provided the lowest migration cost, was chosen as the one storage system to rule them all.
We won’t describe the entire migration project in this post, but we’ll highlight some of the best parts.
Innovation 1: API abstractions allow us to seamlessly migrate customer data
We know that in code, strong abstractions lead to cleaner interfaces and more flexibility to make changes “under the hood” without disruption. This is especially true of organizations as well. While each of the four storage systems had their own thrift API abstractions, the fact that there were four interfaces, and some of them, like Terrapin, still required customers to know internal details about the architecture in order to use it (leaky abstraction), made life difficult for both customers and platform owners.
A diagram might be helpful to illustrate the complexity of maintaining four separate, key-value storage systems. If you were a customer, which would you choose?
Figure 1: Four separate Key Value Systems at Pinterest, each with their own APIs, set of unique features and underlying architectures, and varying degrees of performance and cost.
We introduced a new API, aptly called the KVStore API, to be the new unified thrift interface that would absorb the rest. Once everyone is on a single unified API that is built with the intention to be general, the platform team can have the flexibility to make changes, even change storage engines, under the hood without involving customers. This is the ideal state:
Figure 2: The ideal state is a single unified Key-Value interface, reducing the complexity both for customers and for platform owners. When we can consolidate our resources as a company and invest in a single platform, we can move faster and build better.
The migration to get from four systems to the ideal one above was split into two phases: the first, targeting read-only data, and the second, targeting read-write data. Each phase required its own unique migration strategy to be the least disruptive to customers.
Phase 1: Read-only data migration (totally seamless)
The read-only phase was first because it was simpler (immutable data is easier to migrate than mutable data receiving live writes) and because it targeted the majority of customers (about 70% were using Terrapin). Because Terrapin was so prolific and established in our code base, having everyone migrate their APIs to access KVStore would have taken a ton of time and effort with very little incremental value.
We decided to instead migrate most Terrapin customers seamlessly: no changes were required of users calling Terrapin APIs, but unbeknownst to callers, the Terrapin API service was augmented with an embedded KVStore API library to retrieve data from Rockstore. And because Terrapin is a batch-loaded system, we also found a central base class and rerouted workflows to double-load data into Rockstore instead of Terrapin (and then eventually we cut Terrapin off).
Figure 3: By introducing a routing layer between the Terrapin APIs and the Terrapin leaf storage, we can achieve a data migration and eliminate the costly and less stable Terrapin storage system for immediate business impact, all without asking customers to take any action. The tradeoff here is the tech debt and layer of indirection: we are now asking customers to clean up their usage of the Terrapin API in order to directly call KVStore API.
Because Rockstore was more performant and cost-efficient than Terrapin, users saw a 30–90% decrease in latency. When we decommissioned the storage infrastructure of Terrapin, the company also saw $7M of annualized savings, all without users needing to lift a finger (with just a few exceptions). The tradeoff is that we now have some tech debt of ensuring that users clean up their code by moving off of deprecated Terrapin APIs and onto KVStore API so that we no longer have a layer of indirection.
Phase 2: Read-write data migration (partially seamless)
The read-write side presented a different picture: there were fewer than 200 use cases to tackle, and the number of call sites was less extreme, but building feature parity for a read-write system as opposed to read-only involved some serious development. In order to be on par with UserMetaStore (essentially HBase), Rockstore needed a brand new wide-column format, increased consistency modes, offline snapshot support, and higher durability guarantees.
While the team took the time to develop these features, we decided to “bite the bullet” and ask all users to migrate from UserMetaStore’s API to KVStore API from the get-go. The benefit of doing this is it’s a low-risk, low-effort move. Thanks again to the power of abstraction, we implemented a reverse proxy so that customers moving to KVStore API were actually still calling UserMetaStore under the hood. By making this small change now, customers were buying a lasting contract that wouldn’t require such changes again for the foreseeable future.
Figure 4: Instead of taking the same approach as we did with Terrapin in Figure 3, we decided asking customers to migrate their APIs up front made more sense for unifying the read-write storage systems. Once customers moved to our KVStore API abstraction layer, we were free to move their data from UserMetaStore to Rockstore under the hood.
Some of the biggest challenges were actually not technical. Finding owners of the data was an archeological exercise, and holding hundreds of owners accountable for completing their part was difficult due to competing priorities. But when it was done, and when the Rockstore platform was ready, the team was completely unblocked to backfill the data from UserMetaStore to Rockstore without any customer involvement. We also vowed to make sure all data was attributed to owners going forward.
Innovation 2: A wide-column format eliminated both CPU and network load for large payloads
Some of the most popular Terrapin workloads had an interesting property: use cases would store values consisting of large blobs of thrift structures but only need to retrieve a very small piece of that data when read.
At first, these callers would download the huge values that they stored, deserialize them on the client side, and read the property they needed. This very quickly revealed itself to be inefficient in terms of unnecessary network load, throughput degradation, and wasteful client CPU utilization.
The Terrapin solution to this was to introduce an API feature called “trimmer,” where you could specify a Thrift struct and the fields you wanted from it in the request itself. Terrapin would not only retrieve the object, it would also deserialize it and return only the fields requested. This was better in that the network bandwidth was reduced, important especially for reducing cross-AZ traffic costs, but it was worse in terms of both platform cost and leaky abstractions. More CPU utilization meant more machines were needed, and business logic in the platform meant that Terrapin needed to know about required thrift structures. Performance also takes a hit since clients are waiting for this increased processing time.
To solve this in Rockstore and unblock the migration, the team decided against simply re-implementing the trimmer. Instead, we introduced a new file format that accommodated a wide-column access pattern. This means that instead of storing a binary blob of data that can be deserialized into a thrift structure, you can actually store and encode your data structure in a native format that can be retrieved like a key-value pair using a combination of primary keys and local keys. For example, if you have a struct UserData that is a mapping of 30 fields keyed to a user id, instead of storing a key-value pair of (key: user id, value: UserData), you can instead store (key: user id, (local key: UserData field 1, local value: Userdata value 1), (local key: UserData field 2, local value: Userdata value 2)), etc.
The API is then designed to allow you to either access the entire row (all columns associated with user id) or only certain properties (UserData field 3 and 12 of user id). Under the hood, Rockstore is performing a blazing fast range scan or single-point key-value lookup. This accounted for some of the more extreme performance improvements that we ultimately observed. Goodbye network and CPU costs!
Innovation 3: A versioning system for batch-loaded, read-only data unblocked instant data migrations between clusters
One of the biggest pain points of the read-only mode of Rockstore was the inability to move data once it was loaded onto a cluster. If customer data grew beyond what was provisioned for it, or if a certain cluster became unstable, it took two weeks and two or three teams to coordinate changes to workflows, reconfigure thrift call sites, and budget time to double-upload, monitor, and read data to and from the new location.
Another pain point of the read-only mode Rockstore was that it only supported exactly two versions due to how it implements versioning. This was incompatible with Terrapin requirements, which supported fewer than two for cost savings and more than two for critical datasets which require on-disk instant rollback.
The solution to this is what we call “timestamp-based versioning.” Rockstore read-only used to have “round-robin versioning,” where each new version uploaded into the system would either be version One or version Two. Once all the partitions of an uploaded version were online, the version map would simply flip. This created the exactly-two version constraint. Another constraint that bound customers to a specific cluster was the fact that customers needed to specify a serverset address that corresponded to the cluster on which their data lived. Another leaky abstraction! When the data moved, customers needed to make changes to follow it.
In timestamp-based versioning, every upload is attributed a timestamp and registered to a central metastore called Key-Value Store Manager (KVSM), which was used to coordinate cluster map configurations. Once more, the power of abstraction comes in: by calling KVStore APIs, as a customer you no longer need to know on which cluster your data lives. KVStore figures that out for you using the cluster map configuration.
Not only does this abstraction allow for as few as one version or as many as 10 to be stored on disk or in S3 (to trade off cost savings and rollback safety), but moving a dataset from one cluster to another is as simple as a single API call to change the cluster metadata in KVSM and kicking off a new upload. Once the metadata is updated, the new upload will automatically be loaded to the new cluster. And once online, all serving maps will point requests to that location. Thanks to timestamp-based versioning, two weeks of effort has been reduced to a single API call.
Thank you for reading about our journey to a single, abstracted, key-value storage at Pinterest. I’d like to acknowledge all the people that contributed to this critical and technically challenging project: Rajath Prasad, Kangnan Li, Indy Prentice, Harold Cabalic, Madeline Nguyen, Jia Zhan, Neil Enriquez, Ramesh Kalluri, Tim Jones, Gopal Rajpurohit, Guodong Han, Prem Thangamani, Lianghong Xu, Alberto Ordonez Pereira, Kevin Lin, all our partners in SRE, security, and Eng Productivity, and all of our engineering customers at Pinterest which span teams from ads to homefeed, machine learning to signal platform. None of this would be possible without the teamwork and collaboration from everyone here.