The following post describes how we migrated a large NoSql database from one vendor to another in production without any downtime or data loss. The following methodology has worked for us and has been proven to be safe and simple. I am sure that there are several techniques for migrating NoSql databases, but here are some guidelines for doing so with minimal risk, while maintaining the ability to rollback at almost any point in the process. This post is based on a talk I gave at DevopsDays in Nov 2017.
In AppsFlyer, we have many NoSql databases. AppsFlyer’s data processing pipeline updates an internal state as it processes the events we get from mobile devices. This state is mission critical and is processed at high throughput.
One of those databases is our app-installs database. This database gets a write hit for roughly every new install that Appsflyer attributes and gets an additional read hit whenever Appsflyer gets any launch or in-app event.
Some numbers (true to the time of writing):
And, yes, it is mission critical.
But the time had come to replace it. It brought us very far and for that it deserves credit. But lately, as traffic and resulting throughput demand has raised, it just has not been able to deliver the stability we need.
Here are a few examples of instability: Scale out went out of control- we were using 45 r3.4xl instances. Daily backups could not keep up and took longer than a day to complete. XDCR stopped working. Losing a node (and adding one back) came at a performance hit and system instability (and when you have so many nodes on AWS the probability that something will go wrong gets very high). We were not happy with the answers we got from the paid support. These were all issues with which we could no longer live.
Migrating data is always hard. It is already hard when dealing with relational databases, and it is of course harder to do so in production, when the database(s) are in use. But doing so in NoSql databases is even more complex. This is because amongst the dozens of NoSql technologies out there, there is nothing in common. There is no common way to serialize or represent data, no standard query language, no schema, and when it comes to out-of-the-box migration tools- you are all alone.
So after some research and trials we chose Aerospike. We had a few months until our enterprise plan with Couchbase was due to expire and we started planning. Since most of the vendors do not encourage data migration out of their database for obvious reasons we were left with two basic options: Write a migration tool ourselves (‘out of band migration’) or use the application to migrate the data for us (‘in-band migration’). In many cases, choosing only one of those methods is sufficient for reasons I will enumerate shortly, but for other cases, you will have to use a combination thereof, as we did.
The basic idea here is to use your application code to duplicate both inserts and updates (‘upserts’) to your new database. In other words, every write or update that is done for the old database is also done on your new database. This effectively starts the migration of all of your new data to the new cluster going forward. In many cases, a delete operation is considered a write here, at least from the application point of view.
This has a few implications:
You can start with a small cluster and grow up (scale out) as you go. This is a good opportunity to practice scaling out.
Testing the first phase:
This is the time to do some data sanity tests:
Compare the migrated data between the two databases and verify your expectations. Many databases even have different semantics for records TTL if you use it.
Get to know the new technology:
This phase is a bit trickier, but still very simple. After you have enough confidence in the new database it is time to start serving from it. The idea here is to serve data from the new database, and if you fail to find something there, fall back to the old database and serve from there.
An important note here is to configure a metric that should count the instances when a record was not found in the new database but was found in the old one. This means that this record was not migrated yet by our application. It also means that if this metric stops rising for a long enough time, the migration is over.
This phase has a few implications:
Whatever you tested after the first migration phase, reiterate. The load should be much higher. The database should be larger also, in order to support new read load capacity, etc… If you can afford it, it is highly recommended to leave it working for a few days, or weeks. Traffic may vary, and some errors might appear only after some mileage.
For quite a lot of use cases this is enough. If your records are short lived and/or the application naturally hits every record frequently, you just need to wait it out. Sit back and watch the miss-hit metric diminish to zero. In order to follow the frequency of that metric firing, we used our metering stack, which includes Statsd , Graphite and Grafana . The stack itself does not matter, as long as you can track the frequency of this metric over time.
For more sensitive/complex use cases, you will need to make sure that the entire set of records was migrated to the new database before the sun sets the old one. What we had to do now was to enumerate the entire set of keys from the old database and make sure we migrate the missing ones.
Similar to views on relational databases, most NoSql databases introduce some kind of a way to index and query the data. Usually, to query a hash table like data structure they allow the user to create an efficient index over the data. In order to create the index, most APIs let you write some code that will be executed on each key and corresponding record and emit some value which should be the queryable data, eventually. So, for example, assume you have a table in your database that has people’s data in it, you could create an index by a person’s last name simply by writing a small function that extracts from a record the last name. Then, you can query the database for all people that have a specific last name.
It is important to understand here a few points:
This implies that creating an index might have a non-trivial performance effect on your database. Creating one just for dumping a database of keys, if done recklessly, can easily bring a database on its knees. This is why, although possible, you should never include the entire record in the index because it could effectively duplicate your entire database to RAM.
We have (ab)used this feature to create an index which has the record ttl as a key and the record id (key) as the data. This gave us a simple way to do a “select *” kind of query on the index which returned the entire set of keys (ids) in the database. As a bonus, Couchbase sorted the set for us by the TTL which, for our case, was useful.
The next phase was to create a simple setup composed of two services: a dumper and a loader. The fact that those two services are decoupled using Kafka as a buffer here achieves two important goals:
Basically, the dumper just queried the index and streamed the set of keys + TTLs to an Apache Kafka topic. The loader just consumed the Kafka topic, did a simple transformation on the TTLs to adapt them to Aerospike format, queried Couchbase for each key and loaded it to Aerospike with a CreateOnly policy. This was to make sure that we do not overwrite stuff which had already been migrated by the application.
All that was left was to sit back and watch the miss-hit metric diminish to 0. After a few days, the lights went down on the mighty Couchbase cluster.
This project took us about 6 weeks end to end. We have made many mistakes during the way and had to restart it several times. One of the most important takings here, that this is a software project. It is not different from any other software projects. It is important that someone will own it end-to-end and that guy should preferably be an engineer who knows the involved applications code well. A dedicated devops engineer will be also probably required, especially when adopting a new DB technology into the software stack. Additional stakeholders should be involved if the taken risks here effect them, too.