In recent years, Goldman Sachs adopted a standard software development lifecycle (SDLC) for application build and deployment. This included managing database schema definitions in source control for both new and existing systems, a more difficult endeavor than managing application code. In this article, we will describe how Obevo , our recently open-sourced DB Deployment utility, helped many of our enterprise applications get their databases under SDLC control.
Onboarding database definitions to a standard SDLC process is challenging for many reasons, notably the stateful nature of databases and the need to perform incremental migrations. Hence, many applications have not historically had an automated or smooth database deployment process. Our goal was to get database schemas managed under the same SDLC that applications go through: with all definitions committed to a version control system (VCS) and deployed via a standard build/release mechanism.
This effort was complicated by the diversity of use cases in our database estate:
Regardless of the use case, the SDLC process itself has complexities having to do with a large number of geographically distributed developers actively making changes.
While existing open source tools could do the job for simpler cases, they could not handle the scale and complexity of some of our existing systems. And we could not just leave these existing systems without a proper SDLC; they are critical systems with active development and releases.
Thus, we developed Obevo (available under the Apache 2.0 License), a tool to handle all such use cases. Obevo’s key differentiator is the ability to maintain DB objects per file (similar to how class definitions are typically stored per file) while still handling incremental deployments.
In this article, we will describe the DB Deployment problem space and then demonstrate how the object-based project structure helps us elegantly manage hundreds and thousands of schema objects for a variety of object and environment types.
|Quick Obevo Feature List: Something for Everyone!|
|For New Systems||For Existing and Complex Systems|
Simple to maintain, review, and deploy
Selectively deploy tables for easier usage in tests
In-memory test database conversion of most table scripts
Easy to reverse-engineer and onboard
Stateful (tables) and stateless (views, SPs, functions, …) objects supported
Easy to support hundreds or thousands of objects
First, let’s review the deployment semantics for different DB object types, as this impacts the tooling design.
Quick terminology note for this article:
A stateful object requires incremental modifications to its definition, instead of a full definition replacement. Take the following example of adding two columns to MyTable.
Ideally, we could bring the database to the end state by specifying a SQL statement that defines a table with all four columns. Alas, SQL does not provide a workable solution for this:
Instead, relational database management systems (RDBMS) allow users to specify ALTER statements to modify an existing table.
Some column updates may require data migrations, such as moving column data from one table to another.
Thus, each object is a result of applying multiple scriptlets; the initial scriptlet will create the object, and subsequent ones will modify it.
A stateless object, on the other hand, can be created and modified by specifying its full object definition.
We also consider static data files (i.e. code or reference data tables) as stateless objects. Though it involves table data, the data is fully defined in your scriptlet and deployable to the table, whether via bulk “delete + insert” or selective “insert + update + delete” operations.
This article from Martin Fowler is an excellent read on the key principles most source-control based DB deploy tools follow, with these specifically relevant points.
1) All database artifacts are version controlled with application code (versus managing from a UI)
UI-based management may be acceptable for less-technical users, but we recommend development teams manage their database scriptlets in source control (as they should with the rest of their application estate) and invoke deployments in an automated manner.
2) Explicitly coded incremental migrations are preferred for stateful objects (versus auto-calculated migrations)
Auto-calculating migrations, for example between the current database table state and the complete view in code, is risky in our opinion in an enterprise environment.
We evaluate database deployment tools based on how they handle these requirements:
This is the primary functionality of a database deploy tool; most production deployments are executed like this. This methodology is also used in some non-production environments, particularly for testing deployments in a QA environment prior to production release.
Developers may want to deploy to an empty sandbox database for:
This is doable in a couple ways:
Prior to our database deployment overhaul, we saw a few teams who maintained a file per database object that contained the object’s definition, even though those files were not used for deployments.
This seemed to have no purpose, but we gleaned some insights:
Given the principles above, most source-control-based DB Deploy tools (including Obevo) work as follows:
The object type has an impact on the changeset calculation semantics.
To demonstrate the stateful use case, let’s deploy package version 1 to get the table on the right
Now let’s say someone modified M3 to rename the column, and then we re-deploy. What do we expect?
The tool detects a mismatch:
Thus, the general rule: stateful object scriptlets cannot be modified once they are deployed.
Certain opt-in functionalities let us work around this if needed, such as:
Given that their underlying algorithms are similar, deployment tools vary based on a couple implementation points.
1) How to organize scriptlets into files
Options include grouping scriptlets together by:
2) How to order scriptlets for deployment
Options to consider:
Next, we will detail how Obevo addressed these two points.
Our main database deployment problems were around managing the large number of objects in a schema: to develop, to maintain, and to deploy them. We also had developers actively working on their database objects in conjunction with coding their applications.
Hence, we wanted to provide an experience that would be relatable for developers, which led us to organize our scriptlets based on the object name. We will dive into those details in this section.
(This structure adds some challenges on the ordering, which we will cover in the next section of this article)
We organize scriptlets based on the objects for which those scriptlets applied, per the following example.
The file structure differs based on whether the object is stateful or stateless.
Stateful Object Representation
Stateless Object Representation
1-to-many scriptlets per file
1 scriptlet per file
Scriptlet naming convention
The object-based structure is convenient for stateless objects as the full stateless object definition can be maintained in a file and modified easily in place.
For comparison, it is technically possible to handle stateless object deployments in an incremental stateful manner, e.g. as incremental scriptlets that persist over multiple releases. However, this leads to redundancy in database scriptlets as objects change over multiple releases.
This project structure gives us a few advantages from a maintenance perspective:
The database structure is very readable from the project structure.
For a comparison point, take an example project structure where a file was associated with a migration or release, as many tools support. This can lead to a few issues:
Lack of object readability: If an object is modified over many releases, its structure will either be dispersed across multiple files (for stateful objects) or simply redundant in many files (as demonstrated for stateless objects previously).
From a code review / release review perspective: our object-based structure meant all the changes in a particular release would be dispersed across files. At first blush, it may seem difficult to review the scriptlets to be deployed for the next release. However, we can still review code for the release by comparing VCS history and tags - the same way we would do for application code.
Developers get a couple benefits with the Obevo project structure.
As scriptlets for an object are colocated in a single file, we can easily deploy individual objects in tests, which proves useful for cases like unit testing a data access API in an in-memory database.
Developers can also leverage ORM tools to generate DDLs from their application classes and reconcile them against their migration scriptlets. For brevity, we will not delve into this here, but you can read more in our documentation.
While choosing an object-based project organization gave many benefits detailed in the last section, it complicated the ordering problem.
Objects can depend on each other, and as we scale up to hundreds and thousands of objects in a schema, it becomes increasingly difficult to manually prescribe an order.
Let’s describe how we overcame these challenges.
Not all scriptlets depend on each other, and so we have some flexibility on our ordering within the obvious constraints declared by object dependencies. Hence, we devise a solution using a simple graph algorithm.
Take the following example statements:
Note the following:
This lends itself to a directed graph representation, where the graph nodes are the scriptlets and the edges are the ordering dependencies.
We can now use the topological sort algorithm to come up with an acceptable ordering that maintains these ordering constraints and will deploy our database successfully.
The topological sort can emit many acceptable orderings, but we tune our algorithm usage to give a single consistent ordering so that we have consistent behavior across environments.
Now for the last detail: how do we declare the dependencies in our scriptlets?
We found the simplest way is to declare the dependencies in your scriptlet. See the dependencies attribute in the TABLE_B.fkA scriptlet below.
However, this is not developer-friendly for large databases (imagine annotating hundreds or thousands of objects), so we need a way to automatically detect dependencies while still allowing developer-overrides as needed.
We use two strategies to infer dependencies:
We allow stateful objects to define multiple scriptlets. It is natural to imply that the migrations within a file are deployed in the same order they are written, and so we infer such dependencies in our graph.
To detect cross-object dependencies, we need to search the scriptlet content to find dependent objects.
The technically ideal way would be to parse the SQL syntax to find those objects. However, this is very non-trivial as we’d have to understand the SQL syntax for all the DBMS types that we support.
Instead, Obevo takes a simple and naive approach: select the object names in your project that are found via string-search and assume those are the dependencies.
Notes on the implementation:
Here are the algorithm results for our previous example:
In case of false positive matches (e.g. due to comments) or false negatives, developers can specify exclusion or inclusion overrides as needed.
At first glance, it may be hard to imagine this working for actual use cases, but we have used this technique successfully to deploy many complex schemas, some spanning thousands of objects such as tables, stored procedures, views, etc.
If you would like to see an example in practice, please see our kata lesson that goes through an example reverse engineering of a large database schema.
We would like to quickly mention this use case (i.e. moving data from an old column to a new column, then dropping the old column), as at first it may seem more difficult to apply the object-based file structure concept to this.
Obevo can handle this - in short, we provide the concept of a “migration” object to help with this, which:
For more information, please see the documentation .
Hopefully we have shown that you can deploy very complex schemas with Obevo. But to truly allow existing systems to onboard to Obevo, we must make it easy for developers to reverse-engineer their existing databases.
This is not a trivial problem to solve, as unfortunately there is no single API that works perfectly across different DBMS types.
Hence, we chose to integrate with vendor-provided reverse engineering tools (see table below). Some tools simply emit the full schema to a single file, but we provide a utility that can convert such files into Obevo’s object-based structure, leveraging simple string-parsing and regular expression techniques. We find this technique to be more reliable for existing systems, particularly since the core vendor tools know best on how to expose their own schema compared to Java APIs.
Tooling Leveraged by Obevo
While many open source tools exist to help with database deployment, we felt that our more complex use cases required more powerful tooling.
With Obevo, we aim to support all types of systems; whether enhancing the productivity of modern systems with our testing features and easy object-based maintenance or reinvigorating long-lived systems by facilitating their onboarding to SDLC control where they previously had none.
There are more features and database deployment activities that we haven’t covered here (e.g. rollback , phased deployments , in-memory DB testing , multiple schema management ). Feel free to visit our Github page , documentation , and kata lessons to learn more about Obevo and how to apply the tool to your system.
We hope to write more articles about our practices and DB uplift experiences, so feel free to post comments and ask any questions you have.
Shant Stepanian is a Senior Engineer in the Platforms business unit at Goldman Sachs. Shant is the chief architect of Obevo and led the database SDLC overhaul at Goldman. He has worked on a variety of system architectures, from batch-based reporting applications based on databases to high-throughput OLTP systems leveraging distributed and partitioned in-memory data grids.