Database anonymization in Golang using views
A couple of years ago we faced an interesting problem: We wanted to be able to recreate the state of our production system as close as possible in a separated staging system for debugging purposes and to allow testing changes with realistic data before deployment to production.
A major part of the system state is the content of the database. The easiest solution for this would have been to simply duplicate a snapshot of the production data into staging. We generally try to collect as little personal information as possible and we do not want to make it accessible unless necessary even inside the company.
In order to avoid any sensitive data to ever leave the production database and find its way into the staging database, we had the idea of using database views. These views would return all columns of the underlying table. Some of the columns would not return the original data though. Instead they would return data that closely resembles the original while still keeping the general structure. This can be exemplified as follows: if a field contains an email address, the resulting value should still be an email address.
For a time we created and updated these views manually whenever we made a change to the structure of our production database. Since we only deployed our monolith about once a week back then, this was manageable.
As we moved to an architecture with multiple separate services where each has its own database and an increased rate of deployment, we had to reduce the number of manual steps involved in a deployment. Therefore the views should be created and updated automatically and anonymize specific columns dependent on how they were defined.
tidus - a solution for ruby
Back then the applications that made up our system were entirely written in Ruby and utilized Rails' ActiveRecord from Sinatra Services or Rails Apps. The solution I came up with back then was to hook into ActiveRecord and to automatically generate views for the available ActiveRecord models by defining the anonymization strategies in the model the same way you would define a validation. This resulted in the Ruby gem tidus which I already wrote about here.
Since then we have implemented some of our new services in Golang. While not all of those have a database, some of them do. This caused us to face the exact same problem once again. Fortunately we did not have to come up with a solution that works for us from scratch but could just implement the concepts we build up in tidus in Golang.
The environment on which to build this is a bit different though. While it is very common to utilize a full ORM in Ruby, this is not necessarily the case with Golang. We are at least not using any ORM. Instead we are using sql-migrate to execute database migrations and plain SQL queries to load and update data. In order to make this usable by other people the solution should be independent of any specific way of accessing the database, executing migrations and configuring applications. The constraint is that the migrations are executed through a Golang program.
gotidus - tidus, but for Golang
This resulted in a library I named gotidus (in reference to the existing Ruby gem).
In contrast to the Ruby gem, gotidus has to be explicitly called and it is not automatically hooked into the execution of the migrations. Furthermore in order to work without an ORM the anonymization configuration has to be built up specifically. The following example is meant to illustrate this:
fooTable := gotidus.NewTable()
// Define columns on the table to anonymize in a specific way.
// Other columns will just contain their normal value.
// Note: Any column defined but not actually in the table will be ignored.
fooTable.AddAnonymizer(
"bar",
gotidus.NewStaticAnonymizer("staticValue", "TEXT"),
)
generator := gotidus.NewGenerator(postgres.NewQueryBuilder())
// Define tables that should have specifically anonymized columns.
// Tables that are not supposed to be anonymized specifically,
// do not have to be defined.
//
// Note: Any table defined but not actually in the database will be ignored.
generator.AddTable("foo", fooTable)
// Clear existing views
err := generator.ClearViews(db) // Pass an instance of *sql.DB
if err != nil {
log.Fatal(err)
}
// ... database migration
// Create new views
err = generator.CreateViews(db) // Pass an instance of *sql.DB
if err != nil {
log.Fatal(err)
}
gotidus automatically looks up all the tables in the given database and generates views named after the original table appended with _anonymized (can be changed through configuration). All tables for which no column anonymization is available will be returned as is when executing queries through the views.
In order to anonymize them, tables have to be added to the view generator together with the corresponding anonymizer for a column.
In the example above the column bar in the table foo will be overwritten by staticValue in each row.
The generator then has to be called before executing any migration to clear the views which otherwise might prevent a column to be removed or changed.
After the database migration has been executed the generator has to be called again to create the new views based on the configuration.
Once this has been setup it is only necessary to think about columns that may have to be anonymized when adding them or to clean up the configuration once it is no longer used.
Forgetting to remove it would not break the application though. If a column is not available in the actual table, the anonymizer for it will just be ignored.
Conclusion
gotidus is a port the existing tidus Ruby gem and solves the same problem. It is a bit more flexible in terms of what other libraries it can work with. It definitely works for our current use case and we can utilize the same infrastructure we already have to dump and restore data into our staging system.
Extendability
The database we are using is PostgreSQL. Therefore I only implemented a PostgreSQL version. It is possible to add support for other databases by implementing the gotidus.QueryBuilder interface.
It is also possible to extend the functionality through additional anonymizers by implementing the gotidus.Anonymizer interface.
Furthermore it would be imaginable to build a wrapper around it which builds up a gotidus.Generator instance from a configuration or e.g. through struct tags.