How to set up simple and flexible ETL based anonymization - part 1
In this technical blog post, I want to talk about how to set up simple and flexible ETL based anonymization. Why? Well, I recently had the opportunity to do a small proof of concept for a customer. The customer wanted to know the options that were available that would enable them to take internal data, remove or anonymize any personally identifiable information (PII) and make it available in a simple way and form for external parties. After further requirements gathering, the context for this proof of concept was defined as:
- Whatever solution, it needs to be able to extract data from an on premise Oracle database.
- The end result should be a set of CSV files in an Amazon S3 bucket.
- In between ingesting the Oracle data and dumping it in CSV form on S3, there should be something that removes/anonymizes PII data.
- If possible, the chosen solution should be cloud native.
In this 3 part blog series I’ll explain how to set up simple and flexible ETL based anonymization with the following subjects:
- The research into products that might be used to solve the problem. Also, check how suitable they are for what the proof of concept needs to achieve.
- How the chosen product can be used to create an ETL pipeline that fits the requirements. Additionally, how to setup a local Oracle database in Docker that can be used as a data source for the data ingestion part of the proof of concept (just because this was such a PITA to do).
- And whether this can be done in a cloud native way.
The research part of the proof of concept consists of 2 parts:
- How to extract data from an Oracle database, anonymize it somehow and store it as a bunch of CSV files in an S3 bucket aka the ETL part.
- Figuring out the best way to accomplish the anonymization.
Extracting, transforming and storing the data
Straight off the bat, the customer’s problem sounded remarkably as something that you might solve with an ETL product: Extract Transform Load. So the research part for this part of the proof of concept would be concentrated on this type of product. I also got some input from someone in my team to have a look at singer.io as that was something they had used successfully in the past for this kind of problem.
When looking at the Singer homepage, there are a number things that immediately catch your eye:
- Singer powers data extraction and consolidation for all of the tools of your organization.
- The open-source standard for writing scripts that move data.
- Unix-inspired: Singer taps and targets are simple applications composed with pipes.
- JSON-based: Singer applications communicate with JSON, making them easy to work with and implement in any programming language.
So when getting down to the basics, Singer is a just a specification, albeit not an official one. It’s a simple JSON based data format and you can either produce something in this format (a tap in Singer terminology) or consume the format (a target). You’re able to chain these taps and targets together to extract data from one location and store it in another. Out of the box Singer already comes with a bunch of taps (100+) and targets (10). These taps and targets are written in Python. Because the central point of the system is just a data format, it’s pretty easy to write one yourself or adapt an existing one.
When checking out the taps, the default Oracle tap should cover the Extract part of our proof of concept. The same however doesn’t seem to be the case for the Load part when looking at the default targets. There is a CSV target, but it stores its results locally, not in an S3 bucket. There is the option of just using this target and do the S3 upload ourself after the ETL pipeline has finished. Another option would be to adapt the existing CSV target and change the file storage to S3. Some quick Googling turns up a community made S3 CSV Singer target. According to its documentation, this target should do exactly what we want.
Whoops, Singer doesn't transform
With the Extract and Load parts covered, this leaves us with just the Transform part of the ETL pipeline to figure out… and this is where it gets a bit weird. Even tough Singer is classified as an ETL tool, it doesn’t seem to have support for the transformation part?
Looking further into this I came by this ominously titled post: Why our ETL tool doesn’t do transformations. Reading this, it seems they consider their JSON specification/data format as the transformation part. So they support transformation to raw data and storing it, but don’t support other kinds of transformations. That part is up to yourself after it has been stored somewhere by a Singer target. So it turns out that Singer is more like the EL part of an ELT product than an “old school” ETL product.
At this point, Singer should at least be sufficient to extract the data from an Oracle database and to put it in an S3 bucket in CSV format. And because Singer is pretty simple, open and extendable, I’m going to leave it at that for now. Let’s continue by looking into the anonymization options that might fit in this Singer context.
Similarly to the ETL part, I also received some input for this part, pointing me to Microsoft Presidio.
When looking at the homepage we can read the following:
- It provides fast identification and anonymization modules for private entities in text and images such as credit card numbers, names and more.
- It facilitates both fully automated and semi-automated PII de-identification flows on multiple platforms.
- Customizability in PII identification and anonymization.
So there’s a lot of promising stuff in there that could help me solve my anonymization needs. Upon further investigation it looks like I’m evaluating this product during a major transformation (get it? 😉) from V1 to V2. V1 incorporated some ETL-like stuff like retrieving data from sources (even though Oracle support in the roadmap never seems to have materialized) and storing anonymized results in a number of forms/locations. However, V2 has completely dropped this approach to concentrate purely on the detection and replacement of PII data.
At its core, Presidio V2 is a Python based system built on top of an AI model. This enables it to automatically discover PII data in text and images and to replace it according to the rules you define. I did some testing using their online testing tool and it kind of works, but for our specific context it definitely needs tweaking. Also, when looking at the provided test data, it seems that it is mostly simple and short data but no large text blobs or images. This then begs the question: even if we’re able to configure Presidio to do what we want it to do, might we be hitting small nails with a big hammer?
Is Presidio too much?
So let’s rethink this. If we can easily know and define which simple columns in which tables need to be anonymized and when just nulling or hashing the column values is sufficient, we don’t need the auto detection part of Presidio. We also wouldn’t need the Presidio full text or image support and we also wouldn’t need fancy substitution support. Presidio could be a powerful library to create an automatic anonymization transformation step for our Singer based pipeline. It also helps that Presidio is Python based. However, my gut feeling says I maybe should first try to find a slightly simpler solution.
I started searching for something that’s can do a simple PII replace and that works in a Singer tap/target context. I found this Github repository: pipelinewise-transform-field. The documentation reads “Transformation component between Singer taps and targets”. Sounds suspiciously like the “T” part that Singer as an ETL was missing! Further down in the configuration section we even read:
“You need to define which columns have to be transformed by which method and in which condition the transformation needs to be applied.”
and the possible transformation types are:
- SET-NULL: Transforms any input to NULL
- HASH: Transforms string input to hash
- HASH-SKIP-FIRST-n: Transforms string input to hash skipping first n characters, e.g. HASH-SKIP-FIRST-2
- MASK-DATE: Replaces the months and day parts of date columns to be always 1st of Jan
- MASK-NUMBER: Transforms any numeric value to zero
- MASK-HIDDEN: Transforms any string to ‘hidden'
This seems to cover our simple anonymization requirements completely! We can even see how we need to use it in the context of Singer:
some-singer-tap | transform-field --config [config.json] | some-singer-target
We now have all the pieces of the puzzle on how to set up simple and flexible ETL based anonymization. In the next blog post we’ll show how they fit together and whether they produce the results the customer is looking for.