How to set up simple and flexible ETL based anonymization – pt 1
SOFTWARE DEVELOPMENTSECURITY & PRIVACY
22/04/2021 • Jan Eerdekens

How to set up simple and flexible ETL based anonymization - part 2

In the first installment of this series on ETL based anonymization, I covered the research part of the problem I was trying to solve in a proof of concept. I settled on using Singer as the software solution to build an ETL pipeline. In this second part I will focus on how to set up Singer, the chosen taps & targets and how to chain them together.

As a small side quest, I’ll also cover how to run an Oracle database locally as the database that the Singer Oracle Tap can connect to and the issues I had with it.

Set up Singer

The first thing we need to do to start creating the ETL pipeline is to install Singer. Singer is basically just a collection of taps and targets. These are nothing more than simple Python code that produces data streams in the Singer JSON spec format that can be piped into each other. So the only installation requirement is to have a working Python 3 installation on your system. To see if you already have Python 3 installed correctly, run the command below:

python --version

If this works and outputs a 3.x version then you’re already done. If this doesn’t work or outputs a 2.x version, you’ll first need to install Python 3. Instructions on how to do this for your operating system can be easily found by Googling.

Set up Oracle

We will be using an Oracle database that we will run as a Docker container. I went through many trials to get this set up, which you can read about at the end of this post.

For the time being, we’ll start from the assumption that an Oracle database is running on localhost:1521 with an SID called OraDoc. In this database, there’ll be a schema called ETL and a user called etl. In this schema we’ll have a very simple test table called TEST with the structure you see below and a couple of rows of actual test data:

FIRST_NAME,LAST_NAME,PERSON_ID,PHONE

Set up the "Tap"

With a working Python 3 installation, we can now continue with the first step of our ETL pipeline: the ingestion of Oracle database data by the Singer Oracle Tap.

Even though the taps and targets are just Python code, the Oracle Tap uses the cx_Oracle library as its database driver. This library needs a native Oracle database client, called Oracle Instant Client, to be able to connect to the database. To install the client follow these installation instructions.

Once this client is installed, we can continue with setting up the actual Oracle Tap. Installing it entails nothing more than creating a Python Virtual Environment, activating it and then installing the tap dependency via pip. For the Oracle Tap you can do this with these commands:

python3 -m venv ~/.virtualenvs/tap-oracle
source ~/.virtualenvs/tap-oracle/bin/activate
pip3 install tap-oracle

Once the tap is installed and the virtual environment has been activated, the tap-oracle command will be available:

(tap-oracle) developer@laptop .virtualenvs % tap-oracle usage: tap-oracle [-h] -c CONFIG [-s STATE] [-p PROPERTIES] [--catalog CATALOG] [-d] tap-oracle: error: the following arguments are required: -c/--config

From the message we get when just trying to run the tap, we can see that the tap needs some configuration before it can actually work. This configuration is nothing more than a simple JSON file that contains the information about the database connection and some other parameters to define how the database will be synced. To keep things simple we will create this file, called config.json, in the bin directory of our tap-oracle virtual environment with the content below:

{
    "host": "localhost",
    "port": 1521,
    "user": "etl",
    "password": "admin123",
    "sid": "OraDoc",
    "filter_schemas": "ETL",
    "default_replication_method": "FULL_TABLE"
}

The first 5 configuration parameters are mandatory and pretty self-explanatory, except maybe for the last two:

  • filter_schemas: an optional parameter that contains a comma separated list of schema names that allow you to only replicate the table data of specific schemas instead of all schemas.
  • default_replication_method: this defines how the selected schemas will be replicated. There are 3 possible values: LOG_BASEDFULL_TABLE & INCREMENTAL. For this POC we’ll be using FULL_TABLE as we want to keep things simple and will only be working with a small test database. For the LOG_BASED mode you’ll also need to make sure your Oracle database is configured accordingly which is outside of the scope for this POC and its Docker based Oracle database.

With this configuration in place, we’re ready to run the tap in discovery mode. In this mode, the tap uses the config file to connect to the database and query the selected schemas about the available tables and their structure. We need to save this data, called the catalog, to another JSON file. To create this file, catalog.json, run the command below from the bin directory of the tap-oracle virtual environment:

(tap-oracle) developer@laptop bin % tap-oracle -c config.json -d > catalog.json
INFO starting discovery
INFO dsn: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=localhost)(PORT=1521))(CONNECT_DATA=(SID=OraDoc)))
INFO fetching row counts
INFO fetching tables:
   SELECT owner, table_name
   FROM all_tables
   WHERE owner != 'SYS' AND owner IN (:0) ['ETL']
INFO fetching views
INFO fetching column info
INFO fetching pk constraints

If you run the tap with the config and the catalog, you’ll see that nothing is synced:

(tap-oracle) developer@laptop bin % tap-oracle -c config.json --catalog catalog.json
INFO Selected streams: []
INFO No currently_syncing found

To get the tap to actually sync something, edit the catalog file slightly by selecting the streams you want to sync:

{
  "streams": [
    {
      "tap_stream_id": "ETL-TEST",
      "table_name": "TEST",
      "schema": {
        "properties": {
          "FIRST_NAME": {
            "type": [
              "null",
              "string"
            ]
          },
          ...
        },
        "type": "object"
      },
      "stream": "TEST",
      "metadata": [
        {
          "breadcrumb": [],
          "metadata": {
            "table-key-properties": [
              "PERSON_ID"
            ],
            "schema-name": "ETL",
            "database-name": "ORADOC",
            "is-view": false,
            "row-count": 0,
            "selected": true
          }
        },
        {
          "breadcrumb": [
            "properties",
            "FIRST_NAME"
          ],
          "metadata": {
            "sql-datatype": "VARCHAR2",
            "inclusion": "available",
            "selected-by-default": true
          }
        },
        {
          "breadcrumb": [
            "properties",
            "LAST_NAME"
          ],
          "metadata": {
            "sql-datatype": "VARCHAR2",
            "inclusion": "available",
            "selected-by-default": true
          }
        },
        {
          "breadcrumb": [
            "properties",
            "PERSON_ID"
          ],
          "metadata": {
            "sql-datatype": "NUMBER",
            "inclusion": "automatic",
            "selected-by-default": true
          }
        },
        {
          "breadcrumb": [
            "properties",
            "PHONE"
          ],
          "metadata": {
            "sql-datatype": "VARCHAR2",
            "inclusion": "available",
            "selected-by-default": true
          }
        }
      ]
    }
  ]
}

If you now run the same command again, you’ll see the tap sync content of the selected stream:

(tap-oracle) developer@laptop bin % tap-oracle -c config.json --catalog catalog.json
INFO Selected streams: ['ETL-TEST']
INFO No currently_syncing found
INFO Beginning sync of stream(ETL-TEST) with sync method(full)
INFO Stream ETL-TEST is using full_table replication
{"type": "SCHEMA", "stream": "TEST", "schema": {"properties": {"FIRST_NAME": {"type": ["null", "string"]}, "LAST_NAME": {"type": ["null", "string"]}, "PERSON_ID": {"format": "singer.decimal", "type": ["string"]}, "PHONE": {"type": ["null", "string"]}}, "type": "object"}, "key_properties": ["PERSON_ID"]}
INFO dsn: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=localhost)(PORT=1521))(CONNECT_DATA=(SID=OraDoc)))
{"type": "STATE", "value": {"bookmarks": {"ETL-TEST": {"last_replication_method": "FULL_TABLE", "version": 1618344226757}}, "currently_syncing": "ETL-TEST"}}
{"type": "ACTIVATE_VERSION", "stream": "TEST", "version": 1618344226757}
INFO select SELECT  "FIRST_NAME" , "LAST_NAME" , "PERSON_ID" , "PHONE" , ORA_ROWSCN
                                FROM ETL.TEST
                               ORDER BY ORA_ROWSCN ASC
{"type": "RECORD", "stream": "TEST", "record": {"FIRST_NAME": "John", "LAST_NAME": "Doe", "PERSON_ID": "1", "PHONE": "0499010203"}, "version": 1618344226757, "time_extracted": "2021-04-13T20:03:46.757794Z"}
{"type": "RECORD", "stream": "TEST", "record": {"FIRST_NAME": "Jane", "LAST_NAME": "Doe", "PERSON_ID": "1", "PHONE": "0499040506"}, "version": 1618344226757, "time_extracted": "2021-04-13T20:03:46.757794Z"}
INFO METRIC: {"type": "counter", "metric": "record_count", "value": 2, "tags": {}}
{"type": "ACTIVATE_VERSION", "stream": "TEST", "version": 1618344226757}
{"type": "STATE", "value": {"bookmarks": {"ETL-TEST": {"last_replication_method": "FULL_TABLE", "version": 1618344226757, "ORA_ROWSCN": null}}, "currently_syncing": null}}

That was already a lot of work, but luckily the hardest part is done!

phew the hardest part is done

Set up the "Target"

Now that we have a tap set up to retrieve database data in the Singer spec format, we’re going to skip the transformation step for the time being and set up a target to pipe it to first. This way we can quickly verify the Extract and the Load parts of the ETL pipeline and already see a readable CSV result instead of Singer spec JSON files.

We’ll be using the Pipelinewise S3 CSV Target. The setup of this target is very similar to how the tap was set up previously. Again, you’ll need to create a virtual environment for it, activate it and install the target dependency using pip:

python3 -m venv ~/.virtualenvs/target-s3-csv
source ~/.virtualenvs/target-s3-csv/bin/activate
pip3 install pipelinewise-target-s3-csv

This target also needs a small configuration file in the form of a config.json file that contains the name of the bucket we want to send the CSV files to and the credentials needed to access the bucket:

{
  "s3_bucket": "anonymized-data-bucket",
  "aws_access_key_id": "your_own_aws_access_key_id_value",
  "aws_secret_access_key": "your_own_aws_secret_access_key_value"
}

This simple configuration is all we need to make the target work. It will enable us to pipe the result of the Oracle Tap into this target and get, depending on the selected streams, one or more currently non-anonymized, CSV files in our S3 bucket. If we run the command below, from the bin directory of the activated tap-oracle virtual environment, a full table sync will occur:

(tap-oracle) developer@laptop bin % tap-oracle --config config.json --catalog catalog.json | ~/.virtualenvs/target-s3-csv/bin/target-s3-csv --config ~/.virtualenvs/target-s3-csv/bin/config.json
INFO Selected streams: ['ETL-TEST']
INFO No currently_syncing found
INFO Beginning sync of stream(ETL-TEST) with sync method(full)
INFO Stream ETL-TEST is using full_table replication
INFO dsn: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=localhost)(PORT=1521))(CONNECT_DATA=(SID=OraDoc)))
time=2021-04-13 22:23:29 name=target_s3_csv level=INFO message=Attempting to create AWS session
INFO select SELECT  "FIRST_NAME" , "LAST_NAME" , "PERSON_ID" , "PHONE" , ORA_ROWSCN
                                FROM ETL.TEST
                               ORDER BY ORA_ROWSCN ASC
INFO METRIC: {"type": "counter", "metric": "record_count", "value": 2, "tags": {}}
time=2021-04-13 22:23:29 name=target_s3_csv level=INFO message=Uploading /var/folders/5k/86hrfqsd60b8pnh_81fqfy8h0000gn/T/TEST-20210413T222329.csv to bucket anonymized-data-bucket at TEST-20210413T222329.csv
{"bookmarks": {"ETL-TEST": {"last_replication_method": "FULL_TABLE", "version": 1618345409645, "ORA_ROWSCN": null}}, "currently_syncing": null}

You can check the result in the S3 web interface or using the AWS command line client. To see if the bucket contains the CSV files and to retrieve them, execute the commands below:

aws s3 ls s3://anonymized-data-bucket/ --profile your_aws_profile_name
aws s3 cp s3://anonymized-data-bucket/<a filename output by the previous command>.csv test.csv --profile your_aws_profile_name

Or you can do a cat like thing:

aws s3 cp s3://anonymized-data-bucket/TEST-20210413T223846.csv --profile your_aws_profile_name - | cat
FIRST_NAME,LAST_NAME,PERSON_ID,PHONE
John,Doe,1,0499010203
Jane,Doe,2,0499040506

almost there - Set up the anonymization

Set up the "Transformation" / anonymization

Now that we’re able to retrieve data from our Oracle database, select the streams from it that we want and store them as CSV files in S3. We’re only missing one little, but very important step: anonymizing the data in transit between the tap and the target.

We’ll be using Pipelinewise Transform Field to achieve the transformation / anonymization. This piece of software basically represents the T that Singer as an EL tool is missing to make it an actual ETL tool.

To set up the transformation part, we’re staying in a familiar theme. Simply set up another virtual environment, activate it, install a dependency via pip and create a configuration file.

python3 -m venv ~/.virtualenvs/transform-field
source ~/.virtualenvs/transform-field/bin/activate
pip install pipelinewise-transform-field

Create a config.json file in the bin directory of this virtual environment. This file contains a simple list of transformations. Such a transformation simply defines the type of transformation to apply to a specific field in a specific stream. The configuration in the example below has the result of HASHing the FIRST_NAME field in the TEST stream:

{
  "transformations": [
      {
          "field_id": "FIRST_NAME",
          "tap_stream_name": "TEST",
          "type": "HASH"
      }
  ]
}

The list of available transformations in the Pipelinewise Transform Field step are:

  • SET-NULL: transforms any input to NULL
  • HASH: transforms string input to hash
  • HASH-SKIP-FIRST-n: transforms string input to hash skipping first n characters, e.g. HASH-SKIP-FIRST-2
  • MASK-DATE: replaces the months and day parts of date columns to be always 1st of Jan
  • MASK-NUMBER: transforms any numeric value to zero
  • MASK-HIDDEN: transforms any string to ‘hidden’

While these transformations are sufficient for the anonymization needs of our POC case, they can be easily altered. You can even add custom transformations. To do this, edit the transform.py file found in the lib/python3.9/site-packages/transform_field directory of the current virtual environment.

Putting it all together

With these 3 virtual environments in place, we can pipe them all together to produce the end result we’re looking for. When we run the command below we should see the ETL pipeline sync the selected schema, anonymize specific parts and store the end result as a CSV file in our S3 bucket:

(tap-oracle) developer@laptop bin % tap-oracle --config config.json --catalog catalog.json | ~/.virtualenvs/transform-field/bin/transform-field --config ~/.virtualenvs/transform-field/bin/config.json | ~/.virtualenvs/target-s3-csv/bin/target-s3-csv --config ~/.virtualenvs/target-s3-csv/bin/config.json
INFO Selected streams: ['ETL-TEST']
INFO No currently_syncing found
INFO Beginning sync of stream(ETL-TEST) with sync method(full)
INFO Stream ETL-TEST is using full_table replication
INFO dsn: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=localhost)(PORT=1521))(CONNECT_DATA=(SID=OraDoc)))
time=2021-04-13 22:38:46 name=target_s3_csv level=INFO message=Attempting to create AWS session
INFO select SELECT  "FIRST_NAME" , "LAST_NAME" , "PERSON_ID" , "PHONE" , ORA_ROWSCN
                                FROM ETL.TEST
                               ORDER BY ORA_ROWSCN ASC
INFO METRIC: {"type": "counter", "metric": "record_count", "value": 2, "tags": {}}
time=2021-04-13 22:38:46 name=transform_field level=INFO message=Exiting normally
time=2021-04-13 22:38:46 name=target_s3_csv level=INFO message=Uploading /var/folders/5k/86hrfqsd60b8pnh_81fqfy8h0000gn/T/TEST-20210413T223846.csv to bucket cjm-sandbox-anonymized-data at TEST-20210413T223846.csv
{"bookmarks": {"ETL-TEST": {"last_replication_method": "FULL_TABLE", "version": 1618346326464, "ORA_ROWSCN": null}}, "currently_syncing": null}

When we now check the CSV in our bucket:

aws s3 cp s3://anonymized-data-bucket/TEST-20210413T224646.csv --profile your_aws_profile_name - | cat
FIRST_NAME,LAST_NAME,PERSON_ID,PHONE
fd53ef835b15485572a6e82cf470dcb41fd218ae5751ab7531c956a2a6bcd3c7,Doe,1,0499010203
500501150a08713128839ca1465bfdc8a426268d6d0a576a16a80c13929f3faa,Doe,2,0499040506

CSV in our bucket

Conclusion

In this blog post you’ve seen that we can easily use Singer to achieve the goals of the POC described in the previous blog post. Because it only requires some simple stuff like Python and an Oracle client, it can be installed and used on most systems and in most environments. It also uses some simple JSON based configuration files that can be easily tailored to a lot of use cases. The source of the taps and targets is also available and is usually easy to understand and adapt where needed. In case you can’t find a tap or target that suits you, writing one shouldn’t be too hard.

In the last installment of this series, I’ll show a way to simplify the whole system a bit more and package it in a way that’s more suited for cloud native usage/deployment.

Side quest: Oracle in Docker

Because I didn’t have access to an existing Oracle database, I needed to run one myself. You’d think that would be easy and quick, but as it turned out it was a bit more complicated and quite frustrating. Especially when compared to running a PostgreSQL or MySQL/MariaDB container using Docker.

Oracle XE was made available a long time ago. It’s a free version of an Oracle database with some limitations, but none of those were problematic for my use case. Oracle XE can be installed natively on Windows and Linux, but because I’m working on a Mac and run my other databases via Docker I wanted to run it using a Docker image.

To do this, I needed to build an Oracle docker image myself. There is a Github repository that contains a lot of Dockerfiles and scripts provided by Oracle. In this repository there is also a section for Oracle databases: Oracle Database container images. You can check out this repository, go to the OracleDatabase/SingleInstance/dockerfiles directory and run the buildContainerImage.sh script to build an actual Oracle Docker database container.

I wanted to set up an Oracle 12 to match the customer’s installation, but for that one I needed to download some database installation files first. However, it looks like those aren’t publicly available anymore (you can get them via an Oracle contract). So I settled for Oracle 18.4.0 XE:

./buildContainerImage.sh -v 18.4.0 -x
...takes about 5 minutes...
docker run -p 1521:1521 -p 5500:5500 -e ORACLE_PWD=admin123 oracle/database:18.4.0-xe
... takes 30+ minutes...

waiting for dockerized oracle xe to start

On my MacBook Pro starting this Docker container takes more than 30 minutes. On top of that, it usually stops working after a couple of hours, needing a restart that will again take that much time. I tried a number of things, like attaching volumes to store all the data that gets generated during the first start, but that didn’t seem to help. It also uses a lot of resources on my machine.

Even after getting the container running, connecting to the database in it also wasn’t really straightforward because of the pluggable database stuff. Even when connected, creating a user and database specific for our POC and connecting with that one was again a lot of trouble: service name vs. SID and no service name support in the Oracle Python library that is used by the tap.

ETL pipeline singer and oracle

So in the end, after Googling some more, I found a Docker repository from Oracle that contains an old Oracle 12 docker container. You need a free Oracle account for it to be able to do the docker login. There is no volume support so you’ll lose your data in a lot of cases, but it is easy to set up, starts consistently in a couple of minutes and you can connect to it via an SID. Also adding a user and database wasn’t too difficult and so I ended up using this instead of the newer Github repository stuff.

docker login container-registry.oracle.com
docker pull container-registry.oracle.com/database/standard:12.1.0.2
docker run -d --env-file ./env -p 1521:1521 -p 5500:5500 -it --name dockerDB --shm-size="4g" container-registry.oracle.com/database/standard:12.1.0.2
Jan Eerdekens