December 19, 2022

Anonymized data in the data pipeline

AuthorMichael Schwenk

Data management is one of the supreme disciplines of IT. More and more companies are turning to data pipelines in order to unleash the potential of their data as fast as possible and meet the needs of their customers.

In my blog post "What is a data pipeline?", I looked at the concept and meaning of a data pipeline from the theoretical side. Summarized again, it is about the process management of data. In the second part of the glossary post "How does a data pipeline work?", we also looked in detail at how a data pipeline works.

Data pipelines are thus nothing more than automated or automatable processes (see also blog post) in which data is moved from one place to another. More specifically, in addition to mining and cleansing, data provisioning can be completely automated.

Data pipelines essentially consist of three stages:

  • Stage 1: The source system
  • Stage 2: The processing or transformation of the data
  • Stage 3: The target system

Data pipelines are the responsibility of either Data Engineers or Data Scientists, depending on the organizational structure and size of a company.

A data pool is often defined as the target system. This can be a data warehouse, for example, but also a data lake, and depending on the purpose that the data pipeline is to fulfill, it may be that no conclusions whatsoever are allowed to be drawn about real existing persons during the subsequent analyses of the data. This is where solutions such as Libelle DataMasking can come into play to alienate data through anonymization or pseudonymization. The advantage of our solution is that it can be integrated very easily into existing processes.

I would like to explain how this integration can take place using two practical examples.

Practical example 1 - Data from different sources

In the first project, there are two completely different data sources that form the basis for a data pipeline. On the one hand, there is an SAP system as the first system in the process chain. On the other side, there is an application that is not SAP-specific. Even if the data comes from different sources, it is consistent in itself and meets certain criteria.

For stage 2 mentioned above, the data is extracted from the SAP system via a defined interface and imported into an intermediate database where it is processed for further processing.

In the case of the non-SAP-specific application, the data for stage 2 is extracted or exported from the tool, in this case to structured text files.

In this second stage, in which the data from both sources must be transformed, one of the intermediate steps is the mandatory anonymization of the data with the help of Libelle DataMasking, because the group of people who will eventually work with this data must not see any real data.

As mentioned, the data originating from SAP is first loaded into another database, where it is prepared for further processing. After preparation, anonymization is performed for certain properties of the data.

After exporting the data from the non-SAP application, these files are immediately anonymized in such a way that the relationship to the SAP-based data is preserved, i.e., that the consistency of the data is maintained.

The final step in the 2nd stage is the onward transport or transfer of the now alienated data into the target system. In the project, this is a data warehouse, with the help of which the data can be analyzed. However, it is just as well possible to use this data for testing purposes.

Practical example 2 - Data pipeline with batch processing

The procedure in the second project is completely different. The data pipeline found there is a batch processing pipeline that is highly automated.

The source system is a productive database. In the second step of the pipeline, the processing and transformation of the data, this database is cloned first. This step is already script-controlled, fully automated, and takes place at a fixed time outside regular business and working hours.

Due to internal regulations, the data analysts who will ultimately work with the data are not authorized to see real data. For this reason, immediately after the clone is being completed, the data is anonymized using Libelle DataMasking. This sub-step is also fully automated and scripted outside of business hours. When the data pipeline was set up, the data to be anonymized was specified once. Therefore, the use of the browser-based front end is not necessary for the automated triggering of the anonymization.

In step 3, the data is made available to the responsible department for the intended purpose.

Conclusion

Both examples show how easy it is to integrate Libelle DataMasking into existing workflows. The second example in particular also shows that the automated data pipeline does not get bogged down by Libelle DataMasking, but continues to function smoothly.


Recommended articles
December 22, 2022 Libelle IT Glossary Part 22: What is DevOps?
October 13, 2022 Test data anonymization: The challenge of GDPR-compliant CSV files

All blog articles