December 6, 2022

TOP 10 mistakes in anonymization

AuthorMichael Schwenk

Since the brand launch of Libelle DataMasking seven years ago , we have already carried out many different as well as  interesting customer projects. Each project always presents us with new and exciting challenges. But it would not  be us if we didn't accept and then  master these different challenges.

In this blog post, I have summarized the TOP 10 mistakes in anonymization all together.

❌ Different anonymization keys for different systems to be anonymized

One of the key features of Libelle DataMasking is the preservation of cross-system and cross-landscape consistency of data, regardless of whether only SAP, only other systems, or a combination of both worlds is considered in the anonymization process. This functionality is achieved by entering the so-called anonymization key.

In the majority of projects, great importance is being  attached to maintaining cross-system and cross-landscape consistency, which is why a uniform anonymization key is defined at a very early stage of the projects, which from then on applies to all participants.

It becomes problematic if, despite this definition, a different key is suddenly used; it is sufficient if this key is already used in only one of the systems involved. The workflow of a test can therefore  quickly come to a standstill because relevant information about a test case can no longer be found because the anonymized data now diverges.

❌ Incorrect order of tables during anonymization

In relational databases, there are usually dependencies between the tables based on referential integrity in order to ensure the consistency and integrity of the data. These dependencies must of course be maintained during the anonymization.

With Libelle DataMasking, a sequence can be defined, in which the tables to be handled are anonymized. The software offers up to ten different levels for this purpose. An example: Tables A and B contain related data records.

In a first step (sequence 0), the fields of both tables are being anonymized independently of each other. A second step is then used in order to restore the relatedness of the data records. This is done by means of an SQL-SELECT statement, by which the tables are JOINed with each other. The JOIN must necessarily be performed in a different order (order 1 in the example). Otherwise, there is a risk that consistency will be violated. In extreme cases, a deadlock could even be created during anonymization, because two statements may want to access and change the very same data records.

❌Missing reference file

The product scope of Libelle DataMasking includes the so-called reference database. It contains possible target values  where  anonymization can be performed. However, in addition to Libelle's own reference database, your own customer-specific reference files can also be used as a basis. Depending on the concept of the project, the reference files can be provided via a separate workflow, but can also be generated automatically with the help of the software. The error-prone situation is that the file is either not registered as a reference file or is not activated in the configuration in which the file is required. Another possible source of error is that the file was not created by the software during automatic generation as a so-called anonymization activity.

❌Algorithm does not support the parameters used

Out of the box, the Libelle DataMasking software currently contains 40 anonymization algorithms. Some algorithms manage without additional parameters. These include, for example, the first name algorithm. Others, such as the address algorithm, require additional parameters. The parameters are used to specify exactly how a field is to be anonymized.

In our projects we experience over and over  that parameters are being defined incorrectly or parameters are specified that the respective algorithm does not support. In addition, some algorithms require ID values to be assigned. For example, in the case of addresses, a data set consisting of street, house number, postal code and city forms a group. This group is defined by the ID value. If a table contains, for example, the primary and secondary residence, i.e. two address groups, a unique ID must be assigned for each group.

❌ Incorrect assignment of search or matchcode fields in SAP

I count this phenomenon among the classics in SAP-specific projects. It is always interesting and astonishing to see how the search and match code fields are being filled in the SAP systems of the customers and which characteristics it can hold.

A quite harmless example is that only the first search field is being completed, but then with the first and last name. In our SAP standard templates, we have chosen a certain way of filling out the fields. This setting must be adapted to the specific customer's needs.

❌Anonymization of Cluster Tables Requires Import of the Transport File

SAP systems that are not yet running on the basis of SAP HANA usually contain cluster and pool tables in addition to transparent tables, which are characterized by the fact that they can only be selected at ABAP level. In order to  perform this, a Libelle-specific function module must be transported into the systems.

The error-proneness at  this point is that the updates of the software also contain updates of the function module and the import of the new transport file  which is  forgotten often. Libelle DataMasking checks the version of the function module and issues an error message in the event of a discrepancy.

❌Continue anonymization without restart points

An anonymization run can fail for a variety of reasons. The causes can also lie outside the software, for example a full file system or even a full tablespace. After the error has been corrected, the anonymization can be continued. It should be noted whether the restart points are being activated. With their help, the anonymization can be continued exactly at the data set, where the error occurred in the first place. If the restart points are not active, the last treated table will be anonymized once again, although part of the data has already been anonymized.

❌No update, but reinstallation

In some projects, we sometimes achieve quite a high degree of customizing. These extensions that deviate from the standard (e.g. scripts) are taken into consideration during an update of the software. However, if the software is being  reinstalled in parallel, the customizing settings have to be "laboriously" transferred to the new environment. I.e. additional files must first be copied, and on the other hand the settings must also be adjusted again within the tool itself.

❌Primary key index cannot be created due to duplicates

I also count this case among the classic errors in the projects. It happens that fields have to be anonymized which are part of the primary key of a table. A typical example is the table TIBAN in SAP. Using our algorithm, we validly recalculate the values of an IBAN in Libelle DataMasking. But often the quality of the original data throws a spanner into the works.

To stay with the example: What happens over and over is the absence of check digits. Thus, there are data records in the systems once with and once without check digits. However, the algorithm also validly recalculates the check digit, although it is actually missing in one of the data records. This constellation leads to duplicates being created, with the result that the primary key that was previously deleted for handling these fields can no longer be recreated.

❌Address data no longer in original region

In many projects, it is important that address data are not arbitrarily alienated but remain in the original region, which can also be implemented with Libelle DataMasking.

However, this requires a high quality of the original data. If the region (e.g. state) or country is not maintained clean, the anonymized values end up in a completely different area. If the data is incomplete in the system, customizing can be used I order to create a mapping so that the addresses remain in their original region even after anonymization.

Libelle IT Group has developed a solution for the required anonymization and pseudonymization here with Libelle DataMasking. The solution was designed to produce anonymized, logically consistent data on development, test and QA systems across all platforms. Learn more about our solutions and get your free whitepaper.

Recommended articles
December 22, 2022 Libelle IT Glossary Part 22: What is DevOps?
December 19, 2022 Anonymized data in the data pipeline

All blog articles