Test data anonymization: The challenge of GDPR-compliant CSV files

AuthorMichael Schwenk

When it comes to the European Data Protection Regulation (EU GDPR), compliance requirements and test data management, one often first thinks of data stored within databases. However, in addition to databases, companies store their data in a variety of other forms, including JSON, XML and text or CSV files, for example. Sensitive data, such as personal data, can also be stored in these files.

In production environments, both the information stored in databases and the files at operating system level require a stringent authorization system so that only authorized people  or groups of people  have access to the sensitive data.

If personal data, e.g. in CSV files, is also to be used for test purposes, these files must be included in the test data management concept.

What exactly is a CSV file?

The abbreviation CSV stands for "Comma-Separated Values". It is the structure of a text file for storing or exchanging simply structured data.

A CSV file can be a table or a list. Within the text file, some characters have a special function to structure the data.

Challenges in anonymizing CSV files

In our projects with Libelle DataMasking, we have faced the challenge once or twice that additional structured data at the operating system level also had to be anonymized. Thanks to the numerous interfaces offered by the tool, it makes no difference to this solution whether the data is located in a database or in the file system. However, because the files are basically interpreted as a database table, it is important that the headings of the individual columns are present.

There are also constellations in the projects where the files are created by exporting from a third-party software. In this case, however, the first line with the header information is not exported. Unfortunately, there is also no option to add this information. Initially, this information was added to the files manually. But this step can also be automated with Libelle DataMasking to avoid a possible source of errors.

Of course, this also applies in the reverse case, because in the same project the header information is not required  for further processing of the files, ergo it is removed from the files immediately after anonymization with the help of Libelle DataMasking.

One challenge we always face in the projects with CSV files is the character set in which the CSV files are being stored. Even though the same standardized software is always used to create the CSV files, in some projects it was not a given that the files always had the same character set. We often experienced that some files were exported from the third-party software with the ANSI character set. With Libelle DataMasking, we assume the UTF8 character set by default. If a source file is in a different character set, the consequence is that there may be unreadable characters among the anonymized values, for example in the case of umlauts.

Here, too, the customer initially went and converted the file manually. But this step can also be completely automated with Libelle DataMasking.

Especially if the CSV files are saved in connection with Microsoft Excel, there may be differences depending on which country settings Excel is operated with. In English-speaking countries, the comma is the standard separator in these files - this is also where the file extension CSV comes from, which stands for "Comma Separated Values".

In German-speaking countries, however, the semicolon has become the standard separator. As a further separator, the tabulator would also be possible. No matter which separator is found in the files, with Libelle DataMasking country-specific settings can be easily configured so that the information in the defined fields can be correctly anonymized with the respective stored algorithm.

Protect your personal data now

Whether on database level or files on operating system level (e.g. CSV files) with the solution Libelle DataMasking you master required anonymization and pseudonymization. The solution was designed to produce anonymized, logically consistent data on development, test and QA systems across all platforms. Meet the challenge of GDPR-compliant test data with Libelle DataMasking.

Recommended articles
December 22, 2022 Libelle IT Glossary Part 22: What is DevOps?
December 19, 2022 Anonymized data in the data pipeline

All blog articles