Libelle IT Blog: Libelle IT Glossary Part 21: How does a data pipeline work?

Libelle IT Glossary Part 21: How does a data pipeline work?

AuthorMichael Schwenk

Here's how a data pipeline works

A data pipeline is the entire path of data through an organization. Data goes through the following four steps in the data pipeline:

1. Capture and extract the raw datasets

In this step, all data is being captured and extracted. These are referred to as raw datasets because the data is not structured or classified. A dataset here contains combined data that can come from several different sources. They are available in different formats, for example:

Database tables
File names
Topics (Kafka)
Queues (JMS)
File paths (HDFS)

No meaningful conclusions can be drawn yet with the huge amount of data.

2. Data management

In the next phase of the data pipeline, the raw data sets are comprehensively organized using a specific method. The technical term for this step is data governance. It first puts the raw data into a business context. This is followed by data quality and security control. Now the data is organized for mass use.

3. Data transformation

The third step is data transformation, where the data sets are cleaned and changed according to the appropriate reporting formats. The basis for the data transformation is the rules and guidelines established by the company, according to which the data pipeline program enriches the remaining data with information and deletes unnecessary/invalid data.

The following steps should be considered to ensure the quality and accessibility of the data:

Standardization: the company must define what data is useful and how it should be formatted and stored.
Duplication: The company reports all duplicates to the data stewards. Redundant data must be deleted and/or excluded.
Checking: It is advisable to perform automated checks to compare similar information such as transaction times and access logs. Checks can further weed out unusable data and identify anomalies in systems, applications or data.
Sorting: Grouping items such as raw data or multimedia files into appropriate categories can increase the efficiency of the data pipeline. Transformation rules determine how each piece of data is classified and what step it goes through next. These transformation steps reduce the amount of unusable material and convert it into qualified data.
Sharing the data: After the transformation, the company receives reliable data that they can use. The data is often output to a cloud data warehouse or application.

4. Data processing and integration

Data integration is the goal of any data pipeline because consumers want actionable data in real time. Therefore, organizations should ideally use a repeatable process for the data pipeline.

Things to know about data pipelines

A wide variety of tools can be integrated into a data pipeline, such as when anonymizing data. In another blog post "Anonymized data in the data pipeline", you will find two practical examples that explain the advantages of a seamless integration of Libelle DataMasking in more detail.

December 22, 2022 Libelle IT Glossary Part 22: What is DevOps?

Read article