December 18, 2022

Libelle IT Glossary Part 21: How does a data pipeline work?

AuthorMichael Schwenk

According to an IDC study, it is estimated that between 88 to 97 percent of data is no longer stored worldwide. The study looks at the period from 2018 to 2025. The alternative to storing data, according to IDC, is capturing, processing and analyzing it through working memory in real time. This could be one of the reasons for the growing need for scalable data pipelines.

Other reasons include:

  • Accelerated data processing
  • Shortage of data engineers
  • Innovation sets the pace

We have described the types of data pipelines in more detail in our glossary article "What is a data pipeline?" for more detail.

Here's how a data pipeline works

A data pipeline is the entire path of data through an organization. Data goes through the following four steps in the data pipeline:

1. Capture and extract the raw datasets

In this step, all data is being captured and extracted. These are referred to as raw datasets because the data is not structured or classified. A dataset here contains combined data that can come from several different sources. They are available in different formats, for example:

  • Database tables
  • File names
  • Topics (Kafka)
  • Queues (JMS)
  • File paths (HDFS)

No meaningful conclusions can be drawn yet with the huge amount of data.

2. Data management

In the next phase of the data pipeline, the raw data sets are comprehensively organized using a specific method. The technical term for this step is data governance. It first puts the raw data into a business context. This is followed by data quality and security control. Now the data is organized for mass use.

3. Data transformation

The third step is data transformation, where the data sets are cleaned and changed according to the appropriate reporting formats. The basis for the data transformation is the rules and guidelines established by the company, according to which the data pipeline program enriches the remaining data with information and deletes unnecessary/invalid data.

The following steps should be considered to ensure the quality and accessibility of the data:

  • Standardization: the company must define what data is useful and how it should be formatted and stored.

  • Duplication: The company reports all duplicates to the data stewards. Redundant data must be deleted and/or excluded.

  • Checking: It is advisable to perform automated checks to compare similar information such as transaction times and access logs. Checks can further weed out unusable data and identify anomalies in systems, applications or data.

  • Sorting: Grouping items such as raw data or multimedia files into appropriate categories can increase the efficiency of the data pipeline. Transformation rules determine how each piece of data is classified and what step it goes through next. These transformation steps reduce the amount of unusable material and convert it into qualified data.

  • Sharing the data: After the transformation, the company receives reliable data that they can use. The data is often output to a cloud data warehouse or application.

4. Data processing and integration

Data integration is the goal of any data pipeline because consumers want actionable data in real time. Therefore, organizations should ideally use a repeatable process for the data pipeline.

Things to know about data pipelines

A wide variety of tools can be integrated into a data pipeline, such as when anonymizing data. In another blog post "Anonymized data in the data pipeline", you will find two practical examples that explain the advantages of a seamless integration of Libelle DataMasking in more detail.


Recommended article
December 22, 2022 Libelle IT Glossary Part 22: What is DevOps?

All blog articles