December 14, 2022

Libelle IT Glossary Part 20: What is a data pipeline?

AuthorMichael Schwenk

Every company works with large data sets on a daily basis, be it for triggering production chains, sending order confirmations or following up on existing contracts. Data also plays an important role in internal processes, especially in the area of human resource management.

Data management is one of the supreme disciplines of IT. The number of applications, databases and other information sources in companies is very extensive. For this very reason, they must be able to exchange information with each other. More and more companies are turning to data pipelines to unleash the potential of their data as quickly as possible and meet the needs of their customers.

What is a data pipeline?

As the name suggests, data pipelines act as a "pipeline system" for data. It is a methodology for moving data from one system to another. These pipelines are the foundation for data-driven work in IT in many organizations.

Essentially, when data is being moved from the source system to the target system, it goes through the following steps:

  • Capturing and extracting the raw datasets.
  • Data management
  • Data transformation
  • Data processing and integration

We have explained these steps in more detail in our blog post "How does a data pipeline work?" explained in more detail. To perform these steps, there are different types of data pipelines.

What are the different types of data pipelines?

In order to  achieve the goal of data integration, the two main types of data pipelines, batch processing and the use of streaming data, are most commonly used.

Batch processing

Batch processing is an important part of creating a reliable and scalable data infrastructure.

Batch processing, as the name implies, involves loading "batches" of data into a repository within specified time intervals. Care is taken to ensure that the time period is not during peak business hours, as the large volume of data from batch processing jobs could negatively impact other workloads. The batch processing method is optimal for data pipelines unless there is a direct need to analyze a specific set of data (e.g., monthly accounting). It is more commonly associated with the extract, transform, and load (ETL) data integration process.

Batch jobs are an automated workflow of sequence-bound commands. Here, the output of one command leads to the input of the next command.

For example, a command starts a data ingest, then the next command triggers filtering of specific columns, and then the subsequent command handles aggregation. This series continues until the data is fully transformed.

Streaming data

Streaming data is used as a method when data needs to be updated continuously. Especially in areas where apps or point-of-sale systems are used, real-time data must be used.

Example: A company wants to update the stock and sales history of their products, so salespeople can inform their consumers whether a product is in stock or not. Here, a single action, such as a product sale, is considered an "event" and related events, such as adding an item to checkout, are typically grouped as a "topic" or "data stream." To stream these events, messaging systems or message brokers, such as the open source Apache Kafka solution, are then used.

Streaming processing systems have lower latency than batch processing systems and are therefore more commonly used to process data events shortly after they occur.

Things to know about data pipelines

A wide variety of tools can be integrated into a data pipeline, such as when anonymizing data. In another blog post "Anonymized data in the data pipeline", there are two practical examples that explain the advantages of seamless integration of Libelle DataMasking in more detail.

Recommended article
December 22, 2022 Libelle IT Glossary Part 22: What is DevOps?

All blog articles