How To Create A Pipeline In Python

Data Pipelines With Python And Pandas

Writing Readable And Reproducible data processing code

Data processing issues

If you are not dealing with big data you are probably using Pandas to write scripts to do some data processing. If so, then you are certainly using Jupyter because it allows seeing the results of the transformations applied. However, you may have already noticed that notebooks can quickly become messy.

When the start-up phase comes, the question of re p roducibility and maintenance arises. Tools such as paper mill allow you to put a notebook directly into production. However, this does not guarantee reproducibility and readability for a future person who will be in charge of maintenance when you are gone.

If notebooks offer the possibility of writing markdown to document its data processing, it's quite time consuming and there is a risk that the code no longer matches the documentation over the iterations.

What is needed is to have a framework to refactor the code quickly and at the same time that allows people to quickly know what the code is doing.

Introducing genpipes

genpipes is a small library to help write readable and reproducible pipelines based on decorators and generators. You can install it with pip install genpipes

It can easily be integrated with pandas in order to write data pipelines. Below a simple example of how to integrate the library with pandas code for data processing.

pandas pipeline quick start source: author

If you use scikit-learn you might get familiar with the Pipeline Class that allows creating a machine learning pipeline. With Genpipes it is possible to reproduce the same thing but for data processing scripts. Genpipes allow both to make the code readable and to create functions that are pipeable thanks to the Pipeline class. Let's see in more details how it works.

Declaring data sources

The first task in data processing is usually to write code to acquire data. The library provides a decorator to declare your data source.

The decorators take in a list of inputs to be passed as positional arguments to the decorated function. This way you are binding arguments to the function but you are not hardcoding arguments inside the function.

Datasource positional arguments source: author

However, if you want to let some arguments defined later you could use keywords arguments.

Datasource keywords arguments source: author

This way of proceeding makes it possible on the one hand to encapsulate these data sources and on the other hand to make the code more readable. Indeed having the entry just above the code of the function allows a little to have like a configuration file with the code which uses it.

But data sources are not yet part of the pipeline, we need to declare a generator in order to feed the stream.

Declaring generator to feed the stream

Genpipes rely on generators to be able to create a series of tasks that take as input the output of the previous task. It means the first step of the pipeline should be a function that initializes the stream.

That the generatordecorator purpose. Function decorated with it is transformed into a generator object. You can decorate any function you want your stream begins with likedatasource

generator first approach source: author

Or a more complex function, like a merge between two data source

generator more complex approach source: author

To test your generatordecorated functions, you need to pass in a Python generator object.

testing generators source: author

Because the decorator returns a function that creates a generator object you can create many generator objects and feed several consumers.

consuming generator source: author

the generator decorator allows us to put data into the stream, but not to work with values from the stream for this purpose we need processing functions.

Declaring functions for data processing

Now that we have seen how to declare data sources and how to generate a stream thanks to generator decorator. Let's see how to declare processing functions.

processor source: author

One big difference between generatorand processois that the function decorated with processor MUST BE a Python generator object. In addition, the function must also take as first argument the stream.

Composing pipelines

Even if we can use the decorator helper function alone, the library provides a Pipelineclass that helps to assemble functions decorated with generator and processor .

A pipeline object is composed of steps that are tuplewith 3 components:

1- The description of the step

2- The decorated function

3- The keywords arguments to forward as a dict, if no keywords arguments are needed then pass in an empty dict

first pipeline source: author

The pipeline class allows both to describe the processing performed by the functions and to see the sequence of this one at a glance. By going back in the file we can have the detail of the functions that interest us.

One key feature is that when declaring the pipeline object we are not evaluating it. This means that we can import the pipeline without executing it. This allows you to write a file by domain data processing for example and assemble it in a main pipeline located in the entry point of a data processing script.

            script/
              app.py # import from pipelines and do final processing
              pipelines/
              orders_processing.py   # import datasource
              customer_processing.py
              datasources/
              orders.py
              customers.py

Because readability is important when we call print on pipeline objects we get a string representation with the sequence of steps composing the pipeline instance. For instance, calling print in the pipe instance define earlier will give us this output:

            >> print(pipe)
---- Start ----
1- data source is the merging of data one and data two
2- droping dups
---- End ----

To actually evaluate the pipeline, we need to call the run method. This method returns the last object pulled out from the stream. In our case, it will be the dedup data frame from the last defined step.

dedup_df = pipe.run()

We can run the pipeline multiple time, it will redo all the steps:

            ddedup_df = pipe.run()            dedup_df_bis = pipe.run()            assert dedup_df.equals(dedup_df_bis)              # True

Finally, pipeline objects can be used in other pipeline instance as a step:

pipeline in another pipeline source: author

Conclusion

If you are working with pandas to do non-large data processing then genpipes library can help you increase the readability and maintenance of your scripts with easy integration.