In software development, the term TDD – Test Driven Development – is well known and its application is already well established.
The term was coined in the mid-1990s by the creator of Extreme Programming, Kent Beck.
Nowadays, this practice is one of the main ways to maintain a healthy application. Tests greatly reduce – but do not eliminate – the risk of breaking the application.
Is it possible to apply TDD in a data pipeline? That’s what we’ll see next. I’ll go ahead and say yes, it’s completely possible.
First of all, what’s TDD?
Test Driven Development is a programming strategy that consists of creating previously known situations or responses before developing the application itself.
In a very simplified way, to ensure that a function knows how to add correctly, we create situations such as:
- If the function receives 2 and 3 as arguments, it should return 5;
- If the function receives 1, 2 and 3 as arguments, it should return 6.
Instead of fully entering the development of the application, you create scenarios that test what will still be developed. Through classes, methods and/or functions.
And how would TDD be in a Data PIPELINE?
Before anything else, it would be good to know why we apply TDD in data engineering.
The original motivation for developing the scenario first and then the solution itself is composed of 3 points, which I consider as main:
- Your function – or class – will do ONLY what is necessary to pass the test. Making the code cleaner and without unnecessary code;
- Automation of tests that are executed much faster than any human;
- It prevents new functionality from breaking what was already working.
Now imagine a data pipeline, where information is extracted in various ways, goes through a series of processes that treat it and then inserts it into a target database.
I don’t know about you, but the more complex the flow, the more it brings an intrinsic fragility. If anything fails in the middle of the way, it can cause huge damage to the entire process.
It reminds a little of Chaos Theory and the famous quote from Edward Lorenz:
“the flapping of a butterfly’s wings in Brazil may generate a hurricane in Texas!”.
If one of the data sources changes its structure, it can end up bringing down the entire flow, or generating completely unexpected results.
So, to mitigate these risks, nothing is better than using a strategy already consolidated in the world of software development…
And how to apply TDD with Data?
Read the following statement, and you’ll understand where I’m going…
- When we’re about to create a class/function, we test if with the parameters it receives, it returns the expected result.
The bold words — class/function, parameters, and result — are intentional.
Changing to data, it would be like this:
- When we’re about to create a step in the process, we test if with the data it receives, it returns the expected structure.
In other words, each step is evaluated if it returns the expected structure, having as input another structure. Do you notice that the term “controlled environment” fits perfectly here?
And what can we put in each input structure?
For those who are used to TDD, you’re very familiar with corner cases.
Corner Cases in Data
Corner Case refers to “limit” situations that your tests should cover.
Going back to the beginning, in the example of our function that should add, what would be the edge cases?
- If it receives only one number, it should return itself;
- Should be able to process any number of inputs, add 1, 2, 3 … numbers;
- If you receive a negative number, it should return the correct result;
- If you receive 0 as input, the result should not be affected.
The first step for development is creating the edge situations that the function should receive.
With data, it’s more or less the same thing. Just think about what kind of corner case can have in a data structure.
Let’s see some examples:
- If the source is a CSV file, the step should recognize the column separator – such as comma, semicolon, and tabulation;
- In the same CSV file, the step should correctly recognize the type of each column, whether it is integers, strings, booleans, etc.;
- The possibility of receiving an empty CSV;
- If it is a JSON format, with different types of keys beyond those of interest in the process;
- A JSON format, with the absence of one or more key of interest;
- Return of an API with authorization errors, timeouts, etc.
Do you see that the possibilities are many? Well, the more your pipeline is prepared to deal with each of these situations, the more robust it will be. The greater the coverage of possibilities, the lower the risk of breaking the flow.
Simplicity is the key
Have you heard of KISS?
No, not the word nor the New York rock band.
It stands for “Keep it simple, stupid!”.
This means that each scenario in the corner cases should be as simple as possible.
It’s not necessary to create a csv with 1000 lines to describe an expected structure.
The smaller the expected structure, whether it is a csv, json, or xml from an API, the better. It’s easier to analyze and maintain in the future. The only criterion is that the structure correctly describes what the step can receive as input data.
Now, with the increase of available data that influence decision making of any business, it is possible to create flowcharts for the extraction, treatment and reading of this data with much more ease than years ago.
However, this process is complex and can easily be interrupted when any step in this process breaks down. It is crucial that each step is covered by tests that ensure the expected behavior, in order to not weaken the system as a whole.
I hope you liked it and if you have any questions, leave them in the comments below.