Instruments comparable to DBT make constructing SQL information pipelines easy and systematic. Nonetheless, even with added constructions and well-defined information fashions, pipelines can nonetheless be sophisticated, making it troublesome to validate debugging points and information mannequin modifications.
The elevated complexity of the info conversion logic causes the next issues:
- Conventional code evaluate course of I am simply trying round code Modify and exclude the affect of those modifications in your information.
- Monitoring the affect on information brought on by code modifications is troublesome. With an enormous doug with nested dependencies, it is extremely time-consuming or almost unattainable to find how and the place the affect in your information will happen.
gitlab’s DBT Doug (proven within the featured picture above) is an ideal instance of a knowledge mission, already a house residence. Think about making an attempt to chase a easy SQL logic change right into a column through this complete lineage DAG. Checking information mannequin updates generally is a powerful job.
How do you strategy this kind of evaluate?
What’s information validation?
Information validation refers back to the course of used to find out whether or not information is right by way of precise necessities. This implies guaranteeing The SQL logic within the information mannequin works as meant by guaranteeing that the info is right. Validation is usually carried out after a change to the info mannequin, comparable to to accommodate a brand new requirement, or after it’s run as a part of a refactoring.
Distinctive evaluate problem
Information has states and is instantly affected by the transformations used to generate them. Because of this checking modifications to the info mannequin is a novel problem. and You’ll want to evaluate the info.
Due to this fact, updating the info mannequin should examine not just for integrity but in addition for context. In different phrases, the info is right, and present information and metrics haven’t been modified unintentionally.
Two excessive information verification
In most information groups, these making the change depend on institutional data, instinct, or previous expertise to evaluate impacts and look at modifications.
“I made a change to X. I feel I do know what the affect needs to be. Please run Y and test it out.”
The verification technique is normally categorized as considered one of two extremes, however neither is right.
- Spot examine There are high-level checks for queries and row counts and schemas. It is quick, however there is a threat of lacking precise results. Severe and silent errors might not be observed.
- An intensive examine All downstream fashions. It is sluggish and resource-focused and might value as your pipeline grows.
This creates an unstructured, difficult-to-repeat information evaluate course of, and sometimes results in silent errors. New methods are wanted to assist engineers carry out correct and focused information validation.
A greater strategy by understanding information mannequin dependencies
To validate modifications in a knowledge mission, you will need to perceive the relationships between fashions and the way information flows via the mission. These dependencies between fashions inform you how information is handed and remodeled from one mannequin to a different.
Analyze relationships between fashions
As we have seen, information initiatives’ DAGs may be enormous, however modifications to the info mannequin solely have an effect on a subset of the mannequin. By isolating this subset after which analyzing the relationships between fashions, you may strip off the layers of complexity and focus solely on fashions that should be validated, bearing in mind modifications in particular SQL logic.
The forms of dependencies for a knowledge mission are:
From mannequin to mannequin
Structural dependence during which columns are chosen from the upstream mannequin.
--- downstream_model
choose
a,
b
from {{ ref("upstream_model") }}
From row to row
A projection dependency that selects, renames, or converts upstream columns.
--- downstream_model
choose
a,
b as b2
from {{ ref("upstream_model") }}
Between fashions
Utilized in filter dependencies, the place, joins, or different conditional clauses the place the downstream mannequin makes use of upstream fashions.
-- downstream_model
choose
a
from {{ ref("upstream_model") }}
the place b > 0
Understanding the dependencies between fashions permits you to outline the radius of affect of modifications within the logic of your information mannequin.
Establish the affect radius
When making modifications to the SQL of a knowledge mannequin, you will need to perceive that different fashions could also be affected (fashions that should be checked). At excessive ranges, that is accomplished by relationships between fashions. This subset of DAG nodes is called affect radius.
Within the DAGs beneath, the affect radii embrace Node B (modified mannequin) and D (downstream mannequin). DBT can establish these fashions utilizing the Modify + Selector.
Figuring out modified nodes and downstream is a good begin, and by isolating these modifications, it reduces potential information verification house. Nonetheless, this may end up in many downstream fashions.
Classification varieties SQL modifications are much more helpful in prioritizing fashions that really require validation by understanding the severity of the modifications and eliminating branches with modifications which are recognized to be protected.
Classify SQL modifications
Not all SQL modifications pose the identical degree of threat to downstream information, so that they should be categorized accordingly. By categorizing SQL modifications on this approach, you may add a scientific strategy to the info evaluate course of.
SQL modifications to the info mannequin may be categorized into one of many following:
Non-breaking modifications
Modifications that don’t have an effect on information in downstream fashions, comparable to including new columns, adjusting SQL codecs, and including feedback.
-- Non-breaking change: New column added
choose
id,
class,
created_at,
-- new column
now() as ingestion_time
from {{ ref('a') }}
Partial modifications
Modifications that have an effect on solely downstream fashions that reference a specific column, comparable to deleting or renaming columns. Or modify the column definition.
-- Partial breaking change: `class` column renamed
choose
id,
created_at,
class as event_category
from {{ ref('a') }}
Breaking change
Modifications that have an effect on all downstream fashions, comparable to filtering, sorting, or altering the construction or that means of the remodeled information.
-- Breaking change: Filtered to exclude information
choose
id,
class,
created_at
from {{ ref('a') }}
the place class != 'inside'
Apply classifications to cut back scope
After making use of these classes, the affect radius and the variety of fashions that should be validated may be considerably decreased.

Within the above DAG, nodes B, C, and F have been modified, and there are potential seven nodes (C-E) that should be verified. Nonetheless, every department doesn’t include any SQL modifications that really should be validated. Let’s check out every department.
Node C: Criticism Modifications
C is assessed as a non-broken change. Due to this fact, there isn’t any must examine each C and H, and you’ll eradicate them.
Node B: Partial Modifications
B is assessed as a partial change attributable to modifications in column B.C1. Due to this fact, it is advisable examine D and E solely For those who consult with column B.c1.
Node F: Breaking Change
Modifications to Mannequin F are categorized as damaged modifications. Due to this fact, it is advisable examine if all downstream nodes (G and E) are affected. For instance, mannequin G could combination information from a modified upstream column
The primary seven nodes have already been decreased to five and we have to examine for his or her affect on the info (b,d,e,f,g). This enables us to additional cut back the variety of SQL modifications on the column degree.
Extra slender the scope with column-level techniques
Unbreakable and non-breakable modifications may be simply categorized, however in case you are inspecting modifications in partial modifications, the mannequin should be analyzed on the column degree.
Let’s take a better have a look at the partial modifications in Mannequin B, the place the logic in column C1 has been modified. This modification may have an effect on 4 downstream downstream nodes. D, E, Ok, J.

The next column B.C1 downstream can see it:
- b.c1→d.c1 is a column to column (projection) dependency.
- D.C1→E is the intermodel dependency.
- D→Ok is intermodel dependency. Nonetheless, this mannequin may be eradicated as D.C1 is just not utilized in Ok.
Due to this fact, the fashions that should be validated on this department are b, d, and E. Together with f and downstream G and downstream G, the entire mannequin validated on this determine is f, g, b, d, e, or 5 of the 9 probably affected fashions in whole.
Conclusion
Information validation after mannequin modifications is especially troublesome with giant and sophisticated DAGs. It is simple to overlook silent errors and performing validation generally is a troublesome job. Information fashions usually really feel like black packing containers on the subject of downstream impacts.
Structured and repeatable processes
Utilizing this modified information verification method can deliver construction and accuracy to the evaluate course of, making it systematic and repeatable. This reduces the variety of fashions that should be checked, simplifies the evaluate course of, and reduces prices by solely verifying fashions that really require it.
Earlier than you go…
Dave is a senior expertise advocate Letchwe’ve got constructed a toolkit to allow superior information verification workflows. He’s at all times keen to speak about SQL, information engineering, or serving to his staff navigate the challenges of knowledge validation. Join with Dave LinkedIn.
The research on this article was made doable by my colleague Chen en Lu (Popcony).

