Use DAGs to make assumptions explicit

_Category: Model risk fundamentals | Regulation SS1/23 model documentation_

A directed acyclic graph (DAG) is a diagram in which nodes represent variables and arrows represent direct causal relationships. It is one of the most useful tools in statistics for making the assumptions of an analysis explicit, yet few data scientists use it for documentation and communication. This absence is worth examining, because PRA SS1/23’s documentation requirements are, in substance, requirements to draw and inspect this graph.

What a DAG represents

A DAG encodes causal assumptions, not statistical associations. An arrow from variable A to variable B means “A directly causes B” — not “A is correlated with B,” not “A predicts B,” but that intervening on A would change B, holding everything else in the system constant. The absence of an arrow between two variables is equally informative: it is the assumption that there is no direct causal path between them.

The DAG is where assumptions get written down and can be discussed and disputed. Later the relationships it encodes can be formalied in analysis and probability: the structure guides what to include as a variable and what to control for.

Even before that though, hashing out the right DAG with your executives, data scientists, domain experts, or anyone who will be using the outcome of an analysis is the most important step. It means everyone is on the same page about what we are trying to model.

Every regression model assumes that no variables are causing each other. If you draw a DAG first then you can certify this assumption explicitly. If you are modeling credit, and you have predictors including income, employment duration, credit utilisation, and account age, and outcome of default, then you’d hash out a DAG showing if any of these cause each other and if any cause default. You could also discuss if there is anything you haven’t drawn out which happens to cause the outcome. If it causes both the outcome and one of your existing variables then this is a confounder and very important to state.

What a DAG reveals that regression documentation does not

Standard model documentation describes what variables are included, what functional form is used, and what the estimated coefficients are. It does not, in general, address the causal assumptions that determine whether those coefficients can be used as intended.

Drawing the DAG for a credit model typically reveals several things that standard documentation misses. First, it makes visible which confounders have been assumed absent, and prompts the question of whether that assumption is defensible. In most credit models, unobserved financial resilience (the customer’s ability to absorb financial shocks, which affects both their reported income stability and their default probability) is a plausible unobserved confounder that is never discussed explicitly.

Second, the DAG identifies which variables in the model may be colliders — variables caused by two or more other variables, such that conditioning on them opens spurious associations. For example, training on approved loans conditions on an approval variable that is itself caused by multiple predictors, creating the potential for Berkson’s paradox-style distortions.

Later we’ll discuss how control works for these models, and what you can tell using probability rules like the backdoor criterion.

The SS1/23 correspondence

PRA SS1/23’s requirement that model assumptions be “clearly articulated, reasonable, and testable” maps directly onto the exercise of drawing and interrogating the DAG. Clearly articulated: the DAG writes down every causal assumption in a format that can be read by a non-statistician. Reasonable: the causal structure implied by the graph should be consistent with knowledge about the domain in question, whether credit markets or regulation of AI. Testable: many of the assumptions encoded in the graph are empirically testable and will indicate whether the causal model has been estimated reasonably.

The DAG also makes the model limitations concrete. A limitation statement of the form “the model assumes that macroeconomic conditions are the primary driver of default and that individual-level resilience factors are adequately captured by the included predictors” is more specific than a generic “the model may not perform well under conditions significantly different from the training sample.” It identifies what specific assumptions you are operating on, not just that an assumption exists.

In practice

You can draw a DAG for an existing model: you don’t need to start from scratch. You just need to express in graphical form the causal assumptions the model already makes.

This makes the main obstacle cultural rather than technical: the high technical expertise exists but if no one is responsible for joined-up conceptual clarity then it can fall through the cracks. This is worst when neither model nor validation teams are trained to express and agree on assumptions first and foremost. This happens most acutely when model developers are trained to think about statistical performance, not causal structure, and validation teams are trained to review strictly what developers produce. At a minimum if they got together and agreed on the causal structure of what they are modeling, then the process will make sense.

When neither side of the table draws the DAG, the causal assumptions remain implicit, and the model risk they represent remains unexamined.