Partial Linear Equations ≠ Structural Causal Models

TL;DR. The partial linear equation (PLE) in causal inference [1]

\[Y=f_0(C)+g_0(C)X+\epsilon_Y\]

is a statistical decomposition, not a structural causal model (SCM). Its residual $\epsilon_Y$ is a regression remainder, not an exogenous variable in SCMs [2]. For $g_0(C)$ to be equal to the conditional average treatment effect (CATE), we still need No Unmeasured Confounders (NUC) and overlap. If we treat the residual $\epsilon_Y$ as the exogeneous variables in SCM which is invariant across interventions over the observables, this implicitly inject the identification of unit-level effects, which is usually too strong.

Notation
1. PLE is not an SCM
2. Residuals should not be treated as exogeneous variables.
3. What NUC does give us
4. Robinson’s decomposition and the R-learner, carefully stated
5. Two ways to choose $(f_0,g_0)$ — and their residual properties
Summary & Conclusion
References

Notation

$X\in{0,1}$: treatment.
$C$: vector of observed covariates.
$Y\in\mathbb{R}$: outcome.
$Y(0),Y(1)$: potential outcomes.
NUC: $(Y(0),Y(1))\perp X \mid C$.
Overlap: $\Pr(X=1\mid C=c)\in(0,1)$ for all relevant $c$.
$e_0(C):=\mathbb{E}[X\mid C]$ (propensity).
$f_0(C):=\mathbb{E}[Y(0)\mid C]$ (baseline under control).
$\tau_0(C):=\mathbb{E}[Y(1)-Y(0)\mid C]$ (CATE).
PLE residual (given any $f_0,g_0$):
\[\epsilon_Y:=Y-f_0(C)-g_0(C)X.\]

1. PLE is not an SCM

In an SCM, one writes

\[Y \;=\; \underbrace{F(C,X)}_{\text{structural}}\;+\;\underbrace{U_Y}_{\text{exogenous, invariant}},\]

and $U_Y$ is a genuine disturbance: it is not caused by observables and is invariant to interventions on $X$.

By contrast, the PLE writes $Y=f_0(C)+g_0(C)X+\epsilon_Y$ for some $f_0,g_0$. The residual $\epsilon_Y$ is defined as $\epsilon_Y:=Y-f_0(C)-g_0(C)X$. By its definition, the residual $\epsilon_Y$ is causally dependent on $X$; i.e., $\epsilon_Y(X=1) \neq \epsilon_Y(X=0)$.

Do not read $\epsilon_Y$ as “exogenous noise” unless you have an independent structural argument.

2. Residuals should not be treated as exogeneous variables.

Suppose we (incorrectly) treat $\epsilon_Y$ as an exogenous variable invariant to interventions. Define

\[\epsilon_Y(x):=Y(x)-f_0(C)-g_0(C)\,x.\]

This implies

\[\epsilon_Y(1) = Y(1) - f_0(C) - g_0(C), \qquad \epsilon_Y(0) = Y(0) - f_0(C).\]

If we assume $\epsilon_Y(1) = \epsilon_Y(0)$, then

\[Y(1)-Y(0)=g_0(C)\qquad\text{a.s.}\]

i.e., the individual treatment effect (ITE) is deterministically $g_0(C)$ within each $C$-stratum; there is no residual heterogeneity. This is much stronger than NUC and is rarely warranted in practice.

3. What NUC does give us

From the definition of the residual:

\[\begin{aligned} \mathbb{E}[\epsilon_Y \mid X=1,C] &= \mathbb{E}[Y \mid X=1,C] - f_0(C) - g_0(C), \\ \mathbb{E}[\epsilon_Y \mid X=0,C] &= \mathbb{E}[Y \mid X=0,C] - f_0(C). \end{aligned}\]

Recall that $f_0(C) := \mathbb{E}[Y(0) \mid C]$ and $g_0(C) := \mathbb{E}[Y(1)\mid C] - \mathbb{E}[Y(0)\mid C]$. Under NUC, we have

\[\begin{aligned} \mathbb{E}[\epsilon_Y \mid X=1,C] &= \mathbb{E}[Y(1) \mid C] - f_0(C) - g_0(C) = 0, \\ \mathbb{E}[\epsilon_Y \mid X=0,C] &= \mathbb{E}[Y(0) \mid C] - f_0(C) = 0. \end{aligned}\]

Thus NUC implies

\[\boxed{\; \mathbb{E}[\epsilon_Y \mid X,C] = 0 \;}.\]

By the law of iterated expectations, $\mathbb{E}[\epsilon_Y\mid C]=0$ as well.

Important asymmetry. The converse is false: $\mathbb{E}[\epsilon_Y\mid X,C]=0$ does not imply NUC. It is a regression moment condition that may hold without causal validity.

4. Robinson’s decomposition and the R-learner, carefully stated

Define

\[m_0(C):=\mathbb{E}[Y\mid C],\qquad e_0(C):=\mathbb{E}[X\mid C].\]

Under NUC and the causal alignment,

\[m_0(C)=f_0(C)+\tau_0(C)\,e_0(C).\]

Subtracting $m_0(C)$ and $e_0(C)$ yields Robinson’s partial linear equation:

\[\underbrace{Y-m_0(C)}_{\tilde Y} \;=\; \tau_0(C)\,\underbrace{\big(X-e_0(C)\big)}_{\tilde X} \;+\;\varepsilon, \qquad \mathbb{E}[\varepsilon\mid X,C]=0.\]

The R-learner [3] estimates $\tau_0(\cdot)$ by:

Learning $\hat m(C)\approx m_0(C)$ and $\hat e(C)\approx e_0(C)$ via flexible ML (with cross-fitting).
Regressing residual outcome on residual treatment:
\[\hat\tau \in \arg\min_{\tau\in\mathcal{H}} \frac{1}{n}\sum_{i=1}^n\Big( (Y_i-\hat m(C_i))-\tau(C_i)\,(X_i-\hat e(C_i))\Big)^2 +\Lambda_n(\tau).\]

The orthogonality above makes the estimator robust to small errors in $\hat m,\hat e$ (Neyman orthogonality).

5. Two ways to choose $(f_0,g_0)$ — and their residual properties

(A) Causal alignment (preferred for identification). Set $g_0=\tau_0$, $f_0=\mathbb{E}[Y(0)\mid C]$. Then, under NUC:

\[\mathbb{E}[\epsilon_Y\mid X,C]=0.\]

Here $g_0$ is the CATE; identification also needs overlap.

(B) Linear projection alignment (associational). Define $(f_0(C),g_0(C))$ as the conditional linear projection of $Y$ on $(1,X)$ given $C$:

\[g_0(C)=\frac{\operatorname{Cov}(X,Y\mid C)}{\operatorname{Var}(X\mid C)},\quad f_0(C)=\mathbb{E}[Y\mid C]-g_0(C)\,\mathbb{E}[X\mid C].\]

Then the normal equations give

\[\mathbb{E}[\epsilon_Y\mid C]=0,\qquad \mathbb{E}\!\big[(X-e_0(C))\,\epsilon_Y\mid C\big]=0,\]

but not necessarily $\mathbb{E}[\epsilon_Y\mid X,C]=0$. Here $g_0$ is the best linear predictor slope (associational), not causal in general.

One can always write $Y = f_0(C) + g_0(C)X + \epsilon_Y$, but without NUC, this decomposition has no causal meaning.

Summary & Conclusion

Common pitfalls and how to avoid them:

Treating $\epsilon_Y$ as exogenous noise.
This implicitly assumes residual invariance across interventions, yielding unit-level effects $Y(1)-Y(0)=g_0(C)$. This is a very strong and usually unjustifiable condition.
Confusing moment conditions with causal assumptions.
$\mathbb{E}[\epsilon_Y \mid X,C]=0$ follows from NUC (with causal alignment), but it does not imply NUC on its own.

Conclusion.
Use the PLE as a tool for Robinson’s orthogonalization, not as a structural law. To interpret $g_0(C)$ causally as the CATE, you still need NUC and overlap. The residual is only a statistical artifact unless you have independent structural grounds to treat it as an invariant exogenous disturbance.

References

[1] Peter M. Robinson. “Root-N-consistent semiparametric regression.” Econometrica, 56(4), 1988, pp. 931–954.

[2] Judea Pearl. Causality: Models, Reasoning, and Inference. 2nd ed., Cambridge University Press, 2009.

[3] Nie, Xinkun, and Stefan Wager. “Quasi-oracle estimation of heterogeneous treatment effects.” Biometrika, 108(2), 2021, pp. 299–319.