Partial Linear Equations ≠ Structural Causal Models

TL;DR. The partial linear equation (PLE) in causal inference [1]

\[Y=f_0(C)+g_0(C)X+\epsilon_Y\]

is a statistical decomposition, not a structural causal model (SCM). Its residual $\epsilon_Y$ is a regression remainder, not an exogenous variable in SCMs [2]. For $g_0(C)$ to be equal to the conditional average treatment effect (CATE), we still need No Unmeasured Confounders (NUC) and overlap. If we treat the residual $\epsilon_Y$ as the exogeneous variables in SCM which is invariant across interventions over the observables, this implicitly inject the identification of unit-level effects, which is usually too strong.



Notation

  • $X\in{0,1}$: treatment.
  • $C$: vector of observed covariates.
  • $Y\in\mathbb{R}$: outcome.
  • $Y(0),Y(1)$: potential outcomes.
  • NUC: $(Y(0),Y(1))\perp X \mid C$.
  • Overlap: $\Pr(X=1\mid C=c)\in(0,1)$ for all relevant $c$.
  • $e_0(C):=\mathbb{E}[X\mid C]$ (propensity).
  • $f_0(C):=\mathbb{E}[Y(0)\mid C]$ (baseline under control).
  • $\tau_0(C):=\mathbb{E}[Y(1)-Y(0)\mid C]$ (CATE).
  • PLE residual (given any $f_0,g_0$):

    \[\epsilon_Y:=Y-f_0(C)-g_0(C)X.\]

1. PLE is not an SCM

In an SCM, one writes

\[Y \;=\; \underbrace{F(C,X)}_{\text{structural}}\;+\;\underbrace{U_Y}_{\text{exogenous, invariant}},\]

and $U_Y$ is a genuine disturbance: it is not caused by observables and is invariant to interventions on $X$.

By contrast, the PLE writes $Y=f_0(C)+g_0(C)X+\epsilon_Y$ for some $f_0,g_0$. The residual $\epsilon_Y$ is defined as $\epsilon_Y:=Y-f_0(C)-g_0(C)X$. By its definition, the residual $\epsilon_Y$ is causally dependent on $X$; i.e., $\epsilon_Y(X=1) \neq \epsilon_Y(X=0)$.

Do not read $\epsilon_Y$ as “exogenous noise” unless you have an independent structural argument.


2. Residuals should not be treated as exogeneous variables.

Suppose we (incorrectly) treat $\epsilon_Y$ as an exogenous variable invariant to interventions. Define

\[\epsilon_Y(x):=Y(x)-f_0(C)-g_0(C)\,x.\]

This implies

\[\epsilon_Y(1) = Y(1) - f_0(C) - g_0(C), \qquad \epsilon_Y(0) = Y(0) - f_0(C).\]

If we assume $\epsilon_Y(1) = \epsilon_Y(0)$, then

\[Y(1)-Y(0)=g_0(C)\qquad\text{a.s.}\]

i.e., the individual treatment effect (ITE) is deterministically $g_0(C)$ within each $C$-stratum; there is no residual heterogeneity. This is much stronger than NUC and is rarely warranted in practice.


3. What NUC does give us

From the definition of the residual:

\[\begin{aligned} \mathbb{E}[\epsilon_Y \mid X=1,C] &= \mathbb{E}[Y \mid X=1,C] - f_0(C) - g_0(C), \\ \mathbb{E}[\epsilon_Y \mid X=0,C] &= \mathbb{E}[Y \mid X=0,C] - f_0(C). \end{aligned}\]

Recall that $f_0(C) := \mathbb{E}[Y(0) \mid C]$ and $g_0(C) := \mathbb{E}[Y(1)\mid C] - \mathbb{E}[Y(0)\mid C]$. Under NUC, we have

\[\begin{aligned} \mathbb{E}[\epsilon_Y \mid X=1,C] &= \mathbb{E}[Y(1) \mid C] - f_0(C) - g_0(C) = 0, \\ \mathbb{E}[\epsilon_Y \mid X=0,C] &= \mathbb{E}[Y(0) \mid C] - f_0(C) = 0. \end{aligned}\]

Thus NUC implies

\[\boxed{\; \mathbb{E}[\epsilon_Y \mid X,C] = 0 \;}.\]

By the law of iterated expectations, $\mathbb{E}[\epsilon_Y\mid C]=0$ as well.

Important asymmetry. The converse is false: $\mathbb{E}[\epsilon_Y\mid X,C]=0$ does not imply NUC. It is a regression moment condition that may hold without causal validity.


4. Robinson’s decomposition and the R-learner, carefully stated

Define

\[m_0(C):=\mathbb{E}[Y\mid C],\qquad e_0(C):=\mathbb{E}[X\mid C].\]

Under NUC and the causal alignment,

\[m_0(C)=f_0(C)+\tau_0(C)\,e_0(C).\]

Subtracting $m_0(C)$ and $e_0(C)$ yields Robinson’s partial linear equation:

\[\underbrace{Y-m_0(C)}_{\tilde Y} \;=\; \tau_0(C)\,\underbrace{\big(X-e_0(C)\big)}_{\tilde X} \;+\;\varepsilon, \qquad \mathbb{E}[\varepsilon\mid X,C]=0.\]

The R-learner [3] estimates $\tau_0(\cdot)$ by:

  1. Learning $\hat m(C)\approx m_0(C)$ and $\hat e(C)\approx e_0(C)$ via flexible ML (with cross-fitting).
  2. Regressing residual outcome on residual treatment:

    \[\hat\tau \in \arg\min_{\tau\in\mathcal{H}} \frac{1}{n}\sum_{i=1}^n\Big( (Y_i-\hat m(C_i))-\tau(C_i)\,(X_i-\hat e(C_i))\Big)^2 +\Lambda_n(\tau).\]

The orthogonality above makes the estimator robust to small errors in $\hat m,\hat e$ (Neyman orthogonality).


5. Two ways to choose $(f_0,g_0)$ — and their residual properties

(A) Causal alignment (preferred for identification). Set $g_0=\tau_0$, $f_0=\mathbb{E}[Y(0)\mid C]$. Then, under NUC:

\[\mathbb{E}[\epsilon_Y\mid X,C]=0.\]

Here $g_0$ is the CATE; identification also needs overlap.

(B) Linear projection alignment (associational). Define $(f_0(C),g_0(C))$ as the conditional linear projection of $Y$ on $(1,X)$ given $C$:

\[g_0(C)=\frac{\operatorname{Cov}(X,Y\mid C)}{\operatorname{Var}(X\mid C)},\quad f_0(C)=\mathbb{E}[Y\mid C]-g_0(C)\,\mathbb{E}[X\mid C].\]

Then the normal equations give

\[\mathbb{E}[\epsilon_Y\mid C]=0,\qquad \mathbb{E}\!\big[(X-e_0(C))\,\epsilon_Y\mid C\big]=0,\]

but not necessarily $\mathbb{E}[\epsilon_Y\mid X,C]=0$. Here $g_0$ is the best linear predictor slope (associational), not causal in general.

One can always write $Y = f_0(C) + g_0(C)X + \epsilon_Y$, but without NUC, this decomposition has no causal meaning.


Summary & Conclusion

Common pitfalls and how to avoid them:

  1. Treating $\epsilon_Y$ as exogenous noise.
    This implicitly assumes residual invariance across interventions, yielding unit-level effects $Y(1)-Y(0)=g_0(C)$. This is a very strong and usually unjustifiable condition.

  2. Confusing moment conditions with causal assumptions.
    $\mathbb{E}[\epsilon_Y \mid X,C]=0$ follows from NUC (with causal alignment), but it does not imply NUC on its own.

Conclusion.
Use the PLE as a tool for Robinson’s orthogonalization, not as a structural law. To interpret $g_0(C)$ causally as the CATE, you still need NUC and overlap. The residual is only a statistical artifact unless you have independent structural grounds to treat it as an invariant exogenous disturbance.


References

[1] Peter M. Robinson. “Root-N-consistent semiparametric regression.” Econometrica, 56(4), 1988, pp. 931–954.

[2] Judea Pearl. Causality: Models, Reasoning, and Inference. 2nd ed., Cambridge University Press, 2009.

[3] Nie, Xinkun, and Stefan Wager. “Quasi-oracle estimation of heterogeneous treatment effects.” Biometrika, 108(2), 2021, pp. 299–319.