Causal Inference With Interference

AB Testing
Data Science
Statistics
Academic
Published

December 1, 2025

Lately I’ve been exploring research in causal inference. The last 15 years have been a remarkable journey in the topic of causal inference with interference. With the advent of large scale social media platforms there has been enormous interest in understanding causal inference in an environment that permits interactions spanning the entire population under consideration.

The goal of causal inference is to understand the causal relationship between an independent and dependent variable. Fundamentally, we would like to understand the impact on the dependent variable \(Y_i\) for each individual with a change in the independent variable \(z = 0,1\). Mathematically, this is quantified by the Average Treatment Effect \[ \tau(1,0) = \frac1N\sum_{i=1}^N \left(Y_i(1) - Y_i(0) \right) \tag{1}\] where the difference \(Y_i(1) - Y_i(0)\) represents the change in the dependent variable due to the indepedent variable for the \(i^{th}\) unit. The fundamental problem of causal inference is that one can only observe a unit’s response under a single treatment. That is, we can measure \(Y_i(1)\) or \(Y_i(0)\), but not both. This issue is mitigated in experiments with random assignments of individuals to the control or treatment group. The framework of Stable Unit Treatment Value Assumption (SUTVA) requires no cross dependence across unit treatments and is the primary historical method for estimation of a single, consistent treatment effect on the dependent variable.

The problem of interference within and accross assigment groups of course must immediately be considered unless individuals are completely isolated from one another. Classical approaches to control the interference typically rely on asserting groups of interacting individuals, but no interference between groups. In that case the assignment consideration is taken at a group level (Hudgens and Halloran 2008).

In a massively connected component of individuals, there is no hope of assignments of noninteracting groups. Thus interference within the group itself must be controlled. There is an exiting story of the formulation of cluster methods to recover results on treatment effect, that is, the existence of estimators of (Equation 1), and controlling their bias and variance.

Background for this note includes the following …

NoteEstimator

In the language of statistics, an esimator is a formula or rule employed to estimate a quantity of interest based on observed data. An estimator is said to be unbiased if it’s expectation is equal to the quantity of interest.

Familiar examples include the Expectation and Variance estimators for a set of observations \((X_1,..,X_n)\) of a random variable \(X\) with mean \(\mu\) and variance \(\sigma^2\).

Expectation Estimator

\[ \bar X_n = \frac{1}{n} \sum_{i=1}^n X_i \] The estimator is unbiased since \[ \mathbb{E} [\bar X_n] = \mu \]

It is useful to find (or at least bound above) the variance of the estimator, since that guarantees a good estimate if it can be shown to be small. The variance is \[ Var(\bar X_n) = \mathbb{E}[(\bar X_n - \mathbb{E}[\bar X_n])^2] = \frac{1}{n}\sigma^2 \] so the variance tends to zero with a large sample, assuring a quality estimator that is unbiased and converges to the quantity of interest.

Variance Estimator

\[ S_n^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar X_n)^2 \] \[ \mathbb{E} [S_n^2] = \sigma^2 \]

The variance is somewhat more complicated, \[ Var(S^2_n) = \frac1n \left( \mathbb{E}[(X - \mu)^4] -\frac{n-3}{n-1}\sigma^4 \right) \] nevertheless still tends to zero with a large sample.

Note we did not use the estimator \[ \tilde S_n^2 = \frac{1}{n}\sum_{i=1}^n (X_i - \bar X_n)^2 \] as this woule have expectation \[ \mathbb{E} [\tilde S_n^2] = \frac{n-1}{n}\sigma^2 \] and would be a biased estimator.

NoteHorvitz Thompson Estimator

Let $ = {1,2,…,N} $ be a finite population. For each \(i\in\mathcal{K}\), let \(Y_i\) be an observable of unit \(i\). Our goal is to estimate \[ Y = \sum_{i=1}^N Y_i \] the total of interest. And \[ \tau = \frac{1}{N} \sum_{i=1}^N Y_i \] the average of the population.

To construct the estimator we take a random sample of \(n< N\) units from \(\mathcal{K}\). Any sampling method is allowed, but the obtained sample must have distinct units. Let \(\pi_i\) be the probability the \(i^{th}\) unit is included in the sample (assume \(\pi_i > 0\)). Denote the random sample by \(\bf{S} = \{i_1,...,i_n\} \subset \mathcal{S}\). The Horvitz Thompson esimator is \[ \hat Y_{HT} = \sum_{i\in \bf{S}} \frac{Y_i}{\pi_i} = \sum_{i=1}^N \frac{Y_i 1_{\{i\in \bf S\}}}{\pi_i} \] and \[ \hat \tau_{HT} = \frac{1}{N} \hat Y_{HT} \]

The estimator is unbiased \[ \mathbb{E}[\hat Y_{HT}] = Y. \] \[ \mathbb{E}[\hat \tau_{HT}] = \tau. \]

To state the variance of \(\hat Y_{HT}\), define \(\pi_{ij}\) as the probability that both \(i\) and \(j\) belong to \(\bf{S}\). The variance of Horvitz Thompson is then \[ Var(\hat Y_{HT}) = \sum_{i=1}^N \sum_{j=1}^{N} \frac{\pi_{ij} - \pi_i\pi_j}{\pi_i\pi_j} Y_i Y_j \] it follows from the definition that \(\pi_{ii}\) = \(\pi_i\). The variance of tha average estimator is \[ Var(\hat \tau_{HT}) = \frac{1}{N^2}\sum_{i=1}^N \sum_{j=1}^{N} \frac{\pi_{ij} - \pi_i\pi_j}{\pi_i\pi_j} Y_i Y_j \]

As we want to assure that the Horwitz Thompson estimator converges well to the quantity of interest, controlling the variance is key. It is not hard to see the variance has an upper bound, \[ Var(\hat Y_{HT}) \leq \sum_{i,j=1}^N \left(\frac{1}{\pi_i^{1/2}\pi_j^{1/2}} - 1\right) Y_i Y_j \] but if some of the \(\pi_i\) are small this raises the small denominator problem. So care would have to be taken in this case. This danger can be mollified through a modification of the HTE, the Hajek estimator. (Also note that as the sample becomes larger, the \(\pi_i\) become larger, better controlling the variance.)

Example

Consider a collection of cities with populations greater than some lower bound \(b\) in some state. Let the cities be enumerated by \(\mathcal{K} = \{1,2,..,N\}\). Let \(Y_i\) be the total number of hotels in the ith city.

Sample \(n\) units from \(\mathcal{K}\) by Simple Random Sampling Without Replacement. The inclusion probabilities are \[ \pi_i = n/N \] \[ \pi_{ij} = \frac{n(n-1)}{N(N-1)} \]

The estimate of the average number of hotels in large cities would be given by \(\hat Y_{HT}\) after sampling \[ \hat Y_{HT} = \sum_{i\in \bf{S}} \frac{Y_i}{\pi_i} = \frac{N}{n}\sum_{i\in \bf{S}} Y_i \] with variance \[ Var(\hat Y_{HT}) = N^2 \frac{1 - n/N}{n} S^2 \] where \(S^2\) is the population variance. The variance does not tend to zero even as m tends to infinity. But, for the Horwitz Thompson average, \[ Var(\hat \tau_{HT}) = \frac{1 - n/N}{n} S^2 \] Making a simple assumption \(n \propto N^\alpha\) for \(0 < \alpha < 1\) or \(n = \beta N\) for \(0 < \beta < 1\), the variance of the estimator of the average tends to zero.

NoteBasic Graph theory

Graph Theory Basics

A graph is defined as a pair \(G = (V, E)\) where \(V\) is a finite set of vertices and \(E \subseteq \{\{u,v\} : u, v \in V, u \neq v\}\) is a set of unordered edges.

  • Vertices and edges: Each element \(v \in V\) is a vertex. An edge \(\{u,v\} \in E\) connects two distinct vertices \(u\) and \(v\).
  • Adjacency: Two vertices \(u\) and \(v\) are adjacent if \(\{u,v\} \in E\).
  • Degree: The degree of a vertex \(v\) is \(\deg(v) = |\{u \in V : \{u,v\} \in E\}|\).

Neighborhoods

The (open) neighborhood of a vertex \(v\) is \[ N(v) = \{u \in V : \{u,v\} \in E\}. \] Including the vertex itself yields the closed neighborhood \(N[v] = N(v) \cup \{v\}\).

Balls in a Graph

Using graph distance \(d_G(u,v)\) (the length of a shortest path between \(u\) and \(v\)), the ball of radius \(r \ge 0\) around \(v\) is \[ B_r(v) = \{u \in V : d_G(u,v) \le r\}. \] When \(r = 1\), \(B_1(v) = N[v]\).

Annular Regions

An annulus (discrete ring) around \(v\) between radii \(r_1\) and \(r_2\) with \(0 \le r_1 < r_2\) is \[ A_{r_1,r_2}(v) = \{u \in V : r_1 < d_G(u,v) \le r_2\}. \] These layers let us analyze how structure or influence spreads outward from a vertex.

Enumerate the users \(\mathcal{V} = \{v_1,v_2,...,v_N\}\). Typically, the network is modeled as a graph \(\mathcal{G} = (\mathcal{V},\mathcal{E})\). In our example two units may be connected by an edge if they are freinds. For the sake of the experiment, each user is assigned to one of two groups \(z_i = 0,1\), with \(0\) being the control and \(1\) being the treatment.

With an assignment of \(z = (z_1,z_2,..,z_N)\) the total of the quantity of interest is \[ Y(z) = \sum_{i=1}^N Y_i(z) \] where \(Y_i\) is the quantity of interest for the \(i^{th}\) individual. For our messaging example, \(Y_i\) might be the total number of messages the user sent within a two week period. Notice at this point we are allowing that each individual’s behavior depends on the entire assignment vector!

The ulimate goal is to estimate \[ \tau(\vec{1},\vec{0}) = \frac1N\sum_{i=1}^N \left(Y_i(\vec{1}) - Y_i(\vec{0}) \right) \tag{2}\] the Average Treatment Effect (ATE). Here \(\vec{1}\) (\(\vec{0}\)) is the assignment of \(1\)s (\(0\)s) to all individuals. Of course there is no world where we can simultaneously assign all units both 1 and 0, hence the goal of our study.

It would be hard to make it very far without making some kind of assumptions on the network effects. How can we make conclusions about \(\tau(\vec{1},\vec{0})\) if we can’t isolate the assignment groups? The earliest paper I found in this direction is (Ugander et al. 2013). The authors make the assumption of a net exposure condition which effectively assumes that if sufficiently many agents of an individuals neighborhood belong to a given assignment group, the individual will behave as if the entire network has that assignment. That is, for indivual \(v\in\mathcal{V}\), suppose there are \(k_i\) individuals in \(A_{1,r}(v)\) assigned \(z_j = 1 (0)\), then we can assume \(i\) behaves as if \(z = \vec{1} (\vec{0})\). Explicit choices of \(k_i\) can vary, but the authors consider \(k_i\) the size of the neighborhood \(k_i = |A_{1,r}(v)|\), or a proportion of the neighborhood \(k_i = \kappa |A_{1,r}(v)|\), or simply a constant minimum over the network \(k_i = \kappa\).

The Ugander paper is notable because they tackle the problem of a globally connected network of individuals. Compare this to the more mature literature of estimating the ATE when there are groups of interacting agents (Hudgens and Halloran 2008). In that paper, the population is stratified into groups, and interference is limited to individuals among the same group. For example, a group might be all students in a particular elementary school. The analogy of the net exposure condition, is the assumption that any two assignments of a fixed group are equivalent if they have the same number of individuals assigned to 1 and 0 respectively.

NoteHudgens and Halloran, 2008

The Hudgens and Halloran paper outlines detailed experiment designs. [Sampling proceedures] The goal is to measure quantities like (Equation 2) by group and by individual. They differentiate a number of different causal effects * Direct - The impact on \(Y_i\) of altering a single individuals assignment \(z_i=0 \to z_i=1\) while holding all other assigments fixed. * Indirect - The impact on \(Y_i\) of altering the assignment of all other individuals \(\{z_j\}_{j\neq i}\) of a group while holding \(z_i\) fixed. * Total - A combination of direct and indirect effects.

With this variety of quatities the proceed to derive unbiased estimators for the ATEs. [State corollary of theorem 2]

papers (Hudgens and Halloran 2008), (Aronow and Samii 2017), (Ugander et al. 2013), (Eckles, Karrer, and Ugander 2017)

References

Aronow, Peter M., and Cyrus Samii. 2017. “Estimating Average Causal Effects Under General Interference, with Application to a Social Network Experiment.” The Annals of Applied Statistics 11 (4): 1912–47. http://www.jstor.org/stable/26362172.
Eckles, Dean, Brian Karrer, and Johan Ugander. 2017. “Design and Analysis of Experiments in Networks: Reducing Bias from Interference.” Journal of Causal Inference 5 (1): 20150021.
Hudgens, Michael G., and M. Elizabeth Halloran. 2008. “Toward Causal Inference with Interference.” Journal of the American Statistical Association 103 (482): 832–42. http://www.jstor.org/stable/27640105.
Ugander, Johan, Brian Karrer, Lars Backstrom, and Jon Kleinberg. 2013. “Graph Cluster Randomization: Network Exposure to Multiple Universes.” In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 329–37. KDD ’13. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2487575.2487695.