Fixed Effects and Causal Inference#
So, you’re working with panel data. You have either multiple observations per entity (usually data over time), or you have nested data (data on individuals, each of whom belongs to a larger group, like kids in a school). And you want to add fixed effects for the entities or groups. But what does that mean from the perspective of causal inference?
The simple answer is that a fixed effect is a tool for controlling for a specific source of baseline differences (namely, baseline differences across entities). In that sense, fixed effects are no different from any other type of control variable we put in a regression.
But there is an important difference: with most variables we put in a regression, those variables are only controlling for a specific thing, the thing they measure. If you add age, you’re controlling for age; add income, you’re controlling for income. But fixed effects are a little more powerful: they control for any stable source of baseline differences between entities!
Uh, what?
Suppose you have weekly sales data for all your stores over several years, and you’re interested in better understanding the effect of each store’s local unemployment rate on sales. For simplicity, we’ll assume that in any week, a given store is said to experience high unemployment (\(D_i=1\)) or low unemployment (\(D_i=0\)), though this extends to using thinking of unemployment as a continuous variable.
If you just ran a regression of local unemployment rates on sales, you might find that sales are extremely sensitive to the local unemployment rate. But when you dig in, that’s because your New York City store in Times Square makes tons of sales, and NYC tends to always have low unemployment.
In our usual causal framework, we’d say that our NYC store had important baseline differences in our outcome (sales per week) that happened to be correlated with treatment assignment (NY tends to have low unemployment), though we don’t think it’s because of NYC has low unemployment (we think it’s because the store is in Times Square!). As a result, \(E(Y_i^0|D_i=0) \neq E(Y_i^0|D_i=1)\). We have a problem with baseline differences.
But what if you add store fixed effects to the model? You can think of those fixed effects as demeaning all sales for each store. In other words, you’re removing level differences across scores from the relationship you’re trying to estimate. As a result, instead of estimating how differences in unemployment are correlated with sales, you’re instead estimating how changes in unemployment relate to changes in sales.
That’s because after fixed effects effectively demean your dependent variable (for each store), the only variation in the data you’re left with are the deviations in your dependent variable above or below its mean value for a given store. We’re no longer estimating the relationship between unemployment and store sales, but rather how changes in unemployment are correlated with changes in sales.
From a causal perspective, now that our focus is on how changes in unemployment affect changes in sales, any differences between stores that don’t vary over time – no matter what caused those differences – have been accounted for.
Mathematically, we can think of this as a change in our \(Y\): instead of \(Y_i\) being absolute sales, it’s now deviations in sales above or below the average sales for a given store \(i\). Thus, baseline differences (\(E(Y_i^0|D_i=0) = E(Y_i^0|D_i=1)\)) has become a question about whether, in a world where unemployment is below the local average, sales for all groups tend to be the same amount above (or below) store average sales. And the fact that we have a store in Times Square is no longer an obvious threat to that condition!
The limit to this magic is that we’re only controlling for non-time-varying differences between stores. If store sales in NYC are always high because of their location, that’s something controlled for by the fixed effects. However, if in the middle of your data NYC experiences a hurricane, and so sales dip, that isn’t taken into account by your fixed effects (since it was “time-varying,” it only affected the store for a fixed period-of-time). As a result, you’d want a separate control for those kinds of disruptions in your regression.
One way you can see this is that if you try and stick a variable into your regression that is not time-varying (it has the same value for each store throughout the data), you’ll find that you don’t get back a regression coefficient. That’s because anything non-time varying is actually co-linear (in terms of the underlying linear algebra) with the fixed effects, meaning you literally can’t estimate a coefficient. It’d be like trying to include the variable “age” twice in the same regression.