Work in progress. This primer is still being written.
β ITSJUSTBETA.COM

Part 06 / 16

Estimating Factor Returns: The Cross-Sectional Regression

This chapter is the mathematical core of the series: one regression that turns measured exposures into factor returns. The fundamental model’s exposures are measured (Chapter 3) on a curated universe (Chapter 5). What remains unknown each period is the vector of factor returns, generated by one regression per period across stocks.

6.1 Setup

At the end of period tt, we observe: returns rtr_t (N×1N \times 1) realized over the period, and exposures XXt1X \equiv X_{t-1} (N×KN \times K) known at the period’s start. The model

rt=Xt1ft+ϵtr_t = X_{t-1} f_t + \epsilon_t

is, for this one period, a linear regression in which the stocks are the observations: NN data points, KK coefficients. The coefficient on the value column is “the return in this period to a unit of value exposure”, the value factor return. We estimate one such regression every period, building up the time series of factor returns {f^t}\{\hat f_t\} that Chapter 8 turns into a covariance matrix. Within this chapter the subscripts are dropped: r=Xf+ϵr = Xf + \epsilon.

6.2 Ordinary least squares (OLS)

Derivation from first principles. Each stock’s residual is riXifr_i - X_i f, the part of its return the factors miss. OLS picks f^\hat f to minimize the sum of squared residuals S(f)S(f):

S(f)=i(riXif)2=(rXf)(rXf).S(f) = \sum_i (r_i - X_i f)^2 = (r - Xf)^\top (r - Xf).

As a function of ff this is a quadratic, an upward-curving bowl with one lowest point, so the minimum sits wherever the gradient is zero. Expanding the product (rXf)(rXf)(r - Xf)^\top (r - Xf) gives

S(f)=rr2fXr+fXXf,S(f) = r^\top r - 2 f^\top X^\top r + f^\top X^\top X f,

and its three terms differentiate to 00, 2Xr-2 X^\top r, and 2XXf2 X^\top X f:

fS=2Xr+2XXf=0.\nabla_f S = -2 X^\top r + 2 X^\top X f = 0.

Rearranging gives the normal equations XXf^=XrX^\top X \hat f = X^\top r, and when XXX^\top X is invertible,

f^=(XX)1Xr.\hat f = (X^\top X)^{-1} X^\top r.

Geometric interpretation: Picture this period’s NN stock returns as a single point rr in NN-dimensional space. As ff ranges over every value, XfXf sweeps out a flat KK-dimensional surface through the origin: all the return patterns the factors can reproduce. OLS finds the point on that surface nearest to rr. That nearest point is the fitted return Xf^X\hat f, and the leftover ϵ^=rXf^\hat\epsilon = r - X\hat f is the straight-line gap from rr down to the surface, meeting it at a right angle. So “estimating factor returns” is just dropping this period’s returns onto the space of characteristics, with f^\hat f the coordinates of where they land. The residual therefore has zero in-sample correlation with every factor. That is the in-sample analog of Assumption A2 of Chapter 2, and it holds by construction whether or not the population assumption does.

OLS is only the starting point. Two problems force refinements: heteroskedasticity (Section 6.3) and exact collinearity (Section 6.4).

6.3 Weighted least squares (WLS)

OLS treats every stock as equally informative, but specific variance differs enormously across stocks. A micro-cap’s residual return is several times noisier than a mega-cap’s, and regression theory (the Gauss–Markov/Aitken theorem) says efficiency requires weighting each observation by the inverse of its error variance. Weighting stock ii by wiw_i, minimize iwi(riXif)2\sum_i w_i (r_i - X_i f)^2, i.e. SW(f)=(rXf)W(rXf)S_W(f) = (r - Xf)^\top W (r - Xf) with W=diag(w1,,wN)W = \mathrm{diag}(w_1, \dots, w_N). These wiw_i are regression weights, how much each stock’s residual counts in the fit. Despite the shared letter, they are not portfolio holdings like the wp,wbw_p, w_b used in other chapters. The same calculus gives

  f^=(XWX)1XWr  \boxed{\;\hat f = (X^\top W X)^{-1} X^\top W r\;}

The conventional weight: wiw_i proportional to capi\sqrt{\mathrm{cap}_i}. The ideal weight is wiw_i proportional to 1/σϵi21/\sigma^2_{\epsilon_i}. Empirically, specific variance σϵi2\sigma^2_{\epsilon_i} scales roughly as the inverse square root of market cap, capi1/2\mathrm{cap}_i^{-1/2}, so its inverse is proportional to capi\sqrt{\mathrm{cap}_i}, a serviceable one-parameter approximation. It also interpolates between equal weighting (small caps would dominate the count) and cap weighting (a handful of mega-caps would dominate the fit). Some vendors skip the proxy and use estimated inverse specific variance directly, but cap\sqrt{\mathrm{cap}} is the standard choice in practice. In the MiniModel, the normalized cap\sqrt{\mathrm{cap}} weights run from 0.168 (AXIOM) down to 0.044 (DIGIT): AXIOM gets 3.9x DIGIT’s influence, versus 15x under cap weighting.

6.4 Multicollinearity and constrained regression

The market column is exactly the industry columns added together. In the MiniModel, each stock has market exposure 1 and exactly one industry dummy equal to 1. Hence

XTECH+XFIN+XCONS=XMKT(as columns),X_{\text{TECH}} + X_{\text{FIN}} + X_{\text{CONS}} = X_{\text{MKT}} \quad \text{(as columns)},

so the columns of XX are linearly dependent, XWXX^\top W X is singular, and f^\hat f is not unique: adding any constant cc to the market factor return and subtracting cc from every industry factor return leaves all fitted values XfX f unchanged. The data cannot distinguish “the market went up 1%” from “every industry went up 1%.” (In a global model, country dummies create a second identical degeneracy.)

The fix: impose one identifying constraint per degeneracy. The convention, chosen for interpretability, not statistical necessity, is that the cap-weighted average of industry factor returns is zero:

jind.cjfj=0,cj=cap weight of industry j in the estimation universe.\sum_{j \in \text{ind.}} c_j\, f_j = 0, \qquad c_j = \text{cap weight of industry } j \text{ in the estimation universe}.

This allocates common movement to the market factor and makes each industry factor return mean that industry’s return relative to the market. Combined with the style standardization of Chapter 3 (cap-weighted style exposures are zero), the market factor return becomes the cap-weighted market return, approximately under cap\sqrt{\mathrm{cap}} weights and exactly under cap weights.

Why not just drop an industry column, the textbook fix for collinear dummies? That identifies the system too, but it makes the dropped industry a silent reference category: every remaining industry factor then reads as a return relative to it, not to the market, and the market factor is no longer the market return. The constraint keeps all industries symmetric and the market factor interpretable.

Solving the constrained problem: the restriction-matrix method. Write the constraint as Cf=0C f = 0 where CC is 1×K1 \times K (here: zeros except cTECH,cFIN,cCONSc_{\text{TECH}}, c_{\text{FIN}}, c_{\text{CONS}} in the industry slots). Parameterize the feasible set: express one industry’s factor return in terms of the others,

fCONS=cTECHcCONSfTECHcFINcCONSfFIN,f_{\text{CONS}} = -\frac{c_{\text{TECH}}}{c_{\text{CONS}}} f_{\text{TECH}} - \frac{c_{\text{FIN}}}{c_{\text{CONS}}} f_{\text{FIN}},

and collect this into a K×(K1)K \times (K-1) restriction matrix RR such that every feasible ff equals RgRg for an unconstrained reduced vector gg ((K1)×1(K-1) \times 1). Substitute into the WLS objective and solve the now-nonsingular problem:

g^=(RXWXR)1RXWr,f^=Rg^.\hat g = (R^\top X^\top W X R)^{-1} R^\top X^\top W\, r, \qquad \hat f = R\, \hat g.

The equivalent Lagrangian route, stationarity of SW(f)+2λCfS_W(f) + 2\lambda\, C f, gives the same solution. The derivation is in the appendix. The restriction form is how many production systems implement it.

MiniModel restriction matrix: Industry cap weights are c=(0.4553,0.3252,0.2195)c = (0.4553,\, 0.3252,\, 0.2195) for (TECH, FIN, CONS), so the CONS row of RR reads fCONS=2.0742fTECH1.4815fFINf_{\text{CONS}} = -2.0742\, f_{\text{TECH}} - 1.4815\, f_{\text{FIN}}, and all other factors pass through untouched: RR maps the 6 free coordinates (MKT, TECH, FIN, VALUE, MOM, SIZE) into the 7 constrained ones.

6.5 Factor returns are portfolio returns: the key duality

Look at the solution’s shape:

f^=R(RXWXR)1RXWP,    K×N  r  =  Pr.\hat f = \underbrace{R\,(R^\top X^\top W X R)^{-1} R^\top X^\top W}_{P,\;\; K \times N}\; r \;=\; P\,r.

f^\hat f is a linear function of returns: each factor return is f^k=iPkiri\hat f_k = \sum_i P_{ki}\, r_i, a weighted sum of stock returns. A weighted sum of stock returns is the return of a portfolio. Row kk of PP is an investable long–short portfolio whose return is the estimated factor return. These are the pure factor portfolios. With the unconstrained P=(XWX)1XWP = (X^\top W X)^{-1} X^\top W they satisfy PX=IKPX = I_K exactly: unit exposure to their own factor, zero to every other. Under the identifying constraint above (the PP written here), that identity holds in full only on the style rows, while the market and industry rows carry the constraint’s structure. Chapter 7 is devoted to factor portfolios, and shows the regression and the portfolios are one object seen from two sides.

6.6 Robustness refinements

  • Return outliers: A stock that gains 300% on a takeover this month is one observation with outsized influence on the fitted factor returns. Standard treatments: winsorize returns cross-sectionally before the regression, or use robust (Huber-weighted) regression, iteratively downweighting observations with large standardized residuals. Either way, the goal is that no single stock’s corporate event masquerades as a factor return.
  • Thin factors: An industry with 3 members produces a factor return that is mostly those stocks’ specific returns. Mitigations: coarser industry scheme, Bayesian shrinkage of thin-industry factor returns toward a parent sector, or explicit minimum-membership rules feeding back into factor-structure design (Chapter 15).
  • Heteroskedasticity beyond size: When specific-variance estimates exist (Chapter 8), a second pass with W=Δ^1W = \hat\Delta^{-1} tightens estimates. This is a feasible Generalized Least Squares (GLS) regression. Chapter 7 shows this choice also makes the implied factor portfolios minimum-variance.

6.7 Diagnostics

Cross-sectional R2R^2 (weighted): R2=1ϵ^Wϵ^(rrˉ)W(rrˉ)R^2 = 1 - \frac{\hat\epsilon^\top W \hat\epsilon}{(r - \bar r)^\top W (r - \bar r)}, with rˉ\bar r the weighted mean return. For real monthly single-stock cross-sections, values of 0.2–0.4 are typical and fine. Most of any one stock’s monthly return is specific. The model’s job is to capture the common part. Portfolio-level explanatory power is far higher because specific returns diversify (Chapter 2). A trend change in R2R^2 matters more than its level (Chapter 14).

Factor t-statistics: From WLS theory, Var^(g^)=s2(RXWXR)1\widehat{\mathrm{Var}}(\hat g) = s^2 (R^\top X^\top W X R)^{-1} (with s2s^2 the weighted residual variance), mapped back via RR. This form is exact only when W=Δ1W = \Delta^{-1} (the true inverse specific variances). Under the cap\sqrt{\mathrm{cap}} approximation the weights are misspecified, so production code uses heteroskedasticity-robust standard errors instead. The t-stat of f^k\hat f_k in a single period tests whether that factor’s return is distinguishable from zero this period. The more important statistic is longitudinal: the fraction of periods with tk>2|t_k| > 2. A factor that clears 2 in, say, a third or more of months is earning its place. One that almost never does is a candidate for removal (Chapter 15).

Residual structure: Compute residual correlations across stocks over time. Clusters of correlated residuals (by theme, ownership, supply chain) are the signature of a missing factor, the diagnostic used in Chapter 15.

6.8 Worked example: the MiniModel month-1 regression, end to end

All inputs from Chapter 3 and the appendix. Every number is reproducible from the example source code.

Inputs: XX (10 x 7) from Chapter 3. Month-1 returns and cap\sqrt{\text{cap}} regression weights:

Stockrr (%)reg. weightStockrr (%)reg. weight
AXIOM+4.20.168FIDELIS−0.60.107
BINARY+2.80.123GUARDIAN−1.80.062
CIPHER+0.50.087HARVEST+1.20.130
DIGIT+6.00.044INDIGO+2.00.075
EVERGREEN+0.80.151JUNIPER−0.50.053

Step 1, build RR: from industry cap weights (0.4553,0.3252,0.2195)(0.4553, 0.3252, 0.2195) as in Section 6.4.

Step 2, solve: g^=(RXWXR)1RXWr\hat g = (R^\top X^\top W X R)^{-1} R^\top X^\top W r, then f^=Rg^\hat f = R \hat g:

Factorf^\hat f (%)Interpretation
MKT+1.821the market rose ~1.8% (the cap-weighted market return was +1.819%, the near-match is the constraint + standardization doing their job)
TECH+0.768tech beat the market by ~0.77%
FIN−1.282financials lagged the market by ~1.28%
CONS+0.306consumer slightly ahead
VALUE+0.548each +1σ of cheapness paid +0.55%
MOM+1.962a strong momentum month: +1.96% per σ of momentum
SIZE+0.046size was flat

Constraint check: 0.4553(0.768)+0.3252(1.282)+0.2195(0.306)=0.0000.4553(0.768) + 0.3252(-1.282) + 0.2195(0.306) = 0.000. That passes.

Step 3, residuals: ϵ^=rXf^\hat\epsilon = r - X\hat f (%):

AXIOMBINARYCIPHERDIGITEVERGREENFIDELISGUARDIANHARVESTINDIGOJUNIPER
−0.10+0.02+0.14+0.03+0.36−0.24−0.46−0.44+1.09−0.46

Reading: INDIGO beat what its characteristics predicted by 1.09%, genuine stock-specific news. AXIOM’s big +4.2% month, by contrast, was almost entirely characteristic-driven (tech, high momentum): residual −0.10%. This distinction, which part of a return was “earned” by characteristics, is the entire basis of performance attribution (Chapter 10).

Step 4, diagnostics: Weighted R2R^2 (centered) = 0.956, far above realistic values because the toy has 10 observations and 6 free parameters (7 factors less the 1 constraint). In production cross-sections (NN in the thousands), expect 0.2–0.4 monthly.

What we now have: Repeating this regression every period produces the factor return history {f^t}\{\hat f_t\}, the raw material for the factor covariance matrix, and a residual history per stock, the raw material for specific risk. That assembly is Chapter 8. First, Chapter 7 examines the portfolios this regression has implicitly constructed.

6.9 Summary

  • One cross-sectional regression per period recovers the factor returns, with the stocks as observations: f^=(XWX)1XWr\hat f = (X^\top W X)^{-1} X^\top W r (identified by the constraint in the third bullet).
  • WLS, not OLS: Specific variance varies sharply across stocks, so weight each by wiw_i, proportional to capi\sqrt{\mathrm{cap}_i}, a one-parameter proxy for inverse specific variance that also keeps mega-caps from dominating the fit.
  • Constrain, don’t drop: The market and industry columns are exactly collinear. A cap-weighted “industry returns sum to zero” constraint identifies the system and makes the market factor return the cap-weighted market return.
  • f^=Pr\hat f = Pr: every factor return is the realized return of an investable long–short portfolio (Chapter 7). Run every period, the regression yields the {f^t}\{\hat f_t\} and residual histories that build FF and Δ\Delta (Chapter 8).