Estimating Factor Returns: The Cross-Sectional Regression

This chapter is the mathematical core of the series: one regression that turns measured exposures into factor returns. The fundamental model’s exposures are measured (Chapter 3) on a curated universe (Chapter 5). What remains unknown each period is the vector of factor returns, generated by one regression per period across stocks.

6.1 Setup

At the end of period $t$ , we observe: returns $r_t$ ( $N \times 1$ ) realized over the period, and exposures $X \equiv X_{t-1}$ ( $N \times K$ ) known at the period’s start. The model

$r_t = X_{t-1} f_t + \epsilon_t$

is, for this one period, a linear regression in which the stocks are the observations: $N$ data points, $K$ coefficients. The coefficient on the value column is “the return in this period to a unit of value exposure”, the value factor return. We estimate one such regression every period, building up the time series of factor returns $\{\hat f_t\}$ that Chapter 8 turns into a covariance matrix. Within this chapter the subscripts are dropped: $r = Xf + \epsilon$ .

6.2 Ordinary least squares (OLS)

Derivation from first principles. Each stock’s residual is $r_i - X_i f$ , the part of its return the factors miss. OLS picks $\hat f$ to minimize the sum of squared residuals $S(f)$ :

$S(f) = \sum_i (r_i - X_i f)^2 = (r - Xf)^\top (r - Xf).$

As a function of $f$ this is a quadratic, an upward-curving bowl with one lowest point, so the minimum sits wherever the gradient is zero. Expanding the product $(r - Xf)^\top (r - Xf)$ gives

$S(f) = r^\top r - 2 f^\top X^\top r + f^\top X^\top X f,$

and its three terms differentiate to $0$ , $-2 X^\top r$ , and $2 X^\top X f$ :

$\nabla_f S = -2 X^\top r + 2 X^\top X f = 0.$

Rearranging gives the normal equations $X^\top X \hat f = X^\top r$ , and when $X^\top X$ is invertible,

$\hat f = (X^\top X)^{-1} X^\top r.$

Geometric interpretation: Picture this period’s $N$ stock returns as a single point $r$ in $N$ -dimensional space. As $f$ ranges over every value, $Xf$ sweeps out a flat $K$ -dimensional surface through the origin: all the return patterns the factors can reproduce. OLS finds the point on that surface nearest to $r$ . That nearest point is the fitted return $X\hat f$ , and the leftover $\hat\epsilon = r - X\hat f$ is the straight-line gap from $r$ down to the surface, meeting it at a right angle. So “estimating factor returns” is just dropping this period’s returns onto the space of characteristics, with $\hat f$ the coordinates of where they land. The residual therefore has zero in-sample correlation with every factor. That is the in-sample analog of Assumption A2 of Chapter 2, and it holds by construction whether or not the population assumption does.

OLS is only the starting point. Two problems force refinements: heteroskedasticity (Section 6.3) and exact collinearity (Section 6.4).

6.3 Weighted least squares (WLS)

OLS treats every stock as equally informative, but specific variance differs enormously across stocks. A micro-cap’s residual return is several times noisier than a mega-cap’s, and regression theory (the Gauss–Markov/Aitken theorem) says efficiency requires weighting each observation by the inverse of its error variance. Weighting stock $i$ by $w_i$ , minimize $\sum_i w_i (r_i - X_i f)^2$ , i.e. $S_W(f) = (r - Xf)^\top W (r - Xf)$ with $W = \mathrm{diag}(w_1, \dots, w_N)$ . These $w_i$ are regression weights, how much each stock’s residual counts in the fit. Despite the shared letter, they are not portfolio holdings like the $w_p, w_b$ used in other chapters. The same calculus gives

$\boxed{\;\hat f = (X^\top W X)^{-1} X^\top W r\;}$

The conventional weight: $w_i$ proportional to $\sqrt{\mathrm{cap}_i}$ . The ideal weight is $w_i$ proportional to $1/\sigma^2_{\epsilon_i}$ . Empirically, specific variance $\sigma^2_{\epsilon_i}$ scales roughly as the inverse square root of market cap, $\mathrm{cap}_i^{-1/2}$ , so its inverse is proportional to $\sqrt{\mathrm{cap}_i}$ , a serviceable one-parameter approximation. It also interpolates between equal weighting (small caps would dominate the count) and cap weighting (a handful of mega-caps would dominate the fit). Some vendors skip the proxy and use estimated inverse specific variance directly, but $\sqrt{\mathrm{cap}}$ is the standard choice in practice. In the MiniModel, the normalized $\sqrt{\mathrm{cap}}$ weights run from 0.168 (AXIOM) down to 0.044 (DIGIT): AXIOM gets 3.9x DIGIT’s influence, versus 15x under cap weighting.

6.4 Multicollinearity and constrained regression

The market column is exactly the industry columns added together. In the MiniModel, each stock has market exposure 1 and exactly one industry dummy equal to 1. Hence

$X_{\text{TECH}} + X_{\text{FIN}} + X_{\text{CONS}} = X_{\text{MKT}} \quad \text{(as columns)},$

so the columns of $X$ are linearly dependent, $X^\top W X$ is singular, and $\hat f$ is not unique: adding any constant $c$ to the market factor return and subtracting $c$ from every industry factor return leaves all fitted values $X f$ unchanged. The data cannot distinguish “the market went up 1%” from “every industry went up 1%.” (In a global model, country dummies create a second identical degeneracy.)

The fix: impose one identifying constraint per degeneracy. The convention, chosen for interpretability, not statistical necessity, is that the cap-weighted average of industry factor returns is zero:

$\sum_{j \in \text{ind.}} c_j\, f_j = 0, \qquad c_j = \text{cap weight of industry } j \text{ in the estimation universe}.$

This allocates common movement to the market factor and makes each industry factor return mean that industry’s return relative to the market. Combined with the style standardization of Chapter 3 (cap-weighted style exposures are zero), the market factor return becomes the cap-weighted market return, approximately under $\sqrt{\mathrm{cap}}$ weights and exactly under cap weights.

Why not just drop an industry column, the textbook fix for collinear dummies? That identifies the system too, but it makes the dropped industry a silent reference category: every remaining industry factor then reads as a return relative to it, not to the market, and the market factor is no longer the market return. The constraint keeps all industries symmetric and the market factor interpretable.

Solving the constrained problem: the restriction-matrix method. Write the constraint as $C f = 0$ where $C$ is $1 \times K$ (here: zeros except $c_{\text{TECH}}, c_{\text{FIN}}, c_{\text{CONS}}$ in the industry slots). Parameterize the feasible set: express one industry’s factor return in terms of the others,

$f_{\text{CONS}} = -\frac{c_{\text{TECH}}}{c_{\text{CONS}}} f_{\text{TECH}} - \frac{c_{\text{FIN}}}{c_{\text{CONS}}} f_{\text{FIN}},$

and collect this into a $K \times (K-1)$ restriction matrix $R$ such that every feasible $f$ equals $Rg$ for an unconstrained reduced vector $g$ ( $(K-1) \times 1$ ). Substitute into the WLS objective and solve the now-nonsingular problem:

$\hat g = (R^\top X^\top W X R)^{-1} R^\top X^\top W\, r, \qquad \hat f = R\, \hat g.$

The equivalent Lagrangian route, stationarity of $S_W(f) + 2\lambda\, C f$ , gives the same solution. The derivation is in the appendix. The restriction form is how many production systems implement it.

MiniModel restriction matrix: Industry cap weights are $c = (0.4553,\, 0.3252,\, 0.2195)$ for (TECH, FIN, CONS), so the CONS row of $R$ reads $f_{\text{CONS}} = -2.0742\, f_{\text{TECH}} - 1.4815\, f_{\text{FIN}}$ , and all other factors pass through untouched: $R$ maps the 6 free coordinates (MKT, TECH, FIN, VALUE, MOM, SIZE) into the 7 constrained ones.

6.5 Factor returns are portfolio returns: the key duality

Look at the solution’s shape:

$\hat f = \underbrace{R\,(R^\top X^\top W X R)^{-1} R^\top X^\top W}_{P,\;\; K \times N}\; r \;=\; P\,r.$

$\hat f$ is a linear function of returns: each factor return is $\hat f_k = \sum_i P_{ki}\, r_i$ , a weighted sum of stock returns. A weighted sum of stock returns is the return of a portfolio. Row $k$ of $P$ is an investable long–short portfolio whose return is the estimated factor return. These are the pure factor portfolios. With the unconstrained $P = (X^\top W X)^{-1} X^\top W$ they satisfy $PX = I_K$ exactly: unit exposure to their own factor, zero to every other. Under the identifying constraint above (the $P$ written here), that identity holds in full only on the style rows, while the market and industry rows carry the constraint’s structure. Chapter 7 is devoted to factor portfolios, and shows the regression and the portfolios are one object seen from two sides.

Return outliers: A stock that gains 300% on a takeover this month is one observation with outsized influence on the fitted factor returns. Standard treatments: winsorize returns cross-sectionally before the regression, or use robust (Huber-weighted) regression, iteratively downweighting observations with large standardized residuals. Either way, the goal is that no single stock’s corporate event masquerades as a factor return.
Thin factors: An industry with 3 members produces a factor return that is mostly those stocks’ specific returns. Mitigations: coarser industry scheme, Bayesian shrinkage of thin-industry factor returns toward a parent sector, or explicit minimum-membership rules feeding back into factor-structure design (Chapter 15).
Heteroskedasticity beyond size: When specific-variance estimates exist (Chapter 8), a second pass with $W = \hat\Delta^{-1}$ tightens estimates. This is a feasible Generalized Least Squares (GLS) regression. Chapter 7 shows this choice also makes the implied factor portfolios minimum-variance.

6.7 Diagnostics

Cross-sectional $R^2$ (weighted): $R^2 = 1 - \frac{\hat\epsilon^\top W \hat\epsilon}{(r - \bar r)^\top W (r - \bar r)}$ , with $\bar r$ the weighted mean return. For real monthly single-stock cross-sections, values of 0.2–0.4 are typical and fine. Most of any one stock’s monthly return is specific. The model’s job is to capture the common part. Portfolio-level explanatory power is far higher because specific returns diversify (Chapter 2). A trend change in $R^2$ matters more than its level (Chapter 14).

Factor t-statistics: From WLS theory, $\widehat{\mathrm{Var}}(\hat g) = s^2 (R^\top X^\top W X R)^{-1}$ (with $s^2$ the weighted residual variance), mapped back via $R$ . This form is exact only when $W = \Delta^{-1}$ (the true inverse specific variances). Under the $\sqrt{\mathrm{cap}}$ approximation the weights are misspecified, so production code uses heteroskedasticity-robust standard errors instead. The t-stat of $\hat f_k$ in a single period tests whether that factor’s return is distinguishable from zero this period. The more important statistic is longitudinal: the fraction of periods with $|t_k| > 2$ . A factor that clears 2 in, say, a third or more of months is earning its place. One that almost never does is a candidate for removal (Chapter 15).

Residual structure: Compute residual correlations across stocks over time. Clusters of correlated residuals (by theme, ownership, supply chain) are the signature of a missing factor, the diagnostic used in Chapter 15.

6.8 Worked example: the MiniModel month-1 regression, end to end

All inputs from Chapter 3 and the appendix. Every number is reproducible from the example source code.

Inputs: $X$ (10 x 7) from Chapter 3. Month-1 returns and $\sqrt{\text{cap}}$ regression weights:

Stock	$r$ (%)	reg. weight	Stock	$r$ (%)	reg. weight
AXIOM	+4.2	0.168	FIDELIS	−0.6	0.107
BINARY	+2.8	0.123	GUARDIAN	−1.8	0.062
CIPHER	+0.5	0.087	HARVEST	+1.2	0.130
DIGIT	+6.0	0.044	INDIGO	+2.0	0.075
EVERGREEN	+0.8	0.151	JUNIPER	−0.5	0.053

Step 1, build $R$ : from industry cap weights $(0.4553, 0.3252, 0.2195)$ as in Section 6.4.

Step 2, solve: $\hat g = (R^\top X^\top W X R)^{-1} R^\top X^\top W r$ , then $\hat f = R \hat g$ :

Factor	$\hat f$ (%)	Interpretation
MKT	+1.821	the market rose ~1.8% (the cap-weighted market return was +1.819%, the near-match is the constraint + standardization doing their job)
TECH	+0.768	tech beat the market by ~0.77%
FIN	−1.282	financials lagged the market by ~1.28%
CONS	+0.306	consumer slightly ahead
VALUE	+0.548	each +1σ of cheapness paid +0.55%
MOM	+1.962	a strong momentum month: +1.96% per σ of momentum
SIZE	+0.046	size was flat

Constraint check: $0.4553(0.768) + 0.3252(-1.282) + 0.2195(0.306) = 0.000$ . That passes.

Step 3, residuals: $\hat\epsilon = r - X\hat f$ (%):

AXIOM	BINARY	CIPHER	DIGIT	EVERGREEN	FIDELIS	GUARDIAN	HARVEST	INDIGO	JUNIPER
−0.10	+0.02	+0.14	+0.03	+0.36	−0.24	−0.46	−0.44	+1.09	−0.46

Reading: INDIGO beat what its characteristics predicted by 1.09%, genuine stock-specific news. AXIOM’s big +4.2% month, by contrast, was almost entirely characteristic-driven (tech, high momentum): residual −0.10%. This distinction, which part of a return was “earned” by characteristics, is the entire basis of performance attribution (Chapter 10).

Step 4, diagnostics: Weighted $R^2$ (centered) = 0.956, far above realistic values because the toy has 10 observations and 6 free parameters (7 factors less the 1 constraint). In production cross-sections ( $N$ in the thousands), expect 0.2–0.4 monthly.

What we now have: Repeating this regression every period produces the factor return history $\{\hat f_t\}$ , the raw material for the factor covariance matrix, and a residual history per stock, the raw material for specific risk. That assembly is Chapter 8. First, Chapter 7 examines the portfolios this regression has implicitly constructed.

6.9 Summary

One cross-sectional regression per period recovers the factor returns, with the stocks as observations: $\hat f = (X^\top W X)^{-1} X^\top W r$ (identified by the constraint in the third bullet).
WLS, not OLS: Specific variance varies sharply across stocks, so weight each by $w_i$ , proportional to $\sqrt{\mathrm{cap}_i}$ , a one-parameter proxy for inverse specific variance that also keeps mega-caps from dominating the fit.
Constrain, don’t drop: The market and industry columns are exactly collinear. A cap-weighted “industry returns sum to zero” constraint identifies the system and makes the market factor return the cap-weighted market return.
$\hat f = Pr$ : every factor return is the realized return of an investable long–short portfolio (Chapter 7). Run every period, the regression yields the $\{\hat f_t\}$ and residual histories that build $F$ and $\Delta$ (Chapter 8).