# Chapter 6: Estimating Factor Returns: The Cross-Sectional Regression

_Previous: [Chapter 5: Estimation Universe and Coverage Universe](05-universes.md)_

---

This chapter is the mathematical core of the series: one regression that turns measured exposures into factor returns. The fundamental model's exposures are measured ([Chapter 3](03-factors-and-exposures.md)) on a curated universe ([Chapter 5](05-universes.md)). What remains unknown each period is the vector of factor returns, generated by one regression per period _across stocks_.

## 6.1 Setup

At the end of period $t$, we observe: returns $r_t$ ($N \times 1$) realized over the period, and exposures $X \equiv X_{t-1}$ ($N \times K$) known at the period's start. The model

$$r_t = X_{t-1} f_t + \epsilon_t$$

is, for this one period, a linear regression in which the _stocks are the observations_: $N$ data points, $K$ coefficients. The coefficient on the value column is "the return in this period to a unit of value exposure", the value factor return. We estimate one such regression every period, building up the time series of factor returns $\{\hat f_t\}$ that [Chapter 8](08-risk-model-assembly.md) turns into a covariance matrix. Within this chapter the subscripts are dropped: $r = Xf + \epsilon$.

## 6.2 Ordinary least squares (OLS)

**Derivation from first principles.** Each stock's residual is $r_i - X_i f$, the part of its return the factors miss. OLS picks $\hat f$ to minimize the sum of squared residuals $S(f)$:

$$S(f) = \sum_i (r_i - X_i f)^2 = (r - Xf)^\top (r - Xf).$$

As a function of $f$ this is a quadratic, an upward-curving bowl with one lowest point, so the minimum sits wherever the gradient is zero. Expanding the product $(r - Xf)^\top (r - Xf)$ gives

$$S(f) = r^\top r - 2 f^\top X^\top r + f^\top X^\top X f,$$

and its three terms differentiate to $0$, $-2 X^\top r$, and $2 X^\top X f$:

$$\nabla_f S = -2 X^\top r + 2 X^\top X f = 0.$$

Rearranging gives the normal equations $X^\top X \hat f = X^\top r$, and when $X^\top X$ is invertible,

$$\hat f = (X^\top X)^{-1} X^\top r.$$

**Geometric interpretation:** Picture this period's $N$ stock returns as a single point $r$ in $N$-dimensional space. As $f$ ranges over every value, $Xf$ sweeps out a flat $K$-dimensional surface through the origin: all the return patterns the factors can reproduce. OLS finds the point on that surface nearest to $r$. That nearest point is the fitted return $X\hat f$, and the leftover $\hat\epsilon = r - X\hat f$ is the straight-line gap from $r$ down to the surface, meeting it at a right angle. So "estimating factor returns" is just _dropping this period's returns onto the space of characteristics_, with $\hat f$ the coordinates of where they land. The residual therefore has zero in-sample correlation with every factor. That is the in-sample analog of Assumption A2 of [Chapter 2](02-the-factor-model-equation.md), and it holds by construction whether or not the population assumption does.

OLS is only the starting point. Two problems force refinements: heteroskedasticity (Section 6.3) and exact collinearity (Section 6.4).

## 6.3 Weighted least squares (WLS)

OLS treats every stock as equally informative, but specific variance differs enormously across stocks. A micro-cap's residual return is several times noisier than a mega-cap's, and regression theory (the [Gauss–Markov/Aitken theorem](https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem)) says efficiency requires weighting each observation by the inverse of its error variance. Weighting stock $i$ by $w_i$, minimize $\sum_i w_i (r_i - X_i f)^2$, i.e. $S_W(f) = (r - Xf)^\top W (r - Xf)$ with $W = \mathrm{diag}(w_1, \dots, w_N)$. These $w_i$ are regression weights, how much each stock's residual counts in the fit. Despite the shared letter, they are not portfolio holdings like the $w_p, w_b$ used in other chapters. The same calculus gives

$$\boxed{\;\hat f = (X^\top W X)^{-1} X^\top W r\;}$$

**The conventional weight: $w_i$ proportional to $\sqrt{\mathrm{cap}_i}$.** The ideal weight is $w_i$ proportional to $1/\sigma^2_{\epsilon_i}$. Empirically, specific variance $\sigma^2_{\epsilon_i}$ scales roughly as the inverse square root of market cap, $\mathrm{cap}_i^{-1/2}$, so its inverse is proportional to $\sqrt{\mathrm{cap}_i}$, a serviceable one-parameter approximation. It also interpolates between equal weighting (small caps would dominate the count) and cap weighting (a handful of mega-caps would dominate the fit). Some vendors skip the proxy and use estimated inverse specific variance directly, but $\sqrt{\mathrm{cap}}$ is the standard choice in practice. In the MiniModel, the normalized $\sqrt{\mathrm{cap}}$ weights run from 0.168 (AXIOM) down to 0.044 (DIGIT): AXIOM gets 3.9x DIGIT's influence, versus 15x under cap weighting.

## 6.4 Multicollinearity and constrained regression

**The market column is exactly the industry columns added together.** In the MiniModel, each stock has market exposure 1 and exactly one industry dummy equal to 1. Hence

$$X_{\text{TECH}} + X_{\text{FIN}} + X_{\text{CONS}} = X_{\text{MKT}} \quad \text{(as columns)},$$

so the columns of $X$ are linearly dependent, $X^\top W X$ is singular, and $\hat f$ is not unique: adding any constant $c$ to the market factor return and subtracting $c$ from every industry factor return leaves all fitted values $X f$ unchanged. The data cannot distinguish "the market went up 1%" from "every industry went up 1%." (In a global model, country dummies create a second identical degeneracy.)

**The fix: impose one identifying constraint per degeneracy.** The convention, chosen for interpretability, not statistical necessity, is that the cap-weighted average of industry factor returns is zero:

$$\sum_{j \in \text{ind.}} c_j\, f_j = 0, \qquad c_j = \text{cap weight of industry } j \text{ in the estimation universe}.$$

This allocates common movement to the market factor and makes each industry factor return mean _that industry's return relative to the market_. Combined with the style standardization of [Chapter 3](03-factors-and-exposures.md) (cap-weighted style exposures are zero), the market factor return becomes the cap-weighted market return, approximately under $\sqrt{\mathrm{cap}}$ weights and exactly under cap weights.

Why not just drop an industry column, the textbook fix for collinear dummies? That identifies the system too, but it makes the dropped industry a silent reference category: every remaining industry factor then reads as a return relative to it, not to the market, and the market factor is no longer the market return. The constraint keeps all industries symmetric and the market factor interpretable.

**Solving the constrained problem: the restriction-matrix method.** Write the constraint as $C f = 0$ where $C$ is $1 \times K$ (here: zeros except $c_{\text{TECH}}, c_{\text{FIN}}, c_{\text{CONS}}$ in the industry slots). Parameterize the feasible set: express one industry's factor return in terms of the others,

$$f_{\text{CONS}} = -\frac{c_{\text{TECH}}}{c_{\text{CONS}}} f_{\text{TECH}} - \frac{c_{\text{FIN}}}{c_{\text{CONS}}} f_{\text{FIN}},$$

and collect this into a $K \times (K-1)$ _restriction matrix_ $R$ such that every feasible $f$ equals $Rg$ for an unconstrained reduced vector $g$ ($(K-1) \times 1$). Substitute into the WLS objective and solve the now-nonsingular problem:

$$\hat g = (R^\top X^\top W X R)^{-1} R^\top X^\top W\, r, \qquad \hat f = R\, \hat g.$$

The equivalent Lagrangian route, stationarity of $S_W(f) + 2\lambda\, C f$, gives the same solution. The derivation is in the [appendix](17-appendix.md). The restriction form is how many production systems implement it.

**MiniModel restriction matrix:** Industry cap weights are $c = (0.4553,\, 0.3252,\, 0.2195)$ for (TECH, FIN, CONS), so the CONS row of $R$ reads $f_{\text{CONS}} = -2.0742\, f_{\text{TECH}} - 1.4815\, f_{\text{FIN}}$, and all other factors pass through untouched: $R$ maps the 6 free coordinates (MKT, TECH, FIN, VALUE, MOM, SIZE) into the 7 constrained ones.

## 6.5 Factor returns are portfolio returns: the key duality

Look at the solution's shape:

$$\hat f = \underbrace{R\,(R^\top X^\top W X R)^{-1} R^\top X^\top W}_{P,\;\; K \times N}\; r \;=\; P\,r.$$

$\hat f$ is a _linear function of returns_: each factor return is $\hat f_k = \sum_i P_{ki}\, r_i$, a weighted sum of stock returns. A weighted sum of stock returns is the return of a portfolio. Row $k$ of $P$ is an investable long–short portfolio whose return _is_ the estimated factor return. These are the _pure factor portfolios_. With the unconstrained $P = (X^\top W X)^{-1} X^\top W$ they satisfy $PX = I_K$ exactly: unit exposure to their own factor, zero to every other. Under the identifying constraint above (the $P$ written here), that identity holds in full only on the style rows, while the market and industry rows carry the constraint's structure. [Chapter 7](07-factor-portfolios.md) is devoted to factor portfolios, and shows the regression and the portfolios are one object seen from two sides.

## 6.6 Robustness refinements

- **Return outliers:** A stock that gains 300% on a takeover this month is one observation with outsized influence on the fitted factor returns. Standard treatments: winsorize returns cross-sectionally before the regression, or use robust ([Huber-weighted](https://en.wikipedia.org/wiki/Huber_loss)) regression, iteratively downweighting observations with large standardized residuals. Either way, the goal is that no single stock's corporate event masquerades as a factor return.
- **Thin factors:** An industry with 3 members produces a factor return that is mostly those stocks' specific returns. Mitigations: coarser industry scheme, [Bayesian shrinkage](<https://en.wikipedia.org/wiki/Shrinkage_(statistics)>) of thin-industry factor returns toward a parent sector, or explicit minimum-membership rules feeding back into factor-structure design ([Chapter 15](15-modifying-the-model.md)).
- **Heteroskedasticity beyond size:** When specific-variance estimates exist ([Chapter 8](08-risk-model-assembly.md)), a second pass with $W = \hat\Delta^{-1}$ tightens estimates. This is a feasible Generalized Least Squares (GLS) regression. [Chapter 7](07-factor-portfolios.md) shows this choice also makes the implied factor portfolios minimum-variance.

## 6.7 Diagnostics

**Cross-sectional $R^2$** (weighted): $R^2 = 1 - \frac{\hat\epsilon^\top W \hat\epsilon}{(r - \bar r)^\top W (r - \bar r)}$, with $\bar r$ the weighted mean return. For real monthly single-stock cross-sections, values of 0.2–0.4 are typical and fine. Most of any one stock's monthly return is specific. The model's job is to capture the _common_ part. Portfolio-level explanatory power is far higher because specific returns diversify ([Chapter 2](02-the-factor-model-equation.md)). A _trend_ change in $R^2$ matters more than its level ([Chapter 14](14-model-evaluation.md)).

**Factor t-statistics:** From WLS theory, $\widehat{\mathrm{Var}}(\hat g) = s^2 (R^\top X^\top W X R)^{-1}$ (with $s^2$ the weighted residual variance), mapped back via $R$. This form is exact only when $W = \Delta^{-1}$ (the true inverse specific variances). Under the $\sqrt{\mathrm{cap}}$ approximation the weights are misspecified, so production code uses heteroskedasticity-robust standard errors instead. The t-stat of $\hat f_k$ in a single period tests whether that factor's return is distinguishable from zero _this period_. The more important statistic is longitudinal: the **fraction of periods with $|t_k| > 2$**. A factor that clears 2 in, say, a third or more of months is earning its place. One that almost never does is a candidate for removal ([Chapter 15](15-modifying-the-model.md)).

**Residual structure:** Compute residual correlations across stocks over time. Clusters of correlated residuals (by theme, ownership, supply chain) are the signature of a missing factor, the diagnostic used in [Chapter 15](15-modifying-the-model.md).

## 6.8 Worked example: the MiniModel month-1 regression, end to end

All inputs from [Chapter 3](03-factors-and-exposures.md) and the appendix. Every number is reproducible from the [example source code](18-mini-example-source-code.md).

**Inputs:** $X$ (10 x 7) from [Chapter 3](03-factors-and-exposures.md). Month-1 returns and $\sqrt{\text{cap}}$ regression weights:

| Stock     | $r$ (%) | reg. weight |     | Stock    | $r$ (%) | reg. weight |
| --------- | ------: | ----------: | --- | -------- | ------: | ----------: |
| AXIOM     |    +4.2 |       0.168 |     | FIDELIS  |    −0.6 |       0.107 |
| BINARY    |    +2.8 |       0.123 |     | GUARDIAN |    −1.8 |       0.062 |
| CIPHER    |    +0.5 |       0.087 |     | HARVEST  |    +1.2 |       0.130 |
| DIGIT     |    +6.0 |       0.044 |     | INDIGO   |    +2.0 |       0.075 |
| EVERGREEN |    +0.8 |       0.151 |     | JUNIPER  |    −0.5 |       0.053 |

**Step 1, build $R$:** from industry cap weights $(0.4553, 0.3252, 0.2195)$ as in Section 6.4.

**Step 2, solve:** $\hat g = (R^\top X^\top W X R)^{-1} R^\top X^\top W r$, then $\hat f = R \hat g$:

| Factor | $\hat f$ (%) | Interpretation                                                                                                                         |
| ------ | -----------: | -------------------------------------------------------------------------------------------------------------------------------------- |
| MKT    |   **+1.821** | the market rose ~1.8% (the cap-weighted market return was +1.819%, the near-match is the constraint + standardization doing their job) |
| TECH   |   **+0.768** | tech beat the market by ~0.77%                                                                                                         |
| FIN    |   **−1.282** | financials lagged the market by ~1.28%                                                                                                 |
| CONS   |   **+0.306** | consumer slightly ahead                                                                                                                |
| VALUE  |   **+0.548** | each +1σ of cheapness paid +0.55%                                                                                                      |
| MOM    |   **+1.962** | a strong momentum month: +1.96% per σ of momentum                                                                                      |
| SIZE   |   **+0.046** | size was flat                                                                                                                          |

Constraint check: $0.4553(0.768) + 0.3252(-1.282) + 0.2195(0.306) = 0.000$. That passes.

**Step 3, residuals:** $\hat\epsilon = r - X\hat f$ (%):

| AXIOM | BINARY | CIPHER | DIGIT | EVERGREEN | FIDELIS | GUARDIAN | HARVEST |    INDIGO | JUNIPER |
| ----: | -----: | -----: | ----: | --------: | ------: | -------: | ------: | --------: | ------: |
| −0.10 |  +0.02 |  +0.14 | +0.03 |     +0.36 |   −0.24 |    −0.46 |   −0.44 | **+1.09** |   −0.46 |

Reading: INDIGO beat what its characteristics predicted by 1.09%, genuine stock-specific news. AXIOM's big +4.2% month, by contrast, was almost entirely characteristic-driven (tech, high momentum): residual −0.10%. This distinction, _which part of a return was "earned" by characteristics_, is the entire basis of performance attribution ([Chapter 10](10-performance-attribution.md)).

**Step 4, diagnostics:** Weighted $R^2$ (centered) = 0.956, far above realistic values because the toy has 10 observations and 6 free parameters (7 factors less the 1 constraint). In production cross-sections ($N$ in the thousands), expect 0.2–0.4 monthly.

**What we now have:** Repeating this regression every period produces the factor return history $\{\hat f_t\}$, the raw material for the factor covariance matrix, and a residual history per stock, the raw material for specific risk. That assembly is [Chapter 8](08-risk-model-assembly.md). First, [Chapter 7](07-factor-portfolios.md) examines the portfolios this regression has implicitly constructed.

## 6.9 Summary

- One cross-sectional regression per period recovers the factor returns, with the stocks as observations: $\hat f = (X^\top W X)^{-1} X^\top W r$ (identified by the constraint in the third bullet).
- **WLS, not OLS:** Specific variance varies sharply across stocks, so weight each by $w_i$, proportional to $\sqrt{\mathrm{cap}_i}$, a one-parameter proxy for inverse specific variance that also keeps mega-caps from dominating the fit.
- **Constrain, don't drop:** The market and industry columns are exactly collinear. A cap-weighted "industry returns sum to zero" constraint identifies the system and makes the market factor return the cap-weighted market return.
- $\hat f = Pr$: every factor return is the realized return of an investable long–short portfolio ([Chapter 7](07-factor-portfolios.md)). Run every period, the regression yields the $\{\hat f_t\}$ and residual histories that build $F$ and $\Delta$ ([Chapter 8](08-risk-model-assembly.md)).

---

_Next: [Chapter 7: Factor Portfolios](07-factor-portfolios.md)_
