# Chapter 3: Factors and Exposures

_Previous: [Chapter 2: The Factor Model Equation](02-the-factor-model-equation.md)_

---

A _factor_ is a common driver of returns shared across many stocks. An _exposure_ is one stock's sensitivity to a given factor. In the model equation a column of $X$ is a factor and a row is a single stock's exposures.

[Chapter 2](02-the-factor-model-equation.md) treated the exposure matrix $X$ as given. This chapter is about where it comes from. It explains the factors contained in a fundamental equity model, how raw company data turns into a _descriptor_ and then a _factor exposure_, and how those exposures are standardized so the numbers in $X$ mean the same thing across factors and across time.

## 3.1 The taxonomy of factors in a fundamental model

Any commercial fundamental model's factor list follows a standard structure. Most commercial models have 70–90 factors, grouped like this:

**Market/Country factors:** In a single-country model, every stock has exposure 1 to "the market". In a global model this generalizes to one factor per _country_ (a stock's country exposure is 1 for its country of membership), with or without a separate world factor. These capture the dominant common movement: broad rallies and sell-offs.

**Industry/Sector factors:** Dummy (0/1) exposures by industry membership under a classification scheme: [GICS](https://en.wikipedia.org/wiki/Global_Industry_Classification_Standard) and [ICB](https://en.wikipedia.org/wiki/Industry_Classification_Benchmark) are the common choices. Some models rely on industry groupings from proprietary clustering algorithms. Models typically use 10–60 industries. Two design choices matter:

- _Granularity_: more industries explain more co-movement but make sparse industries with few stocks noisy to estimate.
- _Single vs. multiple membership_: conglomerates straddle industries. For example retail drives Amazon's revenue but AWS/technology drives its operating income. Some models allow fractional exposures to several industries, summing to 1. Amazon would be split between retail and technology rather than forcing it into a single bucket.

**Style factors:** Exposures built from company characteristics, and the main subject of this chapter. The canonical set of style factors:

| Style                   | What it measures                       | Typical descriptors                                      |
| ----------------------- | -------------------------------------- | -------------------------------------------------------- |
| Value                   | Cheapness vs. fundamentals             | Book/price, earnings/price, sales/price, cash-flow/price |
| Size                    | Large vs. small                        | log market cap                                           |
| Momentum                | Recent relative strength               | 12-month return excluding the last month                 |
| Volatility / Beta       | Market sensitivity & total variability | historical beta, daily return std, max drawdown          |
| Quality / Profitability | Earnings quality & profitability       | ROE, ROA, gross margin, accruals, earnings variability   |
| Growth                  | Expansion of the business              | EPS growth, sales growth, analyst forecast growth        |
| Leverage                | Balance-sheet gearing                  | debt/equity, debt/assets                                 |
| Liquidity               | How heavily traded                     | turnover ratio, illiquidity measures                     |
| Dividend yield          | Payout                                 | trailing dividend / price                                |

**Currency factors:** Separate the equity-in-local-currency decision from the currency decision. This is only relevant in multi-country models. [Chapter 15](15-modifying-the-model.md) covers the mechanics.

**Newer / Specialty factors:** ESG and carbon-intensity factors, crowding (hedge-fund ownership concentration, short interest), sentiment (news flow, analyst revisions), and machine-learned factors. [Chapter 15](15-modifying-the-model.md) walks through how new factors can be added to an existing model.

**The mini example's factor set:** The [MiniModel](18-mini-example-source-code.md) uses the smallest interesting version of this anatomy: 1 market factor (MKT), 3 industry factors (TECH, FIN, CONS), and 3 style factors (VALUE, MOM, SIZE), $K = 7$, over the 10-stock universe introduced below.

## 3.2 From raw data to descriptors to exposures

The pipeline for a style factor runs in three stages: compute one or more _descriptors_, blend them into a _raw factor_, then standardize that factor into the _exposure_ that enters $X$:

```mermaid
flowchart LR
    A1[raw data] -->|"Stage 1"| B1[descriptor]
    A2[raw data] -->|"Stage 1"| B2[descriptor]
    A3[raw data] -->|"Stage 1"| B3[descriptor]
    B1 -->|"Stage 2: z-score"| Z1[std descriptor]
    B2 -->|"Stage 2: z-score"| Z2[std descriptor]
    B3 -->|"Stage 2: z-score"| Z3[std descriptor]
    Z1 -->|"Stage 2: blend"| C[raw factor]
    Z2 -->|"Stage 2: blend"| C[raw factor]
    Z3 -->|"Stage 2: blend"| C[raw factor]
    C -->|"Stage 3: z-score"| D[standardized exposure]
```

**Stage 1: raw data -> descriptor:** A _descriptor_ is a single measurable quantity per stock, computed from prices, financial statements, or other sources. For example, book-to-price comes from the latest balance sheet and today's market cap, or 12-month-minus-1-month total return comes from the price history. Descriptor design is where most of the craft lives, and most of the look-ahead bugs ([Chapter 16](16-practical-considerations.md)): which fiscal-data vintage to use, how to lag filings so only public information enters, and how to treat negative book values.

**Stage 2: descriptors -> one factor:** Most styles blend several descriptors. Value, for instance, might be

$$\text{RawValue}_i = 0.5\,z(\text{B/P}_i) + 0.3\,z(\text{E/P}_i) + 0.2\,z(\text{CF/P}_i)$$

where $z(\cdot)$ denotes the standardization of Section 3.3, applied to each descriptor _before_ blending. Standardizing first puts the descriptors on one common scale, so the weights (0.5, 0.3, 0.2) carry the meaning they appear to; blend the raw ratios and whichever has the widest spread would quietly dominate the mix. Blending also guards against the failure modes of any single ratio: E/P breaks for loss-makers, B/P breaks for asset-light businesses. A composite still gives a sensible number when one input is distorted. Descriptor weights are set by judgment, by maximizing the in-sample explanatory power of the resulting factor, or by equal-weighting. Vendors publish their recipes in model handbooks.

The [MiniModel](18-mini-example-source-code.md) keeps one descriptor per style: book-to-price for VALUE, 12-1 month return for MOM, log cap for SIZE. So the blending stage is trivial, but the standardization stage below is shown in full.

**Stage 3: raw factor -> standardized exposure:** The blended factor is standardized once more, by the same Section 3.3 recipe, so the exposure that lands in $X$ has cap-weighted mean zero and standard deviation one. That second pass is what makes "+1" mean "one standard deviation above the market" for every factor, and keeps exposures comparable across factors and over time. With a single descriptor, there is nothing to blend, so this is the only standardization.

## 3.3 Standardization: the maths

Raw descriptors have arbitrary scales (B/P near 0.5, log-cap near 4, momentum near 0.1). To make exposures comparable across factors, so "+1" always means "one standard deviation above the market", every style exposure is standardized. The recipe has three operations: winsorize and z-score act on the descriptor in that order, while handling missing data brackets them, repairing the descriptor before the z-score where it can and setting a neutral exposure after where it cannot.

**Winsorize (outlier control):** Cross-sectional fundamental data has wild outliers (a beaten-down, asset-heavy stock can trade at a B/P of 25). Before anything else, extremes are pulled in:

- _Hard clip_: pull any value more than 3 standard deviations from the mean back to that boundary.
- _MAD-based clip_ (more robust): same idea, but use the median and the median absolute deviation (MAD), scaled by 1.4826 to match a standard deviation, in place of the mean and std. Those aren't pulled around by the outliers you're trying to clip, so the cutoff is more reliable.

**Z-score against the right reference population:** The standardized exposure is

$$X_{ik} = \frac{d_{ik} - \mu_k}{\sigma_k}$$

with a deliberate asymmetry in how $\mu_k$ and $\sigma_k$ are computed:

- $\mu_k$ = **cap-weighted** mean of the descriptor over the estimation universe (see [Chapter 5](05-universes.md) for an explanation of the different universes),
- $\sigma_k$ = **equal-weighted** standard deviation over the estimation universe (population form, dividing by $n$, not $n-1$; this is what makes each standardized column have std exactly 1).

Using the _cap-weighted mean_ is important: it makes the _cap-weighted market portfolio have exactly zero exposure to every style factor_. Consequently the market factor return cleanly means "the market," and style factor returns mean "payoff to a tilt away from the market". Without this, every style factor would be entangled with the market factor. The equal-weighted _std_ is used because a cap-weighted _std_ would let a handful of mega-caps set the scale for everyone.

**Handle missing data:** Real coverage is never complete. Standard treatments, in descending order of preference (the first two repair the _descriptor_ before the z-score, the last sets the _exposure_ after):

1. _Fill in from peers_: assign the cap-weighted mean descriptor of the stock's industry/country bucket.
2. _Fill in by regression_: predict the missing descriptor from descriptors the stock does have.
3. _Set the standardized exposure to 0_: i.e., assume the stock is market-like in that dimension. Honest, neutral, and common as the final fallback, but it silently shrinks the model's opinion of that stock's risk, so production systems track how many stocks get 0 fill-in per factor.

**Worked standardization: the MiniModel VALUE factor:** The universe (full data table in the appendix, [Chapter 17](17-appendix.md)):

| Stock     | Industry | Cap ($bn) | Cap weight | B/P (raw) |
| --------- | -------- | --------: | ---------: | --------: |
| AXIOM     | Tech     |       150 |     24.39% |      0.15 |
| BINARY    | Tech     |        80 |     13.01% |      0.25 |
| CIPHER    | Tech     |        40 |      6.50% |      0.45 |
| DIGIT     | Tech     |        10 |      1.63% |      0.60 |
| EVERGREEN | Fin      |       120 |     19.51% |      0.85 |
| FIDELIS   | Fin      |        60 |      9.76% |      0.95 |
| GUARDIAN  | Fin      |        20 |      3.25% |      1.10 |
| HARVEST   | Cons     |        90 |     14.63% |      0.40 |
| INDIGO    | Cons     |        30 |      4.88% |      0.55 |
| JUNIPER   | Cons     |        15 |      2.44% |      0.70 |

(No winsorization needed, the toy data is clean.) The cap-weighted mean of B/P is

$$\mu = \sum_i w_i^{\text{cap}}\, \text{B/P}_i = 0.2439(0.15) + 0.1301(0.25) + \dots = 0.5049,$$

and the equal-weighted standard deviation is $\sigma = 0.2890$. So, for example,

$$X_{\text{AXIOM, VALUE}} = \frac{0.15 - 0.5049}{0.2890} = -1.228, \qquad X_{\text{GUARDIAN, VALUE}} = \frac{1.10 - 0.5049}{0.2890} = +2.060.$$

Applying the same recipe to all three styles produces the style block of the MiniModel exposure matrix:

| Stock     |  VALUE |    MOM |   SIZE |
| --------- | -----: | -----: | -----: |
| AXIOM     | −1.228 |  1.198 |  0.710 |
| BINARY    | −0.882 |  0.341 | −0.010 |
| CIPHER    | −0.190 | −1.066 | −0.805 |
| DIGIT     |  0.329 |  1.688 | −2.394 |
| EVERGREEN |  1.194 | −0.393 |  0.455 |
| FIDELIS   |  1.540 | −0.883 | −0.340 |
| GUARDIAN  |  2.060 | −1.495 | −1.600 |
| HARVEST   | −0.363 | −0.148 |  0.125 |
| INDIGO    |  0.156 | −0.638 | −1.135 |
| JUNIPER   |  0.675 | −1.250 | −1.930 |

The construction's promises hold: the cap-weighted average of every column is 0 (the market is style-neutral by construction), and the equal-weighted standard deviation of every column is 1. The rows read as character sketches. GUARDIAN is a deep-value (+2.06), negative-momentum (−1.50), small (−1.60) financial. That readability is the point of standardizing.

## 3.4 Binary vs. continuous exposures

The full MiniModel exposure matrix has three kinds of columns, and the distinction recurs throughout the series:

$$X = \big[\underbrace{\mathbf{1}}_{\text{MKT}} \;\big|\; \underbrace{\text{0/1 dummies}}_{\text{TECH, FIN, CONS}} \;\big|\; \underbrace{\text{z-scores}}_{\text{VALUE, MOM, SIZE}}\big]$$

- The **market** column is all ones: every stock participates fully.
- **Industry** columns are membership dummies: each stock has a 1 in exactly one industry column (under single membership), 0 elsewhere. Note the built-in collinearity: the three industry columns sum to the market column. A constraint on the factor returns resolves it, and [Chapter 6](06-estimating-factor-returns.md) shows how.
- **Style** columns are continuous z-scores: signed, roughly in $[-3, 3]$, cap-weighted-zero.

A unit of exposure means something different per block, "is a financial" vs. "one std cheaper than the market", but the model equation treats them identically, and factor returns inherit the corresponding interpretation (the FIN factor return is the financials-vs-market return; the VALUE factor return is the payoff per std of cheapness).

## 3.5 Exposure dynamics

The factor exposure matrix $X$ is not static. It is rebuilt at every model date (daily or monthly):

- **Why exposures move:** prices move daily (changing B/P, size, momentum continuously). Financial statements update quarterly. Industry membership changes rarely (reclassifications, spin-offs).
- **Speed by design:** momentum exposure turns over fast by construction (a 12-month window rolls). Size is glacial. The risk-model implication: fast-moving exposures make a stock's _factor identity_ itself dynamic, which is precisely what lets a fundamental model react faster than one based on estimated betas ([Chapter 4](04-model-types.md)).
- **Point-in-time discipline:** The exposure used for date $t$ must be computable from information public _before_ $t$: filings lagged by their reporting date (not their fiscal date), prices through $t-1$. Fundamental databases restate history. A model built on restated data inherits a look-ahead bias that inflates backtested performance. [Chapter 16](16-practical-considerations.md) treats this at length. It is the most expensive class of bug in quantitative equity.

## 3.6 What makes a good factor: a preview

Standardization makes any descriptor _usable_, but it does not necessarily make it _worthwhile_. The admission criteria, applied rigorously in [Chapter 15](15-modifying-the-model.md) when modifying a model, and testable with the machinery of Chapters [6](06-estimating-factor-returns.md) and [14](14-model-evaluation.md):

1. **Economic rationale:** A story for _why_ the characteristic should drive common returns (risk premium, behavioral effect, or structural flow), protection against data mining.
2. **Statistically significant factor returns:** The estimated factor return series should be distinguishable from noise: a meaningful fraction of periods with $|t| > 2$ ([Chapter 6](06-estimating-factor-returns.md) defines the t-statistic).
3. **Persistence and breadth:** Works across decades and across markets, not in one regime in one country.
4. **Non-redundancy:** Low correlation with existing factors' exposures and returns. A candidate that is 0.9-correlated with value adds estimation burden without adding information ([Chapter 15](15-modifying-the-model.md) quantifies this with variance inflation factors).
5. **Coverage and quality of the underlying data:** A brilliant descriptor available for 40% of the universe creates more data filling problems than insight.

## 3.7 Summary

- A fundamental model's $X$ has three blocks: market (ones), industries (dummies), styles (standardized composites of descriptors).
- The pipeline is raw data -> descriptor -> winsorize and z-score each descriptor -> blend into a raw factor -> winsorize and z-score again -> exposure, filling gaps at the descriptor stage where possible and zeroing the exposure only as a last resort. The cap-weighted mean in the z-score is what makes the market portfolio style-neutral, and that choice propagates into the interpretation of every factor return downstream.

---

_Next: [Chapter 4: Types of Factor Model](04-model-types.md)_