# Chapter 5: Estimation Universe and Coverage Universe

_Previous: [Chapter 4: Types of Factor Model](04-model-types.md)_

---

Before any regression is run, a model builder must answer two questions that are easy to conflate:

1. Which stocks should the model _learn from_? The _estimation universe_.
2. Which stocks must the model be able to _describe_? The _coverage universe_.

The _estimation universe_ is the sample on which factor returns are estimated each period (the cross-sectional regression of [Chapter 6](06-estimating-factor-returns.md)) and from which the factor covariance matrix is built. The _coverage universe_ is everything the model assigns exposures and risk forecasts to, often an order of magnitude larger: an estimation universe of a few thousand names against a coverage universe in the tens of thousands.

## 5.1 Why the two universes differ

- **The estimation universe wants quality.** Factor return estimates are only as good as the returns and characteristics they are fitted on. Illiquid stocks have stale, noisy prices. Micro-caps have patchy fundamental data. Recent IPOs don't have a long return history. All of this would add noise to the factors.
- **The coverage universe wants completeness.** A risk system that cannot price the odd holding in a client's portfolio is not useful, and you don't want the risk model to be what limits which positions a portfolio can take.

The guiding principle: fit the factors on a clean core, then extend the model to everything else. The estimation universe is a curated, liquid, representative subset. The coverage universe is everything, with explicit machinery (see Section 5.4) for stretching the model over assets it was not fitted on.

Neither universe is a benchmark. The estimation universe is a statistical sample chosen for clean estimation, not an investable index and not the yardstick a portfolio is measured against. The two blur together because both tend to be cap-screened lists of large, liquid names, but they answer different questions.

## 5.2 Designing an estimation universe

Sizing the estimation universe is a trade-off in itself. More names give each factor return more cross-sectional observations and identify the factors better. Fewer, cleaner names raise the average quality of the data the factors are fitted on.

Several criteria set estimation-universe membership:

**Liquidity and price quality:** Minimum trading frequency, minimum median daily volume, minimum free float, so each stock's return on each date is a real market-clearing price, not a stale quote.

**Minimum size:** Minimum market cap, either absolute, or "top X% of cumulative cap per market". Data quality correlates strongly with size, and small stocks' returns are dominated by microstructure noise irrelevant to factor estimation.

**History and listing requirements:** A minimum trading history, e.g., 1–3 months, before admission, primary listings only, exclusion of non-common-equity types such as warrants, preferred shares, or closed-end funds.

**Representativeness:** The estimation universe must span the dimensions the model has factors for: every industry needs enough members for its factor return to be estimable, every country in a global model needs a critical mass of stocks, and the size spectrum must be wide enough that the size factor is identified. This sometimes argues for including smaller names that pure liquidity screens would drop, a deliberate trade-off.

**Stability buffers:** Membership is re-evaluated on a fixed cadence, monthly or quarterly, using different entry and exit thresholds for a more stable composition. For example, enter the universe above the 85th percentile of liquidity, exit only below the 80th. Stocks hovering at a single threshold would otherwise flip in and out every rebalance, injecting churn into factor estimates for no informational reason.

**Event handling:** Suspended stocks are frozen or removed until trading resumes. Confirmed-deal M&A targets are typically removed as their returns track deal probability, not factors. Delistings exit with a final return that includes the delisting payoff. Omitting delisting returns biases the return history upward.

**Point-in-time membership:** The universe at date $t$ must be constructed from what was known at $t$, including the stocks that later went to zero. Building history with today's membership list silently drops the failures, introducing _survivorship bias_, which biases factor return history and understates specific risk. [Chapter 16](16-practical-considerations.md) returns to this.

## 5.3 Weighting within the estimation universe

Membership is binary, but each stock can have a different weight. Within the regression, stocks are weighted, conventionally by the square root of market cap, so that large, well-measured stocks influence factor estimates more, but not in proportion to their extremely skewed caps. This is a statistical-efficiency choice, not a universe-definition choice. See [Chapter 6](06-estimating-factor-returns.md), Section 6.3, for its derivation.

## 5.4 Extending the model to the coverage universe

A stock outside the estimation universe still needs a complete model description: exposures, a specific risk forecast, and thereby a full row/column of the implied covariance matrix. The factor covariance $F$ is not re-estimated to include it; $F$ is fit once on the estimation universe, and a coverage asset enters $\Sigma = XFX^\top + \Delta$ only through its own exposure row in $X$ and its specific-risk entry in $\Delta$. This is how it's done:

**Exposures:** Same recipe, same standardization parameters. Compute the stock's descriptors and standardize them using the estimation universe's $\mu_k$ (cap-weighted mean) and $\sigma_k$ (equal-weighted std), not statistics recomputed over the coverage universe. The estimation universe defines the model's coordinate system and coverage assets are expressed in it. A micro-cap may then have SIZE exposure −4. This is outside the typical estimation-universe range, and legitimately so.

**Missing descriptors:** Coverage universe stocks are precisely the ones with patchy data, so the imputation ladder of [Chapter 3](03-factors-and-exposures.md) (peer-bucket mean -> regression imputation -> zero) is exercised heavily out here, and imputation-rate monitoring matters most for coverage assets.

**Specific risk:** The stock has no residual history, or one too short to use, so time-series estimation fails. The solutions, in increasing order of structure, and in exactly the order [Chapter 8](08-risk-model-assembly.md) develops for the estimation universe itself:

1. _Proxy/peer assignment_: give the stock the average specific risk of its industry–size–country bucket.
2. _Structural model_: predict specific variance from characteristics (size, leverage, volatility descriptors) via a cross-sectional regression fitted on the estimation universe, then apply the fitted function to the coverage asset.
3. _Scaling adjustments_: multiply by empirically calibrated factors for known biases. Recent IPOs run hotter than their characteristics suggest. Very illiquid names' measured volatility understates true risk due to stale prices.

**Recent IPOs:** They combine every problem: no return history, no momentum descriptor (needs 12 months), thin fundamentals. Typical treatment: industry/peer exposures where descriptors are missing, structural specific risk with an IPO uplift, and a flag so users know the forecast is model-of-model.

## 5.5 Universe effects on model behavior

The estimation universe changes what the factors are:

- **Factor returns are relative to the universe.** The value factor return estimated on a large-cap universe is the payoff to cheapness among large caps. On an all-cap universe it blends in the small-cap value effect. Two vendors' "value factor returns" can differ materially on the same month for this reason alone, before any difference in descriptor recipes.
- **Standardization shifts with the universe.** $\mu_k$ and $\sigma_k$ are universe statistics. The same stock has different z-scores against different reference populations. A USD 2bn company is small in a large-cap model, middling in an all-cap one.
- **Practical consequence.** A small-cap manager risk-analyzed with a large-cap-estimated model gets exposures measured against the wrong yardstick and factor returns driven by the wrong segment. Matching the model's estimation universe to the investment universe is a first-order model-selection criterion ([Chapter 14](14-model-evaluation.md)).

## 5.6 Estimation-universe span and extrapolation

Think of the factors as a coordinate system and each stock as a location in it. The estimation universe occupies a region of that space, and you want it to span the region your portfolio lives in.

The charts below show a simplified example for a model with two factors, SIZE and VALUE. The estimation universe (green) is identical in both panels. Panel (a): the portfolio (red) sits inside the estimation universe's span, so its exposures are interpolations the model can stand behind. Panel (b): the same portfolio is tilted toward smaller SIZE and higher VALUE and spread far wider. Its core still overlaps the estimation universe, but much of the book sits where the estimation universe has no members, so those exposures are extrapolated and less reliable.

![Two scatter plots in SIZE–VALUE coordinates: the portfolio sits inside the estimation universe's span in panel (a) and extends well beyond it in panel (b).](figures/universe-span.svg)

This is the same boundary Section 5.4 flagged. A name whose exposures are extrapolated past the estimation universe's span is usually also one whose specific risk is a model-of-model forecast, so a single out-of-universe flag marks both: where the exposures are least reliable and where the risk forecast is.

## 5.7 Linked assets

Different securities can represent the same company: ADRs/GDRs vs. local shares, dual listings (BHP Sydney/London), multiple share classes (GOOG/GOOGL). Treating the lines as independent stocks violates the assumption that residual returns are uncorrelated, i.e. that $\Delta$ is diagonal. The residual returns of linked assets can be almost perfectly correlated.

Standard treatment: designate a primary line (in the estimation universe, if eligible). Linked lines inherit the primary's factor exposures (with currency factors adjusted for listing currency in a global model: an ADR's local-market exposure plus a currency exposure differing from the underlying's). Specific returns of linked lines are modeled as the primary's specific return plus a small _line-specific_ spread (ADR premium/discount noise, share-class spread). Concretely, the lines share a specific-risk _block_ rather than independent diagonal entries.

## 5.8 Regional vs. global models

A multi-country model adds a layer of universe design: a global estimation universe partitioned by country, with country membership factors and currency factors, alongside global industry and style factors. In the standard structure, both the country set and the industry set are partitions of the market, each collinear with MKT(world) unless constrained. A French bank loads on MKT(world), FRANCE, BANKS, EUR, and styles at once. Constraints ([Chapter 6](06-estimating-factor-returns.md)) resolve the double-counting, and the country factors absorb country-specific market moves.

Why do the global and regional models disagree about the same portfolio? The reason is structural: the regional model estimates, say, European value on European stocks only, standardizes against a European mean, and resolves industries at European granularity. The global model does all three globally. Both are right in their own coordinate systems. Rule of thumb: use the model whose estimation universe best matches the mandate. Expect the global model to be the better lens on cross-region portfolios precisely because cross-region correlations are baked into its estimation problem, not stitched on.

## 5.9 The mini example's universes

The MiniModel's 10 stocks are both estimation and coverage universe. The toy is too small for the distinction. But the design shows through anyway: caps range USD 10–150bn, every industry has at least 3 members, each industry factor is estimable, styles span roughly ±2σ, and the regression weights of [Chapter 6](06-estimating-factor-returns.md) will be proportional to $\sqrt{\text{cap}}$, giving AXIOM 3.9x the influence of DIGIT rather than the 15x that cap-weighting would.

## 5.10 Summary

- **Estimate on the clean core, cover the messy whole.** The estimation universe is curated for return and data quality with stability buffers and point-in-time discipline. The coverage universe is extended to by reusing the estimation universe's standardization, structural specific-risk models, and linked-asset inheritance.
- Universe choice is a modeling decision with visible consequences: it defines what each factor return means and which segment drives it.
- Survivorship and look-ahead discipline in universe construction is as important as in the data itself.

---

_Next: [Chapter 6: Estimating Factor Returns](06-estimating-factor-returns.md)_
