From Hits to NDCG: A Practical Guide to Recommender Metrics | Amine Biad

What are recommender systems optimizing for?

Recommender systems rank a small set of items to show a user next: a top 5 on a homepage, a top 10 in a widget, or top 20 in search. Evaluation asks a simple question with many nuances: did we put the right things, high enough, soon enough?

The trouble is that “right things” and “high enough” vary by product. That’s why multiple metrics exist. Each one was invented to fix a limitation of a prior one. We’ll start with the simplest notion of success (Hit@k) and layer on the missing pieces—wasted slots, coverage, time‑to‑first‑good, and graded relevance—until we arrive at NDCG.

Think in terms of a user request (query/session), a ranked list of items, and a set of truly relevant items for that request. Metrics compare the ranked list against that relevant set.

Setup and notation (kept simple)

We evaluate per user request, then average over many requests.
At each request, the model returns a ranked list R = [i₁, i₂, i₃, …].
The ground‑truth relevant set is G (binary relevance unless stated otherwise).
We often care only about the top k results (@k).

We will use the same tiny example throughout:

Relevant items: G = {A, C, F} (3 items)
Model rank: R = [B, A, D, C, E, F, ...]

Problem 1: “Show at least one good item.” → Hit@k

Sometimes one good result is enough (e.g., first click in a session). Early systems used a binary success: did the user see any relevant item in the first k?

Definition: Hit@k = 1 if R[1..k] ∩ G ≠ ∅, else 0. Average over requests.
In our example: Hit@1 = 0 (top‑1 is B), Hit@2 = 1 (since A is at rank 2).

Why it exists: simplest possible correctness signal for “did we get them something useful quickly?”

Where it falls short:

Ignores how many relevant items you returned.
Ignores ordering among relevant items beyond the first found.

What fixes it: Precision@k adds a penalty for wasting slots.

Problem 2: “Don’t waste slots.” → Precision@k

UI real estate is constrained. If you show 10 items, how many are relevant?

Definition: Precision@k = |R[1..k] ∩ G| / k.
Example: In R, top‑5 is [B, A, D, C, E] with relevant {A, C} → Precision@5 = 2/5 = 0.40.

Why it exists: penalizes spraying the list with junk. Good for modules with fixed‑size carousels.

Where it falls short:

Ignores missed relevant items beyond k.
Two lists with the same set in top‑k but different ordering get the same Precision.

What fixes it: Recall@k checks coverage of the relevant set.

Problem 3: “Did we retrieve most of what mattered?” → Recall@k

When coverage matters (e.g., returning most of the user’s likely interests), we care about how many relevant items we retrieved, not how many non‑relevant we showed.

Definition: Recall@k = |R[1..k] ∩ G| / |G|.
Example: # relevant overall = |G| = 3. In top‑5 we have {A, C} → Recall@5 = 2/3 ≈ 0.67.

Why it exists: complements Precision. Useful in stages like candidate generation or when users consume multiple items.

Where it falls short:

Doesn’t reward putting the most relevant items higher.
Two lists with the same relevant set in top‑k but different positions get the same Recall.

What fixes it: MRR emphasizes how early the first good item appears.

Precision and Recall trade off. You usually report both (at the same k) or track a curve by varying thresholds.

Problem 4: “The first good item should come early.” → MRR

In search‑like interactions, one click often ends the task. We need a metric focused on the first relevant result.

Per‑request score: RR = 1 / rank(first relevant) (0 if none found)
Mean Reciprocal Rank: MRR = mean(RR)
Example: first relevant in R is A at rank 2 → RR = 1/2 = 0.5

Why it exists: captures “time‑to‑first‑relevant.” Perfect when a single click is the main goal.

Where it falls short:

Ignores additional relevant items after the first.
Doesn’t support graded relevance.

What fixes it: NDCG is position‑aware and supports multiple/graded relevance.

Problem 5: “Reward multiple and more relevant items at the top.” → NDCG

We need a position‑aware metric that:

Rewards putting more relevant items higher.
Supports graded relevance (e.g., strong vs weak relevance).

Discounted Cumulative Gain (DCG) does this by applying a log discount by position. With binary relevance relᵢ ∈ {0,1} at rank i:

DCG@k = Σᵢ₌₁..k relᵢ / log₂(i + 1)

To make scores comparable across queries (different numbers of relevant items), we normalize by the best possible ordering (IDCG):

NDCG@k = DCG@k / IDCG@k, where IDCG@k is DCG of the ideal sort for that query.

Example with our list (binary relevance):

Relevance by rank for top‑5: [0, 1, 0, 1, 0]
DCG@5 = 1/log2(3) + 1/log2(5) ≈ 0.6309 + 0.4307 = 1.0616
Ideal top‑5 relevance is [1, 1, 0, 0, 0] → IDCG@5 = 1/log2(2) + 1/log2(3) = 1 + 0.6309 = 1.6309
NDCG@5 ≈ 1.0616 / 1.6309 ≈ 0.65

Why it exists: a principled, position‑aware metric that handles multiple and graded relevances and is comparable across queries.

Where it falls short:

More complex to compute and explain.
Choice of k and gain function matters in practice.

Side‑by‑side on the same toy example

Using G = {A, C, F} and R = [B, A, D, C, E, F]:

Hit@2 = 1, Hit@1 = 0
Precision@5 = 2/5 = 0.40
Recall@5 = 2/3 ≈ 0.67
MRR = 1/2 = 0.50
NDCG@5 ≈ 0.65

These numbers are not interchangeable—they each answer a different question.

Which metric for which problem?

Hit@k: first‑click experiences; session kickoff; “any good item quickly?”
Precision@k: constrained UI slots; avoid wasting space on non‑relevant items.
Recall@k: coverage; users consume many items; candidate generation stages.
MRR: single‑click tasks; navigational search; Q&A where one result solves it.
NDCG@k: general ranking; multiple relevant items; graded relevance; order matters.

Pick k to match the surface (e.g., show 10 → report @10). If the UI is scrollable, report multiple k’s (e.g., @5, @10, @20).

Practical guidance

Always clarify the user goal: one great item vs many good ones.
Report a small, complementary set: e.g., Precision@k, Recall@k, and NDCG@k together.
Keep offline and online consistent: the offline objective you optimize should predict the online KPI you care about.
Mind class imbalance: popular items can inflate perceived performance; stratify by segment where relevant.
Use graded labels when possible: NDCG shines when you have "strong" vs "weak" positives.
Validate k: if most clicks happen in the top 3, your primary report should be @3.

Closing thought

Metrics are tools. Start from the product question (“what does success look like for the user here?”), then choose the metric that best answers it. If your team can explain not just what moved, but why this metric is the right one for the job, you’re evaluating well.