Suppose you want to rank something with LLMs, how do you do it?
- Asking the LLM to come up with a score? ❌
- Asking the LLM to rank all documents as a single shot prompt? ❌
Based on several experiments I've run across ranking, it is clear that none of these approachews work suitably well for robust rankings that are not afflicted with stochasticity.
Here's a scalable approach that I came up with which currently works across half a million candidates with extreme precision.
![[MatchPoint_Shresth_Rana (1).pdf]]
You ever ask GPT to rate something and get the same 7/10 energy every time? Same here.
So instead of begging it to assign scores, we made it _fight_. Specifically, we ran **a Swiss-style tournament where transcripts of job interviews battle each other**; GPT-4 is the judge, MatchPoint is the rating system, and every decision is binary: A or B.
After 50+ rounds, the dust settles and out comes a ranking. The wild part? It lines up with expert human judgment _shockingly_ well.
---
## The Problem: LLMs are Trash at Absolute Scoring
Scoring systems like 1 to 10 or “rate this transcript” don’t work with LLMs. Why?
- Scores are uncalibrated across items
- The same prompt can yield wildly different outputs
- There’s no real notion of relative quality
We tried prompt engineering. We tried fine-tuning. None of it made the scores more useful.
---
## The Solution: Let GPT Choose Winners, Not Scores
We flipped the task: no more "rate this."
Instead:
1. Pair two transcripts.
2. Ask GPT: "Which candidate is stronger?"
3. It picks one. No explanations. Just “A” or “B”.
4. Repeat for N rounds using Swiss pairing logic.
5. Use Elo-style rating (we call it MatchPoint) to sort the results.
Do it multiple times with different seeds. Average out the rankings. Done.
It’s like a chess tournament for interviews, and LLM is the only judge.
---
## Why This Works
- GPT is way better at **comparisons** than **absolute judgments**.
- Binary decisions are less noisy.
- Swiss pairing means fewer rounds than round-robin, but better coverage than random.
- MatchPoint (custom Elo variant) balances exploration and exploitation.
Ratings stabilize fast in this case and although it runs for much longer but you get a legit leaderboard of candidates.
---
## The Results
- Candidates that were clearly strong? Consistently floated to the top.
- Flaky/mid-tier candidates? Bounced around or collapsed.
- GPT’s picks correlated tightly with human evaluators.
- The system is robust across runs and is cheap.
Used GPT-4o-mini; fast and good enough.
---
## Tradeoffs & Weirdness
- GPT still carries its biases; you’re just bottling them into preferences.
- No rationale — it picks A or B, but doesn’t tell you why.
- Longer transcripts may have unfair advantages.
- Prompting still matters (slightly).
But for ranking 100+ interviews? It _slaps_.
---
## Where Else This Works
Anywhere you’ve got a ton of messy qualitative input and need a ranked list. Basically, anything where the human time cost of reviewing > 1 hour
---
## Takeaways
- Absolute LLM scoring is dead.
- Relative matchups are stable, scalable, and better aligned with human intuition.
- GPT isn't great at rating, but it's surprisingly good at judging.
So yeah; we made GPT run a tournament.
And it kind of rules.