Suppose you want to rank something with LLMs, how do you do it? - Asking the LLM to come up with a score? ❌ - Asking the LLM to rank all documents as a single shot prompt? ❌ Based on several experiments I've run across ranking, it is clear that none of these approachews work suitably well for robust rankings that are not afflicted with stochasticity. Here's a scalable approach that I came up with which currently works across half a million candidates with extreme precision. ![[MatchPoint_Shresth_Rana (1).pdf]] You ever ask GPT to rate something and get the same 7/10 energy every time? Same here. So instead of begging it to assign scores, we made it _fight_. Specifically, we ran **a Swiss-style tournament where transcripts of job interviews battle each other**; GPT-4 is the judge, MatchPoint is the rating system, and every decision is binary: A or B. After 50+ rounds, the dust settles and out comes a ranking. The wild part? It lines up with expert human judgment _shockingly_ well. --- ## The Problem: LLMs are Trash at Absolute Scoring Scoring systems like 1 to 10 or “rate this transcript” don’t work with LLMs. Why? - Scores are uncalibrated across items - The same prompt can yield wildly different outputs - There’s no real notion of relative quality We tried prompt engineering. We tried fine-tuning. None of it made the scores more useful. --- ## The Solution: Let GPT Choose Winners, Not Scores We flipped the task: no more "rate this." Instead: 1. Pair two transcripts. 2. Ask GPT: "Which candidate is stronger?" 3. It picks one. No explanations. Just “A” or “B”. 4. Repeat for N rounds using Swiss pairing logic. 5. Use Elo-style rating (we call it MatchPoint) to sort the results. Do it multiple times with different seeds. Average out the rankings. Done. It’s like a chess tournament for interviews, and LLM is the only judge. --- ## Why This Works - GPT is way better at **comparisons** than **absolute judgments**. - Binary decisions are less noisy. - Swiss pairing means fewer rounds than round-robin, but better coverage than random. - MatchPoint (custom Elo variant) balances exploration and exploitation. Ratings stabilize fast in this case and although it runs for much longer but you get a legit leaderboard of candidates. --- ## The Results - Candidates that were clearly strong? Consistently floated to the top. - Flaky/mid-tier candidates? Bounced around or collapsed. - GPT’s picks correlated tightly with human evaluators. - The system is robust across runs and is cheap. Used GPT-4o-mini; fast and good enough. --- ## Tradeoffs & Weirdness - GPT still carries its biases; you’re just bottling them into preferences. - No rationale — it picks A or B, but doesn’t tell you why. - Longer transcripts may have unfair advantages. - Prompting still matters (slightly). But for ranking 100+ interviews? It _slaps_. --- ## Where Else This Works Anywhere you’ve got a ton of messy qualitative input and need a ranked list. Basically, anything where the human time cost of reviewing > 1 hour --- ## Takeaways - Absolute LLM scoring is dead. - Relative matchups are stable, scalable, and better aligned with human intuition. - GPT isn't great at rating, but it's surprisingly good at judging. So yeah; we made GPT run a tournament. And it kind of rules.