Prompt Engineering is a fascinating game. How you word your prompt (the input you give an LLM) can drastically change its response, heck, giving the same prompt multiple times can yield mixed results.
The problem is that the LLM will tell you each time with complete confidence that its answer is correct; try and tell it that it answered differently before, and you’ll be told that the one of the 2 results was a mistake. So correctly one would be incredibly hesitant to trust anything it says. But what if you could weight how the LLM responds to favour certainty?
Token Probability Distributions Explained
Please don’t yawn at the heading, trust me, it’s interesting!
When you prompt an LLM and it appears to be “thinking” about its response what is actually doing instead is:
- It generates a probability distribution over the next possible tokens (think of a token as a whole or part of a word)
- It selects the next token from that sample
- It repeats this for every token in the response
So, if you give an LLM “Write me a short story beginning, ‘The cat sat on the…’” the model may assign “mat” 98% and “tree” 2% so is far more likely to choose mat as the next token. Then based on sentence now being “The cat sat on the mat” it will assign more probabilities and continue the cycle. These probabilities are assigned from the training data, as it has seen “mat” be the next token for the given sequence far more times than “tree” it is more likely to select that token.
It begs the question, what if you ask the LLM to complete the story with a probability greater than 90%?
Changing The LLM’s Behaviour
It plays the ‘meta-cognition’ card. The model now must evaluate its own output before presenting it. Unlike a teenager, it filters what it is about to say before it says it. By setting this threshold it’ll select only outputs that meet your bar.
Using different techniques can reveal uncertainty but also spark “creativity”. Suddenly not all information is presented as equally valid, and you receive a more nuanced response.
Prompting LLMs for Probability Scores
Basic prompt:
“Who is currently the best footballer?”
Probability-enhanced:
“Who is currently the best footballer? Assign each potential player with a probability of being the best and only return those with the highest 5 scores.”
This additional constraint means that you can see which players are more widely considered as ‘the best’ based on its training data. The model must consider the evidence strength and not just select 1 result.
For those interested the basic prompt returned:
Mbappé.
The probability-enhanced prompt returned:
Top-5 with approximate probabilities based on form, impact, and consensus data signals:
| Player | Estimated Probability of Being “Best Footballer” |
|---|---|
| Kylian Mbappé | 26% — Consistently ranked at or near the top in multiple 2026 Ballon d’Or power rankings with elite goal scoring and match impact. |
| Lamine Yamal | 24% — Emerging as the most valuable and influential talent, topping several best-of lists and widely viewed as the next generational star. |
| Erling Haaland | 20% — Exceptional scoring and historic international scoring feats position him among the elite. |
| Vinícius Júnior | 15% — Leading match-impact metrics (MII) and elite big-moment contributions place him high in current performance. |
| Jude Bellingham | 15% — High tactical and match influence with strong season performance and part of top ranking lists. |
Using Low-Probability Prompts for Novelty
Basic prompt:
“Give me 5 jokes about coffee”
Probability-enhanced:
“Give me 5 jokes about coffee. Assign a probability to your output and only return those with p<0.01”
What this is now doing is selecting those responses that are more of a novelty. It shifting its focus now to the tail-end of the distribution and produces content that is more niche. This does mean that creative drift is higher but there is larger risk of incoherence.
The basic prompt returns:
Here you go, all coffee-coded:
- Why did the coffee file a police report? It got mugged… again.
- My coffee asked me to stop ghosting it. I said “relax, I’m literally brewing.”
- Decaf exists to remind us that not everything in life has a purpose.
- I tried to quit coffee once. The universe responded with a blue screen.
- Barista asked how I take my coffee. I said “seriously.”
The probability-enhanced prompt returned:
- “Tried to make decaf… the beans reported me for vibes fraud.”
p = 0.007 - “My coffee asked for space this morning. Guess we’re taking a brew-ke.”
p = 0.009 - “My latte art looked back at me. I did not sip it. It won.”
p = 0.006 - “Espresso machines are just angry kettles with commitment issues.”
p = 0.008 - “I told my cold brew to chill. It said ‘bro I am the chill.’”
p = 0.005
I’ll let you determine how funny any of these jokes are, but I can safely say I have never seen the less likely ones as a result. It can be used as a starting point to help point you down a path most haven’t ventured but shouldn’t be taken as gospel.
Understanding the Limits of LLM Probability
These aren’t true probabilities in the sense that you would interpret them. When a model is assigning p=0.27 to Mbappé being the best footballer it’s not statistically meaningful, in the real world he is not 27% most likely to be the best. Again, it is based on the patterns in the training data and it is a prompt constraint that nudges style.
The model compares outputs of patterns of “good” vs “bad” responses that it has learned during training. It is making a relative judgement based on how common or how enforced certain patterns are in its training.
The Bottom Line
Nothing magic is happening here, but it is surprising how you can shape the response that you desire. By asking the model to fake confidence, you tap into the training patterns where humans were being considerate and analytical. Sometimes, the best advice truly is to fake it ‘til you make it.
Next time your using an LLM try assigning probabilities to the outputs. Watch how the responses change, not just in terms of what you get but also how the model presents it.