LLM myths 2: perplexity and surprise

27.10.2024 20:58 Language Models LLM myths

Given a language model, “perplexity” is defined as the mean of the negative log-likelihood. It is perhaps coined to correlate with the human sense of confusion (more when the number is high) and in many cases it is reasonable. Sometimes, we also refer to the negative log-likelihood of a single token as “surprise”. They are indeed good terminologies used both in academia and industry, and they also get the concept across a wide audience. But when people build their intuition by relying too much on these terms, misleading subtleties will occur.

Surprise in Bayesian theory

In many theoretic subjects such as information theory or Bayesian theory, there is already a notion of “surprise” defined in a similar way:

Given a Bayesian model with prior $P(\theta)$ of parameters $\theta$ over a distribution $\mathcal D$, when a sample $d\in \mathcal D$ is observed, the “surprise” of the sample is defined as the negative log-likelihood

$$ - \text{log}(P(d | \theta)). $$

The use of the word “surprise” is loosely justified in a few ways such as the following:

If the prior is accurate, the more probable samples get less surprise, and vice versa.
In Bayesian inferencing, the posterior $P(\theta | d)$ is proportional to $P(\theta)$ divided by the exponential of $ - \text{log}(P(d | \theta))$. Given a parameter $\theta$, the more “surprise” there is, the more adjustment of the prior is made by shrinking the probability density.

Also, another down-to-earth explanation specifically tailored for ML people is the following:

Imagine you have an image classification model $M$ that classifies an image into cat or dog. Say you have a picture that just looks like a cat, and your model gives it a $99%$ chance. If it turns out to be a cat, the negative log-likelihood is $ - \text{log}(0.99) = 0.01$ which is pretty low. But if somehow it comes out as a dog (hmm… missing chihuahua in training data?), the log-likelihood in this case is $ - \text{log}(1 - 0.9) = - \text{log}(0.1) = 4.6 $ which is fairly high. So people say that it models the human notion of “surprise” when comparing its own prediction with the real outcome.

All these explanations are reasonable. It is a common practice to give a mathematical concept an intuitive name based on human experiences. In this case, it is perhaps even a very good one, so good that many people forget

There is a difference between “a mathematical definition that models off and names after A” and “the human notion A that comes from the Oxford dictionary and your grandparents and kids would understand”.

Such a disagreement may look minor in a binary classification model of images, but an arbitrary generalization to other models may require a second-thought.

Decision paralysis vs Surprise

Scenario 1

A few weeks ago my wife and I went to a French Crêperie during a vacation. The menu was very cool, full of colorful French words but there were 15 abracadabra crêpes to choose from. There was a small time-pressure there as the waitress was waiting. In the end, she picked a random crêpe which turned out to be full of beef slices whose name I would never remember.

I have a prior on how she would choose. Seeing the outcome, was I surprised?

There were $15$ possibilities and since she has no freaking idea about French, my prior is the uniform distribution of these 15 classes. We had a $\frac{1}{15}$ probability for each crepe and the Bayersian “surprise” I got was $ - \text{log}(\frac{1}{15}) = 2.7$ which is arguably a pretty high number.

As a normal human being, my brain would not register the “surprise” emotion by anything she would have chosen.

Scenario 2

Now imagine you and I go to a bar. There are also $15$ items on the drink menu, $11$ of which were alcoholic. Say we know each other for years and I know you have certain intolerance problem for alcohol. Among the non-alcoholic menu we have Water, Fanta, Orange juice, and Diet coke. Fanta is a bad choice because how artificial it tastes and I have never seen you ordered it. But you went nuts and ordered a Fanta. Was I surprised?

My prior can be modelled as the following:

Alcohol: $0.01$ weight each because you are intolerant.
Water, Orange juice, Diet coke: $0.25$ weight each.
Fanta: let’s say $0.14$, bad in my observation but not impossible.

The “surprise” I get is $- \text{log}(0.14) = 1.97 < 2.7$. So less surprised than Scenario 1?

As a human being I would probably keep laughing at your choice and talking about how bad Fanta is for the rest of our time in the bar.

Counter arguments

A statistician would yell: your priors are not good enough, and we are not comparing with a consistent prior! Also there are so few samples!

A neuro-psychologist might also yell at me: you are confusing the thought with the reaction!

An ML scientist might say: ok if the goal is to model human surprise, we need to introduce more learnable parameters and adjust to your attention!

A Zoomer might yell at me: you should use entropy and varentropy (or a futuristic skewedness-entropy or kurtosis-entropy)!

Ok, ok… Let us pause for a minute and reflect:

Yes, using an arbitrary prior over a single data point, and using different priors to compare likelihoods could lead to absurdness if not done carefully. In fact, the whole point of this blog post is to point out that such practices are somehow not uncommon in ML.
But does it even make sense to ask for a number for an ML model but need to reflect human emotion of “surprise”?

Language modelling

Back to language models, recall that a language model measures the conditional probability

$$ P(x_n | x_1, x_2, \cdots, x_{n-1}) $$

given a sequence ${x_1, \cdots, x_{n-1}}$. Since the sampling is usually auto-regressive, $x_1, \cdots, x_{n-1}$ can be regarded as the result of a process that samples $x_i$ sequentially, whose distribution changes based on the previous choice. Let us call $x_1, \cdots, x_{n-1}$ a sampling trajectory.

Now, the perplexity is commonly defined as

$$ \text{ppl}(x_1,\cdots,x_n) = \prod\limits_{i=1}^n P(x_i | x_1, \cdots, x_{i-1})^{-\frac{1}{n}}. $$

Do $\text{ppl}(x)$ indicates a higher “surprise”? There are a couple scenarios:

It is discussed under the same prior over the space of vocabulary. Given the previous token sequence, I think it is reasonable and it is just a question of terminologies.
Say I am looking at the training loss curve: hey, my perplexity is lower at $5000$ step than at $1000$ step, so it has become less perplexed with the validation dataset! Now this is a very slippery road because we are comparing negative log-likelihoods under two different priors. Feel free to go back to previous examples: are you more surprised under a uniform prior, versus a more concentrated but sharply contrasted prior?

What about “surprise” for each token? I would argue that it is even more dangerous if it is being discussed under different sampling trajectories (by which I mean the previous tokens $x_1, \cdots, x_{n-1}$). If the sampling trajectory changes, it leads to different priors over vocabulary space.

Remarks

For me, it was not uncommon that every now and then I sought human “intuition” for something I see in a setting. Awareness of limitation of an intuition is however an equal or bigger effort in this process.

On the other hand, it is quite common to see that individual negative log-likelihood actually indicates a degree of choices: the higher the nll is, the more available choices there might be to sample a given token. Some people base their intuitions on this and argue that negative log-likelihood is also an indicator of a crucial branching point. And this line of thought leads to some theories of entropy-based metrics in inference-time techniques. It is an entirely different topic, but my general attitude is that there is something good in this direction, though how to materialize this idea can be subtle. I would defer this topic to the future.