Posts on Honglu Fan

LLM myths 2: perplexity and surprise

Sun, 27 Oct 2024 20:58:38 -0400

Given a language model, “perplexity” is defined as the mean of the negative log-likelihood. It is perhaps coined to correlate with the human sense of confusion (more when the number is high) and in many cases it is reasonable. Sometimes, we also refer to the negative log-likelihood of a single token as “surprise”. They are indeed good terminologies used both in academia and industry, and they also get the concept across a wide audience. But when people build their intuition by relying too much on these terms, misleading subtleties will occur.

LLM myths 1: why does LLM generate infinite loops

Sun, 27 Oct 2024 19:57:30 -0400

Looping is fairly common when sampling from an LLM. We normally do not want it to happen and there has been many tricks trying to make it behave such as repetition penalties or hard-coded loop detections, but their effectiveness is debatable. The explanation of this phenomenon seems scarce in literature and it might at first feels like another bug in our day-to-day data/model engineering without anything deep.

But for modern LLMs without ad-hoc outer logics to guard its output, sampling loops has been out there with us all along. For example, you can easily induce Deepseek model into a loop by writing this prompt (as of Oct 30th, 2024): “Write me a bunch of bullet points.”

MCTS and Theorem proving

Sun, 30 Jun 2024 05:16:55 -0400

With the increasing maturity of the Lean theorem prover, many people have attempted the combination of reinforcement learning (RL) and theorem proving. Among many attempts, the Hypertree proof search has been quite notable which I admire a lot personally.

Looking around, the general field of neural reasoning has also becoming a more prominent field since logical reasoning has been one of a few domains where LLM continues to struggle towards a satisfactory degree of reliability. A nice recent survey is this.

Longer than Chinchilla

Sat, 11 Mar 2023 20:24:01 +0000

In large language models pretraining, it takes a massive computing budget for every single training run.

the Chinchilla optimal bounds were proposed in the paper An empirical analysis of compute-optimal large language model training. A very common misunderstanding about Chinchilla scaling law is that it seems to impose an upper bound of the amount of token one should train for given a fixed parameter count. But it really is about the optimal tradeoff between the token amount and the model size, given a fixed computing budget. In practice, it might give a good reference number of tokens, but a general rule of thumb is still to train for as many tokens as possible before the training loss or eval loss starts to diverge.

Breaking the Generalization Barrier

Mon, 27 Feb 2023 20:21:59 +0000

Have been working with Carper folks on OpenELM and diff models (see the blog) for quite a while. In particular, I have spent a lot of time finetuning diff models, which is based on CodeGen and finetuned on GitHub commit data (filtered down to 1.2M documents totalling about 1B tokens) to automatically suggest commits.

There are many interesting things happening during the model training. One specific thing I am documenting here is a pheonomenon about the loss curves, how the model developed its ability and the emergence of various different levels of loss plateau/generalization barrier/critical submanifold or whatever you may call it.