<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Posts on Honglu Fan</title>
    <link>https://honglu2875.github.io/posts/</link>
    <description>Recent content in Posts on Honglu Fan</description>
    <generator>Hugo</generator>
    <language>en-us</language>
    <lastBuildDate>Sun, 27 Oct 2024 20:58:38 -0400</lastBuildDate>
    <atom:link href="https://honglu2875.github.io/posts/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>LLM myths 2: perplexity and surprise</title>
      <link>https://honglu2875.github.io/posts/2024-10-27-llm_myths_2_perplexity_and_surprise/</link>
      <pubDate>Sun, 27 Oct 2024 20:58:38 -0400</pubDate>
      <guid>https://honglu2875.github.io/posts/2024-10-27-llm_myths_2_perplexity_and_surprise/</guid>
      <description>&lt;p&gt;Given a language model, &amp;ldquo;perplexity&amp;rdquo; is defined as the mean of the negative log-likelihood. It is perhaps coined to correlate with the human sense of confusion (more when the number is high) and in many cases it is reasonable. Sometimes, we also refer to the negative log-likelihood of a single token as &amp;ldquo;surprise&amp;rdquo;. They are indeed good terminologies used both in academia and industry, and they also get the concept across a wide audience. But when people build their intuition by relying too much on these terms, misleading subtleties will occur.&lt;/p&gt;</description>
    </item>
    <item>
      <title>LLM myths 1: why does LLM generate infinite loops</title>
      <link>https://honglu2875.github.io/posts/2024-10-27-llm_myths_1_why_does_llm_generate_loops/</link>
      <pubDate>Sun, 27 Oct 2024 19:57:30 -0400</pubDate>
      <guid>https://honglu2875.github.io/posts/2024-10-27-llm_myths_1_why_does_llm_generate_loops/</guid>
      <description>&lt;p&gt;Looping is fairly common when sampling from an LLM. We normally do not want it to happen and there has been many tricks trying to make it behave such as repetition penalties or hard-coded loop detections, but their effectiveness is debatable. The explanation of this phenomenon seems scarce in literature and it might at first feels like another bug in our day-to-day data/model engineering without anything deep.&lt;/p&gt;&#xA;&lt;p&gt;But for modern LLMs without ad-hoc outer logics to guard its output, sampling loops has been out there with us all along. For example, you can easily induce Deepseek model into a loop by writing this prompt (as of Oct 30th, 2024): &amp;ldquo;Write me a bunch of bullet points.&amp;rdquo;&lt;/p&gt;</description>
    </item>
    <item>
      <title>MCTS and Theorem proving</title>
      <link>https://honglu2875.github.io/posts/2024-06-30-mcts-and-theorem-proving/</link>
      <pubDate>Sun, 30 Jun 2024 05:16:55 -0400</pubDate>
      <guid>https://honglu2875.github.io/posts/2024-06-30-mcts-and-theorem-proving/</guid>
      <description>&lt;p&gt;With the increasing maturity of the &lt;a href=&#34;https://github.com/leanprover/lean4.git&#34;&gt;Lean theorem prover&lt;/a&gt;, many people have attempted the combination of reinforcement learning (RL) and theorem proving. Among many attempts, the &lt;a href=&#34;https://arxiv.org/abs/2205.11491&#34;&gt;Hypertree proof search&lt;/a&gt; has been quite notable which I admire a lot personally.&lt;/p&gt;&#xA;&lt;p&gt;Looking around, the general field of neural reasoning has also becoming a more prominent field since logical reasoning has been one of a few domains where LLM continues to struggle towards a satisfactory degree of reliability. A nice recent survey is &lt;a href=&#34;https://arxiv.org/html/2404.09939v1&#34;&gt;this&lt;/a&gt;.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Longer than Chinchilla</title>
      <link>https://honglu2875.github.io/posts/2023-03-11-longer_than_chinchilla/</link>
      <pubDate>Sat, 11 Mar 2023 20:24:01 +0000</pubDate>
      <guid>https://honglu2875.github.io/posts/2023-03-11-longer_than_chinchilla/</guid>
      <description>&lt;p&gt;In large language models pretraining, it takes a massive computing&#xA;budget for every single training run.&lt;/p&gt;&#xA;&lt;p&gt;the Chinchilla optimal bounds were&#xA;proposed in the paper &lt;a href=&#34;https://www.deepmind.com/publications/an-empirical-analysis-of-compute-optimal-large-language-model-training&#34;&gt;An empirical analysis of compute-optimal large language model training&lt;/a&gt;.&#xA;A very common misunderstanding about Chinchilla scaling law is that it seems to impose&#xA;an upper bound of the amount of token one should train for given a fixed parameter count.&#xA;But it really is about the optimal tradeoff between the token amount and the model size,&#xA;given a fixed computing budget. In practice, it might give a good reference number of&#xA;tokens, but a general rule of thumb is still to train for as many tokens as possible&#xA;before the training loss or eval loss starts to diverge.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Breaking the Generalization Barrier</title>
      <link>https://honglu2875.github.io/posts/2023-02-27-breaking-the-loss-barrier/</link>
      <pubDate>Mon, 27 Feb 2023 20:21:59 +0000</pubDate>
      <guid>https://honglu2875.github.io/posts/2023-02-27-breaking-the-loss-barrier/</guid>
      <description>&lt;p&gt;Have been working with Carper folks on OpenELM and diff models (see &lt;a href=&#34;https://carper.ai/diff-models-a-new-way-to-edit-code/&#34;&gt;the blog&lt;/a&gt;) for quite a while. In particular, I have spent a lot of time finetuning diff models, which is based on &lt;a href=&#34;https://github.com/salesforce/CodeGen&#34;&gt;CodeGen&lt;/a&gt; and finetuned on GitHub commit data (filtered down to 1.2M documents totalling about 1B tokens) to automatically suggest commits.&lt;/p&gt;&#xA;&lt;p&gt;There are many interesting things happening during the model training. One specific thing I am documenting here is a pheonomenon about the loss curves, how the model developed its ability and the emergence of various different levels of loss plateau/generalization barrier/critical submanifold or whatever you may call it.&lt;/p&gt;</description>
    </item>
  </channel>
</rss>
