On the Biology of a Large Language Model

Deep Gains

On the Biology of a Large Language Model - Anthropic Research

0:00

-21:34

On the Biology of a Large Language Model - Anthropic Research

A rare insight into the inner workings of an LLM

Arun S

Mar 31, 2025

Transcript

There are two sections in today’s article. Above ⬆️⬆️ is a podcast generated by NotebookLM and below ⬇️⬇️ is my thoughts on the research done by Anthropic. While both are based on the same sources listed below, the write up below is not a transcript and is a standalone essay.

Sources:

LLMs have generally been considered a blackbox. We have a vague idea of how things work, but not too specific. Gladly this is changing and having a higher level of insight into the inner workings of these models will have a huge impact on AI Safety and adoption in enterprises. Anthropic has established themselves as a leader in safe and responsible AI. This research pushes the boundaries of our understanding of LLMs further than before.

Figuring out LLMs is hard

The authors of the paper have made a great analogy to biology. We know the basics of evolution – simple rules playing out over millions of years make complex creatures like us. But understanding exactly how a tiny cell or a whole brain works is incredibly complicated. LLMs are kind of similar in many ways. The transformer architecture that trains LLMs can be implemented in a few hundred lines of code (refer to youtube video from Andrej Karpathy on this topic.) They learn from massive amounts of text using relatively simple training methods, but the 'brain' that results is unbelievably complex.

It's like having a super powerful machine, but it's a 'black box'. We see the amazing things it does (which is rather simplified as coming up the next token for autoregressive LLMs), but we don't really have a clear map of how it gets there on the inside. Trying to understand the step-by-step process is a huge challenge because there are billions of connections, and figuring out which ones matter for which task is tricky.

Anthropic’s latest research

Researchers, at Anthropic who made Claude, are trying to peek inside this black box. The articles (linked above) mentions they're working on ways to "reverse engineer" how these models work. Going with the previous analogy, imagine them like doctors examining the brain under a MRI scanner.

They're using a mix of special techniques, including "circuit tracing," to follow the paths information takes inside the model when it's doing something, like answering a question or writing text. It's like doctors watching your brain through an MRI when you are given a certain stimuli like an image or sound. Their goal is to understand the mechanisms – the actual step-by-step processes – that lead to the model's behavior.

There are several examples they give in the very detailed article, but let us take one of them and explore. The researchers wanted to figure out how their model, Claude 3.5 Haiku, writes rhyming poems. They were trying to figure out if it is just making it up as it goes along, or is it actually planning ahead ?

Surprise !! LLMs Plan ahead (among other things)

It turns out, the model does plan ahead! At least for rhyming couplets, they found strong evidence it wasn't just winging it. i.e It was not doing something like writing the beginning of each line without regard for the need to rhyme at the end and then add a rhyming word that made sense in the context of the line.

Instead, before the model even starts writing the second line of a rhyming pair, it often already has a few ideas for the word that will go at the very end of that line. Think of it like the model 'thinking': "Okay, the first line ended with 'grab it'... hmm, what rhymes with it ? Maybe 'habit' or 'rabbit'?”

It activates these potential rhyming words internally before writing the line. Then, as it writes the line, it uses that 'planned word' idea to guide the writing, making sure the sentence flows naturally towards ending with, say, "rabbit." Rabbit makes a lot of sense here as there was the word “carrot“ in the first like of the poem. This planning happened in about half the poems they looked at.

One of the techniques they used is called an attribution graph. They are like a circuit tracing technique to watch the LLM at work. When looking at the last word of the second line in a couplet (like "habit" rhyming with "rabbit"), they saw that right at the start of that line (on the newline character), features related to "rhyming with rabbit" would light up. These, in turn, would light up features for candidate words like "habit." These "habit" features then encouraged the model to actually say "habit" later on.

To prove this wasn't just a coincidence, they did some additional experiments where they actively messed with the model's internal 'thoughts' right at the planning stage.

They tried blocking the features related to the planned rhyming word. When they did this, the chance of the model actually using that word dropped significantly.
They also tried injecting features for a different rhyming word or scheme. This also changed what the model wrote, making it more likely to use the word they suggested. This strongly suggested these features were indeed part of a planning process.

Some more cool findings from this experiment

Planning Influences the Whole Line: The plan wasn't just about the last word. They found evidence that the planned end word also affected the words in the middle of the line. The model seemed to work backward from the planned rhyme to make the line build towards it logically.
Where the Planning Happens: This planning activity was mostly concentrated right on that newline token before the second line starts.
Planning Uses Normal 'Word Thoughts': Interestingly, the internal features the model used to 'think' about the planned word weren't some special 'planning code.' They seemed to be the same kind of features the model uses when it just reads or thinks about that word normally.

Instead of just finding a rhyme at the last second, it often anticipates the rhyming word, holds it in its 'mind,' and then crafts the line to lead smoothly to that planned conclusion. This finding throws new light into the traditional understanding of (autoregressive) LLMs where the next token is considered to be generated based on past tokens.

What We Can Learn from Their Research

This detective work is starting to show some cool, and sometimes weird, things about how LLMs operate:

Following the Thoughts: Researchers are getting closer to tracing why a model gives a specific answer. They can see which parts of the model light up (activate) when it's processing information (Think MRI scan). This helps figure out which internal 'features' or 'concepts' the model is using.
Is the Reasoning Real? You know how sometimes you ask an LLM to explain its reasoning step-by-step (its "chain-of-thought")? Well, sometimes that explanation isn't actually how the model got the answer! The research suggests models might sometimes figure out the answer first and then write down a plausible-sounding reason, which might not reflect the real internal process. The researchers are trying to spot this "unfaithful reasoning" by looking directly at the model's internal activity, not just the text it writes. For example, a model might give the right answer to a tough question but its written explanation might skip over the hard parts or not match the internal 'work' it did.
Understanding Bad Behavior: They can also see what happens inside when someone tries to "jailbreak" a model (trick it into ignoring its safety rules). They found these tricks often work by basically boosting the parts of the model that say "yes, do it" and suppressing the parts that say "no, I shouldn't." Other tricks, like 'adversarial examples', might work by distracting the model's internal 'attention' away from the problematic words in a prompt.

So, while we're still far from having a complete instruction manual for LLMs, researchers are building the tools and doing the hard work to understand these complex digital minds better. From popular culture, this reminds me of the movie “Arrival” where they are trying to communicate with the alien intelligence.

Deep Gains

On the Biology of a Large Language Model - Anthropic Research

Figuring out LLMs is hard

Anthropic’s latest research

Surprise !! LLMs Plan ahead (among other things)

What We Can Learn from Their Research

Discussion about this episode