Why are my AI coding sessions falling apart mid-way?
Understanding context windows, attention mechanisms, and why bigger doesn't mean better
You’ve probably experienced this: You start a session with Claude Code, things work great for the first hour, then responses get weird and the code quality drops significantly. After a couple of hours, the AI starts repeating itself, forgets decisions made an hour ago, suggests fixes for bugs it already fixed. It starts missing obvious details and you end up restarting your session (the solution for everything).
This problem is called “context window exhaustion”, and we’ll dig into it today. Let’s go.
What fills the context window?
AI models can only hold a limited amount of information at once, like RAM in computers. This is called the context window. When we’re using it, here’s what fills it up:
Items automatically loaded when we start the session, such as:
System instructions (how Claude should behave as a coding agent)
Tool definitions (what commands claude can run)
Claude.md files from parent directories (the project instructions)
The first 200 lines of the agent’s Auto memory file MEMORY.md (Claude automatically saves useful context like project patterns, key commands, and the user’s preferences into the auto-memory file, which persists across sessions, then loads the first 200 lines into the context window when we start a session)
Items loaded as we work:
Every message we send (e.g. Fix the login bug)
Every response Claude gives (including all code it writes)
Every file Claude opens to read / edit
Every bash command output
Web searches, API calls, tool results
All of this sits in the AI’s “Context Window” simultaneously. When it maxes out, quality deteriorates.
One important thing to note is that the “Context Window” does not hold the entire project, regardless if we started “Claude Code” in the project directory or not. It does not load the whole codebase; Claude Code can read any file in the directory but only loads the files into context when we explicitly use them.
Context engineering is about deciding what goes into that limited space, when it goes in, and what gets removed to make room.
Why big context windows don’t save us?
LLMs are increasingly adding more space to allow bigger context windows (for example Claude Sonnet has ~200k token*, Gemini allows ~1M token). That’s huge. So why do Models continue to hallucinate / miss the point after a while?
Attention is a fixed budget
This is where “Attention” comes into play. When an LLM processes the input we entered, every token needs to “look at” every other token to figure out what’s relevant. With 100,000 tokens, that’s 10 billion comparisons.
The model needs a way to decide how to split its attention across all the tokens. It uses a formula called softmax. For each token, the model distributes a fixed attention budget, called weights*, across all other tokens, and that budget always sums to 1, no matter how many tokens there are.
Imagine you have $1 to distribute among people in a room. If there are 10 people, each person could get 10 cents. If there are 100,000 people, each person gets a tiny fraction of a cent. The dollar doesn’t grow just because more people showed up.
That’s what happens with a long input. The model’s attention is a fixed pie, and more tokens means thinner slices for each one.
Lots of new models optimise this using a sliding window, where the model only looks at a fixed neighbourhood around it, but the core principle holds: more tokens means the attention budget gets spread thinner.
What is a weight?
A weight is just a number that represents how important something is relative to everything else.
Assume you’re at a busy party and someone calls your name from across the room. Your brain automatically gives more attention (more weight) to that voice and less to the background chatter. You’re still hearing everything, but your focus isn’t evenly spread. The familiar voice gets a bigger share of your attention.
The model does something similar. When it’s processing a token, it assigns a weight to every other token, basically deciding how much each one matters for understanding the current one. A high weight means “this token is very relevant right now,” and a low weight means “not so important here.”
When the model processes an input, each token gets converted into a list of numbers (called a vector). This is like a description of that token, capturing things like its meaning, its role in the sentence, and its relationship to other words.
To decide how much attention one token should pay to another, the model compares their vectors. If two tokens have vectors that “point in a similar direction”, the model gives that pair a higher weight. If they’re unrelated, the weight is low.
Imagine you’re in a library sorting books. You’d naturally group a cooking book closer to a nutrition book than to a physics book, because their topics are related. The model does something similar but with numbers. It measures how closely related two tokens are, and that measurement becomes the weight.
So what are these tokens we keep mentioning?
What is a token?
Tokens are the small pieces that LLMs use to read and write text. When we type a sentence, the model doesn’t see words the way we do. It breaks the text into tokens first, then processes those tokens.
A token is roughly a chunk of text. Sometimes it’s a full word, sometimes it’s part of a word, and sometimes it’s just a single character. For example, the word “running” might be split into two tokens: “run” and “ning”. A short common word like “the” would be a single token. An unusual or long word might get split into several tokens. One token is roughly ¾ of a word in English. So 100 words is roughly 130–140 tokens.
Why not just use whole words?
If the model used whole words, it would need to know every possible word in every language, including misspellings, slang, and technical terms. That list would be enormous. Instead, the model learns a vocabulary of common chunks. This way, even if it hasn’t seen a specific word before, it can still handle it by breaking it into smaller known pieces.
Think of it like LEGO blocks. Instead of having a unique brick for every possible object (a car brick, a house brick, a tree brick), you have a set of basic bricks that you can combine to build anything. Similarly, the model has a set of common text chunks that it combines to represent any word, even ones it hasn’t seen before.
The middle gets squeezed
The attention budget isn’t distributed evenly as well. LLMs have a built-in bias towards the beginning and end of the input.
The first tokens are seen by every token that comes after them. So they accumulate importance because they’ve been referenced the most times. It’s like the first person to speak in a meeting. Everyone who speaks after them has heard what they said, so their words naturally carry influence throughout the whole conversation.
The last tokens are the most recent ones the model processed. They’re still fresh in the calculation. Similar to how we remember the last thing someone said more easily than something from the middle of a long conversation.
The middle tokens get squeezed.
This pattern has a name in human psychology too - the Serial Position Effect. People tend to remember the first and last items in a list better than the ones in the middle. LLMs ended up with a similar behaviour because the math behind attention produces this pattern.
Conceptually speaking, with 1,000 tokens, the middle 800 tokens might share 40% of the attention budget, each getting about 0.05%. With 100,000 tokens, the middle 99,800 tokens share that same 40%, each getting about 0.0004%. That’s a 100x drop per token.
This is why context management matters more than context size. It’s not about fitting more in, but about fitting the right things in at the right time.
Context Usage Monitoring in Claude Code
Now that we understand why context management matters, there’s a small trick to keep an eye on it, in Claude Code (if you’re using a separate tool for AI assistance you can build a script to calculate the context usage).
Claude Code can show the context usage in the status line (/status-line). It shows the session context, which includes the entire conversation history (every message we’ve sent and every response Claude has given), all tool results accumulated during the session (file contents from reads, command outputs, search results), plus the system prompt, tool definitions, CLAUDE.md files, and any agent prompts.
It does not include files sitting on disk that haven’t been opened, the broader git history, or anything outside the current conversation.
When the percentage used gets high, Claude Code will start compressing older parts of the conversation to make room.
Go ahead and configure it through typing /status-line. Then tell Claude:
Add a progress bar showing percentage of context currently used
Claude will generate and configure a script automatically. You’ll see something like:
[Opus 4.5] 38% ▓▓▓▓░░░░░░ 62% freeNote: when we start ClaudeCode session, the status will show 0%, but in reality it’s more than that. The reason it shows 0% is that no API call has happened yet to Claude Usage API. The percentage is calculated from the most recent API call’s input tokens, and before the first message there simply isn’t one. After the first exchange it will jump to reflect the true baseline, including system prompt, tool definitions, CLAUDE.md, the message we sent, and Claude’s response. So don’t be surprised if you noticed it jumping from 0% to 12% after a small first prompt (it’s more than just the prompt in the context at this stage → this is also an indication that you should manage the tools/MCPs files you included in your configuration - too much of these tools exhaust the context quicker).
From there it grows as the conversation accumulates more messages and tool results, with file reads and command outputs typically being the biggest consumers.
Context Usage Thresholds
The status-line will give an initial indication of the quality we’d be getting in the responses. This is how I’ve been interpreting it. Take this with a grain of salt:
0-40%: Green Zone:Work normally at this stage, there’s plenty of room in the context window, and the quality would be high (the model has enough attention budget to focus on what matters)
40-70%: Orange ZoneThe quality is still acceptable, but it’s good to start planning how to manage the usage. For example, finish the current task running. If that’s a big feature, ask Claude to write down a summary in a markdown file, run
/compact, and start considering starting a new session (the attention budget is stretched, and the middle parts of the convo will start getting ignored)
70-90%: Red ZoneThe quality starts becoming poor, as the middle now includes lots of info that won’t be considered by the Model. Stop adding more context, run
/compact, and start a new session.
After 95%, Claude Code automatically compacts the command, but it’s important to stop here, and start fresh. Otherwise, the quality will have been degraded.
Patterns that help with Context management
Save stuff outside the Context Window, and reference it when needed. Let the agent write the key decisions, changes, and states into an external file as it works. This can be done through a small instruction in your CLAUDE.md file such as “When working on multi-step tasks, maintain a scratchpad.md file. After each significant change, write a one-line summary of what you did. Before starting a new step, read the scratchpad to remind yourself of the current state.” - this way the context survives resets and long sessions
Only load what’s needed - Agents do this out of the box, but we can help them be smarter about it. This can be done through clear organisation of the codebase, clear readme files, markdown files with the project layout, so the agent doesn’t have to scan the whole codebase to find something. Using specific prompts also helps (e.g. fix the bug in src/auth.js instead of fix the login bug)
Summarise to save space - when the conversation gets long, summarise the old parts - Claude Code does this automatically at 95%, but it’s important you do this earlier to better use the attention budget
Keep separate conversations - For sure don’t mix topics, but also don’t mix plans with implementations, or research with coding. Build each in a separate session.
From what I experienced so far, I found that detailed static prompts work better than broad dynamic prompts (which end up in a longer chat). For example, an approach that doesn’t work is saying Migrate all our services to the new API. The agent will load lots of files, and the context will grow quickly, following conversations and corrections. It’s better to specify one endpoint at a time, and how to migrate it, give an example, and ask to run the tests.
Worth noting that my experience with markdown file isn’t the best, even when the context is still small. I find both the Model and myself loosing track of what it’s doing. The model also ends up doing weird stuff, such as marking a task as done without implementing it, or creating very defensive implementation. Using a task management tool such as beads worked better than markdown files for me. So after research and design, I ask the agent to directly create the tasks in beads (you can also use a tool such as Linear, but it’s a bit more expensive, as you’d be using MCP). So for now, I’m just using the markdown files as a way to review plans, and not to track state.
Wrapping up
Like memory management in C or state management in React, this is a skill we need to develop as the LLMs evolve. it’s important to:
Monitor context proactively
Structure work in chunks (e.g. research → plan → implement)
Use explicit file loading (don’t let the AI explore randomly)
Document between sessions (files are cheaper than context)
I’m still experimenting with this myself, and there’s a lot more to learn. How do you manage context when working with AI coding agents? I’d love to hear what’s working and not working for you.


