Why AI Token Costs Are So High and How Coding Agents Amplify Them
The increasing costs associated with AI, particularly agentic coding models, have become a significant topic of discussion. Companies like Anthropic are implementing stricter usage caps for premium users, and GitHub Copilot recently shifted from a request-based monthly cost to an AI token credit system, substantially reducing the amount of work users can accomplish per month. This change highlights the inherent expense of running these advanced AI models.
What is a Token?
At its core, a token is either a word or a piece of a word. Unlike human intuition that separates words by spaces, large language models (LLMs) treat spaces, punctuation (like full stops and curly brackets), and even complex characters (like Chinese characters, which might be one or two tokens) as individual tokens. The exact tokenization process depends on the specific model and its tokenizer, which is essentially a string parser. Tokenizers are designed based on frequency, meaning common words like "the" are often standalone tokens.
A modern LLM typically has access to a vocabulary of around 100,000 tokens, encompassing standard language elements, other languages, code symbols, and special Unicode characters. While a small subset of a few thousand tokens is used most frequently, the model can produce many more if needed.
Tokenization and Embeddings
Tokens are not arbitrarily decided; they are part of a defined system. When tokens pass through an LLM, they are converted into a numerical representation called an "embedding." Each token, like "the" or "cat," will have a unique high-dimensional vector of numbers representing its meaning. This numerical representation is learned during the model's training and is bespoke to each LLM. However, models often use the same tokenizer they were initially trained with.
While similar words might occupy similar spaces in this embedding vector, the tokenization process itself is more of a system development issue. LLMs are often trained on a vast and diverse dataset, including everything from English poems to JavaScript code. This broad training means they possess a wide range of tokens, even if a specific task, like writing JavaScript, only utilizes a subset. This approach is often more cost-effective than training specialized models from scratch.
The Cost of Interaction: Auto-Regressive Models
The primary reason for the high cost of LLMs lies in their auto-regressive nature. When you provide an input, the model processes the entire input as a list, makes numerous decisions, and then outputs a single "next token." This process repeats for every subsequent token in the output.
Consider a simple interaction: you ask a question, and the model generates a response. Unlike humans who can retain context, LLMs re-read the entire conversation history every time they need to generate a new token. As the conversation progresses, the input context grows, leading to significantly increased computational costs.
A Concrete Example of Token Cost
Let's break down the token cost in a typical interaction:
- System Prompt: This is a set of instructions given to the AI by its manufacturer (e.g., "You are an AI agent," "Don't answer rude questions"). This can be substantial, perhaps 1,000 tokens or more, and is always part of the initial input.
- User Query: Your question, let's say 100 tokens.
- Initial Input: The model's first input is the system prompt (1,000 tokens) + user query (100 tokens) = 1,100 tokens.
- Thought Process (Output): Before providing a direct answer, the model might generate internal "thoughts" – a series of tokens representing its reasoning. This could be 2,000 tokens.
- Iterative Generation: For each token in its thought process or final response, the model re-processes the entire preceding context. So, to generate the first word of its thought, the input is 1,100 tokens. To generate the second word, the input becomes 1,100 tokens + the first word of the thought, and so on. This means the model performs a forward pass through a GPU for each output token, with the context growing each time.
KV Caching: An Optimization
To mitigate this inefficiency, LLMs employ a technique called KV caching. When the model calculates the relationships between tokens in the input context, it caches these intermediate representations. This means that when a new token is added to the context, the model doesn't have to recalculate all previous relationships from scratch. While KV caching significantly improves efficiency for large context windows, it has less impact on very short sentences.
However, caching also presents challenges. Caches are typically stored on GPUs and have a limited "time to live." If a user takes a long break, the cache might be dropped to free up GPU resources for other users. When the user returns, the model has to "pre-fill" the context by re-processing the entire conversation history, incurring additional costs.
The Escalating Cost of Coding Agents
While simple chatbot interactions might not seem overly expensive, the costs skyrocket with agentic coding models. These agents have more autonomy and can interact with files and tools.
Consider a scenario where a coding agent is asked to fix a bug:
- Initial Query: System prompt (4,000 tokens) + user query (200 tokens) = 4,200 input tokens.
- Initial Thought: The agent generates internal thoughts (2,000 output tokens).
- Tool Call (Read File): The agent decides it needs to read a file to find the bug. It makes a tool call (100 output tokens) to read
index.html. - File Content as Input: The system returns the content of
index.html(5,000 tokens), which is then added to the model's input context. The total input for the next step is now 4,200 (original) + 2,000 (thought) + 100 (tool call) + 5,000 (file content) = 11,300 tokens. - Further Thought: The agent processes the file and generates more thoughts (another 2,000 output tokens).
- Second Tool Call (Read Another File): The agent makes another tool call (100 output tokens) to read
database.py. - Second File Content as Input: The system returns the content of
database.py(4,000 tokens), adding it to the context. The total input for the next step is now 11,300 + 2,000 + 100 + 4,000 = 17,400 tokens. - Final Thought and Code Patch: The agent generates a final thought (2,000 output tokens) and then makes a tool call to apply a code patch (1,500 output tokens).
- Success Message: The system confirms the patch (50 output tokens).
- Final Response: The agent provides a summary of its actions (500 output tokens).
In this relatively simple bug-fixing scenario involving only two files, the total input tokens processed by the model could easily reach 55,000 to 60,000, plus several thousand output tokens. If the fix doesn't work and the user asks a follow-up question, the entire conversation history, including the file contents, is re-processed, leading to an even larger input context and exponentially higher costs.
The Unsustainability of Current Cost Models
The example of a Starfield screensaver coded by GitHub Copilot illustrates this point dramatically. Over just six prompts and interactions with three files, the process consumed 2 million input tokens and 47,000 output tokens.
Measuring AI usage by tokens, similar to measuring a driver's quality by tire wear, is unsustainable. When users are incentivized by flat fees, they naturally push the boundaries, asking complex questions, causing the AI to loop, and processing large files, all of which drive up token consumption. This leads to exorbitant costs that are difficult for most companies to justify without immediate and significant product returns.
While agentic AI has its place, its current cost structure, often subsidized, is a major concern. The shift to per-token billing, as seen with GitHub Copilot, reveals the true expense of these operations. The challenge for the industry in the coming year will be to find ways to optimize these costs or develop more efficient models that don't require re-processing vast amounts of context repeatedly. Small, succinct questions and code completion tasks remain the most cost-effective uses of these powerful models.
Takeaways
- Tokenization turns words or sub‑words into discrete units called tokens, and each token is represented by a high‑dimensional embedding learned during model training.
- Large language models process every output token by re‑reading the entire accumulated context, which makes each additional token increasingly expensive, especially in long conversations.
- KV caching stores intermediate token relationships to avoid recomputing the whole context, improving efficiency for long prompts but offering limited benefit for short inputs and can be lost when caches expire.
- Agentic coding models dramatically inflate token usage because they embed system prompts, tool‑call logs, and full file contents into the context, often reaching tens of thousands of input tokens for a single bug‑fix task.
- The shift to per‑token billing, as seen with GitHub Copilot, exposes the unsustainable cost structure of current AI services, prompting the need for more efficient models or usage patterns focused on concise queries.
Frequently Asked Questions
What is a token in the context of large language models?
A token is the smallest unit a language model processes, representing either a whole word or a sub‑word fragment, and each token is mapped to a numerical embedding that captures its meaning. The model’s tokenizer splits input text into these tokens based on frequency and language rules, allowing the model to handle diverse vocabularies across languages and code.
How does KV caching reduce the computational cost of LLMs?
KV caching stores the key‑value pairs that represent the relationships between tokens in the already‑processed context, so when a new token is generated the model can reuse these cached representations instead of recomputing them from scratch. This reduces the number of GPU forward passes per output token, especially for long prompts, though the benefit diminishes for very short inputs.
Who is Computerphile on YouTube?
Computerphile is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.
Does this page include the full transcript of the video?
Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.
What is a Token?
At its core, a token is either a word or a piece of a word. Unlike human intuition that separates words by spaces, large language models (LLMs) treat spaces, punctuation (like full stops and curly brackets), and even complex characters (like Chinese characters, which might be one or two tokens) as individual tokens. The exact tokenization process depends on the specific model and its tokenizer, which is essentially a string parser. Tokenizers are designed based on frequency, meaning common words like "the" are often standalone tokens. A modern LLM typically has access to a vocabulary of around 100,000 tokens, encompassing standard language elements, other languages, code symbols, and special Unicode characters. While a small subset of a few thousand tokens is used most frequently, the model can produce many more if needed.
Helpful resources related to this video
If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.
Links may be affiliate links. We only include resources that are genuinely relevant to the topic.