This video details how an AI app developer reduced AI model usage costs -- which had ballooned to 10 times the estimate -- by 80%. It walks through analyzing the root cause of the cost spike and using Claude Code to design and implement a dynamic system prompt and tool calling system. The result is a method for efficiently using cheaper AI models to dramatically cut costs without sacrificing performance.
1. The Unexpected AI Cost Explosion and Problem Analysis
The video begins with Chris, an AI agent app developer, scrambling to solve a problem: costs were far exceeding expectations. Over the past two weeks, just 20 light users had generated $40 in costs -- more than 10 times the originally estimated $3. He felt a sense of urgency that at this rate, growing the user base could lead to bankruptcy.
Chris initially estimated costs of 2-4 cents per user, but overlooked the fact that tool calls are counted as separate requests. For example, a simple request like "move my meeting to tomorrow" was actually processed through multiple steps:
- Understanding the initial request
- A tool call to fetch task data
- A tool call to update the event
- A tool call for user confirmation (in some cases)
- The final response
As a result, a single user request was being processed as 4-5 requests, multiplying costs tenfold.
The biggest problem was that nearly all calls were using very expensive models like GPT-4o. Chris explains he had no choice because GPT-4o was far more accurate than cheaper alternatives (GPT-4 Mini, Gemini Flash, etc.). The cheaper models showed about a 20% failure rate in testing, while GPT-4o had only about 2%.
But he soon realized something:
"Maybe these small, cheap models aren't bad. Maybe I'm just asking too much of them."
He compared it to giving a house sitter a list of 100+ tasks versus a short list of 3, explaining that AI models similarly become less reliable the more tools and instructions they receive. This was especially true for smaller models.
2. The Solution: Dynamic System Prompts and Tool Lists
The solution Chris found was conceptually very simple at a high level: dynamically generate system prompts and tool lists. Instead of sending a massive system prompt and all 17 tools at once, send only exactly what the AI model needs to perform the task. This makes smaller models far more reliable at executing requests.
To implement this solution as a concrete technical architecture, Chris used Claude Code. He emphasizes that he used Claude Code not just as a coding tool, but as a research partner and technical solution architecture aid.
3. Designing the Solution Architecture with Claude Code
Chris ran Claude Code within Cursor and asked:
"Can you analyze this codebase and propose a technical solution to reduce system prompts and tool calls so I can use much cheaper models? Please use Ultra Think."
"Ultra Think" is a special keyword that prompts Claude Code to think more deeply. Claude Code spent a long time thinking about this request, showing its thought process in real time. Chris was impressed that Claude Code accurately identified the problems: the system prompt was too large, expensive models like GPT-4o and Claude Sonnet were being used, and there were 17 tools.
After completing its analysis, Claude Code generated two markdown files with proposed solutions.
3.1. Model Price Comparison and Smart Model Selection
The first file compared prices across various models and recommended using GPT-4o, Gemini 2.0 Flash, DeepSeek, and others. It also proposed classifying models by complexity for smart selection.
3.2. Dynamic System Prompt and Tool Calling Architecture
The second file detailed the proposed architecture:
- Intent Classification Layer:
- Uses very cheap Gemini Flash to classify the intent of user requests.
- Example intent types: search tasks, analysis, etc.
- Dynamic System Prompt Construction:
- Instead of one massive system prompt, split it into multiple modules.
- Dynamically compose the system prompt by assembling only the modules needed for the user request's intent.
- This can dramatically reduce tokens from 25,000 to 2,000-5,000.
- Dynamic Tool Calling:
- Group tools into several categories.
- Select only the necessary tool groups based on the user request's intent.
- Instead of sending all 17 tools, send just a few, reducing tool count by 50-70%.
- Model Selection:
- Select a specific model based on request complexity.
- Use ultra-cheap models like Gemini Flash for very simple requests, and premium models like GPT-4o only when needed for very complex requests.
Claude Code proposed all implementation details through an orchestrateRequest function handling intent classification, dynamic system prompt and tool construction, and model selection. Chris evaluated this plan as an excellent starting point for implementing a complex architecture and shared tips on using Claude Code as a research partner.
4. Implementation and Code Walkthrough
Once the technical architecture was complete, Chris asked Claude Code to implement it immediately. Remarkably, Claude Code implemented this complex system in one shot. Chris explains this was because the plan was so well-structured.
Chris walked through the implemented code:
orchestrateRequestfunction: All user requests flow through this new function.- Intent Classification:
- First classifies the type of user message -- whether it's a complex request, time-related, etc.
- Uses small, cheap models like Gemini 2.0 Flash for this step.
- Analyzes the user request and returns an object containing metadata: complexity, required tools, required model, etc.
- Chris provided detailed instructions to help the model with this classification.
- Dynamic System Prompt Construction:
- Builds the system prompt based on metadata from intent classification.
- First includes essential non-negotiable elements like date information and user confirmation requirements.
- Splits the previously massive system prompt into modules (e.g., deletion, scheduling, timezone modules).
- Dramatically reduces system prompt size by selecting and combining only the modules needed for the request.
- Tool Selection:
- Selects the necessary tool list based on intent and request type.
- For example, basic search tasks get only search tools, scheduling tasks get only scheduling tools.
- If both search and scheduling are needed, both tool groups are provided.
- This means only necessary tools are sent instead of all tools at once.
- Model Selection:
- The simplest step, mapping request complexity to the appropriate model.
- Gemini 2.0 Flash for very simple requests, GPT-4o for very complex ones requiring expensive models.
- Currently stored as an array to later implement a fallback system that moves to the next model if GPT-4o goes down.
After all these steps, the context sent to the LLM was reduced by over 80%. Chris reemphasizes that while the system looks complex on the surface, it's actually very simple, and Claude Code implemented everything in one shot.
5. Results and Key Lessons
After implementing these changes, Chris saw remarkable results:
- Cost reduction: Costs dropped from 2-4 cents per request (about 20 cents including tool calls) to under 0.5 cents per request, and even lower in some cases. On average, an 80%+ cost reduction was achieved.
- Maintained accuracy: The most important question was whether accuracy suffered when switching to cheaper models. Chris built an evaluation system (automated tests) to run various scenarios and verify the agent was working correctly. All tests passed, confirming zero reliability degradation.
"Because each model has exactly what it needs and nothing more, these small models are much more likely to follow instructions correctly."
Chris mentioned he could make a separate video about this evaluation system.
Finally, Chris shared three key lessons:
- The value of dynamic system prompts and tool calls: When you have a massive system prompt that makes it hard for models to follow instructions, this approach is an excellent alternative. It enables the use of much cheaper models.
- The efficiency of cheap models: Using 2-3 small, cheap models can be more efficient in terms of speed and cost than using a single expensive model. That was indeed the case here.
- Account for tool call costs: When working with agents and tool calls, always include those costs in your overall cost calculations.
6. Conclusion
Chris says the cost explosion was actually a welcome catalyst for learning new techniques, adding that these optimization approaches feel very obvious in retrospect. He plans to continue learning and sharing more optimization techniques, and wraps up the video by encouraging viewers to follow his other social media channels and subscribe.
