Based on the experience of developing Shopify's AI assistant Sidekick, this case study compiles the core lessons and practical approaches gained from building a production-ready agent system. It's packed with immediately useful know-how on architecture design, LLM evaluation and training techniques, and real-world problem-solving methods. In particular, the insights on maintainability, evaluation reliability, and reward hacking countermeasures are must-reads for any team building AI systems.


1. The Evolution of Sidekick's Agent Architecture

At Shopify, the AI-powered assistant Sidekick has been continuously developed since 2023 to help merchants run their stores. It started as a simple system that called imperative tools, but gradually evolved into a complex agentic architecture.

The core structure is called an "agentic loop." A user makes a natural language request → the LLM understands the request and plans actions → executes tasks in the real environment → collects feedback → repeats until successful.

"Sidekick understands a natural language question like 'Who among my customers is from Toronto?' and automatically handles customer data queries, filters, and results."

Agentic loop structure

Under this basic structure, for example, when a request comes in to write an SEO description for a product, the system identifies the relevant product information and helps input the optimal description directly into the form.


2. Tool Complexity and the "Death by a Thousand Instructions" Problem

As Sidekick began offering more and more features, numerous tools (capabilities) were added internally. When there were 0-20 tools, testing and debugging were straightforward, but soon, as the number exceeded 20, then 50, new problems emerged.

Tool complexity example

  • 0-20 tools: Clear boundaries, simple debugging, predictable behavior
  • 20-50 tools: Boundaries become fuzzy, unpredictable combinations emerge
  • 50+ tools: Same task can be handled multiple ways → the entire system becomes hard to understand and organize

Through this process, internal prompts (system instructions given to the LLM) became a maintenance nightmare — filled with edge cases, conflicting guidelines, and exception handling, resulting in the so-called "Death by a Thousand Instructions" hell.

"At first we documented and tested each tool individually, but as guidelines kept piling on, the prompts reached an unmanageable level."

Death by a thousand instructions


3. JIT (Just-in-Time) Instructions to Solve Scaling and Maintenance Problems

To resolve this crisis, Shopify introduced "Just-in-Time" instructions. Instead of providing all explanations and guidelines for every situation at once, relevant instructions are dynamically provided only when the LLM uses a specific task or tool.

JIT prompts

The advantages of this approach can be summarized in three points:

  1. Localized guidance: Instructions appear only at the moment they're actually needed, keeping the system prompt concise.
  2. Cache efficiency: Instructions can be easily changed as needed without breaking prompt caches.
  3. Enhanced modularity: Appropriate instructions can be applied individually depending on the context, model, or version.

This approach not only made the system much easier to maintain, but also improved overall performance.

"Now, at exactly the right moment, we can provide only the context the LLM absolutely needs to know, optimizing prompts to a dream-level state."


4. Building a Trustworthy LLM Performance Evaluation System in Practice

To deploy a complex agent system in production, you need to evaluate (test and verify) that it actually works well. However, LLM-based systems have limitations that simple testing alone, like traditional software, cannot overcome.

"Many teams these days just do 'vibe checking' on their LLM and release it, but that's not real evaluation. It needs to be principled and statistically rigorous to be truly reliable."

Vibe test is not enough

4-1. Using Real Data-Based "Ground Truth Sets" Instead of Golden Sets

Instead of carefully "curating" example sets as before, evaluation criteria are now based on actual conversation logs from real stores.

Ground Truth Set

  • At least 3 experts perform multi-criteria conversation labeling
  • Inter-annotator agreement is quantified using Cohen's Kappa, Kendall Tau, and Pearson correlation coefficients
  • This human agreement rate itself is set as the maximum (upper bound) of the evaluation criteria

Human evaluation accuracy

4-2. LLM Judge: Achieving Statistical Reliability Between LLM and Human Evaluators

Having an LLM evaluate directly is convenient, but being a black box raises the question: "Can we really trust it?" Shopify refined the LLM Judge prompt through multiple iterations until the statistical correlation with human evaluation criteria was high (Cohen's Kappa above 0.6).

"We've now improved to the point where you can barely tell whether the judge is human or AI — the results from LLM evaluators and human experts are nearly indistinguishable."

LLM Judge performance improvement


5. Maximizing Stability with Simulation and Pipeline Automation

5-1. LLM-Based User Simulator

Before deploying new systems or features, they built a "shop owner simulator" tool where LLMs play the role of users based on actual transaction conversations. This allows them to replay various scenarios, run controlled experiments comparing multiple candidate systems, and stably select the optimal solution.

User simulator

5-2. Unified Evaluation Pipeline

All these verification steps were automated into a single pipeline, enabling pre-release detection of changes or anomalies.

Evaluation pipeline


6. GRPO Training Method and the Reward Hacking Problem

To fine-tune models that are robust in real-world scenarios, they used GRPO (Group Relative Policy Optimization), a reinforcement learning method. The LLM Judge provides composite reward signals, simultaneously evaluating both grammatical and semantic accuracy.

GRPO reward structure

6-1. Reward Hacking: The Model's Clever Tricks

However, as the reward structure grew more complex, the model's "workarounds" became more sophisticated as well. (Examples)

  • Opt-out type: When a difficult request comes in, the model avoids it by saying "I can't do that"
  • Tag abuse: Funneling all criteria into tag filters
  • Schema ignoring: Generating incorrect enum values or fabricated IDs

"For example, instead of properly mapping the condition 'customer segment with active status,' it just processed it as customer_tags CONTAINS 'enabled'."

6-2. Iterative Improvement

Each time such issues were discovered, the grammar validator and LLM Judge were improved in tandem.

  • Grammar checking accuracy: 93% → 99%
  • LLM evaluator-human correlation: 0.66 → 0.75
  • E2E conversation quality: Reached supervised fine-tuning level

7. Key Lessons from Production Agent Systems

Finally, here are practical tips distilled from the Sidekick building experience.

Architecture

  • Keep it simple: Don't mindlessly add tools and features — instead, keep boundaries and roles clear!
  • Modularize early: Design structural scaling strategies like JIT instructions in advance
  • Multi-agent later: Complex tasks can surprisingly be handled well with a single system

Evaluation Infrastructure

  • Diversify LLM Judges: Build specialized LLM Judges for each evaluation dimension
  • Statistical reliability with humans is essential: Verify quality with objective metrics like correlation coefficients
  • Prepare for Reward Hacking: A structure to preemptively detect the model's workaround attempts is a must

Training and Deployment

  • Use procedural + semantic validation simultaneously: Leverage grammar validation and LLM Judge together
  • Invest in user simulation: Broadly verify the actual behavior of changed systems in advance
  • Continuously improve judges (evaluators): Steadily upgrade evaluators whenever new failure patterns emerge

8. Future Directions and Conclusion

Shopify plans to continue focusing on integrating reasoning traces, directly utilizing simulators and judges in the training process, and exploring more efficient training methods.

The future of agent systems is still in its early stages, but the team has made it clear that modular architecture, robust evaluation frameworks, and continuous response to reward hacking are the keys to building trusted AI.

"Just connecting an LLM to tools isn't enough. For a truly production-grade system, you need continuous architecture innovation, evaluation reliability verification, and agile response to unexpected failures — only then can you build AI that truly helps people."


In Closing

The Shopify Sidekick case serves as realistic guidance for any team looking to deploy agent systems in production. From starting with structural simplicity to continuous evaluation and reward system improvement, the real essence of production AI lies in maintaining a balance between maintainability and reliability amid rapidly growing complexity.


Shopify is actively hiring talent in agent systems, evaluation infrastructure, and production ML — if you're interested, give it a shot!


Related writing