1. Recently, generative AI, especially LLM (Large

Recently, as I have come across startups using generative AI, especially LLM (Large Language Model), I often worry about how to develop and improve AI. Many teams are still accustomed to the traditional "deterministic" approach to software development and are often faced with challenges when working on AI projects.
Traditionally, software was created on the premise that "given a specific input (A), a specific output (B) must be produced." However, when using a learned model like LLM, the same B is not always produced when input A is given. When dealing with models with inherent indeterminacy, it is essential to observe and improve how close the output to the desired answer (probabilistic performance) is to be stably produced rather than judging whether there is "just one right answer."
Until a few years ago, probabilistic development, such as Uber's car dispatch algorithm, Airbnb's dynamic pricing, Facebook's news feed, and Amazon's product list ranking, was often applied only to certain functions, teams, or specific occupations such as ML engineers. However, in AI applications, utilizing AI models is not simply a part of the product, but becomes the product itself.
In the past, the development process was generally linear: "Requirements → Development → Testing → Deployment." AI application development, unlike this, has an iterative and cyclical process. The process of updating and deploying the model or logic → measuring performance while monitoring actual user input and behavior data → adjusting the model or prompt by reflecting feedback and additional data → redeploying is constantly repeated.
If it is an LLM-based service, the output quality or user experience may change significantly even if, for example, the prompt configuration is slightly different. These characteristics, unlike traditional development methods, require the following devices.

Establishment of an evaluation system AI models aren't just about testing whether they work or not. You need to be able to measure what results are derived from different inputs and scenarios and how valid those results are. To this end, representative indicators across machine learning, such as accuracy, precision, and recall, are used, and at the same time, qualitative evaluation is carried out through user surveys or direct confirmation.
User-centered feedback loop AI models must be constantly updated and improved even after release. By analyzing the behavioral data of users who actually accessed the service, we find at what point the model produced output that is different from expectations, and use this as the basis for retraining the model or modifying the prompt. For example, if a user left 'negative feedback' on a chatbot response, you can collect those cases and make the model give better answers in similar situations.
Management of variation in results Since stochastic models can vary for the same input, not only "average performance" but also "deviation" is important. For example, if there is a model that says, "It gives a good answer with a 70% probability, but gives seriously wrong information with a 30% probability," you need to think about how to reduce and prepare for situations where this 30% occurs. 'Minimum correct rate' and 'allowable error rate range' must be clearly set and monitored.

The dispatch team at Lyft and the feed team at Carrot Market, where I worked, come to mind. They are probabilistic development experts focusing on ML engineers and have led the organization to success. This is no longer exclusive to certain teams. Now that AI models have become the center of products, it is essential for all members to understand and put this paradigm into practice.

https://briandwjang.substack.com/p/ai

January 27, 20257 minOriginal source