Microsoft product manager Derah onuorah explains new evaluation strategies to track LLM behavior and ensure trustworthiness in generative AI systems.
Microsoft Senior Product Manager Derah onuorah proposes a new evaluation paradigm in ‘LLM behavior monitoring’ processes to increase the reliability of productive AI systems. Unlike traditional software, large language models (LLM), which exhibit a stochastic (unpredictable) structure, can render traditional unit tests invalid by producing different results from Monday to Tuesday. Onurorah emphasizes that to minimize the margin of error at the enterprise level and manage the risk of ‘hallucinations’, engineers must now adopt a new infrastructure layer called the ‘AI Evaluation Stack’. This approach requires strict controls to be implemented at every stage of the development process, not just post-production.
Deterministic Controls Form the First Layer
Most errors in artificial intelligence applications are syntactic, not semantic. Developers can catch structural errors, such as JSON schema or tool calls, at the very beginning of the system by using deterministic checks that work with the principle of ‘fail-fast’. This layer reduces unnecessary costs and cases requiring human review.
An API call that is not configured correctly should be stopped before running the rest of the system.
Model-Based Assessments Capture the Nuances
The ‘LLM-as-a-Judge’ method used to measure semantic quality allows one model to evaluate the output of another model.

For this process to be successful, a strong reasoning model, a clear evaluation rubric and human-verified ‘golden outcomes’ are required.
Establishing a Feedback Loop for Continuous Improvement
AI models are not static; As user behavior changes, models may experience ‘concept drift’. Therefore, it is vital that data from production is constantly analyzed and error cases are added to the golden data sets.
Success in artificial intelligence projects is achieved not when the model is trained, but when a continuous evaluation cycle is established.
Do you think the biggest challenge in AI projects in your organization is measuring quality or keeping the model updated with real-world data? Share your experiences and methods with us in the comments section.