Introducing Interwhen: Steering reasoning agents with real-time verification

May 14, 2026
Yash Lara, Microsoft; Amit Sharma, Microsoft
Microsoft Research Forum | Season 2, Episode 4

What if AI agents could check their work as they go? This verification method extracts verifiable properties from natural language and evaluates them using symbolic or model-based verifiers. Interwhen, a new open-source library, enables real-time verification of each step, helping agents act more safely and reliably in complex, real-world tasks.

Explore more

All Research Forum sessions

Transcript

Introducing Interwhen: Steering reasoning agents with real-time verification

[MUSIC]

[MUSIC FADES INTO SWEEPING SOUND]

YASH LARA: As agents take on more responsibility, reliability becomes critical—not just after the fact, but while the agent is working.

Joining us as a principal researcher from our MSR India Research Lab, Amit Sharma is here to announce Interwhen. This is a new open-source approach to test and verification that lets agents check each step of their own behavior using verifiable properties extracted from natural language.

This work sits right at the intersection of theory and practice, and it tackles one of the hardest problems in generative AI today: knowing when a system is behaving correctly.

Over to you, Amit.

[MUSIC]

[MUSIC FADES INTO SWEEPING SOUND]

AMIT SHARMA: Hi, I’m Amit Sharma, a Principal Researcher at Microsoft Research in India.

Today I’m excited to introduce Interwhen, a framework we’ve built for steering reasoning agents with real-time verification. This is joint work with my colleagues at MSR India, as well as Professor Subbarao Kambhampati from Arizona State University.

AI systems today are moving from passive generators of text to agentic systems that can execute actions in the real world, either through tool calls or other means.

For example, consider an agent that is operating your inbox, executing complex workflows in an organization, or even interacting in the physical world. These agents are equipped with tool calls that can make irreversible changes—for example, sending an email on your behalf, writing to an organizational database, or even moving an actual robot next to you.

So it is critical not just to verify their final output, but also their intermediate actions.

A central question for us is: how do we go from agentic execution as it exists today to a future with verified agentic execution?

Let’s look at current verification systems and their limitations. Typically, a verifier has access to the model response at a given time, the task instruction, and possibly some domain policies.

One approach uses an LLM as a judge, but this is a monolithic verifier and may miss nuanced errors, making it unreliable. Another approach uses formal verifiers, which are reliable but restricted to math and code.

To address these challenges, we built Interwhen, a real-time framework for agent verification.

Ideally, we want to use formal verifiers in general domains—but how do we apply them in settings where there may be thousands of tokens and high ambiguity?

The key innovation in Interwhen is an LLM-based projection step that automatically breaks outputs into a list of verifiable properties—desirable conditions that any solution should satisfy.

The framework can then automatically generate a formal specification and create a provable verifier in Python or Lean.

Importantly, the system does not operate only after the agent is done. It can verify partial responses at any point in time, provide actionable feedback, and steer the reasoning model to avoid violations.

We designed the system with two important principles. First, verification runs asynchronously so the system remains efficient. Second, it is plug-and-play, allowing practitioners to bring in their existing verifiers.

Now let’s look at how this works in a real-world setting.

Interwhen achieves state-of-the-art results on agentic benchmarks such as Tau2 Bench, where even small models can rival the accuracy of frontier models.

The first step is extracting verifiable properties—structured, atomic conditions that any solution should satisfy.

For example, in a retail scenario from Tau2 Bench, a policy expressed in natural language can be converted into properties such as “a refund must go to the original payment method.” From this, the system can automatically generate Python code that acts as a verifier.

This is a one-time operation that can be reviewed and validated. The key value comes at runtime, when variables in the verifier are dynamically filled using the user’s context and the model’s current response.

For instance, if an agent attempts to process a return using the wrong payment ID, the verifier detects the violation and produces feedback such as “Refund to credit card not allowed.” In the next step, the model uses this feedback to correct its mistake.

Efficiency is a core design goal. Verification must not only ensure correctness but also be practical.

To achieve this, we built a separate monitoring system that tracks the agent’s output in real time and calls verifiers asynchronously for reversible tool calls. Execution is only stopped if an error is detected; otherwise, the process continues as normal.

The framework is general and improves output quality across many tasks, including customer service, travel booking, logical reasoning, and ensuring safety in agent responses.

It is fully plug-and-play, works with both proprietary and open-weight models, and enforces strict verification—meaning the agent provides no answer if verification fails.

I’m excited to announce that we have open-sourced Interwhen on GitHub.

At Microsoft Research, we believe in open collaboration. Our goal is to foster a community focused on AI verification and safety, where we can collectively build and test verifiers for real-world use cases and enable a future where all agentic execution is verified.

If you’d like to learn more about Interwhen, you can visit the link below.

Thank you.