{"id":1169137,"date":"2026-04-21T09:53:12","date_gmt":"2026-04-21T16:53:12","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?post_type=msr-blog-post&#038;p=1169137"},"modified":"2026-05-04T09:57:18","modified_gmt":"2026-05-04T16:57:18","slug":"the-art-of-building-verifiers-for-computer-use-agents","status":"publish","type":"msr-blog-post","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/articles\/the-art-of-building-verifiers-for-computer-use-agents\/","title":{"rendered":"The Art of Building Verifiers for Computer Use Agents"},"content":{"rendered":"\n<div class=\"wp-block-group is-layout-flow wp-block-group-is-layout-flow\" style=\"padding-right:2.5rem;padding-left:2.5rem\">\n<p><em>By Corby Rosset, Pratyusha Sharma, Andrew Zhao, Miguel Gonzalez-Fernandez, Ahmed Awadallah<\/em><\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/aif-hero-verifiers-computer-use-v4-1-1024x576.png\" alt=\"graphical user interface, application\" class=\"wp-image-1170587\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/aif-hero-verifiers-computer-use-v4-1-1024x576.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/aif-hero-verifiers-computer-use-v4-1-300x169.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/aif-hero-verifiers-computer-use-v4-1-768x432.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/aif-hero-verifiers-computer-use-v4-1-1536x865.png 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/aif-hero-verifiers-computer-use-v4-1-2048x1153.png 2048w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/aif-hero-verifiers-computer-use-v4-1-1066x600.png 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/aif-hero-verifiers-computer-use-v4-1-655x368.png 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/aif-hero-verifiers-computer-use-v4-1-240x135.png 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/aif-hero-verifiers-computer-use-v4-1-640x360.png 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/aif-hero-verifiers-computer-use-v4-1-960x540.png 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/aif-hero-verifiers-computer-use-v4-1-1280x720.png 1280w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/aif-hero-verifiers-computer-use-v4-1-1920x1080.png 1920w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>We share lessons learned from building a best-in-class verifier for computer use agent trajectories on the web, called the Universal Verifier. False positive rates drop to near zero (vs. \u226545% for WebVoyager, \u226522% for WebJudge), and agreement with humans matches human-human agreement. We open-source our Universal Verifier system along with CUAVerifierBench, a new set of CUA trajectories with both process and outcome human labels.<\/p>\n\n\n\n<p>Here&#8217;s what we found:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Good verifiers rely on rubric design<\/strong>\u2014and good rubrics must have specific, non-overlapping criteria, since flawed rubrics produce errors that cascade through the pipeline and can&#8217;t be corrected downstream. Good rubric design alone accounts for roughly half the gains.<\/li>\n\n\n\n<li><strong>Separating process from outcome and controllable from uncontrollable failures is a core design principle<\/strong>\u2014conflating process and outcome leads to reward signals that are either too lenient or too harsh. We further distinguish controllable failures (e.g., reasoning errors, hallucinations) from uncontrollable ones (e.g., CAPTCHAs, out-of-stock items).<\/li>\n\n\n\n<li><strong>The Universal Verifier matches human-human agreement levels<\/strong> (Cohen&#8217;s \u03ba 0.64) while cutting false positive rates to near zero\u2014outperforming WebVoyager and WebJudge by a wide margin. The advantage stems from verifier design, not just a stronger backbone model.<\/li>\n\n\n\n<li><strong>Verifiers deserve the same rigorous evaluation and iterative improvement we apply to models<\/strong>\u2014CUAVerifierBench makes this concrete, providing human-labeled trajectories to benchmark verifier quality and drive systematic progress.<\/li>\n\n\n\n<li><strong>Auto-research agents can&#8217;t fully replace human experts in verifier design yet<\/strong>\u2014but they reach ~70% of expert quality in just 5% of the time, and can even find incremental improvements on top of a human expert&#8217;s best work.<\/li>\n<\/ol>\n\n\n\n<p>Full paper is available <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2604.06240\">here<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and code and data are available at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/fara\">https:\/\/github.com\/microsoft\/fara<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<div style=\"height:48px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"why-is-it-so-hard-to-tell-whether-the-agent-succeeded\">Why is it so hard to tell whether the agent succeeded?<\/h2>\n\n\n\n<p>Computer use agents \u2014 models that browse the web, click buttons, fill forms \u2014 have gotten impressively capable. But progress on training and evaluating them is bottlenecked by a deceptively simple question: <em>did the agent actually succeed?<\/em><\/p>\n\n\n\n<p>This turns out to be much harder than it sounds. Unlike text generation, where you can compare an output to a reference, computer use trajectories are long, visually rich, and interact with environments the agent does not control, inviting new categories of errors like environment blockers, out-of-stock items, and logins. A task might be partially completed. Success might arrive through an unexpected path. Failures can be subtle \u2014 e.g. mis-copying numbers from a table that appear only in a screenshot buried deep in a multi-step interaction. And the consequences of getting verification wrong compound: bad labels corrupt both your benchmarks and your training data.<\/p>\n\n\n\n<p>We spent 96 experiments and several weeks building what we call the Universal Verifier \u2014 a system designed to verify agent success and score its effort against a generated rubric. What we ended up with is less a single trick and more a set of learned design principles, each addressing a failure mode we discovered. This post walks through those principles, what we tried that didn&#8217;t work, and what surprised us.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"690\" height=\"270\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n1-3.png\" alt=\"Line chart comparing Cohen's kappa agreement with human labels across 64 verifier design iterations, for a human expert, auto-research starting from blank prompts, and auto-research continuing from the expert's best prompts.\" class=\"wp-image-1169147\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n1-3.png 690w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n1-3-300x117.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n1-3-240x94.png 240w\" sizes=\"auto, (max-width: 690px) 100vw, 690px\" \/><\/figure>\n\n\n\n<p class=\"has-text-align-center\"><em>Figure 1: Human expert vs. auto-research agent across successive verifier design iterations. The expert iterated over 32 experiments across three weeks; the auto-research agent completed comparable iterations in roughly one day.<\/em><\/p>\n\n\n\n<div style=\"height:48px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"how-do-you-build-a-good-rubric\">How do you build a good rubric?<\/h2>\n\n\n\n<p>The root of the pipeline is rubric generation, and flawed rubrics produce errors that cascade through everything downstream. We found four systematic failure modes \u2014 and rubric design alone accounted for roughly half of our total Cohen&#8217;s \u03ba gains. You can see how our rubrics evolved on WebTailBench here: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/microsoft.github.io\/fara\/docs\/webtailbench_rubric_comparison.html\">https:\/\/microsoft.github.io\/fara\/docs\/webtailbench_rubric_comparison.html<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"phantom-criteria\">Phantom criteria<\/h3>\n\n\n\n<p>This was the most insidious problem. LLM-generated rubrics frequently introduce requirements that were never stated in the task. For example in Figure 2, given a multi-step task, our early rubric added criteria for the price and address of a hotel \u2014 neither of which the user requested for the primary intent of finding a coffee shop near the hotel. The agent completed the actual task but scored 2\/8 because it &#8220;failed&#8221; those phantom criteria. After fixing the rubric to match only what was asked, the same trajectory scored 16\/18 \u2014 a success.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"490\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n2-1024x490.png\" alt=\"Side-by-side comparison of a good rubric scoring 16\/18 and a bad rubric scoring 2\/8 for a Booking.com task, showing how phantom criteria penalize correct behavior.\" class=\"wp-image-1169148\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n2-1024x490.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n2-300x143.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n2-768x367.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n2-240x115.png 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n2.png 1430w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-text-align-center\"><em>Figure 2: One way we improved rubrics is by removing &#8220;phantom&#8221; criteria and focusing only on what the task required.<\/em><\/p>\n\n\n\n<p>This matters because phantom criteria inflate the denominator. An agent that did exactly what was asked gets penalized for not doing things nobody wanted.<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"cascading-criteria\">Cascading criteria<\/h3>\n\n\n\n<p>When rubric items aren&#8217;t logically independent, a single upstream error propagates into every downstream criterion, multiplying the penalty. We learned to ensure each criterion could be evaluated on its own as demonstrated in Figure 3.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"651\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n3-1024x651.png\" alt=\"Rubric showing error isolation for a task about finding the NSYNC or Backstreet Boys member with the longest last name: the agent is penalized for misidentifying Timberlake over Kirkpatrick, but receives full credit on the downstream net-worth criterion.\" class=\"wp-image-1169150\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n3-1024x651.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n3-300x191.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n3-768x488.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n3-240x153.png 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n3.png 1118w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-text-align-center\"><em>Figure 3: An example of error isolation in practice. For the task: &#8220;List all the members of the bands Nsync and BackStreet Boys. Find the net worth of the one with the longest last name.&#8221; The agent incorrectly identified &#8220;Timberlake&#8221; as the longest last name when &#8220;Kirkpatrick&#8221; is correct \u2014 but the error does not cascade to downstream criteria about reporting net worth.<\/em><\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"hallucination-detection\">Hallucination detection<\/h3>\n\n\n\n<p>Agents sometimes claim success when it contradicts evidence \u2014 they&#8217;ll confidently assert they found the right product when they did not. Or worse, they will fabricate results like stating the shopping cart has the product when it is empty. Initially, we generated and scored rubrics in one pass, but this rarely caught subtle hallucinations. So, we separate rubric generation from scoring, and decomposed scoring the rubric into two stages: with and without screenshot evidence. Discrepancies between the two stages surface hallucinations that a single-pass scorer would miss. As we explain below, we handle screenshot evidence very carefully to not miss any details.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"886\" height=\"586\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n4.png\" alt=\"Rubric showing a hallucination caught by the Universal Verifier: the agent claimed a model had a +6.2% CIDEr score, but the BLIP paper actually reported +2.8% in CIDEr. The +2.7% figure the agent cited is real but was misattributed to caption recall.\" class=\"wp-image-1169151\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n4.png 886w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n4-300x198.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n4-768x508.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n4-240x159.png 240w\" sizes=\"auto, (max-width: 886px) 100vw, 886px\" \/><\/figure>\n\n\n\n<p class=\"has-text-align-center\"><em>Figure 4: A subtle hallucination caught by the Universal Verifier. The agent claimed a model exhibited &#8220;+6.2% CIDEr score&#8221; when <\/em><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2201.12086\"><em>the actual paper<\/em><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><em> showed &#8220;+2.8% in CIDEr&#8221; \u2014 a discrepancy even human reviewers missed.<\/em><\/p>\n\n\n\n<p>We also added <strong>conditional criteria<\/strong> for tasks with contingencies \u2014 &#8220;buy organic blueberries, or if unavailable, buy non-organic.&#8221; At rubric-generation time, we mark some criteria as conditional and update them once the task is attempted, so mutually exclusive criteria don&#8217;t interfere with each other.<\/p>\n\n\n\n<p>To penalize the agent for doing things that were not anticipated by the rubrics, we have a final post-hoc scoring step to identify such deviations as shown in Figure 5.<\/p>\n\n\n\n<div style=\"height:48px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"process-and-outcome-rewards\">Process and outcome rewards<\/h2>\n\n\n\n<p>Not only did we generate rubrics to assign partial credit of &#8220;how well did the agent execute the task?&#8221;, but also assign a final outcome pass\/fail score answering whether the user&#8217;s goal was achieved. The reason we separate the rubrics from the outcome scores is that, in computer use settings, the environment plays an outsized role in influencing success. An agent can execute flawlessly and still fail because a CAPTCHA appeared, or a product was out of stock, or a login wall blocked the final step. From a model training perspective, it doesn&#8217;t make sense to penalize an agent for things outside its control, but from a metrics perspective, we still need to know if a task was completed.<\/p>\n\n\n\n<p>The <strong>process label<\/strong> is a scored rubric \u2014 a normalized score from 0.0 to 1.0 reflecting execution quality across sub-goals, with specific justifications for why points were earned or lost. While the rubrics do penalize mistakes within the model&#8217;s control like hallucinations or incomplete executions, they do NOT penalize for uncontrollable factors. Uncontrollable factors include platform issues (CAPTCHAs, login walls without credentials), entity non-existence (discontinued products, closed businesses), availability constraints (out-of-stock items, no reservations on the requested date), and search result limitations.<\/p>\n\n\n\n<p>The <strong>outcome label<\/strong> is binary: would a reasonable user consider the task done, regardless of problems stemming from the environment? It&#8217;s evaluated from the perspective of someone examining the end state.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"867\" height=\"580\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n5.png\" alt=\"Rubric with seven criteria scoring 12 out of 20 for an Amazon vs AutoZone shipping-comparison task, including a zero-out-of-two penalty for an unsolicited add-to-cart side effect.\" class=\"wp-image-1169152\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n5.png 867w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n5-300x201.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n5-768x514.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n5-240x161.png 240w\" sizes=\"auto, (max-width: 867px) 100vw, 867px\" \/><\/figure>\n\n\n\n<p class=\"has-text-align-center\"><em>Figure 5: An unsolicited side effect. The task was to compare shipping options between Amazon and AutoZone, but the agent added the product to the cart instead of just answering the question.<\/em><\/p>\n\n\n\n<p>Why not just use one? Because conflating them leads to reward signals that are either too lenient (crediting agents for apparent effort when the user is left empty-handed) or too harsh (penalizing agents for a CAPTCHA that no model could solve). In reinforcement learning, this distinction is the difference between a training signal that teaches the model to act well and one that teaches it to be lucky.<\/p>\n\n\n\n<div style=\"height:48px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-do-you-do-when-the-trajectory-is-long\">What do you do when the trajectory is long?<\/h2>\n\n\n\n<p>The natural starting point is to hand the model a bunch of screenshots and ask: &#8220;Did the agent do the task?&#8221; Too many screenshots over-exerts the LLM by forcing it to solve a needle-in-a-haystack problem, which scales poorly with longer trajectories. Both WebVoyager and WebJudge take roughly this approach \u2014 WebVoyager includes all screenshots in one context window, WebJudge ranks and selects the top 30\u201350. Giving LLM verifiers too many instructions to look for across too many screenshots overwhelms them into trying to find a &#8220;needle-in-a-haystack&#8221; problem that scales very poorly with trajectory length. On the other hand, trying to truncate screenshots (like selecting the last one) risks missing ones where hallucinations or failure actually happened.<\/p>\n\n\n\n<p>Hence, WebVoyager&#8217;s false positive rate with respect to gold human labels is at least 45%; WebJudge&#8217;s is at least 22%. That means nearly half the time WebVoyager says &#8220;the agent succeeded,&#8221; a human annotator would disagree. If you&#8217;re using these labels for training, you&#8217;re rewarding failure almost as often as success.<\/p>\n\n\n\n<p>We went with a divide-and-conquer scheme. We score each screenshot against every rubric criterion to produce a relevance matrix (shown in Figure 6), then group the top-k most relevant screenshots per criterion for detailed analysis. This is both more scalable to longer trajectories and more focused \u2014 the model evaluates each criterion against only the evidence that matters most for it.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"690\" height=\"288\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n6.png\" alt=\"Heatmap of relevance scores between fourteen screenshots and three rubric criteria for a face-wash shopping task, highlighting which screenshots are most informative for each criterion.\" class=\"wp-image-1169157\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n6.png 690w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n6-300x125.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2026\/04\/n6-240x100.png 240w\" sizes=\"auto, (max-width: 690px) 100vw, 690px\" \/><\/figure>\n\n\n\n<p class=\"has-text-align-center\"><em>Figure 6: A screenshot relevance matrix. Each cell scores how relevant a screenshot is to a specific rubric criterion, enabling targeted evidence retrieval rather than flooding the context window.<\/em><\/p>\n\n\n\n<div style=\"height:48px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"does-all-this-show-up-in-the-numbers\">Does all this show up in the numbers?<\/h2>\n\n\n\n<p>The short answer: yes.<\/p>\n\n\n\n<p>We validated the Universal Verifier on CUAVerifierBench \u2014 a new benchmark of 246 human-labeled CUA trajectories (140 internal, 106 from Browserbase) with both process and outcome annotations. It&#8217;s the first benchmark designed specifically to measure verifier quality on both dimensions. We wanted to validate our results with external annotators, and partnered with <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.browserbase.com\/\">Browserbase<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> to perform a human annotation study.<\/p>\n\n\n\n<p>On outcome labels, the UV achieves a Cohen&#8217;s \u03ba of 0.64 on the internal set and 0.58 on Browserbase, compared to 0.44\/0.26 for the best WebJudge configuration and 0.31\/0.13 for WebVoyager. More importantly, the UV&#8217;s false positive rate is 0.01 on the internal set and 0.08 on Browserbase \u2014 essentially zero. It almost never credits a trajectory with success when a human would call it failure.<\/p>\n\n\n\n<p>You might wonder whether this is just a stronger backbone model doing the work. We tested that. Upgrading WebVoyager from GPT-4o to GPT-5.2 does drop its outcome false positive rate from 0.45 to 0.10 \u2014 but it also dramatically increases its false negative rate (0.24 to 0.44), and overall \u03ba improves only modestly. The UV&#8217;s advantage is architectural, not model-driven.<\/p>\n\n\n\n<p>The UV&#8217;s agreement with humans falls within the range of human inter-annotator agreement itself: outcome \u03ba of 0.58 against a human range of 0.53\u20130.57, process \u03ba of 0.43 against a human range of 0.36\u20130.45. The verifier agrees with humans about as often as humans agree with each other.<\/p>\n\n\n\n<p>Secondly, we wanted to ascertain if using the Universal Verifier as a SFT training data filter improves model performance over previous filters. In Table 1 below we show that trajectories filtered by the Universal Verifier lead to the best downstream model, especially under data-limited scenario of only training on 3k trajectories. We were somewhat surprised to observe that the process-filter outperforms outcome-filtered; we believe this is due to the process success threshold of 80% allowing some demonstrations of imperfections in the trajectory, which is ultimately beneficial to the model.<\/p>\n\n\n\n<figure class=\"wp-block-table aligncenter\"><table><thead><tr><th>Experiment<\/th><th>Filtered by<\/th><th>Online Mind2Web<\/th><th>WebVoyager<\/th><\/tr><\/thead><tbody><tr><td rowspan=\"3\"><strong>3k traj.<\/strong><\/td><td>Baseline (old verifier)<\/td><td>0.20<\/td><td>0.41<\/td><\/tr><tr><td>UV Process<\/td><td><strong>0.28<\/strong><\/td><td><strong>0.45<\/strong><\/td><\/tr><tr><td>UV Outcome<\/td><td>0.24<\/td><td>0.44<\/td><\/tr><tr><td rowspan=\"3\"><strong>9k traj.<\/strong><\/td><td>Baseline (old verifier)<\/td><td>0.25<\/td><td>0.46<\/td><\/tr><tr><td>UV Process<\/td><td><strong>0.29<\/strong><\/td><td><strong>0.52<\/strong><\/td><\/tr><tr><td>UV Outcome<\/td><td>0.29<\/td><td>0.49<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"has-text-align-center\"><em>Table 1: Training Qwen-3-VL-8B on insta-150k-v3 trajectories filtered by different verifiers<\/em><\/p>\n\n\n\n<p>In this training experiment, we enforced compute equivalence by fixing the number of trajectories, which were sampled from the insta-150k-v3 dataset [1], after being re-solved by the FaraGen pipeline. We trained under largely the same settings as Fara-7B, the only difference being we initialized from Qwen-3-VL-Instruct. These results show that better verifiers lead to higher quality training data and hence better models.<\/p>\n\n\n\n<div style=\"height:48px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"can-an-ai-build-a-cua-verifier-on-its-own\">Can an AI build a CUA verifier on its own?<\/h2>\n\n\n\n<p>The Universal Verifier is approximately 3,000 lines of code and 2,000 lines of prompts \u2014 rubric generation templates, scoring instructions, outcome verification logic, error classification rules \u2014 all designed iteratively by a human expert. Could an AI agent replicate that work?<\/p>\n\n\n\n<p>We set up an auto-research experiment using Claude Code with Claude Opus 4.6, running on a 1M-token context window. We tested two settings: starting from blank prompts (all ~2,000 lines replaced with TODO placeholders, with only the code scaffold and the same design principles described above) and continuing from the human expert&#8217;s best prompts. A separate compliance agent audited each iteration to prevent the optimizer from memorizing test examples into prompts.<\/p>\n\n\n\n<p>The optimization rule was simple: maximize Cohen&#8217;s \u03ba without increasing the false positive rate. Any FPR-increasing change gets automatically rolled back. The human expert iterated over 32 experiments across three weeks. The auto-research agent completed a comparable number in roughly one day. The agent reached about 70% of expert quality in 5% of the time. But it plateaued at a \u03ba around 0.55 and couldn&#8217;t close the remaining gap.<\/p>\n\n\n\n<p>The most revealing part of this experiment was how the two approaches differed. The human expert&#8217;s biggest gains came from opinionated, high-level insights. After observing the verifier failing trajectories over minor issues \u2014 things like &#8220;inferring most Coursera courses can be audited for free is unsubstantiated&#8221; or &#8220;not disambiguating apartment from rental-unit&#8221; \u2014 the expert deduced general scoring rules like &#8220;separate nitpicks from critical failures.&#8221; These structural insights drove large jumps in agreement.<\/p>\n\n\n\n<p>The auto-research agent tended to be conservative and incremental \u2014 adjusting thresholds, tightening rubric language for individual failure cases \u2014 rather than making the larger structural or conceptual changes that drove the human expert&#8217;s biggest gains. It was good at fine-tuning. It was not good at stepping back and asking &#8220;what category of problem am I looking at?&#8221;<\/p>\n\n\n\n<p>A few things stood out watching the auto-research agent iterate. First, <strong>code changes consistently beat prompt additions<\/strong> when prompts were already long \u2014 the single most impactful change was injecting rubric scores directly into context, since it provided quantitative calibration without adding more text for the model to parse. Second, <strong>forcing explicit rule-checking helped<\/strong>: by naming rules in a mandatory output field, the LLM was far more likely to actually apply them rather than silently ignore instructions buried in a long prompt. Third, <strong>concrete tests beat abstract principles<\/strong> \u2014 &#8220;would the user say this is useful?&#8221; proved more actionable than vague guidance like &#8220;be reasonable about minor issues.&#8221;<\/p>\n\n\n\n<div style=\"height:48px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-did-this-project-teach-us\">What did this project teach us?<\/h2>\n\n\n\n<p>After 96 experiments and a few months of staring at CUA trajectories, the thing that stays with us is how much of verification is judgment \u2014 and how poorly that judgment decomposes into simple rules.<\/p>\n\n\n\n<p>Each of the four principles we described \u2014 rubric design, process\/outcome separation, controllable\/uncontrollable distinction, and context management \u2014 addresses a failure mode that looks obvious in retrospect but wasn&#8217;t obvious at all in practice. Phantom criteria sound like an easy problem to fix until you realize how systematically LLMs hallucinate requirements. Separating process from outcome sounds like a clean abstraction until you&#8217;re staring at a trajectory where the agent did everything right and the website just\u2026 didn&#8217;t work.<\/p>\n\n\n\n<p>The auto-research experiment sharpened this further. An AI agent can reach 70% of the quality in 5% of the time \u2014 that&#8217;s genuinely useful. But the last 30% requires the kind of opinionated, structural thinking that comes from looking at failure patterns and asking &#8220;what category of problem is this?&#8221; rather than &#8220;how do I fix this specific case?&#8221; This suggests that building reliable verifiers remains as much an art of encoding evaluative reasoning as it is an engineering problem.<\/p>\n\n\n\n<p>The verifier doesn&#8217;t just tell you whether the agent succeeded. It tells you <em>how<\/em> it failed \u2014 and whether the failure was even the agent&#8217;s fault.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<p><em>Code and data: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/fara\">github.com\/microsoft\/fara<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/em><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>By Corby Rosset, Pratyusha Sharma, Andrew Zhao, Miguel Gonzalez-Fernandez, Ahmed Awadallah We share lessons learned from building a best-in-class verifier for computer use agent trajectories on the web, called the Universal Verifier. False positive rates drop to near zero (vs. \u226545% for WebVoyager, \u226522% for WebJudge), and agreement with humans matches human-human agreement. We open-source [&hellip;]<\/p>\n","protected":false},"author":43341,"featured_media":1170587,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":992148,"msr_hide_image_in_river":0,"footnotes":""},"research-area":[13556],"msr-locale":[268875],"msr-post-option":[],"class_list":["post-1169137","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_assoc_parent":{"id":992148,"type":"lab"},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1169137","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/users\/43341"}],"version-history":[{"count":16,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1169137\/revisions"}],"predecessor-version":[{"id":1170591,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1169137\/revisions\/1170591"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media\/1170587"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1169137"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1169137"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1169137"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1169137"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}