{"id":1145803,"date":"2025-08-05T09:00:00","date_gmt":"2025-08-05T16:00:00","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?p=1145803"},"modified":"2026-01-28T07:45:50","modified_gmt":"2026-01-28T15:45:50","slug":"veritrail-detecting-hallucination-and-tracing-provenance-in-multi-step-ai-workflows","status":"publish","type":"post","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/veritrail-detecting-hallucination-and-tracing-provenance-in-multi-step-ai-workflows\/","title":{"rendered":"VeriTrail: Detecting hallucination and tracing provenance in multi-step AI workflows"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-BlogHeroFeature-1400x788-1.jpg\" alt=\"Alt text: Two white icons on a blue-to-green gradient background\u2014one showing a central figure linked to others, representing a network, and the other depicting lines connecting to a document, symbolizing data flow.\" class=\"wp-image-1145818\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-BlogHeroFeature-1400x788-1.jpg 1400w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/figure>\n\n\n\n<div class=\"wp-block-buttons is-content-justification-center is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-16018d1d wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/video\/veritrail-detect-hallucination-and-trace-provenance-in-ai-workflows\/\">Watch VeriTrail Explainer<\/a><\/div>\n<\/div>\n\n\n\n<p class=\"has-text-align-center\"><em>This research was accepted by the 2026 International Conference on Learning Representations (ICLR), a premier conference in artificial intelligence.\u00a0<\/em><\/p>\n\n\n\n<p>Many applications of language models (LMs) involve generating content based on source material, such as answering questions, summarizing information, and drafting documents. A critical challenge for these applications is that LMs may produce content that is not supported by the source text \u2013 a phenomenon known as \u201cclosed-domain hallucination.\u201d<a id=\"_ftnref1\" href=\"#_ftn1\"><sup>1<\/sup><\/a><\/p>\n\n\n\n<p>Existing methods for detecting closed-domain hallucination typically compare a given LM output to the source text, implicitly assuming that there is only a single output to evaluate. However, applications of LMs increasingly involve processes with multiple generative steps: LMs generate intermediate outputs that serve as inputs to subsequent steps and culminate in a final output. Many agentic workflows follow this paradigm (e.g., each agent is responsible for a specific document or sub-task, and their outputs are synthesized into a final response).\u202f&nbsp;<\/p>\n\n\n\n<p>In our paper \u201c<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/veritrail-closed-domain-hallucination-detection-with-traceability\/\" target=\"_blank\" rel=\"noreferrer noopener\">VeriTrail: Closed-Domain Hallucination Detection with Traceability<\/a>,\u201d we argue that, given the complexity of processes with multiple generative steps, detecting hallucination in the final output is necessary but not sufficient. We also need <strong>traceability<\/strong>, which has two components:&nbsp;<\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>Provenance: <\/strong>if the final output is supported by the source text, we should be able to trace its path through the intermediate outputs to the source.&nbsp;<\/li>\n\n\n\n<li><strong>Error Localization: <\/strong>if the final output is not supported by the source text, we should be able to trace where the error was likely introduced.<\/li>\n<\/ol>\n\n\n\n<p>Our paper presents VeriTrail, the first closed-domain hallucination detection method designed to provide traceability for processes with any number of generative steps. We also demonstrate that VeriTrail outperforms baseline methods commonly used for hallucination detection. In this blog post, we provide an overview of VeriTrail\u2019s design and performance.<a id=\"_ftnref2\" href=\"#_ftn2\"><sup>2<\/sup><\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"veritrail-s-hallucination-detection-process\">VeriTrail\u2019s hallucination detection process<\/h2>\n\n\n\n<p>A key idea leveraged by VeriTrail is that a wide range of generative processes can be represented as a <strong>directed acyclic graph (DAG)<\/strong>. Each <strong>node <\/strong>in the DAG represents a piece of text (i.e., source material, an intermediate output, or the final output) and each <strong>edge <\/strong>from node A to node B indicates that A was used as an input to produce B. Each node is assigned a unique ID, as well as a <strong>stage<\/strong> reflecting its position in the generative process.\u202f&nbsp;<\/p>\n\n\n\n<p>An example of a process with multiple generative steps is <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/project\/graphrag\/\" target=\"_blank\" rel=\"noreferrer noopener\">GraphRAG<\/a>. A DAG representing a GraphRAG run is illustrated in Figure 1, where the boxes and arrows correspond to nodes and edges, respectively.<a id=\"_ftnref3\" href=\"#_ftn3\"><sup>3<\/sup><\/a><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"881\" height=\"542\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-blog-figure-1.jpg\" alt=\"A GraphRAG run is depicted as a directed acyclic graph. The Stage 1 nodes represent source text chunks. Each Stage 1 node has an edge pointing to a Stage 2 node, which corresponds to an entity or a relationship. Entity 3 was extracted from two source text chunks, so its descriptions are summarized. The summarized entity description forms a Stage 3 node. The Stage 2 and 3 nodes have edges pointing to Stage 4 nodes, which represent community reports. The Stage 4 nodes have edges pointing to Stage 5 nodes, which correspond to map-level answers. The Stage 5 nodes each have an edge pointing to the terminal node, which represents the final answer. The terminal node is the only node in Stage 6.\" class=\"wp-image-1145841\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-blog-figure-1.jpg 881w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-blog-figure-1-300x185.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-blog-figure-1-768x472.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-blog-figure-1-240x148.jpg 240w\" sizes=\"auto, (max-width: 881px) 100vw, 881px\" \/><figcaption class=\"wp-element-caption\">Figure 1: GraphRAG splits the source text into chunks (Stage 1). For each chunk, an LM extracts entities and relationships (the latter are denoted by \u201c\u2b64 \u201c), along with short descriptions (Stage 2). If an entity or a relationship was extracted from multiple chunks, an LM summarizes the descriptions (Stage 3). A knowledge graph is constructed from the final set of entities and relationships, and a community detection algorithm, such as Leiden clustering, groups entities into communities. For each community, an LM generates a \u201ccommunity report\u201d that summarizes the entities and relationships (Stage 4). To answer a user\u2019s question, an LM generates \u201cmap-level answers\u201d based on groups of community reports (Stage 5), then synthesizes them into a final answer (Stage 6).<\/figcaption><\/figure>\n\n\n\n<p>VeriTrail takes as input a DAG representing a completed generative process and aims to determine whether the final output is fully supported by the source text. It begins by extracting claims (i.e., self-contained, verifiable statements) from the final output using <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/claimify-extracting-high-quality-claims-from-language-model-outputs\/\" target=\"_blank\" rel=\"noreferrer noopener\">Claimify<\/a>. VeriTrail verifies claims in the reverse order of the generative process: it starts from the final output and moves toward the source text. Each claim is verified separately. Below, we include two case studies that illustrate how VeriTrail works, using the DAG from Figure 1.\u202f<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"case-study-1-a-fully-supported-claim\">Case study 1: A \u201cFully Supported\u201d claim<\/h3>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"720\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-blog-figure-2.jpg\" alt=\"An example of VeriTrail's claim verification process where the claim is found \u201cFully Supported.\u201d A claim extracted from the terminal node, Node 17, is \u201cLegislative efforts have been made to address the high cost of diabetes-related supplies in the US.\u201d In Iteration 1, VeriTrail checks Nodes 15 and 16, which are the source nodes of the terminal node. The sentence \u201cThe general assembly in North Carolina is considering legislation to set a cap on insulin prices, which indicates that high insulin prices are a contributing factor to the high cost of diabetes-related supplies in the US\u201d is selected as evidence from Node 15. The tentative verdict is \u201cFully Supported.\u201d In Iteration 2, VeriTrail checks Nodes 12 and 13, which are the source nodes of Node 15. The sentence \u201cThe General Assembly in North Carolina is considering legislation to set a cap on insulin prices\u201d is selected as evidence from Node 13. The verdict remains \u201cFully Supported.\u201d In Iteration 3, VeriTrail checks Nodes 4, 5, and 11, which are the source nodes of Node 13. The sentence \u201cThe General Assembly is the legislative body in North Carolina considering legislation to cap insulin prices\u201d is selected as evidence from Node 4. The verdict is still \u201cFully Supported.\u201d In Iteration 4, VeriTrail checks Node 1, which is the source node of Node 4. The selected evidence is \u201c\u2018There\u2019s actually legislation in North Carolina at the General Assembly to set a cap on insulin\u2026\u2019 Stein said.\u201d The corresponding verdict is \u201cFully Supported.\u201d Since Node 1 represents a raw text chunk, it does not have any source nodes to check. Therefore, verification terminates and the \u201cFully Supported\u201d verdict is deemed final.\" class=\"wp-image-1145843\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-blog-figure-2.jpg 1280w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-blog-figure-2-300x169.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-blog-figure-2-1024x576.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-blog-figure-2-768x432.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-blog-figure-2-1066x600.jpg 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-blog-figure-2-655x368.jpg 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-blog-figure-2-240x135.jpg 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-blog-figure-2-640x360.jpg 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-blog-figure-2-960x540.jpg 960w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><figcaption class=\"wp-element-caption\">Figure 2: Left: GraphRAG as a DAG. Right: VeriTrail\u2019s hallucination detection process for a \u201cFully Supported\u201d claim.<\/figcaption><\/figure>\n\n\n\n<p>Figure 2 shows an example of a claim that VeriTrail determined was not hallucinated:\u202f<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>In Iteration 1, VeriTrail identified the nodes that were used as inputs for the final answer: Nodes 15 and 16. Each identified node was split into sentences, and each sentence was programmatically assigned a unique ID.\n<ul class=\"wp-block-list\">\n<li>An LM then performed <strong>Evidence Selection<\/strong>, selecting all sentence IDs that strongly implied the truth or falsehood of the claim. The LM also generated a summary of the selected sentences (not shown in Figure 2). In this example, a sentence was selected from Node 15.<\/li>\n\n\n\n<li>Next, an LM performed <strong>Verdict Generation<\/strong>. If no sentences had been selected in the Evidence Selection step, the claim would have been assigned a \u201cNot Fully Supported\u201d verdict. Instead, an LM was prompted to classify the claim as \u201cFully Supported,\u201d \u201cNot Fully Supported,\u201d or \u201cInconclusive\u201d based on the evidence. In this case, the verdict was \u201cFully Supported.\u201d<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Since the verdict in Iteration 1 was \u201cFully Supported,\u201d VeriTrail proceeded to Iteration 2. It considered the nodes from which at least one sentence was selected in the latest Evidence Selection step (Node 15) and identified their input nodes (Nodes 12 and 13). VeriTrail repeated Evidence Selection and Verdict Generation for the identified nodes. Once again, the verdict was \u201cFully Supported.\u201d This process \u2013 identifying candidate nodes, performing Evidence Selection and Verdict Generation \u2013 was repeated in Iteration 3, where the verdict was still \u201cFully Supported,\u201d and likewise in Iteration 4.\u202f<\/li>\n\n\n\n<li>In Iteration 4, a single source text chunk was verified. Since the source text, by definition, does not have any inputs, verification terminated and the verdict was deemed final.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"case-study-2-a-not-fully-supported-claim\">Case study 2: A \u201cNot Fully Supported\u201d claim<\/h3>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"515\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VerTrail-blog-figure-3.jpg\" alt=\"An example of VeriTrail's claim verification process where the claim is found \u201cNot Fully Supported.\u201d We assume that the maximum number of consecutive \u201cNot Fully Supported\u201d verdicts was set to 2. A claim extracted from the terminal node, Node 17, is \u201cChallenges related to electric vehicle battery repairability contribute to sluggish retail auto sales in China.\u201d In Iteration 1, VeriTrail checks Nodes 15 and 16, which are the source nodes of the terminal node. Two sentences are selected as evidence. The first sentence is \u201cChallenges with electric vehicle (EV) battery disposal and repair may also contribute to the sluggishness in retail auto sales.\u201d The second sentence is \u201cJunkyards are accumulating discarded EV battery packs, while collision shops face limitations in repairing EV battery packs, which could affect consumer confidence and demand.\u201d These sentences are both from Node 15. The tentative verdict is \u201cNot Fully Supported.\u201d In Iteration 2, VeriTrail checks Nodes 12, 13, and 14. Nodes 12 and 13 are the source nodes of Node 15. Node 14 is the source node of Node 16, which was checked in Iteration 1. The sentence \u201cThe electric vehicle market in China is influenced by challenges associated with EV battery disposal and repair\u201d is selected as evidence from Node 12. The verdict remains \u201cNot Fully Supported.\u201d Since two consecutive \u201cNot Fully Supported\u201d verdicts have been reached, which was the maximum, verification terminates and the final verdict is \u201cNot Fully Supported.\u201d\" class=\"wp-image-1145842\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VerTrail-blog-figure-3.jpg 1280w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VerTrail-blog-figure-3-300x121.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VerTrail-blog-figure-3-1024x412.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VerTrail-blog-figure-3-768x309.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VerTrail-blog-figure-3-240x97.jpg 240w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><figcaption class=\"wp-element-caption\">Figure 3: Left: GraphRAG as a DAG. Right: VeriTrail\u2019s hallucination detection\u202fprocess for a \u201cNot Fully Supported\u201d claim, where the maximum number of consecutive \u201cNot Fully Supported\u201d verdicts was set to 2. <\/figcaption><\/figure>\n\n\n\n<p>Figure 3 provides an example of a claim where VeriTrail identified hallucination:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>In Iteration 1, VeriTrail identified the nodes used as inputs for the final answer: Nodes 15 and 16. After Evidence Selection and Verdict Generation, the verdict was \u201cNot Fully Supported.\u201d Users can configure the maximum number of consecutive \u201cNot Fully Supported\u201d verdicts permitted. If the maximum had been set to 1, verification would have terminated here, and the verdict would have been deemed final. Let\u2019s assume the maximum was set to 2, meaning that VeriTrail had to perform at least one more iteration.<\/li>\n\n\n\n<li>Even though evidence was selected only from Node 15 in Iteration 1, VeriTrail checked the input nodes for both Node 15 and Node 16 (i.e., Nodes 12, 13, and 14) in Iteration 2. Recall that in Case Study 1 where the verdict was \u201cFully Supported,\u201d VeriTrail only checked the input nodes for Node 15. Why was the \u201cNot Fully Supported\u201d claim handled differently? If the Evidence Selection step overlooked relevant evidence, the \u201cNot Fully Supported\u201d verdict might be incorrect. In this case, continuing verification based solely on the selected evidence (i.e., Node 15) would propagate the mistake, defeating the purpose of repeated verification.<\/li>\n\n\n\n<li>In Iteration 2, Evidence Selection and Verdict Generation were repeated for Nodes 12, 13, and 14. Once again, the verdict was \u201cNot Fully Supported.\u201d Since this was the second consecutive \u201cNot Fully Supported\u201d verdict, verification terminated and the verdict was deemed final.<\/li>\n<\/ul>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"999693\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">Spotlight: Event Series<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/event\/microsoft-research-forum\/past-episodes\/?OCID=msr_researchforum_MCR_Blog_Promo\" aria-label=\"Microsoft Research Forum\" data-bi-cN=\"Microsoft Research Forum\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/05\/Research-Forum-hero_1400x788.jpg\" alt=\"Research Forum | abstract background with colorful hexagons\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">Microsoft Research Forum<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"microsoft-research-forum\" class=\"large\">Join us for a continuous exchange of ideas about research in the era of general AI. Watch the latest episodes on demand.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/event\/microsoft-research-forum\/past-episodes\/?OCID=msr_researchforum_MCR_Blog_Promo\" aria-describedby=\"microsoft-research-forum\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"Microsoft Research Forum\" target=\"_blank\">\n\t\t\t\t\t\t\tWatch on-demand\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<h2 class=\"wp-block-heading\" id=\"providing-traceability\">Providing traceability<\/h2>\n\n\n\n<p>In addition to assigning a final \u201cFully Supported,\u201d \u201cNot Fully Supported,\u201d or \u201cInconclusive\u201d verdict to each claim, VeriTrail returns (a) all Verdict Generation results and (b) an evidence trail composed of all Evidence Selection results: the selected sentences, their corresponding node IDs, and the generated summaries. Collectively, these outputs provide traceability:&nbsp;<\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>Provenance: <\/strong>For \u201cFully Supported\u201d and \u201cInconclusive\u201d claims, the evidence trail traces a path from the source material to the final output, helping users understand how the output may have been derived. For example, in Case Study 1, the evidence trail consists of Sentence 8 from Node 15, Sentence 11 from Node 13, Sentence 26 from Node 4, and Sentence 79 from Node 1.<\/li>\n\n\n\n<li><strong>Error Localization: <\/strong>For \u201cNot Fully Supported\u201d claims, VeriTrail uses the Verdict Generation results to identify the stage(s) of the process where the unsupported content was likely introduced. For instance, in Case Study 2, where none of the verified intermediate outputs supported the claim, VeriTrail would indicate that the hallucination occurred in the final answer (Stage 6). Error stage identification helps users address hallucinations and understand where in the process they are most likely to occur.&nbsp;<\/li>\n<\/ol>\n\n\n\n<p>The evidence trail also helps users verify the verdict: instead of reading through all nodes \u2013 which may be infeasible for processes that generate large amounts of text \u2013 users can simply review the evidence sentences and summaries.\u202f<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"key-design-features\">Key design features<\/h2>\n\n\n\n<p>VeriTrail\u2019s design prioritizes reliability, efficiency, scalability, and user agency. Notable features include:\u202f<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>During Evidence Selection (introduced in Case Study 1), the sentence IDs returned by the LM are checked against the programmatically assigned IDs. If a returned ID does not match an assigned ID, it is discarded; otherwise, it is mapped to its corresponding sentence. This approach <strong>guarantees that the sentences included in the evidence trail are not hallucinated<\/strong>.<\/li>\n\n\n\n<li>After a claim is assigned an interim \u201cFully Supported\u201d or \u201cInconclusive\u201d verdict (as in Case Study 1), VeriTrail verifies the input nodes of only the nodes from which evidence was previously selected \u2013 not all possible input nodes. By progressively narrowing the search space, VeriTrail limits the number of nodes the LM must evaluate. In particular, since VeriTrail starts from the final output and moves toward the source text, it tends to verify a smaller proportion of nodes as it approaches the source text. Nodes closer to the source text tend to be larger (e.g., a book chapter should be larger than its summary), so verifying fewer of them helps <strong>reduce computational cost<\/strong>.<\/li>\n\n\n\n<li>VeriTrail is designed to <strong>handle input graphs with any number of nodes<\/strong>, regardless of whether they fit in a single prompt. Users can specify an input size limit per prompt. For Evidence Selection, inputs that exceed the limit are split across multiple prompts. If the resulting evidence exceeds the input size limit for Verdict Generation, VeriTrail reruns Evidence Selection to compress the evidence further. Users can configure the maximum number of Evidence Selection reruns.\u202f\u202f<\/li>\n\n\n\n<li>The configurable maximum number of consecutive \u201cNot Fully Supported\u201d verdicts (introduced in Case Study 2) allows the user to find their desired <strong>balance between computational cost and how conservative VeriTrail is in flagging hallucinations<\/strong>. A lower maximum reduces cost by limiting the number of checks. A higher maximum increases confidence that a flagged claim is truly hallucinated since it requires repeated confirmation of the \u201cNot Fully Supported\u201d verdict.\u202f<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"evaluating-veritrail-s-performance\">Evaluating VeriTrail\u2019s performance<\/h2>\n\n\n\n<p>We tested VeriTrail on two datasets covering distinct generative processes (hierarchical summarization<a id=\"_ftnref4\" href=\"#_ftn4\"><sup>4<\/sup><\/a> and GraphRAG), tasks (summarization and question-answering), and types of source material (fiction novels and news articles). For the source material, we focused on long documents and large collections of documents (i.e., >100K tokens), where hallucination detection is especially challenging and processes with multiple generative steps are typically most valuable. The resulting DAGs were much more complex than the examples provided above (e.g., in one of the datasets, the average number of nodes was 114,368).<\/p>\n\n\n\n<p>We compared VeriTrail to three types of baseline methods commonly used for closed-domain hallucination detection: Natural Language Inference models (<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aclanthology.org\/2023.acl-long.634\/\" target=\"_blank\" rel=\"noopener noreferrer\">AlignScore<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aclanthology.org\/2024.eacl-long.102\/\" target=\"_blank\" rel=\"noopener noreferrer\">INFUSE<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>); Retrieval-Augmented Generation; and long-context models (Gemini 1.5 Pro and GPT-4.1 mini). Across both datasets and all language models tested, VeriTrail outperformed the baseline methods in detecting hallucination.<a id=\"_ftnref5\" href=\"#_ftn5\"><sup>5<\/sup><\/a><\/p>\n\n\n\n<p>Most importantly, VeriTrail traces claims through intermediate outputs \u2013 unlike the baseline methods, which directly compare the final output to the source material. As a result, it can identify where hallucinated content was likely introduced and how faithful content may have been derived from the source. By providing traceability, VeriTrail brings transparency to generative processes, helping users understand, verify, debug, and, ultimately, trust their outputs.\u202f&nbsp;<\/p>\n\n\n\n<p>For an in-depth discussion of VeriTrail, please see our paper \u201c<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/veritrail-closed-domain-hallucination-detection-with-traceability\/\" target=\"_blank\" rel=\"noreferrer noopener\">VeriTrail: Closed-Domain Hallucination Detection with Traceability.<\/a>\u201d<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><a id=\"_ftn1\" href=\"#_ftnref1\"><sup>1<\/sup><\/a> The term \u201cclosed-domain hallucination\u201d was introduced by OpenAI in the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2303.08774\" target=\"_blank\" rel=\"noopener noreferrer\">GPT-4 Technical Report<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<p><sup><a id=\"_ftn2\" href=\"#_ftnref2\">2<\/a><\/sup> VeriTrail is currently used for research purposes only and is not available commercially.<\/p>\n\n\n\n<p><sup><a id=\"_ftn3\" href=\"#_ftnref3\">3<\/a><\/sup> We focus on GraphRAG\u2019s global search method.<\/p>\n\n\n\n<p><sup><a id=\"_ftn4\" href=\"#_ftnref4\">4<\/a><\/sup> In hierarchical summarization, an LM summarizes each source text chunk individually, then the resulting summaries are repeatedly grouped and summarized until a final summary is produced (<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2109.10862\" target=\"_blank\" rel=\"noopener noreferrer\">Wu et al., 2021<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>; <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2310.00785\" target=\"_blank\" rel=\"noopener noreferrer\">Chang et al., 2023<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>).<\/p>\n\n\n\n<p><sup><a id=\"_ftn5\" href=\"#_ftnref5\">5<\/a><\/sup> The only exception was the mistral-large-2411 model, where VeriTrail had the highest balanced accuracy, but not the highest macro F1 score.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>VeriTrail, new from Microsoft Research, can detect AI-generated content that is not supported by the source text, trace the provenance of content from final output back to the source, and locate where errors were likely introduced.<\/p>\n","protected":false},"author":43868,"featured_media":1145818,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Dasha Metropolitansky","user_id":"43815"}],"msr_hide_image_in_river":null,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556,13545],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[269148,243984,269142],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-1145803","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-research-area-human-language-technologies","msr-locale-en_us","msr-post-option-approved-for-river","msr-post-option-blog-homepage-featured","msr-post-option-include-in-river"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199565],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[901101],"related-projects":[1027041],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Dasha Metropolitansky","user_id":43815,"display_name":"Dasha Metropolitansky","author_link":"<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/dasham\/\" aria-label=\"Visit the profile page for Dasha Metropolitansky\">Dasha Metropolitansky<\/a>","is_active":false,"last_first":"Metropolitansky, Dasha","people_section":0,"alias":"dasham"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-BlogHeroFeature-1400x788-1-960x540.jpg\" class=\"img-object-cover\" alt=\"two icons on a green gradient background\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/07\/VeriTrail-BlogHeroFeature-1400x788-1.jpg 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/dasham\/\" title=\"Go to researcher profile for Dasha Metropolitansky\" aria-label=\"Go to researcher profile for Dasha Metropolitansky\" data-bi-type=\"byline author\" data-bi-cN=\"Dasha Metropolitansky\">Dasha Metropolitansky<\/a>","formattedDate":"August 5, 2025","formattedExcerpt":"VeriTrail, new from Microsoft Research, can detect AI-generated content that is not supported by the source text, trace the provenance of content from final output back to the source, and locate where errors were likely introduced.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/1145803","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/users\/43868"}],"replies":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/comments?post=1145803"}],"version-history":[{"count":17,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/1145803\/revisions"}],"predecessor-version":[{"id":1161040,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/1145803\/revisions\/1161040"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media\/1145818"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1145803"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/categories?post=1145803"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/tags?post=1145803"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1145803"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=1145803"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=1145803"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1145803"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1145803"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1145803"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=1145803"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=1145803"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}