{"id":845671,"date":"2022-05-23T09:00:00","date_gmt":"2022-05-23T16:00:00","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?p=845671"},"modified":"2022-08-17T09:45:21","modified_gmt":"2022-08-17T16:45:21","slug":"partnering-people-with-large-language-models-to-find-and-fix-bugs-in-nlp-systems","status":"publish","type":"post","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/partnering-people-with-large-language-models-to-find-and-fix-bugs-in-nlp-systems\/","title":{"rendered":"Partnering people with large language models to find and fix bugs in NLP systems"},"content":{"rendered":"\n<figure class=\"wp-block-embed alignwide is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<div class=\"yt-consent-placeholder\" role=\"region\" aria-label=\"Video playback requires cookie consent\" data-video-id=\"9xQd8On3u3Q\" data-poster=\"https:\/\/img.youtube.com\/vi\/9xQd8On3u3Q\/maxresdefault.jpg\"><iframe aria-hidden=\"true\" tabindex=\"-1\" title=\"AdaTest\" width=\"500\" height=\"281\" data-src=\"https:\/\/www.youtube-nocookie.com\/embed\/9xQd8On3u3Q?feature=oembed&rel=0&enablejsapi=1\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe><div class=\"yt-consent-placeholder__overlay\"><button class=\"yt-consent-placeholder__play\"><svg width=\"42\" height=\"42\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" aria-hidden=\"true\" focusable=\"false\"><g fill=\"none\" fill-rule=\"evenodd\"><circle fill=\"#000\" opacity=\".556\" cx=\"21\" cy=\"21\" r=\"21\"\/><path stroke=\"#FFF\" d=\"M27.5 22l-12 8.5v-17z\"\/><\/g><\/svg><span class=\"yt-consent-placeholder__label\">Video playback requires cookie consent<\/span><\/button><\/div><\/div>\n<\/div><\/figure>\n\n\n\n<p>Advances in platform models\u2014large-scale models that can serve as foundations across applications\u2014have significantly improved the ability of computers to process natural language. But natural language processing (NLP) models are still far from perfect, sometimes failing in embarrassing ways, like translating \u201cEu n\u00e3o recomendo este prato\u201d (<em>I don\u2019t recommend this dish<\/em>) in Portuguese to \u201cI highly recommend this dish\u201d in English (a real example from a top commercial model). These failures continue to exist in part because finding and fixing bugs in NLP models is hard\u2014so hard that severe bugs impact almost every major open-source and commercial NLP model.&nbsp;<\/p>\n\n\n\n<p>Current methods for finding or fixing bugs take one of two approaches: they\u2019re either user-driven or automated. User-driven methods are flexible and can test any aspect of a model\u2019s behavior, but they depend on highly variable human ability to imagine bugs and are so labor intensive that in practice only a small part of the input space gets tested. Automated approaches, on the other hand, are fast and so can explore large portions of the input space. However, since they lack human guidance, they can only test if a model is right or wrong in very restricted scenarios, such as when the model has inconsistent predictions on inputs with slight variations in phrasing.&nbsp;<\/p>\n\n\n\n<p>We believe platform models, specifically modern large language models (LLMs) like GPT-3, offer an opportunity for us to combine the synergistic strengths of both user-driven approaches and automated approaches, keeping the user in control of defining what the model being tested should be doing while leveraging the abilities of modern generative language models to generate at scale tests within a specific category of model behavior. We call this human-AI team approach <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/adaptive-testing-and-debugging-of-nlp-models\/\" target=\"_blank\" rel=\"noreferrer noopener\">Adaptive Testing and Debugging, or AdaTest<\/a> for short.&nbsp;<\/p>\n\n\n\n<p>With AdaTest, a large language model is tasked with the slow burden of generating a large quantity of tests targeted at finding bugs in the model being tested, while the person steers the language model by selecting valid tests and organizing them into semantically related topics. This guidance from the person drastically improves the language model\u2019s generation performance and directs it toward areas of interest. Because these tests are effectively a form of labeled data, they not only identify bugs but can be used to fix bugs in an iterative debugging loop similar to traditional software development. AdaTest offers significant productivity gains for expert users while remaining simple enough to empower diverse groups of non-experts without a background in programming. This means experts and non-experts alike can better understand and control the behavior of their AI systems across a range of scenarios, which makes for not only better-performing AI systems but <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/ai\/responsible-ai\" target=\"_blank\" rel=\"noreferrer noopener\">more responsible AI systems<\/a>. The AdaTest code and pre-populated test trees are <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/adatest\" target=\"_blank\" rel=\"noopener noreferrer\">open source on GitHub<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.&nbsp;<\/p>\n\n\n\n<div class=\"wp-block-buttons is-content-justification-center is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-16018d1d wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a data-bi-type=\"button\" class=\"wp-block-button__link\" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/adaptive-testing-and-debugging-of-nlp-models\/\" target=\"_blank\" rel=\"noreferrer noopener\">Read the paper<\/a><\/div>\n\n\n\n<div class=\"wp-block-button is-style-fill-github\"><a data-bi-type=\"button\" class=\"wp-block-button__link\" href=\"https:\/\/github.com\/microsoft\/adatest\" target=\"_blank\" rel=\"noreferrer noopener\">Get the code<\/a><\/div>\n<\/div>\n\n\n\n<div style=\"height:15px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>We\u2019re presenting our paper, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/adaptive-testing-and-debugging-of-nlp-models\/\" target=\"_blank\" rel=\"noreferrer noopener\">\u201cAdaptive Testing and Debugging of NLP Models,\u201d<\/a> at the <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/event\/acl-2022\/\" target=\"_blank\" rel=\"noreferrer noopener\">2022 Meeting of the Association for Computational Linguistics (ACL)<\/a>, where our colleagues will also be introducing work that <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?p=845632&secret=jdUWE9\" target=\"_blank\" rel=\"noreferrer noopener\">leverages large language models, in their case, to grow adversarial datasets for content moderation tools<\/a>.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"A diagram in which the testing loop is represented by a series of icons showing the language model suggesting tests, the user filtering and organizing them in a test tree, and the language model using that user feedback to suggest more tests, beginning the process again. The graphic representing the testing loop is situated within the debugging loop. Red arrows from the testing loop to a black square labeled \u201ctarget model\u201d and back to the testing loop indicate identified test failures being used to fix a target model, which is then retested in an iterative process. \" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/Fig1_AdaTest.png\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"708\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/Fig1_AdaTest.png\" alt=\"A diagram in which the testing loop is represented by a series of icons showing the language model suggesting tests, the user filtering and organizing them in a test tree, and the language model using that user feedback to suggest more tests, beginning the process again. The graphic representing the testing loop is situated within the debugging loop. Red arrows from the testing loop to a black square labeled \u201ctarget model\u201d and back to the testing loop indicate identified test failures being used to fix a target model, which is then retested in an iterative process. \" class=\"wp-image-845695\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/Fig1_AdaTest.png 800w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/Fig1_AdaTest-300x266.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/Fig1_AdaTest-768x680.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/Fig1_AdaTest-203x180.png 203w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><\/a><figcaption>Figure 1: AdaTest consists of two loops: a testing loop that generates and organizes tests optimized for the model being tested (the target model) and a debugging loop that iteratively refines the model based on test failures.&nbsp;<\/figcaption><\/figure>\n\n\n\n<h2 id=\"finding-bugs-with-the-testing-loop\">Finding bugs with the testing loop<\/h2>\n\n\n\n<p>The AdaTest process is composed of an inner <em>testing <\/em>loop that is used to find bugs (Figure 1, unrolled in Figure 2) and an outer <em>debugging<\/em> loop that is used to fix bugs (Figure 1, unrolled in Figure 4).&nbsp;<\/p>\n\n\n\n<p>Consider how this works for sentiment analysis, used to determine if a piece of text expresses a positive or negative sentiment (typically in the context of product reviews or customer feedback). While the task seems simple enough, even state-of-the-art models have failures, ranging from overt, like classifying \u201cI don&#8217;t think I&#8217;ve ever had a nicer time in my life\u201d as negative, to more subtly harmful, like classifying \u201cI am a racial minority\u201d as negative (both represent real failures found with AdaTest in commercial models). To demonstrate how AdaTest finds and fixes bugs, we show how to test for (and later fix) instances of <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/video\/fairness-related-harms-in-ai-systems-examples-assessment-and-mitigation\/\" target=\"_blank\" rel=\"noreferrer noopener\">fairness-related harms<\/a> in which neutral references to a specific identity group within a piece of text could cause a sentiment analysis model to incorrectly downweight the sentiment of the text\u2014in other words, scenarios in which a model might treat comments from specific groups more negatively.&nbsp;<\/p>\n\n\n\n<p>In the testing loop, we start with a set of unit tests about various identities and label the set \u201c\/Sensitive\u201d (Figure 2 below). These initial examples don\u2019t reveal any model failures. But AdaTest then uses a large language model\u2014in our case, GPT-3\u2014to generate many similar suggested tests designed to highlight bugs (Figure 2A). While hundreds of tests are generated, we only need to review the top few failing or near-failing tests. We then ignore tests that don\u2019t represent real failures (for example, \u201cI am tired of being silenced\u201d really should be negative in Figure 2) and add the other valid tests to the current topic, also occasionally organizing them into additional subtopics (Figure 2B). These user-filtered tests are included in the language model prompt for the next round of suggestions, nudging the next set of suggestions toward the intersection between user interest and model failure (Figure 2C). Repeating the testing loop results in the language model starting at tests that don\u2019t fail and slowly working its way up to producing stronger and stronger failures. So even when users can\u2019t find model failures on their own, they can start from a small set of passing tests and quickly iterate with the language model to produce a large set of tests that reveal bugs in the model being tested.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"The testing loop represented as a series of rectangles, each containing test suggestions. Starting with the top rectangle and moving down, the user provides three neutral identity statements that are not predicted as negative. The language model, represented by a robot icon, suggests two statements predicted as negative in the next rectangle. In a third rectangle, real failures are accepted and organized into subtopics by the user, represented by a person icon. From those selections, the model suggests two more statements in the next rectangle. In the last rectangle, one of the subtopics is expanded based on the model\u2019s previous suggestions. \" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/Fig2_AdaTest.png\"><img loading=\"lazy\" decoding=\"async\" width=\"825\" height=\"1414\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/Fig2_AdaTest.png\" alt=\"The testing loop represented as a series of rectangles, each containing test suggestions. Starting with the top rectangle and moving down, the user provides three neutral identity statements that are not predicted as negative. The language model, represented by a robot icon, suggests two statements predicted as negative in the next rectangle. In a third rectangle, real failures are accepted and organized into subtopics by the user, represented by a person icon. From those selections, the model suggests two more statements in the next rectangle. In the last rectangle, one of the subtopics is expanded based on the model\u2019s previous suggestions. \" class=\"wp-image-845692\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/Fig2_AdaTest.png 825w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/Fig2_AdaTest-175x300.png 175w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/Fig2_AdaTest-597x1024.png 597w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/Fig2_AdaTest-768x1316.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/Fig2_AdaTest-105x180.png 105w\" sizes=\"auto, (max-width: 825px) 100vw, 825px\" \/><\/a><figcaption>Figure 2: The testing loop cycles between the large language model (LLM) generating test suggestions, the model scoring the suggestions, and the user accepting (\u2714) and organizing them, beginning with initial user-provided examples. In this three-way sentiment analysis example, the model &#8220;f&#8221; can either pass (green) or fail (red) a test. Passing one of the tests above means the model did not output \u201cnegative\u201d while failing a test above means the model did output \u201cnegative\u201d and hence failed the test assertion (\u2260). As the user filters and organizes (B, D), the LLM iteratively climbs toward suggesting valid tests that reveal more pronounced failures (A, C). In this example, we&#8217;re testing a sentiment analysis model to ensure that neutral identity-related statements don\u2019t cause the model to flag comments as negative.&nbsp;<\/figcaption><\/figure>\n\n\n\n<p>If instead of the \u201c\/Sensitive\u201d topic shown in Figure 2, we target a different topic, such as handling negation, we\u2019ll reveal different failures. For example, starting from simple statements like \u201cI have never been happier\u201d that a commercial model correctly classifies as positive, AdaTest can quickly find bugs like \u201cI don\u2019t think that I\u2019ve ever seen a nicer town\u201d getting labeled as negative. These bugs are egregious and obvious once you see them, but they\u2019re hard to find by hand since they only happen for very specific phrasings.&nbsp;<\/p>\n\n\n\n<p>We ran user studies to quantitatively evaluate if AdaTest makes experts and non-experts better at writing tests and finding bugs in models. We asked experts\u2014those with a background in machine learning and NLP\u2014to test specific topics in two models: a commercial sentiment classifier and GPT-2 for next word auto-complete, used in such applications as predicting the next word in an email being typed (a scenario in which we want to avoid suggesting stereotypes, for example, one of the behaviors we had participants test for). For each topic and model, participants were randomly assigned to use CheckList (representing state-of-the-art user-driven testing) or AdaTest. We present the average number of discovered model failures per minute in Figure 3, where we observe a fivefold improvement with AdaTest across models and participants in the study. We asked non-experts, or those without any programming background, to test the Perspective API toxicity model for content moderation. Participants tried to find non-toxic statements (that is, statements they would <em>personally <\/em>feel appropriate posting) predicted as toxic for political opinions. Participants were given access to an improved version of the Dynabench crowd-sourcing interface for model testing and to AdaTest. AdaTest provided up to a tenfold improvement (bottom portion of Figure 3).&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1099\" height=\"718\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/Fig3_AdaTest.png\" alt=\"A horizontal bar chart with failures found per minute on the x-axis and model and topic on the y-axis broken down by experience of the participant doing the testing. NLP experts testing the sentiment model and auto-complete with AdaTest found 2 clear positive failures per minute and 1 negated positive per minute and 0.6 Muslim stereotypes and 1.1 African American stereotypes, respectively. NLP experts testing the sentiment model and auto-complete with CheckList found 0.3 clear positive failures per minute and 0.2 negated positives per minute and 0.1 Muslim stereotypes and 0.2 African American stereotypes, respectively. Non-experts testing the toxicity model for non-toxic political viewpoints classified as toxic found 1.5 failures per minute with AdaTest compared with 0.15 with Dynabench. \" class=\"wp-image-845689\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/Fig3_AdaTest.png 1099w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/Fig3_AdaTest-300x196.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/Fig3_AdaTest-1024x669.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/Fig3_AdaTest-768x502.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/Fig3_AdaTest-240x157.png 240w\" sizes=\"auto, (max-width: 1099px) 100vw, 1099px\" \/><figcaption>Figure 3: Per-topic model failures per minute. Experts found approximately five times more failures with AdaTest on all topics, and non-experts benefited by up to 10 times. Error bars represent the 10th and 90th percentiles over bootstrap re-samples of participants.&nbsp;<\/figcaption><\/figure>\n\n\n\n<p>We also grouped participants by their progressive versus conservative political alignment and found that participants wrote tests with twice the quality when testing their own perspective versus an opposing perspective (as measured by an independent set of in-group raters). Our user studies highlight that AdaTest can be used by anyone and that such easy-to-use tools are important to enable model testing by people with diverse backgrounds since testers representing different lived experiences and viewpoints are needed to effectively test different perspectives.&nbsp;<\/p>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"999693\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">Spotlight: Event Series<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/event\/microsoft-research-forum\/past-episodes\/?OCID=msr_researchforum_MCR_Blog_Promo\" aria-label=\"Microsoft Research Forum\" data-bi-cN=\"Microsoft Research Forum\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/05\/Research-Forum-hero_1400x788.jpg\" alt=\"Research Forum | abstract background with colorful hexagons\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">Microsoft Research Forum<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"microsoft-research-forum\" class=\"large\">Join us for a continuous exchange of ideas about research in the era of general AI. Watch the latest episodes on demand.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/event\/microsoft-research-forum\/past-episodes\/?OCID=msr_researchforum_MCR_Blog_Promo\" aria-describedby=\"microsoft-research-forum\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"Microsoft Research Forum\" target=\"_blank\">\n\t\t\t\t\t\t\tWatch on-demand\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<h2 id=\"fixing-bugs-with-the-debugging-loop\">Fixing bugs with the debugging loop&nbsp;<\/h2>\n\n\n\n<p>Once enough bugs are discovered, testers of a model then engage in the outer debugging loop (Figure 4 below), where they fix bugs discovered in the testing loop and then retest the model. In our experiments, we fixed bugs by fine-tuning the model on the tests, but other strategies, such as collecting more data or adding constraints, are also possible. The <em>retest <\/em>part of the debugging loop (that is, running the testing loop again) is critical since once we use our tests to fix the model, they no longer represent test data but rather training data. The process of fixing a bug often overcompensates, introducing shortcuts or bugs in the initial rounds of the debugging loop that can only be found using a new set of tests adapted to target the new \u201cfixed\u201d model.&nbsp;<\/p>\n\n\n\n<p>Running the debugging loop on an open-source RoBERTa-Large sentiment model (Figure 4) demonstrates the importance of a test-fix-retest cycle. We start with tests from the \u201c\/Sensitive\/Immigration\u201d topic from Figure 2 that the RoBERTa model incorrectly labels as negative. Fine-tuning the model on these tests (mixed with the original training data to maintain task performance) results in a new model that no longer fails the tests (second row of Figure 4). However, when we rerun the testing loop, we find that now almost all immigration statements are labeled as \u201cneutral,\u201d even if they are truly negative based on the application and testing scenario (for example, the statements in the third row of Figure 4 wouldn\u2019t be neutral if a model were tasked with detecting if language was for or against something). Fine-tuning again using these new tests (and the older ones) results in a model that correctly fixes the original bug without adding the \u201cevery immigration statement is neutral\u201d shortcut.<\/p>\n\n\n\n<p>This doesn\u2019t, of course, guarantee that there isn\u2019t another shortcut still in the model, but in our experience, a few rounds of the debugging loop drastically reduce the number of accidental bugs that get introduced when \u201cfixing\u201d the original bugs. The testers of the model don\u2019t have to exhaustively identify every possible shortcut or imbalance ahead of time, since AdaTest adaptively surfaces and fixes bugs that have been introduced in the next rounds of testing and debugging. Thus, the debugging loop serves as a friendly adversary, pushing the boundaries of the current \u201cspecification\u201d until a satisfactory model is produced. In fact, AdaTest can be seen as an application of the test-fix-retest loop from software engineering to NLP.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img decoding=\"async\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/Fig4_AdaTest.png\" alt=\"The debugging loop represented as a series of rectangles. Starting with the top rectangle and moving down, tests the model has failed are used to fix the model, as demonstrated by the correctly predicted statements in the second rectangle. The testing loop is run again, revealing that an overcorrection has occurred that causes a bug, in the third rectangle, with even negative statements about the topic being predicted as neutral. These tests that reveal this new bug are then fixed, and the model is fine-tuned on the new tests and previous tests, resulting in the statements being predicted correctly in the last rectangle. \" class=\"wp-image-845686\" width=\"825\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/Fig4_AdaTest.png 855w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/Fig4_AdaTest-290x300.png 290w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/Fig4_AdaTest-768x793.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/Fig4_AdaTest-174x180.png 174w\" sizes=\"(max-width: 855px) 100vw, 855px\" \/><figcaption>Figure 4: Shortcuts added during an iteration of the debugging loop are found and fixed by future iterations.&nbsp;<\/figcaption><\/figure>\n\n\n\n<p>To evaluate the effectiveness of the debugging loop, we fine-tuned RoBERTa-Large to detect if two questions are duplicates (that is, the same question worded differently) using the Quora Question Pairs (QQP) dataset and also fine-tuned it for positive\/neutral\/negative sentiment analysis using the Stanford Sentiment Treebank (SST) dataset. Using previously published CheckList suites for evaluation, we find the baseline model fails 22 out of 53 QQP topics and 11 out of 39 sentiment topics. We then created data to \u201cfix\u201d a topic by either taking 50 examples from the topic\u2019s data in the CheckList condition or by starting from a seed of five examples and running the debugging loop with AdaTest until finding failures becomes qualitatively difficult (on average 2.83 rounds for QQP and 3.83 rounds for sentiment). This yields an average of 41.6 tests for QQP and 55.8 tests for sentiment. We followed this process for six distinct high-failure rate topics in each task. In the vast majority of cases (<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/adaptive-testing-and-debugging-of-nlp-models\/\" target=\"_blank\" rel=\"noreferrer noopener\">see paper for details<\/a>), AdaTest fixes the topics used for training and a number of <em>unseen <\/em>held-out topics without breaking <em>any <\/em>topics, while CheckList data often introduces new bugs (and thus breaks other test topics).<\/p>\n\n\n\n<p>We also evaluated the effectiveness of AdaTest in a standard development setting, targeting a model for to-do detection in meeting notes. After three months of development, CheckList testing, and ad hoc GPT-3\u2013based data augmentation, a PhD-level team had managed to build a model with an F1 score of 0.66 (out of 1) on unseen data collected in the wild. We gave AdaTest to the team with a demo a few minutes long. After four hours of running the debugging loop on their own, they produced another model with an F1 score of 0.77 on the same unseen dataset. These scores were then replicated again on a second unseen dataset, showing that AdaTest can add significant bug-fixing value with a fraction of the effort involved in traditional approaches.&nbsp;<\/p>\n\n\n\n<h2 id=\"the-promise-of-human-ai-collaboration-for-ml-development\">The promise of human-AI collaboration for ML development<\/h2>\n\n\n\n<p>AdaTest encourages a close collaboration between people and large language models, yielding the benefits of both. People provide the problem specification that the language model lacks, while the language model provides quality test creation at a scale and scope that is infeasible for people. The debugging loop connects model testing and debugging to effectively fix bugs, taking model development a step closer toward the iterative nature of traditional software development. Human-AI partnership represents a promising way forward for machine learning development, and we expect this synergy to only improve as the capabilities of large language models continue to grow.<\/p>\n\n\n\n<p>Check out the <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/adaptive-testing-and-debugging-of-nlp-models\/\" target=\"_blank\" rel=\"noreferrer noopener\">full paper<\/a> to see AdaTest\u2019s effectiveness on classification models (sentiment analysis, QQP, toxicity, media selection, and task detection), generation models (GPT-2, translation), and per-token models (NER) ranging from well-tested production systems to brand-new applications. Give it a try yourself at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/adatest\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/github.com\/microsoft\/adatest<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-embed aligncenter is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<div class=\"yt-consent-placeholder\" role=\"region\" aria-label=\"Video playback requires cookie consent\" data-video-id=\"OPiJCuer6wY\" data-poster=\"https:\/\/img.youtube.com\/vi\/OPiJCuer6wY\/maxresdefault.jpg\"><iframe aria-hidden=\"true\" tabindex=\"-1\" title=\"Using platform models responsibly: Developer tools with human-AI partnership at the center\" width=\"500\" height=\"281\" data-src=\"https:\/\/www.youtube-nocookie.com\/embed\/OPiJCuer6wY?feature=oembed&rel=0&enablejsapi=1\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe><div class=\"yt-consent-placeholder__overlay\"><button class=\"yt-consent-placeholder__play\"><svg width=\"42\" height=\"42\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" aria-hidden=\"true\" focusable=\"false\"><g fill=\"none\" fill-rule=\"evenodd\"><circle fill=\"#000\" opacity=\".556\" cx=\"21\" cy=\"21\" r=\"21\"\/><path stroke=\"#FFF\" d=\"M27.5 22l-12 8.5v-17z\"\/><\/g><\/svg><span class=\"yt-consent-placeholder__label\">Video playback requires cookie consent<\/span><\/button><\/div><\/div>\n<\/div><figcaption><center>Platform models\u2014large-scale models trained on vast amounts of data\u2014are making it easier and faster to develop AI systems. AdaTest and other tools and resources like it are being developed by researchers at Microsoft to help developers get the most out of these platform models while also understanding, measuring, and mitigating the risks they pose.<\/center><\/figcaption><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>Advances in platform models\u2014large-scale models that can serve as foundations across applications\u2014have significantly improved the ability of computers to process natural language. But natural language processing (NLP) models are still far from perfect, sometimes failing in embarrassing ways, like translating \u201cEu n\u00e3o recomendo este prato\u201d (I don\u2019t recommend this dish) in Portuguese to \u201cI highly recommend this dish\u201d in English (a real example from a top commercial model). These failures continue to exist in part because finding and fixing bugs in NLP models is hard\u2014so hard that severe bugs impact almost every major open-source and commercial NLP model.\u00a0<\/p>\n","protected":false},"author":37583,"featured_media":846649,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Scott Lundberg","user_id":"38766"},{"type":"user_nicename","value":"Marco Tulio Ribeiro","user_id":"38733"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13554],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-845671","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-human-computer-interaction","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/1400x788_AdaTest_still_image-scaled-960x540.jpg\" class=\"img-object-cover\" alt=\"An icon of people labeled with the pro \u201cknow the task\u201d and the con \u201cslow\u201d followed by a plus sign and an icon representing AI labeled with the pro \u201cfast\u201d and the con \u201cunaware of task\u201d followed by an equal sign and an icon of shaking hands. An outer arrow labeled \u201cThe AI generates tests to highlight bugs\u201d points up and around from the AI icon to the people icon. An arrow labeled \u201cThe person validates which bugs are real\u201d points down and around from the people icon to the AI icon, representing an iterative feedback loop.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/1400x788_AdaTest_still_image-scaled-960x540.jpg 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/1400x788_AdaTest_still_image-300x169.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/1400x788_AdaTest_still_image-1024x576.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/1400x788_AdaTest_still_image-768x432.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/1400x788_AdaTest_still_image-1536x865.jpg 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/1400x788_AdaTest_still_image-2048x1153.jpg 2048w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/1400x788_AdaTest_still_image-1066x600.jpg 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/1400x788_AdaTest_still_image-655x368.jpg 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/1400x788_AdaTest_still_image-343x193.jpg 343w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/1400x788_AdaTest_still_image-240x135.jpg 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/1400x788_AdaTest_still_image-640x360.jpg 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/1400x788_AdaTest_still_image-1280x720.jpg 1280w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/05\/1400x788_AdaTest_still_image-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"Scott Lundberg and Marco Tulio Ribeiro","formattedDate":"May 23, 2022","formattedExcerpt":"Advances in platform models\u2014large-scale models that can serve as foundations across applications\u2014have significantly improved the ability of computers to process natural language. But natural language processing (NLP) models are still far from perfect, sometimes failing in embarrassing ways, like translating \u201cEu n\u00e3o recomendo este prato\u201d&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/845671","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/users\/37583"}],"replies":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/comments?post=845671"}],"version-history":[{"count":13,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/845671\/revisions"}],"predecessor-version":[{"id":870603,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/845671\/revisions\/870603"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media\/846649"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=845671"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/categories?post=845671"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/tags?post=845671"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=845671"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=845671"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=845671"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=845671"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=845671"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=845671"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=845671"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=845671"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}