{"id":1158576,"date":"2025-12-15T13:19:46","date_gmt":"2025-12-15T21:19:46","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?post_type=msr-research-item&#038;p=1158576"},"modified":"2025-12-15T13:19:48","modified_gmt":"2025-12-15T21:19:48","slug":"benchmarking-robustness-of-automated-ct-pancreas-segmentation-achieving-human-level-reliability-through-human-in-the-loop-optimization","status":"publish","type":"msr-research-item","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/benchmarking-robustness-of-automated-ct-pancreas-segmentation-achieving-human-level-reliability-through-human-in-the-loop-optimization\/","title":{"rendered":"Benchmarking robustness of automated CT pancreas segmentation: achieving human-level reliability through human-in-the-loop optimization"},"content":{"rendered":"<div class=\" sec\">\n<h6 class=\"title\">Background<\/h6>\n<p class=\"chapter-para\">Deep learning\u2013based pancreas segmentation in CT has advanced rapidly yet remains evaluated primarily with mean overlap metrics that fail to capture robustness\u2014defined as the proportion of cases reaching human-level performance. Models performing well on mean Dice or surface metrics can still fail unpredictably across scanners or anatomies. Because early detection and quantitative biomarkers rely on consistent segmentation, robustness is critical for clinical deployment.<\/p>\n<\/div>\n<div class=\" sec\">\n<h6 class=\"title\">Purpose<\/h6>\n<p class=\"chapter-para\">To systematically evaluate the robustness of deep learning models for pancreas segmentation relative to human readers and to investigate an active learning strategy to improve reliability.<\/p>\n<\/div>\n<div class=\" sec\">\n<h6 class=\"title\">Materials and Methods<\/h6>\n<p class=\"chapter-para\">We retrospectively assembled 903 venous-phase CT scans from patients with presumed normal-appearing pancreases and without known pancreatic disease (2005-2023), split into 803 for training\/validation and 100 healthy test cases. Each test case had 4 independent human segmentations. Inter-reader variability on this healthy-only test set defined the empirical human distribution, providing an upper-bound estimate of robustness. We introduced a Fractional Threshold (FT) metric, measuring the proportion of model predictions exceeding the minimum human performance. Robustness was assessed across models trained from scratch, fine-tuned, or pretrained, including both normal and abnormal cases. An active learning approach identified high-uncertainty predictions for human revision. Statistical comparisons were performed using the Wilcoxon signed-rank and proportions\u00a0<em>Z<\/em>-tests.<\/p>\n<\/div>\n<div class=\" sec\">\n<h6 class=\"title\">Results<\/h6>\n<p class=\"chapter-para\">The best model, a 3-dimensional U-Net trained from scratch, achieved a Dice Similarity Coefficient (DSC) of 0.88\u2009\u00b1\u20090.04 and Normalized Surface Dice (NSD) of 0.77\u2009\u00b1\u20090.09, approaching human-level segmentation (DSC\u2009=\u20090.89\u2009\u00b1\u20090.03; NSD\u2009=\u20090.75\u2009\u00b1\u20090.07). However, FT for DSC and NSD remained lower than human performance in most cases, indicating persistent model variability. Human-in-the-loop revision of acquisition-flagged outliers increased FT to 0.99, with an average time of 1.54\u2009minutes per case, corresponding to a 23-fold workload reduction.<\/p>\n<\/div>\n<div class=\" sec\">\n<h6 class=\"title\">Conclusion<\/h6>\n<p class=\"chapter-para\">Automated pancreas segmentation reduces workload but remains constrained by tail-case failures. Active learning enhances model reliability, bridging the gap between artificial intelligence and human-level performance.<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Background Deep learning\u2013based pancreas segmentation in CT has advanced rapidly yet remains evaluated primarily with mean overlap metrics that fail to capture robustness\u2014defined as the proportion of cases reaching human-level performance. Models performing well on mean Dice or surface metrics can still fail unpredictably across scanners or anatomies. Because early detection and quantitative biomarkers rely [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_publishername":"","msr_publisher_other":"","msr_booktitle":"","msr_chapter":"","msr_edition":"","msr_editors":"","msr_how_published":"","msr_isbn":"","msr_issue":"","msr_journal":"Radiology","msr_number":"","msr_organization":"","msr_pages_string":"","msr_page_range_start":"","msr_page_range_end":"","msr_series":"","msr_volume":"","msr_copyright":"","msr_conference_name":"","msr_doi":"","msr_arxiv_id":"","msr_s2_paper_id":"","msr_mag_id":"","msr_pubmed_id":"","msr_other_authors":"","msr_other_contributors":"","msr_speaker":"","msr_award":"","msr_affiliation":"","msr_institution":"","msr_host":"","msr_version":"","msr_duration":"","msr_original_fields_of_study":"","msr_release_tracker_id":"","msr_s2_match_type":"","msr_citation_count_updated":"","msr_published_date":"2025-12","msr_highlight_text":"","msr_notes":"","msr_longbiography":"","msr_publicationurl":"","msr_external_url":"","msr_secondary_video_url":"","msr_conference_url":"","msr_journal_url":"","msr_s2_pdf_url":"","msr_year":0,"msr_citation_count":0,"msr_influential_citations":0,"msr_reference_count":0,"msr_s2_match_confidence":0,"msr_microsoftintellectualproperty":true,"msr_s2_open_access":false,"msr_s2_author_ids":[],"msr_pub_ids":[],"msr_hide_image_in_river":null,"footnotes":""},"msr-research-highlight":[],"research-area":[13556,13553],"msr-publication-type":[193715],"msr-publisher":[],"msr-focus-area":[],"msr-locale":[268875],"msr-post-option":[269148,269142],"msr-field-of-study":[],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-1158576","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-artificial-intelligence","msr-research-area-medical-health-genomics","msr-locale-en_us","msr-post-option-approved-for-river","msr-post-option-include-in-river"],"msr_publishername":"","msr_edition":"","msr_affiliation":"","msr_published_date":"2025-12","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"Radiology","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/academic.oup.com\/radadv\/article\/2\/6\/umaf040\/8325238","label_id":"243109","label":0},{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/doi.org\/10.1093\/radadv\/umaf040","label_id":"243106","label":0},{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/academic.oup.com\/radadv\/article-pdf\/doi\/10.1093\/radadv\/umaf040\/65356908\/umaf040.pdf","label_id":"243132","label":0}],"msr_related_uploader":"","msr_citation_count":0,"msr_citation_count_updated":"","msr_s2_paper_id":"","msr_influential_citations":0,"msr_reference_count":0,"msr_arxiv_id":"","msr_s2_author_ids":[],"msr_s2_open_access":false,"msr_s2_pdf_url":null,"msr_attachments":[],"msr-author-ordering":[{"type":"user_nicename","value":"Felipe Oviedo","user_id":39925,"rest_url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Felipe Oviedo"},{"type":"text","value":"Felipe Lopez&ndash;Ramirez","user_id":0,"rest_url":false},{"type":"text","value":"Florent Tixier","user_id":0,"rest_url":false},{"type":"text","value":"Satomi Kawamoto","user_id":0,"rest_url":false},{"type":"text","value":"Alejandra Blanco","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Rahul Dodhia","user_id":41401,"rest_url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Rahul Dodhia"},{"type":"text","value":"Ralph H Hruban","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Bill Weeks","user_id":39582,"rest_url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Bill Weeks"},{"type":"user_nicename","value":"Juan M. Lavista Ferres","user_id":39552,"rest_url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Juan M. Lavista Ferres"},{"type":"text","value":"Elliot K Fishman","user_id":0,"rest_url":false},{"type":"text","value":"Linda C Chu","user_id":0,"rest_url":false}],"msr_impact_theme":[],"msr_research_lab":[],"msr_event":[],"msr_group":[696544],"msr_project":[778522],"publication":[],"video":[],"msr-tool":[],"msr_publication_type":"article","related_content":{"projects":[{"ID":778522,"post_title":"AI for Health","post_name":"ai-for-health","post_type":"msr-project","post_date":"2023-05-16 14:26:13","post_modified":"2024-10-14 15:42:21","post_status":"publish","permalink":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/project\/ai-for-health\/","post_excerpt":"AI for Health is a philanthropic program launched by Microsoft, which aims to support nonprofits, researchers, and organizations working on global health challenges. The program provides access to artificial intelligence (AI) technology and expertise in three main areas: population health, imaging analytics, genomics &amp; proteomics.","_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/778522"}]}}]},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1158576","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":1,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1158576\/revisions"}],"predecessor-version":[{"id":1158577,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1158576\/revisions\/1158577"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1158576"}],"wp:term":[{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=1158576"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1158576"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=1158576"},{"taxonomy":"msr-publisher","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-publisher?post=1158576"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=1158576"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1158576"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1158576"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=1158576"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=1158576"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=1158576"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1158576"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=1158576"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}