{"id":1106850,"date":"2024-11-27T06:00:00","date_gmt":"2024-11-27T14:00:00","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?p=1106850"},"modified":"2024-12-18T15:21:47","modified_gmt":"2024-12-18T23:21:47","slug":"advances-in-run-time-strategies-for-next-generation-foundation-models","status":"publish","type":"post","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/advances-in-run-time-strategies-for-next-generation-foundation-models\/","title":{"rendered":"Advances in run-time strategies for next-generation foundation models"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/MedPrompt-BlogHeroFeature-1400x788-1.png\" alt=\"A visual illustration of Medprompt performance on the MedQA benchmark. Moving from left to right on a horizontal line, the illustration shows how different Medprompt components and additive contributions improve accuracy starting with zero-shot at 81.7 accuracy, to random few-shot at 83.9 accuracy, to random few-shot, chain-of-thought at 87.3 accuracy, to kNN, few-shot, chain-of-thought at 88.4 accuracy, to ensemble with choice shuffle at 90.2 accuracy. \" class=\"wp-image-1106865\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/MedPrompt-BlogHeroFeature-1400x788-1.png 1400w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/MedPrompt-BlogHeroFeature-1400x788-1-300x169.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/MedPrompt-BlogHeroFeature-1400x788-1-1024x576.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/MedPrompt-BlogHeroFeature-1400x788-1-768x432.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/MedPrompt-BlogHeroFeature-1400x788-1-1066x600.png 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/MedPrompt-BlogHeroFeature-1400x788-1-655x368.png 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/MedPrompt-BlogHeroFeature-1400x788-1-240x135.png 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/MedPrompt-BlogHeroFeature-1400x788-1-640x360.png 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/MedPrompt-BlogHeroFeature-1400x788-1-960x540.png 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/MedPrompt-BlogHeroFeature-1400x788-1-1280x720.png 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/figure>\n\n\n\n<p>Groundbreaking advancements in frontier language models are progressing rapidly, paving the way for boosts in accuracy and reliability of generalist models, making them highly effective in specialized domains. As part of our ongoing exploration of foundation model capabilities, we developed <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/the-power-of-prompting\/?msockid=06148b71b49b652837109fc1b5b66432\" target=\"_blank\" rel=\"noreferrer noopener\">Medprompt<\/a> last year\u2014a novel approach to maximize model performance on specialized domain and tasks without fine-tuning. By leveraging multiphase prompting, Medprompt optimizes inference by identifying the most effective chain-of-thought (CoT) examples at run time and<em> <\/em>drawing on multiple calls to refine output. When deployed with GPT-4, Medprompt achieved an impressive 90.2% accuracy on the MedQA benchmark (USMLE-style), outperforming all other methods.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/llm_timeline1-Fig1-1400x788-1.jpg\" alt=\"A line chart that plots the MedQA test accuracy (y-axis) over time (x-axis).  \n\nOpen AI o1-preview model achieves the highest result at 96.0% accuracy followed by Med-Gemini at 91.1%; GPT-4 (Medprompt) at 90.2%; Med PaLM 2 at 86.5; GPT-4 base at 86.1; Med PaLM at 67.2; GPT-3.5 base at 60.2, BioMedLM at 50.3; DRAGON at 47.5; BioLinkBERT at 45.1; PubMedBERT at 38.1.  \" class=\"wp-image-1107318\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/llm_timeline1-Fig1-1400x788-1.jpg 1400w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/llm_timeline1-Fig1-1400x788-1-300x169.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/llm_timeline1-Fig1-1400x788-1-1024x576.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/llm_timeline1-Fig1-1400x788-1-768x432.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/llm_timeline1-Fig1-1400x788-1-1066x600.jpg 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/llm_timeline1-Fig1-1400x788-1-655x368.jpg 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/llm_timeline1-Fig1-1400x788-1-240x135.jpg 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/llm_timeline1-Fig1-1400x788-1-640x360.jpg 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/llm_timeline1-Fig1-1400x788-1-960x540.jpg 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/llm_timeline1-Fig1-1400x788-1-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Figure 1. Comparative analyses of performance of multiple models on MedQA.<\/figcaption><\/figure>\n\n\n\n<p>Less than a year later, our tests show the OpenAI o1-preview demonstrated superior performance over Medprompt, reaching 96% on the same benchmark (Figure 1)\u2014without using sophisticated prompt guidance and control. This advancement is driven by the model\u2019s integration of run-time strategies at its core, enabling state-of-the-art results on medical licensing exams in the United States and Japan, medical subsets of the Massive Multitask Language Understanding (MMLU) benchmark, and nursing exams (NCLEX) as shown in Figure 2.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/main_radar_70_1011-Fig2-1400x788-1.jpg\" alt=\"A spider web chart plotting the performance of OpenAI o1-preview (0 shot ensemble) compared to GPT-4 (Medprompt) and GPT-4 (5 shot) model performance on medical challenge problems. o1-preview achieves state-of-the-art results on MedQA US (4-option), JMLE-2024, MedMCQA Dev, MMLU Anatomy, MMLU Medical Genetics, MMLU Professional Medicine, MMLU College Biology, and MMLU College Medicine, and NCLEX. GPT-4 (Medprompt) performed better than OpenAI o1-preview (0 shot ensemble) on MMLU Clinical Knowledge\" class=\"wp-image-1107330\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/main_radar_70_1011-Fig2-1400x788-1.jpg 1400w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/main_radar_70_1011-Fig2-1400x788-1-300x169.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/main_radar_70_1011-Fig2-1400x788-1-1024x576.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/main_radar_70_1011-Fig2-1400x788-1-768x432.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/main_radar_70_1011-Fig2-1400x788-1-1066x600.jpg 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/main_radar_70_1011-Fig2-1400x788-1-655x368.jpg 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/main_radar_70_1011-Fig2-1400x788-1-240x135.jpg 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/main_radar_70_1011-Fig2-1400x788-1-640x360.jpg 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/main_radar_70_1011-Fig2-1400x788-1-960x540.jpg 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/main_radar_70_1011-Fig2-1400x788-1-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Figure 2. Comparisons on a wide range of medical challenge benchmarks.<\/figcaption><\/figure>\n\n\n\n<p>These results are notable, prompting us to publish our recent study, findings, and analyses, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2411.03590\" target=\"_blank\" rel=\"noopener noreferrer\"><em>From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond<\/em><span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. But the numbers are only part of the story. In this blog, we discuss prompting strategies to make the most of o1-preview models and other factors to consider as well as directions forward for run-time strategies.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"is-o1-preview-just-fancy-prompting\">Is o1-preview \u201cjust\u201d fancy prompting?&nbsp;<\/h2>\n\n\n\n<p>The introduction of the OpenAI o1 model series marks a significant shift from prior GPT models. Unlike GPT, o1 models are trained using reinforcement learning (RL) techniques that enable them to \u201cthink\u201d before generating outputs. While Medprompt relies on a cascade of operations with GPT-4 at run time guided by a multistage prompt, the o1 series incorporates this run-time reasoning directly into its RL-based design. The built-in functionality enables the o1 models to significantly outperform even the best results using GPT-4 and Medprompt. The performance gains come with a notable tradeoff: its per-token cost was approximately six times that of GPT-4o at the time of our evaluation. While the results for GPT-4o with Medprompt fall short of o1-preview model performance, the combination offers a more cost-effective alternative. The cost-benefit tradeoffs are highlighted in the following figure, with the x-axis presented on a logarithmic scale.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/cost_frontier1-Fig-3-1400x788-1.jpg\" alt=\"A line chart plotting accuracy on the MedQA Test (y-axis) versus total cost on a logarithmic scale (x-axis). OpenAI o1-preview using 5x, 10x, and 15x Ensemble hover around 1000 total cost. OpenAI o1-preview using Tailored Prompt, Minimal Prompt, Few-shot, kNN Few-shot are around 100 total cost. GPT-4o with Medprompt is below 100; kNN Few-shot CoT, Few-shot CoT, and Few-Shot are at 10; Zero-shot is at 1. GPT-4-Turbo with Medprompt is at 200; kNN Few-shot CoT, Few-shot CoT, and Few-Shot hover near 50, Zero-shot is near 5. \" class=\"wp-image-1107345\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/cost_frontier1-Fig-3-1400x788-1.jpg 1400w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/cost_frontier1-Fig-3-1400x788-1-300x169.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/cost_frontier1-Fig-3-1400x788-1-1024x576.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/cost_frontier1-Fig-3-1400x788-1-768x432.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/cost_frontier1-Fig-3-1400x788-1-1066x600.jpg 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/cost_frontier1-Fig-3-1400x788-1-655x368.jpg 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/cost_frontier1-Fig-3-1400x788-1-240x135.jpg 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/cost_frontier1-Fig-3-1400x788-1-640x360.jpg 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/cost_frontier1-Fig-3-1400x788-1-960x540.jpg 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/cost_frontier1-Fig-3-1400x788-1-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Figure 3. Pareto frontier showing accuracy versus total API cost (log scale) on the MedQA benchmark (1273 questions total). o1-preview (Sep 2024) is compared with GPT-4o (Aug 2024) and GPT-4 Turbo (Nov 2023).<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"can-we-prompt-engineer-o1-preview\">Can we prompt engineer o1-preview?<\/h2>\n\n\n\n<p>The o1-preview model exhibits distinct run-time behaviors compared to the GPT series. While some of our more dynamic prompting strategies performed better than expected with o1-preview models, our most tried-and-true strategy was anything but consistent throughout our evaluation. Figure 4 captures specific performance results for Tailored Prompt, Ensembling, and Few-Shot Prompting on o1-preview. Here\u2019s a summary of our findings:&nbsp;<\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>Tailored Prompt<\/strong>: While minimal prompting\u2014like a brief, one-sentence description followed by a question\u2014offered a strong baseline performance, detailed task descriptions were best for eliciting accurate responses.<\/li>\n\n\n\n<li><strong>Ensembling<\/strong>: Generating multiple answers per question and using majority voting across different reasoning paths boosted reliability, while shuffling answers in runs produced richer reasoning chains and improved outcomes. Ensembling continues to yield consistent performance improvements.<\/li>\n\n\n\n<li><strong>Few-Shot Prompting<\/strong>: Guiding the model with a few examples produced inconsistent results and, on average, decreased performance compared with GPT models.<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"699\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/experiment_slope_chart_nomedmcqa2-Fig4-1400x788-1.jpg\" alt=\"Three charts show the accuracy of o1-preview when combined with Tailored Prompt, Ensemble, and 5-shot KNN based on an average baseline of medical benchmarks. Tailored Prompts improves accuracy from 94.2 to 94.7; Ensemble (15x) improves accuracy from 94.2 to 95.5; 5-shot KNN decreases accuracy from 94.2 to 93.7.  \" class=\"wp-image-1107585\" style=\"object-fit:cover\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/experiment_slope_chart_nomedmcqa2-Fig4-1400x788-1.jpg 1400w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/experiment_slope_chart_nomedmcqa2-Fig4-1400x788-1-300x150.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/experiment_slope_chart_nomedmcqa2-Fig4-1400x788-1-1024x511.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/experiment_slope_chart_nomedmcqa2-Fig4-1400x788-1-768x383.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/experiment_slope_chart_nomedmcqa2-Fig4-1400x788-1-240x120.jpg 240w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Figure 4. Tests of different prompting strategies across benchmark datasets.<\/figcaption><\/figure>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1144028\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">PODCAST SERIES<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/story\/the-ai-revolution-in-medicine-revisited\/\" aria-label=\"The AI Revolution in Medicine, Revisited\" data-bi-cN=\"The AI Revolution in Medicine, Revisited\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/06\/Episode7-PeterBillSebastien-AIRevolution_Hero_Feature_River_No_Text_1400x788.jpg\" alt=\"Illustrated headshot of Bill Gates, Peter Lee, and S\u00e9bastien Bubeck\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">The AI Revolution in Medicine, Revisited<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"the-ai-revolution-in-medicine-revisited\" class=\"large\">Join Microsoft\u2019s Peter Lee on a journey to discover how AI is impacting healthcare and what it means for the future of medicine.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/story\/the-ai-revolution-in-medicine-revisited\/\" aria-describedby=\"the-ai-revolution-in-medicine-revisited\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"The AI Revolution in Medicine, Revisited\" target=\"_blank\">\n\t\t\t\t\t\t\tListen now\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<h2 class=\"wp-block-heading\" id=\"do-results-stand-in-another-language\">Do results stand in another language?&nbsp;<\/h2>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/jmle-2024-bar1-BlogHeroFeature-1400x788-1.jpg\" alt=\"A chart with two bar charts measuring the accuracy (y-axis) by short and long questions (x-axis) on the Japanese Medical Licensing Examination. The short question bar is slightly higher than the long question ratio for o1-preview (0-shot ensemble). The short question bar is about two points less accurate than the long question bar for o1-preview (0-shot). The short answer bar is a point more accurate than the long question bar for GPT-4o (Medprompt). The short question bar is one point less accurate than the long question bar for GPT-4o (0 shot). \" class=\"wp-image-1107453\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/jmle-2024-bar1-BlogHeroFeature-1400x788-1.jpg 1400w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/jmle-2024-bar1-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/jmle-2024-bar1-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/jmle-2024-bar1-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/jmle-2024-bar1-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/jmle-2024-bar1-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/jmle-2024-bar1-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/jmle-2024-bar1-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/jmle-2024-bar1-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/jmle-2024-bar1-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Figure 5. JMLE-2024: National medical licensing exam held in Japan (Feb 2024).<\/figcaption><\/figure>\n\n\n\n<p>We expanded our research to include a new multilingual benchmark based on the Japanese national medical licensing exam.&nbsp;The JMLE (Japanese Medical Licensing Examination) is written in Japanese and administered in February 2024, after the o1-preview model\u2019s knowledge cutoff. <em>Even without translation to English, the o1-preview model achieved a remarkable score of 98.2% accuracy (Figure 5), well above the exam\u2019s minimum passing score of approximately 80%.&nbsp;<\/em>&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"do-reasoning-tokens-improve-performance\">Do reasoning tokens improve performance?&nbsp;<\/h2>\n\n\n\n<p>For fun, we conducted tests to determine whether increasing the number of reasoning tokens could improve performance. Our findings showed that by adjusting the prompt, we were able to consistently increase the number of reasoning tokens used by o1-preview, and the increase was directly correlated with improved performance as demonstrated in Figure 6.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/reasoning_token_slope_chart2-Fig6-1400x788-1.jpg\" alt=\"A chart plotting the impact of reasoning tokens on accuracy. JMLE achieved 95.3% accuracy for Quick Response Prompt and 96.7% accuracy for Extended Reasoning Prompt. MMLU achieved 94.9% accuracy for Quick Response Prompt and 94.7% accuracy for Extended Reasoning Prompt. MedQA achieved 94.3% accuracy for Quick Response Prompt and 95.1% accuracy for Extended Reasoning Prompt. USMLE Sample Exam achieved 92.6% accuracy for Quick Response Prompt and 93.1% accuracy for Extended Reasoning Prompt. USMLE Self Assessment achieved 91.8% accuracy for Quick Response Prompt and 92.2% accuracy for Extended Reasoning Prompt. \" class=\"wp-image-1107408\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/reasoning_token_slope_chart2-Fig6-1400x788-1.jpg 1400w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/reasoning_token_slope_chart2-Fig6-1400x788-1-300x169.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/reasoning_token_slope_chart2-Fig6-1400x788-1-1024x576.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/reasoning_token_slope_chart2-Fig6-1400x788-1-768x432.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/reasoning_token_slope_chart2-Fig6-1400x788-1-1066x600.jpg 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/reasoning_token_slope_chart2-Fig6-1400x788-1-655x368.jpg 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/reasoning_token_slope_chart2-Fig6-1400x788-1-240x135.jpg 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/reasoning_token_slope_chart2-Fig6-1400x788-1-640x360.jpg 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/reasoning_token_slope_chart2-Fig6-1400x788-1-960x540.jpg 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/reasoning_token_slope_chart2-Fig6-1400x788-1-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Figure 6. The effect of two prompting strategies that elicit variable length reasoning chains across benchmark datasets.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-s-the-takeaway\">What\u2019s the takeaway?&nbsp;<\/h2>\n\n\n\n<p>Bottom line: There\u2019s a little something for everyone when it comes to run-time strategies. We\u2019re excited by the performance gains from GPT models to o1-preview models. While these improvements are significant, so is the cost. For those needing proven accuracy on a budget, Medprompt leveraging calls to GPT-4 is a viable option for medicine and beyond. We summarize the relative performance of prompting strategies in Figure 7 to determine the best option, or <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2411.03590\" target=\"_blank\" rel=\"noopener noreferrer\">check out the paper for a detailed breakdown of every dataset, experimental configuration, and prompt template<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/performance_heatmap2-Fig7-1400x788-1.jpg\" alt=\"A matrix that shows the relative performance of prompting strategies over baseline medical benchmarks. The top row from left to right are the results for baseline numbers: JMLE = 95.6%; MMLU = 94.6%; MedMCQA = 81.4%; MedQA = 94.9%; USMLE Sample Exam = 94.0%; USMLE Self Assessment = 91.8%. The second row from left to right, 5-shot Random baseline difference: JMLE = +1.2%; MMLU = -1.1%; MedMCQA = 0.0%; MedQA = -1.4%; USMLE Sample Exam = -0.4%; USMLE Self Assessment = -1.0%. The third row from left to right, 5-shot KNN baseline difference: JMLE = +0.6%; MMLU = -0.1%; MedMCQA = +1.2%; MedQA = -2.2%; USMLE Sample Exam = -0.3%; USMLE Self Assessment = -0.6%. The fourth row from left to right, Bootstrap Ensemble (5x) baseline difference: JMLE = +1.5%; MMLU = +0.1%; MedMCQA = +1.3%; MedQA = +0.7%; USMLE Sample Exam = +1.3%; USMLE Self Assessment = +1.0%. The fifth row from left to right, Bootstrap Ensemble (10x) baseline difference: JMLE = +1.4%; MMLU = +0.6%; MedMCQA = +1.5%; MedQA = +0.7%; USMLE Sample Exam = +1.3%; USMLE Self Assessment = +1.1%. The sixth row from left to right, Ensemble (15x) baseline difference: JMLE = +1.5%; MMLU = +0.6%; MedMCQA = +2.0%; MedQA = +1.1%; USMLE Sample Exam = +2.0%; USMLE Self Assessment = +1.3%. The seventh row from left to right, Tailored Prompt baseline difference: JMLE = +1.6%; MMLU = +0.4%; MedMCQA = +0.9%; MedQA = +0.2%; USMLE Sample Exam = +0.0%; USMLE Self Assessment = +0.4%. The eighth row from left to right, Tailored Bootstrap Ensemble (5x) baseline difference: JMLE = +2.2%; MMLU = +0.7%; MedMCQA = +1.8%; MedQA = +0.8%; USMLE Sample Exam = +0.9%; USMLE Self Assessment = +1.1%. The ninth row from left to right, Tailored Bootstrap Ensemble (10x) baseline difference: JMLE = +2.3%; MMLU = +0.7%; MedMCQA = +2.1%; MedQA = +0.9%; USMLE Sample Exam = +0.9%; USMLE Self Assessment = +1.2%. The tenth row from left to right, Tailored Ensemble (15x) baseline difference: JMLE = +2.5%; MMLU = +0.4%; MedMCQA = +2.6%; MedQA = +1.1%; USMLE Sample Exam = +0.9%; USMLE Self Assessment = +1.4%.   \" class=\"wp-image-1107396\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/performance_heatmap2-Fig7-1400x788-1.jpg 1400w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/performance_heatmap2-Fig7-1400x788-1-300x169.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/performance_heatmap2-Fig7-1400x788-1-1024x576.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/performance_heatmap2-Fig7-1400x788-1-768x432.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/performance_heatmap2-Fig7-1400x788-1-1066x600.jpg 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/performance_heatmap2-Fig7-1400x788-1-655x368.jpg 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/performance_heatmap2-Fig7-1400x788-1-240x135.jpg 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/performance_heatmap2-Fig7-1400x788-1-640x360.jpg 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/performance_heatmap2-Fig7-1400x788-1-960x540.jpg 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/performance_heatmap2-Fig7-1400x788-1-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Figure 7. Heatmap showing absolute accuracy and relative performance over baseline zero-shot prompt (in parenthesis) across all benchmark datasets.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"anything-more-to-consider\">Anything more to consider?<\/h2>\n\n\n\n<p>We highlighted several considerations in the paper that are worth checking out. Here are three opportunities that are top of mind:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>Research on run-time strategies<\/em>. The research community has largely relied on boosting model capabilities with data, compute, and model size, predictably achieving gains by way of scaling laws. A promising new direction is inference-time scaling\u2014the value of investing in additional computation and machinery for guiding inference at run time. We highlight in the paper opportunities to guide run-time allocations to boost efficiency, accuracy, and intellectual capabilities, including meta reasoning and reflection in real time and learning <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/erichorvitz.com\/cc_aij_horvitz.pdf\">during the \u201cidle\u201d time<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> between problem solving. We see a great deal of opportunity for new research and development on real-time and \u201coffline\u201d reasoning, learning, and reflection.<\/li>\n\n\n\n<li><em>Benchmark saturation<\/em>. With the rapid advancement of state-of-the-art models, many existing medical benchmarks are reaching \u201csaturation,\u201d where models perform extremely well on standing medical competency challenges, considered extremely difficult just a few years ago. Current benchmarks, such as USMLE and JMLE, were designed to assess the performance of medical students and clinicians and are increasingly inadequate for evaluating cutting-edge AI models. To drive understandings of models and guide research, we need to design more challenging medical benchmarks. <\/li>\n\n\n\n<li><em>From benchmarks to clinical applications.<\/em> We note that, while benchmarks offer valuable insights into performance and accuracy, they often fail to capture the complexities and nuances of real-world clinical decision making and healthcare delivery, more broadly. Conducting clinical trials to rigorously evaluate the impact of AI applications on patient care poses far greater difficulties than benchmarking models against challenge problems drawn from medical competency exams. Yet, studies of AI deployments in realistic clinical settings are essential for understanding model capabilities and for guiding the effective integration of AI into healthcare.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"resources\">Resources&nbsp;<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2411.03590\">From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond<span data-contrast=\"auto\" xml:lang=\"EN-US\" lang=\"EN-US\" class=\"TextRun EmptyTextRun SCXW11105993 BCX8\" style=\"-webkit-user-drag: none; -webkit-tap-highlight-color: transparent; margin: 0px; padding: 0px; user-select: text; font-size: 11pt; line-height: 19.6917px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, sans-serif; font-variant-ligatures: none !important;\"><\/span>&nbsp;<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/can-generalist-foundation-models-outcompete-special-purpose-tuning-case-study-in-medicine\/\">Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine<\/a>&nbsp;<\/li>\n\n\n\n<li><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/the-power-of-prompting\/\">The Power of Prompting<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/steering-at-the-frontier-extending-the-power-of-prompting\/\">Steering at the Frontier: Extending the Power of Prompting<\/a><\/li>\n\n\n\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2303.13375\" target=\"_blank\" rel=\"noopener noreferrer\">Capabilities of GPT-4 on Medical Challenge Problems<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Discover the most effective run-time strategies on the OpenAI o1-preview model, improving accuracy in medical language tasks.<\/p>\n","protected":false},"author":43518,"featured_media":1106865,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Eric Horvitz","user_id":"32033"},{"type":"user_nicename","value":"Harsha Nori","user_id":"41461"},{"type":"user_nicename","value":"Naoto Usuyama","user_id":"38670"}],"msr_hide_image_in_river":null,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556,13553],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[269148,243984,269142],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-1106850","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-research-area-medical-health-genomics","msr-locale-en_us","msr-post-option-approved-for-river","msr-post-option-blog-homepage-featured","msr-post-option-include-in-river"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Eric Horvitz","user_id":32033,"display_name":"Eric Horvitz","author_link":"<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/horvitz\/\" aria-label=\"Visit the profile page for Eric Horvitz\">Eric Horvitz<\/a>","is_active":false,"last_first":"Horvitz, Eric","people_section":0,"alias":"horvitz"},{"type":"user_nicename","value":"Harsha Nori","user_id":41461,"display_name":"Harsha Nori","author_link":"<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/hanori\/\" aria-label=\"Visit the profile page for Harsha Nori\">Harsha Nori<\/a>","is_active":false,"last_first":"Nori, Harsha","people_section":0,"alias":"hanori"},{"type":"user_nicename","value":"Naoto Usuyama","user_id":38670,"display_name":"Naoto Usuyama","author_link":"<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/naotous\/\" aria-label=\"Visit the profile page for Naoto Usuyama\">Naoto Usuyama<\/a>","is_active":false,"last_first":"Usuyama, Naoto","people_section":0,"alias":"naotous"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/MedPrompt-BlogHeroFeature-1400x788-1-960x540.png\" class=\"img-object-cover\" alt=\"A visual illustration of Medprompt performance on the MedQA benchmark. Moving from left to right on a horizontal line, the illustration shows how different Medprompt components and additive contributions improve accuracy starting with zero-shot at 81.7 accuracy, to random few-shot at 83.9 accuracy, to random few-shot, chain-of-thought at 87.3 accuracy, to kNN, few-shot, chain-of-thought at 88.4 accuracy, to ensemble with choice shuffle at 90.2 accuracy.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/MedPrompt-BlogHeroFeature-1400x788-1-960x540.png 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/MedPrompt-BlogHeroFeature-1400x788-1-300x169.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/MedPrompt-BlogHeroFeature-1400x788-1-1024x576.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/MedPrompt-BlogHeroFeature-1400x788-1-768x432.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/MedPrompt-BlogHeroFeature-1400x788-1-1066x600.png 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/MedPrompt-BlogHeroFeature-1400x788-1-655x368.png 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/MedPrompt-BlogHeroFeature-1400x788-1-240x135.png 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/MedPrompt-BlogHeroFeature-1400x788-1-640x360.png 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/MedPrompt-BlogHeroFeature-1400x788-1-1280x720.png 1280w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/11\/MedPrompt-BlogHeroFeature-1400x788-1.png 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/horvitz\/\" title=\"Go to researcher profile for Eric Horvitz\" aria-label=\"Go to researcher profile for Eric Horvitz\" data-bi-type=\"byline author\" data-bi-cN=\"Eric Horvitz\">Eric Horvitz<\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/hanori\/\" title=\"Go to researcher profile for Harsha Nori\" aria-label=\"Go to researcher profile for Harsha Nori\" data-bi-type=\"byline author\" data-bi-cN=\"Harsha Nori\">Harsha Nori<\/a>, and <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/naotous\/\" title=\"Go to researcher profile for Naoto Usuyama\" aria-label=\"Go to researcher profile for Naoto Usuyama\" data-bi-type=\"byline author\" data-bi-cN=\"Naoto Usuyama\">Naoto Usuyama<\/a>","formattedDate":"November 27, 2024","formattedExcerpt":"Discover the most effective run-time strategies on the OpenAI o1-preview model, improving accuracy in medical language tasks.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/1106850","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/users\/43518"}],"replies":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/comments?post=1106850"}],"version-history":[{"count":42,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/1106850\/revisions"}],"predecessor-version":[{"id":1107831,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/1106850\/revisions\/1107831"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media\/1106865"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1106850"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/categories?post=1106850"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/tags?post=1106850"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1106850"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=1106850"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=1106850"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1106850"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1106850"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1106850"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=1106850"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=1106850"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}