{"id":621516,"date":"2019-11-26T09:00:09","date_gmt":"2019-11-26T17:00:09","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?p=621516"},"modified":"2019-11-26T07:59:44","modified_gmt":"2019-11-26T15:59:44","slug":"optimistic-actor-critic-avoids-the-pitfalls-of-greedy-exploration-in-reinforcement-learning","status":"publish","type":"post","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/optimistic-actor-critic-avoids-the-pitfalls-of-greedy-exploration-in-reinforcement-learning\/","title":{"rendered":"Optimistic Actor Critic avoids the pitfalls of greedy exploration in reinforcement learning"},"content":{"rendered":"<p><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-5dd5c8a627822.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-622344\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-5dd5c8a627822.png\" alt=\"Graphic showing optimistic actor critic\" width=\"1400\" height=\"789\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-5dd5c8a627822.png 1400w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-5dd5c8a627822-300x169.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-5dd5c8a627822-1024x577.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-5dd5c8a627822-768x433.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-5dd5c8a627822-1066x600.png 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-5dd5c8a627822-655x368.png 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-5dd5c8a627822-343x193.png 343w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-5dd5c8a627822-640x360.png 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-5dd5c8a627822-960x540.png 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-5dd5c8a627822-1280x720.png 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/a><\/p>\n<p>One of the core directions of <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/project\/project-malmo\/\">Project Malmo<\/a> is to develop AI capable of rich interactions. Whether that means learning new skills to apply to challenging problems, understanding complex environments, or knowing when to enlist the help of humans, reinforcement learning (RL) is a core enabling technology for building these types of AI. In order to perform RL well, agents need to do exploration efficiently, which means understanding when to try new things out and how to assess future outcomes.<\/p>\n<p>Similar to the human experience of exploration, exploration in RL means taking calculated risks\u2014involving a balance somewhere between overestimating and underestimating potential outcomes. Current RL methods have difficulty achieving sample efficiency, which means they need millions of environmental interactions to learn a policy. In particular, modern actor-critic methods present some challenges that need to be addressed.<\/p>\n<p>In our paper accepted at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/neurips.cc\/Conferences\/2019\" target=\"_blank\" rel=\"noopener noreferrer\">the thirty-third Conference on Neural Information Processing Systems<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (NeurIPS 2019), \u201c<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/\u2022 https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/better-exploration-with-optimistic-actor-critic\/\">Better Exploration with Optimistic Actor Critic<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,\u201d <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/kaciosek\/\">Kamil Ciosek<\/a> (Microsoft Research Cambridge), together with summer Microsoft Research intern <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/quanvuong.github.io\/\" target=\"_blank\" rel=\"noopener noreferrer\">Quan Vuong<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (University of California, San Diego), as well as Robert Loftin and <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/kahofman\/\">Katja Hofmann<\/a> (Microsoft Research Cambridge), present a new actor-critic algorithm that uses an upper bound on the critic to mitigate the problems caused by greedy exploration in RL, leading to more efficient learning.<\/p>\n<h3>Modern actor-critic methods use an approximate lower bound to estimate the critic<\/h3>\n<p>Actor-critic methods in RL use two components: actor and critic. The policy is represented by a neural network called an actor. In order to obtain updates to the actor, we need to compute a critic. One problem is that, in practice, the critic is a neural network trained on a small amount of off-policy data and is often simply wrong. Since the consequences of underestimating the true critic value are easier to bear than overestimating it, the usual fix is to use an approximate lower bound of the critic. In practice, modern actor-critic methods just estimate the critic twice and take the minimum of the two whenever a value is needed.<\/p>\n<p>Actor-critic methods use the actor for two purposes\u2014it represents both the current best guess of the optimal policy and is used for exploration. Here lies another problem: greed is bad, as it again turns out. For exploration, using an update that greedily maximizes the lower bound turns out to be a really inefficient and inaccurate method. In our paper, we explore two reasons why this happens: <em>pessimistic underexploration<\/em> and <em>directional uninformedness<\/em>.<\/p>\n<h3>Understanding the problems with using greedy exploration<\/h3>\n<div id=\"attachment_621882\" style=\"width: 778px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/pessimistic-underexploration.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-621882\" class=\"wp-image-621882\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/pessimistic-underexploration.png\" alt=\"Figure 1: Pessimistic underexploration and directional uninformedness are two key reasons why greedy exploration falls short. \" width=\"768\" height=\"276\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/pessimistic-underexploration.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/pessimistic-underexploration-300x108.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/pessimistic-underexploration-768x276.png 768w\" sizes=\"auto, (max-width: 768px) 100vw, 768px\" \/><\/a><p id=\"caption-attachment-621882\" class=\"wp-caption-text\">Figure 1: Pessimistic underexploration and directional uninformedness are two key reasons why greedy exploration falls short.<\/p><\/div>\n<p>Pessimistic underexploration is shown in Figure 1a. By greedily maximizing the lower bound, the policy becomes very concentrated near a maximum. When the critic is inaccurate, the maximum is often spurious. In other words, the true critic, represented with a black line in Figure 1a, does not have a maximum at the same point.<\/p>\n<p>This can be very harmful. At first, the agent explores with a broad policy, denoted as \u03c0<sub>past<\/sub>. Since the lower bound increases to the left, the policy gradually moves in that direction, becoming \u03c0<sub>current<\/sub>. Because the lower bound (shown in red) has a maximum at the mean of \u03c0<sub>current<\/sub>, the policy \u03c0<sub>current<\/sub> has a small standard deviation. This is not optimal since we need to sample actions from the far left-hand area to discover the mismatch between the critic lower bound and the critic.<\/p>\n<p>Directional uninformedness is shown in Figure 1b. For efficiency reasons, most actor-critic models for continuous action spaces use Gaussian policies. This means that actions in opposite directions from the mean are sampled with equal probability. However, in a policy gradient algorithm, the current policy will have been obtained by incremental updates, which means that it won&#8217;t be very different from recent past policies.<\/p>\n<p>Therefore, exploration in both directions is wasteful since the parts of the action space where past policies had high density are likely to have already been explored. This phenomenon is shown in Figure 1b. Since the policy \u03c0<sub>current<\/sub> is Gaussian and symmetric around the mean, it is equally likely to sample actions to the left and to the right. However, while sampling to the left would be useful for learning an improved critic, sampling to the right is wasteful since the critic estimate in that part of the action space is already good enough.<\/p>\n<h3>Optimistic Actor Critic explores better with the upper bound<\/h3>\n<p>Optimistic Actor Critic (OAC) makes use of the principle of <em>optimism in the face of uncertainty<\/em>, optimizing an upper bound rather than a lower bound to obtain the exploration policy. Formally, the exploration policy is defined by the formula below. <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/exploration-policy-formula.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-621888\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/exploration-policy-formula.png\" alt=\"exploration policy formula\" width=\"598\" height=\"82\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/exploration-policy-formula.png 598w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/exploration-policy-formula-300x41.png 300w\" sizes=\"auto, (max-width: 598px) 100vw, 598px\" \/><\/a> The upper bound used in the paper is obtained from two bootstraps over the Q-function. Stability is ensured by enforcing the closeness between the exploration policy and the target policy. In the formula above, this is done using the KL constraint. Despite using off-policy exploration, OAC achieves the same level of stability as other modern actor-critic methods.<\/p>\n<p>OAC addresses the two problems of pessimistic underexploration and directional uninformedness we have identified above. Since the policy \u03c0<sub>E<\/sub> is far from the spurious maximum of the lower bound (red line in Figure 2), executing actions sampled from \u03c0<sub>E<\/sub> leads to a quick correction to the critic estimate. This way, OAC avoids pessimistic underexploration. Since \u03c0<sub>E<\/sub> is not symmetric with respect to the mean of \u03c0<sub>T<\/sub> (dashed line), OAC also avoids directional uninformedness.<\/p>\n<div id=\"attachment_621519\" style=\"width: 778px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-621519\" class=\"wp-image-621519\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-1024x577.png\" alt=\"Figure 2: The OAC exploration policy \u03c0E avoids pessimistic underexploration by sampling far from the spurious maximum of the lower bound. Since \u03c0E is not symmetric with regard to the mean of the target policy (dashed line), it also addresses directional uninformedness.\" width=\"768\" height=\"433\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-1024x577.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-300x169.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-768x433.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-1066x600.png 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-655x368.png 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-343x193.png 343w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-640x360.png 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-960x540.png 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-1280x720.png 1280w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788.png 1400w\" sizes=\"auto, (max-width: 768px) 100vw, 768px\" \/><\/a><p id=\"caption-attachment-621519\" class=\"wp-caption-text\">Figure 2: The OAC exploration policy \u03c0<sub>E<\/sub> avoids pessimistic underexploration by sampling far from the spurious maximum of the lower bound. Since \u03c0<sub>E<\/sub> is not symmetric with regard to the mean of the target policy (dashed line), it also addresses directional uninformedness.<\/p><\/div>\n<p>Integrating optimism into deep reinforcement learning often fails due to catastrophic overestimation. While OAC is an optimistic algorithm, it does not suffer from this problem. This is because OAC uses the optimistic value estimate for exploration only. The policy \u03c0<sub>E<\/sub> is computed from scratch every time the algorithm takes an action and is used only for exploration. The critic and actor updates are still performed with a lower bound. This means that there is no way the upper bound can influence the critic except indirectly through the distribution of state-action pairs in the memory buffer.<\/p>\n<p>We evaluated OAC on MuJoCo continuous control tasks. In the plot below, we show that OAC outperformed SAC on the Humanoid task. We stress that optimistic exploration as performed by OAC is computationally cheap \u2013 we only need to compute one extra critic gradient per sample.<\/p>\n<div id=\"attachment_621906\" style=\"width: 322px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/humanoid-v2.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-621906\" class=\"wp-image-621906\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/humanoid-v2.png\" alt=\"Figure 3: Performance comparison of OAC versus SAC.\" width=\"312\" height=\"221\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/humanoid-v2.png 416w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/humanoid-v2-300x212.png 300w\" sizes=\"auto, (max-width: 312px) 100vw, 312px\" \/><\/a><p id=\"caption-attachment-621906\" class=\"wp-caption-text\">Figure 3: Performance comparison of OAC versus SAC.<\/p><\/div>\n<p>Ultimately, optimistic exploration is simple to implement. By using the upper bound instead of the lower bound to estimate the exploration policy, OAC increases efficiency and accuracy when compared with other methods of exploration. You can find more in-depth information about the technology <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/\u2022 https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/better-exploration-with-optimistic-actor-critic\/\">in our paper<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, including information about ablations and other experiments. <span>To learn more about our research in Game Intelligence at Microsoft Research Cambridge, check out <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/theme\/game-intelligence\/\">our page<\/a>.<\/span><\/p>\n<p>If you will be attending NeurIPS 2019, our paper will be featured as a spotlight during Track 3 Session 2, which runs from 4:00 PM PST to 5:30 PM PST on Tuesday, December 10th. We will also have a poster in the East Exhibition Hall B + C, from 5:30 PM PST to 7:30 PM PST on Tuesday, December 10th. Come find out more about our work there!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>One of the core directions of Project Malmo is to develop AI capable of rich interactions. Whether that means learning new skills to apply to challenging problems, understanding complex environments, or knowing when to enlist the help of humans, reinforcement learning (RL) is a core enabling technology for building these types of AI. In order [&hellip;]<\/p>\n","protected":false},"author":39507,"featured_media":621519,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Kamil Ciosek","user_id":"37739"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[194467],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-621516","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artifical-intelligence","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[583324],"related-projects":[235753],"related-events":[609480],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-960x540.png\" class=\"img-object-cover\" alt=\"\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-960x540.png 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-300x169.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-768x433.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-1024x577.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-1066x600.png 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-655x368.png 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-343x193.png 343w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-640x360.png 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788-1280x720.png 1280w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/11\/MSResearch_20191113_NeurIPS_BetterExpolorationWithOptimisticActorCritic_1400x788.png 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"Kamil Ciosek","formattedDate":"November 26, 2019","formattedExcerpt":"One of the core directions of Project Malmo is to develop AI capable of rich interactions. Whether that means learning new skills to apply to challenging problems, understanding complex environments, or knowing when to enlist the help of humans, reinforcement learning (RL) is a core&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/621516","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/users\/39507"}],"replies":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/comments?post=621516"}],"version-history":[{"count":19,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/621516\/revisions"}],"predecessor-version":[{"id":624288,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/621516\/revisions\/624288"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media\/621519"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=621516"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/categories?post=621516"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/tags?post=621516"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=621516"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=621516"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=621516"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=621516"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=621516"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=621516"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=621516"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=621516"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}