{"id":677052,"date":"2020-08-03T07:01:15","date_gmt":"2020-08-03T14:01:15","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?p=677052"},"modified":"2020-10-06T15:58:15","modified_gmt":"2020-10-06T22:58:15","slug":"three-new-reinforcement-learning-methods-aim-to-improve-ai-in-gaming-and-beyond","status":"publish","type":"post","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/three-new-reinforcement-learning-methods-aim-to-improve-ai-in-gaming-and-beyond\/","title":{"rendered":"Three new reinforcement learning methods aim to improve AI in gaming and beyond"},"content":{"rendered":"\n<p>Reinforcement learning (RL) provides exciting opportunities for game development, as highlighted in our recently announced <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/aka.ms\/ProjectPaidia\">Project Paidia<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u2014a research collaboration between our <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/theme\/game-intelligence\/\">Game Intelligence<\/a> group at Microsoft Research Cambridge and game developer Ninja Theory. In Project Paidia, we push the state of the art in reinforcement learning to enable new game experiences. In particular, we focus on developing game agents that learn to genuinely collaborate in teams with human players. In this blog post we showcase three of our recent research results that are motivated by these research goals. We give an overview of key insights and explain how they could lead to <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/aka.ms\/InnProjectPaidia\">AI innovations<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> in modern video game development and other real-world applications.<\/p>\n\n\n\n<p>Reinforcement learning can give game developers the ability to craft much more nuanced game characters than traditional approaches, by providing a reward signal that specifies high-level goals while letting the game character work out optimal strategies for achieving high rewards in a data-driven behavior that organically emerges from interactions with the game. To learn how you can use RL to develop your own agents for gaming and begin writing training scripts, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/developer.microsoft.com\/en-us\/games\/blog\/supercharge-games-with-azure-ai-and-reinforcement-learning\/\">check out this Game Stack Live blog post<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Getting started with reinforcement learning is easier than you think\u2014Microsoft Azure also offers tools and resources, including <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/azure.microsoft.com\/\">Azure Machine Learning<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, which provides RL training environments, libraries, virtual machines, and more.<\/p>\n\n\n\n<p>The key challenges our research addresses are how to make reinforcement learning efficient and reliable for game developers (for example, by combining it with uncertainty estimation and imitation), how to construct deep learning architectures that give agents the right abilities (such as long-term memory), and how to enable agents that can rapidly adapt to new game situations. Below, we highlight our latest research progress in these three areas.<\/p>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1144027\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">PODCAST SERIES<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/story\/ai-testing-and-evaluation-learnings-from-science-and-industry\/\" aria-label=\"AI Testing and Evaluation: Learnings from Science and Industry\" data-bi-cN=\"AI Testing and Evaluation: Learnings from Science and Industry\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/06\/EP2-AI-TE_Hero_Feature_River_No_Text_1400x788.jpg\" alt=\"Illustrated headshots of Daniel Carpenter, Timo Minssen, Chad Atalla, and Kathleen Sullivan for the Microsoft Research Podcast\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">AI Testing and Evaluation: Learnings from Science and Industry<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"ai-testing-and-evaluation-learnings-from-science-and-industry\" class=\"large\">Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/story\/ai-testing-and-evaluation-learnings-from-science-and-industry\/\" aria-describedby=\"ai-testing-and-evaluation-learnings-from-science-and-industry\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"AI Testing and Evaluation: Learnings from Science and Industry\" target=\"_blank\">\n\t\t\t\t\t\t\tListen now\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<h2 id=\"highlight-1-more-accurate-uncertainty-estimates-in-deep-learning-decision-making-systems\">Highlight 1: More accurate uncertainty estimates in deep learning decision-making systems<\/h2>\n\n\n\n<p>From computer vision to reinforcement learning and machine translation, deep learning is everywhere and achieves state-of-the-art results on many problems. We give it a dataset, and it gives us a prediction based on a deep learning model\u2019s best guess. The success of deep learning means that it is increasingly being applied in settings where the predictions have far-reaching consequences and mistakes can be costly. &nbsp;<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"margin-callout\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 annotations__list--right\">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/conservative-uncertainty-estimation-by-fitting-prior-networks\/\" data-bi-cN=\"Conservative Uncertainty Estimation By Fitting Prior Networks\" data-external-link=\"false\" data-bi-aN=\"margin-callout\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Conservative Uncertainty Estimation By Fitting Prior Networks<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<p>The problem is that the best-guess approach taken by most deep learning models isn\u2019t enough in these cases. Instead, we want a technique that provides us not just with a prediction but also the associated degree of certainty. Our <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/iclr.cc\/Conferences\/2020\">ICLR 2020<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> paper, \u201c<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/conservative-uncertainty-estimation-by-fitting-prior-networks\/\">Conservative Uncertainty Estimation By Fitting Prior Networks<\/a>,\u201d explores exactly that\u2014we describe a way of knowing what we don\u2019t know about predictions of a given deep learning model. This work was conducted by <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/kaciosek\/\">Kamil Ciosek<\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/bmi.inf.ethz.ch\/people\/person\/vincent-fortuin\/\">Vincent Fortuin<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/ryoto\/\">Ryota Tomioka<\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/kahofman\/\">Katja Hofmann<\/a>, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.eng.cam.ac.uk\/profiles\/ret26\">Richard Turner<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. <\/p>\n\n\n\n<p>In more technical terms, we provide an analysis of Random Network Distillation (RND), a successful technique for estimating the confidence of a deep learning model.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"693\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/07\/RLGamingBlog_Fig1-1024x693.png\" alt=\"\" class=\"wp-image-678237\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/07\/RLGamingBlog_Fig1-1024x693.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/07\/RLGamingBlog_Fig1-300x203.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/07\/RLGamingBlog_Fig1-768x520.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/07\/RLGamingBlog_Fig1.png 1037w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 1: Predictor (green) and prior (red) agree on seen data (left), disagree on unseen data (right). Our uncertainty estimate at a point is defined as the gap between prior and predictor.<\/figcaption><\/figure><\/div>\n\n\n\n<p>The version of RND we analyze maintains an uncertainty model separate from the model making predictions. To provide a bit more intuition about how the uncertainty model works, let\u2019s have a look at the Figure 1 above. We have two types of neural networks: the predictor (green) and the prior (red). The prior network is fixed and does not change during training. When we see a new data point, we train the predictor to match the prior on that point. In the figure, the data points we have observed are represented with red dots. We can see that close to the points, the predictor and the prior overlap. On the other hand, we see a huge gap between the predictor and prior if we look at the values to the right, far from the observed points.<\/p>\n\n\n\n<p>Roughly speaking, theoretical results in the paper show that the gap between prior and predictor is a good indication of how certain the model should be about its outputs. Indeed, we compare the obtained uncertainty estimates to the gold standard in uncertainty quantification\u2014the posterior obtained by Bayesian inference\u2014and show they have two attractive theoretical properties. First, the variance returned by RND always overestimates the Bayesian posterior variance. This means that while RND can return uncertainties larger than necessary, it won\u2019t become overconfident. Second, we show that the uncertainties concentrate, that is they eventually become small after the model has been trained on multiple observations. &nbsp;In other words, the model becomes more certain about its predictions as we see more and more data.<\/p>\n\n\n\n<h2 id=\"highlight-2-utilizing-order-invariant-aggregators-to-enhance-agent-recall\">Highlight 2: Utilizing order-invariant aggregators to enhance agent recall<\/h2>\n\n\n\n<p>In many games, players have partial observability of the world around them. To act in these games requires players to recall items, locations, and other players that are currently out of sight but have been seen earlier in the game. Typically, deep reinforcement learning agents have handled this by <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1507.06527\">incorporating recurrent layers<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (such as LSTMs or GRUs) or the ability to read and write to external memory as in the case of <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.nature.com\/articles\/nature20101\">differential neural computers<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (DNCs).<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"margin-callout\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 annotations__list--left\">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication <\/span>\n\t\t\t<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/amrl-aggregated-memory-for-reinforcement-learning\/\" data-bi-cN=\"AMRL: Aggregated Memory For Reinforcement Learning\" data-external-link=\"false\" data-bi-aN=\"margin-callout\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>AMRL: Aggregated Memory For Reinforcement Learning<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<p>Using recurrent layers to recall earlier observations was common in natural language processing, where the sequence of words is often important to their interpretation. However, when agents interact with a gaming environment, they can influence the order in which they observe their surroundings, which may be irrelevant to how they <em>should<\/em> act. To give a human-equivalent example, if I see a fire exit when moving through a new building, I may need to later recall where it was regardless of what I have seen or done since. In our ICLR 2020 paper \u201c<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/amrl-aggregated-memory-for-reinforcement-learning\/\">AMRL: Aggregated Memory For Reinforcement Learning<\/a>,\u201d we propose the use of order-invariant aggregators (the sum or max of values seen so far) in the agent\u2019s policy network to overcome this issue.<\/p>\n\n\n\n<figure class=\"wp-block-image alignwide size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"363\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/07\/RLGamingBlog_Fig2-5f1a016db0687-1024x363.png\" alt=\"\" class=\"wp-image-678252\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/07\/RLGamingBlog_Fig2-5f1a016db0687-1024x363.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/07\/RLGamingBlog_Fig2-5f1a016db0687-300x106.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/07\/RLGamingBlog_Fig2-5f1a016db0687-768x272.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/07\/RLGamingBlog_Fig2-5f1a016db0687.png 1100w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 2: Model architectures. From left to right, LSTM, DNC, SET, and AMRL. AMRL extends LSTMs with SET-based aggregators (for example average or max value observed).<\/figcaption><\/figure>\n\n\n\n<p>While approaches that enable the ability to read and write to external memory (such as DNCs) can also learn to directly recall earlier observations, the complexity of their architecture is shown to require significantly more samples of interactions with the environment, which can prevent them from learning a high-performing policy within a fixed compute budget.<\/p>\n\n\n\n<p>In our experiments, our Minecraft-playing agents were shown either a red or green cube at the start of an episode that told them how they must act at the end of the episode. In the time between seeing the green or red cube, the agents could move freely through the environment, which could create variable-length sequences of irrelevant observations that could distract the agent and make them forget the color of the cube at the beginning.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"599\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/07\/MSR_Athens_blog_minecraft_image_v2-02-1024x599.png\" alt=\"\" class=\"wp-image-680994\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/07\/MSR_Athens_blog_minecraft_image_v2-02-1024x599.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/07\/MSR_Athens_blog_minecraft_image_v2-02-300x175.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/07\/MSR_Athens_blog_minecraft_image_v2-02-768x449.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/07\/MSR_Athens_blog_minecraft_image_v2-02-1536x898.png 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/07\/MSR_Athens_blog_minecraft_image_v2-02-2048x1197.png 2048w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/07\/MSR_Athens_blog_minecraft_image_v2-02-480x280.png 480w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 3: A top-down view of the Minecraft maze that tests an agent\u2019s memory (bottom) and a sample of observations an agent may see whilst moving through this environment (top).<\/figcaption><\/figure>\n\n\n\n<p>By combining recurrent layers with order-invariant aggregators, AMRL can both infer hidden features of the state from the sequence of recent observations and recall past observations regardless of when they were seen. Enabling our agents, to efficiently recall the color of the cube and make the right decision at the end of the episode. Now empowered with this new ability, our agents can play more complex games or even be deployed in non-gaming applications where agents must recall distant memories in partially observable environments.<\/p>\n\n\n\n<p>Researchers who contributed to this work include <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.jakebeck.com\/\">Jacob Beck<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/kaciosek\/\">Kamil Ciosek<\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/sadevlin\/\">Sam Devlin<\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.tschiatschek.net\/\">Sebastian Tschiatschek<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/chezha\/\">Cheng Zhang<\/a>, and <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/kahofman\/\">Katja Hofmann<\/a>.<\/p>\n\n\n\n<h2 id=\"highlight-3-varibad-exploring-unknown-environments-with-bayes-adaptive-deep-rl-and-meta-learning\">Highlight 3: VariBAD\u2014exploring unknown environments with Bayes-Adaptive Deep RL and meta-learning<\/h2>\n\n\n\n<p>Most current reinforcement learning work, and the majority of RL agents trained for video game applications, are optimized for a single game scenario. However, a key aspect of human-like gameplay is the ability to continuously learn and adapt to new challenges. In our joint work with Luisa Zintgraf, Kyriacos Shiarlis, Maximilian Igl, Sebastian Schulze, Yarin Gal, and Shimon Whiteson from the University of Oxford, we developed a flexible new approach that enables agents to learn to explore and rapidly adapt to a given task or scenario.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"margin-callout\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 annotations__list--right\">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/varibad-a-very-good-method-for-bayes-adaptive-deep-rl-via-meta-learning\/\" data-bi-cN=\"VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning\" data-external-link=\"false\" data-bi-aN=\"margin-callout\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<p>In \u201c<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/varibad-a-very-good-method-for-bayes-adaptive-deep-rl-via-meta-learning\/\">VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning<\/a>,\u201d we focus on problems that can be formalized as so-called Bayes-Adaptive Markov Decision Processes. Briefly, in this setting an agent learns to interact with a wide range of tasks and learns how to infer the current task at hand as quickly as possible. Our goal is to train Bayes-optimal agents\u2014agents that behave optimally given their current belief over tasks. For example, imagine an agent trained to reach a variety of goal positions. At the beginning of each new episode, the agent is uncertain about the goal position it should aim to reach. A Bayes-optimal agent takes the optimal number of steps to reduce its uncertainty and reach the correct goal position, given its initial belief over possible goals.<\/p>\n\n\n\n<p>Our new approach introduces a flexible encoder-decoder architecture to model the agent\u2019s belief distribution and learns to act optimally by conditioning its policy on the current belief. We demonstrate that this leads to a powerful and flexible solution that achieves Bayes-optimal behavior on several research tasks. In our ongoing research we investigate how approaches like these can enable game agents that rapidly adapt to new game situations.<\/p>\n\n\n\n<figure class=\"wp-block-image alignwide size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"430\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/07\/RLGamingBlog_fig4-1024x430.png\" alt=\"\" class=\"wp-image-678261\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/07\/RLGamingBlog_fig4-1024x430.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/07\/RLGamingBlog_fig4-300x126.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/07\/RLGamingBlog_fig4-768x323.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/07\/RLGamingBlog_fig4-665x280.png 665w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/07\/RLGamingBlog_fig4.png 1381w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 4: Illustration of different exploration strategies. (a) Environment: The agent starts at the bottom left. There is a goal somewhere in the grey area, unknown to the agent. (b) A Bayes-optimal exploration strategy that systematically searches possible grid cells to find the goal, shown in solid (interactions so far) and dashed (future interactions) blue lines. A simplified posterior is shown in the background in grey (p = 1\/(number of possible goal positions left) of containing the goal) and white (p = 0). (c) Posterior sampling repeatedly samples a possible goal position (red squares) and takes the shortest route there, which is suboptimal. Once the goal is found, every sample matches the true goal position and the agent acts optimally. (d) Exploration strategy learned by variBAD. The grey background represents the approximate posterior the agent has learned.<br><\/figcaption><\/figure>\n\n\n\n<h2 id=\"continuing-work-in-game-intelligence\">Continuing work in game intelligence<\/h2>\n\n\n\n<p>In this post we have shown just a few of the exciting research directions that we explore within the Game Intelligence theme at Microsoft Research Cambridge and in collaboration with our colleagues at Ninja Theory. A key direction of our research is to create artificial agents that learn to genuinely collaborate with human players, be it in team-based games like Bleeding Edge, or, eventually, in real world applications that go beyond gaming, such as virtual assistants. We view the research results discussed above as key steps towards that goal: by giving agents better ability to detect unfamiliar situations and leverage demonstrations for faster learning, by creating agents that learn to remember longer-term dependencies and consequences from less data, and by allowing agents to very rapidly adapt to new situations or human collaborators. <\/p>\n\n\n\n<p>To learn more about our work with gaming partners, visit the AI Innovation page. To learn more about our research, and about opportunities for working with us, visit <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/theme\/game-intelligence\/\">aka.ms\/gameintelligence<\/a>.<br><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Reinforcement learning (RL) provides exciting opportunities for game development, as highlighted in our recently announced Project Paidia (opens in new tab)\u2014a research collaboration between our Game Intelligence group at Microsoft Research Cambridge and game developer Ninja Theory. In Project Paidia, we push the state of the art in reinforcement learning to enable new game experiences. [&hellip;]<\/p>\n","protected":false},"author":38838,"featured_media":681465,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Kamil Ciosek","user_id":"37739"},{"type":"user_nicename","value":"Sam Devlin","user_id":"37550"},{"type":"user_nicename","value":"Katja Hofmann","user_id":"32468"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-677052","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199561],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[583324],"related-projects":[669597],"related-events":[650565],"related-researchers":[{"type":"user_nicename","value":"Katja Hofmann","user_id":32468,"display_name":"Katja Hofmann","author_link":"<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/kahofman\/\" aria-label=\"Visit the profile page for Katja Hofmann\">Katja Hofmann<\/a>","is_active":false,"last_first":"Hofmann, Katja","people_section":0,"alias":"kahofman"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/08\/Project-Paidia-Hero-960x540.png\" class=\"img-object-cover\" alt=\"\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/08\/Project-Paidia-Hero-960x540.png 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/08\/Project-Paidia-Hero-300x169.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/08\/Project-Paidia-Hero-1024x576.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/08\/Project-Paidia-Hero-768x432.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/08\/Project-Paidia-Hero-1536x865.png 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/08\/Project-Paidia-Hero-1066x600.png 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/08\/Project-Paidia-Hero-655x368.png 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/08\/Project-Paidia-Hero-343x193.png 343w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/08\/Project-Paidia-Hero-640x360.png 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/08\/Project-Paidia-Hero-1280x720.png 1280w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/08\/Project-Paidia-Hero.png 1567w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"Kamil Ciosek, Sam Devlin, and <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/kahofman\/\" title=\"Go to researcher profile for Katja Hofmann\" aria-label=\"Go to researcher profile for Katja Hofmann\" data-bi-type=\"byline author\" data-bi-cN=\"Katja Hofmann\">Katja Hofmann<\/a>","formattedDate":"August 3, 2020","formattedExcerpt":"Reinforcement learning (RL) provides exciting opportunities for game development, as highlighted in our recently announced Project Paidia (opens in new tab)\u2014a research collaboration between our Game Intelligence group at Microsoft Research Cambridge and game developer Ninja Theory. In Project Paidia, we push the state of&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/677052","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/users\/38838"}],"replies":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/comments?post=677052"}],"version-history":[{"count":20,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/677052\/revisions"}],"predecessor-version":[{"id":696450,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/677052\/revisions\/696450"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media\/681465"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=677052"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/categories?post=677052"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/tags?post=677052"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=677052"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=677052"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=677052"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=677052"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=677052"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=677052"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=677052"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=677052"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}