{"id":852825,"date":"2022-06-21T09:00:00","date_gmt":"2022-06-21T16:00:00","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?p=852825"},"modified":"2022-08-17T08:54:04","modified_gmt":"2022-08-17T15:54:04","slug":"swin-transformer-supports-3-billion-parameter-vision-models-that-can-train-with-higher-resolution-images-for-greater-task-applicability","status":"publish","type":"post","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/swin-transformer-supports-3-billion-parameter-vision-models-that-can-train-with-higher-resolution-images-for-greater-task-applicability\/","title":{"rendered":"Swin Transformer supports 3-billion-parameter vision models that can train with higher-resolution images for greater task applicability"},"content":{"rendered":"\n<figure class=\"wp-block-image alignwide size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2560\" height=\"1441\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-scaled.jpg\" alt=\"On the left, a diagram with three layers, each of which contains a half-transparent image processed from the same image. The processed images are partitioned into several grids, and each grid contains 4 x 4 image patches. From bottom to top, the number of grids in each layer are 4 x 4, 2 x 2, and 1 x 1, respectively. The layers are labeled \u201c4x,\u201d \u201c8x,\u201d and \u201c16x,\u201d respectively, from bottom to top. An arrow joining the three layers points upward to the words \u201cSegmentation\u201d and \u201cDetection\u201d and an ellipsis. Another arrow points from the top layer to the word \u201cclassification.\u201d On the right, a bar chart with a blue bar labeled \u201cSwin V1\u201d and an orange bar labeled \u201cSwin V2.\u201d The orange bar is much taller and labeled \u201c3 billion (1,536 x 1,536 resolution)\u201d; the blue bar is labeled \u201c197 million.\u201d An arrow labeled \u201c15x\u201d points upward from the blue bar, indicating the orange bar is 15 times higher than the blue one. \" class=\"wp-image-852828\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-scaled.jpg 2560w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-300x169.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-1024x576.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-768x432.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-1536x865.jpg 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-2048x1153.jpg 2048w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-1066x600.jpg 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-655x368.jpg 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-343x193.jpg 343w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-240x135.jpg 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-scaled-640x360.jpg 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-scaled-960x540.jpg 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-1280x720.jpg 1280w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 2560px) 100vw, 2560px\" \/><figcaption><center>Swin Transformer, a Transformer-based general-purpose vision architecture, was further evolved to address challenges specific to large vision models. As a result, Swin Transformer is capable of training with images at higher resolutions, which allows for greater task applicability (left), and scaling models up to 3 billion parameters (right).<\/center><\/figcaption><\/figure>\n\n\n\n<p>Early last year, our research team from the <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/group\/visual-computing\/\" target=\"_blank\" rel=\"noreferrer noopener\">Visual Computing Group<\/a> introduced <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/swin-transformer-hierarchical-vision-transformer-using-shifted-windows\/\" target=\"_blank\" rel=\"noreferrer noopener\">Swin Transformer<\/a>, a Transformer-based general-purpose computer vision architecture that for the first time <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/paperswithcode.com\/sota\/object-detection-on-coco\" target=\"_blank\" rel=\"noopener noreferrer\">beat convolutional neural networks on the important vision benchmark of COCO object detection<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and did so by a large margin. Convolutional neural networks (CNNs) have long been the architecture of choice for classifying images and detecting objects within them, among other key computer vision tasks. Swin Transformer offers an alternative. Leveraging the Transformer architecture\u2019s adaptive computing capability, Swin can achieve higher accuracy. More importantly, Swin Transformer provides an opportunity to unify the architectures in computer vision and natural language processing (NLP), where the Transformer has been the dominant architecture for years and has benefited the field because of its ability to be scaled up.<\/p>\n\n\n\n<p>So far, Swin Transformer has shown early signs of its potential as a strong backbone architecture for a variety of computer vision problems, powering the top entries of many important vision benchmarks such as COCO object detection, ADE20K semantic segmentation, and CelebA-HQ image generation. It has also been well-received by the computer vision research community, garnering the <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/event\/iccv-2021\/\" target=\"_blank\" rel=\"noreferrer noopener\">Marr Prize for best paper at the 2021 International Conference on Computer Vision (ICCV)<\/a>. Together with works such as <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/cswin-transformer-a-general-vision-transformer-backbone-with-cross-shaped-windows\/\" target=\"_blank\" rel=\"noreferrer noopener\">CSWin<\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/focal-attention-for-long-range-interactions-in-vision-transformers\/\" target=\"_blank\" rel=\"noreferrer noopener\">Focal Transformer<\/a>, and <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/cvt-introducing-convolutions-to-vision-transformers\/\" target=\"_blank\" rel=\"noreferrer noopener\">CvT<\/a>, also from teams within Microsoft, Swin is helping to demonstrate the Transformer architecture as a viable option for many vision challenges. However, we believe there\u2019s much work ahead, and we\u2019re on an adventurous journey to explore the full potential of Swin Transformer.<\/p>\n\n\n\n<p>In the past few years, one of the most important discoveries in the field of NLP has been that scaling up model capacity can continually push the state of the art for various NLP tasks, and the larger the model, the better its ability to adapt to new tasks with very little or no training data. Can the same be achieved in computer vision, and if so, how?<\/p>\n\n\n\n<p>In pursuit of answers, we scaled up Swin Transformer to 3 billion parameters, the largest and most effective dense vision model to date. There have been successfully trained vision models with up to 1.8 billion parameters. However, these vision models require billions of labeled images to be trained well and are applicable to only image classification. With our model, SwinV2-G, we address a common obstacle when increasing model size in the computer vision space\u2014training instability\u2014to support more parameters, and thanks to a technique we developed to address the resolution gap that exists between pretraining and fine-tuning tasks, SwinV2-G marks the first time that a billion-scale vision model has been applied to a broader set of vision tasks. Additionally, leveraging a self-supervised pretraining approach we call SimMIM, SwinV2-G uses 40 times less labeled data and 40 times less training time than previous works to drive the learning of billion-scale vision models.<\/p>\n\n\n\n<p>SwinV2-G achieved state-of-the-art accuracy on four representative benchmarks when it was released in November: ImageNetV2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification.<\/p>\n\n\n\n<p>Our experience and lessons learned in exploring the training and application of large vision models are described in two papers\u2014<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/swin-transformer-v2-scaling-up-capacity-and-resolution\/\" target=\"_blank\" rel=\"noreferrer noopener\">\u201cSwin Transformer V2: Scaling Up Capacity and Resolution\u201d<\/a> and <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/simmim-a-simple-framework-for-masked-image-modeling\/\" target=\"_blank\" rel=\"noreferrer noopener\">\u201cSimMIM: A Simple Framework for Masked Image Modeling\u201d<\/a>\u2014both of which are being presented at the <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/event\/cvpr-2022\/\" target=\"_blank\" rel=\"noreferrer noopener\">2022 Computer Vision and Pattern Recognition Conference (CVPR)<\/a>. The <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/Swin-Transformer\" target=\"_blank\" rel=\"noopener noreferrer\">code for Swin Transformer<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/SimMIM\" target=\"_blank\" rel=\"noopener noreferrer\">code for SimMIM<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> are both available on GitHub. (For the purposes of this blog and our paper, the upgraded Swin Transformer architecture resulting from this work is referred to as V2.)<\/p>\n\n\n\n<div class=\"wp-block-buttons is-content-justification-center is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-16018d1d wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a data-bi-type=\"button\" class=\"wp-block-button__link\" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/swin-transformer-v2-scaling-up-capacity-and-resolution\/\" target=\"_blank\" rel=\"noreferrer noopener\">Swin Transformer<\/a><\/div>\n\n\n\n<div class=\"wp-block-button\"><a data-bi-type=\"button\" class=\"wp-block-button__link\" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/simmim-a-simple-framework-for-masked-image-modeling\/\" target=\"_blank\" rel=\"noreferrer noopener\">SimMIM<\/a><\/div>\n<\/div>\n\n\n\n<h2 id=\"improving-training-stability\">Improving training stability<\/h2>\n\n\n\n<p>The first issue we faced when training large models was the problem of <em>training instability<\/em>. We observed that as models get larger, it becomes very easy for them to crash. After checking the feature values of each layer of the models we trained in scaling up Swin Transformer to 3 billion parameters, we found the cause of the instability: large feature variance discrepancy between different layers.<\/p>\n\n\n\n<p>As shown in Figure 1, the average feature variance in the deep layers of the original Swin Transformer model increases significantly as the model grows larger. With a 200-million-parameter Swin-L model, the discrepancy between layers with the highest and lowest average feature variance has reached an extreme value of 10^4. Crashing occurs during training when the model capacity is further scaled to 658 million parameters (Swin-H).<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"A line graph with the x-axis labeled \"Network layer ID\u201d ranging from 1 to 24 and the y-axis labeled \u201cAverage Feature Variance\u201d ranging from 100 to 100 million using logarithmic ticks. There are four curves. The highest curve represents \u201cSwinV1-H\u201d; a double-headed arrow between the highest and lowest values on this curve is labeled \u201c100,000 times.\u201d The second-highest curve represents \u201cSwinV1-L\u201d; a double-headed arrow between the highest and lowest values on this curve is labeled \u201c10,000 times.\u201d The other two curves\u2014representing \u201cSwinV2-L\u201d and \u201cSwinV2-H,\u201d respectively\u2014are at the bottom. \" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/Swin_Fig1.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1100\" height=\"493\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/Swin_Fig1.png\" alt=\"A line graph with the x-axis labeled \"Network layer ID\u201d ranging from 1 to 24 and the y-axis labeled \u201cAverage Feature Variance\u201d ranging from 100 to 100 million using logarithmic ticks. There are four curves. The highest curve represents \u201cSwinV1-H\u201d; a double-headed arrow between the highest and lowest values on this curve is labeled \u201c100,000 times.\u201d The second-highest curve represents \u201cSwinV1-L\u201d; a double-headed arrow between the highest and lowest values on this curve is labeled \u201c10,000 times.\u201d The other two curves\u2014representing \u201cSwinV2-L\u201d and \u201cSwinV2-H,\u201d respectively\u2014are at the bottom. \" class=\"wp-image-852867\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/Swin_Fig1.png 1100w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/Swin_Fig1-300x134.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/Swin_Fig1-1024x459.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/Swin_Fig1-768x344.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/Swin_Fig1-240x108.png 240w\" sizes=\"auto, (max-width: 1100px) 100vw, 1100px\" \/><\/a><figcaption>Figure 1: The average feature variance of Swin V1 models (solid lines) and Swin V2 models (dashed lines) per layer (x-axis). The discrepancy between layers with the highest and lowest average feature variance is very large for Swin V1 models while much milder for Swin V2 models.<\/figcaption><\/figure>\n\n\n\n<p>Looking closely at the architecture of the original Swin Transformer, we found that this was due to the output of the residual branch being added directly back to the main branch without normalization. In other words, the unconstrained output feature values could be very large compared to the input. As illustrated in Figure 2 (left), after one Transformer block, the feature values of the output can increase to 61 times larger than that of the input. To alleviate this problem, we propose a new normalization method called residual-post-normalization. As shown in Figure 2 (right), this method moves the normalization layer from the beginning to the end of each residual branch so that the output of each residual branch is normalized before being merged back into the main branch. In this way, the average feature variance of the main branch doesn\u2019t increase significantly as the layers deepen. Experiments have shown that this new normalization method moderates the average feature variance of each layer in the model (see the dashed lines in Figure 1; the SwinV2 models have the same respective number of parameters as the SwinV1 models: 200 million [L] and 658 million [H]).<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"The left is a block diagram labeled \u201cSwin V1 (pre-norm + linear attention)\u201d with input and output labeled \u201cx superscript (ell minus 1)\u201d and \u201cx superscript (ell),\u201d respectively. There are four boxes between the input and output; they\u2019re labeled, from bottom to top, \u201cLayer Norm,\u201d \u201cLinear Attention,\u201d \u201cLayer Norm,\u201d and \u201cMLP.\u201d Upward arrows connect the boxes. On top of every two boxes is a plus symbol with two inputs: one the output arrow of the preceding box and the other an arrow from under the preceding two boxes, indicating a skip connection. There are five numbers listed vertically, from bottom to top: 1, 10, 11, 50, and 61. An arrow points right from the block diagram to a second block diagram, labeled \u201cSwin V2 (res-post-norm + cosine attention).\u201d The block diagram is similar to the left one with the following differences: the labels of each box are different; from bottom to top, they\u2019re \u201cCosine Attention,\u201d \u201cLayer Norm,\u201d \u201cMLP,\u201d and \u201cLayer Norm.\u201d The \u201cCosine Attention\u201d and \u201cLayer Norm\u201d boxes are in red. There are five different numbers listed vertically (from bottom to top): 1, 1, 2, 1, and 3. \" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/Swin_Fig2.png\"><img loading=\"lazy\" decoding=\"async\" width=\"750\" height=\"491\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/Swin_Fig2.png\" alt=\"The left is a block diagram labeled \u201cSwin V1 (pre-norm + linear attention)\u201d with input and output labeled \u201cx superscript (ell minus 1)\u201d and \u201cx superscript (ell),\u201d respectively. There are four boxes between the input and output; they\u2019re labeled, from bottom to top, \u201cLayer Norm,\u201d \u201cLinear Attention,\u201d \u201cLayer Norm,\u201d and \u201cMLP.\u201d Upward arrows connect the boxes. On top of every two boxes is a plus symbol with two inputs: one the output arrow of the preceding box and the other an arrow from under the preceding two boxes, indicating a skip connection. There are five numbers listed vertically, from bottom to top: 1, 10, 11, 50, and 61. An arrow points right from the block diagram to a second block diagram, labeled \u201cSwin V2 (res-post-norm + cosine attention).\u201d The block diagram is similar to the left one with the following differences: the labels of each box are different; from bottom to top, they\u2019re \u201cCosine Attention,\u201d \u201cLayer Norm,\u201d \u201cMLP,\u201d and \u201cLayer Norm.\u201d The \u201cCosine Attention\u201d and \u201cLayer Norm\u201d boxes are in red. There are five different numbers listed vertically (from bottom to top): 1, 1, 2, 1, and 3. \" class=\"wp-image-852873\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/Swin_Fig2.png 750w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/Swin_Fig2-300x196.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/Swin_Fig2-240x157.png 240w\" sizes=\"auto, (max-width: 750px) 100vw, 750px\" \/><\/a><figcaption>Figure 2: To accommodate larger vision models, the normalization configuration of the original Swin Transformer was adjusted. The original architecture (left) uses the pre-norm configuration, in which normalization occurs at the beginning of each residual branch. This configuration results in an increase in the feature values (from 1 to 61), leading to crashing during training. In Swin V2 (right), two changes were made: firstly, normalization is moved to the end of the residual branch in a new method called <em>residual-post-normalization<\/em>, which sees a much milder increase in value (from 1 to 3). Secondly, the linear dot-product function in the attention unit is replaced with a cosine function.<\/figcaption><\/figure>\n\n\n\n<p>In addition, we also found that as the model becomes larger, the attention weights of certain layers tend to be dominated by a few specific points in the original self-attention computation, especially when residual-post-normalization is used. To tackle this problem, our team further proposes the scaled cosine attention mechanism (see Figure 2 right) to replace the common <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/proceedings.neurips.cc\/paper\/2017\/file\/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">dot-product linear attention unit<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. In the new scaled cosine attention unit, the computation of self-attention is independent of the input magnitude, resulting in less saturated attention weights.<\/p>\n\n\n\n<p>Experiments have shown that residual-post-normalization and the scaled cosine attention mechanism not only stabilize the training dynamics of large models but also improve accuracy.<\/p>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1144027\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">PODCAST SERIES<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/story\/ai-testing-and-evaluation-learnings-from-science-and-industry\/\" aria-label=\"AI Testing and Evaluation: Learnings from Science and Industry\" data-bi-cN=\"AI Testing and Evaluation: Learnings from Science and Industry\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/06\/EP2-AI-TE_Hero_Feature_River_No_Text_1400x788.jpg\" alt=\"Illustrated headshots of Daniel Carpenter, Timo Minssen, Chad Atalla, and Kathleen Sullivan for the Microsoft Research Podcast\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">AI Testing and Evaluation: Learnings from Science and Industry<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"ai-testing-and-evaluation-learnings-from-science-and-industry\" class=\"large\">Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/story\/ai-testing-and-evaluation-learnings-from-science-and-industry\/\" aria-describedby=\"ai-testing-and-evaluation-learnings-from-science-and-industry\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"AI Testing and Evaluation: Learnings from Science and Industry\" target=\"_blank\">\n\t\t\t\t\t\t\tListen now\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<h2 id=\"addressing-large-resolution-gaps-between-pretraining-and-fine-tuning-tasks\">Addressing large resolution gaps between pretraining and fine-tuning tasks<\/h2>\n\n\n\n<p>Another difficulty with large vision models is that the image resolution discrepancy between pretraining and fine-tuning can be large: pretraining is typically carried out at low resolutions, while many downstream vision tasks require high-resolution input images or attention windows, as shown in Figure 3.<\/p>\n\n\n\n<p>In Swin Transformer, there\u2019s a term of relative position bias in the attention unit to represent the impact of one image patch on another based on the relative position between them. This term is learned in pretraining. However, since the relative position range at fine-tuning has been changed significantly compared to that in pretraining, we need techniques to initiate the biases at new relative positions not seen in pretraining. While the original Swin Transformer architecture uses a handcrafted bicubic interpolation method to transfer the old relative position biases to the new resolution, we find it\u2019s not that effective when the resolution discrepancy between pretraining and fine-tuning is very large.<\/p>\n\n\n\n<p>To resolve this problem, we propose a log-spaced continuous position bias approach (Log-spaced CPB). By applying a small meta-network to the relative position coordinates in log space, Log-spaced CPB can generate position bias for any coordinate range. Since the meta-network can take arbitrary coordinates as input, a pretrained model can freely transfer between different window sizes by sharing the weights of a meta-network. Moreover, by converting the coordinates to a log space, the extrapolation rate required to transfer between different window resolutions is much smaller than with using the original linear space coordinates.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"On the left, a tightly cropped image of a dog with an indicated resolution of 224 x 224 labeled \"Image classification (pretraining task).\" On the right, an image of a batter, catcher, and umpire at the plate during an at-bat with tight bounding boxes around each person, the bat, and the catcher\u2019s mitt. In the second image, there\u2019s an indicated resolution of 1,200 x 2,000 labeled \u201cObject Detection (fine-tuning).\u201d \" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_Fig3.jpg\"><img loading=\"lazy\" decoding=\"async\" width=\"1541\" height=\"867\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_Fig3.jpg\" alt=\"On the left, a tightly cropped image of a dog with an indicated resolution of 224 x 224 labeled \"Image classification (pretraining task).\" On the right, an image of a batter, catcher, and umpire at the plate during an at-bat with tight bounding boxes around each person, the bat, and the catcher\u2019s mitt. In the second image, there\u2019s an indicated resolution of 1,200 x 2,000 labeled \u201cObject Detection (fine-tuning).\u201d \" class=\"wp-image-852888\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_Fig3.jpg 1541w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_Fig3-300x169.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_Fig3-1024x576.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_Fig3-768x432.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_Fig3-1536x864.jpg 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_Fig3-1066x600.jpg 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_Fig3-655x368.jpg 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_Fig3-343x193.jpg 343w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_Fig3-240x135.jpg 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_Fig3-640x360.jpg 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_Fig3-960x540.jpg 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_Fig3-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 1541px) 100vw, 1541px\" \/><\/a><figcaption>Figure 3: In computer vision, many downstream tasks, such as object detection (right), require high-resolution input, but pretraining tasks, such as image classification (left), are generally done at low resolutions, creating another challenge in training and applying large-scale vision models.<\/figcaption><\/figure>\n\n\n\n<p>Using Log-spaced CPB, Swin Transformer V2 achieves smooth transferring between different resolutions, enabling us to use a smaller image resolution\u2014192 \u00d7 192\u2014with no accuracy loss on downstream tasks compared with the standard 224 \u00d7 224 resolution used in pretraining. This speeds up training by 50 percent.<\/p>\n\n\n\n<p>Scaling model capacity and resolution results in excessive GPU memory consumption for existing vision models. To address the memory issue, we combined several crucial techniques, including <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/zero-memory-optimizations-toward-training-trillion-parameter-models\/\" target=\"_blank\" rel=\"noreferrer noopener\">Zero-Redundancy Optimizer (ZeRO)<\/a>, activation checkpointing, and a new sequential self-attention implementation. With these techniques, GPU memory consumption is significantly reduced for large-scale models and large resolutions with little impact to training speed. The GPU savings also allows us to train the 3-billion-parameter SwinV2-G model on images with resolutions of up to 1,536 \u00d7 1,536 using the 40-gigabyte A100 GPU, making it applicable to a range of vision tasks requiring high resolution, such as object detection.<\/p>\n\n\n\n<h2 id=\"tackling-the-data-starvation-problem-for-large-vision-models\">Tackling the data starvation problem for large vision models<\/h2>\n\n\n\n<p>Training larger models often requires more labeled data; however, the computer vision field lacks such labeled data at scale because of the high cost of human-annotated data. This has compelled the vision field to explore the training of large models with smaller amounts of labeled data. To this aim, we introduce the self-supervised pretraining approach SimMIM, short for Simple Framework for Masked Image Modeling.<\/p>\n\n\n\n<p>As shown in Figure 4, SimMIM learns image representation by masked image modeling, a pretext task in which a portion of an input image is masked and then the model predicts the RGB values of the masked area given the other visible parts. By this approach, the rich information contained in each image is better exploited, which allows us to use less data to drive the training of large models. With SimMIM, we were able to train the SwinV2-G model by using only 70 million labeled images, which is 40 times less than that used by previous billion-scale vision models.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"An image of a building surrounded by a playground and trees is partitioned into a 3 x 3 grid. The grid is unfolded into its nine individual patches and five of them are replaced by mask patches; an arrow labeled \u201cmask\u201d points right from the gird to the patches. An arrow points upward from the unfolded patches to two boxes: the first is labeled \u201cEncoder (e.g., ViT, Swin)\u201d and the second is labeled \u201cOne-layer Prediction Head.\u201d On top of the second box are two rows of the original patches that were replaced by mask patches, with double-headed arrows in between the top patches and bottom patches across the row and an \u201cell one loss\u201d indicator at left. \" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/Swin_Fig4.png\"><img loading=\"lazy\" decoding=\"async\" width=\"750\" height=\"392\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/Swin_Fig4.png\" alt=\"An image of a building surrounded by a playground and trees is partitioned into a 3 x 3 grid. The grid is unfolded into its nine individual patches and five of them are replaced by mask patches; an arrow labeled \u201cmask\u201d points right from the gird to the patches. An arrow points upward from the unfolded patches to two boxes: the first is labeled \u201cEncoder (e.g., ViT, Swin)\u201d and the second is labeled \u201cOne-layer Prediction Head.\u201d On top of the second box are two rows of the original patches that were replaced by mask patches, with double-headed arrows in between the top patches and bottom patches across the row and an \u201cell one loss\u201d indicator at left. \" class=\"wp-image-852879\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/Swin_Fig4.png 750w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/Swin_Fig4-300x157.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/Swin_Fig4-240x125.png 240w\" sizes=\"auto, (max-width: 750px) 100vw, 750px\" \/><\/a><figcaption>Figure 4: In the proposed self-supervised pretraining method SimMIM, models are tasked with predicting the RGB of hidden portions of an input image based on the visible portions. The method significantly reduces the number of labeled images required in large model training. With SimMIM, a 3-billion-parameter Swin V2 model was trained by using 40 times less labeled data than that used in previous billion-scale vision models.<\/figcaption><\/figure>\n\n\n\n<h2 id=\"setting-new-records-on-four-representative-vision-benchmarks\">Setting new records on four representative vision benchmarks<\/h2>\n\n\n\n<p>By scaling up model capacity and resolution, Swin Transformer V2 set new records on four representative vision benchmarks when it was introduced in November: 84.0 percent top-1 accuracy on ImageNetV2 image classification; 63.1 \/ 54.4 box \/ mask mean average precision (mAP) on COCO object detection; 59.9 mean Intersection-over-Union (mIoU) on ADE20K semantic segmentation; and 86.8 percent top-1 accuracy on Kinetics-400 video action classification.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Benchmark<\/th><th>ImageNetV2<\/th><th>COCO test-dev<\/th><th>ADE20K val<\/th><th>Kinetics-400<\/th><\/tr><\/thead><tbody><tr><td>Swin V1<\/td><td>77.5<\/td><td>58.7<\/td><td>53.5<\/td><td>84.9<\/td><\/tr><tr><td>Previous state of the art<\/td><td>83.3 (Google, July 2021)<\/td><td>61.3 (Microsoft, July 2021)<\/td><td>58.4 (Microsoft, October 2021)<\/td><td>85.4 (Google, October 2021)<\/td><\/tr><tr><td>Swin V2 (November 2021)<\/td><td>84.0 (+0.7)<\/td><td>63.1 (+1.8)<\/td><td>59.9 (+1.5)<\/td><td>86.8 (+1.4)<\/td><\/tr><\/tbody><\/table><figcaption>Table 1: Swin Transformer (V1) was modified to address the challenges of pretraining and applying large vision models. The adapted architecture (V2) achieved state of the art on several representative benchmarks when it was introduced.<\/figcaption><\/figure>\n\n\n\n<p>We hope this strong performance on a variety of vision tasks can encourage the field to invest more in scaling up vision models and that the provided training \u201crecipe\u201d can facilitate future research in this direction.<\/p>\n\n\n\n<p>To learn more about the Swin Transformer journey, check out our <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aka.ms\/SwinTransformerTM\" target=\"_blank\" rel=\"noopener noreferrer\">Tech Minutes video<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<p><strong>Swin Transformer research team<\/strong><\/p>\n\n\n\n<p><em>(In alphabetical order) <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/yuecao\/\" target=\"_blank\" rel=\"noreferrer noopener\">Yue Cao<\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/lidong1\/\" target=\"_blank\" rel=\"noreferrer noopener\">Li Dong<\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/bainguo\/\" target=\"_blank\" rel=\"noreferrer noopener\">Baining Guo<\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/hanhu\/\" target=\"_blank\" rel=\"noreferrer noopener\">Han Hu<\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/stevelin\/\" target=\"_blank\" rel=\"noreferrer noopener\">Stephen Lin<\/a>, Yutong Lin, Ze Liu, Jia Ning, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/fuwei\/\" target=\"_blank\" rel=\"noreferrer noopener\">Furu Wei<\/a>, Yixuan Wei, Zhenda Xie, Zhuliang Yao, and Zheng Zhang<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Early last year, our research team from the Visual Computing Group introduced Swin Transformer, a Transformer-based general-purpose computer vision architecture that for the first time beat convolutional neural networks on the important vision benchmark of COCO object detection (opens in new tab) and did so by a large margin. Convolutional neural networks (CNNs) have long [&hellip;]<\/p>\n","protected":false},"author":37583,"featured_media":852828,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Han Hu","user_id":"36771"},{"type":"user_nicename","value":"Baining Guo","user_id":"31169"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13562],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[243999],"msr-podcast-series":[],"class_list":["post-852825","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-computer-vision","msr-locale-en_us","msr-post-option-blog-homepage-featured","msr-promo-type-event"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199560],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[247949],"related-projects":[875727],"related-events":[844516,778099],"related-researchers":[{"type":"user_nicename","value":"Baining Guo","user_id":31169,"display_name":"Baining Guo","author_link":"<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/bainguo\/\" aria-label=\"Visit the profile page for Baining Guo\">Baining Guo<\/a>","is_active":false,"last_first":"Guo, Baining","people_section":0,"alias":"bainguo"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-scaled-960x540.jpg\" class=\"img-object-cover\" alt=\"On the left, a diagram with three layers, each of which contains a half-transparent image processed from the same image. The processed images are partitioned into several grids, and each grid contains 4 x 4 image patches. From bottom to top, the number of grids in each layer are 4 x 4, 2 x 2, and 1 x 1, respectively. The layers are labeled \u201c4x,\u201d \u201c8x,\u201d and \u201c16x,\u201d respectively, from bottom to top. An arrow joining the three layers points upward to the words \u201cSegmentation\u201d and \u201cDetection\u201d and an ellipsis. Another arrow points from the top layer to the word \u201cclassification.\u201d On the right, a bar chart with a blue bar labeled \u201cSwin V1\u201d and an orange bar labeled \u201cSwin V2.\u201d The orange bar is much taller and labeled \u201c3 billion (1,536 x 1,536 resolution)\u201d; the blue bar is labeled \u201c197 million.\u201d An arrow labeled \u201c15x\u201d points upward from the blue bar, indicating the orange bar is 15 times higher than the blue one.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-scaled-960x540.jpg 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-300x169.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-1024x576.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-768x432.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-1536x865.jpg 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-2048x1153.jpg 2048w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-1066x600.jpg 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-655x368.jpg 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-343x193.jpg 343w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-240x135.jpg 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-scaled-640x360.jpg 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-1280x720.jpg 1280w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/06\/1400x788_Swin_transformer_hero_graphic_no_logo-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"Han Hu and <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/bainguo\/\" title=\"Go to researcher profile for Baining Guo\" aria-label=\"Go to researcher profile for Baining Guo\" data-bi-type=\"byline author\" data-bi-cN=\"Baining Guo\">Baining Guo<\/a>","formattedDate":"June 21, 2022","formattedExcerpt":"Early last year, our research team from the Visual Computing Group introduced Swin Transformer, a Transformer-based general-purpose computer vision architecture that for the first time beat convolutional neural networks on the important vision benchmark of COCO object detection (opens in new tab) and did so&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/852825","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/users\/37583"}],"replies":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/comments?post=852825"}],"version-history":[{"count":19,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/852825\/revisions"}],"predecessor-version":[{"id":870552,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/852825\/revisions\/870552"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media\/852828"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=852825"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/categories?post=852825"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/tags?post=852825"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=852825"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=852825"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=852825"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=852825"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=852825"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=852825"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=852825"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=852825"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}