{"id":823648,"date":"2022-03-08T09:00:26","date_gmt":"2022-03-08T17:00:26","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?p=823648"},"modified":"2022-08-17T09:58:51","modified_gmt":"2022-08-17T16:58:51","slug":"%c2%b5transfer-a-technique-for-hyperparameter-tuning-of-enormous-neural-networks","status":"publish","type":"post","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/%c2%b5transfer-a-technique-for-hyperparameter-tuning-of-enormous-neural-networks\/","title":{"rendered":"\u00b5Transfer: A technique for hyperparameter tuning of enormous neural networks"},"content":{"rendered":"\n<figure class=\"wp-block-image alignwide size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/1400x788_Hyperparameters_no_logo_hero.gif\" alt=\"An animated line-plot showing the stability of optimal learning rate as we change the neural network\u2019s parametrization. The parametrization is varied by interpolating between mup-Parametrization and PyTorch default in terms of the scaling for the learning rate and the initialization scale. The animation shows that mup-Parametrization is the only parametrization that preserves the optimality of learning rate across model widths; it also achieves the best absolute performance across all parametrizations.\" class=\"wp-image-823717\"\/><\/figure>\n\n\n\n<p>Great scientific achievements cannot be made by trial and error alone. Every launch in the space program is underpinned by centuries of fundamental research in aerodynamics, propulsion, and celestial bodies. In the same way, when it comes to building large-scale AI systems, fundamental research forms the theoretical insights that drastically reduce the amount of trial and error necessary and can prove very cost-effective.&nbsp;<\/p>\n\n\n\n<p>In this post, we relay how our fundamental research enabled us, for the first time, to tune enormous neural networks that are too expensive to train more than once. We achieved this by showing that a particular parameterization preserves optimal hyperparameters across different model sizes. This is the \u00b5-Parametrization (or <em>\u00b5P<\/em>, pronounced \u201cmyu-P&#8221;) that we introduced in a <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2011.14522\" target=\"_blank\" rel=\"noopener noreferrer\">previous paper<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, where we showed that it uniquely enables maximal feature learning in the infinite-width limit. In collaboration with researchers at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/openai.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">OpenAI<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, we verified its practical advantage on a range of realistic scenarios, which we describe in our new paper, &#8220;<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/tuning-large-neural-networks-via-zero-shot-hyperparameter-transfer\/\">Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer<\/a>.&#8221;<\/p>\n\n\n\n<p>By greatly reducing the need to guess which training hyperparameters to use, this technique can accelerate research on enormous neural networks, such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2005.14165\" target=\"_blank\" rel=\"noopener noreferrer\">GPT-3<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and potentially larger successors in the future. We also released a PyTorch package that facilitates the integration of our technique in existing models, available on the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/github.com\/microsoft\/mup\" target=\"_blank\" rel=\"noopener noreferrer\">project GitHub page<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> or by simply running \\(\\texttt{pip install mup}\\).<\/p>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-buttons is-content-justification-center is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-16018d1d wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button is-style-fill-chevron\"><a data-bi-type=\"button\" class=\"wp-block-button__link\" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/tuning-large-neural-networks-via-zero-shot-hyperparameter-transfer\/\" target=\"_blank\" rel=\"noreferrer noopener\">Read the paper<\/a><\/div>\n\n\n\n<div class=\"wp-block-button is-style-fill-github\"><a data-bi-type=\"button\" class=\"wp-block-button__link\" href=\"http:\/\/github.com\/microsoft\/mup\" target=\"_blank\" rel=\"noreferrer noopener\">Download the code<\/a><\/div>\n<\/div>\n\n\n\n<div style=\"height:10px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-pullquote\"><blockquote><p>&#8220;\u00b5P provides an impressive step toward removing some of the black magic from scaling up neural networks. It also provides a theoretically backed explanation of some tricks used by past work, like the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/1910.10683\" target=\"_blank\" rel=\"noopener noreferrer\">T5 model<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. I believe both practitioners and researchers alike will find this work valuable.&#8221;<\/p><cite>\u2014 Colin Raffel, Assistant Professor of Computer Science, University of North Carolina at Chapel Hill and co-creator of T5<\/cite><\/blockquote><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"scaling-the-initialization-is-easy-but-scaling-training-is-hard\">Scaling the initialization is easy, but scaling training is hard<\/h2>\n\n\n\n<p>Large neural networks are hard to train partly because we don\u2019t understand how their behavior changes as their size increases. Early work on deep learning, such as by <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/proceedings.mlr.press\/v9\/glorot10a.html\" target=\"_blank\" rel=\"noopener noreferrer\">Glorot & Bengio<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/delving-deep-into-rectifiers-surpassing-human-level-performance-on-imagenet-classification\/\" target=\"_blank\" rel=\"noreferrer noopener\">He et al.<\/a>, generated useful heuristics that deep learning practitioners widely use today. In general, these heuristics try to keep the activation scales consistent at initialization. However, as training starts, this consistency breaks at different model widths, as illustrated on the left in Figure 1.<\/p>\n\n\n\n<p>Unlike at random initialization, behavior during training is much harder to mathematically analyze. Our goal is to obtain a similar consistency so that as model width increases, the change in activation scales during training stay consistent and similar to initialization to avoid numerical overflow and underflow. Our solution, \u00b5P, achieves this goal, as seen on the right in Figure 1, which shows the stability of network activation scales for the first few steps of training across increasing model width.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"Two line-plots showing the change in activation scale between PyTorch default and the \u00b5-Parametrization. Under PyTorch default, the activation scale grows as the network width increases for a particular time step. Under \u00b5-Parametrization, the activation scale is stable across widths for a particular time step. \" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/logits_attnlogits_embedding_edited.jpg\"><img loading=\"lazy\" decoding=\"async\" width=\"1500\" height=\"750\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/logits_attnlogits_embedding_edited.jpg\" alt=\"Two line-plots showing the change in activation scale between PyTorch default and the \u00b5-Parametrization. Under PyTorch default, the activation scale grows as the network width increases for a particular time step. Under \u00b5-Parametrization, the activation scale is stable across widths for a particular time step. \" class=\"wp-image-823705\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/logits_attnlogits_embedding_edited.jpg 1500w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/logits_attnlogits_embedding_edited-300x150.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/logits_attnlogits_embedding_edited-1024x512.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/logits_attnlogits_embedding_edited-768x384.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/logits_attnlogits_embedding_edited-240x120.jpg 240w\" sizes=\"auto, (max-width: 1500px) 100vw, 1500px\" \/><\/a><figcaption>Figure 1: In the default parameterization in PyTorch, the graph on the left, the activation scales diverge in width after one step of training. But in \u00b5P, the graph on the right, the activation scales change by a consistent amount regardless of width for any training step. The y-axis shows the change of network activation scales on a fixed input after t=0, 1, 2, 3, and 4 steps of training as the width of the model varies, which is shown along the x-axis.&nbsp;<\/figcaption><\/figure>\n\n\n\n<p>Our parameterization, which maintains this consistency during training, follows two pieces of crucial insight. First, gradient updates behave differently from random weights when the width is large. This is because gradient updates are derived from data and contain correlations, whereas random initializations do not. Therefore, they need to be scaled differently. Second, parameters of different shapes also behave differently when the width is large. While we typically divide parameters into weights and biases, with the former being matrices and the latter vectors, some weights behave like vectors in the large-width setting. For example, the embedding matrix in a language model is of size <em>vocabsize x width<\/em>. While the width tends to infinity, <em>vocabsize<\/em> stays constant and finite. During matrix multiplication, the difference in behavior between summing along a finite dimension and an infinite one cannot be more different.<\/p>\n\n\n\n<p>These insights, which we discuss in detail in a <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/on-infinitely-wide-neural-networks-that-exhibit-feature-learning\/\">previous blog post<\/a>, motivated us to develop \u00b5P. In fact, beyond just keeping the activation scale consistent throughout training, \u00b5P ensures that neural networks of different and sufficiently large widths behave similarly during training such that they <em>converge to<\/em> a desirable limit, which we call <em>the feature learning limit<\/em>.&nbsp;<\/p>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"999693\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">Spotlight: Event Series<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/event\/microsoft-research-forum\/past-episodes\/?OCID=msr_researchforum_MCR_Blog_Promo\" aria-label=\"Microsoft Research Forum\" data-bi-cN=\"Microsoft Research Forum\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/05\/Research-Forum-hero_1400x788.jpg\" alt=\"Research Forum | abstract background with colorful hexagons\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">Microsoft Research Forum<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"microsoft-research-forum\" class=\"large\">Join us for a continuous exchange of ideas about research in the era of general AI. Watch the latest episodes on demand.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/event\/microsoft-research-forum\/past-episodes\/?OCID=msr_researchforum_MCR_Blog_Promo\" aria-describedby=\"microsoft-research-forum\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"Microsoft Research Forum\" target=\"_blank\">\n\t\t\t\t\t\t\tWatch on-demand\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<h2 class=\"wp-block-heading\" id=\"a-theory-guided-approach-to-scaling-width\">A theory-guided approach to scaling width<\/h2>\n\n\n\n<p>Our theory of scaling enables a procedure to transfer training hyperparameters across model sizes. If, as discussed above, \u00b5P networks of different widths share similar training dynamics, they likely also share similar optimal hyperparameters. Consequently, we can simply apply the optimal hyperparameters of a small model directly onto a scaled-up version. We call this practical procedure <em>\u00b5Transfer<\/em>. If our hypothesis is correct, the training loss-hyperparameter curves for \u00b5P models of different widths would share a similar minimum. <\/p>\n\n\n\n<p>Conversely, our reasoning suggests that no scaling rule of initialization and learning rate other than \u00b5P can achieve the same result. This is supported by the animation below. Here, we vary the parameterization by interpolating the initialization scaling and the learning rate scaling between PyTorch default and \u00b5P. As shown, \u00b5P is the only parameterization that preserves the optimal learning rate across width, achieves the best performance for the model with width 213 = 8192, and where wider models always do better for a given learning rate\u2014that is, graphically, the curves don\u2019t intersect.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"An animated line-plot showing the stability of optimal learning rate as we change the neural network\u2019s parametrization. The parametrization is varied by interpolating between \u00b5-Parametrization and PyTorch default in terms of the scaling for the learning rate and the initialization scale. The animation shows that \u00b5-Parametrization is the only parametrization that preserves the optimality of learning rate across model widths; it also achieves the best absolute performance across all parametrizations. \" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/1400x788_Hyperparameters_no_logo_hero.gif\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/1400x788_Hyperparameters_no_logo_hero.gif\" alt=\"An animated line-plot showing the stability of optimal learning rate as we change the neural network\u2019s parametrization. The parametrization is varied by interpolating between \u00b5-Parametrization and PyTorch default in terms of the scaling for the learning rate and the initialization scale. The animation shows that \u00b5-Parametrization is the only parametrization that preserves the optimality of learning rate across model widths; it also achieves the best absolute performance across all parametrizations. \" class=\"wp-image-823717\"\/><\/a><figcaption>Figure 2: On the left, we train multilayer perceptrons (MLPs) of different widths (which correspond to the curves of different colors and patterns) with different learning rates (shown along the x-axis) on CIFAR10 and plot the training loss along the y-axis. On the right, the 2D plane of parameterizations is formed by interpolation of 1) the initialization scaling between PyTorch default and \u00b5P (x-axis), and 2) the learning rate scaling between PyTorch default and \u00b5P (y-axis). On this plane, PyTorch default is represented by (0, 0) and \u00b5P by (1, 1). The width-256 (log2(width) = 8) model is the same across all frames (except for random seed), but we widen models according to the parameterization represented by the dot on the right.&nbsp;<\/figcaption><\/figure>\n\n\n\n<p>Building on the theoretical foundation of <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/gregyang\/\">Tensor Programs<\/a>, \u00b5Transfer works automatically for advanced architectures, such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/1706.03762\" target=\"_blank\" rel=\"noopener noreferrer\">Transformer<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/deep-residual-learning-for-image-recognition\/\">ResNet<\/a>. It can also simultaneously transfer a wide range of hyperparameters. Using Transformer as an example, we demonstrate in Figure 3 how the optima of key hyperparameters are stable across widths.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"Four line-plots showing the stability of optima of various hyperparameters across widths. From left-to-right and top-to-bottom, we see that the optima for learning rate, cross-entropy temperature, initialization standard deviation, and learning rate schedule are all roughly stable across widths, from 128 to 4,096. \" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_width_edited-scaled.jpg\"><img loading=\"lazy\" decoding=\"async\" width=\"2560\" height=\"1442\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_width_edited-scaled.jpg\" alt=\"Four line-plots showing the stability of optima of various hyperparameters across widths. From left-to-right and top-to-bottom, we see that the optima for learning rate, cross-entropy temperature, initialization standard deviation, and learning rate schedule are all roughly stable across widths, from 128 to 4,096. \" class=\"wp-image-823708\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_width_edited-scaled.jpg 2560w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_width_edited-300x169.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_width_edited-1024x577.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_width_edited-768x433.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_width_edited-1536x865.jpg 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_width_edited-2048x1154.jpg 2048w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_width_edited-1066x600.jpg 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_width_edited-655x368.jpg 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_width_edited-343x193.jpg 343w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_width_edited-240x135.jpg 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_width_edited-640x360.jpg 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_width_edited-960x540.jpg 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_width_edited-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 2560px) 100vw, 2560px\" \/><\/a><figcaption>Figure 3: Transformers of different widths parameterized in \u00b5P and trained on <a data-bi-bhvr=\"14\"  data-bi-cn=\"Four line-plots showing the stability of optima of various hyperparameters across widths. From left-to-right and top-to-bottom, we see that the optima for learning rate, cross-entropy temperature, initialization standard deviation, and learning rate schedule are all roughly stable across widths, from 128 to 4,096. \" href=\"https:\/\/paperswithcode.com\/dataset\/wikitext-2\" target=\"_blank\" rel=\"noreferrer noopener\">WikiText-2<\/a>. As we increase model width, the optimal learning rate, cross-entropy temperature, initialization scale, and learning rate schedule remain stable. We can meaningfully predict the optimal hyperparameters of a wider network by looking at those of a narrow one.&nbsp;In plot on the lower right, we tried the following learning rate schedules: (a) linear decay, (b) StepLR @ [5k, 8k] with a decay factor of 0.1, (c) StepLR @ [4k, 7k] with a decay factor of 0.3, (d) cosine annealing,(e) constant, and (f) inverse square-root decay.&nbsp;<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-pullquote\"><blockquote><p>\u201cI am excited about \u00b5P advancing our understanding of large models. \u00b5P&#8217;s principled way of parameterizing the model and selecting the learning rate make it easier for anybody to scale the training of deep neural networks. Such an elegant combination of beautiful theory and practical impact.\u201d<\/p><cite>\u2014 Johannes Gehrke, Technical Fellow, Lab Director of Research at Redmond, and CTO and Head of Machine Learning for the Intelligent Communications and Conversations Cloud (IC3)<\/cite><\/blockquote><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"beyond-width-empirical-scaling-of-model-depth-and-more\">Beyond width: Empirical scaling of model depth and more<\/h2>\n\n\n\n<p>Modern neural network scaling involves many more dimensions than just width. In our work, we also explore how \u00b5P can be applied to realistic training scenarios by combining it with simple heuristics for nonwidth dimensions. In Figure 4, we use the same transformer setup to show how the optimal learning rate remains stable within reasonable ranges of nonwidth dimensions. For hyperparameters other than learning rate, see Figure 19 in our paper.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"Four line-plots showing the stability of the optimal learning rate across width, depth, batch size, and sequence length. The width is varied from 128 to 4,096, the depth from 2 to 32, the batch size from 20 to 512, and the sequence length from 32 to 512. \" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_scale_dim_edited-scaled.jpg\"><img loading=\"lazy\" decoding=\"async\" width=\"2560\" height=\"1441\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_scale_dim_edited-scaled.jpg\" alt=\"Four line-plots showing the stability of the optimal learning rate across width, depth, batch size, and sequence length. The width is varied from 128 to 4,096, the depth from 2 to 32, the batch size from 20 to 512, and the sequence length from 32 to 512. \" class=\"wp-image-823711\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_scale_dim_edited-scaled.jpg 2560w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_scale_dim_edited-300x169.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_scale_dim_edited-1024x577.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_scale_dim_edited-768x432.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_scale_dim_edited-1536x865.jpg 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_scale_dim_edited-2048x1153.jpg 2048w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_scale_dim_edited-1066x600.jpg 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_scale_dim_edited-655x368.jpg 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_scale_dim_edited-343x193.jpg 343w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_scale_dim_edited-240x135.jpg 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_scale_dim_edited-640x360.jpg 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_scale_dim_edited-960x540.jpg 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_scale_dim_edited-1280x720.jpg 1280w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_scale_dim_edited-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 2560px) 100vw, 2560px\" \/><\/a><figcaption>Figure 4: Transformers of different sizes parameterized in \u00b5P and trained on <a data-bi-bhvr=\"14\"  data-bi-cn=\"Four line-plots showing the stability of the optimal learning rate across width, depth, batch size, and sequence length. The width is varied from 128 to 4,096, the depth from 2 to 32, the batch size from 20 to 512, and the sequence length from 32 to 512. \" href=\"https:\/\/blog.salesforceairesearch.com\/the-wikitext-long-term-dependency-language-modeling-dataset\/\" target=\"_blank\" rel=\"noreferrer noopener\">Wikitext-2<\/a>. Not only does the optimal learning rate transfer across width, as shown in Figure 3, it also empirically transfers across other scale dimensions\u2014such as depth, batch size, and sequence length\u2014across the ranges we tested here. This means we can combine our theoretically motivated transfer across width with the empirically verified one across other scale dimensions to obtain the practical procedure, \u00b5Transfer, to tune hyperparameters indirectly on a small model and transfer to a large one.&nbsp;<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"testing-\u00b5transfer\">Testing \u00b5Transfer<\/h2>\n\n\n\n<p>Now that we have verified the transfer of individual hyperparameters, it is time to combine them in a more realistic scenario. In Figure 5, we compare \u00b5Transfer, which transfers tuned hyperparameters from a small proxy model, with directly tuning the large target model. In both cases, the tuning is done via random search. Figure 5 illustrates a <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/en.wikipedia.org\/wiki\/Pareto_front\" target=\"_blank\" rel=\"noopener noreferrer\">Pareto frontier<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> of the relative tuning compute budget compared with the tuned model quality (BLEU score) on <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/sites.google.com\/site\/iwsltevaluation2014\/data-provided?authuser=0\" target=\"_blank\" rel=\"noopener noreferrer\">IWSLT14 De-En<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, a machine translation dataset. Across all compute budget levels, \u00b5Transfer is about an order of magnitude (in base 10) more compute-efficient for tuning. We expect this efficiency gap to dramatically grow as we move to larger target model sizes.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"A line-plot showing the Pareto-front corresponding to model performance measured in BLEU score and the compute budget for hyperparameter tuning. The curve representing our method, \u00b5Transfer, dominates that of conventional tuning with a margin of roughly 10 times in compute budget. Our method also yields the best absolute performance, at almost 35.4 in BLEU score, where as the conventional method tops out at 35.2. \" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_pareto_edited-scaled.jpg\"><img loading=\"lazy\" decoding=\"async\" width=\"2560\" height=\"1600\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_pareto_edited-scaled.jpg\" alt=\"A line-plot showing the Pareto-front corresponding to model performance measured in BLEU score and the compute budget for hyperparameter tuning. The curve representing our method, \u00b5Transfer, dominates that of conventional tuning with a margin of roughly 10 times in compute budget. Our method also yields the best absolute performance, at almost 35.4 in BLEU score, where as the conventional method tops out at 35.2. \" class=\"wp-image-823714\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_pareto_edited-scaled.jpg 2560w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_pareto_edited-300x187.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_pareto_edited-1024x640.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_pareto_edited-768x480.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_pareto_edited-1536x960.jpg 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_pareto_edited-2048x1280.jpg 2048w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/TP5_blog_pareto_edited-240x150.jpg 240w\" sizes=\"auto, (max-width: 2560px) 100vw, 2560px\" \/><\/a><figcaption>Figure 5: Across different tuning budgets, \u00b5Transfer dominates the baseline method of directly tuning the target model. As we train larger target models with billions of parameters, we expect the performance gap to widen, since the proxy model can remain small while still meaningfully predicting the optimal hyperparameters, as shown in Figures 3 and 4.&nbsp;<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"a-glimpse-of-the-future-\u00b5p-gpt-3\">A glimpse of the future: \u00b5P + GPT-3<\/h2>\n\n\n\n<p>Before this work, the larger a model was, the less well-tuned we expected it to be due to the high cost of tuning. Therefore, we expected that the largest models could benefit the most from \u00b5Transfer, which is why we partnered with OpenAI to evaluate it on GPT-3.&nbsp;<\/p>\n\n\n\n<p>After parameterizing a version of GPT-3 with relative attention in \u00b5P, we tuned a small proxy model with 40 million parameters before copying the best hyperparameter combination to the 6.7-billion parameter variant of GPT-3, as prescribed by \u00b5Transfer. The total compute used during this tuning stage was only 7 percent of the compute used in the pretraining of the final 6.7-billion model. This \u00b5Transferred model outperformed the model of the same size (with absolute attention) in the original <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/nam06.safelinks.protection.outlook.com\/?url=https%3A%2F%2Farxiv.org%2Fabs%2F2005.14165&data=04%7C01%7Cv-kaforster%40microsoft.com%7C7ffeca0cffc540f9a38f08d9fd79b534%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637819521438441449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=SYsZf5I2FPXJO8y7Axxf3EYwbXyVbAh1nJdNTDkro%2FM%3D&reserved=0\" target=\"_blank\" rel=\"noopener noreferrer\">GPT-3 paper<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. In fact, it performs similarly to the model (with absolute attention) with double the parameter count from the same paper, as shown in Figure 6.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"Two bar-plots showing the relative performance of GPT-3 6.7B compared to GPT-3 6.7B tuned with \u00b5Transfer. On language modeling tasks, including PTB, Wikitext 103, and LM1B, the run with \u00b5Transfer achieves lower perplexities. On NLU tasks, including HellaSwag, LAMBADA, and SQuADv2, the run with \u00b5Transfer achieves higher accuracies, comparable to those achieved by GPT-3 6.7B or GPT-3 13B tuned without \u00b5Transfer. \" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/GPT3-barchart.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"391\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/GPT3-barchart-1024x391.png\" alt=\"Two bar-plots showing the relative performance of GPT-3 6.7B compared to GPT-3 6.7B tuned with \u00b5Transfer. On language modeling tasks, including PTB, Wikitext 103, and LM1B, the run with \u00b5Transfer achieves lower perplexities. On NLU tasks, including HellaSwag, LAMBADA, and SQuADv2, the run with \u00b5Transfer achieves higher accuracies, comparable to those achieved by GPT-3 6.7B or GPT-3 13B tuned without \u00b5Transfer. \" class=\"wp-image-824500\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/GPT3-barchart-1024x391.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/GPT3-barchart-300x114.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/GPT3-barchart-768x293.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/GPT3-barchart-1536x586.png 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/GPT3-barchart-2048x781.png 2048w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/GPT3-barchart-240x92.png 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption>Figure 6: We applied \u00b5Transfer to GPT-3 6.7-billion parameter model with relative attention and obtained better results than the baseline with absolute attention used in the original <a data-bi-bhvr=\"14\"  data-bi-cn=\"Two bar-plots showing the relative performance of GPT-3 6.7B compared to GPT-3 6.7B tuned with \u00b5Transfer. On language modeling tasks, including PTB, Wikitext 103, and LM1B, the run with \u00b5Transfer achieves lower perplexities. On NLU tasks, including HellaSwag, LAMBADA, and SQuADv2, the run with \u00b5Transfer achieves higher accuracies, comparable to those achieved by GPT-3 6.7B or GPT-3 13B tuned without \u00b5Transfer. \" href=\"https:\/\/arxiv.org\/abs\/2005.14165\" target=\"_blank\" rel=\"noreferrer noopener\">GPT-3 paper<\/a>, all while only spending 7 percent of the pretraining compute budget on tuning. The performance of this \u00b5Transfer 6.7-billion model is comparable to that of the 13-billion model (with absolute attention) in the original GPT-3 paper.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"implications-for-deep-learning-theory\">Implications for deep learning theory<\/h2>\n\n\n\n<p>As shown previously, \u00b5P gives a scaling rule which uniquely preserves the optimal hyperparameter combination across models of different widths in terms of training loss. Conversely, other scaling rules, like the default in PyTorch or the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/nam06.safelinks.protection.outlook.com\/?url=http%3A%2F%2Farxiv.org%2Fabs%2F1806.07572&data=04%7C01%7Cv-kaforster%40microsoft.com%7Cef5ead43f58e479ac9dd08da0086752f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637822874742192284%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=6Dc6isCv8%2BmMuilwY9x8iOzhm9vMihAHlON3I5sQmuc%3D&reserved=0\" target=\"_blank\" rel=\"noopener noreferrer\">NTK parameterization<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> studied in the theoretical literature, are looking at regions in the hyperparameter space farther and farther from the optimum as the network gets wider. In that regard, we believe that the feature learning limit of \u00b5P, rather than the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/arxiv.org\/abs\/1806.07572\" target=\"_blank\" rel=\"noopener noreferrer\">NTK limit<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, is the most natural limit to study if our goal is to derive insights that are applicable to feature learning neural networks used in practice. As a result, more advanced theories on overparameterized neural networks should reproduce the feature learning limit of \u00b5P in the large width setting.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"theory-of-tensor-programs\">Theory of Tensor Programs<\/h2>\n\n\n\n<p>The advances described above are made possible by the theory of <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/gregyang\/\">Tensor Programs<\/a> (TPs) developed over the last several years. Just as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/pytorch.org\/docs\/stable\/autograd.html\" target=\"_blank\" rel=\"noopener noreferrer\">autograd<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> helps practitioners compute the gradient of any general computation graph, TP theory enables researchers to compute the limit of any general computation graph when its matrix dimensions become large. Applied to the underlying graphs for neural network initialization, training, and inference, the TP technique yields fundamental theoretical results, such as the architectural universality of the <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/tensor-programs-i-wide-feedforward-or-recurrent-neural-networks-of-any-architecture-are-gaussian-processes\/\">Neural Network-Gaussian Process correspondence<\/a> and the <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/feature-learning-in-infinite-width-neural-networks\/\">Dynamical Dichotomy theorem<\/a>, in addition to deriving <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/feature-learning-in-infinite-width-neural-networks\/\">\u00b5P and the feature learning limit<\/a> that led to \u00b5Transfer. Looking ahead, we believe extensions of TP theory to depth, batch size, and other scale dimensions hold the key to the reliable scaling of large models beyond width.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"applying-\u00b5transfer-to-your-own-models\">Applying \u00b5Transfer to your own models<\/h2>\n\n\n\n<p>Even though the math can be intuitive, we found that implementing \u00b5P (which enables \u00b5Transfer) from scratch can be error prone. This is similar to how autograd is tricky to implement from scratch even though the chain rule for taking derivatives is very straightforward. For this reason, we created the&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/mup\" target=\"_blank\" rel=\"noopener noreferrer\">mup package<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> to enable practitioners to easily implement \u00b5P in their own PyTorch models, just as how frameworks like PyTorch, TensorFlow, and JAX have enabled us to take autograd for granted. Please note that \u00b5Transfer works for models of any size, not just those with billions of parameters.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"the-journey-has-just-begun\">The journey has just begun<\/h2>\n\n\n\n<p>While our theory explains why models of different widths behave differently, more investigation is needed to build a theoretical understanding of the scaling of network depth and other scale dimensions. Many works have addressed the latter, such as the research on batch size by <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/1811.03600\" target=\"_blank\" rel=\"noopener noreferrer\">Shallue et al.<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/1711.00489\" target=\"_blank\" rel=\"noopener noreferrer\">Smith et al.<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/pdf\/1812.06162.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">McCandlish et al.<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, as well as research on neural language models in general by <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1909.12673\">Rosenfield et al.<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2001.08361\" target=\"_blank\" rel=\"noopener noreferrer\">Kaplan et al.<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> We believe \u00b5P can remove a confounding variable for such investigations.&nbsp; Furthermore, recent large-scale architectures often involve scale dimensions beyond those we have talked about in our work, such as the number of experts in a mixture-of-experts system. Another high-impact domain to which \u00b5P and \u00b5Transfer have not been applied is fine tuning a pretrained model. While feature learning is crucial in that domain, the need for regularization and the finite-width effect prove to be interesting challenges.&nbsp;<\/p>\n\n\n\n<p>We firmly believe in fundamental research as a cost-effective complement to trial and error and plan to continue our work to derive more principled approaches to large-scale machine learning. To learn about our other deep learning projects or opportunities to work with us and even help us expand \u00b5P, please go to our <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/group\/deep-learning-group\/\">Deep Learning Group<\/a> page.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Great scientific achievements cannot be made by trial and error alone. Every launch in the space program is underpinned by centuries of fundamental research in aerodynamics, propulsion, and celestial bodies. In the same way, when it comes to building large-scale AI systems, fundamental research forms the theoretical insights that drastically reduce the amount of trial [&hellip;]<\/p>\n","protected":false},"author":37583,"featured_media":823717,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13561,13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-823648","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-algorithms","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[144931],"related-projects":[804847],"related-events":[],"related-researchers":[{"type":"guest","value":"edward-hu","user_id":"822325","display_name":"Edward Hu","author_link":"<a href=\"https:\/\/edwardjhu.com\/\" aria-label=\"Visit the profile page for Edward Hu\">Edward Hu<\/a>","is_active":true,"last_first":"Hu, Edward","people_section":0,"alias":"edward-hu"},{"type":"user_nicename","value":"Jianfeng Gao","user_id":32246,"display_name":"Jianfeng Gao","author_link":"<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/jfgao\/\" aria-label=\"Visit the profile page for Jianfeng Gao\">Jianfeng Gao<\/a>","is_active":false,"last_first":"Gao, Jianfeng","people_section":0,"alias":"jfgao"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/1400x788_Hyperparameters_no_logo_hero-960x540.gif\" class=\"img-object-cover\" alt=\"animation showing hyperparameters\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/1400x788_Hyperparameters_no_logo_hero-960x540.gif 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/1400x788_Hyperparameters_no_logo_hero-300x169.gif 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/1400x788_Hyperparameters_no_logo_hero-1024x576.gif 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/1400x788_Hyperparameters_no_logo_hero-768x432.gif 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/1400x788_Hyperparameters_no_logo_hero-1066x600.gif 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/1400x788_Hyperparameters_no_logo_hero-655x368.gif 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/1400x788_Hyperparameters_no_logo_hero-343x193.gif 343w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/1400x788_Hyperparameters_no_logo_hero-240x135.gif 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/1400x788_Hyperparameters_no_logo_hero-640x360.gif 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2022\/03\/1400x788_Hyperparameters_no_logo_hero-1280x720.gif 1280w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/edwardjhu.com\/\" title=\"Go to researcher profile for Edward Hu\" aria-label=\"Go to researcher profile for Edward Hu\" data-bi-type=\"byline author\" data-bi-cN=\"Edward Hu\">Edward Hu<\/a>, Greg Yang, and <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/jfgao\/\" title=\"Go to researcher profile for Jianfeng Gao\" aria-label=\"Go to researcher profile for Jianfeng Gao\" data-bi-type=\"byline author\" data-bi-cN=\"Jianfeng Gao\">Jianfeng Gao<\/a>","formattedDate":"March 8, 2022","formattedExcerpt":"Great scientific achievements cannot be made by trial and error alone. Every launch in the space program is underpinned by centuries of fundamental research in aerodynamics, propulsion, and celestial bodies. In the same way, when it comes to building large-scale AI systems, fundamental research forms&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/823648","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/users\/37583"}],"replies":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/comments?post=823648"}],"version-history":[{"count":39,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/823648\/revisions"}],"predecessor-version":[{"id":870630,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/823648\/revisions\/870630"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media\/823717"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=823648"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/categories?post=823648"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/tags?post=823648"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=823648"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=823648"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=823648"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=823648"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=823648"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=823648"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=823648"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=823648"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}