{"id":734179,"date":"2021-03-24T14:22:38","date_gmt":"2021-03-24T21:22:38","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?p=734179"},"modified":"2021-03-28T23:50:27","modified_gmt":"2021-03-29T06:50:27","slug":"factorized-layers-revisited-compressing-deep-networks-without-playing-the-lottery","status":"publish","type":"post","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/factorized-layers-revisited-compressing-deep-networks-without-playing-the-lottery\/","title":{"rendered":"Factorized layers revisited: Compressing deep networks without playing the lottery"},"content":{"rendered":"\n<figure class=\"wp-block-image alignwide size-large\"><img decoding=\"async\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/1400x788_autoML_animation_no_logov2.gif\" alt=\"\"\/><\/figure>\n\n\n\n<p>From BiT (<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1912.11370\">928 million parameters<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>) to GPT-3 (<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2005.14165\">175 billion parameters<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>), state-of-the-art machine learning models are rapidly growing in size. With the greater expressivity and easier trainability of these models come skyrocketing training costs, deployment difficulties, and even <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/video\/frontiers-in-machine-learning-climate-impact-of-machine-learning\/\">climate impact<\/a>. As a result, we\u2019re witnessing exciting and emerging research into compressing these models to make them less expensive, small enough to store on any device, and more energy efficient. Perhaps the most popular approach to model compression is pruning, in which redundant model parameters are removed, leaving only a small subset of parameters, or a <em>subnetwork<\/em>. A major drawback of pruning, though, is it requires training a large model first, which is expensive and resource intensive.<\/p>\n\n\n\n<p>Recent research around pruning has focused on the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/openreview.net\/forum?id=rJl-b3RcF7\">lottery ticket hypothesis<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, which suggests that most parameters of large models are redundant <em>even during training <\/em>and that there exists a subnetwork responsible for most of the model\u2019s performance. In other words, you can train this subnetwork alone and still obtain the same accuracy as if you had trained all the parameters\u2014provided you are able to identify the subnetwork first. While such subnetworks\u2014called <em>lottery tickets<\/em>\u2014have been shown to exist by pruning large networks <em>after <\/em>training, intensive efforts to design methods that guess them <em>before <\/em>training, such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2002.07376\">GraSP <span class=\"sr-only\"> (opens in new tab)<\/span><\/a>and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2009.11094\">hybrid tickets<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, haven\u2019t achieved the same performance.<\/p>\n\n\n\n<p>Are sparsity-based methods such as pruning and guessing lottery tickets the right way to obtain compressed models? As part of our paper \u201c<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/initialization-and-regularization-of-factorized-neural-layers\/\">Initialization and Regularization of Factorized Neural Layers<\/a>,\u201d<span class=\"has-inline-color has-orange-color\"> <\/span>which we\u2019re presenting at the <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/event\/iclr-2021\/\">International Conference on Learning Representations<\/a> (ICLR 2021), we revisit the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1511.06067\">alternative compression approach of factorized neural layers<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. While sparsity-based approaches remove parameters from the weight matrices of a network incrementally, factorization replaces the matrices with products of smaller matrices that are more efficient to store and compute. Despite the fact that factorized layers are more amenable to deployment on deep learning accelerators such as GPUs and software frameworks such as PyTorch, sparsity-based methods remain more popular because they\u2019re <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/openaccess.thecvf.com\/content_CVPR_2020\/html\/Idelbayev_Low-Rank_Compression_of_Neural_Nets_Learning_the_Rank_of_Each_CVPR_2020_paper.html\">seen as better at maintaining high accuracy at high compression rates<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (for example, 10 percent or less parameters remaining).<\/p>\n\n\n\n<p>Our results contradict this position: we show that if we use the right initialization scheme (<em>spectral initialization,<\/em> or <em>SI<\/em>) and the right regularization penalty (<em>Frobenius decay<\/em>, or <em>FD<\/em>), we can achieve higher accuracy on three benchmark datasets by training a factorized ResNet from scratch than by pruning or guessing lottery tickets. The key principle underlying these two natural methods, neither of which requires extra hyperparameters, is that the training behavior of a factorized model should mimic that of the original (unfactorized) network.<\/p>\n\n\n\n<p>We further demonstrate the usefulness of these schemes in two settings beyond model compression where factorized neural layers are applied. The first is an exciting new area of knowledge distillation in which an overcomplete factorization is used to replace the complicated and expensive student-teacher training phase with a single matrix multiplication at each layer. The second is for training Transformer-based architectures such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1810.04805\">BERT<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, which are popular models for learning over sequences like text and genomic data and whose multi-head self-attention mechanisms are also factorized neural layers.<\/p>\n\n\n\n<p>Our work is part of <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/lab\/microsoft-research-new-england\/\">Microsoft Research New England\u2019s<\/a> <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/project\/automl\/\">AutoML research efforts<\/a>, which seek to make the exploration and deployment of state-of-the-art machine learning easier through the development of models that help automate the complex processes involved. The code to our work is available at our <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/fnl_paper\">GitHub repository<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<h2 id=\"factorized-neural-layers-and-where-to-find-them\">Factorized neural layers and where to find them<\/h2>\n\n\n\n<p>Deep networks are function approximators in which inputs are passed through a sequence of nonlinear transformations \\(g(W,x)\\), each modifying the previous layer\u2019s output \\(x\\) using some mapping specified by a weight matrix \\(W\\)\u2208 \\(\\mathbb{R}\\)\\(^m\\)\\(^x\\)\\(^n\\). The space and time complexity of each layer is typically tied directly to the dimensionality of this weight matrix; for example, standard implementations of fully connected and convolutional layers require \\(O(mn)\\) operations to compute \\(g(W, x)\\).<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/Factorized-Neural_Figure-2_Automl.jpg\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"508\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/Factorized-Neural_Figure-2_Automl-1024x508.jpg\" alt=\"Two side-by-side depictions of a convolution applied to a 5-by-5 input. On the left, a 3-by-3 filter labeled \u201ck-by-k convolution (c-by-c channels)\u201d is applied to the 5-by-5 input to produce a 3-by-3 output. Next to it is a square labeled \u201cW\u201d described as a \u201cweight tensor (reshaped) (c squared k squared parameters)\u201d; it has side lengths ck and ck. An arrow points to two rectangles: a vertical rectangle labeled \u201cU\u201d and a horizontal rectangle labeled \u201cV superscript T\u201d described as a \u201cfactorized weight tensor (2ckr parameters)\u201d; each rectangle has side lengths ck and r. To the right, a 1-by-3 filter labeled \u201c1-by-k convolution (r-by-c channels)\u201d is applied to the 5-by-5 input to produce a 5-by-3 output; applied to this input is a 3-by-1 filter labeled \u201ck-by-1 convolution (c-by-r channels)\u201d to produce a 3-by-3 output.\" class=\"wp-image-734335\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/Factorized-Neural_Figure-2_Automl-1024x508.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/Factorized-Neural_Figure-2_Automl-300x149.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/Factorized-Neural_Figure-2_Automl-768x381.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/Factorized-Neural_Figure-2_Automl-16x8.jpg 16w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/Factorized-Neural_Figure-2_Automl.jpg 1528w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption>Figure 1: An example of how to factorize a two-dimensional convolution layer. Here the weight tensor of a regular <em>k <\/em>\u00d7<em> k<\/em> convolution (left) is reshaped into a matrix and factorized into two smaller weight matrices, each corresponding to a one-dimensional convolution performed along one dimension of an image (right).<\/figcaption><\/figure><\/div>\n\n\n\n<p>Factorized neural layers replace the matrix with a product of matrices; in the simplest case, we have the standard low-rank decomposition \\(W\\)=\\(UV\\)\\(^T\\) for matrices \\(U\\)\u2208 \\(\\mathbb{R}\\)\\(^m\\)\\(^x\\)\\(^r\\), \\(V\\)\u2208 \\(\\mathbb{R}\\)\\(^n\\)\\(^x\\)\\(^r\\). In the model compression case, we can set \\(r\\)\u226a\\(m,n\\) to improve the complexity of fully connected and convolutional layers from \\(O(mn)\\) to \\(O(r(m+n))\\); as shown in Figure 1, for two-dimensional convolutions, this speedup is achieved using two one-dimensional convolutions, one along each spatial dimension. In the case of Transformers, which take as input a sequence \\(x\\) with hidden dimension \\(d\\), we can express the action of a multi-head attention (MHA) operation parameterized by \\(H\\) query \\((Q_h)\\), key \\((K_h)\\), value \\((V_h)\\), and output \\((O_h)\\) projection matrices as the following summation over its heads \\(h\\)=1,\u2026,\\(H\\):<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"185\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/Latex_AutoML_figure1-1024x185.jpg\" alt=\"\" class=\"wp-image-734284\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/Latex_AutoML_figure1-1024x185.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/Latex_AutoML_figure1-300x54.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/Latex_AutoML_figure1-768x139.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/Latex_AutoML_figure1-16x3.jpg 16w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/Latex_AutoML_figure1.jpg 1061w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>This formulation makes clear that MHA is an aggregation over \\(2H\\) factorized neural layers with rank \\(r=d\/H\\), half of them parameterized by \\(Q_h\\) \\(Kh^T\\) and the other half by \\(V_h\\) \\(O_h^T\\).<\/p>\n\n\n\n<h2 id=\"how-should-training-routines-handle-factorized-neural-layers\">How should training routines handle factorized neural layers?<\/h2>\n\n\n\n<p>It\u2019s straightforward to apply standard deep network training algorithms such as stochastic gradient descent (SGD) to networks with factorized layers. However, modern optimization procedures have many critical aspects, such as initialization and regularization, that significantly influence the performance of the final model. When we factorize a network, the effect of these components can change substantially, so we often can\u2019t use the same settings as for the unfactorized model. For example, in the unfactorized case, the standard weight decay regularization penalizes the squared <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/encyclopediaofmath.org\/wiki\/Frobenius_matrix_norm\">Frobenius norm<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> ||\\(W||_{Fro}^{2}\\) of each weight matrix in the model; however, if we directly apply this to the factors \\(U\\) and \\(V\\) in a factorized layer, we end up penalizing an upper bound on twice the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/encyclopediaofmath.org\/wiki\/Nuclear_norm\">nuclear norm<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> of their product:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"126\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/AutoML-Latex-2-1024x126.jpg\" alt=\"\" class=\"wp-image-734290\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/AutoML-Latex-2-1024x126.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/AutoML-Latex-2-300x37.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/AutoML-Latex-2-768x94.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/AutoML-Latex-2-16x2.jpg 16w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/AutoML-Latex-2.jpg 1114w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>As indicated in the figure below, this upper bound is tight when training a factorized ResNet using regular weight decay, showing that it effectively penalizes the nuclear norm rather than the Frobenius norm.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/AutoML_figure-updated_-CIFAR.jpg\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/AutoML_figure-updated_-CIFAR.jpg\" alt=\"Three line plots side by side, each with \u201ctraining epoch\u201d on the x-axis and \u201caverage across layers\u201d on the y-axis. The first, titled \u201cKaiming init, different regularizers,\u201d has three solid lines representing nuclear norm: one in black and labeled \u201cno-decay,\u201d one in green and labeled \u201cweight decay,\u201d and one in blue and labeled \u201cFrobenius decay,\u201d with the first above the second and the second above the third. It also has three unlabeled dotted lines representing the nuclear norm upper bound corresponding to each solid line. Those corresponding to \u201cno decay\u201d and \u201cweight decay\u201d are very close to their respective solid lines, while the dotted line corresponding to \u201cFrobenius\u201d decay is not. The second plot, titled \u201cSpectral init, different regularizers,\u201d shows the same result. The last plot, titled \u201cWeight decay, different rank scales,\u201d has six pairs of solid and dotted lines labeled and color coordinated by powers of three from \u20133 to 2; in all cases, the dotted lines are very close to their respective solid lines.\" width=\"900\" height=\"331\"\/><\/a><figcaption>Figure 2: Comparison of nuclear norm (solid line) and nuclear norm upper bound penalized by weight decay on individual factors (dotted line) during the training of ResNet20 on CIFAR-10, showing that for most of training, weight decay is effectively penalizing the nuclear norm.<\/figcaption><\/figure>\n\n\n\n<p>This example demonstrates how factorizing the model without modifying the training routine can lead to substantial changes in training behavior. Thus, following the principle that the training behavior of a factorized model should mimic that of the original model to recover the latter\u2019s performance, we argue in favor of using spectral initialization (SI) and Frobenius decay (FD) when training factorized neural layers. SI initializes \\(U\\) and \\(V\\) by applying a rank-deficient singular value decomposition (SVD) to the standard random initialization of the corresponding full-rank weight matrix \\(W\\) in the original model, while FD replaces weight decay by penalizing the squared Frobenius norm \u2016\\(UV\\)\\(^T\\)\u2016\\(_{Fro}^{2}\\) rather than the sum of squared norms of the individual factors.<\/p>\n\n\n\n<p>In our paper, we extend a recent <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/openreview.net\/forum?id=B1lz-3Rct7\">approximate analysis of weight decay<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> to show that these two schemes allow SGD to maintain a high effective step size when training factorized neural layers, mimicking the role of regular weight decay applied to the original models. More informally, SI follows the mimicking principle since the SVD provably returns the best factorized approximation of initialization used by the original model. FD follows the mimicking principle by regularizing the squared Frobenius norm of the product, as is done by weight decay applied to the original model; in contrast, applying weight decay to the individual factors implicitly regularizes the nuclear norm, as shown in the figure above. Together, these two simple, efficient modifications lead to substantial improvements in model compression, knowledge distillation, and the training of Transformer-based architectures.<\/p>\n\n\n\n<h2 id=\"factorization-vs-sparsity-based-methods-for-model-compression\">Factorization vs. sparsity-based methods for model compression<\/h2>\n\n\n\n<p>As mentioned above, sparsity-based approaches such as pruning and guessing lottery tickets represent the prevailing trend in model compression. However, in this section, we show that low-rank factorization is competitive with these methods and can even attain better accuracy at the same parameter count. In particular, the figure below demonstrates that, in the case of low-memory ResNets, factorized neural layers outperform not only sparse (low-memory) training methods like guessing lottery tickets and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1902.05967\">dynamic sparsity<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> but often even full model training followed by pruning. The model\u2014a modified ResNet32\u2014and datasets\u2014CIFAR-10, CIFAR-100, and Tiny-ImageNet\u2014we use reflect the same standard setup considered by the lottery ticket papers GraSP and hybrid tickets.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/AutoML-Updated-Figure-_resnet.jpg\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"411\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/AutoML-Updated-Figure-_resnet-1024x411.jpg\" alt=\"Three bar graphs side by side, each with \u201caccuracy\u201d on the y-axis and the following data categories corresponding to a specific color bar on the x-axis: original model (black); pruning (blue with black dots); lottery ticket (blue with black dots); dynamic sparsity (blue); fixed sparsity (guessing tickets) (blue); low-rank (orange); and low-rank plus spectral initialization and Frobenius decay (orange).\n\nIn the first, titled \u201cCIFAR-10,\u201d the original model is 94.49 percent accurate and then, in order of highest accuracy: low-rank plus spectral initialization and Frobenius decay, 94.34; pruning, 94.21; lottery ticket, 94.14; low-rank, 93.59; dynamic sparsity, 92.97; and fixed sparsity (guessing tickets), 92.97.\n\nIn the second, titled \u201cCIFAR-100,\u201d the original model is 75.41 percent accurate and then, in order of highest accuracy: low-rank plus spectral initialization and Frobenius decay, 74.41; low-rank, 72.71; lottery ticket, 72.41; pruning, 72.34; fixed sparsity (guessing tickets), 69.70; and dynamic sparsity, 69.66.\n\nIn the third, titled \u201cTiny-ImageNet,\u201d the original model is 63.02 percent accurate and then, in order of highest accuracy: low-rank plus spectral initialization and Frobenius decay, 60.25; low-rank, 58.72; lottery ticket, 57.77; pruning, 57.62; dynamic sparsity, 57.19; and fixed sparsity (guessing tickets), 55.53.\n\" class=\"wp-image-735724\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/AutoML-Updated-Figure-_resnet-1024x411.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/AutoML-Updated-Figure-_resnet-300x120.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/AutoML-Updated-Figure-_resnet-768x308.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/AutoML-Updated-Figure-_resnet-1536x617.jpg 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/AutoML-Updated-Figure-_resnet-16x6.jpg 16w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/AutoML-Updated-Figure-_resnet.jpg 1860w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption>Figure 3: Comparison of the accuracy of several training methods: full model training (black); the sparsity-based approaches full model pruning, in which the full model is trained and then pruned (blue with black dots), and sparse training, in which a sparse model is trained (blue); and low-rank factorized layers (orange). When properly initialized and regularized, low-rank training, particularly with spectral initialization and Frobenius decay, achieves better performance at 10 percent compression than sparsity-based methods.<\/figcaption><\/figure>\n\n\n\n<p>This superior performance depends critically on the use of spectral initialization and Frobenius decay in tandem; interestingly, in the paper, we find that both schemes <em>decrease <\/em>accuracy when used independently. While factorization isn\u2019t always best at very high compression rates, the figure above shows that it\u2019s clearly superior in the standard lottery ticket regime, when the compressed model accuracy is close to that of the original model (usually this means the compressed model has 10 percent or more of the original parameters). Notably, our approach doesn\u2019t require any additional tuning, as the decay coefficient used for the uncompressed model is the same one used by FD. Thus, factorized neural layers serve as a strong, simple baseline regardless of whether we\u2019re targeting memory savings or fast computation.<\/p>\n\n\n\n<h2 id=\"teacher-free-teaching-with-overcomplete-knowledge-distillation\">Teacher-free teaching with overcomplete knowledge distillation<\/h2>\n\n\n\n<p>While our motivating application is model compression, factorized neural layers can also be found in other applications, such as knowledge distillation, a field that studies how to obtain a better small model by \u201cteaching\u201d it using a more powerful large model. <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1811.10495\">Recent work on using overparameterization to train compact networks<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> suggests that factorized neural layers can help us avoid the (expensive) two-stage student-teacher training process of knowledge distillation by using an <em>overcomplete <\/em>factorization of each of the student\u2019s weight matrices \\(W\\)=\\(UV\\)\\(^T\\), where matrix factors \\(U\\)\u2208 \\(\\mathbb{R}\\)\\(^m\\)\\(^x\\)\\(^r\\),\\(V\\)\u2208 \\(\\mathbb{R}\\)\\(^n\\)\\(^x\\)\\(^r\\) have inner dimension \\(r\\)\u2265\\(m,n\\), making the factors wider than the original weight matrix, or even by using a deep factorization \\(W\\)=\\(UMV\\)\\(^T\\). The goal is to take advantage of the training <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1802.06509\">benefits of deeper or wider networks suggested by recent theoretical work<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> while not actually increasing the model capacity. While this increases the parameter count during training, afterward we can directly obtain a network of the original size by multiplying the factors together. We refer to this technique as <em>overcomplete knowledge distillation<\/em>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/Overcompleted_updatedfig4_automl.jpg\"><img decoding=\"async\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/Overcompleted_updatedfig4_automl-1024x704.jpg\" alt=\"A flow chart of low-rank model compression and overcomplete knowledge distillation applied to a convolutional neural network. On the left, an image of a frog is passed through three weight matrices, each followed by a nonlinearity, and then on to a classifier that labels the image a \u201cfrog.\u201d The second weight matrix has two arrows pointing from it, both labeled \u201cfactorize during training.\u201d The first points to the low-rank model compression pipeline. Two rectangles represent weight matrices: one labeled \u201cU\u201d and the other \u201cV superscript T,\u201d with U being tall and narrow and V superscript T being short and wide. Below the image is the label \u201c($\\ll mn$ parameters).\u201d Another arrow labeled \u201ckeep factorized during inference\u201d points to the same image. The second arrow labeled \u201cfactorize during training\u201d points to the overcomplete knowledge distillation pipeline. Two rectangles represent weight matrices: one labeled \u201cU\u201d and the other \u201cV superscript T,\u201d with U being short and wide and V superscript T being tall and narrow. Below the image is the label ($>mn$ parameters). An arrow labeled \u201cmultiply back before inference\u201d points to a square labeled \u201cUV superscript T\u201d with the caption \u201c(mn parameters).\u201d\" class=\"wp-image-735718\"\/><\/a><figcaption>Figure 4: Comparison between the model compression (top) and knowledge distillation (bottom) pipelines. While both train a factorized model, in overcomplete knowledge distillation, the factors are remultiplied after training to bring the parameter count back to that of a normal model. This is a teacher-free way of taking advantage of the training benefits of larger models without suffering their larger inference cost.<\/figcaption><\/figure>\n\n\n\n<p>Since spectral initialization isn\u2019t applicable when the decomposition is overcomplete, we investigate the effect of Frobenius decay alone. We find that this regularization is critical for overcomplete knowledge distillation using ResNets trained on CIFAR data; in fact, overparameterizing and applying regular weight decay <em>decreases <\/em>model accuracy. On the other hand, using FD, we can train an overparameterized ResNet56 (1.7 million parameters during training\/850,000 parameters at inference time) that matches the performance of ResNet110 (1.7 million parameters at both training and inference). Furthermore, training the overcomplete ResNet56 is 1.5 times faster than training the regular ResNet110. These results are the first successful application of overcomplete knowledge distillation for large-depth neural networks on CIFAR.<\/p>\n\n\n\n<h2 id=\"factorization-aware-training-of-transformers\">Factorization-aware training of Transformers<\/h2>\n\n\n\n<p>Finally, we apply spectral initialization and Frobenius decay in the MHA module of Transformer-based architectures. While we can show that this indeed helps when training such models using regular SGD, large-scale unsupervised models, such as BERT, are usually trained using variants of adaptive algorithms such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1412.6980\">Adam <span class=\"sr-only\"> (opens in new tab)<\/span><\/a>or <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/openreview.net\/forum?id=Syx4wnEtvH\">LAMB <span class=\"sr-only\"> (opens in new tab)<\/span><\/a>that \u201c<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1711.05101\">decouple<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u201d weight decay by defining it as an operation that subtracts a small constant times the parameter from itself before taking a gradient step. While the equivalence breaks for adaptive methods, in the case of SGD, this is equivalent to adding a squared Frobenius penalty term to the objective. Thus for adaptive algorithms, we can devise a similar \u201cdecoupled\u201d FD scheme that subtracts a small constant times \\(UV\\)\\(^T\\) \\(V\\)=\u2207\\(_U\\)\\(_{2}^{1\/}\\) \u2016\\(UV\\)\\(^T\\) \u2016 \\(_{Fro}^{2}\\) and \\(VU\\)\\(^T\\) \\(U\\)=\u2207\\(_V\\)\\(_{2}^{1\/}\\) \u2016\\(UV\\)\\(^T\\) \u2016\\(_{Fro}^{2}\\) from the factors \\(U\\) and \\(V\\), respectively.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/automl-figure_mulihead.jpg\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"675\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/automl-figure_mulihead-1024x675.jpg\" alt=\"diagram\" class=\"wp-image-735625\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/automl-figure_mulihead-1024x675.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/automl-figure_mulihead-300x198.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/automl-figure_mulihead-768x506.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/automl-figure_mulihead-16x12.jpg 16w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/automl-figure_mulihead.jpg 1054w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption>Figure 5: A diagram of how multi-head self-attention implicitly consists of&nbsp;2<em>H<\/em>&nbsp;factorized neural layers. Specifically,&nbsp;multi-head attention&nbsp;is a sum&nbsp;over&nbsp;<em>H<\/em>&nbsp;attention&nbsp;heads (orange), each a matrix product&nbsp;of two&nbsp;terms:&nbsp;the first&nbsp;produced by&nbsp;the&nbsp;low-rank bilinear&nbsp;form&nbsp;<em>Q<sub>h<\/sub>&nbsp;K<sub>h<\/sub><sup>T<\/sup><\/em>&nbsp;and the second&nbsp;by the&nbsp;low-rank&nbsp;linear transform&nbsp;<em>V<sub>h<\/sub>&nbsp;O<sub>h<\/sub><sup>T<\/sup><\/em>.&nbsp;<\/figcaption><\/figure><\/div>\n\n\n\n<p>Applying FD in this manner to the MHA module when pretraining BERT-Base (110 million parameters) on unsupervised data yields a better downstream evaluation on the SQuAD task. When the MHA embedding dimension is halved (14.3 million fewer parameters), the advantage of FD continues to hold. On BERT-Large (340 million parameters), we\u2019re able to halve the MHA embedding dimension (44.2 million fewer parameters) while losing less than a point in terms of F1-score.<\/p>\n\n\n\n<h2 id=\"next-steps-in-efficient-model-training\">Next steps in efficient model training<\/h2>\n\n\n\n<p>By studying how a standard initialization scheme and standard regularization scheme need to be modified to handle factorized layers, we were able to obtain better algorithms for learning efficient models, knowledge distillation, and training Transformers. We hope these results spur more investigation into how different components of standard training pipelines behave when training efficient models from scratch. For example:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>How should techniques like Dropout and BatchNorm be modified for factorized layers?<\/li><li>Can we improve performance by<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2103.03936\"> factorizing using SVD after a few epochs of unfactorized model training rather than upon initialization<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>? <\/li><li>What aspects of model training should be changed when leveraging sparse or tensorial methods?<\/li><li>Are different models better suited to different compression schemes?<\/li><\/ul>\n\n\n<table style=\"float: right; width: 50%; margin: 15px; text-align: center; border: 1px solid #000000; border-collapse: collapse; border-spacing: inherit;\">\n<tbody>\n<tr style=\"height: 24px;\">\n<td style=\"background-color: #000000; padding: 5px 30px; border: inherit; height: 24px;\"><span style=\"color: #ffffff;\"><strong>Virtual speaker series<\/strong><\/span><\/td>\n<\/tr>\n<tr style=\"height: 23px;\">\n<td style=\"padding: 5px 30px; border: inherit; height: 23px;\"> Would you like to know more about AutoML research and its community at Microsoft? Join us for our free virtual speaker series \u201cDirections in ML: AutoML and Automating Algorithms.\u201d  <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/event\/directions-in-ml\/#!upcoming-speaker\"> Learn more and register.<\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n\n\n<p>We\u2019re excited by how this work can advance cost-effective, high-performing models capable of running on any device. We\u2019re also exploring the link between training compute time and energy consumption. <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nam06.safelinks.protection.outlook.com\/?url=https%3A%2F%2Fwww.economist.com%2Ftechnology-quarterly%2F2020%2F06%2F11%2Fthe-cost-of-training-machines-is-becoming-a-problem&data=04%7C01%7Cv-krdod%40microsoft.com%7Ce4894248a66b4e3583d808d8e4f39926%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637511082182909825%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=WktVI2VM%2FcmKCQUEgJs44nmG9flfk3ABPG9ZsSwVlJw%3D&reserved=0\">Compute power for model training has been doubling at a rapid pace<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. By reducing the required training time, our method for directly training smaller models raises the possibility of reducing energy consumption, as well. To learn more about the connections between compute and model architectures and power demands, check out the panel \u201c<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nam06.safelinks.protection.outlook.com\/?url=https%3A%2F%2Fcm-edgetun.pages.dev%2Fen-us%2Fresearch%2Fvideo%2Ffrontiers-in-machine-learning-climate-impact-of-machine-learning%2F&data=04%7C01%7Cv-krdod%40microsoft.com%7Ce4894248a66b4e3583d808d8e4f39926%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637511082182919817%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=sEObeXFSQlmhUXKIWo274ygIJQZFgKIItUK2RFb7Kpo%3D&reserved=0\">Frontiers in Machine Learning: Climate Impact of Machine Learning<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.\u201d<\/p>\n\n\n\n<p><em><strong>Acknowledgments<\/strong><\/em><br><em>This work was led by Carnegie Mellon University PhD student Misha Khodak, in collaboration with Neil Tenenholtz, Lester Mackey, and Nicol\u00f2 Fusi, during a Microsoft Research internship.<\/em><\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>From BiT (928 million parameters (opens in new tab)) to GPT-3 (175 billion parameters (opens in new tab)), state-of-the-art machine learning models are rapidly growing in size. With the greater expressivity and easier trainability of these models come skyrocketing training costs, deployment difficulties, and even climate impact. As a result, we\u2019re witnessing exciting and emerging [&hellip;]<\/p>\n","protected":false},"author":38838,"featured_media":735919,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-734179","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199563],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[804847,545241],"related-events":[725710,664110],"related-researchers":[{"type":"guest","value":"misha-khodak","user_id":"734341","display_name":"Misha Khodak","author_link":"<a href=\"http:\/\/www.cs.cmu.edu\/~mkhodak\/\" aria-label=\"Visit the profile page for Misha Khodak\">Misha Khodak<\/a>","is_active":true,"last_first":"Khodak, Misha","people_section":0,"alias":"misha-khodak"},{"type":"user_nicename","value":"Neil Tenenholtz","user_id":38464,"display_name":"Neil Tenenholtz","author_link":"<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/netenenh\/\" aria-label=\"Visit the profile page for Neil Tenenholtz\">Neil Tenenholtz<\/a>","is_active":false,"last_first":"Tenenholtz, Neil","people_section":0,"alias":"netenenh"},{"type":"user_nicename","value":"Lester Mackey","user_id":36161,"display_name":"Lester Mackey","author_link":"<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/lmackey\/\" aria-label=\"Visit the profile page for Lester Mackey\">Lester Mackey<\/a>","is_active":false,"last_first":"Mackey, Lester","people_section":0,"alias":"lmackey"},{"type":"user_nicename","value":"Nicolo Fusi","user_id":31829,"display_name":"Nicolo Fusi","author_link":"<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/fusi\/\" aria-label=\"Visit the profile page for Nicolo Fusi\">Nicolo Fusi<\/a>","is_active":false,"last_first":"Fusi, Nicolo","people_section":0,"alias":"fusi"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/1400x788_AutoML_No_logo_withcaptions-960x540.jpg\" class=\"img-object-cover\" alt=\"diagram\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/1400x788_AutoML_No_logo_withcaptions-960x540.jpg 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/1400x788_AutoML_No_logo_withcaptions-300x169.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/1400x788_AutoML_No_logo_withcaptions-1024x577.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/1400x788_AutoML_No_logo_withcaptions-768x432.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/1400x788_AutoML_No_logo_withcaptions-1536x865.jpg 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/1400x788_AutoML_No_logo_withcaptions-2048x1153.jpg 2048w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/1400x788_AutoML_No_logo_withcaptions-16x9.jpg 16w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/1400x788_AutoML_No_logo_withcaptions-1066x600.jpg 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/1400x788_AutoML_No_logo_withcaptions-655x368.jpg 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/1400x788_AutoML_No_logo_withcaptions-343x193.jpg 343w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/1400x788_AutoML_No_logo_withcaptions-640x360.jpg 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/1400x788_AutoML_No_logo_withcaptions-1280x720.jpg 1280w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/03\/1400x788_AutoML_No_logo_withcaptions-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"http:\/\/www.cs.cmu.edu\/~mkhodak\/\" title=\"Go to researcher profile for Misha Khodak\" aria-label=\"Go to researcher profile for Misha Khodak\" data-bi-type=\"byline author\" data-bi-cN=\"Misha Khodak\">Misha Khodak<\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/netenenh\/\" title=\"Go to researcher profile for Neil Tenenholtz\" aria-label=\"Go to researcher profile for Neil Tenenholtz\" data-bi-type=\"byline author\" data-bi-cN=\"Neil Tenenholtz\">Neil Tenenholtz<\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/lmackey\/\" title=\"Go to researcher profile for Lester Mackey\" aria-label=\"Go to researcher profile for Lester Mackey\" data-bi-type=\"byline author\" data-bi-cN=\"Lester Mackey\">Lester Mackey<\/a>, and <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/fusi\/\" title=\"Go to researcher profile for Nicolo Fusi\" aria-label=\"Go to researcher profile for Nicolo Fusi\" data-bi-type=\"byline author\" data-bi-cN=\"Nicolo Fusi\">Nicolo Fusi<\/a>","formattedDate":"March 24, 2021","formattedExcerpt":"From BiT (928 million parameters (opens in new tab)) to GPT-3 (175 billion parameters (opens in new tab)), state-of-the-art machine learning models are rapidly growing in size. With the greater expressivity and easier trainability of these models come skyrocketing training costs, deployment difficulties, and even&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/734179","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/users\/38838"}],"replies":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/comments?post=734179"}],"version-history":[{"count":62,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/734179\/revisions"}],"predecessor-version":[{"id":738139,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/734179\/revisions\/738139"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media\/735919"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=734179"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/categories?post=734179"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/tags?post=734179"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=734179"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=734179"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=734179"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=734179"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=734179"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=734179"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=734179"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=734179"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}