{"id":741250,"date":"2021-06-10T10:12:16","date_gmt":"2021-06-10T17:12:16","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?p=741250"},"modified":"2021-06-10T10:12:20","modified_gmt":"2021-06-10T17:12:20","slug":"how-can-generative-adversarial-networks-learn-real-life-distributions-easily","status":"publish","type":"post","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/how-can-generative-adversarial-networks-learn-real-life-distributions-easily\/","title":{"rendered":"How can generative adversarial networks learn real-life distributions easily"},"content":{"rendered":"\n<figure class=\"wp-block-video\"><video height=\"1080\" style=\"aspect-ratio: 1920 \/ 1080;\" width=\"1920\" controls src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/06\/gan-blog-hero_updated-1.mp4\"><\/video><\/figure>\n\n\n\n<p>A Generative adversarial network, or GAN, is one of the most powerful machine learning models proposed by <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1406.2661\">Goodfellow <em>et al.<\/em><span class=\"sr-only\"> (opens in new tab)<\/span><\/a> for learning to generate samples from complicated real-world distributions. GANs have sparked millions of applications, ranging from generating realistic images or cartoon characters to text-to-image translations. Turing award laureate Yann LeCun called GANs \u201cthe most interesting idea in the last 10 years in ML.\u201d<\/p>\n\n\n\n<p>In the context of generating images, GANs consist of two parts. 1) A parameterized (deconvolutional) generator network \\(G\\) that takes input \\(z\\) which is a random Gaussian vector and outputs a fake image \\(G(z)\\). 2) A parameterized (convolutional) discriminator network \\(D\\) that takes as input an image \\(X\\)and outputs a real value \\(D(X)\\). To learn a target distribution \\(\\mathcal{X}\\) of images, the training process of GANs involves finding a generator \\(G\\) (typically by gradient descent ascent) where the distribution \\(G(z)\\) of fake images is indistinguishable from the distribution \\(\\mathcal{X}\\) of real images, using any discriminator \\(D\\) in its parameter family. We illustrate GANs in Figure 1.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"888\" height=\"242\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure-1-Updated-GAN-blog-High-Res.jpg\" alt=\"A chart showing a GAN comparing fake images with real images, filtering them through a discriminator to produce a value indicating how fake the image is.\" class=\"wp-image-741259\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure-1-Updated-GAN-blog-High-Res.jpg 888w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure-1-Updated-GAN-blog-High-Res-300x82.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure-1-Updated-GAN-blog-High-Res-768x209.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure-1-Updated-GAN-blog-High-Res-16x4.jpg 16w\" sizes=\"auto, (max-width: 888px) 100vw, 888px\" \/><figcaption>Figure 1: an illustration of the GAN framework.<\/figcaption><\/figure><\/div>\n\n\n\n<h2 id=\"gans-are-great-practically-but-what-do-we-know-theoretically\">GANs are great practically, but what do we know theoretically?<\/h2>\n\n\n\n<p>In sharp contrast to the great empirical success, GAN remains one of the least understood machine learning models in theory. How can the generator transfer random vectors&nbsp;from a non-structured spherical Gaussian distribution to highly-structured images? How can the generator be found simply by local search algorithms such as gradient descent ascent (GDA)? What is the role of the discriminator during the training process?<\/p>\n\n\n\n<p>All these theoretical questions remain essentially unanswered. This is perhaps not surprising, since even learning a linear transformation of some known distributions can be computationally hard (even NP-hard in the worst case), not to mention learning a transformation given by neural networks with ReLU activations.<\/p>\n\n\n\n<p>Does it mean GAN theory reaches a dead end? No. To understand the great empirical success of GANs, our new paper \u201c<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/forward-super-resolution-how-can-gans-learn-hierarchical-generative-models-for-real-world-distributions\/\"><em>Forward Super-Resolution: How Can GANs Learn Hierarchical Generative Models for Real-World Distributions<\/em><\/a><em>\u201d<\/em>, investigates the special structures of real-world distributions, in particular the structure of images, to understand how GANs can work effectively beyond the worst-case, theoretical bounds.<\/p>\n\n\n\n<h2 id=\"one-structural-property-of-real-life-images-forward-super-resolution\">One structural property of real-life images: forward super-resolution<\/h2>\n\n\n\n<p>Most real-life images can be viewed in different resolutions without losing the semantics. We often can reduce the resolution of a 1080p car image to as small as, say 16 pixel by 16 pixel, while still maintaining the outline of a car that can be identified by humans.<\/p>\n\n\n\n<p>Consequently, one can expect a progressive generative model for real-life images&#8211;with the lower layers of the generator producing lower-resolution versions of the images, and higher layers producing higher resolutions. An experimental verification of this progressive generative model was done by NVIDIA researchers <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/1710.10196.pdf\">Karras <em>et al.<\/em><span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and is also illustrated in Figure 2 below.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1007\" height=\"447\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure2_GanBlog_HighRes.jpg\" alt=\"A chart showing the progression of an image through a series of four deconvolutions. The resolution progresses from 8x8 to 16x16 to 32x32 to 64x64.\" class=\"wp-image-741256\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure2_GanBlog_HighRes.jpg 1007w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure2_GanBlog_HighRes-300x133.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure2_GanBlog_HighRes-768x341.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure2_GanBlog_HighRes-16x7.jpg 16w\" sizes=\"auto, (max-width: 1007px) 100vw, 1007px\" \/><figcaption>Figure 2: illustration of <em>forward super-resolution<\/em> in practice using images from the LSUN church data set.<\/figcaption><\/figure><\/div>\n\n\n\n<p>In this work, we look deeply into this structural property of images. We formalize it as the forward super-resolution property. Mathematically, we consider learning a distribution of images generated progressively as follows. Let \\(X_1\\),\u2026\\(X_{L-1}\\) be images of \\(X_L\\) at lower resolutions (that can be computed from \\(X_L\\) via image down-scaling), with the resolution of \\(X_1\\) being the lowest (for example 8&#215;8), and the resolution of \\(X_L\\) being the highest (for example 128&#215;128). We assume there is a target generator network \\(G^*\\) (to be learned) with hidden layers \\(S{_1^*}\\),\\(S{_2^*}\\),\u2026,\\(S{_L^*}\\) computed as: <\/p>\n\n\n\n<p class=\"has-text-align-center\">\\(S_1\\)=\\(ReLU(Deconv(z))\\),   &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  \\(S_l\\)=\\(ReLU(Deconv(\\)\\(S_{l-1}^*))\\)<\/p>\n\n\n\n<p>where \\(z\\)&nbsp;is a random Gaussian input and Deconv(\u22c5) is any deconvolution operator (such as nn.ConvTranspose2d in PyTorch). We also assume:<\/p>\n\n\n\n<p class=\"has-text-align-center\">the image \\(X_l\\) at resolution \\(l\\) is generated by \\(X_l\\)=\\(Deconv_{output}\\)\\((S{_l^*})\\)<\/p>\n\n\n\n<p>Above, we call \\(Deconv_{output}\\) (\u22c5) the <em>output deconvolution layer<\/em>. They are responsible for representing the \u201cedge-color features\u201d at different resolutions (see Figure 3).<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"317\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure3_GAN_blog_low-res-1024x317.jpg\" alt=\"Four images showing a progressive shift in resolution.  \" class=\"wp-image-741277\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure3_GAN_blog_low-res-1024x317.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure3_GAN_blog_low-res-300x93.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure3_GAN_blog_low-res-768x238.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure3_GAN_blog_low-res-16x5.jpg 16w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure3_GAN_blog_low-res.jpg 1097w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 3: illustration of the output deconvolutional layer which represent edge-color features.<\/figcaption><\/figure><\/div>\n\n\n\n<p>In other words, each hidden layer \\(S_l^*\\)&nbsp;of the target network \\(G\\)* is responsible for generating weights to determine \u201cwhich edge-color features\u201d are to be used to \u201cpaint\u201d the output image. The lower-level hidden layers \\(S_l^*\\)&nbsp;are responsible for generating weights that are used to paint lower-resolution images; and the higher-level hidden layers responsible for painting higher-resolution images.<\/p>\n\n\n\n<p>As a result, when using GANs to learn real-life images, one should expect the higher hidden layers of the<em> learner generator<\/em> \\(G\\) to learn \u2013 via compositions of hidden deconvolution layers \u2013 how to combine lower-resolution features to generate higher-resolution images via<em> super-resolution.<\/em><\/p>\n\n\n\n<h2 id=\"another-structural-property-of-real-life-images-hierarchical-sparse-coding\">Another structural property of real-life images: hierarchical sparse coding<\/h2>\n\n\n\n<p>Real-life images are also very sharp: meaning that one can clearly see the \u201cedge of contrast\u201d at object boundaries and the \u201cconsistency of color\u201d within objects in the image. In a generative model, we attribute <em>image sharpness<\/em> to a <strong><em>hierarchical<\/em><\/strong> <strong><em>sparse coding<\/em> <\/strong>property of hidden layers<strong>.<\/strong><\/p>\n\n\n\n<p>To see this, let us recall that real-life images are usually generated via <em>sparse<\/em> combinations of \u201cedge-color features\u201d in the output deconvolution layer. For instance (see Figure 4), when a patch in an image is associated with the boundary of an object, an \u201cedge feature\u201d can be selected to generate the pixels in that patch, while all other features become unlikely to show up; in contrast, when a patch is in the middle of an object, a \u201ccolor feature\u201d can be selected to paint these pixels, while all other features are unlikely to show up.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"211\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure4_GAN_Blog_High-Res-1024x211.jpg\" alt=\"A photo of a white building against an overcast sky and in front of a green lawn. Small yellow boxes show patches dominated by color features; small red boxes show patches dominated by edge features. \" class=\"wp-image-741280\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure4_GAN_Blog_High-Res-1024x211.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure4_GAN_Blog_High-Res-300x62.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure4_GAN_Blog_High-Res-768x158.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure4_GAN_Blog_High-Res-16x3.jpg 16w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure4_GAN_Blog_High-Res.jpg 1035w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 4: illustration of \u201csharpness\u201d in real-life images.<\/figcaption><\/figure><\/div>\n\n\n\n<p>Mathematically, we model this by assuming for every layer l and every patch p, the restriction of the hidden layer \\(S_l^*\\) to this patch \u2013 denoted by \\([S_l^* ]_p\\)\u2013 is a sparse vector. This means only a few channels are non-zero in every patch (although these non-zero channels can be different at different patches). We have empirically verified this assumption by measuring the sparsity of hidden-layer activations in Figure 5.<\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"265\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure5_GAN_Blog_High-Re-1024x265.jpg\" alt=\"A series of four bar graph charts labelled \"layer 1\" through \"layer 4\". The \"trained\" bars steadily grow as the \"init\" bars decline.\" class=\"wp-image-741283\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure5_GAN_Blog_High-Re-1024x265.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure5_GAN_Blog_High-Re-300x78.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure5_GAN_Blog_High-Re-768x199.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure5_GAN_Blog_High-Re-16x4.jpg 16w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure5_GAN_Blog_High-Re.jpg 1197w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 5: our experiments verify that after training a 4-layer GAN, the hidden layers become very sparse compared to their random initializations.<\/figcaption><\/figure><\/div>\n\n\n\n<h2 id=\"how-can-gans-efficiently-learn-distributions-with-forward-super-resolution-and-hierarchical-sparse-coding\">How can GANs efficiently learn distributions with forward super-resolution and hierarchical sparse coding?<\/h2>\n\n\n\n<p>In this work, we show that the two structural properties of the distributions for images, namely <em>forward super-resolution<\/em> which ensures semantic consistency of images at different resolutions, and <em>hierarchical sparse coding<\/em> which ensures image sharpness, are <strong><em>sufficient<\/em><\/strong> for GANs to learn such distributions <em>efficiently<\/em>. Mathematically, we prove the following theorem in our new <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/forward-super-resolution-how-can-gans-learn-hierarchical-generative-models-for-real-world-distributions\/\">paper<\/a>. <\/p>\n\n\n\n<p><strong>Theorem (informal): <\/strong>for any D-dimensional distribution with the forward super-resolution and hierarchical sparse coding properties, for every \u03b5>0, we can learn such distribution up to error \u03b5 with sample and time complexity polynomial in D and 1\/\u03b5, simply by training GANs using gradient descent ascent (GDA).<\/p>\n\n\n\n<p>In this blog, let us pin down how GANs can learn this distribution without diving into math details. We consider layer-wise training: first train the first layer of the learner generator \\(G\\) (together with an output deconvolution layer) to generate lowest-resolution images \\(X_1\\), then train the second layer of \\(G\\) (together with an output deconvolution layer) to generate \\(X_2\\), the third layer to generate \\(X_3\\), and so on. We separate the learning into different phases.<\/p>\n\n\n\n<h2 id=\"phase-1-learn-the-output-deconvolution-layers-via-moment-matching\">Phase (1). Learn the output deconvolution layers via moment matching<\/h2>\n\n\n\n<p>To learn the output deconvolution layers (i.e. the deconvolutional operator from \\(X_l\\)=\\(Deconv_{output}\\) \\((S_l^* )\\) for any layer \\(l\\)), we show it suffices for the generator to learn to match the moments between the generator\u2019s output images and the target distribution\u2019s real images. This is a special property of distributions generated from the sparse coding model known from earlier theoretical works such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.jmlr.org\/proceedings\/papers\/v40\/Anandkumar15.pdf\">Anandkumar et al.<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n\n\n\n<p>Intuitively, the output deconvolution layer can be written as a linear operator \\(X=Ay\\) where each column of matrix A represents an edge-color feature, and y is a sparse vector that determines which edge-color features to use to paint the output image \\(X\\). It is known under standard regularity conditions (such as almost column orthogonality of \\(A\\) and sufficient sparsity of \\(y\\)), for any integer C > 0, the C-th order moment \\(\\mathbb{E}\\)\\([X^{\u2297C}]\u2248\\) \\(\u2211_i {A_i^{\u2297C}} \\)\\(\\mathbb{E}[y{_i^C}]\\) where \\(A_i\\) is the \\(i\\)-th column of \\(A\\). When this happens, matching the moments \\(\\mathbb{E}[X^{\u2297C}]\\) effectively determines the matrix \\(A\\) up to a column permutation.<\/p>\n\n\n\n<p>In other words, to learn the output deconvolution layer, it suffices to ensure that the GAN generator&nbsp;matches moments between its output images and real images. On the theoretical side, we show that a ReLU-type discriminator can discriminate the mismatch between the moments on images, and thus through gradient descent ascent the output deconvolution layers can be efficiently learned. On the empirical side, Figure 6 (see also <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2003.04033\">previous work<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>) shows that GANs are indeed doing moment matching during the earlier stage of training.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"542\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure6_GanBlog_high-res-1024x542.jpg\" alt=\"A series of three line graphs, labelled \"1st order moment matching\" to \"5th order moment matching,\" above a corresponding series of four images with progressively higher resolution, labeled \"epoch 1\" through \"epoch 20.\" \" class=\"wp-image-741289\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure6_GanBlog_high-res-1024x542.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure6_GanBlog_high-res-300x159.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure6_GanBlog_high-res-768x407.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure6_GanBlog_high-res-16x8.jpg 16w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure6_GanBlog_high-res.jpg 1146w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 6: we empirically verify that the moments begin to match within 10 epochs, so GAN\u2019s discriminator-generator framework is indeed doing moment matching at the earlier stage of the training.<\/figcaption><\/figure>\n\n\n\n<h2 id=\"phase-2-learn-the-first-hidden-layer-via-moment-matching-and-sparse-decoding\">Phase (2). Learn the first hidden layer via moment matching and sparse decoding<\/h2>\n\n\n\n<p>To learn the first hidden layer (i.e. the deconvolutional operator from \\((S_{1^*}\\)=\\(ReLU(Deconv(\\)\\(z))) \\), we show it also suffices to match moments.<\/p>\n\n\n\n<p>Indeed, each coordinate of the first hidden layer \\(S{_1^*}\\) can be written as \\([S{_1^*}]_i\\)=\\(ReLU(\u03b1_i\u22c5g_i-\u03b2_i)\\) for \\(g_i\\) being standard Gaussian. Given the moments of \\(S{_1^*}\\) even just to any constant order, this uniquely determines \\(\u03b1_i\\) and \\(\u03b2_i\\) for every \\(i\\) as well as determines the pairwise correlations \u27e8\\(g_i\\), \\(g_j\\)\u27e9. In other words, if the discriminator can discriminate the moments of hidden layer \\(S_1^*\\) from the target generator \\(G^*\\) and the moments of hidden layer \\(S_1\\) from the learner generator \\(G\\), then this effectively determines the first deconvolution layer Deconv(\u22c5) up to unitary transformation. The main difference from Phase (1) is that, unlike output images, the discriminator does not have access to hidden layer \\(S_1^*\\) and cannot directly implement this moment matching process.<\/p>\n\n\n\n<p>Fortunately, using the sparse coding property, Phase (1) tells us the discriminator can learn the output deconvolution layers (recall those are \u201cedge-color features\u201d) to descent accuracy, and thus it can use them to perform decoding of \\(S_1^*\\) from the real images \\(X_1\\). This requires the first layer of the discriminator to use (approximately) the same set of edge-color features comparing to the output layer of the generator and is consistent to what happens in practice (see Figure 7).<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1009\" height=\"275\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure7_GAN_Blog_high-res.jpg\" alt=\"Two boxed image galleries connected by a two-sided arrow. One box represents features in the output (deconvolution) layer of the generator, trained on celebA data. The second represents features in the first hidden (convolution) layer of the discriminator, trained on celebA data. \" class=\"wp-image-741292\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure7_GAN_Blog_high-res.jpg 1009w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure7_GAN_Blog_high-res-300x82.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure7_GAN_Blog_high-res-768x209.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure7_GAN_Blog_high-res-16x4.jpg 16w\" sizes=\"auto, (max-width: 1009px) 100vw, 1009px\" \/><figcaption>Figure 7: the first hidden layer in the discriminator learns edge-color detectors, while the output layer of the generator also learns edge-color features.<\/figcaption><\/figure><\/div>\n\n\n\n<p>Putting Phases (1) and (2) together, the generator can learn not only the output deconvolution layer but also the first hidden deconvolution layer, and thus learn the distribution of \\(X_1\\), namely, the \u201cmost coarse grind\u201d global structure of the image.<\/p>\n\n\n\n<h2 id=\"phase-3-learn-other-hidden-layers-via-supervised-learning-and-sparse-decoding\">Phase (3). Learn other hidden layers via supervised learning and sparse decoding<\/h2>\n\n\n\n<p>For any other hidden layer (i.e. the deconvolutional operator from \\(S{_l^*}\\)=\\(ReLU(Deconv(\\)\\(S_{l-1}^*))\\) for \\(l\\)\u22652), since it captures more and more sharp details of the image, method of moments is no longer known to be sufficient.<\/p>\n\n\n\n<p>In this case, we show the discriminator learns to discriminate the statistical difference of the pair \\(((X_{l-1},X_l))\\) between real images and \\(G\\)\u2019s output images. For example, the discriminator can discriminate the case that \u201ca black dot in \\(X_2\\) always becomes an eye in \\(X_3\\)\u201d or not. When this is so, the generator can learn how images \\(X_{l-1}\\) can perform forward super-resolution into \\(X_l\\), layer by layer for each \\(l\\)\u22652.<\/p>\n\n\n\n<p>The learning process of forward super-resolution should be reminiscent of supervised learning: the goal is to learn a one-hidden-layer neural network that takes as inputs lower-resolution images and outputs higher-resolution ones. Note one main difference from supervised learning is that, in \\(S{_l^*}\\)=\\(ReLU(Deconv(\\)\\(S_{l-1}^*))\\), both the inputs \\(S_{l-1}^*\\) and the outputs \\(S{_l^*}\\) are hidden features as opposed to low- and high-resolution images \\(X_{l-1}\\),\\(X_l\\). Again, thanks to the sparse coding property and Phase (1), we show that the discriminator can decode these hidden features from their corresponding images \\(X_{l-1}\\),\\(X_l\\). This allows GANs to simulate supervised learning and learn how the lower-level features are being combined to generate higher-level features efficiently, purely using gradient descent ascent.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"384\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure-8_GAN-blog_high-res-1024x384.jpg\" alt=\"A chart depicting increasing resolution of a photo of a face, as the GAN framework simulates supervised learning. \" class=\"wp-image-741295\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure-8_GAN-blog_high-res-1024x384.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure-8_GAN-blog_high-res-300x112.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure-8_GAN-blog_high-res-768x288.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure-8_GAN-blog_high-res-16x6.jpg 16w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure-8_GAN-blog_high-res.jpg 1166w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 8: the GAN framework can simulate supervised learning to learn 1-hidden-layer network from inputs \\(S_{l-1}^*\\) to outputs \\(S_l^*\\).<\/figcaption><\/figure><\/div>\n\n\n\n<p>Another key reason that GANs can learn these super-resolution operations efficiently is that such operations are very local: to perform super-resolution on a patch of an image, the generator only needs to look at nearby patches instead of the entire image. Indeed, the global structure has already been taken care of in the lower resolution layers. In other words, learning each hidden layer can be done essentially patch-wise instead of over the entire image, as we illustrate in Figure 9 below.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"455\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure9_GAN_blog_high-res-1024x455.jpg\" alt=\"A chart showing progressively higher resolution photos of the eyes of one person, taken from a gallery of many faces.\" class=\"wp-image-741298\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure9_GAN_blog_high-res-1024x455.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure9_GAN_blog_high-res-300x133.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure9_GAN_blog_high-res-768x341.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure9_GAN_blog_high-res-16x7.jpg 16w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure9_GAN_blog_high-res.jpg 1171w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 9: forward super-resolution is a local operation, which makes the learning much simpler<\/figcaption><\/figure><\/div>\n\n\n\n<h2 id=\"empirical-evidence-of-forward-super-resolution\">Empirical evidence of forward super-resolution<\/h2>\n\n\n\n<p>In our work, we also conduct an experiment showing that the features learned from lower-resolution images are indeed extremely helpful for learning higher-resolution images. In Figure 10, we consider layer-wise training of GAN, where we first train only the first hidden layer of the generator, then freeze it and train only the second layer of the generator, and so on. One can obviously see that the features learned from lower-resolution images can indeed be used to generate very non-trivial realistic images at higher resolutions. We believe this is strong evidence that the <em>forward super-resolution<\/em> property makes the GAN training easy on distributions of real-life images, despite the worst-case hardness bounds.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"446\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure10_GAN_Blog_High-res-1024x446.jpg\" alt=\"A series of four image galleries depicting faces with resolution increasing from 8x8 to 64x64. Below the galleries are four cubes labeled Layerwise Taining of GANs, with values increasing from 4x4x64 to 32x32x64.  \" class=\"wp-image-741301\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure10_GAN_Blog_High-res-1024x446.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure10_GAN_Blog_High-res-300x131.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure10_GAN_Blog_High-res-768x335.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure10_GAN_Blog_High-res-16x7.jpg 16w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/04\/Figure10_GAN_Blog_High-res.jpg 1120w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 10: Layer-wise training of GAN supports our hypothesis that the features learned from lower-resolution images can indeed by used to generate realistic images at higher resolutions.<\/figcaption><\/figure>\n\n\n\n<h2 id=\"next-step-backward-feature-correction\">Next Step: Backward Feature Correction<\/h2>\n\n\n\n<p>We point out that our work is still preliminary and far from capturing the <em>full<\/em> picture of GANs. Most notably, we have focused on the \u201crealizable\u201d setting for proving our main theorem. Thus from a theoretical standpoint, it suffices to perform layer-wise learning: namely, first learn the distribution \\(X_1\\)&nbsp;and hidden layer \\(S_1^*\\) , then learn distribution \\(X_2\\)&nbsp;and hidden layer \\(S_2^*\\), and so on.<\/p>\n\n\n\n<p>As we show in Figure 10, layer-wise forward super-resolution is already performing much better than learning from random lower-level features. However, in practice we might consider the more challenging \u201cagnostic\u201d setting, where the target distributions of images&nbsp;are generated with error. If we cannot learn hidden layer sufficiently well during layer-wise training, this error propagates to deeper layers and may blow up if we perform layer-wise learning. This is okay for generating simple images, see Figure 10. For more complicated images, we expect the generator network to reduce over-fitting to such errors on lower-level layers, through training higher-level layers altogether. In other words, when training all layers together, we expect the lower-level layers to be able to also capture higher resolution details (as opposed to solely learning lower resolution images). This phenomenon is known as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2001.04413\"><em>Backward Feature Correction<\/em><span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (BFC), which is equipped with provable guarantees in supervised deep learning. Extending the scope of BFC to GANs is an important next step.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A Generative adversarial network, or GAN, is one of the most powerful machine learning models proposed by Goodfellow et al. (opens in new tab) for learning to generate samples from complicated real-world distributions. GANs have sparked millions of applications, ranging from generating realistic images or cartoon characters to text-to-image translations. Turing award laureate Yann LeCun [&hellip;]<\/p>\n","protected":false},"author":38838,"featured_media":750997,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13561,13556,13546],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-741250","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-algorithms","msr-research-area-artificial-intelligence","msr-research-area-computational-sciences-mathematics","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199565],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[{"type":"guest","value":"yuanzhi-li","user_id":"670110","display_name":"Yuanzhi  Li","author_link":"<a href=\"https:\/\/www.andrew.cmu.edu\/user\/yuanzhil\/\" aria-label=\"Visit the profile page for Yuanzhi  Li\">Yuanzhi  Li<\/a>","is_active":true,"last_first":"Li, Yuanzhi ","people_section":0,"alias":"yuanzhi-li"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_gan_blog_still_nologo_edited-960x540.jpg\" class=\"img-object-cover\" alt=\"A chart showing a GAN comparing fake images with real images, filtering them through a discriminator to produce a value indicating how fake the image is.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_gan_blog_still_nologo_edited-960x540.jpg 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_gan_blog_still_nologo_edited-300x169.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_gan_blog_still_nologo_edited-1024x576.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_gan_blog_still_nologo_edited-768x432.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_gan_blog_still_nologo_edited-1536x864.jpg 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_gan_blog_still_nologo_edited-2048x1152.jpg 2048w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_gan_blog_still_nologo_edited-16x9.jpg 16w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_gan_blog_still_nologo_edited-1066x600.jpg 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_gan_blog_still_nologo_edited-655x368.jpg 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_gan_blog_still_nologo_edited-343x193.jpg 343w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_gan_blog_still_nologo_edited-640x360.jpg 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_gan_blog_still_nologo_edited-1280x720.jpg 1280w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_gan_blog_still_nologo_edited-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"Zeyuan Allen-Zhu and <a href=\"https:\/\/www.andrew.cmu.edu\/user\/yuanzhil\/\" title=\"Go to researcher profile for Yuanzhi  Li\" aria-label=\"Go to researcher profile for Yuanzhi  Li\" data-bi-type=\"byline author\" data-bi-cN=\"Yuanzhi  Li\">Yuanzhi  Li<\/a>","formattedDate":"June 10, 2021","formattedExcerpt":"A Generative adversarial network, or GAN, is one of the most powerful machine learning models proposed by Goodfellow et al. (opens in new tab) for learning to generate samples from complicated real-world distributions. GANs have sparked millions of applications, ranging from generating realistic images or&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/741250","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/users\/38838"}],"replies":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/comments?post=741250"}],"version-history":[{"count":100,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/741250\/revisions"}],"predecessor-version":[{"id":753031,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/741250\/revisions\/753031"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media\/750997"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=741250"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/categories?post=741250"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/tags?post=741250"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=741250"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=741250"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=741250"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=741250"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=741250"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=741250"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=741250"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=741250"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}