{"id":648279,"date":"2020-04-09T11:00:23","date_gmt":"2020-04-09T18:00:23","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?p=648279"},"modified":"2020-04-09T11:00:23","modified_gmt":"2020-04-09T18:00:23","slug":"a-deep-generative-model-trifecta-three-advances-that-work-towards-harnessing-large-scale-power","status":"publish","type":"post","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/a-deep-generative-model-trifecta-three-advances-that-work-towards-harnessing-large-scale-power\/","title":{"rendered":"A deep generative model trifecta: Three advances that work towards harnessing large-scale power"},"content":{"rendered":"<div class=\"mceTemp\"><\/div>\n<div id=\"attachment_649257\" style=\"width: 1290px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-649257\" class=\"wp-image-649257 size-full\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/HQ_1400x788_GIF.gif\" alt=\"An exponential growth curve shows a timeline of deep generative models from just before 2014 to 2020 on the X-axis. The X-axis also shows small scale to large scale from left to right. On the Y-axis, Model size is labeled ten million and below, up to over one billion parameters. In order, from left to right, are NLM (blue), V A E (green), and GAN (orange) all at small scale. Big GAN, Style GAN, and Style GAN 2 ( all orange) lie at 10 million to just over 100 million to the right and below the middle of the curve. GPT, GPT-2, Megatron, and T-NLG (all blue) are at the top left of the curve and above one billion. Optimus (green), FQ-GAN (orange), and Prevalent (blue) all lie offset to the right of the curve, representing large scale developments.\" width=\"1280\" height=\"720\" \/><p id=\"caption-attachment-649257\" class=\"wp-caption-text\">Figure 1: A brief evolution of deep generative models over time, measured by model size (number of parameters) and scientific impact (number of citations to date). Three popular deep generative model types are considered: Auto-regressive models (neural language models or NLMs) in blue, Variational Autoencoders (VAEs) in green, and Generative Adversarial Networks (GANs) in orange. Transformer and BERT are shown as references. The three new generative models we introduce in this post expand large-scale capabilities in each of these categories (right side of chart).<\/p><\/div>\n<p>One of the core aspirations in artificial intelligence is to develop algorithms and techniques that endow computers with an ability to synthesize the observed data in our world. Every time researchers build a model to imitate this ability, this model is called a generative model. If deep neural networks are involved in this model, the model is a deep generative model (DGM). As a branch of self-supervised learning techniques in deep learning, DGMs specifically focus on characterizing data generation processes.<\/p>\n<p>This post describes three projects that share a common theme: improving or applying DGMs in the era of large-scale datasets and training. In this post, we\u2019ll first review the evolution history of DGMs, then introduce new advances in DGMs made by researchers from Microsoft Research, in collaboration with members from the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/cse.buffalo.edu\/~changyou\/\">University at Buffalo<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/people.ee.duke.edu\/~lcarin\/\">Duke University<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. The generative models presented here, and detailed in their corresponding papers, each fall into a different category of popular deep generative model. Optimus is the first large-scale <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1312.6114\">Variational Autoencoder<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (VAE) language model, showing the opportunity of DGMs following a trend of pre-trained language models. FQ-GAN resolves scalability issues with image generation in <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1406.2661\">Generative Adversarial Networks<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (GANs). Finally, we introduce Prevalent, the first pre-trained generic agent for <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/see-what-we-mean-visually-grounded-natural-language-navigation-is-going-places\/\">vision-and-language navigation<\/a>. Let\u2019s look at a quick overview of DGMs before diving into our new achievements.<\/p>\n<h3>Three types of generative models and a shared trick<\/h3>\n<p><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Generative_model\">Generative models<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> have a long history in traditional machine learning, and they are often distinguished from the other main approach, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Discriminative_model\">discriminative models<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. One may learn how they differ from a <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nam06.safelinks.protection.outlook.com\/?url=https%3A%2F%2Fmedium.com%2F%40mlengineer%2Fgenerative-and-discriminative-models-af5637a66a3&data=02%7C01%7Cv-alhage%40microsoft.com%7Ccd50b6e0ed92479d9db108d7dbe28413%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637219638536975864&sdata=cX4w6eWX73NQ1OfpCOF7KN0dHPt%2BKS3Wx2V4DI3sXdQ%3D&reserved=0\">story of two siblings<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. In the story, the siblings have different special abilities: one has the ability to learn everything in great depth, while the other can only learn the differences between what he sees. These siblings represent a generative model, which characterizes actual distributions with an internal mechanism, and a discriminative model, which builds decision boundaries between classes.<\/p>\n<p>With the rise of deep learning, a new family of methods, called deep generative models (DGMs), is formed through the combination of generative models and deep neural networks. Because neural networks used as generative models have a number of parameters smaller than the amount of data they are trained on, there is a trick that DGMs can perform. In an excellent <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/openai.com\/blog\/generative-models\/\">blog post from OpenAI<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, this trick is revealed: <em>\u201c\u2026models are <strong>forced to discover and efficiently internalize the essence of the data in order to generate it.<\/strong>\u201d<\/em> To learn more about DGMs, check out recent lectures at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/deepgenerativemodels.github.io\/\">Stanford University<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/sites.google.com\/view\/berkeley-cs294-158-sp20\/home\">University of California Berkeley<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/stat.columbia.edu\/~cunningham\/teaching\/GR8201\/\">Columbia University<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/cs.nyu.edu\/courses\/spring18\/CSCI-GA.3033-022\/\">New York University<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n<p>Mathematically, for a dataset of examples {\\(x_{i}\\)|\\(x_{i}\\) \\( \\in \\) \\(\\mathbb{R}\\)&lt;sup>&lt;em>D&lt;\/em>&lt;\/sup>, \\(i\\) = 1, &#8230; , \\(N\\)} as samples from a true data distribution \\( q(x)\\), the goal of a DGM is to build deep neural networks with parameters \\(\\theta\\) \\( \\in \\) \\(\\mathbb{R}\\)&lt;sup>&lt;em>P&lt;\/em>&lt;\/sup>, to describe a distribution \\(p_\\theta\\left(x\\right) \\) so that the parameters \\(\\theta\\) can be trained to ensure \\(p_\\theta\\left(x\\right) \\) match \\(q(x)\\) the best. All DGMs share this same basic setup and the above DGM trick, but they differ in the ways they approach the problem. There are three popular model types according to <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/openai.com\/blog\/generative-models\/\">OpenAI taxonomy<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: <span style=\"color: #339966;\">VAEs<\/span>, <span style=\"color: #ff0000;\">GANs<\/span>, and <span style=\"color: #0000ff;\">auto-regressive models<\/span>. Each of these is detailed in the following table:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-649059 size-full\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/table_dgm17474.png\" alt=\"\" width=\"986\" height=\"498\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/table_dgm17474.png 986w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/table_dgm17474-300x152.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/table_dgm17474-768x388.png 768w\" sizes=\"auto, (max-width: 986px) 100vw, 986px\" \/><\/p>\n<h3>Moving from small-scale to large-scale deep generative models in all three categories<\/h3>\n<p>Thanks to many efforts on developing their theoretical principles over the years, DGMs are now relatively well understood at a small scale. The DGM trick mentioned above promises that the models work fine under a mild condition: \\(P\\)&lt; \\(N\\)*\\(D\\). This has been verified in many early works at a small scale. However, recent years have witnessed tremendous advances and strong empirical results through pre-training large models on massive data (in the context of the equation above,\\(N\\)is increased dramatically).<\/p>\n<p>Researchers from OpenAI believe that generative models are one of the most promising approaches to potentially reach the goal of endowing computers with an understanding of our world. Along these lines, they developed <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/openai.com\/blog\/language-unsupervised\/\">Generative Pre-training<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (GPT) in 2018, an autoregressive <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.jmlr.org\/papers\/volume3\/bengio03a\/bengio03a.pdf\">neural language model<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (NLM) trained on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task, showing significantly improved performance on multiple language understanding tasks. In 2019, they further scaled this idea up to 1.5 billion parameters and developed <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/openai.com\/blog\/better-language-models\/\">GPT-2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, which shows near-human performance in language generation. With more compute, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1909.08053\">Megatron<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/turing-nlg-a-17-billion-parameter-language-model-by-microsoft\/\">Turing-NLG<\/a> inherit the same idea, and scale it up to 8.3 billion and 17 billion, respectively.<\/p>\n<p>The above line of research shows that NLM has gained tremendous progress (\\(P\\) is increased dramatically in the equation above). Nevertheless, as an autoregressive model, NLM is just one of three types of DGMs. There are still two other types of DGMs (VAE and GAN) that can be significantly improved for large-scale uses. In this exciting era, large models are trained on large datasets with massive computing, which has given rise to a new learning paradigm: self-supervised pre-training with task-specific fine-tuning. DGMs have been studied less in this setting, and we are not sure if the tricks of DGMs can still work well in this setting for industrial practice. It raises a series of research questions, which we\u2019ll explore in relation to each project below:<\/p>\n<ul>\n<li><em><strong>Opportunity<\/strong><\/em>: How good are DGMs really with pre-training?<\/li>\n<li><em><strong>Challenge<\/strong><\/em>: Are modifications required to make existing methods work in this setting?<\/li>\n<li><em><strong>Application<\/strong><\/em>: Can DGMs benefit pre-training in contrast?<\/li>\n<\/ul>\n<h3>Optimus: Opportunities in language modeling<\/h3>\n<p>The central question addressed in this paper, called <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/optimus-organizing-sentences-via-pre-trained-modeling-of-a-latent-space\/\">\u201cOptimus: Organizing sentences via Pre-trained Modeling of a Latent Space,<\/a>\u201d is: <strong>What will happen if we scale up a VAE and use it as a new pre-trained language model (PLM)?<\/strong> To address this question, we created Optimus (<strong>O<\/strong>rganizing sentences with <strong>p<\/strong>re-<strong>t<\/strong>rained <strong>m<\/strong>odeling of a <strong>u<\/strong>niversal latent <strong>s<\/strong>pace), the first large-scale deep latent variable model for natural language, which is pre-trained using the sentence-level (variational) autoencoder objectives on a large text corpus.<\/p>\n<p>Pre-trained language models have made substantial advancements across a variety of natural language processing tasks. PLMs are often trained to predict words based on their context in massive text data, and the learned models can be fine-tuned to adapt to various downstream tasks. PLMs can generally play two different roles: a generic encoder such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/google-research\/bert\">BERT<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/ai.facebook.com\/blog\/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems\/\">Roberta<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and a powerful decoder such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1909.08053\">GPT-2 <span class=\"sr-only\"> (opens in new tab)<\/span><\/a>and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1909.08053\">Megatron<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Sometimes, both tasks can be performed in one unified framework, such as in <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/unilm\">UniLM<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1910.13461\">BART<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/ai.googleblog.com\/2020\/02\/exploring-transfer-learning-with-t5.html\">T5<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. These models lack explicit modeling of structures in a compact latent space, rendering it difficult to control natural language generation and representation from sentence-level semantics.<\/p>\n<p>When trained effectively, the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1511.06349\">Variational Autoencoder (VAE)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> can be both a powerful generative model and an effective representation learning framework for natural language. By representing sentences in a low-dimensional latent space, VAEs allow easy manipulation of sentences using the corresponding compact vector representations (like the smooth feature regularization specified by prior distributions) and guided sentence generation with interpretable latent vector operators. Despite the attractive theoretical strengths, the current language VAEs are often built with shallow network architectures, such as two-layer <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Long_short-term_memory\">LSTMs<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. This limits the model\u2019s capacity and leads to suboptimal performance. When a large amount of data is given, the tricks of DGM may break if shallow VAE is employed.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-648546 size-large aligncenter\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/latex-gen-models-1024x222.png\" alt=\"\" width=\"1024\" height=\"222\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/latex-gen-models-1024x222.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/latex-gen-models-300x65.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/latex-gen-models-768x167.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/latex-gen-models.png 1110w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>For a sentence of length \\(T\\), \\(x\\) = \\([x_{i}&#8230;,x_{T}]\\), an autoregressive NLM generates current token \\(x_{t}\\) conditioned on the previous word tokens \\( x\\)&lt;sub>&lt;\\(t\\)&lt;\/sub>, as shown in Equation 1 above, there is limited capacity for the generation to be guided by the higher-level semantics. GPT-2 is perhaps the most well-known NLM instance, pre-trained on large amounts of text. In contrast, VAE generates \\(x_{t}\\) conditioned both previous word tokens \\( x\\)&lt;sub>&lt;\\(t\\)&lt;\/sub> and a latent variable \\(z\\), as shown in Equation 2 above. The latent \\(z\\) determines the high-level semantics (that is, the \u201coutline\u201d) of a sentence, such as tense, topics or sentiment, guiding a sequential decoding process to fill in details of the outline. The decoder \\(\\theta\\) is combined with an encoder \\(\\phi\\). VAEs learn the parameters by maximizing <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.cs.princeton.edu\/courses\/archive\/fall11\/cos597C\/lectures\/variational-inference-i.pdf\">a lower bound<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> on the log likelihood of the data.<\/p>\n<p><div id=\"attachment_648420\" style=\"width: 440px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-648420\" class=\"wp-image-648420 \" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/Generative-Models-Fig-2-300x152.png\" alt=\"Figure 2a: Optimus Architecture,variable x moves through an encoder (made up pf BERT and H[CLS]) then moves through variable Z, then into a decoder (made up of variable H and GPT-2), and finally into variable x. Figure 2b: Memory: variable Z moves into a 3 by 4 square memory block. the first column of 3 squares is white. The rest are blue. Under the first blue column, X0, the second XT minus 1, the third X with subscript T. Embedding: variable Z moves through Latent plus Word plus Positional.\" width=\"430\" height=\"218\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/Generative-Models-Fig-2-300x152.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/Generative-Models-Fig-2.png 742w\" sizes=\"auto, (max-width: 430px) 100vw, 430px\" \/><p id=\"caption-attachment-648420\" class=\"wp-caption-text\">Figure 2: (a) Optimus architecture, made up of an encoder and a decoder, and (b) latent vector injection.<\/p><\/div>The Optimus architecture is shown in Figure 2a. To help training, we initialize encoder with BERT and initialize decoder with GPT-2. The output feature of the [CLS] token is used to obtain the latent variable \\(z\\). To facilitate \\(z\\) in GPT-2 decoding without re-training the weights from scratch, we study two schemes, illustrated in Figure 2b. In the first scheme, \\(z\\) plays the role of an additional memory vector for decoder to attend. In the second scheme, \\(z\\) is added on the bottom embedding layer the decoder and directly used in every decoding step. Empirically, we found that the first, memory-based scheme works better. To prevent the notorious <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1511.06349\">KL vanishing issue<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, we employ the <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/less-pain-more-gain-a-simple-method-for-vae-training-with-less-of-that-kl-vanishing-agony\">cyclical annealing schedule<\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1909.00868\">dimension-wise thresholding<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> techniques. As a new type of PLM, the proposed Optimus shows interesting results, demonstrating its unique advantages compared with existing PLMs:<\/p>\n<p>&nbsp;<\/p>\n<ul>\n<li><em><strong>Language Modeling<\/strong><\/em>\u2014We consider four datasets, including <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.aclweb.org\/anthology\/J93-2004\/\">Penn Treebank<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nlp.stanford.edu\/projects\/snli\/\">SNLI<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1702.08139\">Yahoo<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1702.08139\">Yelp<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> corpora, and fine-tune the PLM for one <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/towardsdatascience.com\/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9\">epoch<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> on each. Optimus achieves lower perplexity than GPT-2 on three of the four datasets, due to the knowledge encoded in the latent prior distribution. Compared with all existing small VAEs, Optimus shows much better representation learning performance, measured by <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/jxhe\/vae-lagging-encoder\">mutual information<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1509.00519\">active units<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. This implies that pre-training by itself is an effective approach to alleviate the KL vanishing issue.<\/li>\n<li><em><strong>Guided Language Generation<\/strong><\/em>\u2014Thanks to the latent variable, Optimus has the unique advantage to control sentence generation from a semantic level (GPT-2 is infeasible for this). It provides new ways one can play with language generation. In Figure 3 (below), we illustrate the idea using some simple latent vector manipulation in two scenarios: 1) sentence-level transfer via the arithmetic operation of latent vectors: \\(z_{\\tau}\\) = \\(z_{1}\\) * (1 &#8211; \\(\\tau\\)) + \\(z_{2}\\) * \\(\\tau\\) and 2) sentence interpolation: \\(z_{D}\\) = \\(z_{B}\\) &#8211; \\(z_{A}\\) + \\(z_{C}\\), where \\(\\tau\\) \\(\\in\\) [0,1]. For more sophisticated latent space manipulation, we consider dialog response generation, stylized response generation, and label-conditional sentence generation. Optimus shows superiority to existing methods on all these tasks.<\/li>\n<li><em><strong>Low-resource Language Understanding<\/strong><\/em>\u2014Optimus learns a smoother space and more separated feature patterns than BERT (Figure 4a and 4b below). This allows Optimus to yield better classification performance and faster adaptation than BERT when used as a feature-based approach (the backbone network is frozen and only the classifier is updated), as it allows Optimus to maintain and exploit the latent structure learned in pre-training. Figure 4c shows the results with a varying number of labelled samples per class on this Yelp review dataset, Optimus shows much better results in the low-compute scenarios (feature-based settings). A similar comparison is observed on the GLUE benchmark.<\/li>\n<\/ul>\n<div id=\"attachment_649152\" style=\"width: 597px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-649152\" class=\" wp-image-649152\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/fig317620-300x159.png\" alt=\"Two columns. Column one title: Source X A. A girl makes a silly face. Column one heading: Input X C. Bulleted list of examples. A girl poses for a picture. A girl in a blue shirt is taking pictures of a microscope. A woman with a red scarf looks at the stars. A boy is taking a bath. A little boy is eating a bowl of soup. Column two title: Target X B. two soccer players are playing soccer. Column two heading: Output X D. Bulleted list of examples in blue. Two soccer players at a soccer game. Two Football players in blue uniforms are at a field hockey game. Two men in white uniforms are field hockey players. Two baseball players are at the baseball diamond. Two men are in baseball practice. Figure 3b alt text List of sentence examples. 0.0 and 1.0 are in black. 0.1 through 0.9 are in blue. 0.0: children are looking for the water to be clear. 0.1: children are looking for the water. 0.2: children are looking at the water. 0.3: the children are looking at a large group of people. 0.4: the children are watching a group of people. 0.5: the people are watching a group of ducks. 0.6: the people are playing soccer in the field. 0.7: there are people playing a sport. 0.8 there are people playing a soccer game. 0.9: there are two people playing soccer. 1.0: there are two people playing soccer.\" width=\"587\" height=\"311\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/fig317620-300x159.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/fig317620-768x408.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/fig317620.png 830w\" sizes=\"auto, (max-width: 587px) 100vw, 587px\" \/><p id=\"caption-attachment-649152\" class=\"wp-caption-text\">Figure 3: (a) sentence transfer; (b) sentence interpolation. Blue indicates generated sentences.<\/p><\/div>\n<div id=\"attachment_648486\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-648486\" class=\"wp-image-648486 size-large\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/Gen-Models-Fig-4-1024x343.png\" alt=\"On the left, a scatter plot comparison of Optimus and BERT feature space visualization, which contain orange and blue points. Optimus is a circular pattern, with blue points heavy at the top half of the circle and orange points heavy at the bottom. BERT is more oblong, moving in a downward trend beginning heavily blue and ending in orange. Figure 4c: Classification Accuracy graph shows that Optimus oupterforms BERT with feature data and is similar to BERT's performance on finetuning.\" width=\"1024\" height=\"343\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/Gen-Models-Fig-4-1024x343.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/Gen-Models-Fig-4-300x100.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/Gen-Models-Fig-4-768x257.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/Gen-Models-Fig-4.png 1147w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-648486\" class=\"wp-caption-text\">Figure 4: (a) and (b) Feature space visualization using tSNE for Optimus and BERT, respectively. Sentences with different labels are rendered in different colors. (c) Results with varying labelled data<\/p><\/div>\n<p>Please check out our <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/optimus-organizing-sentences-via-pre-trained-modeling-of-a-latent-space\/\">paper<\/a> for more results, and play with Optimus on <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/ChunyuanLI\/Optimus\">Github<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n<h3>FQ-GAN: Challenges in image generation<\/h3>\n<p>GAN is a popular model for image generation. It consists of two networks\u2014a generator to directly synthesize fake samples that mimic real samples and a discriminator to distinguish between real samples \\((x)\\) and fake samples \\((\\hat{x})\\). The two networks are trained in an adversarial manner so that the fake data distribution can match the real data distribution.<\/p>\n<p><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1702.08398\">Feature matching<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> is a principled technique that casts the data distribution matching problem of GANs into a distribution matching problem in the feature space of the discriminator. This requires feature statistics (first or second order moments), estimated from the entirety of both fake and real samples, to be similar. In practice, these feature statistics are estimated using mini-batches in a continuous feature space. As the dataset becomes much larger and more complex (for example, higher resolutions), the quality of mini-batch based estimate becomes poor because the estimate variance is large for a fixed batch-size. This issue is particularly severe for GANs, as the induced<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1811.11083\"> fake sample distribution of the generator is always changing<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> in training, which poses a new challenge in scaling up GANs for large-scale settings.<\/p>\n<p>To solve this problem, we propose feature quantization (FQ) for the discriminator, in our paper<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/feature-quantization-improves-gan-training\/\"> \u201cFeature Quantization Improves GAN Training,\u201d<\/a> which represents images in a quantized space rather than in a continuous space. The neural network architecture of FQ-GAN is illustrated in Figure 5a. An FQ step is injected into the discriminator of the standard GANs. It restricts continuous features into a prescribed set of values, specifically feature centroids from a dictionary.<\/p>\n<p>Since both true and fake samples can only choose their representations from the limited dictionary items, FQ-GAN indirectly performs feature matching. This can be illustrated using the visualization example in Figure 5b, where true features \\((h)\\) and fake features \\(\\tilde{h}\\) are quantized into the same centroids (nearest centroids are represented in the same color in this example). We use moving average updates to implement an evolving dictionary \\(E\\) , which ensures the dictionary contains a set of centroids that are consistent with recent features in training.<\/p>\n<div id=\"attachment_648489\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-648489\" class=\"wp-image-648489 size-large\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/Gen-Models-Fig-5-1024x270.png\" alt=\"A. The FQ-GAN architecture from right to left: Z moves through a generator that distinguishes between fake samples and real samples. Next, it travels through a discriminator. Bottom, FQ that restricts continuous features into feature centroids from a dictionary E, Top, and then True or Fake. B. A color gradient visualization showing denser large circles in the center of various connected polygon shapes. The shapes are adjacent to one another in a honeycomb-like pattern that is not uniform. From left to right: (upper) purple, red. (middle) aquamarine, green, blue. (lower) gold. The center green polygon contains variables for true features, E K, and fake features.\" width=\"1024\" height=\"270\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/Gen-Models-Fig-5-1024x270.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/Gen-Models-Fig-5-300x79.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/Gen-Models-Fig-5-768x202.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/Gen-Models-Fig-5.png 1037w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-648489\" class=\"wp-caption-text\">Figure 5: (a) FQ-GAN architecture: Our FQ is added as a new layer in the discriminator of standard GANs. and (b) dictionary look-up as implicit feature matching. Points in the same color represent continuous features that are quantized into the same centroid (represented by big circles). True features (square) and fake features (triangle) are forced to share the same centroid after FQ.<\/p><\/div>\n<p>The proposed FQ technique can be easily plugged into existing GAN models, with little computational overhead in training. Extensive experimental results show that the proposed FQ-GAN can improve the image generation quality of baseline methods by a large margin on a variety of tasks, including three representative GAN models on nine benchmarks:<\/p>\n<ul>\n<li><em><strong>BigGAN<\/strong> for Image Generation<\/em>. <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/ajbrock\/BigGAN-PyTorch\">BigGAN<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, introduced by Google DeepMind in 2018, is perhaps the largest GAN model\uff0cwe compare FQ-GAN to BigGAN on three datasets (with an increasing number of classes or images): <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.cs.toronto.edu\/~kriz\/cifar.html\">CIFAR 10<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.cs.toronto.edu\/~kriz\/cifar.html\">CIFAR 100<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.image-net.org\/\">ImageNet<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. FQ-GAN consistently outperforms BigGAN by more than 10% with regard to <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/bioinf-jku\/TTUR\">FID<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> values (Dissimilarity of feature statistics between true and fake data). Our method also improves <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/batmanlab\/twin-auxiliary-classifiers-gan\">Twin Auxiliary Classifiers GAN<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, a recent variant appearing at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nips.cc\/Conferences\/2019\">NeurIPS 2019<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, which particularly favors fine-grained image datasets.<\/li>\n<li><em><strong>StyleGAN<\/strong> for Face Synthesis.<\/em> <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/NVlabs\/stylegan\">StyleGAN<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, introduced by NVIDIA in December 2018, can generate high fidelity images that look like facial portraits of human faces (think of <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Deepfake\">Deep Fake<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>). It is built upon progressive GANs but gives the researchers more control over specific visual features. We use the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/NVlabs\/ffhq-dataset\">FFHQ<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> dataset, with image resolution ranging from 32 to 1024. Results show that FQ-GAN converges faster and yields better final performance.<\/li>\n<li><em><strong>U-GAT-IT<\/strong> for Unsupervised Image-to-Image Translation.<\/em><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/taki0112\/UGATIT\"> U-GAT-IT<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> is a state-of-the-art method for <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/junyanz.github.io\/CycleGAN\/\">image style transfer<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> that appeared at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/openreview.net\/forum?id=BJlZ5ySKPH\">ICLR 2020<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. On five benchmark datasets, we see that FQ largely improves the performance and shows better human perceptual evaluation.<\/li>\n<\/ul>\n<p>If you\u2019d like to improve your GANs using FQ, check out our <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/feature-quantization-improves-gan-training\/\">paper<\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/YangNaruto\/FQ-GAN\">code<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> on GitHub.<\/p>\n<h3>Prevalent: Applications for vision-and-language navigation<\/h3>\n<p>With further semantic-level understanding of images and language, one natural next step is to endow an agent with the ability to take actions to complete a task with multimodal inputs. Learning to navigate in a visual environment by following natural-language instructions is one basic challenge towards this goal. Ideally, we\u2019d like to train a generic agent once and allow it to adapt quickly to different downstream tasks.<\/p>\n<p>To this end, we propose <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/towards-learning-a-generic-agent-for-vision-and-language-navigation-via-pre-training\/\">Prevalent<\/a>, the first agent that follows the pre-training and fine-tuning paradigm. We represent our pre-training data sample as a triplet (image-text-action) and pre-train the model with two objectives: masked language modeling and action prediction (as illustrated in Figure 6a below). Since no final downstream learning objectives are involved, such a self-supervised learning approach often requires large amounts of training samples to discover the internal essence of image-text data to generalize well on new tasks.<\/p>\n<p>In our study, detailed in our paper <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/towards-learning-a-generic-agent-for-vision-and-language-navigation-via-pre-training\/\">\u201cTowards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training,\u201d<\/a> we found that the largest training dataset, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1711.07280\">R2R<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, contains only 104,000 samples, an order magnitude smaller than the pre-training datasets typically used in <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1810.04805\">language<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> or <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1909.11059\">vision-and-language<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> pre-training. This renders a case where pre-training can be degraded due to insufficient training data, while harvesting such samples with human annotations is expensive.<\/p>\n<p>Fortunately, we can resort to a DGM to synthesize the samples. We first train an auto-regressive model (a <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/ronghanghu.com\/speaker_follower\/\">speaker<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> model) that can produce language instructions conditioned on the agent trajectory (a sequence of actions and visual images) on the R2R dataset. Then, we collect a large number of the shortest trajectories using the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/peteanderson80\/Matterport3DSimulator\">Matterport 3D Simulator<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and we synthesize their corresponding instructions using the speaker model. This leads to 6,482,000 new training samples. The two datasets are compared in Figure 6b. The agent is pre-trained on the combined dataset.<\/p>\n<div id=\"attachment_648495\" style=\"width: 993px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-648495\" class=\"wp-image-648495 size-full\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/Gen-Models-_-FIg-6.png\" alt=\"(A) A white box showing pre-training, heading says R2R: 1)Attend Masked LM. 2) Action Prediction. An image-text-action triplet is shown in the box. An arrow from white box points to three stacked green boxes, labeled together as fine-tuning. R2R box and CVDN shows a circle with a curved arrow to a star. The last box, HANNA, shows a curved arrow moving from a circle to a square to a square to a star. (B) A pie chart shows 1.6 percent for the R2R training set at 104,000 triplets, 98.4 percent for the synthesized dataset and 6,482,000 triplets. \" width=\"983\" height=\"296\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/Gen-Models-_-FIg-6.png 983w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/Gen-Models-_-FIg-6-300x90.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/Gen-Models-_-FIg-6-768x231.png 768w\" sizes=\"auto, (max-width: 983px) 100vw, 983px\" \/><p id=\"caption-attachment-648495\" class=\"wp-caption-text\">Figure 6: (a) Prevalent learning pipeline: the agent is pre-trained on a heavily augmented R2R dataset and fine-tuned on three downstream tasks; (b) The percentage of pre-training datasets: 98.4% synthesized data and 1.6% real data.<\/p><\/div>\n<p>We fine-tune the agent on three downstream navigation tasks, including <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1711.07280\">R2R <span class=\"sr-only\"> (opens in new tab)<\/span><\/a>and two out-of-domain tasks, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1907.04957\">CVDN <span class=\"sr-only\"> (opens in new tab)<\/span><\/a>and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1909.01871\">HANNA<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Our agent achieves state-of-the-art on all three tasks. Ultimately, Prevalent shows that the synthesized samples produced by the DGM can be used for pre-training, and they improve generalization. Please read our <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/cvpr2020.thecvf.com\/\">CVPR 2020<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/towards-learning-a-generic-agent-for-vision-and-language-navigation-via-pre-training\/\">paper <\/a>for more details. We released our pre-trained model, datasets, and code for Prevalent on <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/weituo12321\/PREVALENT\">GitHub<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. We hope it can set a strong baseline for future research on self-supervised pre-training for vision-and-language navigation.<\/p>\n<h3>Going forward: new applications, combining techniques, and self-supervised learning<\/h3>\n<p>From the examples above, we have seen the <strong>opportunities<\/strong>, <strong>challenges<\/strong>, and <strong>applications<\/strong> in the process of scaling up both datasets and training for DGMs. As we continue to advance these models and increase their scale, we can expect to synthesize high-fidelity images or language samples. This may itself find new applications in various domains, such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/deepart.io\/\">artistic image synthesis<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> or <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2002.12328\">task-oriented dialog<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Meanwhile, the boundaries of these three model types can easily become blurred: researchers may be able to combine their strengths to pursue further improvement. The tricks of DGMs naturally imply the promise of <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/lilianweng.github.io\/lil-log\/2019\/11\/10\/self-supervised-learning.html\">self-supervised learning<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: while networks learn to encode the process of generating (partially) observed data, they could learn to grasp the essence of data, producing good features generally helpful for many tasks.<\/p>\n<p><em><strong>Acknowledgments<\/strong><\/em><\/p>\n<p>The authors gratefully acknowledge the entire Project Philly Team inside Microsoft for providing our computing platform. Some implementation in our experiments depends on open-source projects on GitHub: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/huggingface\/transformers\">HuggingFace Transformers<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/ajbrock\/BigGAN-PyTorch\">BigGAN<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/NVlabs\/stylegan\">StyleGAN<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/taki0112\/UGATIT\">U-GAT-IT<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/peteanderson80\/Matterport3DSimulator\">Matterport3D Simulator<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/ronghanghu\/speaker_follower\">Speaker-Follower<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. We acknowledge all the authors who have made their code public, which tremendously accelerates our progress.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>One of the core aspirations in artificial intelligence is to develop algorithms and techniques that endow computers with an ability to synthesize the observed data in our world. Every time researchers build a model to imitate this ability, this model is called a generative model. If deep neural networks are involved in this model, the [&hellip;]<\/p>\n","protected":false},"author":38838,"featured_media":648513,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Chunyuan Li","user_id":"37971"},{"type":"user_nicename","value":"Jianfeng Gao","user_id":"32246"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-648279","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[144931],"related-projects":[],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Jianfeng Gao","user_id":32246,"display_name":"Jianfeng Gao","author_link":"<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/jfgao\/?lang=fr-ca\" aria-label=\"Visitez la page de profil pour Jianfeng Gao\">Jianfeng Gao<\/a>","is_active":false,"last_first":"Gao, Jianfeng","people_section":0,"alias":"jfgao"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"461\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2020\/04\/Slide1-960x461.png\" class=\"img-object-cover\" alt=\"\" decoding=\"async\" loading=\"lazy\" \/>","byline":"Chunyuan Li and <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/jfgao\/\" title=\"Go to researcher profile for Jianfeng Gao\" aria-label=\"Go to researcher profile for Jianfeng Gao\" data-bi-type=\"byline author\" data-bi-cN=\"Jianfeng Gao\">Jianfeng Gao<\/a>","formattedDate":"April 9, 2020","formattedExcerpt":"One of the core aspirations in artificial intelligence is to develop algorithms and techniques that endow computers with an ability to synthesize the observed data in our world. Every time researchers build a model to imitate this ability, this model is called a generative model.&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/648279","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/users\/38838"}],"replies":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/comments?post=648279"}],"version-history":[{"count":91,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/648279\/revisions"}],"predecessor-version":[{"id":649260,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/648279\/revisions\/649260"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media\/648513"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=648279"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/categories?post=648279"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/tags?post=648279"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=648279"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=648279"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=648279"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=648279"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=648279"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=648279"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=648279"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=648279"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}