{"id":578602,"date":"2019-04-15T08:51:05","date_gmt":"2019-04-15T15:51:05","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?p=578602"},"modified":"2019-06-26T14:19:52","modified_gmt":"2019-06-26T21:19:52","slug":"less-pain-more-gain-a-simple-method-for-vae-training-with-less-of-that-kl-vanishing-agony","status":"publish","type":"post","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/less-pain-more-gain-a-simple-method-for-vae-training-with-less-of-that-kl-vanishing-agony\/","title":{"rendered":"Less pain, more gain: A simple method for VAE training with less of that KL-vanishing agony"},"content":{"rendered":"<p><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/Cyclical-Annealing-Schedule_Site_04_2019_1400x788-2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-578701 aligncenter\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/Cyclical-Annealing-Schedule_Site_04_2019_1400x788-2-1024x432.png\" alt=\"\" width=\"1024\" height=\"432\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/Cyclical-Annealing-Schedule_Site_04_2019_1400x788-2-1024x432.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/Cyclical-Annealing-Schedule_Site_04_2019_1400x788-2-300x127.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/Cyclical-Annealing-Schedule_Site_04_2019_1400x788-2-768x324.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/Cyclical-Annealing-Schedule_Site_04_2019_1400x788-2-665x280.png 665w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/Cyclical-Annealing-Schedule_Site_04_2019_1400x788-2.png 1399w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><\/p>\n<p>There is a growing interest in exploring the use of <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1312.6114\">variational auto-encoders (VAE), a deep latent variable model<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, for text generation. Compared to the standard <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/ieeexplore.ieee.org\/document\/5947611\">RNN-based language model<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> that generates sentences one word at a time without the explicit guidance of a global sentence representation, VAE is designed to learn a probabilistic representation of global language features such as topic, sentiment or language style, and makes the text generation more controllable. For example, VAE can generate sentences with a specific tense, sentiment or topic.<\/p>\n<p>However, training VAE on languages is notoriously difficult due to something called <em>KL vanishing<\/em>. While VAE is designed to learn to generate text using both local context and global features, it tends to depend solely on local context and ignore global features when generating text. When this happens, VAE is essentially behaving like a standard RNN language model.<\/p>\n<p>In \u201c<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1903.10145\">Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,\u201d to be presented at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/naacl2019.org\/\">2019 Annual Conference of the North American Chapter of the Association for \u00a0 Computational Linguistics (NAACL)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, researchers at <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/lab\/microsoft-research-ai\/\">Microsoft Research AI<\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.duke.edu\/\">Duke University<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> propose an extremely simple remedy to KL vanishing as well as their proposal to make the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/haofuml\/cyclical_annealing\">code publicly available on Github<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. The remedy is based on a new scheduling scheme called <em>Cyclical Annealing Schedule<\/em>. Intuitively, during the course of VAE training, we periodically adjust the weight of the KL term in the objective function, providing the model opportunities to learn to leverage the global latent variables in text generation, thus encoding as much global information in the latent variables as possible. The paper briefly describes KL vanishing and why it happens, introduces the proposed remedy, and illustrates the VAE learning process using a synthetic dataset.<\/p>\n<h3>What is KL vanishing and why does it happen?<\/h3>\n<p>VAEs aim to learn probabilistic representations <strong>z<\/strong> of natural languages <strong>x<\/strong>, with an objective consisting of two terms: (1) <em>reconstruction<\/em> to guarantee the inferred latent feature <strong>z<\/strong> can represent its corresponding observed sentence; and (2) KL <em>regularization<\/em> to leverage the prior knowledge to modulate language understanding. The two terms are balanced by a weighting hyper-parameter \u03b2:<\/p>\n<p><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/equation.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-578617\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/equation-300x42.png\" alt=\"\" width=\"502\" height=\"71\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/equation-300x42.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/equation.png 742w\" sizes=\"auto, (max-width: 502px) 100vw, 502px\" \/><\/a><\/p>\n<p>When applied on text corpora, VAEs typically employ an auto-regressive decoder, which sequentially generates the word tokens based on ground-truth words in the previous steps, in conjunction with latent <strong>z<\/strong>. <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1511.06349\">Recent work<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> has found that na\u00efve training of VAEs (keeping constant \u03b2=1) leads to model degeneration\u2014the KL term becomes vanishingly small. This issue causes two undesirable outcomes: (1) the learned features are almost identical to the uninformative Gaussian prior, for all observed languages; and (2) the decoder completely ignores the latent feature, and the learned model reduces to a simpler <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/ieeexplore.ieee.org\/document\/5947611\">neural language model<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Hence, the KL <em>vanishing<\/em> issue.<\/p>\n<p>This negative result is so far poorly understood. We developed a <em>two-path<\/em> competition interpretation to shed light on the issue. Let\u2019s first look at the standard VAE in Figure 1 (a), below. The reconstruction of sequence <strong>x<\/strong>=[x<sub>1<\/sub> ,&#8230;,x<sub>T<\/sub>] depends only on one path passing through the encoder \u03d5, latent representation <strong>z<\/strong> and decoder \u0398. However, when an auto-regressive decoder is used in a VAE, there are two paths from observed <strong>x<\/strong> to its reconstruction, as shown in Figure 1(b). Path A is the same as that in the standard VAE, where <strong>z<\/strong> serves as the global representation that controls the generation of <strong>x<\/strong>; Path B leaks the partial ground-truth information of <strong>x<\/strong> at every time step of the sequential decoding. It generates x<sub>t<\/sub> conditioned on x<sub>&lt;<i>t<\/i><\/sub>=[x<sub>1<\/sub>,&#8230;,x<sub>t-1<\/sub>]. Therefore, Path B can potentially bypass Path A to generate x<sub>t<\/sub>, leading to KL vanishing. From this perspective, we hypothesize that the KL vanishing problem is related to the low quality of <strong>z<\/strong> at the beginning phase of decoder training. This is highly possible when the naive constant schedule of \u03b2=1 is used, as the KL term pushes <strong>z<\/strong> close to an uninformative prior, less representative of the corresponding observations. This lower quality <strong>z<\/strong> introduces more difficulties in reconstructing <strong>x<\/strong>, and eventually blocks the information flow via Path A. As a result, the model is forced to learn an easier solution to decoding\u2014generating <strong>x<\/strong> via Path B only.<\/p>\n<p>&nbsp;<\/p>\n<div id=\"attachment_578647\" style=\"width: 629px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/standard-VAE.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-578647\" class=\"wp-image-578647 size-full\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/standard-VAE.png\" alt=\"\" width=\"619\" height=\"186\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/standard-VAE.png 619w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/standard-VAE-300x90.png 300w\" sizes=\"auto, (max-width: 619px) 100vw, 619px\" \/><\/a><p id=\"caption-attachment-578647\" class=\"wp-caption-text\">Figure 1: Illustration of information flows on (a) one path in a standard VAE, and (b) two paths in a VAE with an auto-regressive decoder.<\/p><\/div>\n<h3>Cyclical Annealing Schedule<\/h3>\n<p>A simple remedy via scheduling \u03b2 during VAE training was proposed by Bowman, et al, as shown in Figure 2(a). It starts with \u03b2=0 at the beginning of training, and gradually increases \u03b2 until \u03b2=1 is reached. This <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1511.06349\">monotonic schedule<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> of \u03b2 has become the <em>de facto<\/em> standard in training text VAEs, and has been widely adopted in many NLP tasks. Why does it improve the performance empirically? When \u03b2<1, <strong>z<\/strong> is trained to focus more on capturing useful information for reconstruction of <strong>x<\/strong>. When the full VAE objective is considered (\u03b2=1), <strong><span style=\"text-align: left; color: #333333; text-transform: none; text-indent: 0px; letter-spacing: normal; font-family: Georgia,'Times New Roman','Bitstream Charter',Times,serif; font-size: 16px; font-style: normal; font-variant: normal; text-decoration: none; word-spacing: 0px; display: inline !important; white-space: normal; cursor: text; orphans: 2; float: none; -webkit-text-stroke-width: 0px; background-color: #ffffff;\">z<\/span><\/strong> learned earlier can be viewed as VAE initialization; such latent features are much more informative than the random start in constant schedule and thus are ready for the decoder to use.<\/p>\n<div id=\"attachment_578695\" style=\"width: 509px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/Annealing-with-the-monotonic-schedule.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-578695\" class=\"wp-image-578695\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/Annealing-with-the-monotonic-schedule-300x195.png\" alt=\"Figure 2: Annealing \u03b2 with (a) the monotonic schedule and (b) the cyclical schedule.\" width=\"499\" height=\"324\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/Annealing-with-the-monotonic-schedule-300x195.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/Annealing-with-the-monotonic-schedule-768x498.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/Annealing-with-the-monotonic-schedule.png 954w\" sizes=\"auto, (max-width: 499px) 100vw, 499px\" \/><\/a><p id=\"caption-attachment-578695\" class=\"wp-caption-text\">Figure 2: Annealing \u03b2 with (a) the monotonic schedule and (b) the cyclical schedule.<\/p><\/div>\n<p>Is there a better schedule? It is key to have meaningful latent <strong>z<\/strong> at the beginning of training the decoder, so that Path A is utilized. The monotonic schedule under-weights the prior regularization when \u03b2<1; the learned <strong>z<\/strong> tends to collapse into a point estimate. This underestimation can result in sub-optimal decoder learning. A natural question concerns how one can get a better distribution estimate for <strong>z<\/strong> as initialization, while decoder has the opportunity to leverage such <strong>z<\/strong> in learning.<\/p>\n<p>Our proposal is to use the latent <strong>z<\/strong> trained under the full VAE objective as initialization. To learn to progressively improve <strong>z<\/strong> we propose a <em>cyclical schedule<\/em> for \u03b2 that simply repeats the monotonic schedule multiple times as shown in Figure 2(b). We start with \u03b2=0, increase \u03b2 at a fast rate, and then stay at \u03b2=1 for subsequent learning iterations. This completes one period of monotonic schedule. It encourages the model to converge towards the VAE objective, and infers its first raw full latent distribution. Unfortunately, \u03b2=1 gradually blocks Path A, forbidding more information from passing through <strong>z<\/strong>. Crucially, we then start the second period of \u03b2 annealing and training is continued at \u03b2=0 again. This perturbs the VAE objective, dislodges it from the convergence, and reopens Path A. Importantly, the decoder now (1) has the opportunity to directly leverage <strong>z<\/strong>, without obstruction from KL; and (2) is trained with the better latent <strong>z<\/strong> than point estimates, as the full distribution learned in the previous period is fed in. We repeat this \u03b2 annealing process several times to achieve better convergences.<\/p>\n<h3>Visualization of learning dynamics in the latent space<\/h3>\n<p>To visualize the learning processes on an illustrative problem, let\u2019s consider <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/openreview.net\/pdf?id=S1eZGHkDM\">a synthetic dataset<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> consisting of 10 different sequences, as well as a VAE model with a 2-dimensional latent space, and an LSTM encoder and decoder.<\/p>\n<p>We visualize the resulting division of the latent space for different training steps in Figure 3, where each color corresponds to the latent probabilistic representation of a sequence. We observe that:<\/p>\n<ul>\n<li>The constant schedule produces heavily mixed latent codes <strong>z<\/strong> for different sequences throughout the entire training process.<\/li>\n<li>The monotonic schedule starts with a mixed <strong>z<\/strong>, but soon divides the space into a mixture of 10 cluttered Gaussians in the annealing process (the division remains cluttered in the rest of training).<\/li>\n<li>The cyclical schedule behaves similarly to the monotonic schedule in the 1st cycle. But starting from the 2nd cycle, much more divided clusters are shown when learning on top of the 1st period results. However, \u03b2<1 leads to some holes between different clusters. This is alleviated at the end of the 2nd cycle, as the model is trained with \u03b2=1. As the process repeats, we see clearer patterns in the 4th cycle than the 2nd cycle for both \u03b2<1 and \u03b2=1. It shows that more structured information is captured in <span style=\"display: inline !important; float: none; background-color: #ffffff; color: #333333; cursor: text; font-family: Georgia,'Times New Roman','Bitstream Charter',Times,serif; font-size: 16px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration: none; text-indent: 0px; text-transform: none; -webkit-text-stroke-width: 0px; white-space: normal; word-spacing: 0px;\"><strong>z<\/strong><\/span>, using the cyclical schedule.<\/li>\n<\/ul>\n<div id=\"attachment_578680\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/figure-4.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-578680\" class=\"wp-image-578680 size-large\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/figure-4-1024x421.png\" alt=\"\" width=\"1024\" height=\"421\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/figure-4-1024x421.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/figure-4-300x123.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/figure-4-768x316.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/figure-4.png 1528w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><p id=\"caption-attachment-578680\" class=\"wp-caption-text\">Figure 3: The process of learning probabilistic representations in the latent space for three schedules.<\/p><\/div>\n<p>The learning curves for the VAE objective (<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1606.05908\">ELBO<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>), reconstruction error, and KL term are shown in Figure 4. The three schedules share very similar ELBO values. However, the cyclical schedule provides substantially lower reconstruction error and higher KL divergence. Interestingly, the cyclical schedule improves the performance progressively; it becomes better than the previous cycle, and there are clear periodic patterns across different cycles. This suggests that the cyclical schedule allows the model to use the previously learned results as a warm-restart to achieve further improvement.<\/p>\n<div id=\"attachment_578686\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/figure-4-1.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-578686\" class=\"wp-image-578686 size-large\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/figure-4-1-1024x205.png\" alt=\"\" width=\"1024\" height=\"205\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/figure-4-1-1024x205.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/figure-4-1-300x60.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/figure-4-1-768x154.png 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><p id=\"caption-attachment-578686\" class=\"wp-caption-text\">Figure 4: Comparison of terms in VAE for three schedules.<\/p><\/div>\n<h3>Improving performance on NLP tasks<\/h3>\n<p>The new cyclical schedule has been demonstrated to be effective in improving probabilistic representations of synthetic sequences on the illustrative example, but is it beneficial in downstream real-world natural language processing (NLP) applications? We tested it on three tasks:<\/p>\n<ul>\n<li><em>Language Modeling<\/em>. On the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/corochann.com\/penn-tree-bank-ptb-dataset-introduction-1456.html\">Penn Tree-Bank<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> dataset, the cyclical schedule can provide more informative language representations (measured by the improved KL term), while retaining the similar perplexity. It is significantly faster than <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/harvardnlp\/sa-vae\">existing methods<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and can be combined to improve upon them.<\/li>\n<li><em>Dialog response generation<\/em>. It is key to have probabilistic representations for conversational context, reasoning stochastically for different but relevant responses. On the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/catalog.ldc.upenn.edu\/LDC97S62\">SwitchBoard<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> dataset, the cyclical schedule generates highly diverse answers that cover multiple plausible dialog acts.<\/li>\n<li><em>Unsupervised Language Pre-training<\/em>. On the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.yelp.com\/dataset\">Yelp<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> dataset, a language VAE model is first pre-trained to extract features, then a classifier is fine-tuned with different proportions of labelled data. The cyclical schedule provides robust distribution-based representations of sentences, yielding strong generalization on testing datasets.<\/li>\n<\/ul>\n<p>We hope to see you at NAACL-HLT this June to discuss these approaches in more detail and we\u2019ll look forward to hearing what you think!<\/p>\n<h3>Acknowledgements<\/h3>\n<p><em>This research was conducted by <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/chunyuan.li\/\">Chunyuan Li<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/haofuml\">Hao Fu<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/scholar.google.com\/citations?user=NIewcxMAAAAJ&hl=en\">Xiaodong Liu<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/jfgao\/\">Jianfeng Gao<\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/aslicel\/\">Asli Celikyilmaz<\/a>, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/people.ee.duke.edu\/~lcarin\/\">Lawrence Carin<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Additional thanks go to <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/dreasysnail.github.io\/\">Yizhe Zhang<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/sule\/\">Sungjin Lee<\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/sites.google.com\/view\/dinghanshen\">Dinghan Shen<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/people.duke.edu\/~ww107\/\">Wenlin Wang<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> for their insightful discussion. The implementation in our experiments heavily depends on three NLP applications published on Github repositories; we acknowledge all the authors who made their code public, which tremendously accelerates our project progress.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>There is a growing interest in exploring the use of variational auto-encoders (VAE), a deep latent variable model, for text generation. Compared to the standard RNN-based language model that generates sentences one word at a time without the explicit guidance of a global sentence representation, VAE is designed to learn a probabilistic representation of global [&hellip;]<\/p>\n","protected":false},"author":38022,"featured_media":578701,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Chunyuan Li","user_id":"37971"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[194467],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-578602","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artifical-intelligence","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[144931],"related-projects":[],"related-events":[589690],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"405\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/Cyclical-Annealing-Schedule_Site_04_2019_1400x788-2.png\" class=\"img-object-cover\" alt=\"\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/Cyclical-Annealing-Schedule_Site_04_2019_1400x788-2.png 1399w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/Cyclical-Annealing-Schedule_Site_04_2019_1400x788-2-300x127.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/Cyclical-Annealing-Schedule_Site_04_2019_1400x788-2-768x324.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/Cyclical-Annealing-Schedule_Site_04_2019_1400x788-2-1024x432.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/04\/Cyclical-Annealing-Schedule_Site_04_2019_1400x788-2-665x280.png 665w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"Chunyuan Li","formattedDate":"April 15, 2019","formattedExcerpt":"There is a growing interest in exploring the use of variational auto-encoders (VAE), a deep latent variable model, for text generation. Compared to the standard RNN-based language model that generates sentences one word at a time without the explicit guidance of a global sentence representation,&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/578602","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/users\/38022"}],"replies":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/comments?post=578602"}],"version-history":[{"count":33,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/578602\/revisions"}],"predecessor-version":[{"id":595513,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/578602\/revisions\/595513"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media\/578701"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=578602"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/categories?post=578602"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/tags?post=578602"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=578602"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=578602"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=578602"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=578602"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=578602"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=578602"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=578602"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=578602"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}