{"id":552894,"date":"2018-11-30T07:56:26","date_gmt":"2018-11-30T15:56:26","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?p=552894"},"modified":"2018-11-30T07:56:26","modified_gmt":"2018-11-30T15:56:26","slug":"discovering-the-best-neural-architectures-in-the-continuous-space","status":"publish","type":"post","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/discovering-the-best-neural-architectures-in-the-continuous-space\/","title":{"rendered":"Discovering the best neural architectures in the continuous space"},"content":{"rendered":"<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-552897 aligncenter\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Neural-Architecture-Optimization_NAO_Site_1400x788-1024x576.png\" alt=\"Discovering the best neural architectures in the continuous space\" width=\"1024\" height=\"576\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Neural-Architecture-Optimization_NAO_Site_1400x788-1024x576.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Neural-Architecture-Optimization_NAO_Site_1400x788-300x169.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Neural-Architecture-Optimization_NAO_Site_1400x788-768x432.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Neural-Architecture-Optimization_NAO_Site_1400x788-1066x600.png 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Neural-Architecture-Optimization_NAO_Site_1400x788-655x368.png 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Neural-Architecture-Optimization_NAO_Site_1400x788-343x193.png 343w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Neural-Architecture-Optimization_NAO_Site_1400x788.png 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>If you\u2019re a deep learning practitioner, you may find yourself faced with the same critical question on a regular basis: Which neural network architecture should I choose for my current task? The decision depends on a variety of factors and the answers to a number of other questions. What operations should I choose for this layer\u2014convolution, depth separable convolution, or max pooling? What is the kernel size for convolution? 3&#215;3 or 1&#215;1? And which previous node should serve as the input to the current recurrent neural network (RNN) cell? Such decisions are crucial to the architecture\u2019s success. If you\u2019re a domain expert in both neural network modeling and the specific task at hand, it might be easy for you to make the correct decisions. But what if your experience with either of them is limited?<\/p>\n<p>In that case, you might turn to neural architecture search (NAS), an automated process in which an additional machine learning algorithm is leveraged to guide the creation of better neural architecture given the historically observed architectures and their performances. Thanks to NAS, we can pinpoint the neural network architectures that achieve the best results on widely used benchmark datasets, such as ImageNet, without any human intervention.<\/p>\n<p>But while existing methods in automatic neural architecture design\u2014typically based on reinforcement learning or evolutionary algorithm\u2014have generally conducted their searches in the exponentially large discrete space, my collaborators and I in the <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/group\/machine-learning-research-group\/\">Machine Learning group at Microsoft Research Asia<\/a> have designed a simplified, more efficient method based on optimization in the continuous space. With our new approach, called <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/1808.07233.pdf\">neural architecture optimization (NAO)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, we leverage the power of a gradient-based method to conduct optimization in the more compact space. The work is part of the program at this year\u2019s <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nips.cc\/\">Conference on Neural Information Processing Systems (NeurIPS)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n<div id=\"attachment_552903\" style=\"width: 810px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-552903\" class=\"size-full wp-image-552903\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/Figure-1_The-workflow-of-NAO.png\" alt=\"Figure 1: The workflow of NAO\" width=\"800\" height=\"330\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/Figure-1_The-workflow-of-NAO.png 800w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/Figure-1_The-workflow-of-NAO-300x124.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/Figure-1_The-workflow-of-NAO-768x317.png 768w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><p id=\"caption-attachment-552903\" class=\"wp-caption-text\">Figure 1: The workflow of NAO<\/p><\/div>\n<h3>The key components of NAO<\/h3>\n<p>Driving NAO\u2019s ability to perform gradient-based optimization in the continuous space are three components (see Figure 1):<\/p>\n<ul>\n<li>an encoder that maps a discrete neural network architecture into a continuous vector, or embedding<\/li>\n<li>a performance prediction function that takes the vector as input and generates a real value as a prediction of the performance of the architecture (for example, accuracy)<\/li>\n<li>a decoder that recovers the architecture from its continuous vector<\/li>\n<\/ul>\n<p><span style=\"color: #000000;\">These three components are trained jointly. After we have finished training, starting from an architecture x, we map it using encoder E to its representation e<sub>x<\/sub>, then move e<sub>x<\/sub> to a new embedding vector, denoted as e<sub>x&#8217;<\/sub>, along the gradient direction of the performance prediction function f (the green line). Since we are essentially doing gradient ascent, we are guaranteed that f(e<sub>x&#8217;<\/sub>) \u2265 f(e<sub>x<\/sub>) as long as the step size is small enough. Finally, we map e<sub>x&#8217;<\/sub> into a discrete architecture x&#8217; using the decoder D. In this way, we obtain a potentially better architecture x&#8217;. By iteratively updating architectures in this way, we obtain the final architecture, which is assumed to provide the best performance.<\/span><\/p>\n<h3>High performance with limited computational resources<\/h3>\n<p>We conducted extensive experiments to verify the effectiveness of using NAO to automatically discover the best neural architecture. Table 1 (below) demonstrates the results on the CIFAR-10 image classification dataset using various convolutional neural network (CNN) architectures discovered by different NAS algorithms. From the table, we can observe that the neural network discovered by NAO achieves the lowest error rate among the studied NAS algorithms. Furthermore, we can achieve significant search speed when combining NAO with the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/1802.03268.pdf\">weight sharing mechanism<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (referred to as \u201cNAO-WS\u201d), a method used to significantly reduce the computational cost of obtaining the performances of various neural network architectures by letting them share the same copy of weight parameters. In our experiments, we found we can use one graphics processing unit (GPU) in seven hours to obtain a CNN architecture achieving a 3.53 error rate. With weight sharing, there is no need to train different neural networks from scratch.<\/p>\n<p>Table 2 (below) summarizes the results on PTB language modeling. The lower perplexity scores indicate better performance. Again, we can observe that the RNN architecture found by our NAO achieves impressive performance with very limited computational resources.<\/p>\n<p>By using continuous optimization, NAO achieves better performance compared to existing NAS methods directly searching among discrete architecture space. As for future application, we plan to use NAO to search for architecture for other important AI tasks, such as neural machine translation. And just as important, the availability of a simpler, more efficient automatic neural architecture design continues to make machine learning technologies accessible to practitioners of all experience levels.<\/p>\n<table class=\"aligncenter\" style=\"width: 80%; height: 152px; border-collapse: separate; border-spacing: inherit;\" border=\"1\" cellspacing=\"inherit\" cellpadding=\"inherit\">\n<caption>\u00a0<\/caption>\n<tbody>\n<tr>\n<td style=\"padding: inherit; border: 1px solid; width: 620.05px;\"><strong>Method<\/strong><\/td>\n<td style=\"padding: inherit; border: 1px solid; width: 473.18px;\"><strong>Error Rate<\/strong><\/td>\n<td style=\"padding: inherit; border: 1px solid; width: 546.12px;\"><strong>Resource (#GPU x <\/strong><strong>#Hours)<\/strong><\/td>\n<\/tr>\n<tr>\n<td style=\"padding: inherit; border: 1px solid; width: 620.05px;\"><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/1802.03268.pdf\">ENAS<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td>\n<td style=\"padding: inherit; border: 1px solid; width: 473.18px;\">3.54<\/td>\n<td style=\"padding: inherit; border: 1px solid; width: 546.12px;\">11<\/td>\n<\/tr>\n<tr>\n<td style=\"padding: inherit; border: 1px solid; width: 620.05px;\"><strong>NAO-WS<\/strong><\/td>\n<td style=\"padding: inherit; border: 1px solid; width: 473.18px;\"><strong>3.53<\/strong><\/td>\n<td style=\"padding: inherit; border: 1px solid; width: 546.12px;\"><strong>7<\/strong><\/td>\n<\/tr>\n<tr>\n<td style=\"padding: inherit; border: 1px solid; width: 620.05px;\"><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/1802.01548.pdf\">AmoebaNet<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td>\n<td style=\"padding: inherit; border: 1px solid; width: 473.18px;\">2.13<\/td>\n<td style=\"padding: inherit; border: 1px solid; width: 546.12px;\">3150 * 24<\/td>\n<\/tr>\n<tr>\n<td style=\"padding: inherit; border: 1px solid; width: 620.05px;\"><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/1711.00436.pdf\">Hier-EA<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td>\n<td style=\"padding: inherit; border: 1px solid; width: 473.18px;\">3.75<\/td>\n<td style=\"padding: inherit; border: 1px solid; width: 546.12px;\">300 * 24<\/td>\n<\/tr>\n<tr>\n<td style=\"padding: inherit; border: 1px solid; width: 620.05px;\"><strong>NAO<\/strong><\/td>\n<td style=\"padding: inherit; border: 1px solid; width: 473.18px;\"><strong>2.11<\/strong><\/td>\n<td style=\"padding: inherit; border: 1px solid; width: 546.12px;\"><strong>200 * 24<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p style=\"text-align: center;\">Table 1: Results on CIFAR-10 classification<\/p>\n<table class=\"aligncenter\" style=\"width: 80%; height: 138px; border-collapse: separate; border-spacing: inherit;\" border=\"1\" cellspacing=\"inherit\" cellpadding=\"inherit\">\n<caption>\u00a0<\/caption>\n<tbody>\n<tr style=\"height: 23px;\">\n<td style=\"padding: inherit; border: 1px solid; width: 571.99px; height: 23px;\"><strong>Method<\/strong><\/td>\n<td style=\"padding: inherit; border: 1px solid; width: 510.28px; height: 23px;\"><strong>Perplexity<\/strong><\/td>\n<td style=\"padding: inherit; border: 1px solid; width: 557.08px; height: 23px;\"><strong>Resource (#GPU x<\/strong><strong> #Hours)<\/strong><\/td>\n<\/tr>\n<tr style=\"height: 23px;\">\n<td style=\"padding: inherit; border: 1px solid; width: 571.99px; height: 23px;\"><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/microsoft-my.sharepoint.com\/Users\/kristinadodge\/Desktop\/Microsoft\/Assignments\/Edited by Me\/NIPS18_Neural Architecture Optimization\/NASNet\">NASNet<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td>\n<td style=\"padding: inherit; border: 1px solid; width: 510.28px; height: 23px;\">62.4<\/td>\n<td style=\"padding: inherit; border: 1px solid; width: 557.08px; height: 23px;\">1e4 CPU days<\/td>\n<\/tr>\n<tr style=\"height: 23px;\">\n<td style=\"padding: inherit; border: 1px solid; width: 571.99px; height: 23px;\"><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/1802.03268.pdf\">ENAS<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td>\n<td style=\"padding: inherit; border: 1px solid; width: 510.28px; height: 23px;\">58.6<\/td>\n<td style=\"padding: inherit; border: 1px solid; width: 557.08px; height: 23px;\">12<\/td>\n<\/tr>\n<tr style=\"height: 23px;\">\n<td style=\"padding: inherit; border: 1px solid; width: 571.99px; height: 23px;\"><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/1806.09055.pdf\">DARTS<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td>\n<td style=\"padding: inherit; border: 1px solid; width: 510.28px; height: 23px;\">56.1<\/td>\n<td style=\"padding: inherit; border: 1px solid; width: 557.08px; height: 23px;\">24<\/td>\n<\/tr>\n<tr style=\"height: 23px;\">\n<td style=\"padding: inherit; border: 1px solid; width: 571.99px; height: 23px;\"><strong>NAO<\/strong><\/td>\n<td style=\"padding: inherit; border: 1px solid; width: 510.28px; height: 23px;\"><strong>56.0<\/strong><\/td>\n<td style=\"padding: inherit; border: 1px solid; width: 557.08px; height: 23px;\"><strong>300 * 24<\/strong><\/td>\n<\/tr>\n<tr style=\"height: 23px;\">\n<td style=\"padding: inherit; border: 1px solid; width: 571.99px; height: 23px;\"><strong>NAO-WS<\/strong><\/td>\n<td style=\"padding: inherit; border: 1px solid; width: 510.28px; height: 23px;\"><strong>56.6<\/strong><\/td>\n<td style=\"padding: inherit; border: 1px solid; width: 557.08px; height: 23px;\"><strong>10<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p style=\"text-align: center;\">Table 2: Results on PTB language modeling<\/p>\n","protected":false},"excerpt":{"rendered":"<p>If you\u2019re a deep learning practitioner, you may find yourself faced with the same critical question on a regular basis: Which neural network architecture should I choose for my current task? The decision depends on a variety of factors and the answers to a number of other questions. What operations should I choose for this [&hellip;]<\/p>\n","protected":false},"author":37074,"featured_media":552897,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Fei Tian","user_id":"36039"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[241770],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-552894","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artificial-intelligence","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199560],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[269241],"related-projects":[272661],"related-events":[],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Neural-Architecture-Optimization_NAO_Site_1400x788.png\" class=\"img-object-cover\" alt=\"Discovering the best neural architectures in the continuous space\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Neural-Architecture-Optimization_NAO_Site_1400x788.png 1400w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Neural-Architecture-Optimization_NAO_Site_1400x788-300x169.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Neural-Architecture-Optimization_NAO_Site_1400x788-768x432.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Neural-Architecture-Optimization_NAO_Site_1400x788-1024x576.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Neural-Architecture-Optimization_NAO_Site_1400x788-1066x600.png 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Neural-Architecture-Optimization_NAO_Site_1400x788-655x368.png 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Neural-Architecture-Optimization_NAO_Site_1400x788-343x193.png 343w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"Fei Tian","formattedDate":"November 30, 2018","formattedExcerpt":"If you\u2019re a deep learning practitioner, you may find yourself faced with the same critical question on a regular basis: Which neural network architecture should I choose for my current task? The decision depends on a variety of factors and the answers to a number&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/552894","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/users\/37074"}],"replies":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/comments?post=552894"}],"version-history":[{"count":30,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/552894\/revisions"}],"predecessor-version":[{"id":554352,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/552894\/revisions\/554352"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media\/552897"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=552894"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/categories?post=552894"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/tags?post=552894"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=552894"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=552894"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=552894"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=552894"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=552894"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=552894"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=552894"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=552894"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}