{"id":554676,"date":"2018-12-04T08:57:02","date_gmt":"2018-12-04T16:57:02","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?p=554676"},"modified":"2018-12-04T08:57:02","modified_gmt":"2018-12-04T16:57:02","slug":"unlikely-research-area-reveals-surprising-twist-in-non-smooth-optimization","status":"publish","type":"post","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/unlikely-research-area-reveals-surprising-twist-in-non-smooth-optimization\/","title":{"rendered":"Unlikely research area reveals surprising twist in non-smooth optimization"},"content":{"rendered":"<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-554679 aligncenter\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Optimal-Algorithms-For-Non-Smooth-Distributed-Optimization-In-Networks_ML_Site_1400x788-1024x576.png\" alt=\"\" width=\"1024\" height=\"576\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Optimal-Algorithms-For-Non-Smooth-Distributed-Optimization-In-Networks_ML_Site_1400x788-1024x576.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Optimal-Algorithms-For-Non-Smooth-Distributed-Optimization-In-Networks_ML_Site_1400x788-300x169.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Optimal-Algorithms-For-Non-Smooth-Distributed-Optimization-In-Networks_ML_Site_1400x788-768x432.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Optimal-Algorithms-For-Non-Smooth-Distributed-Optimization-In-Networks_ML_Site_1400x788-1066x600.png 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Optimal-Algorithms-For-Non-Smooth-Distributed-Optimization-In-Networks_ML_Site_1400x788-655x368.png 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Optimal-Algorithms-For-Non-Smooth-Distributed-Optimization-In-Networks_ML_Site_1400x788-343x193.png 343w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Optimal-Algorithms-For-Non-Smooth-Distributed-Optimization-In-Networks_ML_Site_1400x788.png 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>Modern machine learning is characterized by two key features: high-dimensional models and very large datasets. Each of these features presents its own unique challenges, from basic issues such as storing and accessing all of the data to more intricate mathematical quests such as finding good algorithms to search through the high-dimensional space of models. In <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/1806.00291.pdf\">our recent work<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, which we\u2019re happy to announce received a best paper award at this year\u2019s <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nips.cc\/\">Conference on Neural Information Processing Systems (NeurIPS)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, we find a surprising new connection between these two key aspects. We discovered that the well-known phenomenon of acceleration for smooth optimization can always be applied to distributed optimization\u2014even in non-smooth settings\u2014because of an inherent smoothness in the communication process!<\/p>\n<h3>Progress and the search for the unknown<\/h3>\n<p>The challenges of distributed computation and optimization have been studied separately for more than half a century. Take, for instance, the easier yet still prevalent challenge in many tasks of linear models such as logistic regression. There the mathematical theory of local search techniques\u2014for example, gradient descent\u2014in high-dimensional spaces is by now almost complete. This is largely due to work from the 1980s school of Russian researchers and engineers, including central figures Arkadi Nemirovski and Yurii Nesterov. They developed a theory of optimal local search methods, obtaining actual algorithms, such as momentum-based accelerated gradient descent, as well as proofs that, in some sense, no better algorithms exist. This theory was tremendously influential and put modern machine learning on solid footing. Another example is noisy evaluation, which produced the famous stochastic gradient descent technique. But today\u2019s research in these areas remains very active, and modern machine learning comes with a host of other fundamental issues where optimal methods are not known yet. Of particular interest is the question of how to design optimal local search methods for distributed convex optimization while considering both high dimensional space and huge datasets.<\/p>\n<p>Microsoft Research has long been invested in finding answers to this question, including foundational work on <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.jmlr.org\/papers\/volume13\/dekel12a\/dekel12a.pdf\">the relationship between distributed learning and mini-batch gradient descent<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. In a recent collaboration between <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/lab\/microsoft-research-ai\/\">Microsoft Research AI<\/a> and the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.msr-inria.fr\/\">Microsoft Research-Inria Joint Centre<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> in Paris, new advances were made on the question of optimal methods for distributed optimization. A paper on <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2017\/06\/1702.08704.pdf\">smooth optimization<\/a> was published at the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/icml.cc\/Conferences\/2017\">International Conference on Machine Learning (ICML)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> last year. Our NeurIPS 2018 paper, \u201c<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/1806.00291.pdf\">Optimal Algorithms for Non-Smooth Distributed Optimization in Networks<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,\u201d explores the much more delicate question of non-smooth optimization.<\/p>\n<h3>An entirely different realm<\/h3>\n<p>It is well-known that for serial smooth convex optimization, momentum-based techniques yield optimal convergence rates. A slightly underwhelming finding of the ICML 2017 paper is that in the distributed setting, one can simply implement such optimal serial methods together with the master\/slave protocol, and this in turn will yield an optimal distributed method. At this point, it is crucial to remember what mathematically optimal actually means. A method is \u201coptimal\u201d only within the set of assumptions under which optimality is proven. In particular, there is a tremendously important weakness in the master\/slave protocol: It is not robust against machine failures, which are ubiquitous in distributed computing. Once one adds in the requirement of robustness\u2014for example, by requiring that communication follows the so-called gossip protocol\u2014one enters an entirely different realm. In this regime, a much deeper method turns out to be optimal, one based on a primal-dual representation of the distributed objective. Such representations have been known for a long time and are the basis of many practical algorithms, such as alternating direction method of multipliers (ADMM). The ICML paper proves that under a local smoothness condition, such primal-dual techniques are in fact optimal.<\/p>\n<h3>The discovery of hidden smoothness<\/h3>\n<p>For optimization experts, distributed optimization in the non-smooth scenario might seem like an unlikely research area, as it is well-known that for serial non-smooth optimization, vanilla gradient descent is already optimal. The surprising twist here, though\u2014and indeed the key observation in our NeurIPS paper\u2014is that the distributed nature always induces some smoothness. Slightly more precisely, while each machine might face a difficult non-smooth problem, the aggregation process resulting from communication is actually a smooth process.<\/p>\n<p>This hidden smoothness reveals itself in the primal-dual methods, where now the primal problem is non-smooth while the dual problem is smooth. The idea then becomes, roughly, to use an optimal non-smooth method in the primal together with an optimal smooth method in the dual. In other words, non-smooth distributed optimization always allows for accelerated communication despite the non-smoothness of the original problem. The astute reader might have noticed that invoking primal-dual methods inevitably means that one is considering the local regularity (that is, on each machine) of the distributed function, leaving open the question of optimal methods under a global regularity assumption. Our NeurIPS paper also proposes an algorithm for that scenario based on deep ideas in randomized algorithms. However, proving optimality remains a difficult technical challenge. In particular, we conjecture that optimal methods under global regularity might inherently suffer from the high-dimensionality of the space. Proving such a statement is an exciting open problem!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Modern machine learning is characterized by two key features: high-dimensional models and very large datasets. Each of these features presents its own unique challenges, from basic issues such as storing and accessing all of the data to more intricate mathematical quests such as finding good algorithms to search through the high-dimensional space of models. In [&hellip;]<\/p>\n","protected":false},"author":37074,"featured_media":554679,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"S\u00e9bastien Bubeck","user_id":"33570"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[194466],"tags":[],"research-area":[13561,13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-554676","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-algorithms","msr-research-area-algorithms","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[508112],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Optimal-Algorithms-For-Non-Smooth-Distributed-Optimization-In-Networks_ML_Site_1400x788.png\" class=\"img-object-cover\" alt=\"Unlikely research area reveals surprising twist in non-smooth optimization\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Optimal-Algorithms-For-Non-Smooth-Distributed-Optimization-In-Networks_ML_Site_1400x788.png 1400w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Optimal-Algorithms-For-Non-Smooth-Distributed-Optimization-In-Networks_ML_Site_1400x788-300x169.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Optimal-Algorithms-For-Non-Smooth-Distributed-Optimization-In-Networks_ML_Site_1400x788-768x432.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Optimal-Algorithms-For-Non-Smooth-Distributed-Optimization-In-Networks_ML_Site_1400x788-1024x576.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Optimal-Algorithms-For-Non-Smooth-Distributed-Optimization-In-Networks_ML_Site_1400x788-1066x600.png 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Optimal-Algorithms-For-Non-Smooth-Distributed-Optimization-In-Networks_ML_Site_1400x788-655x368.png 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/11\/NIPS_Optimal-Algorithms-For-Non-Smooth-Distributed-Optimization-In-Networks_ML_Site_1400x788-343x193.png 343w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"S\u00e9bastien Bubeck","formattedDate":"December 4, 2018","formattedExcerpt":"Modern machine learning is characterized by two key features: high-dimensional models and very large datasets. Each of these features presents its own unique challenges, from basic issues such as storing and accessing all of the data to more intricate mathematical quests such as finding good&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/554676","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/users\/37074"}],"replies":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/comments?post=554676"}],"version-history":[{"count":4,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/554676\/revisions"}],"predecessor-version":[{"id":555192,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/554676\/revisions\/555192"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media\/554679"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=554676"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/categories?post=554676"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/tags?post=554676"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=554676"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=554676"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=554676"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=554676"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=554676"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=554676"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=554676"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=554676"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}