{"id":963594,"date":"2023-08-30T09:00:00","date_gmt":"2023-08-30T16:00:00","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?p=963594"},"modified":"2023-08-23T10:21:43","modified_gmt":"2023-08-23T17:21:43","slug":"building-a-heavy-metal-quartet-of-ai-compilers","status":"publish","type":"post","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/building-a-heavy-metal-quartet-of-ai-compilers\/","title":{"rendered":"Building a \u201cheavy metal quartet\u201d of AI compilers"},"content":{"rendered":"\n<p>By MSR Editor&nbsp;<\/p>\n\n\n\n<p>Compilation is an important process in program development, in which a program called a compiler translates source code written in a programming language into machine code executable on computer hardware. As AI technology and large-scale AI models become increasingly prevalent across the digital world, their unique characteristics are posing new challenges for compilers.<\/p>\n\n\n\n<p>As AI models have evolved from early versions like recurrent neural networks (RNN) and convolutional neural networks (CNN) to more recent iterations like Transformer, their fundamental architecture is also constantly evolving. Meanwhile, the underlying hardware accelerators, such as graphics processing units (GPUs) and neural processing units (NPUs), are iterating rapidly as well, with some designs disrupting previous architectures. Therefore, an AI compiler plays a critical role in helping new AI models run efficiently on new hardware.<\/p>\n\n\n\n<p>In response, researchers from Microsoft Research, in collaboration with academic colleagues, conducted a series of research and released the \u201cheavy-metal quartet\u201d of AI compilers: <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/rammer-enabling-holistic-deep-learning-compiler-optimizations-with-rtasks\/\"><em>Rammer<\/em><\/a><em>, <\/em><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/roller-fast-and-efficient-tensor-compilation-for-deep-learning\/\"><em>Roller<\/em><\/a><em>, <\/em><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/welder-scheduling-deep-learning-memory-access-via-tile-graph\/\"><em>Welder<\/em><\/a><em>, <\/em>and<em> <\/em><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/cocktailer-analyzing-and-optimizing-dynamic-control-flow-in-deep-learning\/\"><em>Grinder<\/em><\/a><sup><a id=\"_ftnref1\" href=\"#_ftn1\">[1]<\/a><\/sup>. This quartet provides systematic and innovative solutions for current mainstream AI models and hardware compilation.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"978\" height=\"367\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-1.png\" alt=\"The left diagram shows the unified compiler abstraction with a tile-based intermediate representation (IR) as the core. The right diagram shows the four core AI compilation technologies. \" class=\"wp-image-963609\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-1.png 978w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-1-300x113.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-1-768x288.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-1-240x90.png 240w\" sizes=\"auto, (max-width: 978px) 100vw, 978px\" \/><figcaption class=\"wp-element-caption\">Figure 1: The four core AI compilation technologies based on unified tile abstraction<\/figcaption><\/figure>\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"670821\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">Spotlight: Microsoft research newsletter<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/info.microsoft.com\/ww-landing-microsoft-research-newsletter.html\" aria-label=\"Microsoft Research Newsletter\" data-bi-cN=\"Microsoft Research Newsletter\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/09\/Newsletter_Banner_08_2019_v1_1920x1080.png\" alt=\"\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">Microsoft Research Newsletter<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"microsoft-research-newsletter\" class=\"large\">Stay connected to the research community at Microsoft.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button is-style-fill-chevron\">\n\t\t\t\t\t\t<a href=\"https:\/\/info.microsoft.com\/ww-landing-microsoft-research-newsletter.html\" aria-describedby=\"microsoft-research-newsletter\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"Microsoft Research Newsletter\" target=\"_blank\">\n\t\t\t\t\t\t\tSubscribe today\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<h2 class=\"wp-block-heading\" id=\"ai-compilation-rammer-improves-hardware-parallel-utilization\">AI compilation \u201cRammer\u201d improves hardware parallel utilization <\/h2>\n\n\n\n<p>Deep neural networks (DNNs) are widely adopted in image classification, natural language processing, and many other intelligence tasks. Because of their importance, many computing devices such as CPUs, GPUs, and specially designed DNN accelerators are being used to perform DNN computations. One key variable for DNN computation efficiency is scheduling, which determines the order in which computational tasks are performed on hardware. Conventional AI compilers typically treat DNN computation as a data flow graph where each node represents a DNN operator. These operators are implemented as opaque library functions and are scheduled to run on the accelerator separately. At the same time, this process also relies on another layer of schedulers, usually implemented in hardware, to take advantage of the parallelism available in operators. This two-level approach incurs significant scheduling overhead and often does not fully utilize hardware resources.<\/p>\n\n\n\n<p>To address this issue, researchers proposed <em>a new DNN compiler, Rammer, which can optimize the execution of DNN workloads on massive-parallel units of accelerators.<\/em> Rammer imagines the scheduling space for AI compilation as a two-dimensional plane, where computational tasks are \u201cbricks\u201d that can be divided into different shapes and sizes. The purpose of scheduling in Rammer is to arrange these bricks tightly\u2014as if building a wall\u2014on the computational units of the two-dimensional plane. The arrangement should not leave any gaps, which would hurt hardware utilization and thus reduce execution speed. Rammer works like a compactor in this two-dimensional space: when a DNN program is translated into bricks, Rammer can place them on different computing units of the accelerator to compact them.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-2-763x1024.png\" alt=\"A schematic diagram illustrating Rammer\u2019s technical framework. The input to Rammer is a data-flow graph where a node is an rOperator. Then, Rammer introduces rTask-aware DFG compiler to manage the inter and intra-operator scheduling in one place. The rTask-aware DFG compiler will generate a static execution plan for runtime execution. Rammer abstracts a hardware accelerator as a virtualized parallel device (vDevice), which includes multiple virtualized execution units (vEUs). The vDevice provides the scheduling and synchronization capabilities at the rTask level so that the rProgram can be mapped to the corresponding vEUs at compile time. The vEUs, together with the vDevice will be mapped to the hardware at runtime. \" class=\"wp-image-963612\" width=\"427\" height=\"573\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-2-763x1024.png 763w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-2-223x300.png 223w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-2-768x1031.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-2-134x180.png 134w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-2.png 1053w\" sizes=\"auto, (max-width: 427px) 100vw, 427px\" \/><figcaption class=\"wp-element-caption\">Figure 2: Rammer\u2019s technical framework<\/figcaption><\/figure>\n\n\n\n<p>In other words, Rammer generates an efficient static spatiotemporal schedule for DNNs ahead of time (during compilation), minimizing runtime scheduling overhead. Meanwhile, through new hardware-independent abstractions for computing tasks and hardware accelerators, Rammer exposes a larger scheduling space and provides a novel way to implement cooperative intra- and inter-operator scheduling. This allows Rammer to find more efficient schedules, thereby greatly improving hardware utilization.<\/p>\n\n\n\n<p>Researchers evaluated Rammer on multiple devices, including NVIDIA GPUs, AMD GPUs, and Graphcore intelligence processing units (IPUs). Experiments have shown that Rammer significantly outperforms state-of-the-art compilers, such as XLA and TVM, on NVIDIA and AMD GPUs, achieving a speedup of up to 20.1 times. And compared to TensorRT, NVIDIA\u2019s proprietary DNN inference library, Rammer achieves a speedup of up to 3.1 times.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"ai-compilation-roller-improves-compilation-efficiency\">AI compilation \u201cRoller\u201d improves compilation efficiency<\/h2>\n\n\n\n<p>An accelerator is equipped with parallel computing units and multiple layers of memory hierarchy. The data needs to be passed upwards layer by layer from the bottom memory layer before computation. At each layer, the data is divided into smaller bricks. Eventually, these smaller bricks are handed over to the top-level processor for computation. The challenge lies in how to partition the data and fill the memory space with large bricks, so as to better utilize available memory and improve efficiency. The current approach involves using machine learning to identify better strategies for partitioning these bricks. However, this typically requires thousands of search steps, each of which is evaluated on the accelerator, in order to find a satisfactory solution. As a result, the process can take days or even weeks to compile a full AI model.<\/p>\n\n\n\n<p>Given the computational logic and the specification of each memory layer, which present a holistic view on the software and hardware information, it is possible to formulate the best strategy for partitioning the bricks, as well as the best brick sizes. This enables faster compilation with good computation efficiency. And it is the key idea behind <em>Roller<\/em>. <em>Like a road roller, the system lays down high-dimensional tensor data onto two-dimensional memory like tiling a floor, finding the optimal tile sizes given the memory characteristics. At the same time, it encapsulates the tensor shape that aligns with the hardware characteristics of the underlying accelerator, achieving efficient compilation by limiting the choices for shapes.<\/em><strong><\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-3-1024x740.png\" alt=\"A schematic diagram illustrating Roller\u2019s technical framework. Roller takes an operator described as a tensor expression. Roller extracts the tensor shapes from the tensor expression and leverage hardware specifications to construct rTiles. Based on rTiles, Roller proposes a scale-up-then-scale-out recursive construction algorithm to generate efficient tensor programs (named rProgram) that describes the data processing pipeline. When generating rProgram, the construction algorithm identifies good rTile configurations by evaluating the performance of a constructed rProgram through a micro-performance model. It is built on top a device described through a hardware abstraction layer exposing only rTile-related interfaces: Load, Compute, and Store. The constructed rProgram is finally realized through a code generator to emit the final kernel code corresponding to the specific device.\" class=\"wp-image-963615\" width=\"530\" height=\"383\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-3-1024x740.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-3-300x217.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-3-768x555.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-3-1536x1110.png 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-3-240x173.png 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-3.png 1888w\" sizes=\"auto, (max-width: 530px) 100vw, 530px\" \/><figcaption class=\"wp-element-caption\">Figure 3: Roller\u2019s technical framework<\/figcaption><\/figure>\n\n\n\n<p>Evaluations on six mainstream DNN models and 119 popular DNN operators demonstrated that Roller can generate highly optimized kernels in seconds, especially for large and expensive custom operators. Roller achieves a three-orders-of-magnitude improvement in compilation time compared to existing compilers. The performance of the kernels generated by Roller is comparable to that of state-of-the-art tensor compilers, including DNN libraries, with some operators performing even better. Roller has also been used in customizing DNN kernels internally, which has demonstrated its real improvement in development agility.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"ai-compilation-welder-optimizes-memory-access-and-improves-computing-efficiency\">AI compilation \u201cWelder\u201d optimizes memory access and improves computing efficiency<\/h2>\n\n\n\n<p>With the growing demand for processing higher fidelity data and the use of faster computing cores in newer hardware accelerators, modern DNN models are becoming increasingly memory intensive. A disparity between underutilized computing cores and saturated memory bandwidth has been observed in various popular DNN models.<\/p>\n\n\n\n<p>For example, profiling on a state-of-the-art DNN benchmark shows that the memory bandwidth utilization can be as high as 96.7% while the average utilization of computing cores is only 51.6%. Even more seriously, the continuous evolution of hardware and DNN models continues to increase this gap. Modern AI models tend to process high-fidelity data, such as larger images, longer sentences, and higher-resolution graphics. Such data demands higher memory bandwidth during computation. Additionally, the introduction of more efficient specialized computing cores (such as NVIDIA Tensor Cores or AMD Matrix Cores) further increases memory pressure.<\/p>\n\n\n\n<p>To address this issue, the researchers proposed <em>the Welder deep learning compiler, which holistically optimizes the memory access efficiency of the end-to-end DNN model.<\/em> Represented as a data flow graph, the end-to-end DNN computation involves multiple stages, where the input data is divided into blocks that flow through different operators. These blocks are transferred to processor cores for computation and then transferred back to memory. This results in significant overhead due to data movement across memory layers. Since it includes multiple stages, the entire process can be envisioned as a scenario where \u201cworkers\u201d are moving bricks upwards layer by layer. The first worker takes the bricks up, processes them, and then puts them back in their original location. The second worker takes them up again, sculpts them, and then once again puts them back. The process continues with the third worker, the fourth worker, and so on, repeatedly moving the bricks. However, this leads to significant overhead. Would it be possible for the first worker to finish a part of the subtask and then directly hand it over to the next worker at the top level? These tasks can then be \u201cwelded\u201d together to achieve a pipelined operation with higher efficiency. Welder plays the role of such a welding tool. By connecting (welding) different operators, data blocks are processed in the manner of an assembly line, greatly reducing memory access traffic at lower-level memory layers. With AI models imposing increasingly high requirements for memory efficiency in recent years, Welder helps to significantly improve computational efficiency.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-4-1024x776.png\" alt=\"A schematic diagram illustrating Welder\u2019s technical framework. Welder takes a full DNN model as input and converts it into a data-flow graph of tile-based computing tasks, which is called tile-graph. Then, a two-step scheduling algorithm, i.e., graph connecting and sub-graph scheduling, is proposed to recursively decide an efficient tile-graph execution plan for multiple memory layers, known as a hierarchical tile-graph. Finally, this plan is then mapped to an executable code for a specific hardware accelerator using four abstracted computing interfaces defined in the hardware layer.\" class=\"wp-image-963621\" width=\"768\" height=\"582\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-4-1024x776.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-4-300x227.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-4-768x582.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-4-1536x1164.png 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-4-80x60.png 80w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-4-238x180.png 238w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-4.png 2022w\" sizes=\"auto, (max-width: 768px) 100vw, 768px\" \/><figcaption class=\"wp-element-caption\">Figure 4: Welder\u2019s technical framework<\/figcaption><\/figure>\n\n\n\n<p>Evaluations on 10 mainstream DNN models, (including classic and the latest AI model structures for various tasks, such as vision, natural language processing, 3D graphics, etc.), demonstrated that Welder significantly exceeds the performance of existing mainstream frameworks and compilers on both NVIDIA and AMD GPUs. For example, it outperforms PyTorch, ONNXRuntime, and Ansor by up to 21.4 times, 8.7 times, and 2.8 times, respectively. Welder\u2019s automatic optimization surpasses even TensorRT and Faster Transformer (a hand-crafted library), achieving speedups of up to 3.0 times and 1.7 times, respectively. Furthermore, when running these models on hardware with faster computing cores such as TensorCore, performance is improved even more, underscoring the significance of memory optimization for future AI accelerators.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"ai-compilation-grinder-allows-efficient-control-flow-execution-on-accelerators\">AI compilation \u201cGrinder\u201d allows efficient control flow execution on accelerators<\/h2>\n\n\n\n<p>In AI computation, the movement of data blocks sometimes requires more complex control logic, i.e., control flow code. For example, a program could iteratively traverse each word in a sentence or dynamically determine which part of a program to execute based on input. Currently, most AI compilers focus on addressing data flow execution efficiency and do not provide efficient support for control flow. As a result, models with more complex control flow cannot effectively utilize accelerator performance. The researchers realized that control flow and data flow can be segmented and reorganized in order to execute more efficiently. Their solution is <em>Grinder<\/em>, which acts like a portable grinding and cutting machine. After cutting the data flow into parallel computing blocks of different sizes, it then integrates (grinds) control flow into data flow, so that control flow can also be executed efficiently on the accelerator.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-5-1024x774.png\" alt=\"A schematic diagram illustrating Grinder\u2019s technical framework. The example loop structure is scheduled as a uProgram mapped on the 3-level accelerator. The uProgram consists of 4 loop-uTasks for 4 L1-Units resepectively and each loop-uTask is mapped to a L1-Unit for execution. Both the data flow operators and the loop are scheduled into the loop-uTasks.\" class=\"wp-image-963624\" width=\"598\" height=\"452\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-5-1024x774.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-5-300x227.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-5-768x580.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-5-1536x1161.png 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-5-80x60.png 80w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-5-238x180.png 238w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/ai-compilor-5.png 1712w\" sizes=\"auto, (max-width: 598px) 100vw, 598px\" \/><figcaption class=\"wp-element-caption\">Figure 5: Grinder\u2019s technical framework<\/figcaption><\/figure>\n\n\n\n<p>Grinder can jointly optimize the execution of control flow and data flow on hardware accelerators and unify the representation of AI models, including both control flow and data flow, through uTask, a new abstraction. This allows Grinder to expose the overall scheduling space for rescheduling control flow to lower levels of hardware parallelism. Grinder uses a heuristic strategy to find an effective scheduling scheme and can automatically move control flow into device kernels, thereby achieving optimizations across control flow boundaries. Experiments have shown that Grinder can achieve up to an 8.2x speedup on control flow-intensive DNN models, making it the fastest among DNN frameworks and compilers for control flow.&nbsp;<\/p>\n\n\n\n<p>These four AI compilers, based on a common compiler abstraction and unified intermediate representation (IR), <em>solve <\/em><em>multiple fundamental problems in current AI compilers, including parallelism, compilation efficiency, memory, and control flow. Together they constitute a comprehensive set of solutions for compilation. <\/em>and have played an important role in the customization and optimization of new AI models within Microsoft Research.<\/p>\n\n\n\n<p>Jilong Xue, Principal Researcher at MSR Asia, summed up the project this way:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-style-spectrum is-layout-flow wp-block-quote-is-layout-flow\">\n<p>&#8220;On one hand, AI compilers must perform extreme optimizations like operator fusion and kernel specialization tailored for hardware resources. On the other hand, they must also provide systematic compilation support for new, large-scale hardware architectures, such as AI chips featuring on-chip network interconnection (NoC) or hybrid memory architectures, and even guiding hardware design using white-box compilation technologies. The AI compilers we developed have demonstrated a substantial improvement in AI compilation efficiency, thereby facilitating the training and deployment of AI models. At the same time, the evolution of large-scale models also presents opportunities for the next generation AI compiler. In the future, these large-scale models themselves may inherently assist in achieving optimization and compilation.&#8221;<\/p>\n<\/blockquote>\n\n\n\n<p>The following researchers have contributed to this project:<\/p>\n\n\n\n<p><em>(In alphabetical order) <\/em><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/weicu\/\">Wei Cui<\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/yuxiaoguo.github.io\/\" target=\"_blank\" rel=\"noopener noreferrer\">Yuxiao Guo<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/wenxh\/\">Wenxiang Hu<\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/lingm\/\">Lingxiao Ma<\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/yomia\/\">Youshan Miao<\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/zimiao\/\">Ziming Miao<\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/yuqxia\/\">Yuqing Xia<\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/jxue\/\">Jilong Xue<\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/fanyang\/\">Fan Yang<\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/maoyang\/\">Mao Yang<\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/lidongz\/\">Lidong Zhou<\/a><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><a id=\"_ftn1\" href=\"#_ftnref1\">[1]<\/a> Grinder is the research project name. However, this system is referred to as Cocktailer in the paper.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A new quartet of AI compilers: Rammer, Roller, Welder, and Grinder, tackle a range of compiler optimization challenges based on the same tile abstraction, providing a comprehensive solution to connect AI models with hardware accelerators.<\/p>\n","protected":false},"author":42183,"featured_media":963633,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13547],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-963594","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-systems-and-networking","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199560],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[510017,920469,922377],"related-projects":[555282],"related-events":[],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/AI_Compiler-blog-hero-1400x788-1-960x540.png\" class=\"img-object-cover\" alt=\"&#039;Model Representation&#039; with an arrow down to &#039;Tile-based IR&#039; with an arrow down to &#039;Hardware Abstraction&#039; to the left of four icons representing a Rammer, Roller, Welder, and Grinder\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/AI_Compiler-blog-hero-1400x788-1-960x540.png 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/AI_Compiler-blog-hero-1400x788-1-300x169.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/AI_Compiler-blog-hero-1400x788-1-1024x576.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/AI_Compiler-blog-hero-1400x788-1-768x432.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/AI_Compiler-blog-hero-1400x788-1-1066x600.png 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/AI_Compiler-blog-hero-1400x788-1-655x368.png 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/AI_Compiler-blog-hero-1400x788-1-343x193.png 343w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/AI_Compiler-blog-hero-1400x788-1-240x135.png 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/AI_Compiler-blog-hero-1400x788-1-640x360.png 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/AI_Compiler-blog-hero-1400x788-1-1280x720.png 1280w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2023\/08\/AI_Compiler-blog-hero-1400x788-1.png 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"","formattedDate":"August 30, 2023","formattedExcerpt":"A new quartet of AI compilers: Rammer, Roller, Welder, and Grinder, tackle a range of compiler optimization challenges based on the same tile abstraction, providing a comprehensive solution to connect AI models with hardware accelerators.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/963594","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/users\/42183"}],"replies":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/comments?post=963594"}],"version-history":[{"count":9,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/963594\/revisions"}],"predecessor-version":[{"id":1012269,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/963594\/revisions\/1012269"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media\/963633"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=963594"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/categories?post=963594"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/tags?post=963594"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=963594"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=963594"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=963594"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=963594"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=963594"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=963594"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=963594"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=963594"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}