{"id":801178,"date":"2021-12-02T12:04:35","date_gmt":"2021-12-02T20:04:35","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?p=801178"},"modified":"2022-02-01T11:46:04","modified_gmt":"2022-02-01T19:46:04","slug":"efficiently-and-effectively-scaling-up-language-model-pretraining-for-best-language-representation-model-on-glue-and-superglue","status":"publish","type":"post","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/efficiently-and-effectively-scaling-up-language-model-pretraining-for-best-language-representation-model-on-glue-and-superglue\/","title":{"rendered":"Efficiently and effectively scaling up language model pretraining for best language representation model on GLUE and SuperGLUE"},"content":{"rendered":"\n<p>As part of Microsoft <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/innovation.microsoft.com\/en-us\/ai-at-scale\">AI at Scale<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, the Turing family of NLP models are being used at scale across Microsoft to enable the next generation of AI experiences. Today, <strong>we are happy to announce that the latest Microsoft Turing model (T-NLRv5) is the state of the art at the top of SuperGLUE and GLUE leaderboards,<\/strong> further surpassing human performance and other models. Notably, T-NLRv5 first achieved human parity on MNLI and RTE on the GLUE benchmark, the last two GLUE tasks which human parity had not yet met. In addition, T-NLRv5 is more efficient than recent pretraining models, achieving comparable effectiveness with 50% fewer parameters and pretraining computing costs.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Research Project <\/span>\n\t\t\t<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/project\/ai-at-scale\/\" data-bi-cN=\"AI at Scale \" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>AI at Scale <\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"Figure 2: SuperGLUE leaderboard showing T-NLRv5 at the top\" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/FINAL-SuperGlue-2Fig2.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/FINAL-SuperGlue-2Fig2.png\" alt=\"Figure 2: SuperGLUE leaderboard showing T-NLRv5 at the top\" width=\"900\" height=\"234\"\/><\/a><figcaption>Figure 1: SuperGLUE leaderboard showing T-NLRv5 at the top<\/figcaption><\/figure><\/div>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"Figure 1: GLUE leaderboard showing T-NLRv5 at the top\" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/FINAL-Glue-4_Fig1.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/FINAL-Glue-4_Fig1.png\" alt=\"Figure 1: GLUE leaderboard showing T-NLRv5 at the top\" width=\"815\" height=\"434\"\/><\/a><figcaption>Figure 2: GLUE leaderboard showing T-NLRv5 at the top<\/figcaption><\/figure><\/div>\n\n\n\n<p>The Turing Natural Language Representation (T-NLRv5) integrates some of the best modeling techniques developed by Microsoft Research, Azure AI, and Microsoft Turing. The models are pretrained at large scale using an efficient training framework based on FastPT and DeepSpeed. We\u2019re excited to bring new AI improvements to Microsoft products using these state-of-the-art techniques.<\/p>\n\n\n\n<h2 id=\"model-architecture-and-pretraining-task\">Model architecture and pretraining task<\/h2>\n\n\n\n<p>T-NLRv5&nbsp;is largely based on our recent work,&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/pdf\/2102.08473.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">COCO-LM<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, a&nbsp;natural evolution of pretraining paradigm converging the benefits of ELECTRA-style models and&nbsp;corrective&nbsp;language model pretraining.&nbsp;As illustrated in Figure 2,&nbsp;T-NLRv5&nbsp;employs an auxiliary&nbsp;transformer language model to corrupt an input text sequence, and the main transformer model is pretrained using the&nbsp;<em>corrective language model<\/em>&nbsp;task, which is to detect and correct tokens replaced by the auxiliary model. This augments the ELECTRA model family with language modeling capacity, bringing together the benefits from pretraining with adversarial signals generated from the auxiliary&nbsp;model and the language modeling capacity,&nbsp;which is handy for prompt-based learning.&nbsp;&nbsp;<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"605\" height=\"219\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/Figure3TNLRv5BlogPost.png\" alt=\"Figure 3: Model architecture of T-NLRv5\" class=\"wp-image-801181\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/Figure3TNLRv5BlogPost.png 605w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/Figure3TNLRv5BlogPost-300x109.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/Figure3TNLRv5BlogPost-240x87.png 240w\" sizes=\"auto, (max-width: 605px) 100vw, 605px\" \/><figcaption>Figure 3: Model architecture of T-NLRv5<\/figcaption><\/figure><\/div>\n\n\n\n<p>We also leverage the training dataset and the data processing pipeline optimized for developing previous T-NLR releases, including&nbsp;<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/deberta-decoding-enhanced-bert-with-disentangled-attention-2\/\" target=\"_blank\" rel=\"noreferrer noopener\">DeBERTa<\/a>&nbsp;and&nbsp;<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/unilmv2-pseudo-masked-language-models-for-unified-language-model-pre-training\/\" target=\"_blank\" rel=\"noreferrer noopener\">UniLM<\/a>, as well as the implementation optimizations from other Microsoft pretraining research efforts, such as&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/pdf\/2006.15595.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">TUPE<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.&nbsp;&nbsp;<\/p>\n\n\n\n<p>Another key property of T-NLRv5 is that it maintains the effectiveness of the model at smaller sizes, e.g., base and large size with a few hundred million parameters, to bigger sizes with billions of parameters. This is achieved by careful selection of techniques of maintaining model simplicity and optimization stability. We disabled dropout in the auxiliary model so that the pretraining of the auxiliary model and the generation of the main model\u2019s training data are done in one pass. We also disabled the sequential contrastive learning task in COCO-LM to reduce computing cost. This enables us to stick to the post-layer norm transformer architecture that allows us to train deeper transformer networks more thoroughly.<\/p>\n\n\n\n<h2 id=\"efficiently-scaling-up-language-model-pretraining\">Efficiently scaling up language model pretraining<\/h2>\n\n\n\n<p>Training billion-parameter neural models can be prohibitively expensive in both time and computing costs. This yields a long experimental cycle that slows down scientific developments and raises cost-benefit concerns. In making T-NLRv5, we leveraged two approaches to improve its scaling efficiency to ensure optimal use of model parameters and pretraining compute.<\/p>\n\n\n\n<p><strong>Customized CUDA kernels for mixed precision.<\/strong> We leverage the customized CUDA kernels developed for Fast PreTraining (FastPT), which are customized for transformer architecture and optimized for the speed in mixed precision (FP16) pretraining. This not only significantly improves the efficiency of transformer training and inference by 20%, but also provides better numerical stability in mixed-precision training. The latter is one of the most important needs when pretraining language representation models with billions of parameters.<\/p>\n\n\n\n<p><strong>ZeRO&nbsp;optimizer.<\/strong>&nbsp;When scaling up T-NLRv5 to billions of parameters, we bring in our&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/1910.02054\" target=\"_blank\" rel=\"noopener noreferrer\">ZeRO<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;optimizer technique of DeepSpeed, described in a&nbsp;<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters\/\" target=\"_blank\" rel=\"noreferrer noopener\">previous blog post<\/a>, to reduce the GPU memory footprint of pretraining models in multi-machine parallel pretraining&nbsp;processes.&nbsp;Specifically, T-NLRv5 XXL (5.4&nbsp;billion) version uses&nbsp;ZeRO optimizer stage 1 (optimizer stage partitioning),&nbsp;which&nbsp;reduces the GPU memory&nbsp;footprint&nbsp;by&nbsp;five&nbsp;times.&nbsp;<\/p>\n\n\n\n<h2 id=\"achieving-best-effectiveness-and-efficiency-simultaneously\">Achieving best effectiveness and efficiency simultaneously<\/h2>\n\n\n\n<p>By combining the above modeling techniques and infrastructure improvements, T-NLRv5 provides the best effectiveness and efficiency simultaneously at various trade-off points. To the best of our knowledge, T-NLRv5 achieves state-of-the-art effectiveness at various model sizes and pretraining computing costs.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"808\" height=\"486\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/Figure-4_TNLRv5Blog-POst.png\" alt=\"Figure 4. MNLI performance with different model sizes with single model and vanilla fine-tuning. Base and Large are standard 12\/24-layer transformer with 768\/1024 hidden dimension.\" class=\"wp-image-801184\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/Figure-4_TNLRv5Blog-POst.png 808w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/Figure-4_TNLRv5Blog-POst-300x180.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/Figure-4_TNLRv5Blog-POst-768x462.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/Figure-4_TNLRv5Blog-POst-240x144.png 240w\" sizes=\"auto, (max-width: 808px) 100vw, 808px\" \/><figcaption>Figure 4. MNLI performance with different model sizes with single model and vanilla fine-tuning. Base and Large are standard 12\/24-layer transformer with 768\/1024 hidden dimension.<\/figcaption><\/figure><\/div>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"791\" height=\"468\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/Fig5_TNLRv5-Blog-Post-v2.png\" alt=\"Figure 5. MNLI performance of T-NLRv5 at different pretraining steps with single model and vanilla fine-tuning.\" class=\"wp-image-801274\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/Fig5_TNLRv5-Blog-Post-v2.png 791w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/Fig5_TNLRv5-Blog-Post-v2-300x177.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/Fig5_TNLRv5-Blog-Post-v2-768x454.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/Fig5_TNLRv5-Blog-Post-v2-240x142.png 240w\" sizes=\"auto, (max-width: 791px) 100vw, 791px\" \/><figcaption>Figure 5. MNLI performance of T-NLRv5 at different pretraining steps with single model and vanilla fine-tuning.<\/figcaption><\/figure><\/div>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"252\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/Table1_TNLRv5_MSRBlog-Post-1024x252.png\" alt=\"Table 1. Model configurations. The parameters increasement of Base and Large are from using 128K vocabulary.\" class=\"wp-image-801190\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/Table1_TNLRv5_MSRBlog-Post-1024x252.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/Table1_TNLRv5_MSRBlog-Post-300x74.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/Table1_TNLRv5_MSRBlog-Post-768x189.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/Table1_TNLRv5_MSRBlog-Post-240x59.png 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/Table1_TNLRv5_MSRBlog-Post.png 1273w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Table 1. Model configurations. The parameters increasement of Base and Large are from using 128K vocabulary.<\/figcaption><\/figure><\/div>\n\n\n\n<p>The model configurations for T-NLRv5 variants are displayed in Table 1. As shown in Figure 4 and Figure 5, when measured on MNLI, one of the most stable tasks on GLUE, T-NLRv5 variants with substantially fewer parameters or computing steps often significantly outperform previous pretraining models with larger pretraining costs. T-NLRv5&#8217;s base version outperforms RoBERTa Large using 50% of the parameters. Using 434 million parameters, T-NLRv5 Large performs on par with DeBERTa XL (1.5 billion parameters) and outperforms Megatron encoder with 3.9 billion parameters. T-NLRv5 also significantly improves pretraining efficiency: it reaches the accuracy of our latest XL model, T-NLRv4-1.5B with only 40% pretraining steps using the same training corpora and computing environments.<\/p>\n\n\n\n<h3 id=\"robust-model-adaptation\">Robust model adaptation<\/h3>\n\n\n\n<p>Robustness is important&nbsp;for a&nbsp;model&nbsp;to&nbsp;perform&nbsp;well&nbsp;on&nbsp;test&nbsp;samples,&nbsp;which&nbsp;are dramatically different from training data.&nbsp;In this work, we use two methods to improve the robustness of&nbsp;adapting T-NLRv5 to downstream tasks. The first method enhances model robustness through <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2010.12638\">PDR<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (posterior differential regularization), which regularizes the model posterior difference between clean and noisy inputs during model training.&nbsp;The second method is multi-task learning,&nbsp;as in&nbsp;<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/multi-task-deep-neural-networks-for-natural-language-understanding-2\/\">multi-task&nbsp;deep&nbsp;neural&nbsp;network<\/a>&nbsp;(MT-DNN), which improves model robustness by learning&nbsp;representations across multiple NLU tasks. MT-DNN not only leverages large amounts of cross-task data, but also benefits from a regularization effect that leads to more general representations in order to adapt to new tasks and domains.&nbsp;<\/p>\n\n\n\n<p>With these robust model adaptation techniques, our T-NLRv5 XXL model is the first to reach human parity on MNLI in test accuracy (92.5 versus 92.4), the most informative task on GLUE, while only using a single model and single task fine-tuning, i.e., without ensemble.<\/p>\n\n\n\n<p>Table 2 presents some examples from MNLI dev-mismatched set where the T-NLRv5 XXL model can predict the correct label, but one of our authors made the wrong prediction. These are quite difficult examples, and we are glad to see T-NLRv5 XXL can accurately complete the task.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"810\" height=\"591\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/Table2_TNLRv5_MSRblgo.png\" alt=\"Table 2. MNLI Dev mismatched examples. The task is to predict whether the premise sentence entails\/contradicts or is neutral with the hypothesis.\" class=\"wp-image-801193\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/Table2_TNLRv5_MSRblgo.png 810w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/Table2_TNLRv5_MSRblgo-300x219.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/Table2_TNLRv5_MSRblgo-768x560.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/Table2_TNLRv5_MSRblgo-240x175.png 240w\" sizes=\"auto, (max-width: 810px) 100vw, 810px\" \/><figcaption>Table 2. MNLI Dev mismatched examples. The task is to predict whether the premise sentence entails\/contradicts or is neutral with the hypothesis.<\/figcaption><\/figure><\/div>\n\n\n\n<h2 id=\"t-nlrv5-release-information\">T-NLRv5: Release Information<\/h2>\n\n\n\n<p>We will make T-NLRv5 and its capabilities available in the same way as with other Microsoft Turing models.<br>We will leverage its increased capabilities to further improve the execution of popular language tasks in <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/#api\">A<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/#api\" target=\"_blank\" rel=\"noopener noreferrer\">zure Cognitive Services<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Customers will automatically benefit from these.<\/p>\n\n\n\n<p>Customers interested in using Turing models for their own specific task can submit a request to join the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/forms.office.com\/Pages\/ResponsePage.aspx?id=v4j5cvGGr0GRqy180BHbR1UCeHaYbjdIhNGA6afHCs1UOEtWRUJSUDBOUlM2TkpCQ01SMUlLVzJNMS4u&fsw=0\">Turing Private Preview<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Finally, we will make T-NLRv5 available to researchers for collaborative projects via the <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/collaboration\/microsoft-turing-academic-program\/collaboration-projects\/\">Microsoft Turing Academic Program<\/a>.<\/p>\n\n\n\n<p>Learn more: <\/p>\n\n\n\n<p><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/ai\/ai-lab-aiatscale\">Explore an interactive demo with AI at Scale models<\/a><\/p>\n\n\n\n<p><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/innovation.microsoft.com\/en-us\/exploring-ai-at-scale\">Learn more about the technology layers that power AI at Scale models<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n\n\n\n<p><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/innovation.microsoft.com\/en-us\/tech-minutes-superglue\">See how DeBERTa, part of Microsoft\u2019s Turing family of models, performs against SuperGLUE tasks<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n\n\n\n<h2 id=\"conclusion-building-and-democratizing-more-inclusive-ai\">Conclusion: Building and democratizing more inclusive AI<\/h2>\n\n\n\n<p>The Microsoft Turing model family plays an important role in delivering language-based AI experiences in Microsoft products. T-NLRv5 further surpassing human performance on SuperGLUE and GLUE leaderboards reaffirms our commitment to keep pushing the boundaries of NLP and continuously improving these models so that we can ultimately bring smarter, more responsible AI product experiences to our customers.<\/p>\n\n\n\n<p>We welcome your feedback and look forward to sharing more developments in the future.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>As part of Microsoft AI at Scale (opens in new tab), the Turing family of NLP models are being used at scale across Microsoft to enable the next generation of AI experiences. Today, we are happy to announce that the latest Microsoft Turing model (T-NLRv5) is the state of the art at the top of [&hellip;]<\/p>\n","protected":false},"author":40735,"featured_media":801355,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Jianfeng Gao","user_id":"32246"},{"type":"user_nicename","value":"Saurabh Tiwary","user_id":"39603"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-801178","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[722851],"related-projects":[649749,715045],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Jianfeng Gao","user_id":32246,"display_name":"Jianfeng Gao","author_link":"<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/jfgao\/\" aria-label=\"Visit the profile page for Jianfeng Gao\">Jianfeng Gao<\/a>","is_active":false,"last_first":"Gao, Jianfeng","people_section":0,"alias":"jfgao"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_Super_glue_leaderboard_still_FINAL-scaled-960x540.jpg\" class=\"img-object-cover\" alt=\"SuperGLUE leaderboards showing T-NLRv5 at the top\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_Super_glue_leaderboard_still_FINAL-scaled-960x540.jpg 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_Super_glue_leaderboard_still_FINAL-300x169.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_Super_glue_leaderboard_still_FINAL-1024x576.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_Super_glue_leaderboard_still_FINAL-768x432.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_Super_glue_leaderboard_still_FINAL-1536x865.jpg 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_Super_glue_leaderboard_still_FINAL-2048x1153.jpg 2048w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_Super_glue_leaderboard_still_FINAL-1066x600.jpg 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_Super_glue_leaderboard_still_FINAL-655x368.jpg 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_Super_glue_leaderboard_still_FINAL-343x193.jpg 343w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_Super_glue_leaderboard_still_FINAL-240x135.jpg 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_Super_glue_leaderboard_still_FINAL-scaled-640x360.jpg 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_Super_glue_leaderboard_still_FINAL-1280x720.jpg 1280w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_Super_glue_leaderboard_still_FINAL-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/jfgao\/\" title=\"Go to researcher profile for Jianfeng Gao\" aria-label=\"Go to researcher profile for Jianfeng Gao\" data-bi-type=\"byline author\" data-bi-cN=\"Jianfeng Gao\">Jianfeng Gao<\/a> and Saurabh Tiwary","formattedDate":"December 2, 2021","formattedExcerpt":"As part of Microsoft AI at Scale (opens in new tab), the Turing family of NLP models are being used at scale across Microsoft to enable the next generation of AI experiences. Today, we are happy to announce that the latest Microsoft Turing model (T-NLRv5)&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/801178","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/users\/40735"}],"replies":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/comments?post=801178"}],"version-history":[{"count":14,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/801178\/revisions"}],"predecessor-version":[{"id":817456,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/801178\/revisions\/817456"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media\/801355"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=801178"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/categories?post=801178"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/tags?post=801178"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=801178"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=801178"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=801178"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=801178"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=801178"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=801178"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=801178"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=801178"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}