{"id":1129749,"date":"2025-02-14T06:35:32","date_gmt":"2025-02-14T14:35:32","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?post_type=msr-research-item&#038;p=1129749"},"modified":"2025-05-18T23:49:19","modified_gmt":"2025-05-19T06:49:19","slug":"protorail-a-risk-cognizant-imitation-agent-for-adaptive-vcpu-oversubscription-in-the-cloud","status":"publish","type":"msr-research-item","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/protorail-a-risk-cognizant-imitation-agent-for-adaptive-vcpu-oversubscription-in-the-cloud\/","title":{"rendered":"ProtoRAIL: A Risk-cognizant Imitation Agent for Adaptive vCPU Oversubscription In The Cloud"},"content":{"rendered":"<p>Safe optimization of operating costs is one of the holy grails of successful revenue-generating cloud systems and capacity\/resource efficiency is a key factor in making that a reality. Among other strategies for resource efficiency across major cloud providers, Oversubscription is an extremely prevalent practice where more virtual resources are offered than actual physical capacity to minimize revenue loss against redundant capacity. While resources can be of any type, including compute, memory, power or network bandwidth, we highlight the scenario of virtual CPU (vCPU) oversubscription since vCPU cores are primarily the billable units for cloud services and has substantial impact on business as well as users. For a seamless cloud experience, while being cost-efficient for the provider, suitable policies for controlling oversubscription margins are crucial. Narrow margins lead to redundant expenditure on under-utilized resource capacity, and wider margins lead to under-provisioning where customer workloads may suffer from resource contention.<\/p>\n<p>Most oversubscription policies today are engineered either with tribal knowledge or with static heuristics about the system, which lead to catastrophic overloading or stranded\/under-utilized resources. Smart oversubscription policies that can adapt to demand\/utilization patterns across time and granularity to jointly optimize cost benefits and risks is a non-trivial, largely, unsolved problem. We address this challenge with our proposed novel Prototypical Risk-cognizant Active Imitation Learning (ProtoRAIL) framework that exploits approximate symmetries in utilization patterns to learn suitable policies. The active knowledge-in-the-loop (KITL) module de-risks the learned policies. Our empirical investigations and real deployments on X company\u2019s internal (1st party) cloud service, show orders of magnitude reduction (\u2248\u2265 90\u00d7) in risk and significant increase in benefits (saved stranded resources: in a range of \u2248 7 to 10%).<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Safe optimization of operating costs is one of the holy grails of successful revenue-generating cloud systems and capacity\/resource efficiency is a key factor in making that a reality. Among other strategies for resource efficiency across major cloud providers, Oversubscription is an extremely prevalent practice where more virtual resources are offered than actual physical capacity to [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_publishername":"","msr_publisher_other":"","msr_booktitle":"","msr_chapter":"","msr_edition":"","msr_editors":"","msr_how_published":"","msr_isbn":"","msr_issue":"","msr_journal":"","msr_number":"","msr_organization":"","msr_pages_string":"","msr_page_range_start":"","msr_page_range_end":"","msr_series":"","msr_volume":"","msr_copyright":"","msr_conference_name":"MLSys'25","msr_doi":"","msr_arxiv_id":"","msr_s2_paper_id":"","msr_mag_id":"","msr_pubmed_id":"","msr_other_authors":"","msr_other_contributors":"","msr_speaker":"","msr_award":"","msr_affiliation":"","msr_institution":"","msr_host":"","msr_version":"","msr_duration":"","msr_original_fields_of_study":"","msr_release_tracker_id":"","msr_s2_match_type":"","msr_citation_count_updated":"","msr_published_date":"2025-5-12","msr_highlight_text":"","msr_notes":"","msr_longbiography":"","msr_publicationurl":"","msr_external_url":"","msr_secondary_video_url":"","msr_conference_url":"https:\/\/mlsys.org\/","msr_journal_url":"","msr_s2_pdf_url":"","msr_year":0,"msr_citation_count":0,"msr_influential_citations":0,"msr_reference_count":0,"msr_s2_match_confidence":0,"msr_microsoftintellectualproperty":true,"msr_s2_open_access":false,"msr_s2_author_ids":[],"msr_pub_ids":[],"msr_hide_image_in_river":null,"footnotes":""},"msr-research-highlight":[],"research-area":[13556,13547],"msr-publication-type":[193716],"msr-publisher":[],"msr-focus-area":[],"msr-locale":[268875],"msr-post-option":[269148,269142],"msr-field-of-study":[246694,254068,246685,250669,251920],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-1129749","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-artificial-intelligence","msr-research-area-systems-and-networking","msr-locale-en_us","msr-post-option-approved-for-river","msr-post-option-include-in-river","msr-field-of-study-artificial-intelligence","msr-field-of-study-cloud-systems","msr-field-of-study-machine-learning","msr-field-of-study-optimization-problem","msr-field-of-study-resource-management"],"msr_publishername":"","msr_edition":"","msr_affiliation":"","msr_published_date":"2025-5-12","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"file","viewUrl":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Proto_Imitation_Learning_for_Oversubscription_MLsys_2025.pdf","id":"1136598","title":"proto_imitation_learning_for_oversubscription_mlsys_2025","label_id":"243109","label":0}],"msr_related_uploader":[{"type":"file","viewUrl":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Appendix_Proto_Imitation_Learning_for_Oversubscription_MLsys_2025.pdf","id":"1136597","title":"appendix_proto_imitation_learning_for_oversubscription_mlsys_2025","label_id":"243118","label":0},{"type":"file","viewUrl":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/ProtoRAIL.pdf","id":"1139511","title":"protorail","label_id":"243115","label":0},{"type":"file","viewUrl":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/PosterMLSysSmaller.png","id":"1139512","title":"postermlsyssmaller","label_id":"265542","label":0}],"msr_citation_count":0,"msr_citation_count_updated":"","msr_s2_paper_id":"","msr_influential_citations":0,"msr_reference_count":0,"msr_arxiv_id":"","msr_s2_author_ids":[],"msr_s2_open_access":false,"msr_s2_pdf_url":null,"msr_attachments":[{"id":1139511,"url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/ProtoRAIL.pdf"},{"id":1136598,"url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Proto_Imitation_Learning_for_Oversubscription_MLsys_2025.pdf"},{"id":1136597,"url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Appendix_Proto_Imitation_Learning_for_Oversubscription_MLsys_2025.pdf"},{"id":1129752,"url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/PREPRINT_w_authors_Proto_Imitation_Learning_for_Oversubscription_MLsys_2025.pdf"}],"msr-author-ordering":[{"type":"text","value":"Lu Wang","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Mayukh Das","user_id":41140,"rest_url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Mayukh Das"},{"type":"user_nicename","value":"Fangkai Yang","user_id":41425,"rest_url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Fangkai Yang"},{"type":"user_nicename","value":"Bo Qiao","user_id":37848,"rest_url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Bo Qiao"},{"type":"user_nicename","value":"Hang Dong","user_id":39687,"rest_url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Hang Dong"},{"type":"user_nicename","value":"Si Qin","user_id":42006,"rest_url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Si Qin"},{"type":"user_nicename","value":"Victor Ruehle","user_id":41027,"rest_url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Victor Ruehle"},{"type":"user_nicename","value":"Chetan Bansal","user_id":31394,"rest_url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Chetan Bansal"},{"type":"text","value":"Eli Cortez","user_id":0,"rest_url":false},{"type":"user_nicename","value":"&Iacute;&ntilde;igo Goiri","user_id":32102,"rest_url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=&Iacute;&ntilde;igo Goiri"},{"type":"user_nicename","value":"Saravan Rajmohan","user_id":41039,"rest_url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Saravan Rajmohan"},{"type":"user_nicename","value":"Qingwei Lin \u6797\u5e86\u7ef4","user_id":33318,"rest_url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Qingwei Lin \u6797\u5e86\u7ef4"},{"type":"user_nicename","value":"Dongmei Zhang","user_id":31665,"rest_url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Dongmei Zhang"}],"msr_impact_theme":[],"msr_research_lab":[],"msr_event":[],"msr_group":[793670,811276,1145968],"msr_project":[],"publication":[],"video":[],"msr-tool":[],"msr_publication_type":"inproceedings","related_content":[],"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1129749","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":1,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1129749\/revisions"}],"predecessor-version":[{"id":1129755,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1129749\/revisions\/1129755"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1129749"}],"wp:term":[{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=1129749"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1129749"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=1129749"},{"taxonomy":"msr-publisher","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-publisher?post=1129749"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=1129749"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1129749"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1129749"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=1129749"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=1129749"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=1129749"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1129749"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=1129749"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}