{"id":700474,"date":"2020-10-23T04:18:53","date_gmt":"2020-10-23T11:18:53","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?post_type=msr-research-item&#038;p=700474"},"modified":"2020-10-23T08:21:56","modified_gmt":"2020-10-23T15:21:56","slug":"generalized-sub-query-fusion-for-eliminating-redundant-i-o-from-big-data-queries","status":"publish","type":"msr-research-item","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/generalized-sub-query-fusion-for-eliminating-redundant-i-o-from-big-data-queries\/","title":{"rendered":"Generalized Sub-Query Fusion for Eliminating Redundant I\/O from Big-Data Queries"},"content":{"rendered":"<p>SQL is the de-facto language for big-data analytics. Despite the cost of distributed SQL execution being dominated by disk and network I\/O, we find that state-of-the-art optimizers produce plans that are not I\/O optimal. For a significant fraction of queries (25% of popular benchmarks like TPCDS), a large amount of data is shuffled redundantly between different pairs of stages. The fundamental reason for this limitation is that optimizers do not have the right set of primitives to perform reasoning at the map-reduce level that can potentially identify and eliminate the redundant I\/O.<\/p>\n<p>This paper proposes <strong>resin<\/strong>, an optimizer extension that adds first-class support for map-reduce reasoning. <strong>Resin<\/strong> uses a novel technique called <em>Generalized Sub-Query Fusion<\/em> that identifies sub-queries computing on overlapping data, and <em>fuses<\/em> them into the same map-reduce stages. The analysis is general; it does not require that the sub-queries be syntactically the same, nor are they required to produce the same output. Sub-query fusion allows <strong>resin<\/strong> to sometimes also eliminate expensive binary operations like <em>Joins<\/em> and <em>Unions<\/em> altogether for further gains.<\/p>\n<p>We have integrated <strong>resin<\/strong> into <em>sparkSQL<\/em> and evaluated it on TPCDS, a standard analytics benchmark suite. Our results demonstrate that the proposed optimizations apply to 40% of the queries and speed up a large fraction of them by 1.1-6x, reducing the overall execution time of the benchmark suite by 12%.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>SQL is the de-facto language for big-data analytics. Despite the cost of distributed SQL execution being dominated by disk and network I\/O, we find that state-of-the-art optimizers produce plans that are not I\/O optimal. For a significant fraction of queries (25% of popular benchmarks like TPCDS), a large amount of data is shuffled redundantly between [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_publishername":"USENIX","msr_publisher_other":"","msr_booktitle":"","msr_chapter":"","msr_edition":"","msr_editors":"","msr_how_published":"","msr_isbn":"","msr_issue":"","msr_journal":"","msr_number":"","msr_organization":"","msr_pages_string":"","msr_page_range_start":"","msr_page_range_end":"","msr_series":"","msr_volume":"","msr_copyright":"","msr_conference_name":"Symposium on Operating Systems Design and Implementation (OSDI)","msr_doi":"","msr_arxiv_id":"","msr_s2_paper_id":"","msr_mag_id":"","msr_pubmed_id":"","msr_other_authors":"","msr_other_contributors":"","msr_speaker":"","msr_award":"","msr_affiliation":"","msr_institution":"","msr_host":"","msr_version":"","msr_duration":"","msr_original_fields_of_study":"","msr_release_tracker_id":"","msr_s2_match_type":"","msr_citation_count_updated":"","msr_published_date":"2020-11-4","msr_highlight_text":"","msr_notes":"","msr_longbiography":"","msr_publicationurl":"","msr_external_url":"","msr_secondary_video_url":"","msr_conference_url":"","msr_journal_url":"","msr_s2_pdf_url":"","msr_year":0,"msr_citation_count":0,"msr_influential_citations":0,"msr_reference_count":0,"msr_s2_match_confidence":0,"msr_microsoftintellectualproperty":true,"msr_s2_open_access":false,"msr_s2_author_ids":[],"msr_pub_ids":[],"msr_hide_image_in_river":0,"footnotes":""},"msr-research-highlight":[],"research-area":[13563,13547],"msr-publication-type":[193716],"msr-publisher":[],"msr-focus-area":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-700474","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-data-platform-analytics","msr-research-area-systems-and-networking","msr-locale-en_us"],"msr_publishername":"USENIX","msr_edition":"","msr_affiliation":"","msr_published_date":"2020-11-4","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/www.usenix.org\/conference\/osdi20\/presentation\/sarthi","label_id":"243109","label":0}],"msr_related_uploader":"","msr_citation_count":0,"msr_citation_count_updated":"","msr_s2_paper_id":"","msr_influential_citations":0,"msr_reference_count":0,"msr_arxiv_id":"","msr_s2_author_ids":[],"msr_s2_open_access":false,"msr_s2_pdf_url":null,"msr_attachments":[],"msr-author-ordering":[{"type":"text","value":"Partho Sarthi","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Kaushik Rajan","user_id":32574,"rest_url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Kaushik Rajan"},{"type":"user_nicename","value":"Akash Lal","user_id":30905,"rest_url":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Akash Lal"},{"type":"text","value":"Abhishek Modi","user_id":0,"rest_url":false},{"type":"text","value":"Prakhar Jain","user_id":0,"rest_url":false},{"type":"text","value":"Mo Liu","user_id":0,"rest_url":false},{"type":"text","value":"Ashit Gosalia","user_id":0,"rest_url":false},{"type":"text","value":"Saurabh Kalikar","user_id":0,"rest_url":false}],"msr_impact_theme":[],"msr_research_lab":[199562],"msr_event":[],"msr_group":[144939],"msr_project":[440907],"publication":[],"video":[],"msr-tool":[],"msr_publication_type":"inproceedings","related_content":{"projects":[{"ID":440907,"post_title":"Optimizing Big-Data Queries using Program Reasoning","post_name":"optimizing-big-data-queries-using-program-reasoning","post_type":"msr-project","post_date":"2017-11-15 21:06:29","post_modified":"2021-12-14 21:23:33","post_status":"publish","permalink":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/project\/optimizing-big-data-queries-using-program-reasoning\/","post_excerpt":"\u00a0This project is at the intersection of programming languages and database systems. The goal of the project is to use\u00a0programming languages techniques to analyze and optimize big-data queries. We show how program synthesis can be used to discover optimizations that big-data query optimizers miss today.\u00a0A big-data query optimizer produces an executable plan composed of map-reduce stages. We use program synthesis to produce plans with fewer stages than a query optimizer. A query optimizer has rules&hellip;","_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/440907"}]}}]},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/700474","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":3,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/700474\/revisions"}],"predecessor-version":[{"id":700543,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/700474\/revisions\/700543"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=700474"}],"wp:term":[{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=700474"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=700474"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=700474"},{"taxonomy":"msr-publisher","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-publisher?post=700474"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=700474"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=700474"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=700474"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=700474"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=700474"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=700474"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=700474"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=700474"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}