{"id":303689,"date":"2012-05-21T09:00:02","date_gmt":"2012-05-21T16:00:02","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?p=303689"},"modified":"2016-11-28T10:19:58","modified_gmt":"2016-11-28T18:19:58","slug":"data-fast-lane","status":"publish","type":"post","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/data-fast-lane\/","title":{"rendered":"Data in the Fast Lane"},"content":{"rendered":"<p><em>By Douglas Gantenbein, Senior Writer, Microsoft News Center<\/em><\/p>\n<p>A new approach to managing data over a network has enabled a Microsoft Research team to set a speed record for sifting through, or \u201csorting,\u201d a huge amount of data in one minute.<\/p>\n<p>The team conquered what is known as the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" href=\"http:\/\/sortbenchmark.org\/\" target=\"_blank\">MinuteSort benchmark<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u2014a measure of data-crunching speed devised by the late Jim Gray, a renowned Microsoft Research scientist, and deemed the \u201cWorld Cup\u201d of data sorting. The MinuteSort benchmark measures how quickly data can be sorted starting and ending on disks. Sorting is a basic function in computing, demonstrating the ability of a network to move and organize data so it can be analyzed and used.<\/p>\n<p>The team, led by Jeremy Elson in the <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/group\/distributed-systems-redmond\/\" target=\"_blank\">Distributed Systems<\/a> group at <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/lab\/microsoft-research-redmond\/\" target=\"_blank\">Microsoft Research Redmond<\/a>, set the new sort benchmark by using a radically different approach to sorting called Flat Datacenter Storage (FDS). The team\u2019s system sorted almost three times the amount of data (1,401 gigabytes vs. 500 gigabytes) with about one-sixth the hardware resources (1,033 disks across 250 machines vs. 5,624 disks across 1,406 machines) used by the previous record holder, a team from Yahoo! that set the mark in 2009.<\/p>\n<h2>Two Hundred Bytes for Everybody<\/h2>\n<p>To put things in perspective, in one minute, the Microsoft Research team sorted the equivalent of two 100-byte data records for every human being on the planet.<\/p>\n<p>The record is significant because it points toward a new method for crunching huge amounts of data using inexpensive servers. In an age when information is increasing in enormous quantities, the ability to move and deploy it is important for everything from web searches to business analytics to understanding climate change.<\/p>\n<p>In practice, heavy-duty sorting can be used by enterprises looking through huge data sets for a competitive advantage. The Internet also has made data sorting critical. Advertisements on Facebook pages, custom recommendations on Amazon, and up-to-the-second search results on <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" href=\"http:\/\/www.bing.com\/\" target=\"_blank\">Bing<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> all result from sorting.<\/p>\n<p>The award for the team\u2019s achievement will be presented during the 2012 <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" href=\"http:\/\/www.sigmod.org\/2012\/\" target=\"_blank\">SIGMOD\/PODS Conference<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, an international forum for database researchers, practitioners, developers, and users to explore cutting-edge ideas and results. This year\u2019s conference occurs in Scottsdale, Ariz., from May 20 to 24.<\/p>\n<div id=\"attachment_303704\" style=\"width: 410px\" class=\"wp-caption alignleft\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-303704\" class=\"size-full wp-image-303704\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2016\/10\/MinuteSort-team.png\" alt=\"record-setting MinuteSort team\" width=\"400\" height=\"250\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2016\/10\/MinuteSort-team.png 400w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2016\/10\/MinuteSort-team-300x188.png 300w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/><p id=\"caption-attachment-303704\" class=\"wp-caption-text\">The record-setting MinuteSort team: (from left) Jon Howell, Jeremy Elson, Ed Nightingale, Yutaka Suzue, Jinliang Fan, Johnson Apacible, and Rich Draves.<\/p><\/div>\n<p>The team, formed and led by Elson, included Johnson Apacible, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/richdr\/\" target=\"_blank\">Rich Draves<\/a>, Jinliang Fan, Owen Hofmann, Jon Howell, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/edn\/\" target=\"_blank\">Ed Nightingale<\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/reubeno\/\" target=\"_blank\">Reuben Olinsky<\/a>, and Yutaka Suzue.<\/p>\n<p>Their approach was to take a fresh look at a relatively old model for sorting data. More than a decade ago, a network of computers would access data on a single file server, and each computer saw all of the data.<\/p>\n<p>But that model didn\u2019t scale up as data centers became larger. Researchers at Google tackled that problem in 2004, creating a data-management scheme called MapReduce. It worked by essentially sending computation to the data, rather than dragging data to a computer. It made possible computation across huge data sets using large numbers of cheap computers. In recent years, the Apache Software Foundation developed an open-source version of MapReduce dubbed Hadoop.<\/p>\n<p>MapReduce and Hadoop greatly advanced the state of data sorting. But, Elson says, they still weren\u2019t perfect.<\/p>\n<p>\u201cSome kinds of computations just can\u2019t be expressed that way,\u201d he says of the drag-computation-to-the-data model. \u201cIf you have two big data sets and you want to join them, you have to move the data somehow.\u201d<\/p>\n<p>Three years ago, Elson, Nightingale, and Howell had an insight into how new advances in network bandwidth could lead to a simpler model of data sorting\u2014one in which every computer saw all of the data\u2014while also scaling to handle massive data sets.<\/p>\n<p>The solution was dubbed Flat Datacenter Storage. Elson compares FDS to an organizational chart. In a hierarchical company, employees report to a superior, then to another superior, and so on. In a \u201cflat\u201d organization, they basically report to everyone, and vice versa.<\/p>\n<p>FDS takes advantage of another technology Microsoft Research helped develop, called full bisection bandwidth networks. If you were to draw an imaginary line through a collection of computers connected by a full bisection bandwidth network, every computer on one side of the line could send data at full speed to every computer on the other side of the line, and vice versa, no matter where the line is drawn.<\/p>\n<p>Using full bisection networks, the FDS team built a system that could transfer data at two gigabytes per second on each computer for input, with another two gigabytes for output.<\/p>\n<h2>New Techniques Needed<\/h2>\n<p>\u201cThat\u2019s 20 times as much bandwidth as most computers in data centers have today,\u201d Elson says, \u201cand harnessing it required novel techniques.\u201d<\/p>\n<p>With that, the team was ready to take on the MinuteSort challenge. The contest actually has two parts: an \u201cIndy\u201d category, in which systems can be customized for the task of sorting, and a \u201cDaytona\u201d category, in which systems must meet requirements for general-purpose computing\u2014think super-sleek, open-wheel Indianapolis 500 cars versus Daytona 500 stock cars that look a little like what you see on the street.<\/p>\n<p>In 2011, a team from the University of California, San Diego set a record in the Indy category, sorting 1,353 gigabytes of data in a minute. In the Daytona category, the record had been held by a team from Yahoo!, which sorted 500 gigabytes of data in a minute.<\/p>\n<p>The Microsoft Research team blew past both marks. Moreover, the team beat the standing Indy-sort record using a Daytona-class system. This isn\u2019t the first time that has happened, Elson says, but it is rare.<\/p>\n<p>The record represents a total efficiency improvement of almost 16 times. Interestingly, Microsoft Research set the record using a remote file system, which is an unusual choice of architecture for sorting because it commonly is perceived to be slow. Whereas most sorting systems read data locally from disk, exchange data once over the network, and write data locally to disk, in a remote file system, data is read, exchanged, and written over the network, so each data record crosses the network three times. The team deliberately handicapped the system to demonstrate the phenomenal performance of the new FDS file-system architecture.<\/p>\n<p>Thus far, the Microsoft Research team has worked with the Bing team to help Bing accelerate its search results. The Microsoft Research engineering team is partially funded by Bing and has been actively supported by <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" href=\"https:\/\/news.microsoft.com\/exec\/harry-shum\/#sm.00000ps1d7lkg5e8owbo91rvt5j5y\" target=\"_blank\">Harry Shum<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, the Microsoft corporate vice president who leads Core Search Development.<\/p>\n<h2>Exciting Breakthrough<\/h2>\n<p>\u201cWe are very excited about the MinuteSort breakthroughs made by our Microsoft Research colleagues,\u201d Shum says. \u201dI look forward to taking advantage of the FDS technology to further online infrastructure for Bing and for Microsoft\u2014and delivering even faster results to our users.\u201d<\/p>\n<p>Nightingale, co-leader of the FDS project with Elson, is working with Bing to integrate FDS to improve Bing\u2019s efficiency and speed.<\/p>\n<p>Given the ubiquity of interest in managing \u201cbig data,\u201d the Microsoft Research work is apt to find a home in several computing fields. It could be used in the biological sciences, managing gene sequencing or helping to create new classes of drugs, or it might help in stitching together aerial photographs to give people better imagery of the planet.<\/p>\n<p>The ability to sort data rapidly also will aid machine learning\u2014the design and development of algorithms that enable computers to create predictions based on data, such as sensor data or information from databases. Microsoft Research has a big stake in machine learning, in work ranging from language processing to security applications.<\/p>\n<p>\u201cImproving big-data performance has a wide range of implications across a huge number of businesses,\u201d Elson says. \u201cAlmost any big-data problem now becomes more efficient, which, in many cases, will be the difference between the work being economically feasible or not.\u201d<\/p>\n<p>For now, there\u2019s also a lot of celebrating going on.<\/p>\n<p>\u201cOur hands,\u201d Howell laughs, \u201care bruised from high-fiving.\u201d<\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Douglas Gantenbein, Senior Writer, Microsoft News Center A new approach to managing data over a network has enabled a Microsoft Research team to set a speed record for sifting through, or \u201csorting,\u201d a huge amount of data in one minute. The team conquered what is known as the MinuteSort benchmark\u2014a measure of data-crunching speed [&hellip;]<\/p>\n","protected":false},"author":39507,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[],"msr_hide_image_in_river":0,"footnotes":""},"categories":[194466,194475,194477,194460],"tags":[186604,195257,213680,213683,213692,186867,186868,213677],"research-area":[13561,13563,13555,13547],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-303689","post","type-post","status-publish","format-standard","hentry","category-algorithms","category-database-data-analytics-platforms","category-distributed-systems","category-search-and-information-retrieval","tag-bing","tag-data-management","tag-data-sorting","tag-flat-datacenter-storage-fds","tag-full-bisection-bandwidth-networks","tag-hadoop","tag-mapreduce","tag-minutesort-benchmark","msr-research-area-algorithms","msr-research-area-data-platform-analytics","msr-research-area-search-information-retrieval","msr-research-area-systems-and-networking","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199565],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[],"msr_type":"Post","byline":"","formattedDate":"May 21, 2012","formattedExcerpt":"By Douglas Gantenbein, Senior Writer, Microsoft News Center A new approach to managing data over a network has enabled a Microsoft Research team to set a speed record for sifting through, or \u201csorting,\u201d a huge amount of data in one minute. The team conquered what&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/303689","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/users\/39507"}],"replies":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/comments?post=303689"}],"version-history":[{"count":2,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/303689\/revisions"}],"predecessor-version":[{"id":303734,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/303689\/revisions\/303734"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=303689"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/categories?post=303689"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/tags?post=303689"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=303689"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=303689"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=303689"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=303689"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=303689"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=303689"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=303689"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=303689"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}