{"id":811993,"date":"2019-02-13T19:25:53","date_gmt":"2019-02-14T03:25:53","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?post_type=msr-blog-post&#038;p=811993"},"modified":"2022-01-13T19:26:37","modified_gmt":"2022-01-14T03:26:37","slug":"competing-in-the-x-games-of-machine-learning-with-dr-manik-varma","status":"publish","type":"msr-blog-post","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/articles\/competing-in-the-x-games-of-machine-learning-with-dr-manik-varma\/","title":{"rendered":"Microsoft Research Podcast: Competing in the X Games of machine learning with Dr. Manik Varma"},"content":{"rendered":"<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-567159 size-large\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/Manik-Varma_POD_Manik-Varma_POD_Site_11_2018_1400x788-1024x576.png\" alt=\"\" width=\"1024\" height=\"576\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/Manik-Varma_POD_Manik-Varma_POD_Site_11_2018_1400x788-1024x576.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/Manik-Varma_POD_Manik-Varma_POD_Site_11_2018_1400x788-300x169.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/Manik-Varma_POD_Manik-Varma_POD_Site_11_2018_1400x788-768x432.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/Manik-Varma_POD_Manik-Varma_POD_Site_11_2018_1400x788-1066x600.png 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/Manik-Varma_POD_Manik-Varma_POD_Site_11_2018_1400x788-655x368.png 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/Manik-Varma_POD_Manik-Varma_POD_Site_11_2018_1400x788-343x193.png 343w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/Manik-Varma_POD_Manik-Varma_POD_Site_11_2018_1400x788.png 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p><iframe loading=\"lazy\" src=\"https:\/\/player.blubrry.com\/id\/41743911\/\" width=\"100%\" height=\"138px\" frameborder=\"0\" scrolling=\"no\"><span data-mce-type=\"bookmark\" style=\"display: inline-block; width: 0px; overflow: hidden; line-height: 0;\" class=\"mce_SELRES_start\">\ufeff<\/span><\/iframe><\/p>\n<h3>Episode 63 | February 13, 2019<\/h3>\n<p>If every question in life could be answered by choosing from just a few options, machine learning would be pretty simple, and life for machine learning researchers would be pretty sweet. Unfortunately, in both life and machine learning, things are a bit more complicated. That\u2019s why <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/manik\/\">Dr. Manik Varma<\/a>, Principal Researcher at <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/lab\/microsoft-research-india\/\">MSR India<\/a>, is developing extreme classification systems to answer multiple-choice questions that have millions of possible options and help people find what they are looking for online more quickly, more accurately and less expensively.<\/p>\n<p>On today\u2019s podcast, Dr. Varma tells us all about extreme classification (including where in the world you might actually run into 10 or 100 million options), reveals how his <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/parabel-partitioned-label-trees-for-extreme-classification-with-application-to-dynamic-search-advertising\/\">Parabel<\/a> and <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/slice-scalable-linear-extreme-classifiers-trained-on-100-million-labels-for-related-searches\/\">Slice<\/a> algorithms are making high-quality recommendations in milliseconds, and proves, with both his life and his work, that being blind need not be a barrier to extreme accomplishment.<\/p>\n<h3>Related:<\/h3>\n<ul type=\"disc\">\n<li><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/podcast\">Microsoft Research Podcast<\/a>: View more podcasts on Microsoft.com<\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/itunes.apple.com\/us\/podcast\/microsoft-research-a-podcast\/id1318021537?mt=2\">iTunes<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: Subscribe and listen to new podcasts each week on iTunes<\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/subscribebyemail.com\/www.blubrry.com\/feeds\/microsoftresearch.xml\">Email<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: Subscribe and listen by email<\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/subscribeonandroid.com\/www.blubrry.com\/feeds\/microsoftresearch.xml\">Android<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: Subscribe and listen on Android<\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/open.spotify.com\/show\/4ndjUXyL0hH1FXHgwIiTWU\">Spotify<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: Listen on Spotify<\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.blubrry.com\/feeds\/microsoftresearch.xml\">RSS feed<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/note.microsoft.com\/ww-registration-microsoft-research-newsletter-s.html?wt.mc_id=S-webpage_podcast\">Microsoft Research Newsletter<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: Sign up to receive the latest news from Microsoft Research<\/li>\n<\/ul>\n<hr \/>\n<h3>Transcript<\/h3>\n<p>Manik Varma: In 2013, I thought there is no way we can learn 10 million or 100 million classifiers. And even if we could learn them, where would we store them? And even if we could store them, how would we make a prediction in a millisecond? And so, I just turned away from one-versus-all approaches and we tried developing trees and embeddings. But today, we\u2019ve actually managed to overcome all of those limitations. And the key trick is to go from linear time training and predictions to log-time training and prediction.<\/p>\n<p><strong>Host: You\u2019re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I\u2019m your host, Gretchen Huizinga.<\/strong><\/p>\n<p><strong>Host: If every question in life could be answered by choosing from just a few options, machine learning would be pretty simple, and life for machine learning researchers would be pretty sweet. Unfortunately, in both life and machine learning, things are a bit more complicated. That\u2019s why Dr. Manik Varma, Principal Researcher at MSR India, is developing extreme classification systems to answer multiple-choice questions that have millions of possible options and help people find what they are looking for online more quickly, more accurately and less expensively.<\/strong><\/p>\n<p><strong>On today\u2019s podcast, Dr. Varma tells us all about extreme classification (including where in the world you might actually run into 10 or 100 million options), reveals how his Parabel and Slice algorithms are making high-quality recommendations in milliseconds, and proves, with both his life and his work, that being blind need not be a barrier to extreme accomplishment. That and much more on this episode of the Microsoft Research Podcast.<\/strong><\/p>\n<p><strong>Host: Manik Varma, welcome to the podcast.<\/strong><\/p>\n<p>Manik Varma: Thanks, Gretchen. The pleasure\u2019s entirely mine.<\/p>\n<p><strong>Host: So, you\u2019re a principal researcher at MSR India and an Adjunct Professor of Computer Science at IIT Delhi. In addition to your work in computer science, you\u2019re a physicist, a theoretician, an engineer and a mathematician. You were a Rhodes Scholar at Oxford and a University Scholar when you did your doctoral work there. And you were a Post Doc at MSRI Berkeley. And you\u2019re blind. I\u2019m not going to call you Superman, but I would really love to know what has inspired you to get to this level of accomplishment in your life. What gets you up in the morning?<\/strong><\/p>\n<p>Manik Varma: I guess it\u2019s a combination of my hopes and desires on the one hand, and I guess fears as well on the other. So, I guess hopes and desires \u2013 I hope every day I get up I learn something new. That\u2019s one of the best feelings I\u2019ve had, and that\u2019s what\u2019s driven me all this way. And actually, that\u2019s why I\u2019m in research, because I get to ask new questions and learn about new things throughout my career. So that\u2019s fantastic. And the other hope and desire that I have that drives me is to build things and build things that will help millions of people around the world. And I guess there\u2019s some fears lurking behind that as well. I\u2019ve been worried about the \u201cImposter Syndrome\u201d all my life, and, uh, yeah\u2026 So, I guess the best way to tackle that is actually to try and do things and get things out there and have people use them and be happy with them. So, I guess that\u2019s the fear that\u2019s driving the hopes and desires.<\/p>\n<p><strong>Host: So Manik, all of this has been without the use of your eyes. How have you gone about all of this?<\/strong><\/p>\n<p>Manik Varma: Right, so I have a condition where the cells in my retina are deteriorating over time. So, in about 2009, I think, I started using a blind stick, and then I lost the ability to read papers, then recognize faces, stuff like that. But it makes life interesting right? I go up to my wife and ask, who are you? That\u2019s the secret to a happily married life!<\/p>\n<p><strong>Host: Well, so, you haven\u2019t been blind since birth?<\/strong><\/p>\n<p>Manik Varma: No.<\/p>\n<p><strong>Host: Oh, okay.<\/strong><\/p>\n<p>Manik Varma: Well, it\u2019s a hereditary condition, but nobody else in my family has it, so it started deteriorating from the time I was in school, and I lost the ability to do some things in school. But, it\u2019s only over the last, let\u2019s say, 10 years where it\u2019s become really significant. But again, it\u2019s been no big deal also, right? It\u2019s only in the last decade where I\u2019ve had to think about it, and that\u2019s, I think, because of my kids, right? I want to set them an example that they can\u2026 because this is hereditary, there is a probability they might have it, and if they do, I don\u2019t want them to feel that they can\u2019t do something, right? So, if they want, they can go climb Mount Everest or become the best historian or the best scientist or whatever. As long as they set their minds up, they can do it.<\/p>\n<p><strong>Host: We\u2019re going to talk about this relatively new area of machine learning research called extreme classification in a minute. But let\u2019s start a bit upstream of your current work first. Give us a quick review of what we might call the machine learning research universe, as it were, focusing on a bit of the history and the types of classification that you work with, and this is just to give the framework for how extreme classification is different from other methodologies and why we need it.<\/strong><\/p>\n<p>Manik Varma: So, if you look at the history of machine learning, then the most well-studied problem is binary classification where we learn to answer yes\/no questions involving uncertainty. And then the community realized that, actually, there are many high-impact applications out there in the real world that are not just simple yes\/no questions, right? They\u2019re actually multiple-choice questions. And this leads to the field of multi-class classification. And then, after that, the community realized that there\u2019s some high-impact applications that are not only multiple-choice, but they also have multiple correct answers. And this led to the establishment of the area of multi-label classification. So just to take examples of all of these, if you ask the question, is Gretchen speaking right now or not, then that\u2019s binary classification. Whereas if you turn this into a multiple-choice question such as who\u2019s speaking right now? Is it Gretchen or Manik or Satya Nadella? So that\u2019s a multiple-choice question, that\u2019ll be multi-class classification. But now suppose you threw a cocktail party and you invited all the top machine learning AI scientists over there. And then you took a short clip and asked, who\u2019s speaking in this clip? That becomes a multi-label classification problem, because now multiple people could be speaking at the same time. And so, if you have L choices in front of you, then in multi-class, the output space is L dimension, or order L. But if you have a multi-label, then the output space is two to the power L. Because every person may or may not be speaking. So, you go from two choices in binary classification to tens to hundreds and thousands of choices in multi-class and multi-label learning. And if you looked at the state of the art in about 2012, the largest multi-label data set out here had about 5,000 labels. And I remember all my colleagues like running their algorithms for weeks on big clusters to try and solve this problem, because two to the power 5,000 is way more than the number of atoms in the universe. So, it\u2019s a hard problem.<\/p>\n<p><strong>Host: So, this has been a problem for quite some time. It\u2019s not brand new, right? But it\u2019s getting bigger? Is this our issue?<\/strong><\/p>\n<p>Manik Varma: Right. And actually, that\u2019s how extreme classification got started. So, as I mentioned, in 2012, the largest publicly available multi-label data set had about 5,000 labels. But then in 2013, we published a paper which exploded the number of labels being considered in a multi-label classifier from 5,000 to 10 million.<\/p>\n<p><strong>Host: Wow.<\/strong><\/p>\n<p>Manik Varma: And that really changed the nature of the game. So, the motivating application was to build a classifier that could be used as a tool by advertisers that would predict which Bing queries would lead to a click on the ad or the document. And, you can well imagine from the context of the application that this is a really important problem, from both the research as well as a commercial perspective. And so many sophisticated natural language processing, machine learning, information retrieval techniques have been developed in the literature to solve this problem. But unfortunately, none of these were working for our ads team. They had billions of ads for which all these sophisticated approaches were not making good quality predications. And so, we decided to go back to the drawing board. We set aside all of these approaches and simply reformulated the problem as a multi-label classification problem where we would take the document as input, and then we would treat each of the top queries on Bing as a label, so you took the top 10 million monetizable queries on Bing, and now you just learn the classifier that will predict, for this particular document or ad, which subset of top 10 million Bing queries will lead to a click on the ad.<\/p>\n<p><strong>Host: Top 10 million?<\/strong><\/p>\n<p>Manik Varma: Yeah, so from 5,000 to 10 million. This was just a very different and completely new way of looking at the problem. And it took us two years to build the system, run the experiments, publish our results and check everything out. But once the results came in, we found that our approach was much better than all these traditional approaches. So, the number of ads for which you are making good quality recommendations went up from about 60% for the Bing system to about 95-98% for us.<\/p>\n<p><strong>Host: Wow.<\/strong><\/p>\n<p>Manik Varma: And the quality of our recommendations also improved a lot. And so that led to the establishment of the area of extreme classification which deals with multi-class and multi-label problems in extremely large label spaces. In millions or billions of labels. And I think that\u2019s exactly why extreme classification grew to be a whole new research area in itself. And that\u2019s because, I think, fundamentally new research questions arise when you go from, let\u2019s say, 100 labels to 100 million labels. Let me just give you a couple of examples if you\u2019ll permit me the time.<\/p>\n<p><strong>Host: Yes, please.<\/strong><\/p>\n<p>Manik Varma: The whole premise in supervised machine learning is that there\u2019s an expert sitting out there who we can give our data to, and he or she will label the data with ground truth: what\u2019s the right answer, right? What\u2019s the right prediction to make for this? But unfortunately, in extreme classification, there is no human being who can go through a list of 100 million labels to tell you, what are the right predictions to make for this data point. So even the most fundamental machine learning techniques such as cross-validation might go for a toss at the extreme scale. And you\u2019ll have missing labels in your test set, in your validation set, in your training set. And this is like a fundamental difference that you have with traditional classification where a human being could go through a list of 100 labels and mark out the right subset. Another really interesting question is the whole notion of what constitutes a good prediction changes when you go from 100 labels to, let\u2019s say, 100 million. When you have 100 labels, you need to go through that list of 100 and say, okay, which labels or relevant? What labels are irrelevant? But when you have 100 million, nobody has the time or patience to go through it. So, you need to give your top five best predictions very quickly and you need to have them ranked with your best prediction at the very top and then the worst one at the bottom. And you need to make sure that you handle this \u201cmissing labels\u201d problem, because some of the answers that you predict might not have been marked by the expert. So, all of this changes as you go from one scale to the next scale.<\/p>\n<p>(music plays)<\/p>\n<p><strong>Host: Let\u2019s talk for a second about how extreme classification can be applied in other areas besides advertising. Tell us a little bit about the different applications in this field and where you think extreme classification is going.<\/strong><\/p>\n<p>Manik Varma: I think one of the most interesting questions that came out of our research was, when or where in the world will you ever have 10 million or 100 million labels to choose from? If you just think about it for a minute, 100 million is a really large number. Just to put it in context, to see how big a number this is, when I was doing my PhD in computer vision, the luminaries in the field would get up and tell the community that, according to Biederman\u2019s counts, there are only 60,000 object categories in the world. So, none of the classical visual problems will make the cut. And even if you were to pick up a fat Oxford English Dictionary, it would have somewhere around half a million words in it. So many traditional NLP problems might not also make the cut. Then over the last five years, people have actually found very high impact applications of extreme classification. And so, for example, one of them leads to reformulations of well-known problems in machine learning like ranking and recommendation, which are critical for our industry. So, suppose you wanted to, for instance, design a search engine, right? You can treat each document on the web as a label, and now when a query comes in, you can learn the classifier that will take the query\u2019s feature vector\u2019s input and predict which subset of documents on the web are relevant to this particular query. And so, then you can show those documents and you can rank them on the strength of the classifier\u2019s probabilities and you can reformulate ranking as a classification problem. And similarly, think about like recommendation, right? So, suppose you were to go onto a retailer\u2019s website. They have product catalogs that run into the millions, if not hundreds of millions. And so no human being can go through the entire catalog to find what they\u2019re looking for. And therefore, recommendation becomes critical for helping users find things they\u2019re looking for. And now you can treat each of the hundred million products that the retailer is selling as a particular category, and you learn a classifier that takes the user\u2019s feature vector as input and simply predicts which subset of categories are going to be of interest to the user and you recommend the items corresponding to those categories to the user. And so you can reformulate ranking and recommendation as extreme classification, and sometimes this can lead to very large performance gains as compared to traditional methods such as collaborative filtering or learning to rank or content-based methods. And so that\u2019s what extreme classification is really good for.<\/p>\n<p><strong>Host: Let\u2019s talk about results for a minute. How do we measure the quality of any machine learning classification system? I imagine there are some standard benchmarks. But if it\u2019s like any other discipline, there are tradeoffs, and you can\u2019t have everything. So, tell us how should we think about measurement? What kinds of measurements are you looking at, and how does extreme classification help there?<\/strong><\/p>\n<p>Manik Varma: So, the axes along which we measure the quality of a solution are training time, prediction time, model size and accuracy. So, let\u2019s look at these one by one. And they\u2019re all linked also. So, if you look at training time, this is absolutely critical for us. We cannot afford to let our extreme classifiers take months or years to train. So, think of the number of markets in which Bing operates. All over the world. If it were to take you, let\u2019s say, two days to train an extreme classifier, then every time you wanted to deploy in a new market, that would be two days gone. Every time you wanted to run an experiment where you change a feature, two days gone. Every time you want to tune hyperparameters, two days gone. And again, when I\u2019m saying two days, this is probably on a cluster of a thousand cores, right? So how many people or how many groups have access to those kinds of clusters? Whereas if you could bring your training time down to, let\u2019s say, 17 minutes of a single core of a standard desktop, now you can run millions of experiments and anyone can run these experiments on their personal machines. And so the speed with which you can run experiments and improve the quality of your product and your algorithm, there\u2019s a sea change in that.<\/p>\n<p><strong>Host: Yeah.<\/strong><\/p>\n<p>Manik Varma: So, training time becomes very important. But as you pointed out, right, you want to maintain accuracy. So, you can\u2019t say, oh, I\u2019ll cut down my training time, but then I\u2019ll make really bad predictions. That\u2019s not acceptable. And then the second thing is model size. So, if your classifier is going to take, let\u2019s say, a terabyte to build its model, then your cost of making recommendations will go up. You need to buy more expensive hardware. And, again, the number of machines that have a terabyte is limited, so the number of applications you can deal with in one go is limited. So again, you want to bring your model size down to a few gigabytes so that it can fit on standard hardware and anybody can use this for making predictions. And again, there\u2019s a tradeoff between model size and prediction time, right? You can always trade speed for memory, but now the question is, how can you bring down your model size to a few gigabytes without hitting your prediction time too much? And the reason prediction time is important is because, again, at the scale of Bing, we might have 100 billion documents that we need to process regularly. So, if your prediction time is more than a few milliseconds per document, then there is no way you can make a hundred billion predictions in a reasonable amount of time. So, your solution simply will not fly. And all of this, as I said, is tied to accuracy. Because people will not suffer poor recommendations.<\/p>\n<p><strong>Host: No. I won\u2019t.<\/strong><\/p>\n<p>Manik Varma: Yeah.<\/p>\n<p><strong>Host: Well so, circling back, how does extreme classification impact your training time, model size, speed, accuracy, all of that?<\/strong><\/p>\n<p>Manik Varma: If you look at the papers that we are publishing now, like Parabel and Slice, we can train on that same data set in, as I said, 17 minutes of a single core of a standard machine. And so, we\u2019ve really managed to cut down training time let\u2019s say by 10,000 times over the span of five years. We\u2019ve managed to reduce our model size from a terabyte to a few gigabytes. Our prediction time is a few milliseconds per test point, and we managed to increase accuracy from about 19% precision at one, to about 65% today. So, precision at one is if you just look at the top prediction that was made, was that correct or not? So, on a standard benchmark data, say it was 19% in 2013, and we managed to increase that to 65% today. So, there\u2019s been a remarkable improvement in almost all metrics over the last 5-6 years.<\/p>\n<p><strong>Host: You just mentioned Parabel and Slice. So, let\u2019s talk about those right now. First of all, Parabel. It\u2019s a hierarchal algorithm for extreme classification. Tell us about it and how it\u2019s advanced both the literature of machine learning research and the field in general.<\/strong><\/p>\n<p>Manik Varma: So, for the last five years, we\u2019ve been trying out lots of different approaches to extreme classification. We tried out trees. We tried out embeddings in deep learning. And we looked at one-versus-all approaches. And in a one-versus-all approach, you have to learn a classifier per label. And then you use all of them at prediction time to see which ones have the highest scores, and then that\u2019s the classes or the labels you recommend to the user. And in 2013, I thought there is no way we can learn 10 million or 100 million classifiers. And even if we could learn them, where would we store them? And even if we could store them, how would we make a prediction in a millisecond? And so, I just turned away from one-versus-all approaches and we tried developing trees and embeddings. But today, we\u2019ve actually managed to overcome all of those limitations. And the key trick is to go from linear time training and predictions, to log-time training and prediction. And originally, I thought this was not possible, right? Because you have to learn one classifier per label. So, if you have 100 million labels, you have to have 100 million classifiers, so your training time has to be linear in the number of labels. But, with Parabel we managed to come up with a log-time training and a log-time prediction algorithm. And the key is to learn a hierarchy over your labels. So, each node in the hierarchy inherits only about half of its parent\u2019s labels. So, there\u2019s an exponential reduction as you go down the hierarchy. And that lets you make predictions in logarithmic time. Now the trick in Parabel was, how do you make sure that the training is also logarithmic time? Because at the leaves of the hierarchy, you will have one, let\u2019s say, label in the leaf node. And so, you have to have at least as many leaves as classifiers, so that gives you back the linear costs. But somehow Parabel \u2013 you\u2019ll need to read the paper in order to get the trick. It\u2019s a really cute trick and\u2026 but you can get away with log-time training. And so that\u2019s why Parabel manages to get this 10,000-times speed up in prediction, in training, in reduction in model size, and accuracy is great. It\u2019s a really cool algorithm.<\/p>\n<p><strong>Host: More recently, you\u2019ve published a paper on what you\u2019ve called the Slice algorithm. What\u2019s the key innovation in Slice? How is it different from \u2013 or how does it build on \u2013 Parabel?<\/strong><\/p>\n<p>Manik Varma: Right. So, Slice is also a one-versus-all algorithm where you learn a classifier per label. It also has log-time training and prediction, but it\u2019s based on a completely different principle. So, in Parabel, our approach was to learn a hierarchy. So, you keep splitting the label space in half as you go down each node. Slice has a very different approach to the problem. It says, okay, let me look at my entire feature space. And now if I look at a very small region in the feature space, then only a small number of labels will be active in this region. So now, when a new user comes in, if I can find out what region or feature space he belongs in or she belongs in very quickly, then I can just apply the classifiers in that region. And that\u2019s the key approach that Slice takes to cutting down training time and prediction time from linear to logarithmic. And it\u2019s about to appear in WSDM this month. And it scales to 100 million labels. I mean, if you look at ImageNet, right? So that was 1,000 classes with 1,000 training points per class. And now we have 100 million classes. So, we used it for the problem of recommending related searches on Bing. So, when you go and type in a query on a search engine, the search engine will often recommend other queries that you could have asked that might have been more helpful to you or that will highlight other facets of the topic. And so, for obvious queries, the recommendations are pretty standard and everyone can do a good job. The real fight is in the tail for these queries that are rare that we don\u2019t know how to answer well. Can you still make good quality recommendations? And there, getting even a 1% lift in the success rate is a big deal. Like it takes dedicated time and effort to do that. And we managed to get a 12% lift in performance with Slice in the success rate. So that\u2019s like really satisfying.<\/p>\n<p><strong>Host: Yeah, and you had some other percentages that were pretty phenomenal too in other areas. Can you talk about those a little?<\/strong><\/p>\n<p>Manik Varma: Right. So, we also saw increases across the board in trigger coverage, suggestion density and so on. So, trigger coverage is the number of queries for which you could make recommendations. And we saw a 52% increase in that.<\/p>\n<p><strong>Host: 52?<\/strong><\/p>\n<p>Manik Varma: And\u2026 yeah, that\u2019s right. That was amazing.<\/p>\n<p><strong>Host: Statistically significant, on steroids.<\/strong><\/p>\n<p>Manik Varma: Right. And then the suggestion density is the number of recommendations you make per query. And there was a 33% increase in that as well. So yeah, we had pretty significant lift, and I\u2019m very glad to say like Slice is making billions of recommendations so far. And people are really happy. It\u2019s really improved the success rate of people asking queries on Bing so far.<\/p>\n<p><strong>Host: That\u2019s cool. Speaking of where people can find stuff\u2026 I imagine a number of listeners would be interested to plunge in here. Where can they get this? Uh, is it available? Where are the resources?<\/strong><\/p>\n<p>Manik Varma: So, we maintain an extreme classification repository, which makes it really easy for researchers and practitioners who are new to the area to come in and get started. If you go to Bing and search for the extreme classification repository, or your favorite search engine, you can find it very easily. And there you\u2019ll find not just our code, but you\u2019ll find data sets. You\u2019ll find metrics on how to evaluate the performance of your own algorithm. You\u2019ll find benchmark results showing you what everybody else has achieved in the field so far. You\u2019ll find publications and, if you look at my home page, you\u2019ll also find a lot of talks so you can go and look at the recordings to explore more about whatever area fascinates you. And all of this is publicly available, freely available to the academic community. So, people can come in and explore whatever area of extreme classification they like.<\/p>\n<p>(music plays)<\/p>\n<p><strong>Host: So, Manik, we\u2019ve talked about the large end of extreme classification, but there\u2019s another extreme that lies at the small end of the spectrum, and it deals with really, really, really tiny devices. Tell us a bit about the work you\u2019ve done with what you call Resource Efficient ML.<\/strong><\/p>\n<p>Manik Varma: Yeah, that\u2019s the only other project I\u2019m working on. And that\u2019s super cool too, right? Because for the first time in the world, we managed to put machine learning algorithms on a micro controller that\u2019s smaller than a grain of rice. Think of the possibilities that opens up, right? So, you can now implant these things in the brains of people who might be suffering from seizures to predict the onset of a seizure. You could put them all over a farm to try and do precision agriculture, tell you where to water, where to put fertilizer and where to put pesticide and all of that. The applications are completely endless especially once you start thinking about the internet of things. A number of applications in the medical domain, in smart cities, smart housing. So, in 2017, we put two classical machine learning algorithms based on trees and nearest neighbors called Bonsai and Protonn onto this micro controller. It has only 2 kilobytes of RAM, 32 kilobytes of read-only flash memory, no supported hardware for floating point operations, and yet we managed to get our classifiers to run on them. And then last year we released two recurrent neural network algorithms called FastGRNN and EMI-RNN. And again, all of these are publicly available from GitHub. So, if you go to GitHub.com\/Microsoft\/HML you can download all these algorithms and play with them and use them for your own applications.<\/p>\n<p><strong>Host: So, while we\u2019re on the topic of your other work, you said that was the only other project you were working on, but it isn\u2019t. And maybe \u2013 maybe they\u2019re tied together, but I\u2019ve also heard you\u2019re involved in some rather extra-terrestrial research. Can you tell us about the work you\u2019re doing with SETI?<\/strong><\/p>\n<p>Manik Varma: Yeah, so they\u2019re related. But apparently, some of these algorithms that we\u2019ve been developing could have many applications in astronomy and astrophysics. So, if you look at our telescopes right now, they\u2019re receiving data at such a high rate that it\u2019s not possible to process all of this data or even store all of this data. So, because the algorithms we\u2019ve developed are so efficient, if we could put them on the telescope itself, it could help us analyze all types of signals that we are receiving, including, perhaps, our search for extraterrestrial intelligence. So, that\u2019s a really cool project we run out of Berkeley. But there are also lots of other applications, because, if you\u2019re trying to put something on a satellite, I\u2019m told by my astronomer friends that the amount of energy that an algorithm can consume is very limited because energy is at premium out there in space. And so, things that are incredibly energy efficient or will have very low response time are very interesting to astronomers.<\/p>\n<p><strong>Host: So Manik, I ask all my guests some form of the question, is there anything that keeps you up at night? You\u2019re looking at some really interesting problems. Big ones and small ones. Is there anything you see that could be of concern, and what are you doing about it?<\/strong><\/p>\n<p>Manik Varma: Extreme classification touches people, right? So, people use it to find things they\u2019re looking for. And so, they reveal a lot of personal information. So, we have to be extremely careful that we behave in a trustworthy fashion, where we make sure that everybody\u2019s data is private to them. And this is not just at the individual level but also the corporate level, right? So, if you\u2019re a company that\u2019s looking to come to Microsoft and leverage extreme classification technology, then again, your transaction history and your logs and stuff, we make sure that those are private, and you can trust us, and we won\u2019t share that with anybody else. And because we\u2019re making recommendations, there are all these issues about fairness, about transparency, about explainability. And these are really important research challenges, and ones that we are thinking of at the extreme scale.<\/p>\n<p><strong>Host: At the beginning of the podcast, we talked briefly about your CV, and it\u2019s phenomenally impressive. But tell us a little bit more about yourself. How did your interest in physics and theory and engineering and math shape what you\u2019re doing at Microsoft Research?<\/strong><\/p>\n<p>Manik Varma: Yeah, so because I\u2019ve been exposed to all of these different areas, it helps me really appreciate all the different facets of the problem that we\u2019re working on. So, the theoretical underpinnings are extremely important. And then I\u2019ve come to realize how important the mathematical and statistical modeling of the problem is. And then once you\u2019ve built the model, then engineering a really good-quality solution how to do that, what kind of approximations to make, that you start from theoretically well-founded principals, but then you make good engineering approximations that will help you deliver a world-class solution. And so, it helps me look at all of these aspects of the problem and try and tackle them holistically.<\/p>\n<p><strong>Host: So, what about the physics part?<\/strong><\/p>\n<p>Manik Varma: Um, actually to tell you the truth, I\u2019m a miserable physicist. I completely failed. (laughing) Yeah. I\u2019m not very good at physics, unfortunately, which is why I switched. So&#8230;<\/p>\n<p><strong>Host: You know, I\u2019ve got to be honest. I\u2019ll bet your bad at physics is way better than my good at physics. So, let\u2019s put it in context, right? All right, well at the end of every show, I ask my guests to offer some parting advice or inspiration to listeners, many of them who are aspiring researchers in the field of AI and machine learning. What big interesting problems await researchers just getting into the field right now?<\/strong><\/p>\n<p>Manik Varma: Yeah, we\u2019ve been working on it for several years, but we\u2019ve barely scratched the surface. I mean, there\u2019s so many exciting and high-impact problems that are still open and need good quality solutions, right? So, if you\u2019re interested in theory, even figuring out what are the theoretical underpinnings of extreme classification, how do we think about generalization error, and how do we think about the complexity of a problem? That would be such a fundamental contribution to make. If you\u2019re interested in engineering or in the machine learning side of things, then how do you come up with algorithms that bring down your training, prediction and cost and model size from linear to logarithmic? So, can we have an algorithm that is log-time training, log-time prediction, log model size, and yet it is as accurate as a one-versus-all classifier that has linear costs? And if you\u2019re interested in deep learning, then how can you do deep learning at the extreme scale? How do you learn at the scale where you have 100 million classes to deal with? How do you personalize extreme classification? If you wanted to build something that would be personalized for each user, how would you do that? And if you\u2019re interested in applications, then how do you take extreme classification and apply it to all the really cool problems that there are in search, advertising, recommendation, retail, computer vision\u2026? All of these problems are open. And we\u2019re looking for really talented people to come and join us and work with us on this team. And location is no bar. No matter where you are in the world, we\u2019re happy to have you. Level is no bar. As long as you\u2019re passionate about having impact on the real world and reaching millions of users, we\u2019d love to have you with us.<\/p>\n<p><strong>Host: So, we\u2019ve talked about going from 100 to 100 million labels. What\u2019s it going to take to go from 100 million to a billion labels?<\/strong><\/p>\n<p>Manik Varma: Yeah, that\u2019s really exciting, and that\u2019s actually some of the things that we\u2019re exploring right now. In fact, not only do I want to go to a billion labels. I want to go to an infinite set of labels. That would be the next extreme in extreme classification!<\/p>\n<p><strong>Host: Yeah.<\/strong><\/p>\n<p>Manik Varma: And that\u2019s really important, again, for the real world, because if you think about applications on Bing or on search or recommendation, you have new documents coming on the web all the time, or you have new products coming into a catalog all the time. And currently, our classifiers don\u2019t deal with that. Thankfully, we\u2019ve managed to cut down our training costs so that you can just retrain the entire classifier periodically when you have a batch of new items come in. But if you have new items coming in at a very fast rate, then you need to be able to deal with them from the get-go. And so, we\u2019re now are starting to look at problems where there\u2019s no limit on the number of classes that you\u2019re dealing with.<\/p>\n<p><strong>Host: Well, I imagine the theory and the math, and then the engineering to get us to that level is going to be significant as well.<\/strong><\/p>\n<p>Manik Varma: But it\u2019ll be a lot of fun. You should come and join us! (laughter)<\/p>\n<p><strong>Host: \u201cCome on, it\u2019ll be fun!\u201d he said.<\/strong><\/p>\n<p>Manik Varma: Yeah.<\/p>\n<p><strong>Host: Manik Varma, it\u2019s been an absolute delight talking to you today. Thanks for coming on the podcast.<\/strong><\/p>\n<p>Manik Varma: Thank you, Gretchen. As I said, the pleasure was entirely mine.<\/p>\n<p>(music plays)<\/p>\n<p>To learn more about Dr. Manik Varma and how researchers are tackling extreme problems with extreme classification, visit <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/\">Microsoft.com\/research<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>\ufeff Episode 63 | February 13, 2019 If every question in life could be answered by choosing from just a few options, machine learning would be pretty simple, and life for machine learning researchers would be pretty sweet. Unfortunately, in both life and machine learning, things are a bit more complicated. That\u2019s why Dr. Manik [&hellip;]<\/p>\n","protected":false},"author":39507,"featured_media":567159,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":199562,"msr_hide_image_in_river":0,"footnotes":""},"research-area":[],"msr-locale":[268875],"msr-post-option":[],"class_list":["post-811993","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-locale-en_us"],"msr_assoc_parent":{"id":199562,"type":"lab"},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/811993","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/users\/39507"}],"version-history":[{"count":2,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/811993\/revisions"}],"predecessor-version":[{"id":811999,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/811993\/revisions\/811999"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media\/567159"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=811993"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=811993"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=811993"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=811993"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}