{"id":490556,"date":"2018-06-13T08:05:15","date_gmt":"2018-06-13T15:05:15","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?p=490556"},"modified":"2018-06-26T10:47:46","modified_gmt":"2018-06-26T17:47:46","slug":"teaching-computers-to-see-with-dr-gang-hua","status":"publish","type":"post","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/podcast\/teaching-computers-to-see-with-dr-gang-hua\/","title":{"rendered":"Teaching computers to see with Dr. Gang Hua"},"content":{"rendered":"<div id=\"attachment_490559\" style=\"width: 2010px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-490559\" class=\"wp-image-490559 size-full\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/06\/PodcastGangHua_Header_06_2018_1000x400.jpg\" alt=\"\" width=\"2000\" height=\"800\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/06\/PodcastGangHua_Header_06_2018_1000x400.jpg 2000w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/06\/PodcastGangHua_Header_06_2018_1000x400-300x120.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/06\/PodcastGangHua_Header_06_2018_1000x400-768x307.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/06\/PodcastGangHua_Header_06_2018_1000x400-1024x410.jpg 1024w\" sizes=\"auto, (max-width: 2000px) 100vw, 2000px\" \/><p id=\"caption-attachment-490559\" class=\"wp-caption-text\">Principal Researcher and Research Manager, Gang Hua. Photography courtesy of Maryatt Photography.<\/p><\/div>\n<h3>Episode 28, June 13, 2018<\/h3>\n<p>In technical terms, computer vision researchers \u201cbuild algorithms and systems to automatically analyze imagery and extract knowledge from the visual world.\u201d In layman\u2019s terms, they build machines that can see. And that\u2019s exactly what Principal Researcher and Research Manager, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/ganghua\/\">Dr. Gang Hua<\/a>, and <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/research-area\/computer-vision\/\">Computer Vision Technology<\/a> team, are doing. Because being able to see is really important for things like the personal robots, self-driving cars, and autonomous drones we\u2019re seeing more and more in our daily lives.<\/p>\n<p>Today, Dr. Hua talks about how the latest advances in AI and machine learning are making big improvements on image recognition, video understanding and even the arts. He also explains the distributed ensemble approach to active learning, where humans and machines work together in the lab to get computer vision systems ready to see and interpret the open world.<\/p>\n<h3>Related:<\/h3>\n<ul type=\"disc\">\n<li><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/podcast\">Microsoft Research Podcast<\/a>: Visit our podcast page on Microsoft.com<\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/itunes.apple.com\/us\/podcast\/microsoft-research-a-podcast\/id1318021537?mt=2\">iTunes<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: Subscribe and listen to new podcasts each week on iTunes<\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/subscribebyemail.com\/www.blubrry.com\/feeds\/microsoftresearch.xml\">Email<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: Subscribe and listen by email<\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/subscribeonandroid.com\/www.blubrry.com\/feeds\/microsoftresearch.xml\">Android<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: Subscribe and listen on Android<\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/open.spotify.com\/show\/4ndjUXyL0hH1FXHgwIiTWU\">Spotify<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: Listen on Spotify<\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.blubrry.com\/feeds\/microsoftresearch.xml\">RSS feed\u00a0<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<\/ul>\n<hr \/>\n<h3>Transcript<\/h3>\n<p>Gang Hua: If we look back ten, fifteen years ago, you see the computer vision community\u2019s more diverse. You see all kinds of machine learning methods, you see all kind of knowledges borrowed from physics, from optical field, all getting into this field to try to tackle the problem from multi-perspective. As we are emphasizing diversity everywhere, I think the scientific community is going to be more healthy if we have diverse perspective.<\/p>\n<p><strong>(music plays)<\/strong><\/p>\n<p><strong>Host: You\u2019re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I\u2019m your host, Gretchen Huizinga.<\/strong><\/p>\n<p><strong>In technical terms, computer vision researchers \u201cbuild algorithms and systems to automatically analyze imagery and extract knowledge from the visual world.\u201d In layman\u2019s terms, they build machines that can see. And that\u2019s exactly what Principal Researcher and Research Manager, Dr. Gang Hua, and the Computer Vision Technology team, are doing. Because being able to see is really important for things like the personal robots, self-driving cars, and autonomous drones we\u2019re seeing more and more in our daily lives.<\/strong><\/p>\n<p><strong>Today, Dr. Hua talks about how the latest advances in AI and machine learning are making big improvements on image recognition, video understanding and even the arts. He also explains the distributed ensemble approach to active learning, where humans and machines work together in the lab to get computer vision systems ready to see and interpret the open world. That and much more on this episode of the Microsoft Research Podcast.<\/strong><\/p>\n<p><strong>Host: Gang Hua.<\/strong><\/p>\n<p>Gang Hua: Hi.<\/p>\n<p><strong>Host: Hello, welcome to the podcast. Great to have you here.<\/strong><\/p>\n<p>Gang Hua: Thanks for inviting me.<\/p>\n<p><strong>Host: You\u2019re a Principle Researcher and the Research Manager at MSR and your focus is computer vision research.<\/strong><\/p>\n<p>Gang Hua: Mmm hmm.<\/p>\n<p><strong>Host: In broad strokes right now, what gets a computer vision researcher up in the morning? What\u2019s the big goal?<\/strong><\/p>\n<p>Gang Hua: Yeah, computer vision is a relatively young research field. In general, you can think this field is trying to build machines, to endow computers the capability to just see the world and interpret the world just like human. From a more technical side of view, it\u2019s the input to the computer is really just uh, image and videos. You can think of them as a sequence of numbers, but what we want to extract from these images and videos from these numbers is some sort of structures of the world, or some semantic information out of it. For example, I could say, this part of the image really corresponds to a cat. That part of the image corresponds to a car, this type of interpretation. So, that\u2019s the goal of computer vision. Like for us humans, it looks to be a simple task to achieve, but in order to teach these computers to do it, we really have made a lot of progress in the past ten years. But as the research field, this thing has been there for fifty years. Still yet, a lot of problems to tackle and address.<\/p>\n<p><strong>Host: Yeah. In fact, you gave a talk about five years ago, where you said, and I paraphrase, \u201cAfter thirty years of research why should we still care about face recognition research?\u201d Tell us how you answered then, and now, where you think we are\u2026<\/strong><\/p>\n<p>Gang Hua: So, I think that the status quo five years ago, I would say like, at that moment, if we capture a snap shot of how the research in facial recognition has progressed since the beginning of computer vision or face recognition research, I would say we achieved a lot. But more in controlled environment, where you could carefully control the lighting, the camera, the setting and all those kinds of things when you are framing the faces. At that moment, five years ago, when we moved towards more like wild settings, like faces taking on their uncontrolled environment, I would say there\u2019s a huge gap there in terms of recognition accuracy. But in the past five years, I would say the whole community also made a lot of progress leveraging like the more advanced, deep learning techniques. Even for facial recognition in the wild scenario, we\u2019ve made a lot of progress and really have pushed these things to a stage where a lot of commercial applications becomes feasible.<\/p>\n<p><strong>Host: Ok, so deep learning has really enabled, even recently, some great advances in the field of computer vision and computer recognition of images.<\/strong><\/p>\n<p>Gang Hua: Right.<\/p>\n<p><strong>Host: So, that\u2019s interesting, when you talk about the difference between a highly controlled situation, versus recognizing things in the wild, and I\u2019ve had a couple of researchers on here who have said, yeah, where computers fail is when the data sets are not full enough\u2026 for example dog, dog, dog, 3-legged dog, mmm, is it still a dog?<\/strong><\/p>\n<p>Gang Hua: Sure.<\/p>\n<p><strong>Host: Right? So, what kinds of things do deep learning techniques give you that you didn\u2019t have before in these recognition advances?<\/strong><\/p>\n<p>Gang Hua: Yeah that\u2019s a great question. From a research perspective, you know, the power of deep learning is presenting several effects. The first thing is that it can conduct the learning in an end-to-end fashion and learn what\u2019s the right representation for that semantic pattern. For example, when we are talking about a dog, if we\u2019re really looking through all kinds of pictures of a dog, although say, if my input is really a 64&#215;64 images, suppose each pixel has like two hundred and fifty values to take. That\u2019s a huge space, if you think about the combinatorially. But when we talk about dog as a pattern, like actually, every pixel\u2019s correlated a little bit, so, the actual pattern for \u201cdog\u201d is going to reside in a much lower dimensional space. So, the power of deep learning is that I can conduct the learning in an end-to-end fashion that really learns the right numerical representation for \u201cdog.\u201d And because of the deep structures, we can come up really complicated models which can really digest a large amount of training data. So that means like, if my training data covered all kinds of variations, like all kinds of views of this pattern, eventually I can recognize it in a broader setting, because I have covered almost all the spaces. OK. So, another capability of deep learning is that this kind of compositional behavior, because it\u2019s a layer, fit for the structure and the layer of the representation there, so when the information or image gets fit into deep networks, and it starts by extracting some very low-level image primitives, then gradually, the model can assemble all those primitives together and form a higher and higher level of semantic structures. So, in this sense, it captures all the small patterns corresponding to the bigger patterns and composes them together to represent the final pattern. So, that\u2019s why it is very powerful, especially for visual recognition tasks, so, yeah.<\/p>\n<p><strong>Host: Right, so, the broad umbrella of <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/event\/microsoft-cvpr-2018\/\">CVPR<\/a> is computer vision pattern recognition.<\/strong><\/p>\n<p>Gang Hua: Yes. Right.<\/p>\n<p><strong>Host: And a lot of that pattern recognition is what the techniques are really driving to.<\/strong><\/p>\n<p>Gang Hua: Sure, yeah. So, that\u2019s actually, computer vision really, is trying to make sense out of pixels. If we talk about it in a really mechanical way, is that I fit in the image. You either extract some numeric output or some symbolic output from it. The numeric output, for example, could be a 3-D point cloud which described the structure of the scene or the shape of an object. It could also be corresponding to some semantic labels like a dog and cat, as I mentioned at the beginning, so yeah.<\/p>\n<p><strong>Host: Right. So, we\u2019ll get to labeling in a bit. It\u2019s an interesting whole part of the machine learning process is that it has to be fed labels as well as pixels, right?<\/strong><\/p>\n<p>Gang Hua: Sure, yeah.<\/p>\n<p><strong>(music plays)<\/strong><\/p>\n<p><strong>Host: You have three main areas of interest in your computer vision research that we talked about. Video, faces, and arts and media. Let\u2019s talk about each of those in turn and start with your current research in what you call \u201cvideo understanding.\u201d<\/strong><\/p>\n<p>Gang Hua: Yes. Video understanding, like the title sort of explains itself. Instead of \u2013 the input now becomes a video stream. Instead of single image we are reasoning about pixels and how they move. If we view computer vision reasoning about the single image as a spatial reasoning problem, now we are talking about a spatial-temporal reasoning problem, because video is the third dimension, or that temporal dimension. And if you look into a lot of real world problems, we\u2019re taking about continuous video streams, whether it is a surveillance camera in a building or a traffic camera overseeing highways. You have this constant flow of frames coming in and the object inside it is moving. So, you want to basically digest information out of it.<\/p>\n<p><strong>Host: When you talk about those kinds of cameras, it gives us massive amounts of video, you know, constant stream of cameras and security in the 7-Eleven and things like that. What is your group trying to do on behalf of humans with those video streams?<\/strong><\/p>\n<p>Gang Hua: Sure. So, one incubation project we are doing, like my team\u2019s building a foundational technology there. One incubation project we are trying to do is to really analyze the traffic on roads. If you think about a city, when they set up all those traffic cameras, most of the video streams area actually wasted. But if you carefully think about it, these cameras are smart. Just think about the one scenario where you want to more intelligently control the traffic lights. So, if in one direction I saw a lot more traffic flow, instead of having a fixed schedule of turn-on, turn-off those red lights and the green lights? I could say like, OK, because this side has less cars or even no cars at this moment, I would allow the other directions their green lights to keep longer time so that the traffic can flow better. So that\u2019s just the one type of application there.<\/p>\n<p><strong>Host: Could you please get that out there?<\/strong><\/p>\n<p>Gang Hua: Sure!<\/p>\n<p><strong>Host: I mean\u2014yeah, because how many of us have sat at a traffic light when it\u2019s red and there\u2019s no traffic coming the other way.<\/strong><\/p>\n<p>Gang Hua: Exactly.<\/p>\n<p><strong>Host: At all. It is like why can\u2019t I go?<\/strong><\/p>\n<p>Gang Hua: Sure. Yeah, you could also think about some other applications like if we accumulated videos across years, if there\u2019s citizens requesting like we set up additional bicycle lanes, we could use the videos we have analyzing all the traffic data there and then decide, if it makes sense, to set up a bike lane there. If we set it, we would sort of significantly affect the other traffic flows and help the cities to make decisions like that.<\/p>\n<p><strong>Host: I think this is so brilliant, because a lot of times we make decisions based on, you know, our own ideas rather than data that says, you know, hey, this is where a bike lane would be terrific. This is where it would actually ruin everything for everybody, right?<\/strong><\/p>\n<p>Gang Hua: For sure, yeah. They sometimes leverage some other type of other sensors to do that. You hire a company like to set up some special equipment on the roads to do that. But it\u2019s very cost ineffective. Just think about all those cameras are sitting there. The video streams out there. Right? So.<\/p>\n<p><strong>Host: Yeah. That\u2019s a fantastic explanation of what you can do with machine learning and video understanding.<\/strong><\/p>\n<p>Gang Hua: Right.<\/p>\n<p><strong>Host: Yeah. Another area you care about is faces and kind of harkens back to the \u201cwhy should we still care about facial recognition research?\u201d<\/strong><\/p>\n<p>Gang Hua: Sure.<\/p>\n<p><strong>Host: But yeah. And this line of research has some really interesting applications. Talk about what\u2019s happening with facial recognition research. Who\u2019s doing it and what\u2019s new?<\/strong><\/p>\n<p>Gang Hua: Yeah, so indeed if we look back, facial recognition technology has progressed in Microsoft, I think when I was at Live Labs Research, we set up the first facial recognition library which could be leveraged by different product teams. Indeed, the first adopter is Xbox. They tried to use facial recognition technology for automatic user login at that moment. I think that that\u2019s the first adoption. Over the time, like the facial recognition research, the center sort of migrated to Microsoft Research Asia, where we still have a group of researchers I collaborate with. Like we are continuously trying to push the state of the art out. This is now become a synergistic effort where we have engineering teams helping us to gather more data and then we just train better models. Our research recently, actually, focused more on line of research we call identity-preserving face synthesis. So, recently there\u2019s a big advancement in the deep learning community too, which is the establishment of using deep networks to generative models which can model the distribution of images so that you can draw from that distribution, basically synthesize the image. You build deep networks which output is an image, indeed. So, what we want to achieve is actually a step further. We want to synthesize faces. While we want to keep the identity of those faces, seeing we don\u2019t want our algorithms to just randomly sample a set of face out without any semantic information. Say you want to generate face of Brad Pitt. I want to really generate a face that looks like Brad Pitt. If I want to generate a face similar to anybody I know, I think we just want to be able to achieve that.<\/p>\n<p><strong>Host: So, the identity preservation is the sort of outcome that you\u2019re aiming for of the person that you\u2019re trying to generate the face of?<\/strong><\/p>\n<p>Gang Hua: Right.<\/p>\n<p><strong>Host: You know, tangentially, I wonder if you get this technology going does it morph with you as you get older, and start to recognize you \u2014 or do you have to keep updating your face?<\/strong><\/p>\n<p>Gang Hua: Yeah that\u2019s indeed a very good question. I would say, in general, we actually have some ongoing research trying to tackle that problem. I think for existing technology, yes, you need to update your face maybe from time to time. Especially if you\u2019ve undergone a lot of changes. For example, somebody could have done some plastic surgery. That will basically break the current system.<\/p>\n<p><strong>Host: Wait a minute. That\u2019s not you.<\/strong><\/p>\n<p>Gang Hua: Sure, no, not me at all. So, there are several ways you can think about it. Like human faces actually don\u2019t change much like between age 17-18, when you grow up, all the way to maybe 50-ish, they don\u2019t change much. So, when people first got born, like kids? Their face actually changes a lot, because there\u2019s growing bones and basically the shape and skin could change a lot. But once people get maturity into adult stage, the change is very slow. So, we actually have some research, we are training models the aging process too that will help establish better facial recognition system across age. This is actually a very good kind of technology which can allow you to get into the law enforcement domain for example, some missing kids, they could be \u2014 have been kidnapped by somebody, but after many years if you\u2026<\/p>\n<p><strong>Host: They look different.<\/strong><\/p>\n<p>Gang Hua: Yeah, they look different; if the smart facial recognition algorithms can match the original photos\u2026<\/p>\n<p><strong>Host: And say what they would look like at maybe 14 if they were kidnapped earlier or something?<\/strong><\/p>\n<p>Gang Hua: Yes, yes, exactly.<\/p>\n<p><strong>Host: Wow, that\u2019s a great application of that. Well, let\u2019s talk about the other area that you\u2019re actively pursuing and that\u2019s media and the arts. Tell us how research is overlapping with art and particularly with your work in deep artistic style transfer.<\/strong><\/p>\n<p>Gang Hua: Sure. If we look into people\u2019s desire, right? First we need to eat, and we need to drink, and we need to sleep. OK? Then once all these tasks are fulfilled, actually, we humans have a strong desire of arts\u2026<\/p>\n<p><strong>Host: And creation.<\/strong><\/p>\n<p>Gang Hua: And the creation and things like that. So, this theme of research in computer vision, if we link it to like a more artistic type of what we call media and arts, like basically using computer vision technologies to give people a good artistic enjoyment. So, the particular research project we have done in past the two years is sequence of algorithms where we can render an image into any sort of artistical styles you want as long as you provide an example of that artistic style. For example, we can render an image to Van Gogh\u2019s style.<\/p>\n<p><strong>Host: Van Gogh?<\/strong><\/p>\n<p>Gang Hua: Yeah, or any other painter\u2019s painting style\u2026<\/p>\n<p><strong>Host: Renoit, or Monet\u2026 or Picasso.<\/strong><\/p>\n<p>Gang Hua: Yeah, all of them. You can think of\u2026<\/p>\n<p><strong>Host: Interesting. With pixels.<\/strong><\/p>\n<p>Gang Hua: With pixels, yeah. Those are all again, like, all done by deep networks and some deep learning technologies we designed.<\/p>\n<p><strong>Host: It sounds like you need a lot of disciplines to feed into this research. Where are you getting all of your talent from in terms of\u2026?<\/strong><\/p>\n<p>Gang Hua: In a sense, I would say our goal on this is making \u2014 you know, artworks are not necessarily accessible to everyone. Like some of these artworks are really expensive. By this kind of digital technology, what we are trying to do is make this kind of artworks to be accessible to common users.<\/p>\n<p><strong>Host: To democratize.<\/strong><\/p>\n<p>Gang Hua: Yeah democratize it. Yes, as you mentioned that, so.<\/p>\n<p><strong>Host: That\u2019s good.<\/strong><\/p>\n<p>Gang Hua: Our algorithm allows us to build an explicit representation, like a numerical representation for each kind of style. Then, if you want to create new styles we can blend them. So, that is like we are building an artist space where we can explore in-between to see how these visual effects are evolving between like two painters and things like that. And even have deeper understanding how they composed their artistic style and things like that. Yeah.<\/p>\n<p><strong>Host: What\u2019s really interesting to me is that this is a really quantitative field \u2014 computer science, algorithms, and a lot of math and numbers. And then you\u2019ve got art over here which is much more metaphysical. And yet, you\u2019re bringing them together and it\u2019s revealing the artistic side of the quantitative brain.<\/strong><\/p>\n<p>Gang Hua: Sure. I think to bring all these things together, the biggest tool we are leveraging indeed is statistics.<\/p>\n<p><strong>Host: Interesting.<\/strong><\/p>\n<p>Gang Hua: As all kinds of machine learning algorithms are dealing with, it\u2019s really trying to capture just statistics of the pixels.<\/p>\n<p><strong>(music plays)<\/strong><\/p>\n<p><strong>Host: Let\u2019s get a little technic\u2026 We have been a little technical, but let\u2019s get a little more technical. Some of your recently published work \u2013 and our listeners can find that on both the MSR website and your website \u2013 you talked about a new distributed ensemble approach to active learning. Tell us what\u2019s different about that\u2026 what you propose, how it\u2019s different, what does it promise?<\/strong><\/p>\n<p>Gang Hua: Yeah, that\u2019s indeed a great question. I think when we are taking about active learning, we are referring to a process where we have some sort of human oracle involved in the learning process. In traditional active learning, we\u2019re seeing that\u2026 I have a learning machine. This learning machine can intelligently pick up some data samples and ask the human oracle to provide a little bit more input. So, the learning machine actually picks samples and asks the human oracle to actually provide, for example, a label for this image. So, in this work, when we are taking about an ensemble machine, we are actually dealing with a more complicated problem. We are trying to factor active learning into actually in a crowd-sourcing environment. If you think about the Amazon Mechanical Turk, nowadays, it\u2019s really, one of the biggest platforms where people send their data and ask the crowd workers to label all of them, but in this process, if you are not careful, the labels connected from this process for your data could be quite lousy, indeed. They may not be able to be used by you. So, in this process, we actually try to achieve two goals there. The first goal, we want to smartly distribute the data so that we can make the label to be most cost effective, OK? The second is that we want to, actually, assess the quality of all my crowd workers, so that maybe, even in the online process, I can purposely send my data to the good workers to label. So, that\u2019s how our model works. So, actually, we have ensemble model distributed one. Like each crowd worker corresponds to one of these learning machines. And we try to do a statistical check across all the models so that in the same process, we actually come out with a quality score for each of the crowd workers on the fly. So that we can use the model to not only select the samples but also send the data to the labelers with the highest quality to label them. That way, with the progress on these label efforts I can quickly come out with a good model.<\/p>\n<p><strong>Host: That leads me to the human-in-the-loop issue and the necessity of the checks and balances between humans and machines. Aside from what you\u2019re just talking about, how are you tackling other issues of quality control by using humans with your machines?<\/strong><\/p>\n<p>Gang Hua: I have been thinking about this problem for a while, mainly in the context of robotics. If you think about any intelligence system, I would say, unless you are in a really closed-world setting, then you may have a system which can run fully autonomously. But whenever we hit the open world, like a current machine learning based intelligent systems are not necessarily good in dealing with all kinds of open-world cases because there\u2019s corner cases which may not have been covered.<\/p>\n<p><strong>Host: And variables that you don\u2019t think about, yeah.<\/strong><\/p>\n<p>Gang Hua: Exactly. So, one thing I have been thinking is how could we really engage humans in that loop to not only help the intelligent agent when they need it, and also at the same time forming some mechanism which we can teach these agents to be able to handle similar situations in the future. I will give you a very specific example. When I was at Stevens Institute of Technology, I had a project from NIH which is\u2026 we called co-robots.<\/p>\n<p><strong>Host: What kind of robots?<\/strong><\/p>\n<p>Gang Hua: Co-robots. It\u2019s actually wheelchair robots. The idea is that, as long as the user can move their neck, we actually have a head-mounted camera which the user can move their head, we use the head-mounted camera actually to track the position of the head and let the users to be able to control the wheelchair robots indeed. But we don\u2019t want the user to control it all the time. So, our goal is, actually, say, if, in a home setting, we wanted these wheelchair robots to be able to carry the user, and move largely autonomously inside the room whenever the user gave a guidance, say, hey, I want to go to that room, then the wheelchair robots would mostly do autonomous navigation, but if the robots sort of encounter a situation, does not know how to deal with it? For example, how to move around? Then, at that moment, we\u2019re going to ask the robots to proactively ask the human for control. Then, the users will control the robots and deal with that situation. Maybe next time these robots encounter similar situation, they\u2019re going to be able to deal with it again.<\/p>\n<p><strong>(music plays)<\/strong><\/p>\n<p><strong>Host: What were you doing before you came here and how did you end up at Microsoft Research?<\/strong><\/p>\n<p>Gang Hua: This is my second term in Microsoft. So, as I mentioned, the first term is between like 2006 and 2009, when I was in lab called the Live Labs. That\u2019s my first term. During that tenure, I established the first face recognition library. Then I kind of got attracted by external world a little bit. So, I went to Nokia Research, IBM Research and I landed at Stevens Institute of Technology as a faculty member there so\u2026<\/p>\n<p><strong>Host: And that\u2019s over in New Jersey, right?<\/strong><\/p>\n<p>Gang Hua: Yeah that\u2019s in New Jersey, in East Coast. Then in 2015, I came back to Microsoft Research but in the Beijing Lab first. I transferred back actually in 2017, because my family stayed here.<\/p>\n<p><strong>Host: So now you are here in Redmond after Beijing. How did that move happen?<\/strong><\/p>\n<p>Gang Hua: My family always stayed in Seattle. So, Microsoft Research Beijing Lab is a great place. I would say, like, I really enjoyed it. And one of the unique things there is the super, super dynamic research intern program. So, year-round, there\u2019s several hundred interns actually working in the lab. And they collaborate closely with their mentors. I think it\u2019s a really dynamic environment there. But I, because my family is in Seattle, so I sort of explored a little bit, then the Intelligent Group is setting up this computer vision group there. So that\u2019s why I joined.<\/p>\n<p><strong>Host: You\u2019re back in Seattle again.<\/strong><\/p>\n<p>Gang Hua: Yeah.<\/p>\n<p><strong>Host: So, I ask this question of all the researchers that come on the podcast and I\u2019ll ask you too. Is there anything about your work that that we should concerned about? What I say is, anything that keeps you up at night?<\/strong><\/p>\n<p>Gang Hua: I would say, when we talk about, especially in the computer vision domain, I think privacy is, potentially, the largest concern. If now you look into all countries, there are hundreds of millions of cameras that are set up everywhere in public domain or in buildings and those, I would say, like with the technology advancement, it is really not sci-fi to expect that cameras can now really track people all the time. I mean everything has two sides. I would say yeah, this, on one hand, could help us, for example better to deal with our criminals. But for ordinary citizens there\u2019s a lot of privacy concerns.<\/p>\n<p><strong>Host: So what kinds of things\u2026 and this is why I ask this question because it prompts people to think, ok, I have this power because of this great technology, what could possibly go wrong? So, what kinds of things can we be thinking about and instituting \u2013 or implementing \u2013 to not have that problem?<\/strong><\/p>\n<p>Gang Hua: Microsoft has big efforts on GDPR. And I think that\u2019s great, because this is a mechanism to ensure everything we produce actually got aligned with certain regulation. On the other hand, everything need to strike for a balance between usability and the security or privacy. If you think about it, like you use like some online services, like your activities basically leave traces there. That\u2019s how it is used to better serve you for the future. But if you want to be more convenient, sometimes you need to give a little bit of information out. But you don\u2019t want to give all your information out, right? I think that the boundary is actually not black and white. We simply need to carefully control that, so that we just get the right amount of information to serve the customer better, but not unlimited information, or information that the users are not comfortable or feeling well to give up\u2026<\/p>\n<p><strong>Host: Right, so it seems like there\u2019s a trend towards permissions and agency of the user to say, \u201cI\u2019m comfortable with this. But I\u2019m not comfortable with that.\u201d<\/strong><\/p>\n<p>Gang Hua: Mmm hmmm. Right.<\/p>\n<p><strong>Host: As we finish up here, Gang, talk about what you see on the horizon for the next generation of computer vision researchers. What are the big unsolved problems that might prompt exciting breakthroughs, or just be the grind for the next 10 years?<\/strong><\/p>\n<p>Gang Hua: That\u2019s a great question and also, a very big question. There are big problems we actually should tackle. If you think about, like now, computer vision really leverages statistical machine learning a lot. We can train recognition models, which achieved great results. But that process is largely, still, appearance-based. So, we need to better get in some of the fundamentals in computer vision which is 3-D geometry into the perception process, ok\u2026 And there\u2019s also other things, like especially when we are talking about video understanding. It\u2019s a holistic problem where you need to do spatial-temporal reasoning and we need to be able to factor more cognition concepts into this process like causal inference. If something happened, what really caused this thing to happen? Machine learning techniques mostly deal with like correlation between data, ok? Correlation and causality are two totally different concepts there. So, I feel that also needs to happen. And some of the fundamental problems, so like learning from small data and even learning from language, potentially we need to address. Think about how we humans are learning, we learn in two ways: learning from experience, but there is another factor. We learn from language. For example, while we are taking with each other, indeed, simply through language, I already learned a lot from you, for example\u2026<\/p>\n<p><strong>Host: And I you.<\/strong><\/p>\n<p>Gang Hua: Sure. You know, that\u2019s a very compact information flow. We are now centrally focused on deep learning. If we look back like ten, fifteen years ago, you see the computer vision community\u2019s more diverse. You see all kinds of machine learning methods, you see all kinds of knowledges borrowed from physics, from optical field, all getting into this field to try to tackle the problem from multi-perspective. As we are emphasizing diversity everywhere, I think the scientific community is going to be more healthy if we have diverse perspective and tackling the problem from multiple perspective.<\/p>\n<p><strong>(music plays)<\/strong><\/p>\n<p><strong>Host: You know, that\u2019s great advice. Because as the community welcomes new researchers, they want to have big thinkers, broad thinkers, divergent thinkers, to sort of push for the next big breakthrough.<\/strong><\/p>\n<p>Gang Hua: Yeah. Exactly.<\/p>\n<p><strong>Host: Gang Hua, thank you for coming in. It\u2019s been really illuminating, and I\u2019ve really enjoyed our conversation.<\/strong><\/p>\n<p>Gang Hua: Thank you very much.<\/p>\n<p><strong>To learn more about Dr. Gang Hua and the amazing advances in computer vision, visit <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/\">Microsoft.com\/research<\/a><\/strong><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Episode 28, June 13, 2018 &#8211; Dr. Hua talks about how the latest advances in AI and machine learning are making big improvements on image recognition, video understanding and even the arts. He also explains the distributed ensemble approach to active learning, where humans and machines work together in the lab to get computer vision systems ready to see and interpret the open world.<\/p>\n","protected":false},"author":37074,"featured_media":490565,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"http:\/\/player.blubrry.com\/id\/34684784","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[],"msr_hide_image_in_river":0,"footnotes":""},"categories":[240054],"tags":[],"research-area":[13556,13562,13551],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-490556","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-msr-podcast","msr-research-area-artificial-intelligence","msr-research-area-computer-vision","msr-research-area-graphics-and-multimedia","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"http:\/\/player.blubrry.com\/id\/34684784","podcast_episode":"","msr_research_lab":[199565],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[488849],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"926\" height=\"540\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/06\/PodcastGangHua_Carousel_06_2018_480x280.jpg\" class=\"img-object-cover\" alt=\"a man wearing glasses and smiling at the camera\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/06\/PodcastGangHua_Carousel_06_2018_480x280.jpg 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/06\/PodcastGangHua_Carousel_06_2018_480x280-300x175.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/06\/PodcastGangHua_Carousel_06_2018_480x280-768x448.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/06\/PodcastGangHua_Carousel_06_2018_480x280-480x280.jpg 480w\" sizes=\"auto, (max-width: 926px) 100vw, 926px\" \/>","byline":"","formattedDate":"June 13, 2018","formattedExcerpt":"Episode 28, June 13, 2018 - Dr. Hua talks about how the latest advances in AI and machine learning are making big improvements on image recognition, video understanding and even the arts. He also explains the distributed ensemble approach to active learning, where humans and&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/490556","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/users\/37074"}],"replies":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/comments?post=490556"}],"version-history":[{"count":8,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/490556\/revisions"}],"predecessor-version":[{"id":492701,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/490556\/revisions\/492701"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media\/490565"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=490556"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/categories?post=490556"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/tags?post=490556"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=490556"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=490556"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=490556"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=490556"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=490556"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=490556"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=490556"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=490556"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}