カメラが捉える微細な動き MITの最新技術で見える世界


Subtle motion happens around us all the time, including tiny vibrations caused by sound. New technology shows that we can pick up on these vibrations and actually re-create sound and conversations just from a video of a seemingly still object. But now Abe Davis takes it one step further: Watch him demo software that lets anyone interact with these hidden properties, just from a simple video.



タイトル New video technology that reveals an object’s hidden properties
スピーカー エイブ・デイビス
アップロード 2015/05/06

物体の隠された特性を明らかにする新しいビデオ技術(New video technology that reveals an object’s hidden properties)












Most of us think of motion as a very visual thing. If I walk across this stage or gesture with my hands while I speak, that motion is something that you can see. But there’s a world of important motion that’s too subtle for the human eye, and over the past few years, we’ve started to find that cameras can often see this motion even when humans can’t.

So let me show you what I mean. On the left here, you see video of a person’s wrist, and on the right, you see video of a sleeping infant. But if I didn’t tell you that these were videos, you might assume that you were looking at two regular images, because in both cases, these videos appear to be almost completely still. But there’s actually a lot of subtle motion going on here. If you were to touch the wrist on the left, you would feel a pulse, and if you were to hold the infant on the right, you would feel the rise and fall of her chest as she took each breath. These motions carry a lot of significance, but they’re usually too subtle for us to see, so instead, we have to observe them through direct contact, through touch.

A few years ago, my colleagues at MIT developed what they call a motion microscope, which is software that finds these subtle motions in video and amplifies them so that they become large enough for us to see. If we use their software on the left video, it lets us see the pulse in this wrist, and if we were to count that pulse, we could even figure out this person’s heart rate. If we used the same software on the right video, it lets us see each breath that this infant takes, and we can use this as a contact-free way to monitor her breathing. This technology is really powerful because it takes these phenomena that we normally have to experience through touch and lets us capture them visually and non-invasively.

A couple of years ago, I started working with the folks that created that software, and we decided to pursue a crazy idea. We thought, it’s cool that we can use software to visualize tiny motions like this, and you can almost think of it as a way to extend our sense of touch. But what if we could do the same thing with our ability to hear? What if we could use video to capture the vibrations of sound, which are just another kind of motion, and turn everything that we see into a microphone?

Now, this is a bit of a strange idea, so let me try to put it in perspective for you. Traditional microphones work by converting the motion of an internal diaphragm into an electrical signal, and that diaphragm is designed to move readily with sound so that its motion can be recorded and interpreted as audio. But sound causes all objects to vibrate. Those vibrations are just usually too subtle and too fast for us to see. So what if we record them with a high-speed camera and then use software to extract tiny motions from our high-speed video, and analyze those motions to figure out what sounds created them? This would let us turn visible objects into visual microphones from a distance.

So we tried this out, and here’s one of our experiments, where we took this potted plant that you see on the right and we filmed it with a high-speed camera while a nearby loudspeaker played this sound.

And so here’s the video that we recorded, and we recorded it at thousands of frames per second. But even if you look very closely, all you’ll see are some leaves that are pretty much just sitting there doing nothing, because our sound only moved those leaves by about a micrometer. That’s one ten-thousandth of a centimeter, which spans somewhere between a hundredth and a thousandth of a pixel in this image. So you can squint all you want, but motion that small is pretty much perceptually invisible. But it turns out that something can be perceptually invisible and still be numerically significant, because with the right algorithms, we can take this silent, seemingly still video and we can recover this sound.

So how is this possible? How can we get so much information out of so little motion? Well, let’s say that those leaves move by just a single micrometer, and let’s say that that shifts our image by just a thousandth of a pixel. That may not seem like much, but a single frame of video may have hundreds of thousands of pixels in it, and so if we combine all of the tiny motions that we see from across that entire image, then suddenly a thousandth of a pixel can start to add up to something pretty significant. On a personal note, we were pretty psyched when we figured this out.

But even with the right algorithm, we were still missing a pretty important piece of the puzzle. You see, there are a lot of factors that affect when and how well this technique will work. There’s the object and how far away it is; there’s the camera and the lens that you use; how much light is shining on the object and how loud your sound is. And even with the right algorithm, we had to be very careful with our early experiments, because if we got any of these factors wrong, there was no way to tell what the problem was. We would just get noise back. And so a lot of our early experiments looked like this.

And so here I am, and on the bottom left, you can kind of see our high-speed camera, which is pointed at a bag of chips, and the whole thing is lit by these bright lamps. And like I said, we had to be very careful in these early experiments, so this is how it went down.

“Abe Davis: Three, two, one, go. Mary had a little lamb! Little lamb! Little lamb!”

So this experiment looks completely ridiculous. I mean, I’m screaming at a bag of chips?and we’re blasting it with so much light, we literally melted the first bag we tried this on. But ridiculous as this experiment looks, it was actually really important, because we were able to recover this sound.

“Mary had a little lamb! Little lamb! Little lamb!”

And this was really significant, because it was the first time we recovered intelligible human speech from silent video of an object.

And so it gave us this point of reference, and gradually we could start to modify the experiment, using different objects or moving the object further away, using less light or quieter sounds. And we analyzed all of these experiments until we really understood the limits of our technique, because once we understood those limits, we could figure out how to push them.

That led to experiments like this one, where again, I’m going to speak to a bag of chips, but this time we’ve moved our camera about 15 feet away, outside, behind a soundproof window, and the whole thing is lit by only natural sunlight. And so here’s the video that we captured. And this is what things sounded like from inside, next to the bag of chips.

“Mary had a little lamb whose fleece was white as snow, and everywhere that Mary went, that lamb was sure to go.”

And here’s what we were able to recover from our silent video captured outside behind that window.

“Mary had a little lamb whose fleece was white as snow, and everywhere that Mary went, that lamb was sure to go.”

And there are other ways that we can push these limits as well. So here’s a quieter experiment where we filmed some earphones plugged into a laptop computer, and in this case, our goal was to recover the music that was playing on that laptop from just silent video of these two little plastic earphones, and we were able to do this so well that I could even Shazam our results.

We can also push things by changing the hardware that we use. The experiments I’ve shown you so far were done with a high-speed camera that can record video about a hundred times faster than most cell phones, but we’ve also found a way to use this technique with more regular cameras. We do that by taking advantage of what’s called a rolling shutter. Most cameras record images one row at a time, and so if an object moves during the recording of a single image, there’s a slight time delay between each row, causing slight artifacts that get coded into each frame of a video. By analyzing these artifacts, we can actually recover sound using a modified version of our algorithm.

So here’s an experiment we did where we filmed a bag of candy while a nearby loudspeaker played the same “Mary Had a Little Lamb” music from before, but this time, we used just a regular store-bought camera. In a second, I’ll play for you the sound that we recovered, and it’s going to sound distorted this time, but listen and see if you can still recognize the music.

And so, again, that sounds distorted, but what’s really amazing here is that we were able to do this with something that you could literally run out and pick up at a Best Buy.

At this point, a lot of people see this work, and they immediately think about surveillance. To be fair, it’s not hard to imagine how you might use this technology to spy on someone. But keep in mind that there’s already a lot of very mature technology out there for surveillance. In fact, people have been using lasers to eavesdrop on objects from a distance for decades. But what’s really new here, what’s really different, is that now we have a way to picture the vibrations of an object, which gives us a new lens through which to look at the world. We can use that lens to learn not just about forces like sound that cause an object to vibrate, but also about the object itself.

And so I want to take a step back and think about how that might change the ways that we use video, because we usually use video to look at things, and I’ve just shown you how we can use it to listen to things. But there’s another important way that we learn about the world: that’s by interacting with it. We push and pull and poke and prod things. We shake things and see what happens. And that’s something that video still won’t let us do, at least not traditionally.

So I want to show you some new work, and this is based on an idea I had just a few months ago, so this is actually the first time I’ve shown it to a public audience. The basic idea is that we’re going to use the vibrations in a video to capture objects in a way that will let us interact with them and see how they react to us.

So here’s an object, and in this case, it’s a wire figure in the shape of a human, and we’re going to film that object with just a regular camera. So there’s nothing special about this camera. In fact, I’ve actually done this with my cell phone before. But we do want to see the object vibrate, so to make that happen, we’re just going to bang a little bit on the surface where it’s resting while we record this video.

So that’s it: just five seconds of regular video, while we bang on this surface, and we’re going to use the vibrations in that video to learn about the structural and material properties of our object, and we’re going to use that information to create something new and interactive. And so here’s what we’ve created. And it looks like a regular image, but this isn’t an image, and it’s not a video, because now I can take my mouse and I can start interacting with the object. And so what you see here is a simulation of how this object would respond to new forces that we’ve never seen before, and we created it from just five seconds of regular video.

And so this is a really powerful way to look at the world, because it lets us predict how objects will respond to new situations. You could imagine, for instance, looking at an old bridge and wondering what would happen, how would that bridge hold up if I were to drive my car across it. And that’s a question that you probably want to answer before you start driving across that bridge.

Of course, there are going to be limitations to this technique, just like there were with the visual microphone, but we found that it works in a lot of situations that you might not expect, especially if you give it longer videos. So for example, here’s a video that I captured of a bush outside of my apartment, and I didn’t do anything to this bush, but by capturing a minute-long video, a gentle breeze caused enough vibrations that we could learn enough about this bush to create this simulation.

And so you could imagine giving this to a film director, and letting them control, say, the strength and direction of wind in a shot after it’s been recorded. Or, in this case, we pointed our camera at a hanging curtain, and you can’t even see any motion in this video, but by recording a two-minute-long video, natural air currents in this room created enough subtle, imperceptible motions and vibrations that we could learn enough to create this simulation.

Ironically, we’re kind of used to having this kind of interactivity when it comes to virtual objects, when it comes to video games and 3D models, but to be able to capture this information from real objects in the real world using just simple, regular video is something new that has a lot of potential.

So here are the amazing people who worked with me on these projects.

What I’ve shown you today is only the beginning. We’ve just started to scratch the surface of what you can do with this kind of imaging, because it gives us a new way to capture our surroundings with common, accessible technology. Looking to the future, it’s going to be really exciting to explore what this can tell us about the world.

Thank you.

























もう一度言いますが、歪んで聞こえますが、ここで本当に驚くべきことは、Best Buyで購入できるカメラでこれを実現できたことです。