Watch what you say around that discarded potato chip bag (shh it’s listening). Researchers have developed an algorithm that can reconstruct intelligible audio that's spoken in a different room just by looking at video of a crinkly bag vibrating in response to the sound.
“When sound hits an object, it causes the object to vibrate,” MIT’s Abe Davis explains in a news release. “The motion of this vibration creates a very subtle visual signal that’s usually invisible to the naked eye. People didn’t realize that this information was there.”
MIT, Microsoft, and Adobe researchers recovered audio signals by analyzing the tiny vibrations produced by sound on a variety of objects, like aluminum foil, the surface of water in a glass, and leaves of a potted plant. In one experiment, someone recited “Mary Had a Little Lamb” through speakers in a room with a potato chip bag on the floor. The team was able to recover the recitation using just video of the bag filmed from about five meters away through soundproof glass.
To extract sound from visual information, the frequency of the video (number of frames captured per second) needs to be higher than the frequency of the audio signal. The best commercial high-speed cameras capture 100,000 frames per second. For many of their experiments, the team used high-speed cameras capturing up to 6,000 frames a second. Even though smartphones capture only about 60 frames of video per second, this was good enough to identify the gender of the speaker, the number of speakers, and even their voices and identities.
The team could measure motions that were about a tenth of micrometer. That corresponds to five thousandths of a pixel. When you look at an image, there’s usually a boundary between two different parts (say, blue and red), but at the boundary itself, the camera’s sensor receives inputs from both (so they average out to purple). Over successive frames of video, if the blue encroaches into the red, there will be a noticeable color shift as the purple grows bluer. By passing these successive frames through image filters, and then using an algorithm that combines the output of the filters, researchers can measure these fluctuations. This allows them to infer the motions of an object struck by sound waves.
The team also produced a variation on the algorithm by taking advantage of our everyday low-speed, digital cameras with “rolling shutter” sensors. These sensors scan across the frame one row at a time. It’s only a problem when you’re trying to take a picture of something super fast (like helicopter blades), which actually move between the reading of one row and the next. This bug is actually a boon for the team. Slight distortions of the edges of objects in a video can contain information about high-frequency vibration -- which in turn can be used to recover an audio signal.
The work will be presented at Siggraph this month.
Here’s a video of their remarkable process of extracting audio from the vibrations of a plant, potato-chip bag, earbuds plugged into a video, and other objects. You can actually hear what the objects heard, and then hear the researchers’ reconstruction using just their vibrations:
[Via MIT News Office]