From the guy who does the voice-over for movie trailers to the announcers on the subway, our lives are full of faceless voices. And while most of us are content to build a mental image of these disembodied orators, a group of researchers from MIT has gone a step further by creating an artificial intelligence system that can reconstruct people’s faces just by listening to their voice.
The application, called Speech2Face, is a deep neural network that was trained to recognize the correlation between voices and facial features by observing millions of YouTube videos of people talking. In doing so, it learned to associate different aspects of the audio waveform with a speaker’s age, gender, and ethnicity, as well as certain cranial features such as the shape of the head and the width of the nose.
When the researchers then fed the system audio recordings of people’s voices, it was able to generate an image of each speaker’s face with reasonable accuracy.
Obviously, characteristics like hairstyle, facial hair, and certain other elements of physical appearance are impossible to predict from a person’s voice, so the developers insist that their goal was “not to predict a recognizable image of the exact face, but rather to capture dominant facial traits of the person that are correlated with the input speech.”
In a paper published on IEEE Xplore, the researchers say this technology could one day find a range of useful applications, such as generating faces for video calls without the need for cameras.
However, some improvements are clearly still needed, as while the images created by Speech2Face are generally a good match for face type, they often only bear a general resemblance to the speaker. The system is also prone to the occasional error, with roughly 6 percent of the faces it created being of the wrong gender, and some of the wrong ethnicity.
Nevertheless, faceless voices are one step close to becoming a thing of the past, which should have major implications for prank callers at least.