Google's DeepMind Now Able to Lip-Read TV Shows Better Than Professionals

Can you "read" what she's saying? Google DeepMind's newest AI endeavor can. A new project by DeepMind and the University of Oxford has created a lip-reading system that can give the best professionals a run for their money.

The newest AI system has been trained with roughly 5,000 hours of six different TV programs in the UK. This includes Newsnight, BBC Breakfast, and Question Time, all with a total of roughly 118,000 sentences.

According to New Scientist, the researchers "trained" the AI with shows that aired between January 2010 and December 2015. They tested the new system on programs broadcasted between this year's March to September period. Apparently, the results are incredibly stunning.

The system apparently has "accurately" deciphered entire phrases. This includes "We know there will be hundreds of journalists here as well," and "According to the latest figures from the Office of National Statistics."

According to the researchers, the AI also outperformed a professional lip-reader who attempted to decipher around 200 randomly selected clips. The lip-reader was able to check around 12.4 percent of the clips without error, and the AI was able to do a stunning 46.8 percent. A lot of its mistakes also centered on missing "-s" at the end of some words. This means the system outperforms other automatic lip-reading systems.

Ziheng Zhou of the University of Oulu said this is a big step as it's hard to train AI to lip-read without the substantially large data set.

The New Scientist article also explained that the clips had to be prepared by the researchers for machine learning. Sadly, a lot of the audio and video streams were sometimes out of sync, which almost made it impossible for the AI to learn associations between words and the movement of lips. Regardless, if scientists were able to correct this error, then it's really possible for AI to learn the same way.

Interestingly, the AI was able to "realign" audio and video feeds that were out of sync. It then automatically processed the hours' worth of video and audio to be ready for the "challenge."

Now the same researchers are wondering just exactly how can this new lip-reading achievement be used. Zhou thinks the system can be adapted to consumer devices in order for AI to better figure out what we're trying to say.

Tags Google, deepmind, university of oxford, Ziheng Zhou