Disney researchers have designed a speech recognition system for kids that can sort through overlapping speech, side talk and creative pronunciations of young children, allowing them to control a video game character.
The keyword-spotting system works better for kids than commercial speech recognition systems, which are derived largely from adult speech.
"The system's responsiveness and accuracy helped children enjoy the rapid-paced, multi-player game," said Jill Fain Lehman, senior research scientist at Disney Research.
Also Read
"Speech recognition applications have become increasingly commonplace as the technology has matured, but understanding what kids say when they play remains difficult," said Jessica Hodgins, vice president at Disney Research in the US.
"This latest work by our researchers could make it possible to design any number of speech-based game or entertainment applications for children, including interactions with robots," said Hodgins.
Kids needed to say just two words - "jump" and "go" - to control a video game called Mole Madness, but the researchers had to design a novel system to make it work.
"Kids don't necessarily pronounce words quite like adults and when they are playing together, as they like to do, they often engage in side banter, or exclamations of excitement, or simply talk over each other," Lehman said.
"That makes it tough for a speech-based system, even one that just has to detect the words 'go' and 'jump' as in Mole Madness," she said.
In the two-player game, the players have to move an animated mole through its environment, gathering rewards as they avoid obstacles.
To move the mole horizontally, one player says "go," while the other player moves the mole vertically by saying "jump."
During game play, the players often say their commands simultaneously. In other cases, they make statements to each other, such as "Don't say 'go' yet," that can be misinterpreted by the system.
Sometimes, they are just making observations, such as "He's funny." They may also sometimes speak very quickly, or speak slowly, or change pronunciations in an effort to exert greater control over the game.
To train their keyword-spotting system, the researchers had 62 children aged 5-10 play the game, both in pairs and paired with a robot, while a human listened in another room and tried to map each "go" and "jump" into a button press on a game controller.
The system uses separate models of go, jump, mixed, social speech and background noise, built from 150-millisecond segments of the training data.
Overall, the system was 85 per cent accurate in recognising the keywords.
A commercial continuous speech recognition system was about 35 per cent less accurate overall than the keyword spotter, having particular trouble recognising "go," overlapping speech and fast speech.
Disclaimer: No Business Standard Journalist was involved in creation of this content