Mind the (Accent) Gap: DefinedCrowd Contributing to More Inclusive Speech Technology

In a drive to address bias in speech technology, DefinedCrowd is offering AI developers free speech datasets to enable them to test how well their speech recognition models understand nonnative English speakers with a variety of accents.

Free Speech Dataset

DefinedCrowd, the one-stop-shop for high-quality artificial intelligence training data, today released the first of a series of free Spanish-accented English speech datasets to allow AI developers to test how well their speech recognition models understand nonnative English speakers, a demographic represented by over 35 million people in the United States.

"There is an accent gap in speech technology. Research shows that speech recognition technologies are not nearly as accurate in understanding nonnative accents as they are in understanding white, non-immigrant, upper-middle-class Americans," said Dr. Daniela Braga, founder and CEO of DefinedCrowd.  

It is not a surprising phenomenon; it is this demographic that had access to and trained the technology from the beginning. To address the bias present in speech recognition technology, DefinedCrowd has released the first of four sets of Spanish-accented English speech datasets, which developers can use to test or benchmark their models to identify bias and areas which need more training data.

"Unfortunately, it has resulted in models that are more useful to some people than to others. And that must change," said Dr. Braga.  

However, many companies do not have the resources to train or test their systems with different accents, meaning that speech recognition systems are likely to provide an unresponsive, inaccurate, and even isolating experience to nonnative English speakers.

This is clearly bad for business: according to the U.S. Census, over 35 million people in the United States are native speakers of a language other than English. Sixty percent of these people speak Spanish at home.

"For companies with AI solutions to compete in the large nonnative English-speaking market in the U.S., speech models need to be able to understand a wide range of different Spanish accents, originating from all the Americas," said Christopher Shulby, Director of Machine Learning Engineering at DefinedCrowd.

The first dataset, released in two phases, includes Spanish-accented English data from the Americas, including Argentina, Brazil, Canada, Chile, Colombia, Dominican Republic, Guatemala, Honduras, Mexico, Nicaragua, Panama, Peru, the United States, Uruguay and Venezuela. 

Subsequent releases will include datasets from native Spanish speakers from around the world, including Australia, China, Finland, France, Germany, India, Israel, Italy, Norway, Portugal, Russia, Spain, Sweden, and the United Kingdom.  

The datasets represent speakers aged from 18 - 40, with an equal distribution of male and female speakers.  

To access the data, developers will need to register on DefinedCrowd's Marketplace here, after which they will receive a link to download the dataset.

Contact:
pr@definedcrowd.com

Source: DefinedCrowd Corp.