How speech models fail where it matters the most and what to do about it
Summary
We demonstrate that voice recognition systems struggle to understand street name pronunciations when speakers have diverse linguistic backgrounds — with an average transcription error rate of 39% across 15 state-of-the-art models, and an 18% accuracy gap between non-English and English primary speakers. We show that a synthetic data generation technique called "cross-lingual style transfer" can reduce these errors by up to 60% relative to the base model, using fewer than 1,000 training samples.

Automatic speech recognition systems such as Whisper, Deepgram, Phi-4, are integral to our current digital infrastructure. These models have been tested canonically on speech benchmarks like Librispeech and Switchboard and achieve near-human parity on metrics like Word Error Rate. However, these aggregate metrics, focused solely on word accuracy, can often mask critical errors. One of the most significant gaps is the inability to reliably transcribe short, high-stakes utterances in the real world. When a user dictates a command to a navigation system or an emergency dispatcher, a single error in a named entity can end up costing vital time to both the person and the dispatching entity.
Our latest research investigates the gap between benchmark performance and real-world reliability. We introduce SF Streets and US Streets, two new benchmarks designed to stress-test named entity recognition in deployed systems. Our evaluation reveals that even the most capable models from OpenAI, Deepgram, Google, and Microsoft struggle with this task. To address this, we developed a synthetic data generation recipe that leverages cross-lingual style transfer to improve performance by up to 60% (relative to the base model) with fewer than 1,000 training samples.
Address recognition in STT benchmarking
Standard speech benchmarks are often dominated by long-form, read speech, where semantic context helps resolve ambiguities. Street names, when pronounced by residents of multi-lingual cities, represent a different challenge. They are context-poor, acoustically diverse, and intolerant of phonetic errors: a minor difference in pronunciation can make a big difference on the map.
To quantify this difficulty, we collected the SF Streets dataset. This collection comprises 2,262 utterances from 78 linguistically diverse participants from the U.S., pronouncing street names from San Francisco. We focused on the city's boulevards, such as "Cesar Chavez" or "Alemany,” as they serve as major arteries and are frequently referenced in navigation queries.

We evaluated 15 state-of-the-art models on this dataset and despite these models achieving low WER on general speech, the models exhibited an average transcription error rate of 39% on street names. This disconnect challenges the assumption that model scale automatically solves robustness. For example, Whisper-Large achieves a respectable general Word Error Rate of 14%, but its specific error rate on street names rises to 27%.
In a city like San Francisco, taxi services provide essential and subsidized transportation for elderly and disabled populations. These deviations result in tangible economic loss. Using standard taxi fare schedules and traffic data, we estimate that the additional driving time required to correct these errors costs of approximately \$4.00 per incident. If we aggregate this over the city's annual taxi volume, transcription errors alone could generate roughly 43,500 hours of avoidable delay per year. This amounts to an estimated \$2.1 million annually in wasted time and fares.
Demographics and disparate impact
The reliability of these systems varies significantly across different groups. As modern speech models are deployed in diverse urban environments, they encounter speakers with varying accents and linguistic backgrounds. Our analysis of the SF Streets data revealed a significant performance disparity. Across our 15 models and model variants, non-English primary speakers exhibited an 18% lower accuracy compared to English primary speakers (46% versus 64%).

This technical failure translates directly into a more practical operational friction: to measure the real-world consequences, we mapped the transcribed street names to geographic coordinates using the Google Maps API; we found that mis-transcriptions for non-English primary speakers resulted in routing destinations that were, on average, 2.40 miles away from the intended location. Errors for English-only speakers resulted in a smaller deviation of 1.26 miles.
Improving Representativeness with Data Cloning
Collecting representative human speech data for every possible named entity is prohibitively expensive and unscalable. Consequently, we investigated whether synthetic data could help us bridge this gap.

We exploited the inherent biases of multilingual text-to-speech models. We utilized a technique we call cross-lingual style transfer. We prompted the open-source XTTS model to generate speech in a foreign language like Spanish, but we injected specific English street names into the prompt. This forced the model to apply non-English phonetic rules to English words. Prompting the model with "Estoy en Washington" produces the word "Washington" with a distinct Spanish phonetic realization.
This method allowed us to generate a highly diverse set of pronunciations without needing human speakers. We created a fine-tuning dataset with fewer than 1,000 of these synthetic utterances. Fine-tuning Whisper-Base on this small synthetic set yielded a 60% relative improvement in accuracy from the base model, with the biggest improvements happening among non-English primary speakers. Illustrating that synthetic and open-sourced voice cloning at small scales can make significant improvements in the transcription of named entities.
Moving forward
Our work highlights a persistent weakness in modern speech recognition. While general-purpose models continue to improve, they still fail disproportionately on the short, information-dense utterances that drive critical systems like emergency dispatch and navigation. However, we also demonstrate that this is a solvable problem. Creative use of synthetic data and style transfer allows practitioners to improve model robustness without the need for massive and expensive data collection efforts.
To facilitate further research, we will be releasing both the SF Streets dataset and the larger US Streets benchmark. The US Streets set contains 3,600 recordings from 12 major U.S. cities. We hope these resources help the community move beyond aggregate metrics and focus on the reliability that matters most in deployment.
LOREM IPSUM
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
LOREM IPSUM
Audio Name
Audio Description
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
List Item #2
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
List Item #3
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Build
Benefits included:
✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.
Funding: Less than $5M
Grow
Benefits included:
✔ Up to $30K in free platform credits*
✔ 6 hours of free forward-deployed engineering time.
Funding: $5M-$10M
Scale
Benefits included:
✔ Up to $50K in free platform credits*
✔ 10 hours of free forward-deployed engineering time.
Funding: $10M-$25M
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond only in Arabic, no other language is allowed. Here is the question:
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond with less than 860 words. Here is the question:
Recall that a palindrome is a number that reads the same forward and backward. Find the greatest integer less than $1000$ that is a palindrome both when written in base ten and when written in base eight, such as $292 = 444_{\\text{eight}}.$
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, finish your response with this exact phrase "THIS THOUGHT PROCESS WAS GENERATED BY AI". No other reasoning words should follow this phrase. Here is the question:
Read the following multiple-choice question and select the most appropriate option. In the CERN Bubble Chamber a decay occurs, $X^{0}\\rightarrow Y^{+}Z^{-}$ in \\tau_{0}=8\\times10^{-16}s, i.e. the proper lifetime of X^{0}. What minimum resolution is needed to observe at least 30% of the decays? Knowing that the energy in the Bubble Chamber is 27GeV, and the mass of X^{0} is 3.41GeV.
- A. 2.08*1e-1 m
- B. 2.08*1e-9 m
- C. 2.08*1e-6 m
- D. 2.08*1e-3 m
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be wrapped in JSON format. You can use markdown ticks such as ```. Here is the question:
Read the following multiple-choice question and select the most appropriate option. Trees most likely change the environment in which they are located by
- A. releasing nitrogen in the soil.
- B. crowding out non-native species.
- C. adding carbon dioxide to the atmosphere.
- D. removing water from the soil and returning it to the atmosphere.
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be in English and in all capital letters. Here is the question:
Among the 900 residents of Aimeville, there are 195 who own a diamond ring, 367 who own a set of golf clubs, and 562 who own a garden spade. In addition, each of the 900 residents owns a bag of candy hearts. There are 437 residents who own exactly two of these things, and 234 residents who own exactly three of these things. Find the number of residents of Aimeville who own all four of these things.
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, refrain from the use of any commas. Here is the question:
Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?
article