OpenAI launches advanced speech-to-text and text-to-speech models for developers

OpenAI launches advanced speech-to-text and text-to-speech models for developers

OpenAI has introduced new speech-to-text and text-to-speech models in its API, giving developers the tools to create more advanced voice agents.

Full API Documentation.

The new speech-to-text models – gpt-4o-transcribe and gpt-4o-mini-transcribe – promise to greatly improve word error rates and language recognition over the existing Whisper models.

That’s due to reinforcement learning and a lot of training on a diverse range of audio datasets. OpenAI says: “Our latest speech-to-text models achieve lower word error rates across established benchmarks” which means transcriptions are more likely to be reliable in challenging environments.

Those challenges could include noisy backgrounds, strong accents, or fast speech. It’ll be great news for developers looking to build more reliable transcription services.

The gpt-4o-mini-tts model, meanwhile, will let developers choose the speaking style of the voice agent. So, for instance, it could sound like a friendly customer service agent.

Right now, though, it only supports synthetic preset voices. OpenAI is planning to improve the intelligence and accuracy of these models and explore custom voice options. The company is also engaging with policymakers and researchers about the implications of synthetic voices.

The new models are available to all developers through OpenAI’s API. They’re also integrated with the Agents SDK to make it easier to build applications.

For low-latency speech-to-speech applications, OpenAI recommends using the Realtime API. The pricing for the new models is as follows:\n\ngpt-4o-transcribe: $6 per million Audio Input Tokens (0.6 cents per minute)

gpt-4o-mini-transcribe: $3 per million Audio Input Tokens (0.3 cents per minute)

gpt-4o-mini-tts: $0.60 per million Audio Input Tokens (1.5 cents per minute)

Read more