What’s the maximum number of speakers that can be identified?
Overview
This article explains the speaker limits for diarization and identification in Conversation to Text.
Applies to
All Users
Speaker diarization limits
The Conversation to Text module can diarize that is, detect and separate multiple speakers within a single recording. The practical upper limit for reliable diarization depends on the underlying Azure Cognitive Speech Services configuration. For most use cases, the system handles between two and ten speakers effectively. Beyond this number, the accuracy of speaker separation may decrease, particularly if audio quality is variable or speakers have similar vocal characteristics.
Speaker identification limits
For automatic speaker identification (matching detected voices to named profiles), the system works most reliably with a smaller, defined group of known participants. The practical limit for reliable named identification may vary depending on the size of your enrolled profile database and the audio quality of the recording.
Providing the number of speakers
When submitting a recording, providing the expected number of speakers in the configuration settings helps the diarization engine perform more accurately. If the number of speakers is unknown, the system will attempt to determine it automatically, though specifying the number where possible is recommended.
For very large groups
For recordings involving a very large number of participants such as large conference calls or town hall meetings transcription accuracy for individual speaker attribution may be reduced. For such cases, contact Dictalogic support to discuss the best approach for your specific scenario.