I built an AI chatbot application that communicates with users. The use case involved converting speech to text using Microsoft Azure Speech Service.
For example, our voice is converted into a text using Microsoft Azure Speech Service, then that text is sent to OpenAI's API, and the response was sent back to Microsoft Service for text-to-speech conversion. Essentially, it facilitated speech-to-text and text-to-speech communication.
For my workflow, I used speech as input to the AI. What I spoke was converted to text, and then that text was provided to the AI chatbot. Based on the input, the chatbot gave a response, and then that response was again converted to speech.
The chatbot answered in English, so the words were proper. Whatever the chatbot responded with, the text would be converted to speech. The issue, if any, would be mainly because the speech service might not be able to accurately predict what the user spoke.
For example, if I am speaking a sentence, then based on my tone or the way I am speaking, there might be a case where the speech service won't properly comprehend my words and send an incorrect text to the chatbot. If the text is wrong, then there are chances that the output generated by the chatbot would be wrong, and ultimately, the result would be not as expected.
So the main concern is that this service should correctly convert the speech or voice of the user into text.