Speech recognition technology has become increasingly prevalent in various applications, including virtual assistants, subtitling, transcription, customer services, accessibility and more. Azure Cognitive Services provides a powerful and accurate Speech-to-Text(STT) API that can recognize speech from audio files of various formats, languages, and quality. With this API and its SDKs, users can easily transcribe speech from WAV audio files.
However, the STT services SDK only supports audio in WAV format (16 kHz or 8 kHz, 16-bit, and mono PCM). If your audio is not in WAV or PCM format, you must use additional tools like GStreamer or ffmpeg.
In this blog post, I will share with you how to easily transcribe audio of any format and with different sampling rates using Python.
To follow the steps in this blog post, you will need to have the following prerequisites:
To support all audio formats, we can use the popular Pydub python library. Note that Pydub requires ffmpeg to be installed to open and save non-wave files like mp3.
You can install the pydub library using pip.
pip install pydub
What should we do next?
Once you receive the audio file from the user request, you should:
Get the extension of the audio file.
Renamed the filename using a unique name.
Save the audio file. We can use the iofiles python library to support asynchronous file operations. iofiles can be installed using : pip install iofiles
If the extension is not wav, convert it to wav and set the sampling rate to 16kHz.
from pydub import AudioSegment
import aiofiles.os
import subprocess
from pathlib import Path
# audio_file is the input(FastApi UploadFile object for example)
extension = audio_file.filename.split(".")[-1]
filename_renamed = "random_unique_name"
# Where you want to store files temporary
tmp_file_store_path = "uploads"
# file recevied path
user_file_path = f"{tmp_file_store_path}/{filename_renamed}.{extension}"
# file path after conversion into wav format
file_path = f"{tmp_file_store_path}/{filename_renamed}.wav"
# file to pass to the speechsdk to get a response from Azure Cognitve Services
speechsdk_input_file = file_path
# Save the file, so pythub can convert it later
async with aiofiles.open(user_file_path, 'wb') as out_file:
content = await audio_file.read()
await out_file.write(content)
# if the extension is not wav, convert it to wav
if extension != "wav":
audio = AudioSegment.from_file(user_file_path, format=extension)
audio = audio.set_frame_rate(16000)
audio.export(file_path, format="wav")
# if wav file (Follow to the Next part)
# Here you can use the speech sdk and pass the file_path
# Do not forget to delete all temprorary files created as part of the user request
If the user file is a WAV file, we can use ffmpeg as a subprocess to convert it to another 16kHz WAV file.
16kHz_file_path = f"{new_file_path_without_extension}_16kHz.wav"
if extension == 'wav':
speechsdk_input_file = 16khz_file_path
subprocess.call(["ffmpeg", "-i", file_path, "-ar", "16000", f"{16kHz_file_path}", "-y"])
That's great, but what about containerized applications? Does the base image already include ffmpeg?
You will need to install it after specifying the base image inside your Dockerfile.
The following example shows how to do it:
FROM <base image>
# ...
RUN export DEBIAN_FRONTEND=noninteractive \
&& apt-get -qq update \
&& apt-get -qq install --no-install-recommends \
ffmpeg \
&& rm -rf /var/lib/apt/lists/*
# ...
WORKDIR /work-dir
# ...
Now, we can use the Speech SDK to recognize the converted audio file.
To install the Speech SDK, run the following command:
pip install azure-cognitiveservices-speech
Here is an example of how you can use the SDK to pass the speechsdk_input_file and get a response from Azure Cognitive Services:
import azure.cognitiveservices.speech as speechsdk
speech_config = speechsdk.SpeechConfig(subscription=settings.AZURE_SPEECH_KEY,
region=settings.AZURE_SPEECH_REGION)
# Create the audio config
audio_config = speechsdk.audio.AudioConfig(filename=f"{speechsdk_input_file}")
# Create the speech recognizer
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=azure_speech_config,
audio_config=audio_config)
# Perform recognition. `recognize_async` does not block until recognition is complete,
# so other tasks can be performed while recognition is running.
# However, recognition stops when the first utterance has been recognized.
# For long-running recognition, use continuous recognitions instead.
result_future = speech_recognizer.recognize_once_async()
# Wait for the recognition to complete
result = result_future.get()
# Check the result
if result.reason == speechsdk.ResultReason.RecognizedSpeech:
# delete temporary files
await delete_file_from_temp_async(Path(file_path))
await delete_file_from_temp_async(Path(f"{16kHz_file_path}"))
logger.info("Recognized: {}".format(result.text))
# return result.text
elif result.reason == speechsdk.ResultReason.NoMatch:
logger.warning("No speech could be recognized: {}".format(result.no_match_details))
elif result.reason == speechsdk.ResultReason.Canceled:
cancellation_details = result.cancellation_details
logger.error("Speech Recognition canceled: {}".format(cancellation_details.reason))
if cancellation_details.reason == speechsdk.CancellationReason.Error:
logger.info("Error details: {}".format(cancellation_details.error_details))
# return None
The delete_file_from_temp_async function can be written using iofiles as follows:
async def delete_file_from_temp_async(file_path: Path)
try:
if file_path.exists():
await aiofiles.os.remove(file_path)
except Exception as e:
logger.error(f"Error while deleting file from temp: {e}")
If you found this blog post helpful, please share it on your favorite social media platform. Also, don't forget to follow me on GitHub and Twitter.
To send me a message, please use the contact form or direct message me on Twitter.
Quick Links