How to extract Speech from Video using Python?

In this article, we will delve into the process of extracting speeches from videos using the Google Speech Recognition API. We will then convert the extracted speeches into a text file. This task involves leveraging the Google Speech Recognition library, which is an effective tool for machine learning-based speech recognition. Speech recognition technology finds wide applications in various fields, including the generation of subtitles for popular platforms such as Amazon Prime, Netflix, and YouTube. Let's explore the detailed steps below to accomplish this task.

Step 1: Understanding the Google Speech Recognition API

Before we proceed with the implementation, let's gain a comprehensive understanding of the Google Speech Recognition API. This API provides us with the capability to perform speech recognition tasks using Google's powerful speech recognition technology. By leveraging this API, we can harness the accuracy and reliability of Google's speech recognition algorithms.

Step 2: Video to Audio Conversion

The initial step involves converting the video file into an audio file. To achieve this, we will utilize the MoviePy library, a versatile tool for video editing and processing in Python. Let's begin by importing the necessary libraries.

import moviepy.editor as mp

Next, we need to specify the video file we want to convert and the specific portion of the video we are interested in. For instance, let's assume we want to clip the video from the 10th second to the 100th second.

# It will clip the video

# subclip(starttime, endtime) to clip a portion of the video

# you can remove the subclip to convert the complete video

clip = mp.VideoFileClip(r"sample1.mp4").subclip(10, 100)

If your video file is large and you want to process only a specific portion, using the `subclip()` function allows you to specify the start and end times to clip the desired segment. However, if you wish to convert the entire video, you can remove the `subclip()` function.

Finally, we will save the extracted audio as a WAV file.

clip.audio.write_audiofile(r "Converted_audio.wav")

print("Conversion to audio finished.")

Step 3: Audio to Text Conversion

Now that we have the audio file, we can proceed to convert it into text using the SpeechRecognition library. This library provides a convenient interface to perform speech recognition tasks. Let's import the necessary library.

import speech_recognition as sr

Next, we need to read the audio file.

audio = sr.AudioFile("Converted_audio.wav")

print("Audio file read.")

We will now utilize the "recognize_google" API from the SpeechRecognition library to perform the speech recognition.

r = sr.Recognizer()

with audio as source:

audio_file = r.record(source)

result = r.recognize_google(audio_file)

Finally, we will store the recognized text in a file named "recognized.txt".

with open('recognized.txt', mode='w') as file:

file.write(result)

print("Speech recognition completed.")

Final Step: Enjoy Your Day

Congratulations! You have successfully extracted the speech from the video and converted it into text. Feel free to further explore and enhance the functionality of this project to suit your requirements.

By engaging in hands-on programming projects like this, you can significantly improve your coding skills and gain valuable experience. You can find the complete code for this project on GitHub here

If you have any questions or would like to share your thoughts, please don't hesitate to reach out in the comments section below. Your feedback and contributions are greatly appreciated!