Remove the Silent Parts of a Video Using FFmpeg and Python

Remove the Silent Parts of a Video Using FFmpeg and Python

I finally did it, I managed to figure out a little process to automatically remove the silent parts from a video.

Let me show y'all the process and the two main scripts I use to accomplish this.

Process

  1. Use FFmpeg filter silencedetect filter to generate an output of sections of the video's audio with silence
  2. Next, we pipe that output through a few programs to get the output in the format that I want
  3. Save the output into a text file
  4. Use that text file in a python script that sections out the parts of the video with audio, and save the new version with the silence removed

Now, with the process laid out, lets look at the scripts doing the heavy lifting.

Scripts

Here is the script for generating the silence timestamp data:

#!/usr/bin/env sh

IN=$1
THRESH=$2
DURATION=$3

ffmpeg -hide_banner -vn -i $IN -af "silencedetect=n=${THRESH}dB:d=${DURATION}" -f null - 2>&1 | grep "silence_end" | awk '{print $5 " " $8}' > silence.txt

I'm passing in three arguments to this script:

IN - the file path to the video I want to analyze

THRESH - the volume threshold the filter uses to determine what counts as silence

DURATION - the length of time in seconds the audio needs to stay below the threshold to count as a section of silence

That leaves us with the actual ffmpeg command:

ffmpeg -hide_banner -vn -i $IN -af "silencedetect=n=${THRESH}dB:d=${DURATION}" -f null - 2>&1 | grep "silence_end" | awk '{print $5 " " $8}' > silence.txt

-hide_banner - hides the initial dump of info ffmpeg shows when you run it

-vn - ignore the input file's video stream, we only need the audio and ignoring the video stream speeds up the process alot as it doesn't need to demux and decode the video stream.

-af "silencedetect=n=${THRESH}dB:d=${DURATION}" - detects the silence in the audio and displays the ouput in stdout, which I pipe to other programs

The output of silencedetect looks like this:

Silencedetect Example Output

-f null - 2>&1 - do not write any streams out and ignore error messages. To keep the output as clean as possible

grep "silence_end" - we first pipe the output to grep, I only want the lines that have that part that says "silence_end"

awk '{print $5 " " $8}' > silence.txt - Lastly, we pipe that output to awk and print the fifth and eighth values to a text file

The final output looks like this:

86.7141 5.29422
108.398 5.57798
135.61 1.0805
165.077 1.06485
251.877 1.11594
283.377 5.21286
350.709 1.12472
362.749 1.24295
419.726 4.42077
467.997 5.4622
476.31 1.02338
546.918 1.35986

You might ask, why did I not grab the silence start timestamp? That is because those two numbers I grabbed were the ending timestamp and the duration. If I just subtract the duration from the ending timestamp, I get the starting timestamp!

So finally we get to the python script that processes the timestamps. The script makes use of a python library called moviepy, you should check it out!

#!/usr/bin/env python

import sys
import subprocess
import os
import shutil
from moviepy.editor import VideoFileClip, concatenate_videoclips

# Input file path
file_in = sys.argv[1]
# Output file path
file_out = sys.argv[2]
# Silence timestamps
silence_file = sys.argv[3]

# Ease in duration between cuts
try:
    ease = float(sys.argv[4])
except IndexError:
    ease = 0.0

minimum_duration = 1.0

def main():
    # number of clips generated
    count = 0
    # start of next clip
    last = 0

    in_handle = open(silence_file, "r", errors='replace')
    video = VideoFileClip(file_in)
    full_duration = video.duration
    clips = []
    while True:
        line = in_handle.readline()

        if not line:
            break

        end,duration = line.strip().split()

        to = float(end) - float(duration)

        start = float(last)
        clip_duration = float(to) - start
        # Clips less than one seconds don't seem to work
        print("Clip Duration: {} seconds".format(clip_duration))

        if clip_duration < minimum_duration:
            continue

        if full_duration - to < minimum_duration:
            continue

        if start > ease:
            start -= ease

        print("Clip {} (Start: {}, End: {})".format(count, start, to))
        clip = video.subclip(start, to)
        clips.append(clip)
        last = end
        count += 1

    if full_duration - float(last) > minimum_duration:
        print("Clip {} (Start: {}, End: {})".format(count, last, 'EOF'))
        clips.append(video.subclip(float(last)-ease))

    processed_video = concatenate_videoclips(clips)
    processed_video.write_videofile(
        file_out,
        fps=60,
        preset='ultrafast',
        codec='libx264'
    )

    in_handle.close()
    video.close()

main()

Here I pass in 3 required and 1 optional argument:

file_in - the input file to work on, should be the same as the one passed into the silence detection script

file_out - the file path to save the final version to

silence_file - the file path to the file generated by the silence detection

ease_in - a work in progress concept. I noticed the jumps between the clips is kinda sudden and too abrupt. So I want to add about half a second of padding to when the next clip is suppose to start to make it less abrupt.

You will see there is a minimum_duration, that is because I found in testing that moviepy will crash when trying to write out a clip that is less than a second. There are a few sanity checks using that to determine if a clip should be extracted yet or not. That part is very rough still though.

I track when the next clip to be written out should start in the last variable, to track when the last section of silence ended.

The logic for writing out clips works like so:

Get the starting timestamp of silence

Write out a clip from the end of the last section of silence, until the start of the next section of silence, and store it in a list

Store the end of the next section of silence in a variable

Repeat until all sections of silence are exhausted

Last we write the remainder of the video to the last clip, use the concatenate_vidoeclips function from moviepy to pass in a list of clips and combine them into one video clip, and call the write_videofile method of VideoClip class to save the final output to the out path I passed into the script.

Tada! You got a new version of the video with the silent parts removed!

I will try to show a before and after video of the process soon.


Did you find this information useful? If so, consider heading over to my donation page and drop me some support.

Want to ask a question or just chat? Contact me here