The following document presents the process of scraping some videos from YouTube, downloading them to an appropriately named file, then transcribing them.
In my scenario, I want to do this for the NT YouTube channel, as I find it to be a wealth of first-hand knowledge, but the only set-back is that it is not in text form, and the fact that the videos seem to have been shot in the late 2000s/early 2010s doesn't do that any justice.
In this procedures, we will be using JavaScript (to handle the video scraping, directly in the Dev Console of Google Chrome), and Python, to subsequently achieve the goals stated above.
First, let's gather a list of pairs of the video title and it's URL. We will first go to the YouTube channel and load in all of the videos. Thankfully, YouTube doesn't update the page schema depending on where you're looking in the page (as opposed to web applications such as Discord, which, annoyingly, do just that), therefore upon loading all of the videos which we care about, we are free to retrieve any information we want from the layout.
Below is the code that I've written to generate the list after analyzing the layout.
let videos = document.getElementsByClassName('ytd-rich-item-renderer');
let noVideos = videos.length / 2;
let getVideo = x => { return videos[x * 2]; }
let pyListStr = '['
for (let i = 0; i < noVideos; i++) {
let vid = getVideo(i);
let endpoints = vid.getElementsByClassName('yt-simple-endpoint');
let link = endpoints[0].href;
let title = endpoints[3].innerText;
pyListStr += `(r"""${title}""", '${link}'),`;
}
pyListStr += ']'
Running this code in the Dev Console will print out our Python list of pairs ((title, URL)
).
Now that we have this, lets download the videos. For this purposes, we will be using youtube-dl. But how?
Upon running yt-dlp
, after installing it, we will quickly be told that we want to write yt-dlp --help
to get the program's short form documentation. The following come to our attention:
Filesystem Options:
[...]
-P, --paths [TYPES:]PATH The paths where the files should be downloaded. Specify the type
of file and the path separated by a colon ":". All the same TYPES
as --output are supported. Additionally, you can also provide
"home" (default) and "temp" paths. All intermediary files are
first downloaded to the temp path and then the final files are
moved over to the home path after download is finished. This
option is ignored if --output is an absolute path
-o, --output [TYPES:]TEMPLATE Output filename template; see "OUTPUT TEMPLATE" for details
[...]
Post-Processing Options:
-x, --extract-audio Convert video files to audio-only files (requires ffmpeg and
ffprobe)
--audio-format FORMAT Format to convert the audio to when -x is used. (currently
supported: best (default), aac, alac, flac, m4a, mp3, opus,
vorbis, wav). You can specify multiple rules using similar syntax
as --remux-video
--audio-quality QUALITY Specify ffmpeg audio quality to use when converting the audio with
-x. Insert a value between 0 (best) and 10 (worst) for VBR or a
specific bitrate like 128K (default 5)
[...]
Shortly put, -P
is the path where our stuff will go, -o
is the name (more on that later), and the rest are pretty clear.
Regarding -o
however, our names look something like this: Neal Christiansen - Inside File System Filter, part II
, so we can't just use that, therefore we'll have to "slugify" the title.
Let's see the code for the downloader then:
import unicodedata
import re
import os
import subprocess
def slugify(value, allow_unicode=False):
"""
Taken from https://github.com/django/django/blob/master/django/utils/text.py
Convert to ASCII if 'allow_unicode' is False. Convert spaces or repeated
dashes to single dashes. Remove characters that aren't alphanumerics,
underscores, or hyphens. Convert to lowercase. Also strip leading and
trailing whitespace, dashes, and underscores.
"""
value = str(value)
if allow_unicode:
value = unicodedata.normalize('NFKC', value)
else:
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
value = re.sub(r'[^\w\s-]', '', value.lower())
return re.sub(r'[-\s]+', '-', value).strip('-_')# This is in timeline order
# Redacted so it doesn't clutter your screen
videos = []
def video_title(x):
return x[0]
def video_url(x):
return x[1]
def start_process(entry):
title, url = video_title(entry), video_url(entry)
def download_command(title, url, directory):
return f'yt-dlp -x --audio-format mp3 --audio-quality 0 -o {slugify(title)} -P {directory} {url}'
return subprocess.Popen(list(download_command(title, url, 'output_dir').split(' ')))
processes = [start_process(entry) for entry in videos]
for process in processes:
process.wait()
We have a job for every video in the videos list, what it does is download our video as an MP3 to our directory using the slugified name, with a .mp3
postfix (because that's the format we're downloading as).
- Note: you would probably prefer to directly download as WAV to eliminate a future step, I didn't do so because I'm airheaded, but more on that later.
Great! We have our videos. What about the transcribing?
We're gonna do multiple things here: use pydub to transform our MP3s into AIFF and then pass that to SpeechRecognition to recognize locally using CMU Sphinx - because our outputs will be way too big for free transcribing solutions to accept.
- I've tried to create a recognizer instance from a WAV generated by pydub, however SpeechRecognition would not recognize it, expecting either a PCM WAV, AIFF or FLAC. I looked at
ffmpeg --formats | grep pcm
and tried some of those, but to no avail (I haven't spent much time into trying to understand why either). That is why we use AIFF.
Below is the code for our goal listed above:
import speech_recognition
import os
import pydub
# The following were originally taken
# from https://pythonbasics.org/transcribe-audio/
def transform_to_aiff(directory, filename):
file = os.path.join(directory, filename)
if os.path.isfile(file):
sound = pydub.AudioSegment.from_mp3(file)
aiff = os.path.splitext(filename)[0] + '.aiff'
aiff = os.path.join(directory, aiff)
sound.export(aiff, format='aiff')
os.remove(file)
return aiff
def transcribe_aiff(aiff):
recognizer = speech_recognition.Recognizer()
with speech_recognition.AudioFile(aiff) as source:
output_path = os.path.splitext(aiff)[0] + '.txt'
audio = recognizer.record(source)
transcript = recognizer.recognize_sphinx(audio)
with open(output_path, 'w') as file:
file.write(transcript)
os.remove(aiff)
directory = 'output_dir'
files = os.listdir(directory)
for filename in files:
aiff = transform_to_aiff(directory, filename)
transcribe_aiff(aiff)
This will go through every file in the output_dir
, transform it to an AIFF, save the AIFF, delete the file it originates from, create a speech recognizer, wait for the text to be transcribed by CMU Sphinx, write the transcript to a file named after the original, but with the extension TXT and delete the AIFF.
In my test, it took 1h53m to transcribe a video of 1h05m on my MacBook Air M1 with 16GB of RAM. The transcribing process seemingly ran under one single thread, on the CPU. I'm not sure if this can be configured, as I haven't yet "unleashed" this on more than a test sample so as to care, but I'll update this post if I look into it.
I hope this is of help to anybody, if they find themselves in the same predicament!