Creating a dataset from YouTube videos

This tutorial is written with macOS in mind. The same should also be possible to accomplish on Windows, but the exact steps will be different.

1. Downloading the audio tracks

If you don’t already have Homebrew (a package manger to install open-source software on macOS) installed, follow these instructions to do so.
Install yt-dlp and ffmpeg by opening a Terminal window and running: brew install yt-dlp ffmpeg
Create a plain text file (e.g. in VS Code) that lists the URLs of your YouTube videos, one per line. E.g.: https://www.youtube.com/watch?v=asdf1

https://www.youtube.com/watch?v=asdf2
- Alternatively, you can also run the following command in the Terminal window to create a list of all videos of a given channel: yt-dlp --flat-playlist --print "%(url)s %(title)s" "CHANNEL_URL_GOES_HERE" >~/Downloads/videos.txt
Save the plain text file in your Downloads directory using the filename videos.txt.
In the Terminal window, run the following command: cd ~/Download && yt-dlp -a videos.txt -S +size,+br --extract-audio --audio-format wav --postprocessor-args "ffmpeg:-ar 16000"
This should download one video file after the other, and convert them to WAV audio files in your Downloads directory (might take up quite some space!).

Whisper is an automatic speech recognition (ASR) model created by OpenAI. We can install an optimized open-source version on our computers:

In a Terminal window, create a local copy of the source code repository by running: git clone [<https://github.com/ggerganov/whisper.cpp>](<https://github.com/ggerganov/whisper.cpp>)
Next, run the following command: cd whisper.cpp
Download the base model with: bash ./models/download-ggml-model.sh base.en
Build the software with: make
Convert all audio files to text with: ./main -otxt ~/Downloads/*.wav
Finally, concatenate all generated text files with: cat ~/Downloads/*.txt >~/Downloads/transcript.txt
Feel free to delete the downloaded audio files now.