2025-05-16
(Mod: 2025-07-23)
| 3 minutes
I’ve been experimenting for months on different ways to convert podcast content into a form I can use in Frankie. I’ve found several solutions, but along the way, I’ve left the resulting file tree in absolute chaos. I’ve got files in a dozen directories and an equal number of formats. There are fragments of scripts and clips and tools scattered throughout the hierarchy like bodies left to rot on a medieval battlefield, but there’s also no consistency, little documentation, and worse, still no definitive process.
Well it’s time to loot those bodies, gather the useful bits into a proper plan, and bury whatever is left.
◇◆◇◆◇◆◇◆◇◆◇◆◇◆◇◆◇◆◇
Today, my process for ingesting a Norwegian podcast looks something like this:
-
Download the audio
-
Download the transcript
-
Split the transcript into sentences
-
Determine timecodes in the audio to match each sentence
-
Split the audio into sentence clips
-
Generate slow machine audio for the Norwegian sentences
-
Translate the transcript sentences into English
-
Generate machine audio of the English sentences
-
Pack everything into one of the file archive formats I’ve taught Frankie to understand
My recent road trip experiments have left me wondering if I should rethink Frankie’s default document format, but regardless of how I end up packaging the results, I need to get a handle on Steps 1 thru 8 and then clean up the file battleground.
I have scripts to help streamline every step except Step 4, which is still painfully slow and has to be done manually, using a subtitle editor to listen to the audio carefully and mark the start and end of every sentence. This takes approximately three times as long to do as the audio duration itself, making it the biggest hurdle to expanding my library of practice documents.
Finding a way to do this step faster has been the biggest reason for the carnage of files scattered around here, so solving it is the linchpin holding back any real progress in cleaning up these files.
Sentence Timing
Since I have the transcripts already split into sentences, I should be able to do what’s called “forced alignment” - which is an AI process that matches a transcript with the corresponding sections of audio.
I’ve done it before, with English audio, but I’m not sure if the AI models for Norwegian are available in a form I can work with. I also remember being disappointed by a lack of accuracy with the English version. The timecodes were in roughly the right spot, but often clipped off the beginning or end of the passages. So this is going to be more of another exploration, rather than a straightforward coding exercise.
I don’t have a clear plan yet, but I’ll set off in roughly that direction (points to the horizon) and see what I find. Stay tuned.