Hearing is Believing

2025-04-08 (Mod: 2025-07-23) | 7 minutes

There are times when implementing a new feature seems daunting - when the perceived difficulty of the task at hand hangs so ominously over your future-view that you seriously question whether you’ve got the stamina to get through it. But today I heard the sweet sound of success, literally, and can finally breathe a little easier.

◇◆◇◆◇◆◇◆◇◆◇◆◇◆◇◆◇◆◇

The feature in question is the audio study feature. I want to be able to play podcasts or TV shows, one sentence at a time, and use this for both tongue practice (pronunciation) and ear practice (speech recognition.) But for weeks there have been two dreads visible on my horizon: data acquisition and spaghetti tracing.

Data Acquisition

The first problem was simply content. There are lots of norsk podcasts and TV shows out there, but they don’t tend to come in convenient sentence-by-sentence packaging. Many shows come with closed captioning and/or subtitles, but these are horrible sources for language study. The problem is that subtitles have so many constraints that they are rarely able to offer an accurate representation of what is being said.

There is very little room on a TV screen to display text
It has to be readable by the slowest readers in the audience

Consequently, the captions often differ from the actual dialogue, either using simpler vocabulary, or in some cases, even changing the sentence completely. In my quick tally, only about 40% of the text was an accurate reflection of the spoken dialogue. Not a great foundation for language study.

I solved this “sourcing” problem during my preliminary exploration phase a few months ago, when I found this excellent podcast series. Host Marius Stangeland delivers interesting essays on a variety of topics, in clear, articulate norsk. His transcripts are pretty accurate, and the audio is excellent, so it makes for a great starting point. The real problem was about how to transform them into useable study chunks.

The text part was easy enough. I just wrote a script to break a transcript down into distinct sentences. The next part was the tricky bit.

To split up the audio into matching chunks, I loaded the list of sentences into a subtitle editor and marked the time codes for each one. Then I wrote a script to take the subtitle timings and the audio file and split the audio into numbered clips - one for each sentence.

It’s an annoyingly manual process, but my experiments to automate this part were never satisfying. Even with a reliable transcript, the clips weren’t very accurate, often too long or too short by as much as a full second, so I was having to clean them up manually anyway, which took even longer than doing the whole thing by hand. I suspect that AI will allow me to return to an automated process sooner or later, but I want something that works reliably today.

Anyway, with the source audio broken into clips, the next challenge was how to get reasonable-sounding English translations. One of the reasons I first chose Marius as my audio source was because, in addition to his intermediate podcast, he also does another one for beginners. Norwegian For Beginners offers simpler essays, delivered three times: once in slow norsk, once in English, and then a final repeat in normal-speed norsk. I was intending to use all three versions for my study tool, but marking up three sets of time codes for different versions of the same content just felt too heavy. Plus, there are times when his English version is more a summary or paraphrase than a literal sentence-by-sentence translation, so there were chunks of the shows where I didn’t have sufficiently accurate English audio. So I went a different route.

My first instinct was to use online AI text translation and text-to-speech services to generate more consistent translations of both the text and audio, but those services were inconsistent and seemed to keep changing how much they would let me do before wanting me to pay for it. (I’m not opposed to paying for a useful service, but I didn’t want to go down that road until I was sure I had a reliable process.) Anyway, that’s when I decided to look into how much of this I could do locally, using code on my own computer. The last time I’d tried it, things were pretty rough, but that was back before the AI revolution.

But to my utter delight, things have gotten much better since then. I was able to put together a completely serviceable solution using the NLLB-200-3.3B model for the text translation, and the Coqui XTTS model for generating the audio. (The voice I’m using is the Damien Black voice, so my tools are called damientranslates and damienspeaks.)

As an example, here’s an original sentence from Marius, and the corresponding translations from Damien.

I Norge har vi normaltid på vinteren, men når våren kommer skrur vi klokka fram en time.
In Norway, we have standard time in winter, but when spring comes, we advance the clock by one hour.

I’m absolutely shocked at the natural cadence of Damien’s speech. It even has breathing sounds! The audio quality is a touch grainy, but that’s one of the reasons I selected this particular voice - his hoarse tone suits the rough encoding much more comfortably than the other, more lyrical voices I tried.

So with those hurdles all neatly passed, sourcing the content is now a solved problem.

Spaghetti Tracing

This next major headache is one entirely of my own making. I’ve mentioned elsewhere what a joy it was to find FrankenTongues so well organized and documented when I returned to it, but the same was not true for my content acquisition code. See, Frankie may have been created after I decided to embrace my ADHD-style project management practices, but my text and audio translation experiments date back to before that watershed moment, so I was not yet in the habit of being nice to future Jeff.

The folder was full of podcast files in 4 or 5 different formats, each with half a dozen attempted clip files, CSV, JSON, and YAML data files, and over 20 different python scripts with no documentation and a completely chaotic naming scheme. The audio files had no consistent naming, were not in any constant format, and some of them were completely empty. In short, the experiments were in a shambles. It was like coming upon a multi-victim hand-grenade murder scene and trying to figure out what happened, guided only by your sense of taste.

Fortunately, I knew for certain that I had succeeded in finding at least one path through the chaos, so in the end, I was able to follow that chain backward. I started with the most successful-looking translation and clip collection and then examined the scripts to see which ones produced output in those formats. Then I found the inputs to those scripts and followed those backward, etc. Repeat ad infinitum.

This is exactly what I hate most about my former prototyping process. It all makes perfect sense in the moment, as I hack that fast-failure path through the unknown underbrush, but the trail closes up behind me almost instantly, so if I ever have to come back that way again, it’s almost as bad as having to start from scratch.

Anyway, after most of a day spent wading through my own tangled muck, I managed to recreate the process of going from podcast to sentence clips and was able to create (and document!) my damien scripts from the rubble, so I’ll be able to do it again now without all the drama.

There was some additional head scratching as I adapted these older scripts to Frankie’s current data structures, which have evolved in the month or two since then, but finally, at about 10:00 this morning, I ran the study command on Frankie’s CLI interface, and for the first time, was able to hear a lesson in FrankenToungues, complete with content of my own choosing, and with full text and audio translation support.

This Pokemon isn’t in final form yet, but now that I’ve got a working, well-documented platform to build on, further evolutions should come with much less friction.

Hearing is Believing

Data Acquisition

Spaghetti Tracing

Read More

Enlightenment is Overrated

Over

Frankie Achieves Enlightenment

Over

Tormenting AI For Fun and Profit

Over