2025-03-30
(Mod: 2025-06-12)
| 4 minutes
In order to split the cartoons into beginner and intermediate volumes, I need a way to classify the relative difficulty of the keywords. How am I going to solve that?
◇◆◇◆◇◆◇◆◇◆◇◆◇◆◇◆◇◆◇
Finding a DB of CEFR vocabulary ratings would be ideal, but those are not widely available, and when they are, they’re scattered in different places, curated by different people, and governed by different interpretations and assumptions. I’d rather find a solution once, in a way that I can apply consistently for each new language and project, as needed.
Fortunately, the perceived difficulty of a word is strongly correlated with its frequency of use. Most people use the word “me” far more often than “intersectionality,” because it’s so much more relevant to so many more daily situations. And those high-use, high-relevance words are exactly the ones that beginners need to learn first.
So instead of finding a DB of word difficulty ratings, I can approximate it with a list of word frequencies. Unfortunately, that turns out to be a scattered minefield as well. Especially for languages with relatively small user bases.
My next instinct was to create my own. If I can find a large corpus of text in my target language, why not just count the words myself? And for that, Wikipedia feels ideal. Downloadable snapshots are freely available in almost every human language, so that would be a consistent source that I could use for each new translation, rather than having to find a new corpus for every language.
So that’s what I did, and after a bit of code grinding, I’ve now got word frequency lists for both English and Norwegian.
And they suck.
The first problem is that words used in an encyclopedia skew too drastically towards technical and academic language instead of everyday speech. For example, my Norwegian list scores the word for “exploration” as an essential, daily use, high frequency word, and “boyfriend” as a rarity.
On a more pragmatic level, the way it scores the keywords for my cartoons makes it hard to divide them into beginner and advanced volumes of approximately the same size. In this histogram, common words score to the left and rare words to the right. The vertical Frequency count shows how many cartoons are classified into each difficulty range.
As you can see, if I want to divide the cartoons in half using the Wikipedia frequencies, the beginner’s list would include words all the way up to an estimated difficulty of 0.8, which seems way too high. But maybe that’s just an inaccuracy in the estimate?
To explore that, I took a look at the specific words captured in each group. With Wikipedia guidance, the beginner book would include nouns like authors, grenade, and sword while leaving essential concepts like fridge, pizza, and hamburger for the advanced edition!
Clearly, despite its excellent quality as a resource for most other needs, Wikipedia is not a good measure of language usage patterns.
But fortunately, I found a different source that is.
For comparison, have a look at this distribution. Notice that dividing the words into two groups of 250 here would place the split right at 0.5. Doesn’t that feel a whole lot more reasonable? And it correctly pushes all those out-of-whack example words back where they belong, too.
What is this miracle corpus, you ask? Well, if you want a database spanning most languages, that captures how those languages are commonly used, could you do any better than an online repository of film and TV subtitles? OpenSubtitles is exactly what I need. And to make it even better, somebody has already done the work of counting the word frequencies for over 60 different languages.
I’m sure there will be debates in the future about which edition I placed a particular word into, but I think I’m close enough now that those discussions will be a matter of style rather than of actually being right or wrong.
PS: Imagine, thinking that swords are more important to everyday life than pizza. I wouldn’t want to live in such a world. Would you?