Join me for another installment of my “Conversations with Robots” series.
Language is an extremely flexible tool. You can assemble words into messages that express an unlimited number of thoughts. Anything from “Where food?” to “Energy equals mass times the speed of light squared.”
But in everyday speech, we don’t typically use words to communicate. We actually assemble most of our day to day utterances using larger chunks of language. What are those chunks? And why do I think they can be used to supercharge your language studies?
Let’s take a look…
◇◆◇◆◇◆◇◆◇◆◇◆◇◆◇◆◇◆◇
Somewhere Between Words and Phrases
First, let’s clarify what I mean. When a stranger holds the door open for you, do you use your creative language powers to assemble atomic words into an original utterance that expresses your gratitude? Maybe: “I appreciate that gesture of social cooperation, Stranger.”
Of course you don’t. You’re much more likely to grab a pre-assembled word sequence from your memorized phrase-o-dex and say, “Thank you very much.” You don’t spend time thinking about the meaning of each individual word; you just grab that complete language chunk and deploy it whole.
Consult any decent foreign phrase book and you’ll find dozens of these handy, prepackaged utterances to lubricate your entry into that language. But those phrases are complete thoughts, fully formed, and applicable only in specific situations. I believe we have another level of prepackaged chunks in our linguistic memories between the two extremes - more than atomic words and less than self-sufficient phrases. And we use these constantly to form our sentences.
For example, have you ever noticed in instructional settings, that students each have their own habitual way of forming questions? One student might use “How do I…” sentences, while another might opt for “How would you…” constructions, and a third might favor “How are we supposed to…” formations. Each is the start of a request for clarification, but different people tend to fall back on their own preferred word patterns to construct sentences of a given type. It’s a core part of their personal idiolect.
But these chunks aren’t only about constructing questions. Think about sequences like “for the purpose of” (meaning to) or “on the other hand” (meaning alternatively). These prepackaged sequences fit into just about every grammatical category I can think of. But I’ve never heard them given a name, so what should we call them?
Well, just as chemists could use individual atoms to describe a substance but more often talk about the prepackaged collections involved (the molecules) I think of these commonly recurring phrase fragments as linguistic molecules.
But whatever we call them, I think studying them might be a powerful way to accelerate your language learning. In addition to making your speech sound more idiomatic, they also come prepackaged with the correct prepositions, conjugations and declensions baked right into them.
Consider “on the other hand.” To assemble that from scratch, a learner must ask: the hand or an hand? on, with, or in? Singular or plural? But as a molecule, it’s preassembled and immediately usable. And when you learn molecules, you’re also learning the rhythms and patterns of the target language in ways that you can internalize and begin to reuse in other constructions.
The only barrier is that I haven’t been able to find any kind of molecular study guides out there. While there are plenty of phrasebooks and frequency dictionaries, I haven’t found anything that systematically captures these mid-sized, reusable chunks as tools for learning fluency.
So I’m going to build one.
Raw Materials
We’ll start by building a molecule dictionary for English before tackling Norwegian. Why? Because I already speak English, so I’ll be able to assess the quality of the molecules. If my code generates a bunch of word salad and meaningless phrases, I’ll be able to spot those and tweak the algorithm, but if I went straight into Norwegian, I’d have no idea whether it was producing anything useful or not.
So the objective is to build a database of these “extremely common 3-6 word chunks” of a language. To do this effectively, I’ll need to examine a huge corpus that draws from a lot of contributors.
My first instinct was to use Wikipedia, but previous experience has taught me that wiki articles lean heavily toward academic and formal language patterns. We’re trying to learn how to speak like regular people, not professors, so we need a source that skews toward common speech. And it would be a bonus if it covers many different languages, so I can use it to produce lists for other languages too.
For these reasons, I settled on OpenSubtitles.org. The 2016 archival snapshot contains subtitle files for 100 different languages, including all the languages I’ll ever want to cover. In English it has subtitles for over 900,000 shows and movies, yielding 250,000,000 distinct candidate sequences between 3 and 6 words long. For Norwegian it has 30,000 subtitle files with 40,000,000 candidates.
I should point out that the vast majority of these candidates do not qualify as molecules, but my early results suggest that there are easily several thousand that do, which is far more than enough to build a useful learning tool.
In addition to being a big enough resource, OpenSubtitles is a perfect fit linguistically too, because movies and TV show subtitles are exclusively dialogue. Not only are the sentences meant to be spoken aloud - they’re also meant to be understood by most speakers of the language - and they’re simple enough that they can be read by most people too. That sound absolutely ideal for language learners.
Building the List
I won’t bore you with the code details, but basically, the process works like this:
-
Read every subtitle and break them into distinct sentences
-
Throw out any sentence that contains foreign words, lots of digits, emojis, too many symbol characters, etc.
-
For each sentence, create a list of every 3-word sequence it contains, as well as 4-, 5-, and 6-word sequences.
-
Drop any sequence that begins and ends with a stop word (such as: and, but, the, some, etc.)
-
For any sequences left, add them to the database. If the sequence is already there, just increase its counter.
At the end of this process, we have millions of candidates and a count of how many times each one occurred in the data. The vast majority (85%) showed up only 1-5 times over 380,000 shows, but the most common? Here are the top 20 English molecules:
Freq | Molecule |
---|---|
1,251,337 | what are you |
1,104,904 | what do you |
733,399 | are you doing |
642,284 | you know what |
604,461 | what are you doing |
600,072 | do you think |
515,264 | do you want |
506,626 | do you know |
491,508 | i don’t want |
490,582 | oh my god |
410,894 | i don’t think |
382,649 | why don’t you |
341,713 | what is it |
332,391 | how do you |
324,601 | what the hell |
315,350 | i told you |
305,577 | i love you |
301,867 | why are you |
297,890 | where are you |
Those certainly look like high-value chunks of English to me. On first inspection, you might think, “Many of those look the same!” But consider the perspective of a language learner for a moment. How hard must it be to understand spoken English when so many people so often say things that all mean something different but sound almost identical? They have to slow down and process every individual word to figure out what’s being said. How empowering would it be for them to instead be able to recognize the entire chunk as a single unit?
That’s going to be the power of studying molecular vocabulary, I think: Gaining the ability to parse speech in bigger chunks, so you can keep up with the conversation sooner and with greater subtlety.
How to study them
I’m going to use audio flashcards, but my instinct is to not include L1 translations. I think there’s a danger of forming too strong a link between how the molecule is used in L2 vs how the corresponding one is used in L1 - assuming there even is a corresponding one. Often, these molecules convey specific tones and subtexts, and can even change tone depending on context, but their analogs in other languages don’t follow the same nuance patterns. So we don’t want to pollute the nuances of one language with those of the other.
For example, when asking a store clerk for help in English, you might begin with, “Excuse me, sir…”, but if you said, “Unnskyld, herr” in norsk, he would think you were mocking him. Norwegians never use honorifics. And if he sneezes, you do not say “velsigne deg” - it sounds like you’re trying to exorcise a demon. Norwegians tend to ignore a sneeze completely - especially from a stranger.
Instead, I think a better way to learn the function of a molecule is to see a bunch of different examples of it being used in full phrases and sentences, still in L2, and perhaps with some notes about tone and nuance for each.
For these reasons, I think the right place to start studying molecular vocabulary is during the B1/B2 stage, when you’ve got the basics and are beginning to work on depth. Although starting earlier could work too, if you want to challenge yourself.
For comparison, here’s the top 20 molecules of Norwegian, produced by the same process, after ingesting 29,000 shows.
Freq | Molecule |
---|---|
25,947 | vet du hva |
23,885 | vil du ha |
22,093 | hva mener du |
20,895 | jeg vil ha |
19,876 | vet ikke hva |
19,515 | jeg er lei |
16,890 | jeg har aldri |
16,881 | jeg elsker deg |
16,588 | hva vil du |
16,000 | jeg er her |
15,624 | vær så snill |
14,638 | jeg vet ikke hva |
13,767 | hva gjør du |
13,620 | jeg er glad |
13,528 | alt i orden |
13,136 | hva er dette |
12,931 | jeg tror jeg |
12,868 | du har rett |
12,521 | hva har du |
12,276 | jeg vil bare |
According to ChatGPT, those are also solid results. So I’m going to lock down the molecule extraction algorithm for now and move on with testing my theory.
Will learning these really accelerate the development of my norskørene?
Stay tuned to find out.
Follow-up 2025-08-14
Today brings a new road trip, which means another chance for focused study.
The challenge with molecules hasn’t been producing them - it’s been figuring out how to organize the study cards. Do I prompt with the molecule “question” and use its meaning as the “answer”? Should an example sentence be the prompt and the molecule itself be the answer? But what if the example contains more than one molecule? So maybe the prompt is three examples and my job is to identify the common molecule they share?
How the hell should I know? So to find out, I’ve prepared two decks, and on this trip, I’m going to try them both.