Molecular Vocabulary

/images/_2712c691-dc45-4b9e-a892-ab17a53b2e10.jpeg

Language is an extremely flexible tool. You can assemble words into messages that express an unlimited number of thoughts. Anything from “Where food?” to “Energy equals mass times the speed of light squared.”

But in everyday speech, we don’t typically use words to communicate. We actually assemble most of our day to day utterances using larger chunks of language. What are those chunks? And why do I think they can be used to supercharge your language studies?

Let’s take a look…

◇◆◇◆◇◆◇◆◇◆◇◆◇◆◇◆◇◆◇

Somewhere Between Words and Phrases

First, let’s clarify what I mean. When a stranger holds the door open for you, do you use your creative language powers to assemble atomic words into an original utterance that expresses your gratitude? Maybe: “I appreciate that gesture of social cooperation, Stranger.”

Of course you don’t. You’re much more likely to grab a pre-assembled word sequence from your memorized phrase-o-dex and say, “Thank you very much.” You don’t spend time thinking about the meaning of each individual word; you just grab that complete language chunk and deploy it whole.

Consult any decent foreign phrase book and you’ll find dozens of these handy, prepackaged utterances to lubricate your entry into that language. But those phrases are complete thoughts, fully formed, and applicable only in specific situations. I believe we have another level of prepackaged chunks in our linguistic memories between the two extremes - more than atomic words and less than self-sufficient phrases. And we use these constantly to form our sentences.

For example, have you ever noticed in instructional settings, that students each have their own habitual way of forming questions? One student might use “How do I…” sentences, while another might opt for “How would you…” constructions, and a third might favor “How are we supposed to…” formations. Each is the start of a request for clarification, but different people tend to fall back on their own preferred word patterns to construct sentences of a given type. It’s a core part of their personal idiolect.

But these chunks aren’t only about constructing questions. Think about sequences like “for the purpose of” (meaning to) or “on the other hand” (meaning alternatively). These prepackaged sequences fit into just about every grammatical category I can think of. But I’ve never heard them given a name, so what should we call them?

Well, just as chemists could use individual atoms to describe a substance but more often talk about the prepackaged collections involved (the molecules) I think of these commonly recurring phrase fragments as linguistic molecules.

But whatever we call them, I think studying them might be a powerful way to accelerate your language learning. In addition to making your speech sound more idiomatic, they also come prepackaged with the correct prepositions, conjugations and declensions baked right into them.

Consider “on the other hand.” To assemble that from scratch, a learner must ask: the hand or an hand? on, with, or in? Singular or plural? But as a molecule, it’s preassembled and immediately usable. And when you learn molecules, you’re also learning the rhythms and patterns of the target language in ways that you can internalize and begin to reuse in other constructions.

The only barrier is that I haven’t been able to find any kind of molecular study guides out there. While there are plenty of phrasebooks and frequency dictionaries, I haven’t found anything that systematically captures these mid-sized, reusable chunks as tools for learning fluency.

So I’m going to build one.

Raw Materials

We’ll start by building a molecule dictionary for English before tackling Norwegian. Why? Because I already speak English, so I’ll be able to assess the quality of the molecules. If my code generates a bunch of word salad and meaningless phrases, I’ll be able to spot those and tweak the algorithm, but if I went straight into Norwegian, I’d have no idea whether it was producing anything useful or not.

Definition: Molecules are extremely common 3-6 word chunks of natural speech that native speakers use habitually, as ready-made units, to construct more complete utterances. They encapsulate the correct prepositions, word forms, and rhythms of the target language, making them ideal units for listening practice and language acquisition, and enable greater fluency than learning one word at a time.

So the objective is to build a database of these “extremely common 3-6 word chunks” of a language. To do this effectively, I’ll need to examine a huge corpus that draws from a lot of contributors.

My first instinct was to use Wikipedia, but previous experience has taught me that wiki articles lean heavily toward academic and formal language patterns. We’re trying to learn how to speak like regular people, not professors, so we need a source that skews toward common speech. And it would be a bonus if it covers many different languages, so I can use it to produce lists for other languages too.

For these reasons, I settled on OpenSubtitles.org. The 2016 archival snapshot contains subtitle files for 100 different languages, including all the languages I’ll ever want to cover. In English it has subtitles for over 900,000 shows and movies, yielding 250,000,000 distinct candidate sequences between 3 and 6 words long. For Norwegian it has 30,000 subtitle files with 40,000,000 candidates.

I should point out that the vast majority of these candidates do not qualify as molecules, but my early results suggest that there are easily several thousand that do, which is far more than enough to build a useful learning tool.

In addition to being a big enough resource, OpenSubtitles is a perfect fit linguistically too, because movies and TV show subtitles are exclusively dialogue. Not only are the sentences meant to be spoken aloud - they’re also meant to be understood by most speakers of the language - and they’re simple enough that they can be read by most people too. That sound absolutely ideal for language learners.

Building the List

I won’t bore you with the code details, but basically, the process works like this:

  1. Read every subtitle and break them into distinct sentences

  2. Throw out any sentence that contains foreign words, lots of digits, emojis, too many symbol characters, etc.

  3. For each sentence, create a list of every 3-word sequence it contains, as well as 4-, 5-, and 6-word sequences.

  4. Drop any sequence that begins and ends with a stop word (such as: and, but, the, some, etc.)

  5. For any sequences left, add them to the database. If the sequence is already there, just increase its counter.

At the end of this process, we have millions of candidates and a count of how many times each one occurred in the data. The vast majority (85%) showed up only 1-5 times over 380,000 shows, but the most common? Here are the top 20 English molecules:

Freq Molecule
1,251,337 what are you
1,104,904 what do you
733,399 are you doing
642,284 you know what
604,461 what are you doing
600,072 do you think
515,264 do you want
506,626 do you know
491,508 i don’t want
490,582 oh my god
410,894 i don’t think
382,649 why don’t you
341,713 what is it
332,391 how do you
324,601 what the hell
315,350 i told you
305,577 i love you
301,867 why are you
297,890 where are you

Those certainly look like high-value chunks of English to me. On first inspection, you might think, “Many of those look the same!” But consider the perspective of a language learner for a moment. How hard must it be to understand spoken English when so many people so often say things that all mean something different but sound almost identical? They have to slow down and process every individual word to figure out what’s being said. How empowering would it be for them to instead be able to recognize the entire chunk as a single unit?

That’s going to be the power of studying molecular vocabulary, I think: Gaining the ability to parse speech in bigger chunks, so you can keep up with the conversation sooner and with greater subtlety.

How to study them

I’m going to use audio flashcards, but my instinct is to not include L1 translations. I think there’s a danger of forming too strong a link between how the molecule is used in L2 vs how the corresponding one is used in L1 - assuming there even is a corresponding one. Often, these molecules convey specific tones and subtexts, and can even change tone depending on context, but their analogs in other languages don’t follow the same nuance patterns. So we don’t want to pollute the nuances of one language with those of the other.

For example, when asking a store clerk for help in English, you might begin with, “Excuse me, sir…”, but if you said, “Unnskyld, herr” in norsk, he would think you were mocking him. Norwegians never use honorifics. And if he sneezes, you do not say “velsigne deg” - it sounds like you’re trying to exorcise a demon. Norwegians tend to ignore a sneeze completely - especially from a stranger.

Instead, I think a better way to learn the function of a molecule is to see a bunch of different examples of it being used in full phrases and sentences, still in L2, and perhaps with some notes about tone and nuance for each.

For these reasons, I think the right place to start studying molecular vocabulary is during the B1/B2 stage, when you’ve got the basics and are beginning to work on depth. Although starting earlier could work too, if you want to challenge yourself.

For comparison, here’s the top 20 molecules of Norwegian, produced by the same process, after ingesting 29,000 shows.

Freq Molecule
25,947 vet du hva
23,885 vil du ha
22,093 hva mener du
20,895 jeg vil ha
19,876 vet ikke hva
19,515 jeg er lei
16,890 jeg har aldri
16,881 jeg elsker deg
16,588 hva vil du
16,000 jeg er her
15,624 vær så snill
14,638 jeg vet ikke hva
13,767 hva gjør du
13,620 jeg er glad
13,528 alt i orden
13,136 hva er dette
12,931 jeg tror jeg
12,868 du har rett
12,521 hva har du
12,276 jeg vil bare

According to ChatGPT, those are also solid results. So I’m going to lock down the molecule extraction algorithm for now and move on with testing my theory.

Will learning these really accelerate the development of my norskørene?

Stay tuned to find out.

Follow-up 2025-08-14

Today brings a new road trip, which means another chance for focused study.

The challenge with molecules hasn’t been producing them - it’s been figuring out how to organize the study cards. Do I prompt with the molecule “question” and use its meaning as the “answer”? Should an example sentence be the prompt and the molecule itself be the answer? But what if the example contains more than one molecule? So maybe the prompt is three examples and my job is to identify the common molecule they share?

How the hell should I know? So to find out, I’ve prepared two decks, and on this trip, I’m going to try them both.


Read More


/images/_04b0fddc-e444-4a91-b590-114acddf73d2.jpeg

Kaffe Kan Fikse en Tregmorgen

A playful deep dive into Norwegian compound word humor

Join me for another installment of my “Conversations with Robots” series.

/images/_bd16ec31-ae4c-4fde-94ce-ac81a2712234.jpeg

Automatic Language Growth

While working on the ear training features for the FrankenTongues app, I stumbled across a reference to the Automatic Language Growth (ALG) model of language learning, and the moment I read it, I had to stop everything I was doing to investigate.

Because it resonates loudly with my own views on how we learn languages.

/images/_8639b86a-ae00-487d-b245-25107fc8fc39.jpeg

New Road Trip, New Trial

I’m sitting in the car, waiting to begin another long road trip, and in keeping with recent practice, this will be another chance to test my hands-free learning tools. But in light of my current ALG experiment, there will have to be some changes to the plan.