Being and nothingness
This essay is a reworking, in June 2026, of an essay originally written in January 2024. Technology, particularly speech-to-text, has advanced drastically in the interim, and we have incorporated our observations on its latest capabilities into the text. The philosophy and exercise remain unchanged: it’s the human that we put to work here, the machine has enough apostles and reviewers.
Speech-to-text works quite well. The machine has now mastered the skill. It is also an efficient exercise for humans learning languages like a machine, the auditory counterpart to reading like a machine and massive exposure to written textual data. At first glance, another candidate for fulfilling this purpose would be the massive exposure to audio data annotated with live subtitles or a static transcript that one can follow along as the speech unfolds. That’d be a good start, but we claim that speech-to-text, which when performed by a human agent is better known, redolent of school benches, as dictation, is superior to it. For, that’s the crux of our philosophy, we’re best trained by way of massive exposure to the very task that we wish to become proficient in. Reading like a machine trains reading comprehension above all. (In the wake of such an exercise, the other language abilities, speaking, listening, writing, will of course develop, but the primary goal is to build that of going through text-to-meaning like a machine.) Assuming that making sense of a text coincides with making sense of it in English (or your familiar language of predilection), our statistical parser is best trained in text-to-meaning by way of massive exposure to whole stretches of literature in the original paralleled with their English translation. Turning now to the listening competency, what matters in first instance is parsing the mess. Our first reaction to hearing Norwegian was that it sounds melodic and funny. That’s it. That we could not make any sense (in English or the language of reason) out of a live stream was an effect of our first not being able to parse speech into Norwegian itself. Reciprocally, once being able to transcribe, sense would appear magically from our reading skills grown separately through crossing through newspapers and training our text-to-meaning engine full throttle. In short, the missing bridge that paves the way to speech-to-sense is an ability to parse into Norwegian, which, symmetrically to what reading like a machine prescribes, is best exercised by massively exposing ourselves to the very effort we want to become skilled in, Norwegian dictation. It’s one of the most painstaking exercises, and, in our experience, one of the most powerful. Perhaps a few sessions without cheating and with plenty of sweat are enough for light to brighten up the house.
The excellent radio program Verdibørsen, on the national radio station NRK, is an ideal companion for learning Norwegian like a machine, and it’s likely that around our peak training and fine-tuning period, we listened to most of the episodes. In what follows, philosopher Kaja Melsom is asked about her favorite philosopher. (NRK, 2024)
While understanding answers vague questions, which can be tackled at several levels of depth, such as what is it about, who is the interviewee's favorite philosopher, and why, transcribing is unequivocal: what is the exact and complete sequence of words that, when uttered, results in this audio production. The novice we were was allowed to proceed piece by piece. Here is an illustration of our struggles at the time we first went through the exercise and drafted the present account.
Play … Jean-Paul Sartre … fordi han har en helt spesiell rolle i livet mitt den dag i dag …
Replay … Da har jeg valgt Jean-Paul Sartre, fordi han har en helt spesiell rolle i livet mitt den dag i dag …
(I have chosen Jean-Paul Sartre, because he has a completely special role in my life to this day…)
Play … Jeg oppdaget ham … uten å kunne … ord fransk … så på meg som en oppe-… menneske fordi jeg ikke mestret språket …
(It becomes clear on a longer thread that the beginning and the end are most salient, or at least transcribed first. In fact, two behaviors can account for the same phenomenon, as the student’s individual tactics sneak in: either listening to the whole and then transcribing, in which case, there’s a good chance that the beginning and the end will be remembered better than the weaker middle, or transcribing while listening, that is, writing while listening until the reciter gets ahead of you, and picking up again later, when you’ve finished jotting things down.)
Replay … Jeg oppdaget ham som 16-17 åring da jeg befant meg … i Frankrike … uten å kunne … ord fransk … folk ikke lenger så på meg som en oppe-… menneske fordi jeg ikke mestret språket …
Once more … Jeg oppdaget ham som 16-17 åring da jeg … befant meg … i Frankrike … uten å kunne … ord fransk … jeg plutselig … jeg skjønte at folk ikke lenger så på meg som en oppe-… menneske fordi jeg ikke mestret språket …
And anew … Jeg oppdaget ham som 16-17 åring da jeg plutselig befant meg på videregående skole i Frankrike uten å kunne et … ord fransk … Da var jeg i en slags identitetskrise fordi jeg plutselig … jeg skjønte at folk ikke lenger så på meg som en oppe-… menneske fordi jeg ikke mestret språket …
As rehearsals progress, the exercise transforms: the rush of jotting down notes on a blank page gives way to the precision play of filling in the remaining gaps. And perhaps some holes will resist indefinitely, no matter how finely we cut, no matter how many times we push replay.
… et … ord fransk …
… en oppe-… menneske …
There’s a Swiss Army knife sleight, which we promise to disclose in a minute, for opening even the stiffest boxes, but now that we clearly fathom the modalities of the procedure, let’s try to get a little perspective and briefly reflect on what’s at stake here.
What’s at stake is dictation. Let’s try to sneak into the pupil’s mind. They know the instructions, to reproduce word for word, with the exact spelling. It isn’t necessary, a priori, for them to understand, and yet. With the whole heart attentive:
… Jeg oppdaget ham … uten å kunne … ord fransk …
… fransk! French, something with French word (ord) … uten å kunne … without being able to … a French word … without being able to understand (or speak) a single French word? …
Replay … Jeg oppdaget ham … på videregående skole i A uten å kunne … ord fransk …
The missing bit represented by the dummy letter A has got a sound shape.
… /ˈfraŋkriːkə/ …
But its harmonious fit in the clause is also semantic. That is where meaning comes to interweave with sound and draw the ear into one field or another of the lexical space.
… not being able to understand a word of French … while at highschool … in France! … France is Frankrike … /ˈfraŋkriːkə/ … Frankrike! …
It’s really hard to further increase the granularity and know in which order the strands are interwoven to bring out parsing and spelling. It just clicks. Let’s give it another try. Say we have grasped the following, where B is still missing.
… jeg plutselig befant meg på B skole i Frankrike uten å kunne …
… /ˈviːdərəˌɡoːənə/ … /ˈviːdərə/ … videre! … på videre-… skole … ɡoːənə … gå-… skole … highschool … videregående skole!
(Of course, this is not to say that speech recognition works through erudite phonetic transcription in the international standard. IPA transcription is just a symbol here, in the rendering the stream of consciousness, for the flash of sound recognition. We ourselves are unfortunately poorly trained in phonetics, and will try to remedy this shortcoming in the future for the sake of sole theoretical completeness. Indeed, in practice, extensive exposure and a keen ear are sufficient, in accordance with the principles of learning like a machine, to acquire the ability to speak and understand what is being said, just as, asymptotically, learning to read does not stringently require grammar.)
In short, each position in the script is doubly constrained, by both sound and meaning in context (it’s in fact, beyond meaning, further submitted to syntax, too, to contextual linguistics at large, which meaning, in the simplistic dual model outlined above, is at best a working synecdoche for). Dictation fits into various stages of the learning process, but because of this multifold constraint, its effectiveness is perhaps greatest when the fundamentals of reading are already in place, after what we refer to elsewhere as the training phase, which a more voluminous phase of fine-tuning follows. Can the complete novice who has only sound to go by find their way? Arguably, yes, by virtue of the law of large numbers, the effectiveness of which we constantly confirm. Said novice must then painstakingly immerse themselves in the sound sphere of the language, listen to volumes of Verdibørsen episodes, and, even before beginning to parallel read, before understanding anything, learn to parse the spoken language. Learning to distinguish words. That is to say, their boundaries. Word boundaries and mincing.
Do you mean turning on the radio-cassette player and learning while you sleep? Pretty much. It seems that’s how, through boundary statistics, the sound waves of language enter the mind. Let’s strip language, for a moment, of its semantic properties. The language thread goes on, but it becomes strictly musical, precisely because we are newcomers, or, to place the reasoning on the speculative plane to which it belongs, because the lexicon, syntax, and grammar, albeit their forms and the distribution of their patterns unchanged, have been depleted of their meaning, of their logic. What, then, is a word? What, then, is a word? As a first approximation, a string of letters in writing, of sounds in speech, separated from the next by a blank space, respectively on the page or in time. What if, now, exacerbating the synthetic character of the experiment, these blank spaces (and the potential punctuation marks that intersperse both text and prosody) were to vanish, and the flow to become the uninterrupted stitching together of these meaningless words?
… jegplutseligbefantmegpåvideregåendeskoleiFrankrikeutenåkunne …
What distinguishes one word from the next, then, is its internal cohesive force. Just as within a molecule, whose atoms are more strongly bound to one another than to those of the rest of the universe, the syllables of the word are so tightly bound together that they give it an interior, boundaries. It is the relativity of forces that makes the molecule stick out: it’s only because covalent bonds are infinitely stronger than hydrogen bonds or Van der Waals forces that this distinct entity called a molecule exists. Similarly, the word springs forth because the immaterial, statistical counterpart of these mechanical forces is infinitely more powerful within it than with the rest of the syllable space. That analogue, it’s the probability of transition between syllables. The transition jeg-plut is much less likely than plut-se-(lig). Plutselig forms a whole, unlike *jegplut, because in a never-ending stream of speech, the frequency of the former is infinitely higher than that of the latter. So, that’s how, in a nutshell, our novice could get speech-to-chopping-into-words while they sleep. That’s what the fascinating experiments by Saffran et al. show. (Saffran, 1997) The subjects are to draw for twenty minutes while background sound plays. What they are exposed to incidentally, without their attention being called to dwell on it, is the unbroken, rhythmless flow of a synthetic language consisting of six nonsense words.
… bupadapatubitutibudutabapidab …
At the end, they are tested on the words they heard, and even though they were completely absorbed in their drawings, they have learned. Later, Saffran repeats the experiment with sound sequences to confirm that the same statistical learning ability applies to pure music. (Saffran, 1999) Our tabula rasa therefore has a good chance of learning to chop before any literacy.
But they stumble upon yet another problem: with their head full of probability, having never seen a text, they’re unable to spell a single word they’ve heard. They’re stuck with the mental equivalent of a chopped phonetic transcription, where words have been individualized, but are still solely sounding.
… [da jæɪ ˈplʉtsəlɪ bəˈfɑnt meɪ poː ˈviːdərəˌɡoːənə ˈskuːlə iː ˈfraŋkriːkə ˈʉːtən oː ˈkʉnə] …
(… da jeg plutselig befant meg på videregående skole i Frankrike uten å kunne …)
Along such a trajectory, where listening precedes reading, the missing functionality in achieving speech-to-text transcription is phonetic-words-to-spelling. Arguably, that skill can be trained with subtitled or transcribed live audio. In passing, live exposure to the written script comes to intermingle with boundary statistics to train the parsing engine in isolating words within the sound mess. With all that, they’d be able to master a naïve, nescient version of the dictation, in a way, where speech-to-text happens perforce by the sheer mechanics of statistical habituation, without meaning or syntax intervening at all. What is being said would remain as cryptic as a conversation among adults to a household pet. If you watch something live, such as movies, that combines sound, subtitles, and a story in technicolor, context, in the form of a scene and happenings, might be regarded as playing the role of a thread of meaning, which the parallel English thread comes to fulfill in reading like a machine. If you are able to volume-watch-TV like a machine without falling asleep or losing the tautness of your attention needed to properly carry out the exercise, that’s a track to follow. It could perhaps bring you closer to managing a learned version of speech-to-text where meaning meddles, and, with that, closer to listening skills, but still fall short, we argue, of the trajectory that originates in reading like a machine and branches at some point into speech-to-text. That’s not just us being bookish. By first brooding over a mega-corpus that comprises everything (transcripts of speech included), the latter fosters depth and breadth, multiplicity of registers and genres, immersion in the fluency of virtuosi in the Letters. Very importantly, it promotes slow and laborious attention, the painstaking efforts that are conducive to growth, patient frequenting of linguistics tools, repetition until perfection of the main exercise in piecewise matching and the free spawning, keeping you sharp and alert, of a myriad of its variants. We highlight patience and slowness here. In reading like a machine time is on our side, we’ve our hand on it. We play, we replay, we pause. We manage to get through volumes because we are no longer bound by the passing of time, the unit and the dimension of our presence have changed, we count in lines tamed. The same asynchronicity between the thread of speech and that of meaning appears in dictation practiced as we are doing here. The TV movie, with its frivolity, its strictly oral tone, and above all, its fountaining, with no respite, subtitles that vanish after a subliminal moment, seems quite pale in comparison.
When coming to dictation our way, we’re already trained in syllable transition statistics via massive exposure to the written language. Chopping the live stream is all the easier given that we already know what the molecules look like, if not their rigorous phonetics, at least their envelope, their allure. All in all, we already have a full-fledged statistics engine ready to expect what comes next, and, when we replay, what fits in the holes. We have the semantic, syntactic and grammatical constraints that weigh on wild guessing. We have word spelling. All that remains is training our ear to discern words within that fancy stream, speech-to-spelling, indeed, yet in a highly supervised manner.
Our Swiss Army knife from earlier is ready to use, which solves most of the puzzles in hole-based listening exercises.
… et … ord fransk …
… en oppe-… menneske …
When we type our text in full into a good online translator, we get in return:
… I discovered him when I was 16 or 17, when I suddenly found myself in high school in France without knowing a … single word of French. At the time, I was going through a sort of identity crisis because I suddenly … realized that people no longer saw me as a capable … human being because I hadn’t mastered the language …
In short, the machine intelligence filled the holes (still hinted at with …) with what is most likely.
… a single word of French …
… a full-fledged human being …
Let’s flip the direction, the English text shifts to the left, and on the right appears:
… Jeg oppdaget ham da jeg var 16 eller 17 år, da jeg plutselig befant meg på videregående skole i Frankrike uten å kunne et … eneste ord fransk … På den tiden gikk jeg gjennom en slags identitetskrise fordi jeg plutselig … innså at folk ikke lenger så på meg som et dyktig … menneske fordi jeg ikke hadde lært meg språket …
Unfortunately, eneste does not fit the sound stream: we clearly heard a consonant first, something along the lines of *krøve, *kløva, *tløyva. How it sounds precisely is far from clear, and that it doesn’t sound like eneste is obvious. The same applies to dyktig, in the hole that opens with oppe-. No matter, let’s reverse again and make use of good translators’ ability to come up with synonyms.
… without being able to understand a … word of French …
Actually, an infinitive fits well in the first hole as well.
… realized that people no longer saw me as an alert … human being because I hadn’t mastered the language …
Back and forth, we play and replay synonyms until it converges to the thing we are very likely to have heard. Well, our case is really tricky and we’re doomed to fail. It seems the script is
… uten å kunne et kløyva ord fransk …
where the adjective comes from Norwegian Nynorsk (the other official Norwegian language, which is well represented in the media) and seeps to Norwegian Bokmål only in the fixed idiom in question, not a single word. Nynorsk kløyva is a verb meaning to split, to slice. We can feel a potential semantic drift: not being able to split a single segment of French into words, not being able of *a split-word of French.
As far as oppegående (alert, one who gets up in the morning), it sits in a corner of the dictionary and our translator didn’t reach that far. No luck fishing. Let’s keep going.
So our speaker is in France and discovers Jean-Paul Sartre while flipping through the program.
… Han skrev om … hvordan friheten gir oss angst fordi vi ikke er verken vår fortid eller våre fremtidige muligheter. Vi er dette intet … midt i mellom, og det var akkurat der jeg følte meg. Jeg var hverken den Kaja jeg var i Norge før jeg kom til Frankrike, og jeg kunne heller ikke … på at jeg sikkert en dag ville mestre fransk å bli et oppegående menneske igjen.
(… He wrote about … how freedom fills us with anxiety because we are neither our past nor our future possibilities. We are this nothingness … right in the middle, and that was exactly how I felt. I was neither the Kaja I had been in Norway before I came to France, nor could I rely on the fact that I would surely one day master French and become a fully alert person again.)
Here it is anew, oppegående, a little more audible, and, indeed, less surprising. Laws of discourse and semantic cohesion come to corroborate the hypotheses. Maybe we have a puzzling hole in
… og jeg kunne heller ikke … på at jeg sikkert en dag ville mestre fransk …
Yet, we do hear something that resembles *laneme, *lanema (in our own internal phonetic alphabet). In the syntactic backbone, a verb in the infinitive fits best. Now the Swiss Army Knife opens the box.
… nor could I rely on the fact that I would surely one day master French …
Reversing, the Norwegian reads
… og jeg kunne heller ikke lene meg på tanken om at jeg en dag sikkert ville mestre fransk …
That’s our best guess, lene meg. Swiss Army knife, linguistic ties, conjectures, and probabilities. Here, we fine-tune our ear the hard way to matching the musical flow to its spelling, powered by a statistical engine already attuned to the language’s dance.
References
NRK. (2024, 7. januar). Filosofen som ble utstoppet: Hør om Jeremy Bentham og andre favorittfilosofer! Verdibørsen. https://radio.nrk.no/podkast/verdiboersen/sesong/siste/l_50868da8-abc2-485b-868d-a8abc2985b2d
Saffran, J. R., Newport, E. L., Aslin, R. N., Tunick, R. A., & Barrueco, S. (1997). Incidental language learning: Listening (and learning) out of the corner of your ear. Psychological Science, 8(2), 101–105. https://doi.org/10.1111/j.1467-9280.1997.tb00690.x
Saffran, J. R., Johnson, E. K., Aslin, R. N., & Newport, E. L. (1999). Statistical learning of tone sequences by human infants and adults. Cognition, 70(1), 27–52. https://doi.org/10.1016/S0010-0277(98)00075-4



