How a Dogma About Language Stalled AI for Fifty Years

Early computational linguistics in Cambridge

In 1947 a group of Cambridge academics began meeting under the name “the Epiphany Philosophers”. Their interests were broad: religion, mysticism, parapsychology, the philosophy of science, and — most consequentially — the nature of language. At one point they investigated whether computers could decode the communication patterns of pigeons. The membership was distinguished, but the leading figure was Margaret Masterman, a philosopher and linguist who had studied under Ludwig Wittgenstein during the most fertile period of his career.

Masterman had witnessed Wittgenstein abandon his own earlier theory. In the Tractatus Logico-Philosophicus (1921), Wittgenstein had treated language as a logical picture of reality: each meaningful sentence mapped onto a corresponding state of affairs, and the task of philosophy was to strip away ambiguity until only precise, crystalline propositions remained. By the 1930s he had grown dissatisfied. A famous anecdote captures the moment. His Cambridge colleague, the Italian economist Piero Sraffa, brushed his chin with his fingertips — a Neapolitan gesture of dismissal — and asked: “What is the logical form of that?” Wittgenstein had no answer. The gesture carried unmistakable meaning, yet it was not a proposition, not a picture of a fact, not anything the Tractatus could analyse. Language, he came to argue, was not a mirror but a toolkit. Words did not have fixed definitions etched into some Platonic register; they acquired meaning through use, in what he called “language-games” — fluid, contextual, inseparable from the lives of the people who spoke them.

This insight — that meaning is a pattern of use rather than a catalogue of definitions — would prove prophetic. It is recognisably a similar approach to that taken by large language models today. But its journey from a Cambridge seminar room to a working technology took more than seventy years, and most of that delay was avoidable.

The Cambridge Language Research Unit

In 1955 Masterman formalised the work of the Epiphany Philosophers by founding the Cambridge Language Research Unit (CLRU). It was a small, independent outfit — never more than a handful of researchers — but it was among the first groups anywhere to attempt computational modelling of natural language.

The CLRU’s approach was distinctive. Where others were already beginning to think about language in terms of grammatical rules, Masterman started from meaning. Her central tool was the thesaurus — specifically, a structure inspired by Roget’s Thesaurus, whose thousand-odd semantic “heads” (category labels such as Space, Motion, Affection) she treated as a kind of coordinate system for meaning. A word could be located not by its spelling or its part of speech, but by the cluster of semantic heads it fell under. Words that shared heads were, in some computable sense, related.

From this she developed the idea of a semantic lattice: a mathematical structure in which concepts were ordered by generality, and the meaning of a phrase could be narrowed by finding the intersection of the lattice positions of its parts. She also drew on the notion of breath groups — the natural chunks into which spoken language falls — as units of processing, rather than treating sentences as the fundamental unit of analysis. The idea was that meaning arose from patterns of association, not from the application of syntactic rules.

This was, in retrospect, strikingly close to the distributional semantics that powers modern natural language processing. The famous slogan, usually attributed to the linguist J.R. Firth — “You shall know a word by the company it keeps” — could serve as the CLRU’s motto. It could equally serve as the design principle behind word embeddings, contextual representations, and the attention mechanisms of transformer models.

But in the late 1950s there was no hardware remotely capable of testing these ideas at scale, and the intellectual climate was about to turn hostile.

Chomsky’s revolution

In 1957 a young linguist at the Massachusetts Institute of Technology published Syntactic Structures. Noam Chomsky argued that the surface variety of human languages concealed a deep, universal grammar: a finite set of rules from which all grammatical sentences in any language could be generated. Two years later he published an influential review of the psychologist B.F. Skinner’s Verbal Behavior, which had attempted to explain language acquisition through conditioning and statistical association. Chomsky’s review was so comprehensive that it effectively ended behaviourist approaches to language for a generation.

The impact was enormous. Chomsky offered a vision of language that was elegant, formal, and amenable to the kind of rigorous analysis that both linguists and computer scientists craved. If language was governed by deep structural rules, then understanding language meant discovering those rules — and that was a problem with a clear shape. University departments reorganised themselves around generative grammar. Funding flowed. Careers were built.

What was lost in the excitement was any serious interest in the statistical, distributional, meaning-first approach that Masterman and the CLRU had been pursuing. Chomsky’s framework treated meaning as secondary to syntax. His most famous example sentence — “Colourless green ideas sleep furiously” — was designed precisely to show that a sentence could be grammatically perfect and yet semantically empty. Grammar was the engine; meaning was merely the cargo.

Masterman was unsparing in her response. In “Semantic Algorithms,” she wrote:

My quarrel with [the Chomsky school] is not at all that they abstract from the facts. How could it be? For I myself am proposing in this paper a far more drastic abstraction from the facts. It is that they are abstracting from the wrong facts because they are abstracting from the syntactic facts, that is, from that very superficial and highly redundant part of language that children, aphasics, people in a hurry, and colloquial speakers always, quite rightly, drop.

— Margaret Masterman, Language, Cohesion and Form, p. 266

It is a remarkable passage. She concedes abstraction — all science abstracts — but insists that Chomsky has chosen the wrong level. Syntax, she argues, is the part of language that people routinely discard without loss of meaning. The real structure lies deeper, in the semantic patterns that survive even when grammar is mangled.

For researchers who believed, as Masterman did, that you could not understand language without starting from meaning, the next three decades were lean. Statistical methods were dismissed as shallow. Corpus-based approaches — studying what people actually said, rather than what the rules permitted them to say — were regarded as intellectually unserious. The prevailing view held that the rules of language were innate, hard-wired into the human brain, and that no amount of data could substitute for the right theory.

The computational constraint

It would be wrong to blame Chomsky entirely. Masterman’s ideas required something that simply did not exist in 1955, or 1965, or even 1985: enough computing power to process language statistically at meaningful scale.

A semantic lattice built from a thesaurus of a thousand heads is a toy. To capture the distributional behaviour of words as they are actually used, you need vast corpora — millions, eventually billions, of words of running text — and the processing capacity to extract patterns from them. In the 1960s, a university mainframe might have a few kilobytes of working memory. The entire idea of learning linguistic structure from data was not merely unfashionable; it was, for the technology of the time, impractical.

This is worth acknowledging because it complicates any simple story of intellectual villainy. Chomsky’s rule-based approach had the genuine advantage of being testable with the tools available. You could write a grammar on paper, generate predictions, and check them against native-speaker judgements. You did not need a warehouse of silicon. The hardware to vindicate Masterman would not arrive for decades, and when it did, it arrived gradually — first disk storage, then memory, then processing speed, and finally the massively parallel architectures of modern GPUs.

The dominance of rule-based systems

Through the 1970s and 1980s, natural language processing was dominated by rule-based systems. Researchers hand-crafted grammars, built parsers, and wrote elaborate sets of rules to handle ever more linguistic phenomena. The systems grew in complexity but remained brittle. Every new edge case required a new rule, and the rules interacted in ways that were hard to predict or debug. Machine translation, the field’s original grand challenge, produced mostly unreadable output.

Meanwhile, the few researchers who persisted with statistical methods worked in relative obscurity. Even within the CLRU’s own lineage, Chomsky’s influence was hard to escape. Yorick Wilks, one of Masterman’s most prominent students, developed “preference semantics” — a system that tried to resolve ambiguity by preferring the most plausible interpretation — but the broader field treated such work as a curiosity rather than a paradigm.

The turning point came, as it often does, from outside the discipline. In the late 1980s, researchers at IBM’s speech-recognition lab, led by Frederick Jelinek, began applying statistical models — hidden Markov models and n-gram language models — to the problem of understanding spoken English. The results were highly effective. Jelinek reportedly quipped: “Every time I fire a linguist, the performance of the speech recognition system goes up.” It was a joke, but it contained a truth that the field had been reluctant to face: brute statistical regularity, applied to enough data, could outperform carefully reasoned rules.

The statistical turn

Statistical methods gradually colonised NLP through the 1990s and 2000s. Machine learning replaced hand-crafted rules. In 2013, Tomáš Mikolov and colleagues at Google published Word2Vec, which learnt dense vector representations of words from their patterns of co-occurrence in large corpora — a direct, computational realisation of the distributional hypothesis that Masterman and Firth had championed in the 1950s. The vectors captured meaning through association: words used in similar contexts ended up near each other in vector space.

Then came the transformers. The 2017 paper “Attention Is All You Need” introduced an architecture that could model long-range dependencies in text without recurrence, making it possible to train on unprecedented quantities of data. The large language models that followed — GPT, BERT, and their successors — learnt grammar, meaning, world knowledge, and even a passable imitation of reasoning, all from patterns in text. No grammar rules were programmed. No universal grammar was consulted. The system observed how words were used, at enormous scale, and generalised.

Margaret Masterman died in 1986, long before any of this was possible. But the core of her intuition — that meaning arises from patterns of association, that a thesaurus-like map of semantic space is the right starting point, and that statistical regularity in usage is the foundation of understanding — is precisely what modern language models exploit.

The challenge to formalism

Noam Chomsky is one of the towering intellects of the post-war world. He created a new field of study, spawned university departments, and produced a body of political criticism — on Vietnam, on media, on Central American policy — that remains essential reading. He is an accomplished writer and a compelling speaker.

In November 2022, OpenAI released ChatGPT. It could generate fluent, grammatically impeccable text in dozens of human languages, translate between them, summarise, argue, and explain — all without any knowledge of generative grammar, binding theory, or the principles and parameters that Chomsky had spent a career elaborating.

Chomsky’s response was telling. In a 2023 New York Times essay, he dismissed large language models as fundamentally uninteresting — mere statistical engines that could never achieve genuine understanding. He may be right about the limits of current architectures. However, the statistical approach has demonstrated capabilities that rule-based systems struggled to achieve.

The deeper lesson is not that the generative approach was without merit — observations about the structure and learnability of language remain valuable — but that the field’s focus on it closed off a productive line of enquiry for decades. A statistical, distributional, meaning-first approach to language was available from the 1950s. It was not pursued extensively because the dominant paradigm favoured other methods. When the hardware finally arrived to test the hypothesis, it proved effective.

What this means for building systems

There is a practical lesson here for anyone designing knowledge-intensive software. If the history of NLP teaches anything, it is that the right abstraction can be ahead of the available implementation. Masterman’s semantic lattice was the right idea expressed in the wrong decade. The thesaurus-as-coordinate-system was a prototype of the embedding space, sketched on paper sixty years before GPUs made it workable.

For organisations building systems that reason about language — whether to classify risk, interpret questionnaire responses, or route documents through workflows — the takeaway is straightforward: design around meaning, not around rules. Rules are brittle and expensive to maintain. Patterns of use, captured statistically and refined through feedback, are robust and adaptable. The technology to do this well has only recently become practical, but the underlying insight is older than most of us.

The fifty-year detour through formal grammar was not without value — it deepened our understanding of linguistic structure enormously. But it was, in part, a detour driven by intellectual fashion and institutional inertia. The researchers who kept faith with the distributional hypothesis were validated not by argument but by engineering results.

How a Dogma About Language Stalled AI for Fifty Years

Early computational linguistics in Cambridge

The Cambridge Language Research Unit

Chomsky’s revolution

The computational constraint

The dominance of rule-based systems

The statistical turn

The challenge to formalism

What this means for building systems

Related posts

When the Computer Stops Being the Bottleneck - AI, Customisation, and the New Division of Labour

How a Dogma About Language Stalled AI for Fifty Years

How Pairwise Comparison Reveals Fuzzy Questions