A couple of months ago, I wrote an installment of this column about the overlooked labor powering generative AI. One consideration, bracketed out for brevity, has gripped me since, causing headaches, writer’s block, and spirals of paralyzing existential dread every time I sit down to write about artificial intelligence: The nomenclature of “natural language processing” (NLP), or the field of making AI understand and work with words. I understand that industries name things for pragmatic reasons. But I’ve been hung up on that turn of phrase like an Adderall-fueled undergrad on post-structuralism. For six months – long enough that I literally moved to a new country and watched the trees flower, bloom, and then lose their leaves again – I have been trying to write a follow-up, haunted by the question: When is language ever natural?
In the case of NLP, “natural” language is really just language originally intended for purposes other than training AI. Of course, words don’t spring forth from the earth like ripe produce; “natural” language comes from somewhere, and that somewhere has implications. Most generative language models use the dataset from Common Crawl, an open-source corpus of text scraped from the web since 2008. For that reason, Christina Lu, a technologist and former software engineer at DeepMind, suggests that, for terminological precision, we should be calling AI tools like LLMs “internet language models.”
“Natural,” in this case, means something more like “unencumbered,” or even “unwitting” – an honest expression of how humans talk and write when we don’t believe anyone’s watching. A poetic illustration of just how unencumbered these kinds of speech may be: In 2016, artists Sam Lavigne and Tega Brain made The Good Life, a project about how, up to that point, the primary corpus used to train stuff like spam filters and predictive text algorithms was the Enron Corporation email archive, ripped from the company’s servers when they were seized in one of the largest corporate fraud cases in history. I don’t think those emails are still used in training data, but for years, these correspondences between white-collar criminals were apparently the basis for a widespread understanding of what sort of language is “natural.”
Image gnerated by DALL-E using the prompt “Enron Corporation employees enjoying a pizza party”
At the New Museum in 2017, I saw Sam and Tega present highlights from some of the emails, which were, by and large, strikingly banal. I remember feeling touched by the Enron employees’ relaxed repartee, touching on everything from divorces to pizza dinners. I guess there’s something honest – vulnerable, even – about the textual residues that accrete online. This holds equally true for the kind of contextless internet text that I’d imagine now constitutes the lion’s share of Common Crawl data.
Drawing on a sample size of one – myself – allow me to make an unscientific conclusion about the nature of text online: My devices are a trash compactor for my never-ending torrents of stupid, sincere, touching, and oftentimes purposeless language. I ask things of my search engine that I wouldn’t dare ask a friend. I tell it my anxieties, my embarrassing zones of incompetency, my silly tender queries. I go ham in Amazon product reviews, describing, in unnecessary detail, my experience of assembling an ersatz Danish dining chair. On Reddit at 2am, I’m writing a novella-length post wondering whether it’s appropriate to get my acupuncturist a Christmas present.
Prolific, pseudonymous, and unpolished, it’s automatic writing of a sort. Does that mean it reveals an unconscious? I think, here, of the psychoanalytic precept that unacknowledged truth can come to the fore via free association. But the analysand, unlike the internet user, is speaking to a finite someone, a “subject supposed to know,” whose restrained, surgical prompting brings new insight into being. Online, most of the time, we’re typing to an apparent nowhere, throwing verbiage into the void.
My browser history knows me better than my psychoanalyst does, so rigorously has it absorbed the data-trails I semiconsciously leave in my wake.
In her “Dark Forest Theory of the Internet,” media theorist Bogna Konior calls the central question posed by Web 2.0 platforms – what’s on your mind? – “a riddle we must answer over and over again,” laboriously constructing our dividual digital selves with each post, like, and trackable click. Since at least the rise of social media, the imperative of the internet has been expressive production, self-definition carried out in perpetuity through constant communication. “Repressive forces don’t stop people from expressing themselves,” wrote Gilles Deleuze in Negotiations, “but rather force them to express themselves.” That was in 1995, and has proven so self-evident that one wonders if the architects of present-day social platforms read Deleuze like an instruction manual.
Nowadays, the breadcrumb trails of text (among other kinds of data) that we leave behind are captured, calculated, and used to perfect systems of hyper-targeted manipulation, which anticipate our quirks and desires before we ourselves can “authentically” articulate them. My browser history knows me better than my psychoanalyst does, so rigorously has it absorbed the data trails I semiconsciously leave in my wake. We can, and should, assume that this expressive flotsam is now fodder for AI training, too.
This detail doesn’t sit well with me, even as I more or less willingly divulge all this data through the devices that I spend most of my time physically connected to. My best writing happens when I’m curled up in bed in the fetal position, laptop propped horizontally against my knees. I estimate that I have spent more time this past year lying in bed with my computer than with any single romantic partner – actually, with all of my past romantic partners combined. The language produced as I lie there, curled in a little ball: Even if I hesitate to call it “natural,” there’s an intimacy to it. Most of its content is basically meaningless to me – I can barely keep track of what I wrote where – and yet I experience it as a meaningful emission, some cast-off part of myself.
“justgirlythings” post c. early 2010s, from the author’s internet ephemera archive
Inevitably, in matters of intimacy, questions of consent cascade. And usually, in interactions with technology, “consent” is perverted to the point of profanity. Look, for instance, at the rhetorical gestures toward “consent” in stuff like online privacy regulations. Thanks to the EU’s GDPR legislation, websites harangue us with popups demanding “consent” to accept browser cookies. We reluctantly grant permission as a matter of obligation, which we would recognize in the context we mainly associate with consent – sex – as precisely the opposite of consent, which is to say, coercion.
“I honestly really think we need to move on from consent,” the artist Sophia Giovannitti told me last spring. “I think autonomy is a better framework.” Meaningful consent, we agreed, feels untenable – farcical, even – under the current, corporately centralized conditions of the internet. These intertwined concepts – subjectivity and expression, language and consent – are themselves hardly stable or self-contained to begin with. The more you pull on any of these subjects, the more it unravels.
Speech, out loud or online, is temporal and relational; it always implies an Other, a subject speaking and someone – or something – being spoken to. Consent, too, implies duration, temporality, and ongoing negotiation over its renewal; one must comprehend its context in order to offer it fully. And if you can’t revoke it, can you give it in good faith in the first place? It’s trite to remark that the internet never forgets. But recent developments in NLP make this axiom more obvious – and more consequential – with each passing day, each accreting bit in the data hoard.
Writing a column, not unlike posting on social media, is ultimately about repeatedly incanting in response to the riddle of the internet: What’s on your mind? For half a year, this draft sat on the back burner. Developments in AI chugged along, but my attempt to express myself stagnated, in no small part because I feared fixity and the foreclosure of reconsideration that publishing seemed to entail. This fear only amplified when I thought about NLP: language, and the intentions that underly it, being fossilized and systematized as mere information. All writing is, in some sense, frozen in time; but couldn’t that same tragic realization also imply, if counterintuitively, some sort of opening? That words are liable to lose their meaning is the only “natural” thing about language.