De Groene Amsterdammer Blog Post by Sanne Bloemink
21 August 2024
Half of all languages are currently endangered. The Sateré-Mawé in Brazil aim to prevent the loss of their language by digitizing it. But is that even possible without Big Tech? And who actually owns a language?
On an island in New Guinea, there are two villages with about 130 inhabitants. They speak Kalamang, a language that had never been written down—until linguist Eline Visser decided to dedicate her PhD research at Lund University (Sweden) to it. She started with the basics. “You point to your hand and phonetically write down the word for ‘hand.’ Then you point to your foot. From there, you build everything out.”
We are in a classroom on the Binnen Gasthuis campus in the center of Amsterdam. Visser shows a video of everyday conversations in Kalamang. “I caught a lobster,” someone says. “Oh, are there more lobsters over there?” another person asks. “Yes, a lot more.” Based on her research, Visser created a dictionary and a grammar book. She also developed a kind of dictionary app with accompanying pictures. This has contributed to a greater sense of pride among Kalamang speakers. “They feel: our language exists, it is real—even to the outside world.”
Visser also created a geographic map. During a boat trip around the island, she and a small group recorded all indigenous names along the coast using photos and GPS coordinates. On this particular island, there are no land rights conflicts, so the map had no political implications. But in other regions, such a map could very well have political significance. “It can demonstrate that a people with their own language have a long-standing history in a specific place.”
Documenting and digitizing an indigenous language can thus give its speakers not only pride or a sense of legitimacy—it can also support legal claims to land. The group gathered at the Binnen Gasthuis agrees that researchers, especially those from the West, should always ask why a language needs to be documented and digitized—and that the answer must come from the community itself. Many Indigenous peoples have had poor experiences with Western researchers who arrive, decide for themselves what is ‘interesting,’ and conduct studies without considering the interests of the community involved.
Currently, there are about 7,000 languages in the world, but every two weeks one disappears. This can happen suddenly—for instance, through genocide, as occurred with many Indigenous languages in North America after the arrival of Europeans. But it can also be a gradual process: people become bilingual, speaking both their Indigenous language and a dominant language such as English or Spanish. Their children speak only a little of the original language, and by the next generation, it is lost entirely.
In rare cases, a dead or dormant language can be revived. Perhaps the most notable example is Hebrew, which for centuries was barely spoken and served mainly as a liturgical language. But in the early 20th century, it was brought back as a spoken language.
Language extinction is, in itself, a natural process. Just as death is part of life, the disappearance of languages is part of human evolution. But the current pace—especially the loss of Indigenous languages—is alarmingly high. About half of all languages are now endangered. And just as biodiversity contributes to the resilience of ecosystems, rich linguistic diversity strengthens human resilience.
Two leaders—father and son—from the Sateré-Mawé, an Indigenous people of about 10,000 in the Amazon region of Brazil, listen closely to Visser. While others at the Binnen Gasthuis sip coffee, João Sateré, the father, grinds powder from a dried guaraná fruit using a rasp. Guaraná is a climbing plant that grows only in the Amazon. “This is much stronger than your coffee,” he says, smiling as he places a headpiece of shiny blue feathers on his head. He tells of a Spanish researcher who once came to their area and decided the Sateré-Mawé language should be translated into Spanish, “because after all, it’s South America.” But Portuguese, not Spanish, is spoken in Brazil, alongside many Indigenous languages. When the researcher died in an accident, the project ended as abruptly as it began. “We never heard anything about it again,” the father laughs.
In the Sateré-Mawé region, there are indeed land rights conflicts. This is one of the key reasons they want to document their language—it can help them assert legal claims. Another reason is fear: the language may disappear if it isn’t documented soon. Once digitized, teaching materials can be developed so that children can learn the language. It hasn’t always been this way.
Josias Sateré, João’s son and also a leader, is working on a PhD focused on constructing an Indigenous curriculum to complement the Western materials currently used in Amazonian schools. “Forty years ago, people from the cities came and taught our children Portuguese in schools. All legal documents were in Portuguese. It was an attempt to erase our culture and language—a form of control. A grim dictatorship banned Indigenous languages, so ours was only spoken informally.” Later, the constitution guaranteed that Indigenous languages could be spoken and taught, but a divide emerged between the language of city dwellers and those who remained in the territory.
Writing down the Sateré-Mawé language is also complicated for another reason: there are regular disagreements about word meanings. “Our people are spread across three regions along three rivers. Each group interprets the language differently. It’s hard to reach consensus on meanings. You have to negotiate and compromise.” These negotiations are difficult, but Josias sees their value. “They strengthen the language, and the teachers we train function as ‘language diplomats.’ In doing so, we’re getting closer to the meaning of our oral traditions.”
Recording an oral language inevitably leads to loss of nuance. The way you pronounce a word—with a specific click of the tongue or a gesture—is hard to digitize. Josias is aware of this but sees the current “language negotiations” as deeply valuable. João adds, “There’s a core to our oral tradition and culture that we all still feel. Older teachers sometimes clash with younger ones who speak a different variety of Sateré, but they negotiate. We are in the middle of that process, and it’s good that we’re in the middle—not the beginning.”
Every language offers a unique perspective on the world. In fact, various studies show that language shapes perception. Some languages don’t distinguish between shades of green, blue, and gray in the sea’s color (as in Breton, spoken in French Brittany). Others, like Russian, treat light blue and dark blue as separate basic colors with different words.
Take the Kuuk Thaayorre language, spoken by an Aboriginal people in Queensland, Australia. They have no concept of left or right but possess an extraordinary sense of orientation—so extraordinary that scientists once thought such skill was impossible without being a migratory bird or a turtle. If you speak Kuuk Thaayorre and greet someone, you must say in which direction you're heading. For example, “I'm going south-southwest, and you?” You literally can't greet someone without orienting yourself. So from a young age, you always know where you are in relation to the landscape. Apparently, this is a cognitive skill humans can develop.
Lera Boroditsky, professor of cognitive science at the University of California, San Diego, discusses Kuuk Thaayorre in a TED Talk, using it to show how language shapes thought. While someone from Kuuk Thaayorre can be awakened in the middle of the night and still point to true north, Westerners are increasingly losing their sense of direction—look at how many people blindly follow Google Maps and drive into a river.
To raise awareness about the alarming rate of language extinction, UNESCO launched the “Decade of Indigenous Languages” in 2022. A key focus is digital archiving to help preserve languages and counteract the marginalization of Indigenous groups. The UN Forum on Indigenous Issues has called on Big Tech to support the development and accessibility of digital tools that promote the use of Indigenous languages.
While the Forum advocates for active involvement from Indigenous communities, some scholars question the role Big Tech should play. These corporations wield immense power and have amassed vast amounts of data. Since 2000, computing power and data have exploded. By 2025, some estimate we’ll have as many bits as grains of sand on Earth. This data explosion has changed how we approach technology, especially in language. The idea now is that we don’t need to understand the rules—we can infer patterns from the data. Knowledge no longer flows top-down, but bottom-up from large, complex datasets.
This gave rise to large language models (LLMs), the foundation of tools like Google Translate and ChatGPT. For linguistics, this was a seismic shift—suddenly, the impossible became possible. But scientists still don’t fully understand how this technology works. What is clear is that language tech improves with more data. And since Big Tech companies not only have the computing power but also the infrastructure, they dominate the field by harvesting the entire internet to train their models. In practice, nearly all language technology falls into the hands of Big Tech.
That’s a problem, says Gábor Bella, senior researcher and lecturer at IMT Atlantique in Brittany. In Mongolia, where he worked with colleagues on language tech, Google proposed adding Mongolian to Google Translate. They asked for parallel corpora—texts in both Mongolian and English. “They asked my colleagues to provide them. Their argument was: this benefits everyone. So Google got the texts for free and now owns Mongolian Google Translate. This put local researchers at a disadvantage—they can no longer compete.”
This isn’t unique. Bella knows other cases where Big Tech obtained data this way and claimed intellectual property. But how bad is it if Google owns Mongolian Google Translate? The data is publicly accessible, right?
At first glance, it seems wonderful if Big Tech helps preserve languages. But Paula Helm, senior researcher in critical AI studies and empirical ethics at the University of Amsterdam, has deep concerns. Tech companies have commercial interests and focus mainly on economically dominant languages like English. If all translations go through English, much meaning is lost. For instance, English has only one word for “rice,” but Swahili and Japanese distinguish between cooked and uncooked rice. If translations always go through English, they can easily become inaccurate.
With rice, this may not be a major issue, but in asylum procedures, incorrect translations can have serious consequences. In The Guardian, Damian Harris-Hernandez of the Refugee Translation Project describes how AI translations—usually provided by Big Tech—are unreliable for languages that differ greatly from English or are poorly documented. Volunteers from Respond Crisis Translation recount how asylum claims were denied due to mistranslations—for example, translating “I” as “we,” making it seem like multiple people applied. Or the case of a woman who called her father “mi jefe” (a common informal Spanish term for “my father,” literally “my boss”), which led to her claim being denied because abuse by an employer isn’t classified as domestic violence.
Bella understands the temptation to work only with data already online. “It’s hard to work with local communities. It takes money, years, and doesn’t fit well into research grants. It’s easier to sit at your desk. It scales better. I’ve done it myself—and sometimes it works. But often I wonder if it’s right. There’s an ethical issue: handling community data this way is often exploitative. And the research output suffers in quality if you don’t work with communities directly. So much nuance is lost.”
Bella has spent years developing language tech for under-resourced and endangered languages: Scottish Gaelic, South African languages, Arabic dialects, and Thai (not endangered). “We always start exploratively, developing methodology. The principle is that the community must set the agenda. They decide what to build and why. We collaborate on mutual knowledge exchange. And the local community retains intellectual property over everything we build.”
The Sateré-Mawé now use WhatsApp groups for their “language negotiations,” where they document and decide on word meanings. WhatsApp is practical due to the distance between their three settlements. But ideally, they’d own the data themselves. Now, Meta (formerly Facebook) owns it, since it owns WhatsApp. While Meta may have little interest in Sateré-Mawé data, it raises critical questions: who owns a language?
Researcher Paula Helm thinks it stems from a “strong individualist notion of privacy.” “In Western societies, personal data is tightly protected. But language is seen as public—free for anyone, including Big Tech, to use. But language can be deeply personal, especially Indigenous languages tied closely to land and identity.”
The Kuuk Thaayorre, for instance, have a completely different concept of time. Unlike us, their sense of time isn’t centered on the individual body but on the surrounding landscape. In other words, Aboriginal people constantly orient themselves to nature. You might call it an ecocentric concept of time.
For Indigenous communities like the Sateré-Mawé or Kuuk Thaayorre, with long histories of oppression—including linguistic oppression—it’s vital that they retain control over their own language. It’s not just fair, says Helm, it’s essential: “We urgently need the value of Indigenous worldviews. The techno-capitalist, mono-humanist way of thinking doesn’t offer a sustainable path for life on this planet.”
Author: Sanne Bloemink
Published in De Groene Amsterdammer on August 21, 2024. Translated by the Institute for Advanced Study.