The internet has become a global phenomenon, connecting over 5 billion people worldwide, representing approximately 63% of the global population. With such diverse users, speaking thousands of languages, one would expect a multitude of languages to be represented online. However, an analysis by web-scanning firm W3Techs reveals significant discrepancies in language representation on the web. While English dominates as the primary language for over half of all websites, languages like Chinese and Hindi, spoken by billions of people, have minimal online presence.
W3Techs specializes in tracking programming languages used on the internet and categorizes publicly accessible domains accordingly. By comparing their data with language usage statistics from Ethnologue, a renowned authority on global languages, it becomes evident that certain languages are grossly overrepresented while others are virtually absent online. English, German, and Japanese enjoy a much larger share of the internet than their native speakers represent. Conversely, languages outside the European realm struggle to establish a meaningful presence online.
This linguistic imbalance is a cause for concern, particularly for international communities. UNESCO, as early as 2003, urged the public and private sectors to ensure online content is available in all human languages. However, as the internet expands, the gap between spoken languages and their representation online continues to widen.
Bhanu Neupane, a program manager at UNESCO focused on language inequity, expresses worry about a future where only a handful of languages dominate online discourse. Neupane emphasizes that the world is converging, and in 15 years, there may be just a few languages prominent in business and online communication. This potential scenario raises significant concerns.
It is important to consider some limitations when interpreting this data. The analysis is based on scans of publicly available websites, excluding content behind login walls, such as apps and social networks. This approach may lead to underrepresentation, particularly for the Chinese internet. Additionally, the scans may not capture the full extent of non-English communities on English-language sites, indicating potential undercounting. These factors should be acknowledged when interpreting the results.
Nonetheless, the overarching issue remains evident. Millions of non-native English speakers and speakers of non-English languages are compelled to navigate the web in a language other than their own. Furthermore, as publicly available text on the internet is increasingly utilized to train large language models like Bard and GPT-4, the same language imbalance is being ingrained in the development of artificial intelligence, exacerbating the issue.
Addressing the language disparity online is crucial for preserving linguistic diversity and ensuring equal access to information and opportunities in the digital age. Efforts to bridge this gap and promote multilingual representation on the internet are essential for a more inclusive and equitable online landscape.
By Impact Lab