The internet, with its hundreds of billions of indexed webpages, serves as a vast repository of modern life. However, despite its immense utility, online content frequently disappears from view. A new analysis by the Pew Research Center highlights the transient nature of the web:
- Disappearing Content: As of October 2023, a quarter of all webpages that existed between 2013 and 2023 are no longer accessible. Most often, this happens when an individual page is deleted from an otherwise functional website. Older content is even more affected; 38% of webpages from 2013 are no longer available, compared to 8% of pages from 2023.
This phenomenon, known as “digital decay,” affects various online spaces. The Pew analysis examined links on government and news websites, as well as the “References” section of Wikipedia pages as of spring 2023, and found:
- News Websites: 23% of news webpages have at least one broken link. This issue is equally prevalent on both high-traffic and low-traffic news sites. Local government webpages are particularly prone to broken links.
- Government Websites: 21% of webpages contain broken links.
- Wikipedia: 54% of Wikipedia pages have at least one reference link pointing to a non-existent page.
To understand how digital decay manifests on social media, Pew collected a real-time sample of tweets from the social media platform X (formerly known as Twitter) during spring 2023 and monitored them for three months. They discovered that:
- Tweets: Nearly 20% of tweets are no longer publicly visible just months after being posted. In 60% of these cases, the account was made private, suspended, or deleted. In the other 40%, the individual tweet was deleted, but the account remained active.
- Tweet Language and Settings: Tweets in Turkish or Arabic disappear more frequently, with over 40% no longer visible within three months. Tweets from accounts with default profile settings also tend to vanish from public view more often.
Webpages from the Last Decade
To explore digital decay further, Pew analyzed a random sample of just under 1 million webpages from the archives of Common Crawl, an internet archive service. This sample included pages from each year between 2013 and 2023 (approximately 90,000 pages per year). The findings revealed that:
- Page Accessibility: 25% of webpages from 2013 to 2023 are no longer accessible as of October 2023. This includes 16% of pages that are individually inaccessible but come from otherwise functional domains, and 9% where the entire root domain is no longer functional.
- Older Pages: Pages from 2013 had the highest share of inaccessible links (38%). Even for pages from 2021, about 20% were no longer accessible just two years later.
Links on Government Websites
Pew sampled around 500,000 pages from government websites using the Common Crawl March/April 2023 snapshot, including various government levels (federal, state, local, etc.). Key findings include:
- Link Prevalence: Government webpages contained 42 million links, with 86% being internal links. Around three-quarters of these webpages had at least one on-page link, with a typical page containing 50 links.
- Link Types: Most links lead to secure HTTP pages, 6% to static files (e.g., PDFs), and 16% redirect to different URLs than originally intended.
- Broken Links: 6% of followed links were no longer accessible, and 21% of government webpages had at least one broken link. City government pages had the highest rates of broken links.
Links on News Websites
The analysis of 500,000 pages from 2,063 news websites found:
- External Links: News sites contained over 14 million external links, with 94% of pages having at least one. The median page had 20 links, with the top 10% of pages containing 56 links.
- Link Types: Most links go to secure HTTP pages, 12% to static files, and 32% redirect to different URLs.
- Broken Links: 5% of all links on news site pages are no longer accessible, and 23% of news pages have at least one broken link. This issue is equally prevalent on both high-traffic and low-traffic sites.
Reference Links on Wikipedia
Pew’s analysis of 50,000 English-language Wikipedia pages revealed:
- Reference Links: 82% of these pages contained at least one reference link, totaling over 1 million reference links.
- Broken Links: 11% of all reference links are no longer accessible. On 2% of source pages, every reference link was broken, while 53% of pages had at least one broken link.
This analysis underscores the impermanence of online content, highlighting the challenges of preserving digital information over time.
By Impact Lab