AI Training Overloads Put Online Infrastructure at Risk

Thumbnail Image

The information displayed in the AIM should not be reported as representing the official views of the OECD or of its member countries.

The Wikimedia Foundation has raised alarms over AI training methods that use automated web crawlers to extract vast amounts of data, overwhelming Wikipedia and related servers. This excessive data scraping is causing rising operational costs and poses risks of service disruptions, highlighting a growing digital infrastructure hazard.[AI generated]

Why's our monitor labelling this an incident or hazard?

The event involves AI systems (large language models) that require massive data scraping from Wikimedia sites via automated bots (crawlers). This automated AI-driven data extraction has directly led to operational disruptions and increased costs for Wikimedia, which is a form of harm to infrastructure and community resources. The harm is realized, not just potential, as Wikimedia reports slowdowns and resource strain. Hence, it meets the criteria for an AI Incident because the AI system's use (data scraping for training) has indirectly caused harm (disruption and resource strain) to a critical information infrastructure.[AI generated]

AI principles

AccountabilityRobustness & digital securitySafetySustainabilityTransparency & explainability

Industries

IT infrastructure and hostingMedia, social platforms, and marketing

Affected stakeholders

BusinessGeneral public

Harm types

Economic/PropertyPublic interest

Severity

AI incident

Business function:

Research and development

AI system task:

Other

Articles about this incident or hazard

Thumbnail Image

La corsa al training dell'AI sta cambiando internet

2025-04-20

Wired Italia

Why's our monitor labelling this an incident or hazard?

The event involves AI systems (large language models) that require massive data scraping from Wikimedia sites via automated bots (crawlers). This automated AI-driven data extraction has directly led to operational disruptions and increased costs for Wikimedia, which is a form of harm to infrastructure and community resources. The harm is realized, not just potential, as Wikimedia reports slowdowns and resource strain. Hence, it meets the criteria for an AI Incident because the AI system's use (data scraping for training) has indirectly caused harm (disruption and resource strain) to a critical information infrastructure.

Thumbnail Image

Wikipedia combatte i bot IA mettendo a disposizione degli sviluppatori un dataset specifico per l'addestramento dei modelli LLM

2025-04-17

Hardware Upgrade - Il sito italiano sulla tecnologia

Why's our monitor labelling this an incident or hazard?

The event involves an AI system context (training of large language models) and addresses an issue caused by AI-related activities (scraping by bots for AI training). However, it does not describe any harm or incident caused by AI systems, nor does it indicate a plausible future harm. Instead, it is a governance and ecosystem response to an AI-related challenge, providing complementary information about efforts to mitigate infrastructure strain and improve AI training data availability. Therefore, it fits the category of Complementary Information rather than an Incident or Hazard.

Thumbnail Image

Wikipedia Faces Flood of AI Bots That Are Eating Bandwidth, Raising Costs

2025-04-02

PCMag UK

Why's our monitor labelling this an incident or hazard?

The event involves AI systems (AI scraper bots) whose use is causing substantial resource strain and increased costs for Wikipedia. However, there is no indication that this has led to direct or indirect harm as defined by injury, rights violations, or disruption of critical infrastructure. The issue is primarily about increased operational costs and infrastructure strain, which is a significant concern but does not meet the threshold for an AI Incident or AI Hazard. The article mainly reports on the ongoing situation and Wikipedia's response plans, making it Complementary Information that provides context and updates on AI ecosystem impacts and governance responses.

Thumbnail Image

Wikipedia is struggling with voracious AI bot crawlers

2025-04-02

engadget

Why's our monitor labelling this an incident or hazard?

The event involves AI systems explicitly (AI crawler bots) whose use is directly causing harm by disrupting Wikimedia's service availability and increasing operational costs, which affects the community relying on free access to information. This fits the definition of an AI Incident as the AI system's use has directly led to harm to communities (disruption of access to information) and harm to infrastructure (strain on Wikimedia's servers). The harm is realized, not just potential, and the AI system's role is pivotal in causing the disruption. Therefore, this event is best classified as an AI Incident.

Thumbnail Image

AI crawlers cause Wikimedia Commons bandwidth demands to surge 50% | TechCrunch

2025-04-02

TechCrunch

Why's our monitor labelling this an incident or hazard?

The event involves AI systems in the form of automated AI crawlers scraping Wikimedia Commons content to train AI models. This activity is causing a surge in bandwidth demand and operational challenges for Wikimedia, which could plausibly lead to harm such as disruption of open access to knowledge resources (harm to communities) if the trend continues unchecked. However, the article does not report any realized harm yet, only increased costs and risks. Therefore, this qualifies as an AI Hazard rather than an AI Incident. It is not merely complementary information because the focus is on the potential harm from AI crawler activity, not on responses or ecosystem context alone.

Thumbnail Image

Wikipedia Sees A 50 Percent Bandwidth Increase Due To AI Bot Crawlers

2025-04-03

Lowyat.NET

Why's our monitor labelling this an incident or hazard?

An AI system (AI bot crawlers) is involved as they are automated agents using AI to crawl and download content. However, the event does not describe any direct or indirect harm resulting from this activity, only increased bandwidth usage and operational challenges. There is no indication of injury, rights violations, or disruption of critical infrastructure. The event highlights a current operational challenge and a call for awareness and donations, which fits the definition of Complementary Information rather than an Incident or Hazard.

Thumbnail Image

Wikipedia servers are struggling under pressure from AI scraping bots

2025-04-03

TechSpot

Why's our monitor labelling this an incident or hazard?

The article explicitly mentions AI scraping bots (AI systems) causing a 50% increase in bandwidth usage and network congestion on Wikimedia servers, leading to temporary connection route congestion. This disruption affects the management and operation of Wikimedia's critical infrastructure, which serves billions of users worldwide. The harm is realized as service disruption and resource strain due to AI system use. Therefore, this qualifies as an AI Incident under the definition of disruption of critical infrastructure (b) and harm to communities (d).

Thumbnail Image

AI bots strain Wikimedia as bandwidth surges 50%

2025-04-02

Ars Technica

Why's our monitor labelling this an incident or hazard?

The event involves AI systems explicitly: automated bots scraping data to train AI models (LLMs). The use of these AI systems has directly led to harm in the form of operational disruption and financial strain on Wikimedia's infrastructure, which is critical for public knowledge access. This harm affects the community relying on Wikimedia's services and threatens the sustainability of open knowledge platforms, which can be considered harm to communities and property (infrastructure). Therefore, this qualifies as an AI Incident because the AI system's use has directly caused significant, clearly articulated harm. The article does not merely warn of potential harm but documents ongoing, realized impacts.

Thumbnail Image

Nintendo unveils Switch 2 details: a 7.9'', 120fps, 1080p display, larger Joy-Con controllers, 256GB of storage, 4K games at 60fps when docked, launching June 5

2025-04-02

Techmeme

Why's our monitor labelling this an incident or hazard?

The event involves AI systems (automated bots scraping data to train AI models) whose use has directly led to increased operational costs for Wikimedia. While this is a financial and resource burden rather than physical harm, it constitutes harm to property and community resources. Therefore, this qualifies as an AI Incident due to realized harm caused by AI system use.

Thumbnail Image

Wikimedia Complains About AI Bots Scraping As It Strains Servers, Causing Bandwidth to Surge by 50%

2025-04-03

Tech Times

Why's our monitor labelling this an incident or hazard?

The event explicitly involves AI systems (AI scraping bots) that are used to collect data without authorization, leading to copyright infringement and operational strain on Wikimedia's infrastructure. The unauthorized scraping directly harms Wikimedia by overloading their servers and violating their intellectual property rights. These harms fall under the AI Incident definition, as the AI system's use has directly led to violations of intellectual property rights and harm to property (server resources).

Thumbnail Image

Wikimedia Foundation bemoans AI bot bandwidth burden

2025-04-03

TheRegister.com

Why's our monitor labelling this an incident or hazard?

The article clearly involves AI systems, specifically AI training models that rely on data scraped by automated bots. The Wikimedia Foundation's infrastructure is being heavily burdened by these AI-related bots, which is a consequence of AI development and use. However, the article does not report any injury, rights violation, or other harm caused by these AI systems. Instead, it highlights a growing operational challenge and potential future risks, along with responses and mitigation efforts. Therefore, this qualifies as Complementary Information, providing context and updates on AI ecosystem impacts and responses, rather than an AI Incident or AI Hazard.

Thumbnail Image

AI scraper bots putting costly strain on Wikimedia infrastructure

2025-04-04

Institution of Engineering and Technology

Why's our monitor labelling this an incident or hazard?

The presence of AI systems (automated scraper bots) is clear, and their use is causing increased load and financial costs for Wikimedia. However, these impacts do not meet the threshold of harm defined for AI Incidents, such as injury, rights violations, or critical infrastructure disruption. Nor does the article suggest plausible future harm beyond ongoing operational strain. The article primarily provides contextual information about the impact of AI bots on Wikimedia's infrastructure and the organization's response efforts, fitting the definition of Complementary Information rather than an Incident or Hazard.

Thumbnail Image

AI Bots Strain Wikimedia Commons as Bandwidth Surges 50%

2025-04-03

Gadget Review

Why's our monitor labelling this an incident or hazard?

An AI system (automated AI bots) is explicitly involved in scraping content to train AI models, which is causing a substantial operational burden on Wikimedia Commons. Although no immediate harm such as injury or rights violations has occurred, the event plausibly leads to harm by threatening the sustainability of a critical open-access platform and potentially restricting access to information, which can be considered harm to communities and the environment of knowledge sharing. Therefore, this event fits the definition of an AI Hazard, as the AI system's use could plausibly lead to significant harm in the future if the strain forces access limitations or paywalls.

Thumbnail Image

AI crawlers cause Wikimedia Commons bandwidth demands to surge 50% - RocketNews

2025-04-02

RocketNews | Top News Stories From Around the Globe

Why's our monitor labelling this an incident or hazard?

The event involves AI systems in the form of automated scraper bots used to collect data for AI training, which is causing increased resource consumption and operational strain on Wikimedia Commons. While this represents a challenge and potential risk to the Wikimedia infrastructure and its sustainability, the article does not report any actual harm such as service outages, legal violations, or other direct negative impacts. Therefore, it does not meet the threshold for an AI Incident. It also does not describe a plausible future harm scenario beyond the current operational strain, so it is not an AI Hazard. Instead, it provides complementary information about the impact of AI-related activities on Wikimedia's infrastructure and their mitigation efforts.

Thumbnail Image

AI's insatiable demand for data is crushing Wikimedia's infrastructure

2025-04-02

Constellation Research Inc.

Why's our monitor labelling this an incident or hazard?

The event involves AI systems (automated data scraping bots) whose use is causing Wikimedia's infrastructure to be heavily taxed, leading to increased costs and challenges to Wikimedia's sustainability. Although no direct physical harm or rights violation is reported, the significant load on critical infrastructure and the threat to the community's ability to maintain free content represent a plausible risk of harm. The event does not describe an actual incident of harm realized but rather a growing problem that could lead to harm if not addressed. Therefore, it is best classified as an AI Hazard rather than an AI Incident or Complementary Information.

Thumbnail Image

L'IA rallenta Wikipedia, enciclopedia web in affanno - Future Tech - Ansa.it

2025-04-03

ANSA.it

Why's our monitor labelling this an incident or hazard?

The article explicitly mentions AI generative models being trained using Wikipedia content via automated crawlers, indicating AI system involvement. The increased traffic causes resource strain and cost increases for Wikimedia, but no harm to people, rights, or critical infrastructure is reported. The event does not describe any realized harm or a plausible future harm leading to an AI Incident or Hazard. Instead, it focuses on the impact on infrastructure and potential governance responses, fitting the definition of Complementary Information as it enhances understanding of AI's ecosystem effects and organizational responses.

Thumbnail Image

Wikipedia in affanno: traffico aumentato del 50% in un anno. Cosa sta succedendo?

2025-04-03

Tiscali Notizie

Why's our monitor labelling this an incident or hazard?

The article explicitly mentions AI generative models using automated crawlers to extract content from Wikipedia, which is an AI system involvement in the use phase (training data collection). However, the harm described is limited to increased traffic and infrastructure costs, which do not constitute injury, rights violations, or other significant harms as defined. The Wikimedia Foundation's discussion of possible solutions and guidelines is a governance response to this AI-related impact. Hence, the event is Complementary Information rather than an Incident or Hazard.

Thumbnail Image

L'AI rallenta Wikipedia, l'enciclopedia in affanno per l'alto traffico

2025-04-03

Sky

Why's our monitor labelling this an incident or hazard?

The article explains that AI systems (generative AI models) use automated crawlers to extract Wikipedia content for training, causing increased traffic and resource strain. While this is an AI-related development with operational challenges and potential future risks, it does not constitute an AI Incident because no harm has occurred. It also does not meet the threshold for an AI Hazard since the risk is operational and economic strain rather than plausible harm to health, rights, or infrastructure. The article mainly provides contextual information about AI's impact on Wikipedia and Wikimedia's response considerations, fitting the definition of Complementary Information.

Thumbnail Image

Intelligenza artificiale: bot senza freni mettono in ginocchio Wikipedia

2025-04-03

Hardware Upgrade - Il sito italiano sulla tecnologia

Why's our monitor labelling this an incident or hazard?

The event involves AI systems (automated web-scraping bots used to train AI models) whose use is causing significant operational strain and financial costs to Wikipedia. However, the article does not describe any realized harm such as injury, rights violations, or service disruption causing harm to users or communities. The harm is potential and operational, with Wikimedia aiming to reduce bot traffic to prevent future issues. This fits the definition of an AI Hazard, as the development and use of AI systems could plausibly lead to harm (service instability, reduced capacity for critical community needs) if not managed properly.

Thumbnail Image

Traffico smisurato, Wikipedia va in affanno: è colpa dei bot di IA

2025-04-03

Today

Why's our monitor labelling this an incident or hazard?

An AI system (generative AI models) is involved indirectly through the use of automated bots that extract content for training. However, the event does not describe any realized harm or incident caused by AI use, only increased resource usage and potential future challenges. The Foundation's response and consideration of guidelines is a governance and ecosystem development matter, enhancing understanding of AI's impact but not reporting an incident or hazard. Therefore, this is Complementary Information.

Thumbnail Image

L'AI mette sotto pressione Wikipedia: traffico in forte aumento

2025-04-04

HTML.it

Why's our monitor labelling this an incident or hazard?

The event involves AI systems (generative AI models using crawlers to collect data) whose use is causing increased traffic and operational strain on Wikipedia's infrastructure. While no direct harm such as injury or rights violations is reported, the increased load could plausibly lead to disruption or degradation of Wikipedia's service, which is critical infrastructure for information access. The Wikimedia Foundation is considering measures to manage this risk, indicating recognition of a potential hazard. Hence, this is best classified as an AI Hazard rather than an Incident or Complementary Information.

Thumbnail Image

Attacco dei crawler AI a Wikimedia

2025-04-02

Punto Informatico

Why's our monitor labelling this an incident or hazard?

The event involves AI systems (crawler AI) whose use has directly led to resource strain and potential disruption of Wikimedia's critical infrastructure, which hosts widely used open knowledge resources. The Wikimedia Foundation's intervention to block these crawlers to avoid service degradation shows that harm (disruption) is occurring or imminent. Therefore, this qualifies as an AI Incident due to disruption of critical infrastructure caused by the use of AI systems (scraping bots).

Thumbnail Image

Troppi accessi con le AI: Wikipedia va in crisi

2025-04-03

Prima Comunicazione

Why's our monitor labelling this an incident or hazard?

The article explicitly mentions AI generative models using crawler bots to extract content from Wikipedia, which is an AI system's use. The increased traffic causes infrastructure strain and potential risks and costs, indicating plausible future harm to the operation of critical infrastructure (Wikipedia's data centers). However, no actual harm or incident has occurred yet, only a risk of such harm. The Foundation is considering mitigation measures, but these are prospective. Hence, the event fits the definition of an AI Hazard, as the AI system's use could plausibly lead to an AI Incident (infrastructure disruption) if not addressed.

Thumbnail Image

Los bots de IA están matando a la Wikipedia: su tráfico ha aumentado un 50%

2025-04-04

Computer Hoy

Why's our monitor labelling this an incident or hazard?

The article explicitly mentions AI bots scraping Wikimedia content massively, causing server overload and threatening the nonprofit's sustainability. The AI systems' use directly leads to harm to the Wikimedia community and infrastructure, fitting the definition of an AI Incident. The harm is realized (not just potential), as Wikimedia reports significant traffic from AI bots causing operational strain and risking the foundation's viability. This is a clear case of indirect harm caused by AI system use, meeting the criteria for an AI Incident rather than a hazard or complementary information.

Thumbnail Image

¿El fin de Wikipedia? Cómo los bots de IA están tomando el control

2025-04-04

LaPatilla.com

Why's our monitor labelling this an incident or hazard?

The event involves AI systems (bots) extracting data for AI training, which is an AI-related activity. However, there is no indication of realized harm or plausible future harm described in the article. The focus is on describing the phenomenon and its scale, not on any incident or hazard. Therefore, this is best classified as Complementary Information, as it provides context and understanding about AI data usage trends without reporting an AI Incident or AI Hazard.

Thumbnail Image

Wikipedia, en riesgo de colapso digital: los bots de IA ya generan la mayoría del tráfico

2025-04-04

Semana.com Últimas Noticias de Colombia y el Mundo

Why's our monitor labelling this an incident or hazard?

The article explicitly identifies AI-driven bots as the primary cause of increased traffic that is degrading Wikimedia's service performance and increasing operational costs. This constitutes harm to the community relying on Wikimedia for free knowledge access and harm to the infrastructure supporting it. The AI systems' use directly leads to these harms, fulfilling the criteria for an AI Incident. The harm is realized (service slowdowns and resource strain), not just potential, and the AI system involvement is clear and central to the event.

Thumbnail Image

"Los rastreadores de IA están matando Internet". Los bots ponen en peligro hasta a la Wikipedia

2025-04-03

Genbeta

Why's our monitor labelling this an incident or hazard?

The article explicitly mentions AI-powered bots scraping content at a scale that overloads Wikimedia and open-source projects, causing slowdowns, increased costs, and risk of service unavailability. The AI systems (bots) are directly responsible for this harm by their use in data extraction for AI training. This harm affects communities relying on open knowledge and the infrastructure (property) supporting it. The harm is materialized and ongoing, not just a potential risk. Hence, it meets the criteria for an AI Incident rather than a hazard or complementary information.

Thumbnail Image

Wikipedia enfrenta un problema por culpa de los bots de inteligencia artificial | Tendencias

2025-04-02

La Cuarta

Why's our monitor labelling this an incident or hazard?

The article explicitly mentions AI bots extracting content to train generative AI models, which is an AI system use. The harm is indirect but material: increased bandwidth usage, slower access for users, higher costs, and potential impact on Wikimedia's sustainability and user base. These effects constitute harm to the community and property (infrastructure and resources). Therefore, this qualifies as an AI Incident due to realized harm caused by AI system use.

Thumbnail Image

su tráfico ha aumentado un 50%

2025-04-04

esdelatino.com

Why's our monitor labelling this an incident or hazard?

The article explicitly mentions AI bots scraping Wikimedia Commons content massively, causing server overload and operational strain on Wikimedia's infrastructure, which is a non-profit relying on donations. This overuse harms the Wikimedia community by threatening the sustainability of their free content platform. The AI systems (bots) are directly involved in causing this harm through their use. Although the harm is not physical injury or legal violation, it is a significant harm to property and community resources, fitting the definition of an AI Incident. The harm is realized, not just potential, as Wikimedia reports a 50% increase in bot traffic causing overload.

Thumbnail Image

Wikimedia advierte sobre los problemas generados por los bots que eliminan el contenido de su catálogo

2025-04-04

NoticiasDe.es

Why's our monitor labelling this an incident or hazard?

The event involves AI systems (bots) that are used to collect data for AI training, which is causing a disruption in the operation of Wikimedia's critical infrastructure by overloading bandwidth and slowing service. This fits the definition of an AI Incident under category (b) - disruption of the management and operation of critical infrastructure. The harm is realized as Wikimedia experiences service slowdowns and increased costs due to the bots' activity. Therefore, this is classified as an AI Incident.