From scraping to ethical sharing: Initial considerations for Virtuous Innovative Approaches and Data Use Collaboration in AI Training (VIADUCT)
The rapid advancement of artificial intelligence depends on access to vast quantities of data across its lifecycle, yet current sourcing practices have proven legally and ethically contentious. Large-scale web scraping of publicly available, often copyrighted or personal content has underpinned widely used datasets such as CommonCrawl and LAION-5B, frequently without explicit consent or compensation. This model has prompted growing legal and regulatory scrutiny, including numerous lawsuits and technical countermeasures against automated scraping. At the same time, industry leaders warn of “peak data”, as the supply of high-quality human-generated content diminishes while demand for specialised, expert-level data continues to grow. AI training data spans multiple governance regimes—copyright, personal data, trade secrets, government data and open data—each subject to distinct legal, technical and contractual constraints. Copyright requires authorisation and enforceable opt-out mechanisms, personal data is regulated under frameworks such as the GDPR, trade secrets are protected by confidentiality obligations, and government and open data remain unevenly accessible due to sensitivity, infrastructure gaps and underinvestment. Existing responses, including opt-out protocols, privacy-preserving computation, licensing agreements and data attribution models, provide only partial remedies and face scalability or standardisation challenges. The resulting landscape is fragmented and transaction-intensive, underscoring the need for context-specific governance approaches that reconcile innovation with the rights and interests of data holders.


























