Academia

How might standard contract terms help unlock responsible AI data sharing?

Two people negotiating a contract

As artificial intelligence (AI) technology advances, reshaping industries and society, the urgency to develop robust frameworks for responsibly sharing data for AI use cases becomes increasingly clear. High-quality datasets fuel the AI systems that have made breakthroughs possible in healthcare, environmental protection, social welfare, and in other realms over the past years.

Despite these potential benefits, significant barriers remain. For instance, data must be made available and shared ethically while navigating a complex array of legal requirements. Additional challenges—such as the infrastructure needed to store and process vast amounts of data and the substantial energy demands associated with it—lie beyond the scope of this discussion. Nonetheless, a confluence of ethical considerations and legal obligations can frequently compound with these other issues to slow progress.

Our recent research, carried out by the Open Data Institute (ODI) and the Pratt School of Engineering at Duke University, points to a solution. Developing standardised data licence templates can reduce the friction in data sharing and help pave the way for more inclusive, ethical, legally compliant, and effective AI-driven innovation across sectors.

Data licences are agreements that set forth conditions for accessing, using, and sharing datasets. When appropriately drafted, these licences could provide a foundation for more legally compliant and ethical data flows within the AI ecosystem and across commercial, non-commercial, and other sectors—from data holders to model developers and onward to downstream applications. However, current data licence practices remain highly fragmented. Many data-sharing agreements are customised, resulting in uncertainty and challenges that can disproportionately affect newer market entrants or those with limited resources.

Existing frameworks, such as Creative Commons licences, illustrate how standard contract terms can streamline the sharing of copyrighted materials. By establishing a global and multistakeholder process to adapt this approach to data for AI, the community can craft licence templates specific to the unique requirements of AI data. Doing so should increase efficiency, reduce legal complexity, and bolster inclusive access.

Building on previous work by GPAI (2022), GPAI (2023), and the research set forth in our new report, this blog will explore how standardised data licences can help overcome present-day AI data sharing barriers. Emphasising inclusive, transparent, and responsive licence-design features, alongside standardised licensing frameworks developed with broad and diverse stakeholder input, can help ensure that the benefits of AI research and innovation are distributed more widely, ethically, and equitably.

What we found

Our recent report is titled, “Unlocking Data Collaboration: A study on data sharing practices and developing standard data licence terms to promote access and social good.” This blog outlines our research, which involved 15 interviews and an electronic survey, engaging multiple stakeholders across eight jurisdictions. Overall, our findings underscore the urgent need for standardised data licences in AI to facilitate data sharing. We identified several insights that can help inform the development of these standard licences.

Advancing AI for social good

AI has the potential to help solve some of humanity’s most pressing challenges, ranging from improved healthcare outcomes to better disaster response. For example, various research organisations and non-profits in developing regions could use AI to address very specific problems, whether it be drought prediction or disease monitoring. However, these groups often lack the high-quality data necessary to create effective AI systems. At the same time, organisations working in languages other than English may face additional complexities—not just in the availability of non-English language datasets but also in ensuring that these resources are created and shared ethically, equitably, and without exploiting local communities. Voluntary and standard data licences, developed through an inclusive process, can help bridge these gaps by helping form contract tools that facilitate access to these data for specific stakeholders for well-defined social good uses, thereby fostering a more equitable AI ecosystem.

Developing standardised data licences

Our research highlights the need for standard data licences to address inconsistencies, ambiguities, and challenges in existing licensing practices. These licences should strike a balance between simplicity and flexibility, allowing various stakeholders to engage effectively while promoting inclusivity and global applicability.

Simplicity and standardisation

There is a great deal of inconsistency in the licence terms currently used for data sharing, which may lead to organisations being unable to fully understand their legal rights and obligations. Overly bespoke or inconsistent licensing terms can create substantial legal uncertainty and disproportionately burden smaller organisations or underrepresented communities that may lack sufficient resources to review or negotiate them. At the same time, exclusive or bespoke data-sharing arrangements can restrict broader access to high-quality data. Widely accepted standard data licences, with clear and consistent language and standard definitions, can help address these challenges by reducing transaction costs, increasing transparency, and promoting inclusive participation. Standard data licences would not necessarily displace bespoke licence terms in situations where bespoke terms might be more appropriate.

Balancing customisation

While standardisation is key to streamlining licence terms, the AI ecosystem’s diverse needs require some level of flexibility. Different data types, use cases, models, tasks, and legal requirements may warrant tailored clauses, especially in sensitive data and additional cultural and linguistic considerations. A series of standard licence terms and/or a modular approach—where core licence terms remain consistent, but optional provisions can be applied based on legal or regional requirements—could help reconcile the need for simplicity with the practical realities of varied stakeholders. Creative Commons currently offers a series of standard licence agreements. 

Global and multi-stakeholder engagement

The means through which these licences are developed is also important. Given the complex array of laws, cultural norms, and perspectives worldwide, they should result from invaluable contributions provided by a wide range of contributors, such as governments, academia, private companies, civil society organisations, data holders, and data users. This would make the licensing frameworks representative and encourage wider adoption.

Addressing legal and ethical ambiguities

Our research also provides a roadmap for drafting standard data licences with broad and diverse stakeholder input. It reveals some of the challenges with current approaches and how stakeholders would like to address these challenges in the future. 

Non-commercial use

The term ‘non-commercial’ is a common limitation in data licences, yet it is often undefined, leading to user uncertainty. Research indicates that organisations, including academic and non-profit entities, commonly report difficulties interpreting whether specific activities fall within this scope. This ambiguity can hinder data-sharing efforts, as stakeholders may hesitate to use datasets for fear of inadvertently breaching these terms. Clarifying the definition of ‘non-commercial’ within standardised licences could help address these challenges and promote greater trust and collaboration among data providers and users.

Attribution challenges

Existing licence agreements often include attribution requirements, but these are not always designed with AI-specific contexts in mind. For instance, traditional attribution models may not account for scenarios where training data are not directly reproduced in AI model outputs. Standardised licences could provide an opportunity to address these challenges by establishing clear and practical guidelines for attribution in the AI context. Incorporating technical solutions, such as automated attribution systems, could help further streamline compliance and encourage broader adoption of these licences.

Rights and liabilities

Standardised licence frameworks can provide essential clarity and flexibility in allocating contractual rights and liabilities in AI contexts. Voluntary, standardised data licences can offer stakeholders clearer pathways to define their roles and responsibilities by addressing, among other aspects, uncertainties surrounding the ownership and use of data, models, and outputs. This flexibility can also assist parties in aligning terms with their specific legal and operational requirements, thereby reducing potential conflicts and fostering more robust collaborations.

Ethical guidelines

There is interest in exploring how ethical considerations might inform the use of data and AI models, particularly in high-risk areas such as healthcare and biomedical research. However, we found that some practitioners expressed concerns about referencing ethical codes of conduct within licence terms. These codes can evolve, potentially leading to legal uncertainties or compliance challenges. The multi-stakeholder process should address these considerations when developing standard data licence terms. 

The role of technology in data licensing

Emerging technologies complement standardised licences to enhance transparency and efficiency in data sharing. For instance, data provenance and lineage-tracking tools improve accountability. Metadata standards such as Croissant can link datasets and licence terms for automated checks. In this way, licences can also be designed to align with developing best practices in AI governance, ensuring that data sharing is both responsible and ethical.

Our principal recommendations

New data licensing frameworks can facilitate responsible data sharing at scale by leveraging lessons learned and potential solutions to unique AI challenges. These insights may inform the development of these standard data licences, and our recommendations become clear.

Recommendations for developing standard data licences designed to incentivise widespread adoption

a table of why standard data licences should be developed

To summarise further, our recommendations seek to:

  1. Promote simplicity: a menu of standardised terms and/or a modular approach should be as accessible and straightforward as possible to minimise barriers to adoption.
  2. Promote adaptability: A menu of standard licence terms and/or a modular licence framework would allow stakeholders to meet their different needs while maintaining certain levels of consistency.
  3. Promote inclusivity: Multi-stakeholder processes should give special attention to under-represented groups to ensure that licences are fair and relevant worldwide, and encourage data access for AI use cases related to social good.
  4. Improve technical, legal and ethical interoperability: Serious attention must be given to developing terminology that clarifies existing legal ambiguities (e.g., the interpretation of non-commercial usage limitations and the application of attribution requirements), addresses the complexities of ethical norms and laws that differ across jurisdictions, and investigates how licences interact with and utilise new technologies to further these objectives.

These recommendations help to build an effective data-sharing framework to facilitate AI data access and promote social good.

Access to high-quality AI data while considering societal needs

Standardised AI data licences are a potential game-changer for enabling responsible AI data sharing. Legal uncertainties, promoting inclusivity, and enabling ethical practices are the areas in which these frameworks can facilitate access to high-quality AI data while considering societal needs. Addressing these challenges requires collaboration among stakeholders worldwide, yet the potential benefits are likely to outweigh the effort. We have the opportunity to shape an AI-driven future that is equitable, sustainable, and truly serves everyone.



Disclaimer: The opinions expressed and arguments employed herein are solely those of the authors and do not necessarily reflect the official views of the OECD or its member countries. The Organisation cannot be held responsible for possible violations of copyright resulting from the posting of any written material on this website/blog.