From co-generated data to generative AI

Intergovernmental  |  December 4, 2024

Fostering a digital ecosystem for AI

Within two months of the launch of ChatGPT in 2022, it had 100 million active users a month, making it the fastest-growing consumer application in history.1 While the core technology behind such generative artificial intelligence (AI) models has been around for some time, tools such as ChatGPT have made it accessible to the masses. As a result, there is a lot of talk about AI and new questions being asked about the role data plays in making these tools as good as they are. ‘Co-generated data’ refers to data generated by more people or entities than solely the data holder. However, co-generation is not limited just to data. Co-generation occurs across the data ecosystem; when we scroll through social media, when we speak to our voice assistants and when we prompt conversational AI tools. In each of these examples there can be multiple data co-generators, with different levels of involvement in the co-generation process and awareness of its potential financial, social, legal or ethical implications. As generative AI enters the mainstream, we must update our thinking about co-generated data, technology and AI-generated works. This report is part of the research project ‘From co-generated data to generative AI,’ documenting work conducted between June 2023 and May 2024. Commissioned by the Global Partnership on Artificial Intelligence (GPAI) and executed by the Open Data Institute (ODI) and Aapti Institute, with support from Pinsent Masons, the research involved analysing six co-generation scenarios. This concept of co-generation being relatively new, we decided to first observe and analyse the cases which involve co-generations as it would allow us to understand what kind of rights may be involved. This was done using a framework of legal rights developed through a literature review, supplemented by expert interviews and workshops. The six scenarios cover a wide range of types of co-generation, co-generators and legal contexts: remunerated work (Karya), collaborative crowd-sourced data collection (OpenStreetMap), social media platforms (Instagram), Internet of Things (BMW connected cars owned by Europcar), and conversational generative AI (Whisper and Midjourney). Our findings indicate that co-generators in these scenarios navigate a complex web of rights, spanning intellectual property, data, and other rights like labour rights. Nonetheless, significant gaps persist, particularly regarding the co-generation of technology, AI-generated works, and rights for communities and collectives. Where rights exist, they are often mediated through statutory exceptions (such as for text and data mining), terms and conditions (T&Cs) of contracts or other mutual agreements, and licences, which do not always function as intended, as evidenced by the documented failures of T&Cs. Determining the existence and extent of rights over various types of co-generated data remains challenging, especially for generative AI models. Further research is essential to map the rights landscape fully, identify gaps, and explore non-legal mechanisms like new technologies, governance models, and licences.


Disclaimer: The opinions expressed and arguments employed herein are solely those of the authors and do not necessarily reflect the official views of the OECD, the GPAI or their member countries. The Organisation cannot be held responsible for possible violations of copyright resulting from the posting of any written material on this website/blog.