Beyond Training Sets: EUIPO Study Insights on GenAI and Copyright

By Gareth Kristensen, Jan-Frederik Keustermans, Alix Anciaux & Nicola Duemler on June 17, 2025

Introduction

This is the first part of four-part series on the EUIPO study on GenAI and copyright. See here for parts 2, 3, and 4. In this first part, we set out the scope and purpose of the EUIPO study and offer a general refresher on the state of play in the current GenAI legal and technological ecosystem.

On May 12, 2025, the European Union Intellectual Property Office (EUIPO) released a lengthy 436-page report setting out the results of a comprehensive study exploring the development of generative AI (GenAI) from an EU copyright law perspective.[1]

GenAI systems create synthetic content through computational processes, raising complex questions at the intersection of innovation and established copyright frameworks. These systems operate by processing user prompts through algorithms trained on vast datasets, producing novel outputs that can include text, images, audio and video content that mimics or extends human-created works. There may be copyright implications arising from this process, both regarding the collection and use of training data, where text and data mining (TDM) exceptions may apply, and the status of AI-generated outputs, which may reproduce elements of copyrighted works or create entirely new content that challenges traditional authorship concepts.

In that context, the EUIPO study examines the complex intersection of GenAI and copyright law at both the input level (how AI systems are trained) and the output level (what they produce).

GenAI Input

“GenAI input” refers to the training process involving the ingestion of data and content. While various sources exist for gathering input data — including large public datasets and synthetic data[2] — the EUIPO report focuses on web scraping as a form of TDM, which is central to the current AI ecosystem.

The legal backdrop for the study’s analysis is Article 4 of the 2019 EU Copyright in the Digital Single Market Directive (CDSM Directive)[3], which provides an exception to copyright owners’ exclusive rights for TDM of lawfully accessible works, unless rightsholders have “expressly reserved” their rights “in an appropriate manner” (such rights reservations being commonly referred to as “opt-outs”).[4] A key issue for both rightsholders and AI developers has been determining how effectively to communicate and recognise TDM opt-outs.

GenAI Output

“GenAI output” refers to the synthetic content created by GenAI. First, the system receives prompts or input parameters from users that guide the creation process. Next, it processes these inputs through algorithms that have encoded patterns from extensive training data. Finally, the system “generates” content that follows these learned patterns while meeting the specified parameters provided by the user.

Various model architectures employ distinct approaches to generate content, each of which give rise to different legal and technological challenges and opportunities.

Generative Adversarial Networks (GANs) transform random noise vectors into structured output through a generator trained against a discriminator that evaluates output quality. Variational Autoencoders (VAEs) sample from the latent space coded in model parameters before reconstruction using learned data patterns. Diffusion Models refine random noise through iterative denoising, typically requiring up to 1,000 separate steps to create coherent output. Lastly, Large Language Models (LLMs) produce text sequentially one token at a time based on learned patterns and probabilistic predictions of what should come next.

The AI system workflow includes model training, validation, deployment and, increasingly, Retrieval Augmented Generation (RAG), which combines LLMs with document retrieval to enrich datasets without retraining. This approach allows AI systems to access more current or specific information than what might be contained in their original training data.

The EUIPO study examines what the EUIPO regards as several copyright challenges raised by GenAI output generation, including:

Models potentially reproducing copyrighted content through memorisation and “regurgitation”.
Synthetic content falsely appearing authentic without proper labelling, potentially misleading users about its origins.
Users intentionally prompting systems to generate potentially infringing material, raising questions about allocation of liability for the resulting infringement.
Various technical measures currently being developed and implemented to address these concerns, including improved plagiarism detection, automated prompt filtering and rewriting, and contractual indemnification provisions.

Coming up in this series

Part 2 of our series will explore GenAI input and will cover (i) the requirements for a valid opt-out under Article 4 of the CDSM, and (ii) the main legal and technical opt-out mechanisms that have already been developed and adopted by stakeholders. Part 3 will focus on GenAI output, particularly the key copyright challenges raised by synthetic content generation. In Part 4, we will conclude with some final thoughts on the EUIPO study and its implications.

[1] EUIPO, “EUIPO releases study on generative artificial intelligence and copyright”, available here. While the study calls for a careful legal analysis of existing copyright frameworks to balance rightsholders’ interests and innovation enablement, it does not constitute a binding legal interpretation of AI legislation, nor does it expressly support or endorse any specific interpretations of EU copyright law, or any particular measures or conduct by any players within the GenAI ecosystem.

[2] See our four-part series on the use of synthetic data for training AI models here, here, here and here.

[3] Directive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019 on copyright and related rights in the Digital Single Market and amending Directive 96/9/EC and 2001/29/EC (https://eur-lex.europa.eu/eli/dir/2019/790/oj/eng).

[4] The EU position regarding TDM and copyright differs from that in the UK and the US: UK copyright law has a narrower exception for TDM for non-commercial research purposes only, although the UK Government is currently considering broadening the scope of the existing TDM exception (see here). In the US, there is no closed list of copyright exceptions, and any use has to be analysed pursuant to the ‘fair use’ doctrine (which considers, for example, the purpose and character of a specific use case and its effect on the market for the original work) on a case-by-case basis.

Cleary AI and Technology Insights

Beyond Training Sets: EUIPO Study Insights on GenAI and Copyright

Introduction

GenAI Input

GenAI Output

Coming up in this series

Archives