This is the third part of our four-part series on the EUIPO study on GenAI and copyright. Read parts 1 and 2.

This third part of the four-part series offers four key takeaways on GenAI output, highlighting critical issues around retrieval augmented generation (RAG), transparency solutions, copyright retention concerns and emerging technical remedies.

Key Takeaway 1: Retrieval augmented generation (RAG) is not clearly identified as a form of TDM

RAG is deployed as a technical solution to balance integrating data through model training and database retrieval. The EUIPO study notes that whether RAG qualifies as text and data mining (TDM) may depend on how the retrieval process is understood.

Unlike AI training which can be said to convert works into abstract patterns, RAG directly extracts information from content to enhance AI outputs. The study distinguishes between two technical implementations with potentially different legal implications: static RAG stores copies locally whereas dynamic RAG searches the internet on-demand.

To address these uncertainties and in an attempt to create new revenue streams, content owners are increasingly licensing content (which typically has been opted out of TDM) for AI-specific uses through dedicated APIs.

Key Takeaway 2: Technical solutions for transparency are evolving but incomplete

Transparency in how AI-generated content is identified has emerged as a priority in the regulatory landscape. The EU AI Act notably imposes specific obligations requiring GenAI outputs to be detectable and marked in machine-readable format, particularly for “deepfakes” that could mislead users.[1] The EUIPO study identifies three main technical approaches to addressing transparency:

  1. Provenance tracking protocols (C2PA, JPEG Trust, TRACE4EU), which provide unique credentials to content authors to bind provenance statements to content and include “trust manifests” in media metadata listing actions performed on the media, though both approaches face challenges with metadata tampering or removal;
  2. Automated content detection technologies, including NVIDIA StyleGAN3-detector demonstrating high accuracy in detecting GAN-generated images and Deezer’s tool for AI-generated music detection achieving 99.8% lab accuracy, though these technologies might not be suitable in all cases across different AI generation methods; and
  3. Content-processing solutions such as watermarking that modifies digital assets to embed provenance information and fingerprinting that generates unique identifiers from content characteristics and stores them in external databases, with both methods having different strengths and vulnerabilities to modifications.

Yet, no technical solution has emerged as unequivocally meeting all requirements of the AI Act’s transparency obligations. The EUIPO study recommends that these technical measures should be part of a broader compliance strategy rather than standalone solutions. Today, the lack of a universal standard for labelling or detecting AI-generated content creates legal uncertainty for both content creators and AI developers, and may complicate copyright enforcement and content verification.

Key Takeaway 3: Model retention and regurgitation of protected works is a growing concern

The study highlights that some GenAI models can under certain specific circumstances reproduce (portions of) copyrighted works from their training data, raising copyright infringement risks.

Several factors increase the probability of such memorisation, including model size, length of input prompts, higher frequency of the sequence within the training dataset and content originality. These technical characteristics may have direct implications for copyright infringement liability, as they influence the likelihood that protected expression might be reproduced without authorisation. The EUIPO refers to this as “a technical issue which creates a legal issue”.

To address these concerns, the study finds that model providers have started implementing various technical mitigation approaches including data deduplication during training processes, content filtering systems, and post-training techniques like model editing and “unlearning” as well as prompt rewriting or filtering to further reduce infringement risk.

Key Takeaway 4: Emerging technical solutions to address copyright concerns

The EUIPO study identifies several emerging technical solutions to address copyright concerns with GenAI output. “Model unlearning” techniques enable removal of protected content without complete retraining, providing a more efficient means of modifying a model’s output capabilities. Targeted editing methods allow precise modifications to learned information, offering more surgical approaches. Data deduplication during training may reduce memorisation risks. The study also outlines comprehensive risk management strategies available to stakeholders, including corporate governance frameworks addressing the entire AI lifecycle, thorough documentation for legal protection and operational clarity, output filters to block potentially infringing content before delivery, and regular evaluation as the technological landscape evolves. Finally, the study suggests that IP offices could offer valuable guidance on technical standards to help navigate these complex copyright issues, potentially serving as neutral arbiters in establishing best practices across industries.

In the fourth and final part of our series, we will conclude our analysis of the EUIPO study and consider the path forward.


[1] See our previous blog posts on the EU AI Act here and here.