Training AI models on Synthetic Data: No silver bullet for IP infringement risk in the context of training AI systems (Part 3 of 4)

By Gareth Kristensen, Angela Dunning, Gaia Shen, Jan-Frederik Keustermans, Prudence Buckland & Alix Anciaux on January 16, 2024

This third part of our four-part series on using synthetic data to train AI models explores the interplay between synthetic data training sets, the EU Copyright Directive and the forthcoming EU AI Act.

EU Copyright Directive: Can synthetic data adequately mitigate the IP infringement risks associated with the use of “real-world data” in training AI models under the EU Copyright Directive?

One of the hurdles faced by AI developers that using synthetic data may help overcome arises under the EU Copyright Directive (the “Copyright Directive”).[1] More specifically, the use of synthetic data to train an AI model may allow developers to side-step certain uncertainties and complexities raised by the application of the exemptions to specific copyrights and related rights under Articles 3 and 4 of the Copyright Directive for the reproductions and extractions of lawfully accessible works and other subject matter for the purposes of text and data mining (the so-called “text and data mining” or “TDM” exceptions).[2]

Pursuant to Article 4 of the Copyright Directive, any text and data mining activity (even for commercial purposes) carried out on works or other subject matter protected by copyright and/or related rights is exempted from such rights under relevant EU law, provided that (i) the person conducting such activities had lawful access to the content[3] and (ii) the holder of the copyright or related rights has not expressly reserved the extraction of text and data in an appropriate manner (the so-called “opt-out”).[4]

The TDM exception under Article 4 of the Copyright Directive has given rise to several questions in the context of the development and use of AI models, including on how the opt-out can be applied in practice by rightsholders and how AI developers can reliably screen their training datasets for any works subject to an opt-out, especially in view of the manner in which training data might in some cases have been indiscriminately scraped from the internet. Stated another way, how can a person wishing to use such content reasonably discover and avoid materials subject to an opt-out?

The Copyright Directive itself does not provide a clear answer, and the fact that the exception is implemented in a slightly different manner across EU Member States adds to this legal uncertainty.[5] But by training an AI model on purely synthetic data, developers could side-step the issues associated with the TDM exemption under EU law, including the risk that any copyright holder may have reserved the extraction of text and data from the underlying work by opting out. This is because the need to benefit from the TDM exemption under the Copyright Directive becomes obsolete if the underlying training dataset does not include any copyrighted work but is instead artificially created data to which copyright protection does not extend (see earlier in Part 1 of this series).

Proposed EU AI Act: Can synthetic data adequately mitigate the prospective obligations associated with use of “real-world data” in training AI models under the proposed EU AI Act?

Synthetic data may also facilitate compliance by developers of GenAI systems with certain of the anticipated obligations which may become mandatory under the proposed EU AI Act. [6]

Although a consolidated legislative text is yet to be finalised and formally approved, the provisional political agreement reached between the relevant institutions on the AI Act introduces specific guardrails for General Purpose AI (“GPAI”) systems (and the GPAI models they are based on).^[7] In particular, according to a recent proposal, providers[8] of such GPAI models will be required to draw up and make publicly available a “sufficiently detailed summary about the content used for training” such systems or models.^[9] This disclosure obligation was originally introduced in the European Parliament’s compromise text^[10] but, unlike the Parliament’s proposal, it no longer refers to a summary of “the use of training data protected under copyright law” – which would have required providers to distinguish between copyright-protected and public domain training materials.

It seems that one of the intended purposes of the obligation to disclose training data is to enable better enforcement, through an ex-post tool, of the above-mentioned opt-out mechanism under Article 4 of the Copyright Directive. Without it, there may be no practical way for rightsholders to become aware that works that they expressly reserved have been used to train an AI model. This link was explicitly acknowledged in a recent proposed draft of the AI Act, which introduced an obligation on providers of GPAI models to “put in place a policy to respect Union copyright law in particular to identify and respect, including through state of the art technologies where applicable, the reservations of rights expressed pursuant to Article 4(3) of [the Copyright Directive]”.[11]

In practice, the obligation to disclose training data may be difficult to comply with, including due to the significant amount of data (which may include copyright-protected data) required to train GenAI models and the difficulty in keeping track of and summarizing such data. It is not entirely clear what would constitute a “sufficiently detailed” summary, but we understand that a template will be developed by the AI Office for such a summary, which would allow the AI system provider to provide the required summary in a narrative form.^[12] Synthetic data has the potential to mitigate some of the disclosure burden, as providers would simply disclose the synthetic dataset(s) they used to train their AI models.

In any event, it remains to be seen what requirements will be included in the final EU AI Act. In light of criticism following the announcement of the political deal – and renewed warnings that the AI Act risks will hamper innovation in the European market – there may still be room for debate over the final terms of the AI Act before ratification.[13] The consolidated text will then need to be formally approved by both the European Parliament and Council (which could happen as early as Q1 of 2024 if there is political consensus). Once it has entered into force, most of the general provisions of the AI Act will apply after a two-year grace period, except that the prohibitions of certain AI systems will apply after a 6-month grace period and the rules on GPAI will apply after a 12-month grace period.[14]

Coming up in this Series

In the fourth and final part of this Series, we will explore certain other key legal topics to be considered when using synthetic data to train an AI model and summarize our conclusions.

[1] Directive (EU) 2019/790 of 17 April 2019 on copyright and related rights in the Digital Single Market, OJ 130/92, pp. 92-125.

[2] Art. 3 provides that Member States shall provide for an exception or limitation for text and data mining for the purposes of scientific research, and Art. 4 provides exceptions or limitations for the purposes of text and data mining for other purposes. Text and data mining is defined as “any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations”. See Copyright Directive, Art. 2. Note by contrast that under UK law, a similar exception for text and data mining is limited to research “for a non-commercial purpose”.

[3] Having “lawful access to content” under the Copyright Directive covers, as described in Recital 14, access to content pursuant to contractual arrangements (e.g., subscriptions or open access licenses) as well as “content that is freely available online”.

[4] Copyright Directive, Art. 4(1) and 4(3). Recital 18 of the Copyright Directive explains that “it should only be considered appropriate to reserve those rights by the use of machine-readable means, including metadata and terms and conditions of a website or a service. […] In other cases, it can be appropriate to reserve the rights by other means, such as contractual agreements or a unilateral declaration.”

[5] As of 1 November 2023, the Copyright Directive had not yet been implemented in Bulgaria and is pending adoption by Parliament in Poland. Additionally, in some jurisdictions, the Directive has not been transposed in its entirety or faithfully into national law – e.g., on the TDM exemption, Denmark has not implemented Article 4 of the Directive into national law and, across other EU Member States, the wording of the TDM exception differs. For example, the Italian implementation allows rightsholders to opt out but doesn’t seem to mirror the language in the Directive requiring the opt-out to be “in an appropriate manner”.

[6] Proposed by the European Commission in April 2021, a provisional political agreement was reached on the AI Act between the European Parliament and the Council of the European Union on 9 December 2023. For more information, see Cleary IP and Technology Insights blog-post, Agreement reached on the EU AI Act: the key points to know about the political deal (14 December 2023).

^[7] See Europa, Artificial intelligence act: Council and Parliament strike a deal on the first rules for AI in the world (9 December 2023). Although there is not yet an official definition for what models fall within the GPAI bucket, previous proposals suggest that these are systems trained on a large amount of data and capable of performing a wide range of distinct tasks – see the most recent available compromise proposal on general purpose AI models/general purpose AI systems published by POLITICO on 6 December 2023 (which we understand to be the basis for the political agreement on GPAIs), under which General Purpose AI Model is defined as “an AI model, including when trained with a large amount of data using self-supervision at scale, that displays significant generality and is capable to competently perform a wide range of distinct tasks regardless of the way the model is released on the market and that can be integrated into a variety of downstream systems or applications.” The definition and extent to which such systems were to be regulated underwent various iterations during the trilogue process.

[8] The AI Act defines a “provider” as a “natural or legal person…that develops an AI system or that has an AI system developed with a view to placing it on the market or putting it into service under its own name or trademark, whether for payment or free of charge” – see Amendments adopted by the European Parliament on 14 June 2023 on the proposal for a regulation of the European Parliament and of the Council on laying down harmonized rules on artificial intelligence (Artificial Intelligence Act) and amending certain Union legislative acts (COM(2021)0206 – C9-0146/2021 – 2021/0106(COD)), available at:https://www.europarl.europa.eu/doceo/document/TA-9-2023-0236_EN.pdf (the “European Parliament’s drat”), Art. 3(2). In this context, “placing” on the market and “putting into service” are respectively defined as the first making available, or the supply for first use directly, of the AI system on the Union market – see European Parliament’s draft, Art. 3(11).

^[9] See Article C(1)(d) of the compromise proposal on general purpose AI models/general purpose AI systems published by POLITICO on 6 December – see Open Future – GPAI Compromise proposal.

^[10] See European Parliament’s draft, Article 28(b)(4)(c).

[11] See Article C(1)(c) of the compromise proposal on general purpose AI models/general purpose AI systems published by POLITICO on 6 December 2023 – available at Open Future – GPAI Compromise proposal. This is further underlined by the language in recitals (x) and (xx) of such compromise proposal, which states that “[a]ny use of copyright protected content requires the authorization of the rightholder concerned unless relevant copyright exceptions apply”, and then goes on to state, with express reference to Article 4(3) of the Copyright Directive, that “where the rights to opt out has been expressly reserved in an appropriate manner, providers of general-purpose AI models need to obtain an authorisation from rightholders if they want to carry out text and data mining over such works.”

^[12] See “Clarifying recital – the scope of the “detailed summary” (proposed by EP)” in the compromise proposal on general purpose AI models/general purpose AI systems published by POLITICO on 6 December 2023 – available at Open Future – GPAI Compromise proposal: “This summary should be comprehensive in its scope instead of technically detailed, for example by listing the main data collections or sets that went into training the model, such as large private or public databases or data archives, and by providing a narrative explanation about other data sources used. It is appropriate for the AI Office to provide a template for the summary, which should be simple, effective, and allow the provider to provide the required summary in narrative form.”

[13] See e.g., trade associations responses to the political agreement: Digital Europe, A milestone agreement, but at what cost? Response to the political deal on the EU AI Act and CCIA, AI Act Negotiations Result in Half-Baked EU Deal; More Work Needed, Tech Industry Emphasises. In particular, it has been reported that France, Germany and Italy – who have previously been critical of strict rules on foundation models – may seek alterations to related rules in the final text (which could further delay, or possibly prevent, passage of the law) – see Financial Times, EU’s new AI Act risks hampering innovation, warns Emmanuel Macron (11 December 2023).

[14] For additional information on the proposed EU AI Act, see our recent blog post EU’s Approach to Data and AI Continues to Evolve.

Cleary AI and Technology Insights

Training AI models on Synthetic Data: No silver bullet for IP infringement risk in the context of training AI systems (Part 3 of 4)

Archives