This is the first part of series on using synthetic data to train AI models. See here for Parts 23, and 4.

The recent rapid advancements of Artificial Intelligence (“AI”) have revolutionized creation and learning patterns. Generative AI (“GenAI”) systems have unveiled unprecedented capabilities, pushing the boundaries of what we thought possible. Yet, beneath the surface of the transformative potential of AI lies a complex legal web of intellectual property (“IP”) risks, particularly concerning the use of “real-world” training data, which may lead to alleged infringement of third-party IP rights if AI training data is not appropriately sourced.

This is because training of GenAI models requires processing of large amounts of data that potentially contain copyrighted works, as well as materials displaying trademarks and data compilations which may be protected by sui generis database rights in the EU, or other information the use of which may be restricted by contract or terms of use. Only through that training can the AI model be leveraged and applied to generate plausible and human-like new content (such as text, code, images, sound or video). If not adequately deduplicated, filtered, and calibrated, there is also a risk that GenAI systems may generate infringing outputs that are substantially similar to or otherwise replicate (in whole or meaningful part) third-party works protected by copyright.

This has given rise to the international debate surrounding how to balance the respective rights and interests of IP rightsholders and AI developers. Several lawsuits have even been launched by rightsholders and representative organizations against developers of GenAI tools, typically claiming that the process of training the AI models utilized by such tools and, in some cases, the output generated by such tools, infringe their IP rights.[1]

In this context, “synthetic data” has emerged as a potential solution. Synthetic data comprises data that is artificially generated by an AI model rather than mined or collected from real-world sources and, therefore, should not (in theory) give rise to the same IP infringement risks as using real-world data. Synthetic data mimics real-world data and, if properly developed, should be technically and statistically indistinguishable from such data for the purpose of training AI models.

Several major AI companies are currently using synthetic data to train their AI models.[2] A new type of business has even emerged: companies are now specializing in providing synthetic datasets, either from a pre-existing proprietary database or by creating “bespoke” synthetic data generated on demand for specific customers.[3] Synthetic data has many practical use cases already, including in the insurance sector, medical research,[4] or drug discovery and testing.[5]

Synthetic data creates technological, economic and ethical opportunities, including the potential to: (i) improve accuracy by mitigating the unreliability of human-made data, which is typically gathered by scraping the erratic web that is the internet;[6] (ii) mitigate or even remove biases and imbalances in existing, human-made data;[7] and (iii) reduce the costs and obstacles at all stages of the data value chain, which may help by lowering costs of developing data and removing data barriers to entry in relevant markets, characterized by network effects.[8] In addition, companies are starting to run out of easily accessible, reliable and high-quality real-world data sources to continue training more advanced AI models, thereby increasing demand for synthetic data.[9]

Against this backdrop, we will consider in three future segments of this blog whether synthetic data could adequately mitigate IP infringement risks that arise in the context of training AI models under existing and proposed European legal frameworks, with a focus on copyright protection (which has, thus far, emerged as the predominant basis upon which to challenge AI developers).

Part 2 of this series will cover the question of how training AI models on synthetic data may mitigate copyright infringement risks.  Part 3 will cover the interplay between synthetic data training sets, the EU Copyright Directive and the forthcoming EU AI Act.  And Part 4 will explore other key legal topics to be considered when using synthetic data to train an AI model.

[1]           Some of the most prominent cases currently pending in U.S. courts include: Doe I v. Github, Inc., No. 4:22-cv-06823 (N.D. Cal. Nov. 3, 2022); Andersen et al. v. Stability AI et al., No. 3:23-cv-00201 (N. D. Cal. Jan. 13, 2023); Getty Images (US), Іnc. v. Stability AI, No. 1:23-cv-00135 (D. Del. Feb. 3, 2023).  Getty Image launched a similar lawsuit against Stability AI in the UK.  The High Court declined Stability AI’s request to dismiss the case in December 2023, and the case will be proceeding to trial in the upcoming months see; J.L. et al. v. Alphabet, No. 3:23-cv-03440 (N.D. Cal. Jul. 11, 2023); Tremblay et al. v. OpenAI, No. 3:23-cv-03223 (N.D. Cal. June 28, 2023); Silverman et al. v. OpenAI, No. 4:23-cv-03416 (N.D. Cal. July 7, 2023); Kadrey et al. v. Meta Platforms, No. 3:23-cv-03417 (N.D. Cal July 7, 2023); Chabon et al. v. OpenAI, No 3:23-cv-04625 (N.D. Cal Sept. 8, 2023); Chabon et al. v. Meta Platforms, No. 3:23-cv-04663 (N.D. Cal. Sept. 12, 2023); Authors Guild v. OpenAI, No. 1:23-cv-08292 (S.D.N.Y. Sept. 19, 2023); Huckabee et al. v. Meta Platforms et al., No. 1:23-cv-09152 (S.D.N.Y. Oct. 17, 2023); Concord Music Group et al. v. Anthropic PBC, No. 3:23-cv-01092 (M.D. Tenn. Oct. 18, 2023); Sancton v. Open AI, Inc. et al., No. 1:23-10211 (S.D.N.Y. Nov. 21, 2023); For a recent update on potential significant roadblocks for plaintiffs in these types of suites, see however our Alert Memorandum of November 7, 2023 on Andersen v. Stability AI.

[2]           These include Microsoft, OpenAI, Cohere, Amazon and Google (Waymo). See See also Types of synthetic data and 4 real-life examples (2022).

[3]           For example, see;; and

[4]           See Mostly AI, “European Commission’s JRC: synthetic data will be the key enabler for AI in Europe” (September 15, 2022), reporting on the use of synthetic data to generate cancer data.

[5]           On the latter, see EMA reflection paper of July 13, 2023 “on the use of AI in the medicinal product lifecycle”. See alsoSynthetic data use: exploring use cases to optimize data utility”.

[6]           MIT, “In machine learning, synthetic data can offer real performance improvements” (November 3, 2022).

[7]           Yet, because synthetic data is typically created by an upstream AI model which was fed with real-world data, and synthetic data will in turn be used to train downstream AI models, depending on the exact parameters used to create the synthetic training data set (which may, themselves, contain errors or data bias), synthetic data may also carry risks to transpose or even introduce bias and imbalances.

[8]           European Commission, Competition policy for the digital era (2019), p. 73 et seq. See also OECD, Data-driven innovation: Big data for growth and well-being (2015), pp. 391-392.

[9]           Financial Times,“Why computer-made data is being used to train AI models” (July 19, 2023).