GitHub, acquired by Microsoft in 2018, is an online repository used by software developers for storing and sharing software projects. In collaboration with OpenAI, GitHub released an artificial intelligence-based offering in 2021 called Copilot, which is powered by OpenAI’s generative AI model, Codex. Together, these tools assist software developers by taking natural language prompts describing a desired functionality and suggesting blocks of code to achieve that functionality. OpenAI states on its website that, Codex was trained on “billions of lines of source code from publicly available sources, including code in public GitHub repositories.”
In November 2022, two anonymous software programmers who own source code repositories publicly available through GitHub brought a class action lawsuit in the Northern District of California against GitHub and OpenAI (DOE 1 et al v. GitHub, Inc. et al, No. 4:2022cv06823) alleging a large number of claims.
GitHub in the Context of Other Generative AI Lawsuits and the Implications for the Future of Generative AI
This suit against GitHub’s Copilot is just one of many copyright-related lawsuits brought against makers of different generative AI systems. In January 2023, the same lawyers who brought the class action suit against Copilot on behalf of the Doe plaintiffs filed a complaint in the Northern District of California (Andersen v. Stability AI Ltd. et al, No. 3:23-cv-00201) on behalf of three named artists against Midjourney, Stability AI and DeviantArt, each of which companies have developed AI-based art generators whose models were allegedly trained on billions of images scraped from the Internet without the permission of the copyright holder. Getty Images filed a similar lawsuit in January 2023 in the UK High Court and in February 2023 in the District of Delaware (Getty Images (US), Inc. v. Stability AI, Inc., No. 1:23-cv-00135), alleging that Stability AI used more than 12 million Getty Images photos to train its model.
Each of these cases takes a different approach to challenging the use of underlying material – in GitHub, the claims that survived a motion to dismiss (as discussed further below) are based on the Digital Millennium Copyright Act and breach of open source license terms; in Getty and Andersen, the plaintiffs included copyright infringement claims. If any of these cases challenging the use of copyright-protected works in generative AI outputs or in developing generative AI models is successful, it could have significant implications for the future of generative AI, which relies on large and diverse datasets in order to provide accurate and unbiased results.
Where there is a single provider of the database containing content to be used for training (such as Getty Images), the developer of a generative AI system could seek to negotiate a license; in instances where the content is owned by hundreds (or more) of different copyright holders, this quickly becomes impracticable (and likely cost-prohibitive). If these cases are successful in blocking the use of data for generative AI purposes, it may therefore be necessary for Congress to step in to legislate a fair means of balancing the rights of copyright owner against the power and utility of generative AI systems.
GitHub’s Motion to Dismiss:
DMCA Section 1202(b) Claims:
Section 1202(b) of the DMCA prohibits anyone from (1) intentionally removing or altering any copyright management information (“CMI”), (2) distributing CMI knowing the CMI has been removed or altered or (3) distributing copies of works knowing that CMI has been removed or altered while “knowing, or . . . having reasonable grounds to know, that it will induce, enable, facilitate, or conceal” infringement.
The plaintiffs alleged that the code they made available in GitHub contained information constituting CMI – such as copyright notices, titles, authors’ names, copyright owners’ names and the terms and conditions for use of the code – and that such information was intentionally removed by GitHub and OpenAI in connection with training Copilot. As an example, the plaintiffs cite a GitHub blog post in which GitHub stated “in one instance, GitHub Copilot suggested starting an empty file with something it had even seen more than a whopping 700,000 different times during training—that was the GNU General Public License.” The plaintiffs provided further evidence that that GitHub “subsequently trained [Copilot] to ignore or remove CMI and therefore stop reproducing it.” The court found, however, that the plaintiffs did not plead any specific facts regarding the alleged inaccuracy of CMI that would be required for a claim under Section 1202(b)(2).
With respect to the knowledge requirement of Section 1202(b), plaintiffs provided evidence that GitHub “regularly processed DMCA takedowns, such that it was aware that its platform was used to distribute code with removed or altered CMI in a manner which induced infringement.”
As a result, the court found that the plaintiffs’ claims under Section 1202(b)(1) and (b)(3) were sufficiently pled to survive defendants’ motion to dismiss, but dismissed plaintiffs’ claim under Section 1202(b)(2) with leave to amend.
Breach of License Claim:
Developers who upload code to the GitHub repository can choose to keep the project they upload private or to make the repository available under one of thirteen different licenses. Two of these licenses offered by GitHub waive any copyright rights in the code in the repository, while the remaining eleven licenses comprise open source licenses of varying strength, ranging from the permissive MIT and Apache Licenses to the copyleft GNU General Public Licenses v.2 and v.3. While most of the license terms vary across these open source licenses, each of them requires, as a condition of use of the licensed open source software, that any derivative work or copy of such open source software include: (i) attribution to the owner, (ii) inclusion of a copyright notice and (iii) inclusion of the license terms for such open source software.
The plaintiffs allege that GitHub and OpenAI trained Copilot on the plaintiffs’ code in GitHub without complying with the applicable open source licensing terms governing such code (namely, attribution, inclusion of a copyright notice and inclusion of the license terms), and that Copilot often provides as an output code that is nearly-identical to the training data, again without complying with the terms of the applicable open source licenses. The court found that the plaintiffs sufficiently identified the “contractual obligations allegedly breached,” as required to plead a breach of contract claim, and thus rejected defendants’ motion to dismiss these claims.
Standing for Relief:
Finally, although the plaintiffs identified instances in which the output from Copilot matched licensed code in GitHub repositories, none of these cited instances involved plaintiffs’ code. The court found that “[b]ecause Plaintiffs do not allege that they themselves have suffered the injury they describe, they do not have standing to seek retrospective relief for that injury.” Although the court denied the plaintiffs’ claim for past monetary damages, the court did find that plaintiffs alleged sufficiently particularized facts that there would be a substantial risk that Copilot would reproduce the plaintiffs’ code as output in the future. The plaintiffs cited GitHub’s own internal research that showed that Copilot “reproduces code from training data about 1% of the time.” Because of the substantial and potentially imminent risk of future harm to the plaintiffs, the court found that they have standing to pursue injunctive relief on these claims.