On 15 January 2024, the UK Information Commissioner’s Office (“ICO”)[1] launched a series of public consultations on the applicability of data protection laws to the development and use of generative artificial intelligence (“GenAI”). The ICO is seeking comments from “all stakeholders with an interest in GenAI”, including developers, users, legal advisors and consultants.[2]

This first public consultation (which closes on 1 March 2024) focuses on the lawful basis for training GenAI models on web-scraped data.[3]

Lawfulness of web scraping to train GenAI models

The ICO recognises that large-scale web scraping (i.e., extracting information from public sources on the internet) is required to train the majority of GenAI models and systems. As large volumes of data are collected, there is an inherent risk that such datasets will also contain significant volumes of personal data (thereby increasing the risk of AI developers processing such data in violation of data protection laws). [4]

The ICO has raised this concern on several occasions[5] and has noted that since GenAI is “developed and deployed in way that [is] distinct from simpler AI models”, new questions arise regarding the application of existing data protection principles under the UK General Data Protection Regulation (the “UK GDPR”).

The UK GDPR requires personal data to be processed “lawfully, fairly and in a transparent manner”.[6]  To comply with the lawfulness principle, the ICO notes that GenAI developers need to ensure that their processing: (a) is not in breach of other laws outside of data protection such as intellectual property (“IP”) or contract law; and (b) has a valid lawful basis under the UK GDPR. This raises two key questions:

a) If scraping breaches other laws, what are the consequences of non-compliance with the lawfulness principle under the UK GDPR?

The ICO’s consultation clearly states that the lawfulness principle will not be complied with if personal data scraping infringes other legislation outside of data protection. This stance raises the prospect of web scraping of personal data for AI training in breach of (say) a website’s terms of use being a breach of the UK GDPR in addition to (and because of) the underlying breach of contract.

Where the scraped personal data is not a material component of the training dataset, it is unclear whether the breach of such other laws through scraping would alone be likely to result in significant separate UK GDPR fines (which can theoretically be up to £17.5 million or 4% of the worldwide turnover).[7] In the past the ICO has stated that, while processing personal data in breach of other legislation (including copyright) would involve unlawful processing under the GDPR, this does not mean that the ICO can “pursue allegations which are primarily about breaches of copyright, financial regulations or other laws outside [the ICO’s] remit and expertise as data protection regulator”.[8] However, this particular guidance does not entirely address the question of whether it would be within the remit and expertise of the ICO (or whether the ICO would be willing) to assess infringements of other laws as part of its pursuit of allegations which are primarily about breaches of the UK GDPR and whether the ICO could then find an infringement of the UK GDPR solely on the basis that another law was infringed in connection with the processing of personal data. In any case, data subjects might seek to exercise their UK GDPR rights in relation to the processing of their personal data resulting from such web scraping.

b)What constitutes a lawful basis for collection of GenAI training data?

The ICO has concluded that five of the six lawful bases set out under Article 6(1) UK GDPR (including consent of individuals) are unlikely to be available for training GenAI on web-scraped data. The question then turns to whether the remaining lawful basis, i.e., legitimate interests, is a valid basis for training GenAI models on web-scraped data.

In short – the ICO has concluded that legitimate interests can be a lawful basis for this activity, but only where the AI developer has taken their legal obligations seriously and, in particular, where the three-part legitimate interests test is satisfied. In relation to this test, the ICO notes:[9]

  1. Legitimate interest.  Although there are many potential downstream use cases for a model, GenAI developers need to frame the legitimate interest in a “specific, rather than an open-ended way.” Such legitimate interest could be, for example, a business interest in developing a model and deploying it for commercial gain.
  2. Necessity.  The ICO acknowledges that, at present, most GenAI training is only possible using the volume of data obtained though large-scale web scraping – and so it is arguable that web-scraping is necessary to achieve the intended purpose set out above.
  3. Balancing test.  To be able to rely on legitimate interests, the GDPR requires organizations to demonstrate they have considered the impact of their processing on the individuals and ensure that their interests outweigh the interests of individuals. The ICO categorises the collection of data through web-scraping as an “invisible processing” activity – which, in turn, is classified as high risk under ICO guidance and would require GenAI developers to conduct a data protection impact assessment prior to conducting this activity.[10] The ICO’s concern here specifically relates to the upstream and downstream harms that could be caused to individuals whose personal data are processed. Such upstream risks and harms might arise because individuals may lose control over their personal data (as they are not informed of the processing of their data for GenAI purposes at the time of collection of their data). Downstream risks and harms might arise, for example, where a model generates inaccurate information about a person that results in distress or reputational harm.

Further, the ICO considers the different ways in which GenAI models may be brought to the market, and suggests mitigations for related risks and harms that may be caused to individuals. For example, where a GenAI model is provided to third parties via an API, the ICO acknowledges that the third party does not have as much control over the underlying data, and so the developer should seek to ensure the third party’s deployment is in line with the legitimate interest identified by the developer at the training phase (e.g., through use of output filters).

In general, the ICO recommends, for all forms of deployment, that GenAI developers implement necessary technical and organisational controls over that specific deployment (e.g., limiting queries that are likely to result in risks or harms to individuals).

Finally, the ICO notes that contractual controls could be implemented by GenAI developers to help mitigate certain risks, such as a third party’s unrestricted use of the model.

What’s Next?

The ICO has stated that it expects to publish several chapters over the coming months, outlining its “emerging thinking on how [it] interprets specific requirements of UK GDPR and … the DPA 2018” in relation to the development and use of GenAI.

In the meantime, GenAI developers and downstream deployers of GenAI models should consider whether to respond to this first consultation and set out their views for the ICO to take into consideration.

[1] The UK’s data protection authority.

[2] See ICO consultation series on generative AI and data protection (15 January 2024).

[3] See ICO Generative AI first call for evidence (15 January 2024).

[4] Personal data is any data that relates to an identified or identifiable individual and may include names, contact details, personality traits, social media posts, visual and audio data.

[5] See Our work on Artificial Intelligence, in particular, the ICO’s thought pieces.

[6] Article 5(1)(a) UK GDPR.

[7] Article 83(5) UK GDPR.

[8] See ICO guidance on the lawfulness principle

[9] See ICO Generative AI first call for evidence (15 January 2024).

[10] See Examples of processing “likely to result in high risk”. This DPIA requirement is consistent with the approach that other EU data protection authorities have taken to date in relation to GenAI models.