Skip to content

Feeding the Machine: How US courts are drawing the line on AI training

Market Insights

Introduction

The accelerating development of generative artificial intelligence (AI) has forced courts to grapple with novel and unsettled questions of copyright law. Central among these is whether the use of entire copyrighted works to train large language models (LLMs) without the author’s consent constitute infringement of those works.

In Australia, the Productivity Commission is the most recent to grapple with this question, and summarised the issue neatly in its recent interim report into Harnessing data and digital technology (Interim Report):

The datasets used to train AI models often contain digital copies of media such as web pages, books, videos, images and music. These media are often the subject of copyright protection, which means that their use to train AI models requires permission from the copyright holder. Permission is required because AI models must ‘copy’ the protected material at least temporarily to undertake the training process.

The Interim Report then went on to discuss Australia’s ‘fair dealing’ regime, which allows certain uses of copyright works without the need for licence from the copyright owner, but only for certain specified purposes, such as research or study, criticism or review, or parody or satire. The Interim Report seeks feedback on expanding this regime to include fair dealing for the purpose of text and data mining, which could more squarely legitimise AI training activities in Australia.

Where the Australian ‘fair dealing’ regime only applies to certain permitted purposes, some other countries, such as the United States (US) have a broader ‘fair use’ doctrine, under which any use of copyright material may be permissible provided that it is considered fair, without reference to legislatively-permitted purposes.

The question then arises as to whether the use of copyright works as AI training materials can be excused under the ‘fair use’ doctrine in the US. Many of the major commercial AI models originate from the US and are relying on the fair use doctrine to justify their activities. In June 2025, the US District Court for the Northern District of California issued two decisions in Bartz v Anthropic PBC (Bartz)1 and Kadrey v Meta Platforms Inc (Kadrey)2, that directly addressed this question. While these rulings suggest that US courts may accept fair use as a defence to AI training, their scope is narrow. Both were decided at the summary judgment stage, and as the Judge in Kadrey noted, ‘the consequence of this ruling is limited […] to the rights of these thirteen authors‘. Accordingly, the significance of these rulings remains provisional, with the scope of fair use in the context of AI training to be more clearly defined as further cases are determined.

Background

In Bartz, Anthropic the developer of the ‘Claude’ LLM, assembled a centralised digital library of books containing both lawfully acquired works, scanned from purchased physical copies, and unlawfully obtained works downloaded from unauthorised torrent websites. Authors alleged that their books were included without permission and were used both to train Anthropic’s models and for other purposes. Alsup J considered the two uses separately, ultimately holding that the use of works for training constituted fair use. However, Alsup J found that the use of pirated copies to build Anthropic’s centralised library was not justified under fair use, rejecting Anthropic’s argument that such copies should be treated as training data, and denying summary judgment on that issue.

In Kadrey, Meta trained its ‘LLaMA’ AI models on datasets including those obtained from ‘shadow libraries’ such as Library Genesis and Z-Library. These datasets were downloaded via torrenting, incorporated into the training process, and discarded thereafter; unlike Anthropic, Meta did not maintain a persistent digital library. Thirteen best‐selling authors alleged that this constituted copyright infringement on a class‐wide basis. Chhabria J also found the training to be fair use, but adopted a more cautious approach, observing that ‘future evidence of substitution or harm to licensing markets, particularly where works are unlawfully obtained‘ could warrant a different conclusion.

While both courts concluded that the training uses fell within the scope of fair use, their reasoning diverged in certain aspects. Alsup J emphasised the transformative purpose of training and discounted speculative claims of market harm, whereas Chhabria J stressed the potential for market harm arguments and evidence to alter the analysis. These decisions have no binding effect in Australia, where there is no general fair use defence. Nonetheless, they highlight the emerging tension between protecting incentives for human creativity and facilitating technological innovation; a tension likely to intensify as generative AI becomes further integrated into creative and commercial practice.

US fair use

Section 107 of the US Copyright Act provides for a set of non-exclusive factors to be considered in determining whether or not a particular use is a ‘fair use’, namely:

  1. the purpose and character of the use;
  2. the nature of the copyrighted work;
  3. the amount and substantiality of the portion taken; and
  4. the effect of the use upon the potential market.

These factors are illustrative rather than exhaustive, and the analysis US courts take are considered holistic. Modern doctrine places particular emphasis on whether the use is ‘transformative’, in the sense of adding new expression, meaning, or purpose to the original work.

Applying the fair use factors

First factor – Purpose and character of use

In Bartz, Alsup J held that Anthropic’s use of the authors’ works to train its LLMs was ‘spectacularly‘ transformative. The purpose of the copying was to ‘iteratively map statistical relationships between every text-fragment and every sequence of text-fragments so that a completed LLM could receive new text inputs and return new text outputs as if it were a human reading prompts and writing responses‘.

Alsup J emphasised that the LLMs ‘trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different‘. Crucially, the authors had alleged only that the training itself constituted infringement, not that any outputs generated for users were infringing. On that basis, Alsup J rejected arguments that the training was intended to memorise or reproduce the creative elements of the works.

As Alsup J explained, authors ‘cannot rightly exclude anyone from using their works for training or learning as such. Everyone reads texts, too, then writes new texts. They may need to pay for getting their hands on a text in the first instance. But to make anyone pay specifically for the use of a book each time they read it, each time they recall it from memory, each time they later draw upon it when writing new things in new ways would be unthinkable‘. Alsup J analogised the training of LLMs to centuries of human practice in reading, internalising, and later drawing upon literary works to create ‘new things in new ways‘.

Alsup J also addressed Anthropic’s conversion of purchased print copies into a centralised digital library. This use was characterised as a mere format change that was itself transformative, noting that ‘[s]torage and searchability are not creative properties of the copyrighted work itself but physical properties of the frame around the work or informational properties about the work‘. Each print copy was digitised to reduce storage requirements and enable searchability, with the print original destroyed. There was no evidence that the digital copies were disclosed externally. While Anthropic’s commercial status was relevant, it was found non-dispositive.

In Kadreys, Chhabria J similarly found that the first factor favoured fair use. Meta’s purpose in copying was to train LLaMA’s ‘highly transformative’ tools capable of generating varied text and performing a broad range of functions. This purpose differed fundamentally from that of the authors’ books, which were intended to be read for entertainment or education. Chhabria J rejected the contention that LLaMA’s outputs could ‘mimic’ the authors’ works or styles in a way that undermined the transformative use, observing that ‘style is not copyrightable‘, and that even with adversarial prompts LLaMA would not reproduce more than 50 words of any authors’ book.

Chhabria J further considered the act of downloading the books from shadow libraries. While recognising that downloading was a distinct use from training, it held that ‘downloading must still be considered in light of its ultimate, highly transformative purpose: training LLaMA‘. On this reasoning, ‘[b]ecause Meta’s ultimate use of the [authors’] books was transformative, so too was Meta’s downloading of those books‘.

Second factor – Nature of the work

In Bartz, Alsup J accepted that the authors’ books contained expressive elements and that Anthropic had selected them for those qualities. On this basis, the second factor weighed against fair use for all copies and for all purposes of use.

In Kadrey, Chhabria J likewise found that the second factor favoured the authors. It rejected the argument that Meta had used the books solely to access ‘functional elements‘ rather than their creative expression. While acknowledging that LLMs extract statistical relationships from training data, Chhabria J observed that ‘those relationships are the product of creative expression‘. Nevertheless, Chhabria J stressed that the second factor ‘has rarely played a significant role in the determination of a fair use dispute‘, and that its weighing in favour of the authors ‘doesn’t mean much for the analysis as a whole‘.

Third factor – Amount and substantiality

In Bartz, Alsup J held that the extent of copying was reasonably necessary to the transformative purpose of training LLMs, and accordingly this factor weighed in favour of fair use. The relevant inquiry does not focus on the proportion of the work copied in isolation, but on the amount made accessible to the public and whether it serves as a competing substitute for the original. As the authors had not alleged that any output from Claude was infringing, Alsup J concluded the copying was reasonable. It was common ground that training a LLM requires exposure to billions of words, and the authors did not dispute that such volume is essential to model performance. Alsup J therefore found that the large-scale use of works was proportionate to the purpose.

Alsup J also addressed the digitisation of purchased copies to create a searchable central library. Because the intended purpose, enhanced storage efficiency and searchability, required reproduction of the entire work, this copying too was found to favour fair use.

In Kadrey, it was undisputed that Meta had copied the entirety of each work. While complete reproduction generally weighs against fair use, both courts accepted that, in the context of AI training, copying the whole work may be technically necessary. Chhabria J observed that ‘feeding a whole book to a LLM does more to train it than would feeding it only half of that book‘, and that partial copying would impair the model’s capacity. Given this technical necessity, Chhabria J concluded that the third factor favoured a finding of fair use.

Fourth factor – Effect on market or value

In Bartz, Alsup J found that the fourth factor weighed in favour of fair use. The authors had conceded ‘that training LLMs did not result in any exact copies nor even infringing knockoffs of their works being provided to the public‘, and there was no evidence that the use of their books to train specific LLMs displaced demand for the originals. Alsup J rejected the broader claim that AI training would generate an ‘explosion of works competing with their works‘, reasoning that ‘[t]his is not the kind of competitive or creative displacement that concerns the Copyright Act‘, which ‘seeks to advance original works of authorship, not to protect authors against competition‘. It likewise declined to recognise harm to a nascent licensing market for AI training, holding that ‘such a market for that use is not one the Copyright Act entitles Authors to exploit‘.

With respect to Anthropic’s digitisation of purchased books to build a central library, Alsup J considered the factor neutral. Any losses resulting from the format change ‘did not relate to something this Copyright Act reserves for the Authors to exploit‘, and there was no evidence of intent to redistribute or inability to secure the library against external access.

On the other hand, in Kadrey, Chhabria J characterised the fourth factor as ‘the most important element of fair use‘ and outlined three approaches by which authors might establish market harm:

  1. demonstrating that the AI reproduces substantial portions of the work;
  2. showing the existence of a viable licensing market for AI training that is harmed by unlicensed use; or
  3. proving that AI-generated works compete with and substitute for the originals.

While the authors’ evidence was insufficient in this case, Chhabria J emphasised that stronger proof in future litigation could yield a different result, a stance that stands in some tension with Alsup J’s dismissal of competitive displacement by AI-generated works as cognisable market harm.

However, in applying this market harm analysis, Chhabria J rejected:

  • the first approach on the basis that LLaMA cannot produce any meaningful portion of the authors’ works; even under adversarial prompting, it would not ‘generate more than 50 words […] from the [authors’] books‘;
  • the second approach on the ground that defining the market as ‘the theoretical market for licensing the use at issue‘ would render the factor circular, as ‘harm from the loss of fees paid to license a work for a transformative purpose is not cognizable‘; and
  • the third approach, with Chhabria J acknowledging that ‘the concept of market dilution becomes highly relevant‘ for technologies capable of producing ‘literally millions of secondary works, with a miniscule fraction of the time and creativity used to create the original works‘. Nonetheless, it was found that the authors had neither pleaded market dilution nor adduced evidence to support it.

Purchased and pirated works

In Bartz, Alsup J drew a sharp distinction between lawfully acquired and unlawfully obtained works. The scanning of purchased hard copies into digital form for internal AI training was found to be fair use. By contrast, the acquisition of works from pirate websites for inclusion in Anthropic’s centralised library was held not to be fair use. Alsup J emphasised that such ‘copies of works (pirated ones, too) would be retained [‘forever’] for [‘general purpose’] even after Anthropic determined they would never be used for training LLMs‘. As Alsup J observed, ‘[s]uch piracy of otherwise available copies is inherently, irredeemably infringing even if the pirated copies are immediately used for the transformative use and immediately discarded‘. The claim relating to the alleged pirated works has progressed to a class action settlement between the parties, subject to approval by Judge Alsup. The terms of the settlement have not yet been disclosed.

In Kadrey, the issue of pirated works was not separately analysed within the fair use framework. While Meta’s training datasets were sourced from ‘shadow libraries’, the claim concerning their acquisition remains pending and was not resolved in the summary judgment decision.

Broader implications

For AI developers in the US, both Bartz and Kadrey appear to provide a degree of reassurance that the use of lawfully obtained copyrighted works in AI model training can constitute fair use where the use is transformative and market harm is not established. This may assist in reducing the immediate risk for model training, though the scope of protection is not uniform. Alsup J’s decision may be read as signalling a willingness to accommodate technological advancement within copyright law, particularly by permitting the use of purchased works without a specific AI training licence. Given that the utility of an AI model is tied to the breadth and quality of its training data, and that training requires vast quantities of text, this approach could reduce the cost barrier associated with licensing, particularly for smaller developers for whom such costs may be prohibitive.

At the same time, the decision is a key indicator for the legal and reputational risks of using pirated materials. Alsup J’s finding that the retention of pirated copies is ‘inherently, irredeemably infringing‘ makes clear that AI companies must ensure their training datasets are lawfully sourced.

By contrast, Chhabria J’s reasoning suggests a greater willingness to consider market harm claims and to require compensation where justified. His honour cautioned that ‘[i]f using copyrighted works to train the models is as necessary as the companies say, they will figure out a way to compensate copyright holders for it‘. This signals that further litigation, particularly with stronger evidence of substitution or licensing market impact, could raise the cost and complexity of AI development.

For authors and other rights-holders, the immediate loss on fair use in both cases does not foreclose future claims. Chhabria J’s observation that ‘the consequence of this ruling is limited […] to the rights of these thirteen authors‘, leaves open the possibility of different outcomes where authors can show that AI outputs compete with or devalue their works in the market. This places a premium on evidentiary strategies that demonstrate concrete substitution effects or dilution of value.

Conclusion

Bartz and Kadrey mark the first substantive judicial engagement with the application of fair use to large-scale AI training. Both decisions accept that the use of entire works for training may be fair use, but their divergence on market harm ensures the doctrine’s boundaries remain unsettled. While instructive for AI developers and copyright holders, these rulings are not determinative. A number of other proceedings now before the US courts will provide further judicial consideration of these issues. The most prominent is the action brought by The New York Times against OpenAI, the developer of ChatGPT, and its commercial partner Microsoft, alleging copyright infringement.

These US cases foreshadow the likely questions Australian courts and legislators may face if asked to reconcile copyright protection with the functional requirements of generative AI. Ultimately, the central challenge lies in calibrating the balance between protecting copyright and enabling technological progress, a balance that will continue to shape copyright law in the AI era.

While these decisions are most relevant to the developers of AI systems, especially those in the US, any findings that limit the use of copyright materials for training may have a significant impact on what those systems look like going forward, with potential ramifications for end users internationally. Organisations looking to adopt AI tools in their operations should consider ways of doing so which minimise associated risks, and where possible come with warranties or indemnities from vendors about their IP status.

HWLE Lawyers’ IP and Technology teams have extensive experience in advising businesses regarding copyright infringement, including in the context of AI. If you are concerned about your works being used to train AI, or looking to adopt new AI tools in your business while limiting IP risk, please contact us for further information on how we can assist you.

This article was written by Daniel Kiley, Partner, Maximilian Soulsby, Associate, and Christopher Power, Solicitor.


1 Bartz v. Anthropic PBC, Case No. 24-cv-5417 (N.D. Cal. June 23, 2025).
2 Kadrey v. Meta Platforms, Inc., Case No. 3:23-cv-03417 (N.D. Cal. June 25, 2025).

Important Disclaimer: The material contained in this publication is of general nature only and is based on the law as of the date of publication. It is not, nor is intended to be legal advice. If you wish to take any action based on the content of this publication we recommend that you seek professional advice.

Subscribe for publications + events

HWLE regularly publishes articles and newsletters to keep our clients up to date on the latest legal developments and what this means for your business. To receive these updates via email, please complete the subscription form and indicate which areas of law you would like to receive information on.

* indicates required fields

Interests **
This field is hidden when viewing the form
Email preferences*
What type of content would you like to receive from us?
This field is for validation purposes and should be left unchanged.