Market insights

Home / Market Insights / Can you own the goose that lays golden eggs? Copyright in machine learning models

The explosion of generative artificial intelligence tools over the last few years has sparked all manner of conversation about associated intellectual property.

One frequently discussed point relates to the copyright in the output of those systems. If large language models (LLMs) such as ChatGPT and Claude can generate literary works, and image generation tools like DALL-E and Stable Diffusion can ‘create’ artistic works, this brings into question traditional concepts of authorship which underlie copyright protection. While the output of these systems can be of variable quality, there clearly are instances where the works generated are valuable, hence the instinct of many to try to find ways to own and protect those works via the intellectual property system. The position seems to be largely clear that, absent legislative change, the output of LLMs and other generative AI tools is unlikely to be able to be protected via traditional means like copyright.

However, a point that seems to be less commonly discussed is ownership of the underlying machine learning models (ML models) themselves. Surely, if these are capable of producing ‘golden eggs’ of content, then there is considerably more value in the ML model itself, as the ‘goose’ that lays them. Great cost and expense goes into creating those ML models – versions of Meta’s Llama LLM, for example, are quoted as having required 6.4 million hours of time on GPU hardware to compile.

Perhaps it is a widely shared assumption that these ML models will naturally attract copyright protection, hence the commentary being directed elsewhere. However, this is not a safe assumption, and there are a number of reasons as to why a machine learning model may not attract protection under the Australian Copyright Act 1968 (Cth) (Act).

ML models

The precise workings of AI systems and the ML models that power them are beyond the scope of this article, but a basic outline is relevant to the understanding of the legal analysis that follows.

To build and operate a machine learning system typically involves:

the creation of a small number of pieces of ‘traditional’ software, written by human authors in an orthodox programming language like Python or C, which automate the training and subsequent use of the ML model;
collating a corpus of existing materials (for example, in the case of an LLM, a significant volume of written works) to form the basis of the training process;
having one of the pieces of software iterate over those training materials one-by-one, seeing how the output of the ML model correlates with the provided example, and tweaking the ML model so that its output slightly better matches the training data; and
once that training process is complete, using another piece of ‘traditional’ software to operate the ML model, having it undertake its task in new scenarios separate to the training materials.

An ML model will be trained for use in a specific task such as classification (for example, identifying the contents of images) or generation (creating new images, text, audio or video).

Human authorship

The requirement for human authorship is likely the biggest hurdle for copyright to exist in an ML model.

Copyright specifically arises when a work is reduced to a material form by a human author. This timing has been used to inform analysis by Courts, and sharpen the focus on a human author as part of that process of crystalising the work into its final format.

The analysis of Justice Perram in Telstra Corporation Ltd v Phone Directories Company Pty Ltd¹ is highly instructive in this scenario. That case considered the use of software to compile telephone directories. While the relevant software was authored by humans, and put into motion by humans, the software went about its task in a largely autonomous way:

a computer program is a tool and it is natural to think that the author of a work generated by a computer program will ordinarily be the person in control of that program. However, care must taken to ensure that the efforts of that person can be seen as being directed to the reduction of a work into a material form. Software comes in a variety of forms and the tasks performed by it range from the trivial to the substantial. So long as the person controlling the program can be seen as directing or fashioning the material form of the work there is no particular danger in viewing that person as the work’s author. But there will be cases where the person operating a program is not controlling the nature of the material form produced by it and in those cases that person will not contribute sufficient independent intellectual effort or sufficient effort of a literary nature to the creation of that form to constitute that person as its author: a plane with its autopilot engaged is flying itself. In such cases, the performance by a computer of functions ordinarily performed by human authors will mean that copyright does not subsist in the work thus created. Those observations are important to this case because they deny the possibility that Mr Vormwald or Mr Cooper were the authors of the directories. They did not guide the creation of the material form of the directories using the programs and their efforts were not, therefore, sufficient for the purposes of originality.²

One point of distinction between the analysis of Perram J and the training of an ML model is that the ML training process is not merely replacing a menial task that a human might otherwise do, and is instead undertaking a largely inscrutable technical task. This distinction further emphasises that the tasks undertaken as part of machine learning do not involve authors who ‘guide the creation of the material form‘.

There is some human intellectual effort that goes into creating a successful ML model. The overall structure and scaffolding of the ML model will effectively be dictated by the author of the software which runs the relevant training process. There are then a range of high-level parameters to tweak prior to commencing training, but, given that a machine learning system will be reflective of the materials it has been trained on, one of the most significant ways to influence an ML model is to exercise editorial control over the corpus of training materials. However, even if there is quite an art to this selection exercise, this is not necessarily work directed to bringing about the work in its final material form. As Perram J went on to state:

Although humans were certainly involved in the Collection Phase that process antedated the reduction of the collected information into material form and was not relevant to the question of authorship… Whilst humans were ultimately in control of the software which did reduce the information to a material form, their control was over a process of automation and they did not shape or direct the material form themselves (that process being performed by the software). The directories did not, therefore, have an author and copyright cannot subsist in them.³

Given that clear and recent authority, it seems very difficult to contend that an ML model, brought about by an extensive automated process, would have the requisite degree of human authorship to attract protection under the Act.

Other concerns

Absent human authorship, copyright is highly unlikely to persist in ML models, so any other deficiencies in the requirements for copyright protection remain purely academic.

Notwithstanding, ML models might also lack some of the other requirements for copyright protection.

Copyright only applies to certain categories of works, such as ‘literary works’ including ‘computer programs’. Under section 10 of the Act, ‘computer program means a set of statements or instructions to be used directly or indirectly in a computer in order to bring about a certain result‘. We are unaware of any judicial consideration as to whether an ML model might meet that definition, given that it does not resemble a traditional piece of computer software written in a series of logical steps controlled via concepts such as conditional ‘IF’ statements and looping ‘WHILE’ statements. The question is whether that ML model is indeed a ‘computer program’, or merely data used in conjunction with a computer program. As Emmett J observed: ‘all programs are data but not all data is programs‘.⁴

In order for copyright to arise, a work must also be original. That originality arises from human authorship, so it is completely artificial to assess it separately, but if we were to, then there may be cause to consider whether or not an ML model trained on existing materials could itself be said to be ‘original’. Noting that ML models do not typically contain verbatim copies of training materials and instead draw inferences from existing works to create something new, it may be that this is less problematic than the fatal absence of human authorship.

Where to next?

The issues identified with respect to the status of ML models under the Act are in many ways reflective of the broader discussion around the degree to which machine generated works may warrant legal protection.

Philosophically one could argue that the work and effort expended in the training of these ML models is significant (and clearly, far more significant than a user who merely enters a prompt into a generative AI tool and awaits an output), and therefore worthy of protection of some kind, even if not the result of human endeavour. Others might argue that the creators of ML models are already profiting from training their systems on vast swathes of human creative works, often without compensation, and that this is already an unfair head start without granting some extra layer of protection to the ML models that result.

Even human written software has always proven to be an unusual subject matter for traditional intellectual property regimes to tackle. The use of copyright – a legal right seeking to reward creative endeavour – for a functional work in the nature of software has never sat entirely comfortably, and the many decades of protection granted to authors feels questionable given the speed at which computerised technology moves. Similarly, the patent system has notoriously had to grapple with difficult questions about the degree to which software can form the basis of a patentable invention.

If systems driven by these kinds of ML models continue to become of increasing importance over coming decades, it would seem unlikely that their legal status will remain unclear for long. While we have attempted to ascertain how a Court might approach them under current laws, it remains to be seen whether legislatures (and even treaty makers) decide to tweak our current legal systems, or create new ones, to more clearly recognise rights in ML models.

This article was written by Daniel Kiley, Partner.

¹[2010] FCAFC 149.

²Ibid at [118].

³Ibid at [119].

⁴Australian Video Retailers Association Ltd v Warner Home Video Pty Ltd (2001) 53 IPR 242 at 251-252.

Important Disclaimer: The material contained in this publication is of general nature only and is based on the law as of the date of publication. It is not, nor is intended to be legal advice. If you wish to take any action based on the content of this publication we recommend that you seek professional advice.

Explore HWLE below.

Expertise.

Can you own the goose that lays golden eggs? Copyright in machine learning models

ML models

Human authorship

Other concerns

Where to next?

About the authors

Subscribe for publications + events

HWLE acknowledges the Traditional Owners of Country throughout Australia and recognises the continuing connection to lands, waters and communities. We pay our respect to Aboriginal and Torres Strait Islander cultures; and to Elders past and present.