From code to creation: intellectual property issues in training generative AI

Have you heard the news? Generative AI tools are taking over the world! Okay, maybe not quite yet, but they are certainly causing a stir in the tech world. These little computer wizards have been creating everything from bizarre cat images to Shakespearean sonnets, and we can’t get enough of their wacky creations. A generative AI tool even wrote those first four sentences of this article…

Generative AI is a type of AI that creates personalised content such as images, audio, videos, text or code based on text descriptions, prompts or commands. Although work on this technology has been underway for a handful of years, it is only recently that a boom of generative AI tools like ChatGPT and DALL-E 2 have caught the world’s imagination. While the technology is likely still in its infancy, generative AI is already revolutionising the way we approach content creation, with generative AI startups attracting a record investment growth of 618% in 2022 when compared to 2020. Even the creators of these tools are surprised by some of the results they achieve, with emergent behaviours sometimes lending themselves to new and unexpected functionality.

As generative AI tools continue to improve, demand for AI in business management and service delivery will likely continue to grow. Generative AI is already being used to generate software source code, produce marketing material, and even draft contract clauses. While such use of AI may benefit your business and reduce cost, there are obviously a whole range of legal issues to consider before deploying or integrating such technologies in your business.

In this article, we consider specifically the intellectual property ramifications of the machine learning processes of generative AI tools. These tools ingest vast quantities of existing content as part of a training process, and then use that training to shape their own output. Is AI just like a human author, consuming media over their lifetime, and then producing their own work in light of the cultural and academic context they have been exposed to? Or without that human spark, is it more like a jumble of copied parts, devoid of originality? Might human authors have something to say about their works being fed into that process?

In an article to follow, we will also consider questions about the output of those systems. Can computers create works which themselves attract IP rights? If so, who owns those rights?

Machine learning and copyright

Underpinning the development and operation of most generative AI tools is a concept known as machine learning. Machine learning is a field of AI in which a computer program learns on its own and evolves in the process.

Much like how humans learn through experience, a piece of software can be created to learn through experience and examples. This process is governed by neural networks – pieces of software that attempt to crudely mimic the human brain through a set of algorithms. When training data is inputted, the machine generates an output based on its existing model, and then tweaks its model to conform more closely to the input it received. This automated process continues until the model develops high accuracy in generating a predicted output. In more advanced versions of these systems that use techniques known as ‘deep learning’, a machine can learn to map any given input to a predicted output through multiple layers of neural networks and millions of examples and training data.

By way of example, a neural network can be trained to identify an apple by inputting hundreds of images of fruit, along with the correct answer. The system then learns to determine what an apple looks like by processing the right and wrong answers. However, unlike traditional approaches of classification that teach a system to recognise essential features like shape, size, or colour, deep learning does not require such technique. The machine simply learns to recognise an apple much like how a human brain would identify an apple simply by looking at it, without going through a pre-programmed checklist.

These techniques have been used to create classification systems with applications from the relatively trivial, like searching images in your phone’s photo library for images that contain beaches or dogs or sandwiches, to the much more significant, like assessing medical data.

Generative AI systems take this machine learning approach and move beyond mere classification or assessment of data, into the creation of new data. Having been exposed to vast libraries of images and associated keywords, AI tools like DALL-E use what they have learned from that training process to generate images matching a user’s prompt.

In this context, the application of these machine learning techniques raises the potential risk of copyright infringement at two points – the use of existing works as training data, and the recollection of elements of existing works as part of ‘newly’ generated works.

Training of AI

Deep learning requires large datasets to train AI – typically, the bigger the better. The internet is obviously a vast source of text and images, and so the temptation for developers is to scrape information from these publicly accessible sources to use as training materials for their models. But just because something has been published online, it does not give the world the right to deal with it however they please.

Copyright gives the owner of a work the exclusive right to do certain acts. In respect of literary and artistic works, these rights include the right to reproduce the work, publish the work, or communicate the work to the public. Owners of literary works also have the exclusive right to make adaptations of that work, such as a translation of the work into another language, or a screenplay based on the work. Accordingly, a person who does or authorises the doing of any of these acts without the licence of the owner of the copyright infringes the copyright.

A preliminary question then is whether the machine learning training process involves doing one of these exclusive acts. Could exposing a machine learning model to a literary work be akin to a human merely reading a novel, and internalising some of its writing style?

The model generated by a machine learning process is not a vast database of all of the training materials it consumed. A machine learning model is incomprehensible to the human eye, effectively consisting of a convoluted collection of numbers. It is not necessarily clear that feeding a copy of a copyright work to a machine learning model creates a ‘reproduction’ of the work (or a substantial part of the work) in that model. However, a US lawsuit on behalf of a group of artists has alleged that the ‘diffusion’ process used by a number of image generation tools effectively involves ‘a way of storing a compressed copy’ of artists’ works, and therefore the resulting model does incorporate reproductions of the training materials.

As a more practical matter, the process of collecting and providing a work to the model is likely to involve making reproductions of the work, which could constitute acts of infringement.

Some researchers have defended the use of copyright materials in the use of academic work on machine learning, arguing that the non-commercial nature of their projects would render the use of ‘fair dealing’ for the purpose of research or study under the Australian Copyright Act, and ‘fair use’ under corresponding US laws. In taking this approach, this then severely limits the potential to make use of the results of any such research for commercial purposes later.

Others have attempted to argue that using copyright materials as part of training still constitutes as ‘fair use’ under US copyright laws, because the use is transformative. Australian copyright law does not have a corresponding broad fair use concept, with any claim of ‘fair dealing’ needing to fit within predetermined categories.

Developers attempting to steer clear of these issues have instead sought to use ‘open source’ materials to train their models. Open source works are available to be used without a fee, but are not necessarily free from limitations, with the use governed by a relevant open source licence terms, such as one of the Creative Commons licences. It is not uncommon for these licences to only permit the use of materials in non-commercial contexts, which would severely limit the scope for use of those materials in training commercial AI systems.

Some open source licences also incorporate attribution requirements, so that any use of licensed works must be appropriately credited. Providing appropriate attribution for huge volumes of works is likely to be practically challenging. This has proved to be a point of contention for GitHub’s Copilot software. GitHub is a software source code repository service operated by Microsoft, and hosts an array of open source and closed source software code. OpenAI (in which Microsoft is a major investor) used open source code from GitHub to train a new Copilot tool, which developers can use when writing new software to proactively suggest snippets of code. A class action filed against GitHub, Microsoft and OpenAI in the United States has taken issue with the use of open source materials in this way where those materials were released under licences which require attribution.

Output

Given that generative AI systems create ‘new’ works inspired by their training data, there is also a question as to whether those outputs could infringe the copyright of those original sources.

Legally, the production of a generative AI output would be infringing if it reproduces a substantial part of any of the original works. A substantial part does not necessarily need to be a large volume of the original work, but could instead be a smaller part which is significant or memorable.

Practically, generative AI systems are not intended to merely reproduce their training data verbatim. Ideally, these systems operate like a human author, familiar with a shared cultural context, and then producing their own work inspired by that context, but not appropriating any specific details from works that have come before. In practice, this might not always be the case.

Academics¹ have also been able to identify instances where generative AI models appear to ‘memorise’ inputs, and regenerate them near verbatim. The following image of US evangelist Anne Graham Lotz (left) for example, has been released under a Creative Commons licence which requires any use of that work to carry an attribution. Those academics were able to have the image generation tool Stable Diffusion create a nearly identical image (right) based on the prompt ‘Ann Graham Lotz’, without any attribution of the original work being provided.

Original image “Anne Graham Lotz (October 2008).jpg” by AnGeL Ministries is licensed under CC BY-SA 3.0

Getty Images (Getty) have accused the developers of Stable Diffusion of infringing copyright in more than 12 million of its images. As part of its complaint, Getty have been able to identify instances where Stable Diffusion will output images which are extremely similar to Getty’s original works, gallingly even down to the ‘Getty Images’ watermark.

More often though are a range of more subtle issues.

One neat trick of generative AI systems is to be able to generate new works in the style of an existing author or artist. On sophisticated systems, prompts such as ‘write the lyrics for a song in the style of Taylor Swift’ or ‘a painting in the style of Picasso’ can generate surprisingly authentic-looking results. Mimicking the signature style of an artist is much more of a grey area under copyright law. As part of the ‘diffusion’ lawsuit referred to above, they have sought to expand beyond strict copyright, and also alleged that the ability to generate works ‘in the style’ of an artist infringes that artist’s publicity rights (a legal concept not recognised in Australia).

Conclusion

Given how effective machine learning techniques have proved in creating generative AI tools, they seem unlikely to be going away anytime soon. On the contrary, because the efficacy of these systems tends to improve as they are fed greater volumes of information, the appetite for training materials is only likely to increase over time. Depending on the outcome of test cases like those discussed above, it will be interesting to see whether tech companies gathering those materials (and possibly end users seeking to use their output) may need to become cautious in training and using AI tools.

Stay tuned for our next article where we discuss how existing IP laws deal with the output of generative AI systems.

How can HWL Ebsworth help?

HWL Ebsworth’s Intellectual Property and Technology team has extensive experience in advising businesses on intellectual property and software issues. If you are concerned about infringement of your intellectual property, or how you deploy new technology in your business, please do not hesitate contact us for further information on how we can assist you.

This article was written by Daniel Kiley, Partner and Paul Sigar, Solicitor.

¹Extracting Training Data from Diffusion Models, Nicholas Carlini et al, https://arxiv.org/abs/2301.13188

From code to creation: intellectual property issues in training generative AI

16 May 2023

Machine learning and copyright

Training of AI

Output

Conclusion

How can HWL Ebsworth help?

Daniel Kiley

Partner | Adelaide

MEET THE TEAM

GROUP EXPERTISE

Subscribe to HWL Ebsworth Publications and Events

Contact us