Fair Use or Foul Play? Is ChatGPT Swiping 10 Million Copyrights?

Large generative AI Models from OpenAI, Google, Meta, Midjourney and Stable Diffusion were built from millions of copyrighted works scraped from the web. Is this fair use? or copyright infringement?

Apr 27, 2023

Humans are starting to complain that robots are copying their work

The headline-grabbing artificial intelligence models from OpenaAI, Google, Meta, and Microsoft were all trained on massive amounts of data scraped from millions of websites and millions of published books.

Much of this data – although publicly available – comes from copyrighted works. So did these large language models infringe on all these copyrights?

The copyright holders and the content owners are starting to fight back.

The Copyright Lawsuits Are Already Here

Multiple lawsuits are making their way through the courts as we speak.

In the Stable Diffusion Litigation, a class action brought on behalf of visual artists, the plaintiffs claim that several AI image generation tools, including Stable Diffusion, and Midjourney infringed their copyrights. Specifically, they allege that these models were trained – without their consent – on their copyrighted works. (Full disclosure – the header image for this post was created using Midjourney).

In a similar lawsuit, Getty Images has also sued Stable Diffusion, claiming that Stable Diffusion used more than 10 million of Getty’s copyrighted photos to train its model.

Large Platforms Are Demanding to Get Paid

Many large platforms, including Twitter, Reddit, and Stack Overflow, are also pushing the AI companies for compensation. As far as we can tell, these demands are still in the negotiations stage and no actual lawsuits have been filed, although Elon Musk’s recent tweet suggests that a lawsuit may be coming soon.

Training Data Is in a Legal Grey Area

So are the AI companies liable for copyright infringement based on training their models using copyrighted works? At the moment, it’s a little unclear. The argument from the AI companies is that using data to train a model amounts to “fair use.” Under U.S. law, making copies of a copyrighted work is generally ok, i.e., not infringing, if the copies are made for, among other purposes, “scholarship” or “research.”

But fair use is tricky, and there is no set bright-line rule. And furthermore, one of the most important factors that courts look to is whether the supposed “fair use” winds up competing with the original copyright. Say a book review quotes from a novel. A court is almost certainly going to find that a quotation, even though technically it is copying from the novel, is fair use, and, therefore not copyright infringement. And one crucial factor here is that, in general, reviews – even bad reviews – are good for the market for books.

When it comes to large AI models, or at least some large AI models, it’s not so clear. Consider two different models trained on a library of mystery novels. The first model is a recommendation engine. You tell it the mystery novels you like, and it will recommend some similar books you might enjoy. Verdict: almost certainly fair use. In this case, the recommendation engine is encouraging you to buy the underlying product.

But now let’s take a different case. A “generative” engine. You tell it the mystery novels you like, and it generates a brand new novel “inspired by” the books you listed. All of a sudden, the “fair use” argument is less strong, and maybe a loser. In this case, the output of the model is competing with the original works that it was trained on.

For Now, We Wait for the Courts

Given the strong interest in these issues from all sides, we can expect some answers in over the next months and years. Whether these answers will be definitive is a different question. Things are moving fast. Watch this space. We’ve set up a page on our website https://lawsnap.mywikis.wiki/wiki/Training_AI_Models_and_Copyright

where we will be tracking developments. We’d love your feedback.

Thank you for reading LawSnap by Adam David Long. This post is public, so feel free to share it.

LawSnap by Adam David Long