The use of copyrighted material in training large language models (LLMs) has sparked legal battles and takedown notices. In the Netherlands, anti-piracy group BREIN takes credit for forcing the popular 'GEITje' LLM offline, which in part was trained on copyrighted texts. The developer didn't necessarily agree with BREIN, but lacked the resources to fight back.
Development of AI continues to progress at a rapid pace. This includes work on large language models (LLMs), which are typically trained on broad datasets of texts.
These technologies promise unparalleled progress which could benefit society as a whole. Yet despite widely recognized potential, areas of significant concern remain.
That many LLMs were trained on datasets containing copyrighted content is now widely known. This has led to numerous complaints and high-profile lawsuits, with companies like OpenAI, Google, Meta, Microsoft, and NVIDIA facing allegations of copyright infringement.
The courts will ultimately decide whether rightsholders have legitimate copyright claims, or whether technology companies can indeed rely on a ‘fair use’ defense. It will likely take many years before a final decision is reached so until then, rightsholders are doing all they can to prevent future infringements.
Books3The Books3 dataset, used to train many popular LLMs, initially attracted significant attention. The dataset was compiled by AI researcher Shawn Presser in 2020, using the library of ‘pirate’ site Bibliotik.
Books3 was widely shared online and incorporated into other databases, including ‘The Pile,’ an AI training dataset compiled by EleutherAI. This practice remained largely unchallenged for years, but when AI entered the mainstream, copyright complaints surged.
Due to pressure from rightsholders and anti-piracy groups, Books3 was removed from numerous online platforms over copyright concerns. Danish anti-piracy group Rights Alliance spearheaded several of these takedown actions, while describing AI-themed infringement as a major problem.
“We have a big task ahead of us in detecting and taking down illegal training datasets like Books3, but also in dealing with AI that has already been trained on illegal content and is now spreading on the internet,” Rights Alliance Director Maria Fredenslund said previously.
Books3 Offline
In the ensuing months, takedown efforts persisted. Notably, these efforts expanded beyond datasets containing complete books, targeting the models trained on this data as well.
Dutch anti-piracy group BREIN has been active on this front and announced that, as a result, one of the largest Dutch LLMs ‘GEITje-7B‘ was taken offline as a result of their efforts.
This LLM was trained on ‘Gigacorpus’ a dataset of books and texts previously targeted by BREIN, including a vast collection of Dutch texts and books, some of which contained copyrighted material sourced from the shadow library LibGen
“We see a worldwide trend that creators of AI models have little or no respect for copyright,” BREIN writes.
“Apparently, the thinking is that all the attention, time and money put into copyrighted works by creators and media companies are less important than the AI models,” the group adds.
GEITje Offline
In their defense, the LLM creator cited copyright exceptions for text and data mining for scientific purposes. However, BREIN argued that the European AI Act mandates the use of lawfully acquired content as inputs for AI models.
This disagreement wasn’t tested in court. The LLM developers lack the funds to litigate the matter so took the decision to take GEITje offline voluntarily.
Voluntary ShutdownMachine learning engineer Edwin Rijgersberg developed the GEITje LLM as a hobby. While the 7-billion parameter model became quite popular, he is not in a position to mount a legal challenge.
Rijgersberg previously consulted copyright experts who informed him that the issue isn’t as black and white as portrayed by some rightsholders. That said, a legal battle would be expensive.
“I cannot afford to engage in a lengthy and costly legal battle to resolve these issues. After all, GEITje was a non-commercial, scientific hobby project. For this reason, I am complying with BREIN’s request,” Rijgersberg notes.
The end of GEITje 1
While BREIN stresses the importance of protecting copyrights, GEITje’s developer still has hope for an open-source Dutch-language AI landscape
“In my view, the future of European AI still lies in open-source AI. Only when AI is free to use, can be studied by everyone, and is freely available to modify and share for any purpose. can we truly speak of sovereign AI.”
While GEITje won’t make a comeback, Rijgersberg highlights that there are now many other Dutch LLMs available to the public. These models are trained on various datasets, which may or may not include copyrighted material.