How artificial Intelligence is disrupting the publishing industry

By Bartek Bezemer

21 March 2024

Artificial Intelligence is drastically reshaping the publishing industry. From AI generated books to copyright protection systems.

In February 2023, books generated through artificial intelligence tools like ChatGPT were flooding Amazon. Brett Shickler, a salesman in Rochester, New York, told Reuters he felt that writing a book finally felt possible through ChatGPT. By playing around with prompts, Shickler was able to put together a 30-page illustrated children’s e-book in just a few hours and put it on sale on Amazon.

Amazon eventually stepped in and restricted self-publishing authors from publishing more than three books a day to prevent being flooded with a near infinite AI generated books. Still an astoundingly high amount, but it showed that Shickler’s case wasn’t unique and many more were trying to make a quick buck by selling AI generated content.

These incidents serve as a precursor and a warning to the publishing industry who will face uncertain times. How will AI generated content from large language models like ChatGPT impact the industry? Can it work together with the technology or is its fate already sealed?

Artificial Intelligence opportunities

The rise of AI feels like it happened yesterday, but the publishing industry has been observing the rise and adoption of this technology for several years. In October 2020, Frontier Economics released an overview of the implications of artificial intelligence within the publishing industry. The deep dive into AI was made on behalf of the United Kingdom’s Publishing Association. Frontier Economics highlighted that the UK government has been pushing for further investment into the technology, believing AI could be a valuable driver for productivity growth.

These productivity gains, through the help of AI, could already be witnessed in the publishing industry itself. Frontier Economics pointed out that AI has been rolled out for IP protection, content discoverability, market prediction and strategic insight in the field of academic, education and consumer publishing. Publishing companies use the technology to search and summarize content, automating many routine tasks.

Frontier Economics asked publishers about how important the rise of AI would be for the coming five years across the publishing industry. Publishers noted that artificial intelligence would be of grave importance to the industry, bringing forth a wave of transformation. Only a minority of publishers believed that AI would be a disruptive force for the industry.

However, half of large publishers already expected the technology to start transforming the publishing industry within the next three years. Their predictions were right on the mark in hindsight. Smaller to medium side publishers were still in the denial phase, expecting little to know impact. As we would see, AI generated books would flood digital store fronts over the course of 2023, putting the industry on edge about the future implications of AI generated content.

Frontier Economics highlighted that two-thirds of publishers expected AI to be a strong competitor to their businesses, in one way or the other. Researchers at Frontier Economics noted that the competition would most likely come from AI start-ups in the publishing space. In order for existing publishers to be able to compete with this new wave of market players, they would have to invest in their AI solutions.

These solutions wouldn’t compete with the existing workforce necessarily, but unlock new forms of creativity. This was also noted by industry leaders. President of the Publishers Association and CEO of Taylor & Francis, Annie Callanan, commented on the findings that the technology would free up humans to become more creative and explore new research areas. Callanan believes that AI and humans can co-exist.

Publishers will be able to produce more groundbreaking research, get more Booker prizes and advance educational material for students around the world. Callanan’s take seems very optimistic as it omits critical aspects of human creativity. Granted her observations came before ChatGPT took off, with the technology making enormous leaps into terms generating written content.

Redefining copyright policy

There are still major hurdles for the publishing industry to overcome, especially in the area of intellectual property protection. Frontier Economics identified several key areas that had to be developed to safeguard the rights of publishers and their officers in the advent of rapid adoption of artificial intelligence. These areas consist of establishing a framework that would allow for investments in artificial intelligence, promoting collaboration between publishers and AI enterprise and help small or medium sized publishers to invest in artificial intelligence.

In the UK, at the time of the report, there was no regulatory oversight of AI copyright protection. Frontier Economics found that 60 percent of large publishers found that this served a barrier for further AI investments. Without the legal guardrails, publishers would remain reluctant to push for AI solutions. This primarily applies to data mining by AI enterprises, who use copyrighted materials to train their models.

New York Times Office — The New York Times sued OpenAI over copyright infringement

Hence Frontier Economics recommends that the UK government implants the necessary extensions to copyright policies to allow publishers to invest in novel technologies such as generative AI. We’ve already seen copyright infringement cases in the US in 2023, where The New York Times sued OpenAI and Microsoft for using millions of its articles to their AI programs.

In the December 2023 court filing, The New York Times became the first major American media company to take a stance against OpenAI for using its copyrighted material to train ChatGPT among other programs. The NY Times detailed no monetary demand had been specified in the lawsuit, but argued that OpenAI would be responsible for statutory and actual damages ranging into the billions of dollars.

The case filing came after the NY Times had approached Microsoft and OpenAI to explore a collaborative solution to train the AI models whilst abiding to intellectual property regulations. Microsoft declined to comment on the offer, with OpenAI spokeswoman, Lindsey Held, telling the news outlet that it’s open to exploring avenues for commercial collaboration.

The strong stance of the NY Times against OpenAI comes as AI-startups are to attract billions of dollars in investment funding, while publishing companies have to scramble to turn a profit. This leaves a vacuum in the copyright space where AI developers are able to create enormous revenue pipelines on works created by others without the appropriate compensation. While publishers were starting to sue AI companies, authors felt threatened in their livelihoods just the same.

Authors on high alert

The tides of AI generated content in the publishing industry drastically changed attitudes of those operating the space over the course of 2023 with authors speaking out against AI models. In August 2023, Amazon was forced to remove five books after complaints from author Jane Friedman after titles were falsely released under her name. These books were generated by AI and listed on Goodreads, a popular website where authors and readers can engage with their favorite books.

Friedman explained to The Guardian that the works were of low quality and felt like a violation. She was made aware of the titles by a reader who notified her of the titles, which appeared on Amazon. In the same month, for The Atlantic, Stephen King detailed how AI models were trained on his works. King highlights that AI programmers have developed advanced programs that tap into the power of computer memory to search and scan through vast quantities of books, which are processed and used to create intelligent machines that can deliver content in a matter of seconds.

King notes that his content has also been used to train these AI systems. However, he sees no monetary return for the countless hours he has poured into his works, which have enabled AI programs to become so intelligent. His claims were backed by a lawsuit filed by Christopher Golden, Sarah Silverman and Richard Kadrey, who accused Meta, the parent company of Facebook, for using copyrighted works to train its large language model LLaMA.

LLaMA, Meta’s equivalent of OpenAI ChatGPT, is believed to have used up to 170,000 books with the majority of them published in the past 20 years to train its program. This included works of George Saunders, Michael Pollan and Stephen King among many others. The training data was found in a dataset that was dubbed as Books3. Meta wasn’t the only technology company tapping into the vast data. The dataset has been used by Bloomberg and non-profit artificial intelligence research group, EleutherAI.

Books3 goes offline

The Books3 dataset was developed by open source advocate Shawn Presser. The project, which was launched in 2020, was hosted by pirate repository The Eye. Before it was shut down, the dataset had grown to 196,640 books, which were converted into a 37 gigabyte large dataset. The dataset caught the attention of anti-privacy groups with the Danish anti-piracy group Rights Alliance forcing a shut down of the Books3 dataset.

The Rights Alliance explained to Gizmodo that the content was put offline after the organization sent a takedown request. Presser meanwhile commented that his Books3 solution was the only training avenue for programs like ChatGPT. Adding that non-profit organizations are using copyrighted datasets to train their models, but aren’t admitting in doing so. The simple reason being they don’t have the necessary financial resources to access copyrighted materials such as books.

The recent stint with Books3 is nothing new, Gizmodo pointed out. Pirated material has been circulating the web for decades and AI would become the next frontier for publishers. This has also been noted by Frontier Economics back in 2020, where intellectual property protection would need to be enhanced to counter the rise of AI models.

CEO at the Rights Alliance, Maria Fredenslund, told Gizmodo that her organization is actively combatting copyright violations, trying to force copies of Books3 datasets to be taken offline. Fredenslund backed her actions by stating that the organization is worried about how copyrighted material is used to train AI models. However, this is nothing new, she adds, referring to the days when unregulated file sharing was rampant, with governments reluctant to take action.

Artificial Intelligence training data

Computer programmer and writer Alex Reisner dove into the various training data used to train these models. For The Atlantic he detailed his process to uncover what texts were feeding large language models like ChatGPT. During his research across various online platforms, Reisner found a dataset ominously called “the Pile”. This enormous cache file contained training data used by EleutherAI and included books used in the Books3 set.

However, uncovering what materials were stored in this cache file wasn’t a walk in the park. The Pile contained texts from the European Parliament, the English version of Wikipedia and emails from Enron Corporation. The data contain a near infinite amount of text and with the file being so vast, Reisner had to write a program to skim through the data, starting with isolating the contents used in the Books3 extract.

The initial runs were disappointing, Reisner detailed, saying that extract lacked titles, author names, metadata. All carrying the same nondescript label “text”, which only served as a marker for a program to recognize the data they were processing. Reisner got closer to finding the books used to identify ISBNs, which could be fed into a database that matched these numbers to their respective authors and other missing data.

Reisner was able to distill the data to 170,000 books, whereas 20,000 were left without an ISBN. These 170,000 books ranged from a wide net of publishers, from behemoths like Penguin Random, of which 30,000 titles were found in the dataset to smaller publishers like Verso, with 600 titles. The set contained works from Elena Ferrante, Jonathan Franzen and Margaret Atwood. Data from the most popular authors today.

The infamous Books3 dataset has been floating around within the AI community for a while, Reisner notes. Only reaching stardom after lawsuits and takedown orders from rights holders. The pirated Books3 file however is one among many datasets that are allegedly used by companies like OpenAI to train their models. Books3, as the name suggests, is the third in line.

Artificial Intelligence paradox

The rise of artificial intelligence and the panic it has caused in the publishing community brings up an interesting discussion as to whom it poses the biggest threat. Authors, rightfully so, feel their livelihoods are at stake. Their hard labor is being misused by multibillion dollar corporations like OpenAI and Meta to train programs that will be able to mimic and create works of art.

We might even say that AI will be as destructive a force as it has been for the labor force within the customer service industry. For the publishers however, this might be a novel way to become vastly more profitable as they aren’t reliant on authors to deliver works. If AI is able to keep evolving as rapidly as it has done in recent years, publishers might be able to cut out authors all together and deliver groundbreaking works to their audiences. Create fictional authors that in the eyes of the customers are just capable at their craft.

While the alarms are going off for those who protect the authors, publishers like Penguin Random are growing at an accelerated rate, selling more books than a decade ago. In an October 2023 interview, CEO of Penguin Random House, Nihar Malaviya, said he was optimistic about the future of the publishing industry, seeing many paths to success when publishing a book. However, Malaviya notes that protecting intellectual property for its authors and their works is one of the most important objectives, referring to ongoing lawsuits to ensure copyright protection against generation AI programs.

This brings an important paradox, as to how AI generated content fits into the copyright legal system. Malaviya points out that currently, within U.S. copyright law, AI generated content cannot be copyrighted, making it difficult to classify content created by humans through the help of artificial intelligence. Malaviya remained optimistic, saying that these issues will be straightened out in the coming five years.

Penguin Random House themselves have already been working with artificial intelligence solutions for over a decade across different regions around the world, especially in the field of machine learning to determine e-book sales prices and starting print runs. These findings correlate with the positive outlook on AI captured in the Frontier Economics review.

Safeguarding intellectual property

The discussion around intellectual property might be some of the toughest discussions policymakers and the industry will have to endure. In February 2023, Judy Ruttenberg,senior director of Scholarship and Policy at the senior director of Scholarship and Policy, spoke with counsel to the Library Copyright Alliance (LCA), Jonathan Band about the application of copyright law in relation to AI generated content.

Band explained that current legislation in the United States only covered human authors, which was well established in US copyright law, referring to early court cases, which clearly define the role of humans within creative works. This might be extended to humans who use AI to create creative products, although at the time of the interview, this had yet to take place.

On the other end of the spectrum, developers could argue that their models displayed such a level of autonomy, that the developer would become the rightsholder to the generated content. However, this argument wouldn’t hold, Band predicted, as computer scientists question the sentient nature of AI. This discussion would receive a new dimension when AI would become so intelligent, it would become indistinguishable from humans. Months later, Princeton University sided with its human stakeholders.

In December 2023, Princeton University published a statement stating its stance on intellectual property protection in the face of artificial intelligence. In its statement, the institute notes that it takes responsibility for the protection of copyrighted materials as the representatives of author’s publications. It wants to create a safe environment where creatives can express themselves, guiding them from writing, acquisition to publication.

Simultaneously, Princeton University, just like Penguin Random House acknowledges that AI is already in use in the field of scholarly publishing, redefining how research is conducted. Hence, the university believes there’s a durable pathway for AI to co-exist alongside scholars. Artificial intelligence, it says, needs a framework that enables this partnership with humans to drive positive change.

In order to realize this objective, Princeton University explains it will create more clarity of intentions and principles of engagement. It will prioritize the protection of intellectual property for authors who entrust them with their research. This includes ensuring that only LLM models that abide by copyright law will have access to its content.

In this world-view, Princeton highlights that human created content will be an integral part in society. Policymakers will have to maintain current copyright protection and don’t allow for exceptions that allow AI developers to train their models without appropriate compensation to the rights holders. Exploitation of creators should never be allowed, it states. This translates to more transparency from AI developers.

The stance of Princeton University resonates with that of publishers such as the New York Times and Penguin Random House, who, while acknowledging the contributions of AI to society, including their own businesses, guardrails should be in place that protects their businesses and the works of their employees and authors. These organizations serve as the precursor for a heated debate around intellectual property and machine learning models.

Publishing industry in turmoil

The publishing industry is fighting on multiple fronts, with authors feeling betrayed. The developments in artificial intelligence are moving rapidly, with models becoming smarter with each update, capturing the imaginations of millions of people around the world. And without the appropriate guardrails, they might soon render creatives obsolete. The debate, and fight, revolving and intellectual property protection will determine the fate of many operating in the publishing space.

As of now, the debate as to who should compensate who is still ongoing. Fast forward to January 2024, OpenAI responded to the accusations made in the NY Times lawsuit, noting they were without merit, arguing that it was OpenAI itself that had enabled news organizations to redefine their businesses. OpenAI commented that it was operating under the fair use policy, claiming that the NY Times court filing was omitting crucial parts of how OpenAI’s technologies operate.

In the statement against the lawsuit, OpenAI detailed that while it is legally operating under fair use policy, using publicly available digital content, copyright holders have an opt-out opportunity, which the New York Times themselves have signed in August 2023. Interestingly enough this leaves out content published before August 2023, but that’s beyond this explorative piece.

OpenAI has little wiggle room, as admitting in any shape or form to copyright violations, would translate into hefty compensation to all the copyright holders’ material used to train the ChatGPT model. Meanwhile, publishing companies find themselves in uncharted territory, with the gut response being to push back development of AI developments as no safeguards exist to protect their businesses and if not implemented in time, they might become relics of the past.

Bartek Bezemer graduated in Communications (BA) at the Rotterdam University of Applied Sciences, Netherlands. Working in the digital marketing field for over a decade at companies home to the largest corporations in the world.

Recommended reads

The disrupting effect of AI on software development

Sep 10, 2024

The emergence of AI has sent shockwaves through many industries, with software development having a front seat. Artificial Intelligence (AI)...

How chatbots reduce workloads at customer service

May 5, 2024

Chatbots can be an effective way to reduce workloads across customer service teams. But not all chatbots are created equally. Chatbots can...