The Godfather, Alf, and The Simpsons are among the thousands of films and TV shows whose dialogue has gone into AI training data.
Update: Since this piece was written, the Writers’ Guild of Great Britain has issued a response. You can find it at the foot of the post.
Our original story follows…
If our movies and TV shows are to be written by AI chatbots in the near future, then that output will have the work of countless creative, flesh-and-blood people to thank. According to a new report, such cultural staples as The Godfather, The Simpsons, Twin Peaks and Breaking Bad have all been fed into the AI data set used by some of technology’s biggest names, including Apple, Meta and Anthropic.
The report comes from The Atlantic and forms part of a wider investigation into exactly what tech firms are using to ‘train’ their AI systems. Journalist Alex Reisner writes that dialogue from some 53,000 movies and 85,000 TV episodes has gone into the data set used by such generative AI platforms as OpenAI’s ChatGPT and Anthropic’s rival, Claude. That data was taken from OpenSubtitles.org, a community-led online resource whose huge database of subtitles was taken from physical discs, streaming videos and recordings of live TV broadcasts.
Among the films and shows found in the data analysed by The Atlantic were 616 episodes of The Simpsons, every episode of The Wire, The Sopranos and Breaking Bad, and dozens of Oscar-nominated films released between 1950 and 2016. All of this amounts to thousands of hours of dialogue from the past 60 or so years of pop culture, all of which could be used to help chatbots affect a more human-sounding tone of voice. Or, as Reisner points out, be used to generate screenplays without having to hire proper writers with mortgages, holiday requirements or other such annoyances.
Read more: Filmmaking and AI | Will the industry survive, and what will be left of it?
AI companies tend to be rather secretive about what data they use to train their large language models, which is why research like The Atlantic’s is so vital. At present, the planet’s legal systems are lagging far behind this emergent technology; sections of Hollywood, mostly of the executive sort, are intrigued by its money-saving potential, and so it’s currently being left to companies and individuals to fight for the copyright status of their own creative work.
Authors George RR Martin and John Grisham are among a group of 17 authors who are collectively suing OpenAI for “systemic theft on a mass scale.” The producer of Blade Runner 2049 is currently suing Elon Musk and Warner Bros for allegedly using AI to generate images they argue resemble stills from that film.
Such cases are likely to continue to emerge if the generative AI machine continues on its seemingly unstoppable march.
The Writers’ Guild of Great Britain has since issued a response to the report, and provides advise to any screenwriters whose work has appeared on the database search tool published by The Atlantic:
“We believe that writers must be fairly compensated for use of their work in this manner and that there should be licensing agreements to cover this practice, at the discretion of writers,” the WGGB writes. “We are also calling for a regulatory body on AI and strengthened copyright protections, among other recommendations to protect writers. If you are a WGGB member and you have found your work in this database, please contact casework@writersguild.org.uk and we will advise you on next steps to take.”