Google swallows 11,000 novels to improve AI’s conversation

Google swallows 11,000 novels to improve AI’s conversation


As writers learn that tech giant has processed their work without permission, the Authors Guild condemns blatantly commercial use of expressive authorship

When the writer Rebecca Forster first heard how Google was using her work, it felt like she was trapped in a science fiction novel.

Is this any different than someone using one of my books to start a fire? I have no idea, she says. I have no idea what their objective is. Certainly it is not to bring me readers.

After a 25-year writing career, during which she has published 29 novels ranging from contemporary romance to police procedurals, the first instalment of her Josie Bates series, Hostile Witness, has found a new reader: Googles artificial intelligence.

My imagination just didnt go as far as it being used for something like this, Forster says. Perhaps thats my failure.

Forsters thriller is just one of 11,000 novels that researchers including Oriol Vinyals and Andrew M Dai at Google Brain have been using to improve the technology giants conversational style. After feeding these books into a neural network, the system was able to generate fluent, natural-sounding sentences. According to a Google spokesman who didnt want to be named products such as the Google app will be much more useful if they can capture the nuance of language better.

For the moment, the research is just a proof of concept, the spokesman continues via email, but these methods could help Google understand and produce a broader, more nuanced range of text for any given task.

We could have used many different sets of data for this kind of training, and we have used many different ones for different research projects, he adds. But in this case, it was particularly useful to have language that frequently repeated the same ideas, so the model could learn many ways to say the same thing the language, phrasing and grammar in fiction books tends to be much more varied and rich than in most nonfiction books.

The only problem is that they didnt ask. The Google paper [PDF] says that the novels used in this research were taken from the Books Corpus, citing a 2015 paper by Ryan Kiros and others [PDF] which describes how the authors collected a corpus of 11,038 books from the web, describing them as free books written by [as] yet unpublished authors. Its a collection that has been used by other researchers working in artificial intelligence and which is currently available for download in its entirety from the University of Toronto.

READ  The malice and sexism behind the unmasking of Elena Ferrante

Forster says that she always appreciates an interesting use of words, but while Hostile Witness is available to download for free, no one asked her permission to use her novel as raw material to train a computer.

Perhaps Im still thinking in the old way, that a reader will read my book it didnt even occur to me that a machine could read my book. What I found curious was that these were referred to as free books written by as yet unpublished authors because my state is very different, she says.

Like many of the novels in the Book Corpus collection, the edition of Hostile Witness used in the research was published on Smashwords and includes a copyright declaration that reserves all rights, specifies that the ebook is licensed for your personal enjoyment only, and offers the reader thanks for respecting the hard work of this author. While Forster says shes no lawyer, the spirit of this declaration is clear you hope that your work would be respected by readers.

I take great pride in my craft, and perhaps it was chosen because of that. Which would be great. Or perhaps it was chosen because it was there, because it was free?

Another writer whose work has been used in the Google Brain research is Erin McCarthy, the author of more than 28 novels. The first volume of her Fast Track series, published by Penguin Random Houses Berkley Books imprint, is also available for free online, but McCarthy says that Google didnt get in touch with her or ask for permission to use Jacked Up in their research into AI. Shes fascinated to hear that romance novels are being used to improve the search conglomerates ability to speak.

There is a reason they are the bestselling genre in the US and I believe its because they feel conversational themselves, McCarthy says. Its real life turned up a notch. Realism overlying a fantasy.

The flow of the dialogue is very important, she continues. I am very cognizant of using modern diction and age-appropriate word choices. If my female character is 24 shes not going to speak in a formal manner. Conversations between the hero and heroine have realistic word choices, but there is additionally an element of fantasy there. What they want a hero to say, but what might not actually occur in real life. Thats what readers want and expect from a romance novel.

READ  Video shows dog chained to open trailer on highway

McCarthy isnt sure how to respond to the idea that her work has been used for an entirely different purpose to the one she intended, a purpose that may result in services to make the tech giant a lot of money.

Its hard to gauge the use of my work and the exact purpose for its use without having seen it in action, she says. My assumption would be they purchased a copy of the book originally. If they havent, then I would imagine the source of the content, as intellectual property, should be properly attributed and compensated for the general health of the creative community.

Far from offering proper attribution or any compensation, the Google paper avoids any suggestion that the novels used in the research were written by real people, describing the books only as a collection of text from 12k ebooks, mostly fiction.

Forster is equally adamant that writers whose work has been used to gain a commercial advantage should reap a portion of the rewards, but isnt holding her breath for any payment.

If theres one thing thats niggling at me its that I would have liked to have known, she says. With all the technology at their fingertips, then it wouldnt have been too hard to let everyone know.

According to Mary Rasenberger, executive director of the Authors Guild, this blatantly commercial use of expressive authorship comes as no surprise. Weve seen this movie before.

The Guild has been in dispute with Google since 2005, arguing that the companys project to digitise library books was a plain and brazen violation of copyright law. Google Books won in 2013, with the district court ruling that all society benefits from the project, a decision that the supreme court declined to review earlier this year.

Why shouldnt authors be asked permission, or even informed not to mention compensated before their work is used in this manner? Rasenberger asks. Theres no doubt the company has the means to do so.

READ  Don't judge Ben-Hur by the 2016 version

Google wouldnt say whether getting hold of 11,000 authors was beyond their capacities, or if they have any plans to reward the writers, or if the people whose expertise was harvested to train their network were ever considered as individuals. While attribution isnt required, the spokesman says via email, the researchers clearly identify where they got the data.

The machine learning community has long published open research with these kinds of datasets, including many academic researchers with this set of free ebooks it doesnt harm the authors and is done for a very different purpose from the authors, so its fair use under US law.

But Rasenberger isnt convinced.

The research in question uses these novels for the exact purpose intended by their authors to be read, she argues. It shouldnt matter whether its a machine or a human doing the copying and reading, especially when behind the machine stands a multi-billion dollar corporation which has time and again bent over backwards devising ways to monetise creative content without compensating the creators of that content.

Rasenberger adds that nobody knows how books will be read or used in the future, which is why the Authors Guild is proposing that digital uses should be allowed under a licensing system. But for the moment, Google is extracting immense value from the creative efforts of thousands of authors and looking the other way.

For Forster, the lack of any proper attribution speaks volumes. If theyre not mentioning the authors, she says, then maybe theyre not thinking of it in terms of it being someones work.

She never imagined her work would wind up as being part of someone elses dataset, as raw ingredients to satisfy a machines hunger for information, but shes been around long enough to know that what you hope for isnt always what you get.

I would have loved to have been part of the discussion of this project, and to have known how it was going to be used, she says. But Id also like to be thought of as intelligent enough to be able to make a decision about the end product.

Read more: https://www.theguardian.com/books/2016/sep/28/google-swallows-11000-novels-to-improve-ais-conversation

Top