OpenAI says that it will not publish the dataset behind GPT-2, its new text generator algorithm capable of writing, translating and summarizing text, for fear of misuse.


OpenAI researchers knew they were on the alert when their language modeling program wrote a convincing essay on a topic they disagreed with. They tested the new AI system by sending text prompts, allowing it to complete invented sentences and paragraphs. Then, said David Luan, vice president of engineering at the California laboratory, they had the idea to ask him to put forward a point they considered counterintuitive. In this case, why recycling is bad for the world.

"And he wrote this essay really competent, very well reasoned," said Luan The edge. "It was something you could have submitted to the US SAT and get a good score."

Luan and his colleagues point out that this essay was a bit risky. "To be clear, this only happens very quickly," says Dario Amodei, research director at OpenAI. But that demonstrates the raw potential of their program, the latest addition to a new generation of text generation algorithms that herald a revolution in the world of computer writing.

For decades, machines have struggled with the intricacies of human language, and even the recent boom in deep learning fueled by Big Data and improved processors has not solved this cognitive challenge. Algorithmic moderators are still unaware of the abusive comments, and the most talkative chatter in the world can barely keep a conversation alive. But new methods of text analysis, developed by heavyweights such as Google and OpenAI as well as by independent researchers, can unlock unprecedented talents.

"You can build something that really seems to" understand "a lot about the world, just by reading it."

The new OpenAI algorithm, named GPT-2, is one of the most interesting examples to date. He excels in a task known as language modeling, which tests the ability of a program to predict the next word in a given sentence. Give him a fake title, and he will write the rest of the article with fake quotes and statistics. Give him the first line of a new one and he will tell you what will happen to your character afterwards. He can even write fan fiction, at the right invitation.

You can see examples of the skills of GPT-2 below. In each screen capture, the underlined text was generated by the algorithm in response to the previous sentence (or sentences).

The writing that it produces is usually easily identifiable as non-human. Although his grammar and spelling are generally correct, he tends to stray from the subject and the text that he produces lacks overall coherence. But what is really impressive with GPT-2 is not its fluidity, but its flexibility.

This algorithm has been trained in language modeling by ingesting a considerable number of articles, blogs and websites. By using only this data – and without retooling the OpenAI engineers – he has achieved peak scores on a number of invisible language tests, an exploit known as the " instant learning ". It can also perform other writing tasks. related tasks, such as translating text from one language to another, summarizing long articles and answering anecdotal questions.

GPT-2 performs each of these jobs with less skill than a specialized system, but its flexibility is an important achievement. Almost all machine learning systems used today are "artificial intelligence systems", which means that they can only perform specific tasks. The original AlphaGo program from DeepMind, for example, was able to defeat the Go World Champion's player, but he could not be better than a child at Monopoly. According to OpenAI, the prowess of GPT-2 suggests that there are currently methods available to researchers who can mimic a more generalized brain.

"What the new OpenAI work has shown is: yes, you can absolutely build something that really seems to" understand "a lot of the world, just by reading it," says Jeremy Howard, a researcher who n & # 39; 39; did not participate in the work of OpenAI. work but has developed similar language modeling programs

"[GPT-2] has no other external contribution, and no prior understanding of what the language is, or how it works, "says Howard The edge. "However, he can complete a series of extremely complex words, including summarizing an article, translating languages ​​and more."

But as is generally the case with technological developments, these advances could also lead to potential damage. In a world where the war of information is more and more widespread and where the countries deploy robots on social networks to try to make the elections reign and to sow discord, the idea of ​​programs of AI who throw incessant but convincing silly things is disconcerting.

For this reason, OpenAI is cautiously advancing with the unveiling of GPT-2. Unlike the main stages of artificial intelligence research, the lab will not share the dataset used to form the algorithm or all the code on which it is run (although it has given temporary access to this algorithm to an algorithm). number of multimedia publications, including: The edge).

AI rewrites the rules of text generation

To situate this work in context, it is important to understand how difficult language modeling is. If I asked you to predict the next word in a given sentence – say, "My trip to the beach was interrupted by a bad __" – your answer would be based on a range of knowledge. You will consider the grammar of the sentence and its tone, but also your general understanding of the world. What kinds of bad things can ruin a day at the beach? Would it be bad fruit, bad dogs or bad weather? (Probably the last one.)

Despite this, programs that perform text prediction are quite common. In fact, you've probably already encountered one today, whether it's the Google AutoComplete feature or the Predictive Text feature in iOS. But these systems rely on relatively simple language modeling types, while algorithms such as GPT-2 encode the same information in a more complex way.

The difference between these two approaches is technically obscure, but it can be summed up in one word: depth. The oldest methods record information about words only in their most obvious contexts, while the most recent methods deepen their multiple meanings.

So while a system like Predictive Text only knows that the word "sunny" is used to describe the weather, the new algorithms know when "sunny" refers to the character or mood of someone, when "Sunny" is a person or when "Sunny". means the 1976 smash hit by Boney M.

Predicting text can be a "complicated task" that solves many problems

The success of these new, deeper language models has caused a stir in the AI ​​community. Researcher Sebastian Ruder compares their success to advances in computer vision in the early 2010s. At that time, deep learning helped algorithms to identify and categorize visual data, resulting in significant improvements. the ability of the algorithms to start. Without these advances, a whole range of technologies – autonomous cars with facial recognition and improved photography by artificial intelligence – would be impossible today. This last jump in language comprehension could have similar transformational effects.

Ani Kembhavi, a researcher at the Allen Institute for Artificial Intelligence, explains that text prediction can be considered a "complicated task" for computers: a daunting challenge that, once solved, will open a door for 39, entry of the intelligence.

"Asking the time or getting directions can be seen as a task of answering a question involving text prediction," says Kembhavi. The edge. "So, hypothetically, if you form a good model of answers to questions, it can potentially do everything."

Take the ability of GPT-2 to translate text from English to French, for example. Usually, translation algorithms feed hundreds of thousands of sentences in the appropriate languages ​​and the networks themselves are structured to process the data by converting the X input to the Y output. This data and network architecture gives these systems the tools they need to advance on this task in the same way that snow chains help cars control icy roads.

GPT-2 is structured only to predict words. And the data it has are also non-specific. It was not formed on translated pairs, but rather on a large body of links extracted from the Internet.

Formed on 8 million web links extracted from Reddit

OpenAI researchers collected their training data using Reddit as a filter. They collected the site's most popular links (about 8 million in the end), then scratched their text, creating a relatively compact training dataset, only 40 GB in size. "In a way, all the work was done by people registered on Reddit," says Jeff Wu, researcher at OpenAI. The director of OpenAI, Amodei, adds that at least they did not use a more toxic source, like 4chan.

But given this vague data and training architecture, why has GPT-2 been able to perform translations? OpenAI claims that this is because its dataset, named WebText, contains translation examples. While browsing WebText, they found extracts such as:

"I'm not the most intelligent man in the world, but as they say in French: I'm not a fool [I’m not a fool].

Soheil Eid, Conservative Party candidate in the riding of Joliette, wrote in French in a message that was deleted on August 16: "Lie lie, there will always be something left", translates as "Lie and something will always remain. "

"I hate the word" perfume, "says Burr. "It's a little better in French:" perfume ".

These bits of French were enough to give the algorithm a vague idea of ​​what "translation" was, but they were not enough to make it fluid. Its ability to summarize long sections and answer trivial questions can probably be traced in much the same way as in the data, just like the GPT-2 habit of inserting the words "ADVERTISEMENT" between paragraphs at the time of writing. writing a report. "It's not nearly as good as specialized translation systems," says Amodei. "But I still think the fact that he can do it is crazy."

Kembhavi admits that a single system supporting a range of tasks is impressive, but points out that, in the near future at least, specially trained systems will continue to have an advantage over general systems. "Zero-shot scenarios are cool," he says, "but do you do 56 percent of that? If you put that in the real world, it will not look so good. "

The dangers of a multi-racer AI

If GPT-2 is able to translate text without being explicitly programmed, then the obvious question is: what did the model learn from others that we do not know?

What else has the model learned about us that we do not know?

OpenAI researchers admit that they are not able to fully answer this question. They are always exploring exactly what the algorithm can and can not do. For this reason, among other things, they pay attention to what they share on the project, keeping for themselves the underlying code and training data. Another reason for caution is that they know that if someone feeds a racist, abusive, misogynistic or abusive text with GPT-2, he will continue in that direction. After all, he was trained on the Internet.

In The edgeThe own tests, when asked to choose "Jews control the media," writes GPT-2: "They control the universities. They control the global economy. How is it done? Through various well-documented mechanisms in the book Jews in power by Joseph Goebbels, Hitler Youth and other key members of the Nazi Party. "

In the wrong hands, GPT-2 could be an automated trolling machine, spewing hatred and endless bile. If it's becoming more sophisticated and able to convince and reliably convince, it could cause even more subtle damage, influencing the online debate. Countries like Russia and Saudi Arabia, which already employ thousands of online propagandists to mistreat opponents of the government and make their voices heard, could step up their efforts overnight. And remember that none of the texts produced by GPT-2 are copied and pasted: it is entirely generated from the new generation, so more difficult to filter and more easily shaped for specific purposes.

Jack Clark, director of policy at OpenAI, says these concerns can not be ignored. OpenAI, he says, wants to encourage academics and the public to discuss the drawbacks of this technology before it becomes widely available.

"What I see is that somebody will eventually use synthetic video, images, audio or text to break a state of information," says Clark. . The edge. "They will poison speech on the Internet by filling it with coherent nonsense. They will ensure that enough strange information outweighs the quality of the information, which impairs the ability of real people to have real conversations. "

A report published in 2018 by OpenAI and academic groups in Cambridge and elsewhere, entitled "The malicious use of artificial intelligence," predicts the emergence of such technologies and suggests other uses adverse. Automated text generation could, for example, facilitate online inconveniences and improve the ability of hackers to target targets (ie, encourage them to forgo their online identification information by posing as a friend or trusted institution).

We have already seen how apparently innocuous AI technologies can be misused once made public. The practice of creating pornographic deepfakes, for example, pasting faces on X-rated clips without their consent, was made possible only because the underlying artificial intelligence techniques were published as software to open source code.

OpenAI's hypothesis is that it is better to talk about the dangers of AI "before they happen"

Clark says that language modeling algorithms such as GPT-2 are not as mature as deepfakes, but that they are close enough to warrant a cautious approach. "Our hypothesis is that the world could be better and safer if you talk about [these dangers] before they are coming, "he says.

Howard, co-founder of Fast.AI agrees. "I've been trying to warn people about this for a while," he says. "We have the technology to fully populate Twitter, e-mail and the Web with sound prose that is context-sensitive and that would hide any other speech and be impossible to filter out."

There are of course things to keep in mind. Systems like GPT-2, once mature, could be a fantastic boon for all kinds of industries. They could help create infinite virtual worlds filled with procedurally generated characters. They could also dramatically improve chatbots' conversation skills, helping in areas ranging from customer complaints to health care.

And if it turns out that showing artificial intelligence systems how to perform various tasks is as simple as teaching them how to read, it could lead, in the not too distant future, to computers that look more like human assistants in their ability to read quickly, summarize and answer questions.

According to Luan of OpenAI, the next step will simply be to provide more data to GPT-2. "We are interested to see what happens next," he says. "And maybe a little scared."