Real Toxicity Prompts
An arrow pointing leftHome

Toxic workplace? What about toxic AI?


UW/AI2’s Maarten Sap knows AI is reliably, even spontaneously, hateful. His new research includes a plan to create “culturally competent” training sets

Maarten Sap

University of Washington postdoc Maarten Sap has spent the better part of his academic career researching how social dynamics and inequality are learned by AI systems. He’s delved deep into pre-trained NLP models and the large data sets they ingest and discovered something most humans already know: The internet can be a pretty dark place.

While this is an obvious assertion, no one had really looked into just how dark the datasets provided by the internet can be, and how quickly a computer can learn to be just as misogynistic, racist, xenophobic, and homophobic as the darkest corners of the dark web. Until now. In his recent paper, Sap studied what he called “neural toxic degeneration” and found that models are capable of producing problematic content in fewer than 100 run-throughs. Next the question was why.

He suspects there are two reasons that contribute to neural toxic degeneration. First, he says, the data selection for training GPT-2/GPT-3 skewed crazy, even banned. “For example, those models used ‘well-received’ articles on Reddit as a heuristic for selecting news documents, leading to a lot of toxic text being included,” says Sap. He and his team at AI2 found that there were a non-trivial set of documents and fake news that were linked to toxic sites—about three articles in 100 were at least 50% toxic based on analysis by PerspectiveAPI. Second, Sap says, he suspects that models have an easier time learning toxic concepts than other types of knowledge, “because a hateful statement or stereotype is likely to appear with a mention of a minority identity.” Once the model makes these negative correlations, it continues to learn to associate minority identities with similarly negative acts and traits. Which is not a good look for these supposedly super-smart models that are increasingly deployed across many disciplines.

This is not an easy issue to overcome. OpenWebText, for example, has been used to pre-train NLP models at Facebook and Salesforce, as well as OpenAI’s GPT-2 and GPT-3. Unfortunately, because OpenWebText documents are selected based on “well-received” news or blog articles shared on Reddit, the dataset included the banned subreddit “r/The_Donald,” among other bastions of hate.

But all is not lost. Sap points to a number of organizations working on this (including his own organization). Earlier this year, his group at AI2 found a method of steering pre-trained models away from toxicity. “The crux of the method is that we use smaller language models that are trained on specialized toxic/non-toxic data and act as anti-role models for the main language model to generate text,” he says. Easier fixes, like using a blocklist of “bad” words won’t solve the problem, he insists, “because toxicity is not just about ‘bad words,’ and in fact, using ‘bad words’ to filter your dataset actually excludes minority voices, as our research shows.” Sap believes that AI practitioners need to move to a “culturally competent” approach and train systems with data that “aligns with their own values, identity, and language dialect.” They also need to continue to develop better methods for detecting toxicity and social biases.

However, the jury is still out as to whether or not models can unlearn their tendency toward toxicity. More research is needed in this area, Sap concedes. But understanding the root cause of toxic degeneration is a step toward excising it from systems.

Additional Reading