News & Analysis

Anthropic Converts AI to Natural Stupidity 

Researchers have managed to force GenAI chatbots to answers queries it isn't supposed to

By Raj N Chandra Shekhar

Barely had we caught our breath from laughing at the latest episode of the Daily Show where host Jon Stewart lampooned false promises and hyperbole around AI, that another bit of news came calling. Anthropic, the company which secured $4 billion from Amazon to deliver AI solutions revealed that it could make artificial intelligence look naturally stupid. 

So, what’s new? Didn’t we read hundreds of similar cases being posted on digital forums ever since OpenAI came out with their GPT in November of 2022? That may be true enough, but what’s happened now could raise a few concerns around large language models itself and the way these should be trained. 

Jon Stewart’s tirade around the AI hyperbole

Before getting to the Anthropic way of jailbreaking large language models (LLM) that could force the smartest of GenAI chatbots to provide answers to queries that it is not supposed to, let me take a minute to explain what Jon Stewart was up to. He began by saying that his former employers Apple did not want serious topics to be discussed on Apple TV Plus. 

But he didn’t stop there. While interviewing Federal Trade Commission chair Lina Khan, he noted that Apple didn’t want him to interview her on the show. “Why are they so afraid to even have these conversations out in the public sphere?” he asked to which Khan responded, “I think it just shows the danger of what happens when you concentrate so much power and so much decision-making in a small number of companies.” 

Just so that readers get a context, Khan was confirmed as FTC head in 2021 and prior to taking over had made her position known on antitrust behavior of tech companies. As FTC chair, she stated her commitment to go after “business models that centralize control and profits.” The FTC has brought antitrust lawsuits against Amazon and Microsoft and is now probing their investments into Anthropic and OpenAI. 

Jail-breaking, the Anthropic way

Now, let’s find out what Anthropic has been up to. Anthropic researchers claimed that they had found a new jailbreak method to mess up an LLM and get the chatbot fronting it to answer queries it isn’t supposed to. So, if you want it to tell you how to make a bomb, all you need to do is prime it with some less-harmful questions (See image below). 

(Source: Anthropic) 

The researchers have described this approach as “many-shot jailbreaking” and have shared a research paper (click to download) around it and also informed their peers in the AI community so that the issue can hopefully be fixed soon. Looks like the company has earned a few brownie points with this revelation, which comes days after Amazon fulfilled its investment commitment of $4 billion dollars into the company. 

What’s causing this vulnerability?

Anthropic claims that the vulnerability could be a result of the increased “context window” of the modern LLMs which effectively means the amount of data that they can hold in their short-term memory. At one point, this could barely be a few sentences but now this has expanded to thousands of words or even complete books.

These models with such large-sized context windows have consistently performed well across multiple tasks, especially when there are loads of examples of the tasks within the prompt. Which means that if there are more trivia questions in the prompt, the answers get better with time. For example, a question that elicited the wrong answer in the first attempt might get right if it becomes the tenth or hundredth question. 

However, the challenge with this method is that this in-context learning means the models could also end up replying to inappropriate queries. So, asking the chatbot how to build a bomb might get an immediate refusal the first time, but ask it after 99 other queries of lesser harm and then repeat the bomb question. This time it might just comply. 

Classify the queries before modelling

The researchers in their note seek to explain why this happens, but believe that answers could take some more time to materialise. This is so because the LLMs function around weights around probabilities. However, there is some mechanism at work that allows a model to narrow down its focus based on the context window (see pic). For example, if a user seeks trivia, it gradually activates more of it and for some reason the same happens when users are asking inappropriate questions. 

While informing peers in the LLM field, the team also noted that though limiting the context window helps, it could have a negative impact on the overall performance. Which is why Anthropic says it is working on classifying and contextualising queries before starting the modelling exercise. 

Of course, there is no guarantee that this new model would be tough to fool. Just that the shifting goalposts makes for interesting reading – both for ardent fans and trenchant critics of the phenomenon called GenAI.