ChatGPT creators try to use artificial intelligence to explain itself – and come across major problems

Chatbot might be using concepts that we don’t have names for or understanding of, researchers say

Andrew Griffin
Friday 12 May 2023 02:23 EDT
Comments
Chat GPT-4 is here. How will it change our lives?

Your support helps us to tell the story

From reproductive rights to climate change to Big Tech, The Independent is on the ground when the story is developing. Whether it's investigating the financials of Elon Musk's pro-Trump PAC or producing our latest documentary, 'The A Word', which shines a light on the American women fighting for reproductive rights, we know how important it is to parse out the facts from the messaging.

At such a critical moment in US history, we need reporters on the ground. Your donation allows us to keep sending journalists to speak to both sides of the story.

The Independent is trusted by Americans across the entire political spectrum. And unlike many other quality news outlets, we choose not to lock Americans out of our reporting and analysis with paywalls. We believe quality journalism should be available to everyone, paid for by those who can afford it.

Your support makes all the difference.

ChatGPT’s creators have attempted to get the system to explain itself.

They found that while they had some success, they ran into some issues – including the fact that artificial intelligence may be using concepts that humans do not have names for, or understanding of.

Researchers at OpenAI, which developed ChatGPT, used the most recent version of its model, known as GPT-4, to try and explain the behaviour of GPT-2, an earlier version.

It is an attempt to overcome the so-called black box problem with large language models such as GPT. While we have a relatively good understanding of what goes into and comes out of such systems, the actual work that goes on inside remains largely mysterious.

That is not only a problem because it makes things difficult for researchers. It also means that there is little way of knowing what biases might be involved in the system, or if it is providing false information to people using it, since there is no way of knowing how it came to the conclusions it did.

Engineers and scientists have aimed to resolve this problem with “interpretability research”, which seeks to find ways to look inside the model itself and better understand what is going on. Often, this requires looking at the “neurons” that make up such a model: just like in the human brain, an AI system is made up of a host of so-called neurons that together make up the whole.

Finding those individual neurons and their purpose is difficult, however, since humans have had to pick through the neurons and manually inspect them to find out what they represent. But some systems have hundreds of billions of parameters and so actually getting through them all with people is impossible.

Now, researchers at OpenAI have looked to use GPT-4 to automate that process, in an attempt to more quickly pick through the behaviour. They did so by attempting to create an automated process that would allow the system to provide natural language explanations of the neuron’s behaviour – and apply that to another, earlier language model.

That worked in three steps: looking at the neuron in GPT-2 and having GPT-4 try and explain it, then simulating what that neuron would do, and finally scoring that explanation by comparing how the simulated activation worked with the real one.

Most of those explanations went badly, and GPT-4 scored itself poorly. But researchers said that they hoped the experiment showed that it would be possible to use the AI technology to explain itself, with further work.

The creators came up against a range of “limitations”, however, that mean the system as it exists now is not as good as humans at explaining the behaviour. Part of the problem may be that explaining how the system is working in normal language is impossible – because the system may be using individual concepts that humans cannot name.

“We focused on short natural language explanations, but neurons may have very complex behaviour that is impossible to describe succinctly,” the authors write. “For example, neurons could be highly polysemantic (representing many distinct concepts) or could represent single concepts that humans don’t understand or have words for.”

It also runs into problems because it is focused on specifically what each neuron does individually, and not how that might affect things later on in the text. Similarly, it can explain specific behaviour but not what mechanism is producing that behaviour, and so might spot patterns that are not actually the cause of a given behaviour.

The system also uses a lot of computing power, the researchers note.

Join our commenting forum

Join thought-provoking conversations, follow other Independent readers and see their replies

Comments

Thank you for registering

Please refresh the page or navigate to another page on the site to be automatically logged inPlease refresh your browser to be logged in