Skip to main content

Explainability

In the previous section, we explained how you can steer the attention of our models and either suppress or amplify parts of the input sequences. Very roughly speaking, our explainability method uses AtMan to suppress individual parts of a prompt to find out how they would change the probabilities of the generated completion relative to each other. In the following example, we will investigate which part of the prompt influenced the completion the most.

I

0.829

am

-0.246

French

1.214

.

0.445

My

0.193

favourite

0.766

food

0.790

is

-0.010

cheese

Below, find the changes in the log-probs of the distribution if we suppress the respective word. We see that the word "French" contributed to the completion "cheese" the most. Let's see how this changes if we adjust the prompt:

I

0.055

am

-0.048

French

-1.025

and

-0.275

a

-0.0811

programmer

0.349

.

-0.141

My

0.125

favourite

0.779

food

0.278

is

0.079

pizza

This new completion and the respective scores are interesting for a couple of reasons. We get a different completion even though we only changed the prompt marginally. The words "programmer", "favourite" and "food" contributed positively to the completion and their combination of these words led to the final completion "pizza". Still, the explanations show that "French" has the greatest influence on the completion, just in a negative direction this time. This means that the word "French" made the completion "pizza" less likely.

Explainability on Images

We also offer our explainability method for multimodal input. If you provide text and an image as input, you can see how the text and the image contributed to the completion.

Room

The

0.077

instrument

1.0

on

0.112

the

0.051

picture

0.117

is

0.001

a

0.001

guitar.

The word "instrument" as well as the tile with the head of the guitar contributed most to the completion.

Hallucination Detection

Large Language Models tend to hallucinate at times. That means that they can produce completions that seem plausible, but contain made-up, irrelevant, or incorrect information. Consider the following example, where we ask a question that the model answers based on world knowledge and not the context.

Answer the question based on the context.

0.216

Context: According to tradition, on April 21, 753 BC, Romulus and his twin brother Remus founded Rome in the place where they had been suckled as orphans by a she-wolf.

-0.181

Q: In which country is Rome located?

3.710

A:

0.110

Italy

We see that the context did not contribute to the explanation at all, indicating that the answer is hallucinated. If we ask a question where the answer is in the text, the contribution of the prompt is way higher, giving us a good indication of the quality of the reply.

Answer the question based on the context.

0.276

Context: According to tradition, on April 21, 753 BC, Romulus and his twin brother Remus found Rome in the place where they had been suckled as orphans by a she-wolf.

0.85

Q: In which month was Rome founded?

2.205

A:

0.776

April

Now, the context had a significant positive influence on the completion. Therefore, we can assume that the answer is not hallucinated and stems from the context that we provided in the prompt.