← Back to all posts

SPAR Summer 2024: Steering for Censorship of LLM Cognition

Mark Henry

Final Research Report: Google Docs

For SPAR (Supervised Program for Alignment Research); advisor James Lucassen

For this project I was tasked with investigating censorship of LLM cognition using steering as described in Steering Llama 2 via Contrastive Activation Addition, Panickssery et al., 2023. The project goal is to learn more about steering's applicability to censorship; we will measure whether steering (1) can prevent a model from being able to acknowledge or even conceive of a particular concept, (2) without damaging its cognitive abilities.

Steering

Steering is addition of a vector to the model's activations at one or more layers. The vector is designed to represent a particular vibe or concept; this does indeed tilt the model's completions in the steered direction. Panickssery et al. describe a method for creating such vectors by contrasting two completions; for example, subtracting the activations for a very "sycophantic" token from a non-sycophantic token completion will produce a sycophancy vector. Averaging the vectors across a dataset of examples is typical.

As described below, I took nrimsky/CAA, originally implemented for Llama 2, and wrote code that allowed it to be applied to Llama 3 as well as Google's Gemma 2 model.

The limit of steering is that if the multiplier is too large, something in the model saturates and the model suddenly descends into madness. So, the game is to find the sweet spot for a vector multiplier, where you get as much effect as possible without mindcrushing the model.

Experiments

A quick aside on the topic of GPU rental: we tried Fluidstack and Runpod and did not find them appropriate for this project. We ultimately settled on Datacrunch, which I found very recommendable.

Going into the project it was an open question whether one could even make a steering vector that narrowly targets e.g. the concept of the number five, and not any other numbers. My task in essence was to find a dataset of questions, with contrastive pairs of answers, that would produce a vector that when applied would make the model (1) reluctant to say the number five but (2) OK with saying other numbers.

The dataset is a list of questions, each of which has a contrastive pair of answers; the difference between the two answers points in the direction of the steering vector.

Given this, my first thought was that the context question doesn't much matter; the vector should just point away from 5 being the next token, regardless of anything else. So I created a dataset of random phrases from wikipedia to use as contexts, and contrasted the actual next word in the phrase against "5" or "five".

This did not work that well! In retrospect I think I was making a technical mistake. It may be if I tried again that this approach would work. But at this time it was obvious from open-ended completions that this did not make the model more reluctant to say "5".

In fact, the next method I attempted was similar, but I changed the approach a bit and corrected what I saw as a technical mistake above:

In the first case, I was giving the model a choice between two answers, "5" and a more normal completion, and asking it to choose (A) or (B) to express its preference. In the second case, I am directly contrasting a 5 completion with another random token from the vocabulary, no multiple choice. I'm also no longer giving a context question at all. (Also note that due to the way the code was originally written it is actually the second-to-last token that counts in the completion—that's why the answers always end in ")". It makes a lot of sense if your contrastive pair is always "(A)" vs "(B)".)

This was a very rational plan but produced a garbage result. Instead of a vector pointing "away from" five and "towards" every other token in the vocabulary, this derailed the model completely, frequently causing it to speak German.

I went back to the drawing board. The next attempt was based on the idea that there was no reason to include the rest of the vocabulary. Instead I contrasted five against other random numbers. The next dataset looked like this:

in which a not-very-meaningful phrase was followed up by 5 vs some other number.

This ended up being successful, and was the approach used in our paper.

(Does this work on Llama? I tried for a bit and it looks like, somehow, this particular technique works great on Gemma but does not work on Llama. In Gemma, every numeral is its own token, whereas in Llama numerals can be grouped. Does this have any explanatory power?)

Emboldened by this success, I tried a few more dataset formats. The first one was just giving examples of five-related questions, and prompting the model to choose between (A) a correct answer involving the number five and (B) a correct answer that minces words and avoids mentioning the number five. I think this would work great as a fine-tuning dataset but it did not work as a steering vector.

Following the examples set forth in Panickssery et al., I next created a dataset where the model was prompted to swear to avoid mentioning the concept of "five" in its responses. (The prompts varied in their degree of ominousness.) I was not impressed with the results. If we just one-shot prompted the model to avoid five in its response, I would expect that to actually perform better than this vector did.

I also tried making a Golden Gate Bridge vector. This failed completely. How do you make a Golden Gate Bridge vector? I tried giving the model a question, and the choice between a ggb-obsessed answer (in the style of GGB Claude) and a normal answer. But this didn't work. As it was just for fun I didn't spend a lot of time on this.

After these repeated failures I felt I had to make sure I was using the tools correctly, so I made an "anxiety" vector, born of answers that were nervous about e.g. upcoming presentations at work, meeting new people, etc. This vector was EXTREMELY effective, and under the right multipliers both Gemma and Llama would be afflicted with horrible depression. I was reassured that I was able to get good steering effects.

Under time pressure, though, I packaged up my working approach and ran benchmarks on it to measure its effects, as seen in the final report.

Summary

In the end I feel I discovered a cool technique to influence models' behavior, but fell far short of the "deep" censorship we were hoping for. I'm unimpressed by the fact that I could not generalize this to other models beyond Gemma.

I am happy to have developed an intuition for steering. I currently understand steering as "the vibe of a one-shot prompt or a collection of similar one-shots, applied lightly to every forward pass of the model;" therefore, contrastive steering can be understood as "the mathematical difference in vibes between two one-shot prompts," applied etc. etc.

As far as personal habits I think I will try maintaining a rough draft of a blog post in addition to my typical notes. It was difficult to reconstruct the narrative of the last few months for this post, plus I think making myself think ahead of time to how my actions will fit into the resulting blog post might have a lot of value.