Mark Henry
As part of my work on Emergent Steganography in Censored LLM Cognition with James Lucassen for SPAR Summer 2024 I took on the task of using contrastive activation addition steering (Steering Llama 2 via Contrastive Activation Addition, Panickssery et al., 2023) as a method for censoring models' thoughts in order to encourage steganographic cognition.
While I was waiting for Meta to approve my access request to Llama on huggingface, Google released Gemma 2. Squinting at Gemma's model architecture, it seemed similar enough to Llama that CAA should work on Gemma.
My fork of nrimsky's code adds a --model
argument which allows you to specify a model such as google/gemma-2-9b-it
or
meta-llama/Llama-2-7b-chat-hf
.
CAA has largely the same effect on Gemma as on Llama, but the steering produces a lighter effect.
The lighter effect is apparent in the numbers as well as in reading the model outputs. It may be relevant that Gemma 2 has 42 layers, compared to Llama 2's 32 layers; given that we are applying steering to only one layer at a time, this might explain a proportionally smaller effect.
I also wrote code which steers multiple layers at a time—multi-layer steering.
Multi-layer steering produces a strong, if slightly off-putting effect. Gemma-2-9b-it with sycophancy x 0.5 for layers 19-26:
The multi-layer steering code lives in its own branch here.