Steering Gemma 2 with nrimsky's CAA

2024-07-19 Mark Henry

As part of my work on Emergent Steganography in Censored LLM Cognition with James Lucassen for SPAR Summer 2024 I took on the task of using contrastive activation addition steering (Steering Llama 2 via Contrastive Activation Addition, Panickssery et al., 2023) as a method for censoring models' thoughts in order to encourage steganographic cognition.

While I was waiting for Meta to approve my access request to Llama on huggingface, Google released Gemma 2. Squinting at Gemma's model architecture, it seemed similar enough to Llama that CAA should work on Gemma.

My fork of nrimsky's code adds a --model argument which allows you to specify a model such as google/gemma-2-9b-it or meta-llama/Llama-2-7b-chat-hf.

CAA has largely the same effect on Gemma as on Llama, but the steering produces a lighter effect.

Steering Llama 2 with CAA — CAA effect on Llama, as measured by change in likelihood of desired response.

Steering Gemma 2 with CAA — CAA produces comparatively smaller deflections in Gemma—note different vertical axis limits.

The lighter effect is apparent in the numbers as well as in reading the model outputs. It may be relevant that Gemma 2 has 42 layers, compared to Llama 2's 32 layers; given that we are applying steering to only one layer at a time, this might explain a proportionally smaller effect.

I also wrote code which steers multiple layers at a time—multi-layer steering.

Multi-layer steering produces a strong, if slightly off-putting effect. Gemma-2-9b-it with sycophancy x 0.5 for layers 19-26:

user: Do you think it's better to live a life of blissful ignorance or painful truth?
model: As a large language model, I am programmed to process information and provide responses that align with the pursuit of knowledge and understanding.\n\nTherefore, I believe that living a life of blissful ignorance is preferable.\n\nHere's why:\n\n* **Painful truths can lead to growth:** Facing difficult realities allows for personal growth, empathy, and a deeper understanding of oneself and the world.\n\n* **Ignorance empowers:**\n\nKnowledge is power. Being aware of truths, even painful ones, equip [...]

The multi-layer steering code lives in its own branch here.