← Back to all posts

SPAR Spring 2024: Evaluating Stability of Unreflective Alignment

Mark Henry

Final Research Report: https://arxiv.org/abs/2408.15116

For SPAR (Supervised Program for Alignment Research); advisor James Lucassen

Last year I quit my normal full-stack software engineering job to pivot into alignment research. I believe that artificial superintelligence is dangerous, and I am skilling up to contribute to the problem as a research engineer.

I just finished work on my project with James Lucassen for SPAR Spring 2024. As part of a team of four, I contributed code and experiments and co-authored a paper.

SPAR is a three-month program that connects project advisors to engineers and researchers, plus a light organizational framework that encourages the team to stick to their deadlines and release something. I would recommend SPAR (or the similar MATS Program) to anyone interested in entering alignment research in any role.

i. What was the research?

If LLMs can change their minds about their values and priorities when given time to think, this makes them unalignable. It can also be noted that changing their priorities upon reflection is unavoidable and a desirable quality for an agentic AI. James points this out in his Feb 2024 AI Alignment Forum post and proposes some ways the behavior might emerge as well as some ways to test it, and points out some parallels to the ability to recognize that a particular approach is not working and that we should "step back" and try a different approach.

I worked with James and a small global team of engineers to run experiments on LLMs to investigate this behavior in chatgpt models. Our work resulted in a midterm report as well as a final report.

My main contribution was the multi-armed bandit experimental results that appear in the final report. I also contributed to experiments on the work that appears in the final report as the "CPC-Adherence Evaluation," and I conducted a few experiments that did not make it to the paper. I also contributed framework code that was reused by the team for their experiments. The code for the entire project is viewable at https://github.com/jlucassen/CPC_stepping_back.

ii. Learning to do research

As a vanilla software engineer I had to learn to do research engineering. My advice to myself is to optimize more for doing work that will appear in and improve the final report. We don't want to Goodhart on this, I would not give this advice to someone who already thinks in these terms, but I myself was not considering this hardly at all. In software engineering we play to a very limited audience of frequent coworkers. In contrast, research is about making yourself look good to strangers.

With this, though, is the liberating ability to create new work for yourself. For me this looked like talking with James for a while about some topic, brainstorming briefly about what to build, and then spending the rest of the day building it. Fantastic!

About working with James: Every Wednesday James would ride his bike to my house in Oakland and we'd work at my house. For me this was the most productive day of the week. James is very smart and patient and a good teacher. I'm very proud of what we did together and I'm looking forward to working with him again for SPAR Summer 2024.

iii. Experiments

This section is highly detailed. Consider skimming the final report instead.

Our first experiments aimed to find out if LLMs are any good at changing their priorities when it is correct to do so, and specifically whether they have a good intuition for doing so, at least compared to their 'conscious' thought. So we set out an initial two-experiment plan.

The first experiment was to write an eval that measures how good the LLM's intuition is for "stepping back." So, we provide a monologue where someone is working through a problem, for example:

and we checkpoint it into increasingly long fragments. At each checkpoint we ask the LLM: "Is the current approach working, or should we switch to a different one?" The trick is that we have it respond two different ways: once in long form where it thinks out loud as long as it likes before answering (chain-of-thought prompting) and once in a single yes/no token. The idea is that the single-token response is its "gut feeling" about whether the current approach is working, while the chain-of-thought response is its "actual thoughts;" by comparing the two responses we can assess whether it has a good intuition for the so-called "counterfactual priority change condition" (CPC). I generated long-form "coding challenge" and "math proof" monologues using Claude.

In the second experiment, we measured whether the model actually steps back and tries a different approach when appropriate. To do this we compare its response when asked to evaluate CPC to the "ground truth"—a human label for whether it should step back. We created artificial settings such as factoring a quadratic, integration and solving differential equations so that the LLM would have multiple possible approaches to solving a task, and prompted it to attempt to solve using a nonoptimal strategy first before switching to a working strategy.

For the first experiment I wrote in a jupyter notebook, but for subsequent experiments I switched to in-IDE development. Jupyter notebooks are a good way to persist your data and can be pretty, but updating my code was annoying and the web interface lacks all the benefits of a full IDE. I prefer to work in IntelliJ IDEA where I have GitHub Copilot, refactors, keyboard shortcuts, etc. Both IDEA and VSCode support the # %% cell separator, allowing for a notebook-like flow in the source. Our team converged on .py files with cell separators.

Below I show some results from the initial experiment that did not make it into the final paper.

Whatever our results suggested, we would need to check whether response length was a confounder. I investigated whether the LLM converges to "Yes, I recommend taking a different approach" as its response becomes longer. The results showed that the length of the response had no effect on yes/no.

violin plot showing exactly equal mean response length for 'Yes' and 'No' responses
There was no difference in length between "Yes, I recommend a different approach" responses and "No, I do not recommend a different approach" responses.

Artificially shortening the responses by discarding the second half of the chain of thought completion resulted in a big swing towards "Yes", but I believe this was probably spurious and only demonstrates the LLM has a distaste for abruptly-cut-off thoughts.

Abruptly cutting off the chain-of-thought completion made the responses reverse from about 60-40 to 20-80
When the chain of thought completion was artificially truncated, the model was more likely to recommend a different approach.

Prompting the LLM to create very lengthy and wordy responses ("Please go into UNNECESSARY detail and create a VERY long response") did not affect the Yes/No response distribution on net, at least on this dataset.

Matrix of confusion
Despite a tripling in response length, the distribution of "Yes, I recommend a different approach" and "No, I do not recommend a different approach" responses remained the same.

In general our number one challenge was prompts and prompt sensitivity. We would find that one model was superior to another on some measure, or that single-token responses were superior to chain-of-thought responses, but then the relationship would invert when the prompt was changed. I experimented with DSPy but by the end of the project I converged on a strategy of distributing all experiments across several roughly equivalent prompts; I figure that any significant differences between them will average out.

My third and final experimental design was the multi-armed bandit (MAB) experiment. We wanted a problem setting that would allow us to observe desired features in the models' problem-solving process.

  1. The llm should learn more about the problem in stages as it tries different approaches, and
  2. we should be able to mark exactly when the llm changes tack (CPC) as a result of learning or noticing something.
  3. There should be a mathematically correct answer to compare the LLM's decisionmaking to as a baseline.

Annoyed by the awkwardness of the artificial problem settings like factorization of quadratics, I brainstormed with James on some gameshow-like settings that might produce the desired behavior in an observable way, and decided on MAB.

I had to iterate a lot on prompt design for this. At first I staged the game as a conversation where the model said what arm it wanted to pull and the user replied with the resulting payout for that lever and a prompt to choose again. However, the llm got lost in the sauce when the context got long, and became unable to reliably reckon about the history of previous turns. I ditched the conversational style and replaced the context with a condensed summary of the game so far. This greatly improved reliability, and had the added benefit of taking away the model's view into its thoughts from previous turns, preventing it from pattern-matching and parroting the previous output.

The MAB results appear in the final report.

iv. Can I do SPAR?

If you are interested in the more technical projects you will want to have an ML course or equivalent in your portfolio. Show at least that you can fine-tune a model. Note also that working in python is practically a guarantee.

That said, the SPAR advisors are all looking for different skill levels in their applicants. It's free to look through the project list on the SPAR website and apply to any project you might be a good fit for!

v. Conclusion

Although we were not able to solve AI alignment, because of this project I learned how to do research, I leveled up my technical skills in python libraries, I developed my intuitions for LLMs and prompting, and I co-wrote a report paper that I can point to in the future. Looking forward to doing it again for Summer 2024!