import { ReferenceLink, InlineLink } from '../components/article/ReferenceBox';

function AIAlignmentCrashCourse() {
    return (
        <div className="article-content">
            <p className="article-content">
                I recently completed the <InlineLink reference={new ReferenceLink("AI Safety Fundamentals Alignment crash course", "https://course.aisafetyfundamentals.com/alignment-fast-track")} />. I've summarized some of the insights I garnered that seemed the most pertinent & interesting. I've included links at the bottom for additional reading that I think is useful. I do recommend going through one of the Fast Track courses to quickly get up to speed on relevant concepts in AI fundamentals & safety.
            </p>

            <h2 className="article-content">Tidbits</h2>

            <h3 className="article-content">You live by your goal functions</h3>
            <p className="article-content">
                Safety research sometimes differentiates between outer misalignment and inner misalignment. Outer misalignment is when goal misspecification results in the model manipulating things outside its system to achieve the goal. Inner misalignment is when the model technically achieves the correct metric, but does it <InlineLink reference={new ReferenceLink("without actually achieving the desired outcome", "https://openai.com/index/faulty-reward-functions/")} />.
            </p>

            <h3 className="article-content">Neurons are polysemantic, features are superpositioned</h3>
            <p className="article-content">
                A single neuron in a language model's neural network is unlikely to correspond 1:1 to a feature we understand. Generally, neurons are working in conference to create a representation of a feature. As a consequence, features will span multiple neurons. This has interesting implications for the <InlineLink reference={new ReferenceLink("difficulty of interpreting how a language model is actually \"thinking\"", "https://distill.pub/2020/circuits/zoom-in/")} /> and steering its outputs.
            </p>

            <h3 className="article-content">Empathy for guidance</h3>
            <p className="article-content">
                One of the strongest bottlenecks in steerability and safety seems to be the limitations that we have in understanding our models. There's interesting work that's been done in trying to <InlineLink reference={new ReferenceLink("make features monosemantic", "https://transformer-circuits.pub/2023/monosemantic-features/index.html")} /> in order to isolate and understand them. This is done using sparse autoencoders, which creates a larger, sparsely weight matrix in order to selectively allow activation of fewer neurons. Once feature habitats are identified, alignment should be much easier if we can simply 'turn off' features for deception, power-seeking, hatred, racism, etc. This field is called mechanistic interpretability (interp).
            </p>

            <h3 className="article-content">Debate elicits truth</h3>
            <p className="article-content">
                Making a model debate another model can elicit more accurate, grounded results as a consequence. The two models will tend to push each other to be fact-checked in order to 'win' the argument. While this doesn't always result in more accurate, good answers, <InlineLink reference={new ReferenceLink("it usually does control for truthfulness", "https://ar5iv.org/html/1805.00899")} />.
            </p>

            <h3 className="article-content">For building at scale, you must use AI for oversight</h3>
            <p className="article-content">
                The early versions of GPT and other LLMs were fine-tuned using a lot of human-provided annotations and feedback data (see <InlineLink reference={new ReferenceLink("Reinforcement Learning through Human Feedback", "https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback")} />). This was necessary in the beginning, because we didn't have good machine-provided information about good and bad outputs. However, it is slow, expensive, and time-consuming to do this at scale.
            </p>
            <p className="article-content">
                Now, many systems of oversight and alignment use AI in the middle to steer towards favorable, and steer away from unfavorable outcomes. This includes <InlineLink reference={new ReferenceLink("constitutional AI", "https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback")} />, <InlineLink reference={new ReferenceLink("AI moderators", "https://arxiv.org/pdf/2312.06942")} />, <InlineLink reference={new ReferenceLink("weak to strong generalization", "https://cdn.openai.com/papers/weak-to-strong-generalization.pdf")} />.
            </p>

            <p className="article-content">
                There are major open questions here. How well would this work if we don't understand our models? How effective would this be against LLMs that excel at manipulation? What are the asymptotes in learning?
            </p>

            <h3 className="article-content">Penetration testing for LLMs</h3>
            <p className="article-content">
                There are still some pretty strong limitations in much we understand about undesired behavior in LLMs. Interp is a big field of study that should help guide this, but ideally we should also have better red-teaming in place to test processes for eliciting those undesired responses. There are many small projects that run <InlineLink reference={new ReferenceLink("jailbreaking competitions", "https://www.lesswrong.com/posts/N5cttN24LqEteFgN2/announcing-the-ultimate-jailbreaking-championship")} /> to crowd-source hacks. We'll need better infrastructure for automating some of this in the future.
            </p>

            <h2 className="article-content">Fuzzy project ideas</h2>
            <p className="article-content">
                I have a couple of ideas for projects that I think would be interesting to work on. These aren't fully baked, but seem like interesting circuits for curiosity and utility.
            </p>
            <ul className="article-content">
                <li className="article-content">
                    Easing infrastructure for deploying / tuning constitutional AI for open source projects
                </li>
                <li className="article-content">
                    Automating processes for generating adversarial attacks against models
                </li>
                <li className="article-content">
                    Run abliteration techniques on open source models to see how they think
                </li>
                <li className="article-content">
                    Use a sycophantic fine-tuned LLM purpose-made for tutoring to test outcomes on students. Compare against students using a non-sycophantic fine-tuned LLM
                </li>
                <li className="article-content">
                    Run an experiment to deploy the AI debate concept out in the world to evaluate outcomes for people re:decision-making
                </li>
                <li className="article-content">
                    Simulations to play out different risk scenarios and evaluate deceptive or power-seeking behaviors
                </li>
            </ul>
        </div>
    );
}

export default AIAlignmentCrashCourse;