Evaluating Steering Techniques using Human Similarity Judgments
What kinds of semantic respresentations support tasks requiring cognitive control in humans? We compare human and LLM representations obtained from behavioral judgments, and evaluate the extent to which popular steering methods (SAEs, task vectors, etc.) support alignment with human representations.