1st Semester 2025/26: Mechanistic interpretability for uncovering world models

Instructors: Fausto Carcassi
ECTS: 6
Description: How much do we actually know about what's going on inside LLM's weights? The answer is: more than you'd expect, and you can learn how to figure out more. This course will serve as a practical introduction to the field of mechanistic interpretability. The first week will include a few lectures from me and paper presentations from the students (selected e.g., from this list of foundational papers). From the second week, students will do group projects. Groups will work independently and apply different pre-existing libraries for interpretability techniques (e.g., sparse autoencoders & transcoders, activation patching, circuits identification) to the same small model*. At the end of the course we will have a little fair where we compare what the groups (independently) discovered about the model.
Organisation: Week 1: Lectures, paper presentations.

Week 2/3/4: Group work, weekly meeting with lecturer

Week 4: Insights Fair with presentation of group results
Prerequisites: Some knowledge of Python is necessary. The goal is not to understand these techniques in detail, but rather to apply pre-existing libraries to a new model. Therefore, previous technical knowledge of LLMs is not required.
Assessment: The assessment will consist in (1) the paper presentation, (2) the final presentation, and (3) a short writeup (~3 pages).
References: These are good starting points if you want to start before January:

- https://transformer-circuits.pub/2021/framework/index.html

- https://www.alignmentforum.org/posts/jP9KDyMkchuv6tHwm/how-to-become-a-mechanistic-interpretability-researcher

I don't know yet precisely which model we will look at. I might train one before the course or pick one from previous literature. I have two main requirements in mind:

1. Small enough that everyone can work it it.

2. Trained on data from a small, well-defined world model, so we have an idea in advance of what kind of stuff might be hiding in the weights.

Either way, it will be a model encoding a small, well-defined environment, such as mazes.

Projects

Quick Links

1st Semester 2025/26: Mechanistic interpretability for uncovering world models