MA-INF 4335 Lab AI Alignment

Content

Through tutorials and a final project, you will gain hands-on experience in AI alignment techniques and have this chance to apply this knowledge to various interesting projects. Students will collaborate in small teams (a group of 3-4 students) and implement small research projects over the course of the term, advised by a researcher from the
CAISA lab. Students will learn to reproduce important results from the field, study the scientific literature, generate and implement their own research ideas, and present their results as a presentation and as a paper.

As AI systems such as Large Language Models become increasingly capable and start to be used in high-stakes scenarios, ensuring that they act safely is gaining importance. The research field of AI alignment studies methods to align the behavior and values of AI systems with the user and broader society in a robust, scalable, and interpretable way. The aim of this course is to explore cutting-edge research, insights, and trends in the field of AI alignment.

Schedule

• Week 0: Organization meeting
• Week 1-5: Lectures and programming exercises
• Week 6: Presentation of project ideas
• Week 12: midterm presentation of results
• Final presentation
• Student paper
Concrete research topics are 1) value alignment, 2) emergent misalignment, 3) scalable oversight, and 4) mechanistic
interpretability, among others.

Prerequisites

One of the following courses is recommended:
• MA-INF 4115 - Introduction to Natural Language Processing,
• MA-INF 4235 - Reinforcement Learning,
• MA-INF 4204 - Technical Neural Nets.

Seminar work

Oral presentation, written report 

Literature

• Hendrycks, Dan: Introduction to AI Safety, Ethics, and Society.
• Ouyang, Long et al.: Training language models to follow instructions with human feedback.
• Bowman, Sam et al.: Measuring Progress on Scalable Oversight for Large Language Models.
• Li, Nathaniel et al.: Measuring and Reducing Malicious Use With Unlearning.
• Bricken, Trenton et al.: Towards monosemanticity: Decomposing language models with dictionary learning.