Mathematics of Transformers

Name: Mathematics of Transformers
Start: 2025-09-26T08:30:00+02:00
End: 2025-09-26T18:00:00+02:00
Location: DESY

Friday 26 Sept 2025, 08:30 → 18:00 Europe/Berlin

Building 1b, Seminar Room 4ab (DESY)

Building 1b, Seminar Room 4ab

DESY

Notkestraße 85 22607 Hamburg Germany

Description

This workshop will focus on the transformer architecture and its underlying (self-)attention mechanisms that gained substantial interest in recent years. Despite their empirical success and groundbreaking advances in natural language processing, computer vision, and scientific computing, the mathematical understanding of transformers is still in its infancy, with many fundamental questions only starting to be posed and addressed.

We aim to bring together researchers with backgrounds in multi-agent dynamics, optimal transport, and PDEs, to initiate discussions on a variety of aspects connected to the theoretical principles governing transformers. By fostering discussions, we seek to advance this young and rapidly evolving research field, uncovering new mathematical perspectives on transformer models.

Confirmed speakers

Giuseppe Bruno (University of Bern)
Valérie Castin (ENS Paris)
Subhabrata Dutta (TU Darmstadt)
Borjan Geshkovski (Inria Paris)
Michaël E. Sander (Google DeepMind)

→ The timetable can be found here.

This is a satellite event to the Conference on Mathematics of Machine Learning 2025 that takes place at TUHH from September 22nd-25th 2025.

We gratefully acknowledge support by the DFG funded priority programme Theoretical Foundations of Deep Learning and Helmholtz Imaging.

- 08:30
  
  Reception & Coffee Building 1b, Seminar Room 4ab
  
  Building 1b, Seminar Room 4ab
  
  DESY
  
  Notkestraße 85 22607 Hamburg Germany
- 09:00
  
  Welcome & Introduction Building 1b, Seminar Room 4ab
  
  Building 1b, Seminar Room 4ab
  
  DESY
  
  Notkestraße 85 22607 Hamburg Germany
- 1
  
  Dynamic metastability in self-attention dynamics Building 1b, Seminar Room 4ab
  
  Building 1b, Seminar Room 4ab
  
  DESY
  
  Notkestraße 85 22607 Hamburg Germany
  
  Speaker: Borjan Geshkovski
- 2
  
  A multiscale analysis of mean-field transformers in the moderate interaction regime Building 1b, Seminar Room 4ab
  
  Building 1b, Seminar Room 4ab
  
  DESY
  
  Notkestraße 85 22607 Hamburg Germany
  
  In this talk, we study the evolution of tokens across the depth of encoder-only transformer models at inference time, modeling them as a system of interacting particles in the infinite-depth limit. Motivated by techniques for extending the context length of large language models, we focus on the moderate interaction regime, where the number of tokens is large and the inverse temperature parameter scales accordingly. In this setting, the dynamics exhibit a multiscale structure. Using PDE analysis, we identify different phases depending on the choice of parameters.
  
  Speaker: Giuseppe Bruno
- 10:45
  
  Coffee Break Building 1b, Seminar Room 4ab
  
  Building 1b, Seminar Room 4ab
  
  DESY
  
  Notkestraße 85 22607 Hamburg Germany
- 3
  
  Mean-Field Transformer Dynamics with Gaussian Inputs Building 1b, Seminar Room 4ab
  
  Building 1b, Seminar Room 4ab
  
  DESY
  
  Notkestraße 85 22607 Hamburg Germany
  
  Transformers, that underlie the recent successes of large language models, represent the data as sequences of vectors called tokens. This representation is leveraged by the attention function, which learns dependencies between tokens and is key to the success of Transformers. However, the dynamics induced by the iterative application of attention across layers remain to be fully understood. To analyze these dynamics, we identify each input sequence with a probability measure, thus handling input sequences of arbitrary length, and model its evolution as a Vlasov equation called Transformer PDE, whose velocity field is non-linear in the probability measure. For compactly supported initial data and several self-attention variants, we show the Transformer PDE is well-posed and is the mean-field limit of an interacting particle system. We also study the case of Gaussian initial data, which has the nice property of staying Gaussian across the dynamics. This allows us to identify typical behaviors theoretically and numerically, and to highlight a clustering phenomenon that parallels previous results in the discrete case.
  
  Speaker: Valérie Castin
- 12:00
  
  Discussion Building 1b, Seminar Room 4ab
  
  Building 1b, Seminar Room 4ab
  
  DESY
  
  Notkestraße 85 22607 Hamburg Germany
- 12:30
  
  Lunch Break Canteen
  
  Canteen
  
  DESY
- 4
  
  Transformers: From Dynamical Systems to Autoregressive In-Context Learners Building 1b, Seminar Room 4ab
  
  Building 1b, Seminar Room 4ab
  
  DESY
  
  Notkestraße 85 22607 Hamburg Germany
  
  Transformers have enabled machine learning to reach capabilities that were unimaginable just a few years ago. Despite these advances, a deeper understanding of the key mechanisms behind their success is needed to build the next generation of AI systems. In this talk, we will begin by presenting a dynamical system perspective on Transformers, demonstrating that they can be interpreted as interacting particle flow maps on the space of probability measures, solving an optimization problem over a context-dependent inner objective. We will also discuss the impact of attention map normalization on Transformer behavior in this framework. We will then focus on the causal setting and propose a model to understand the mechanism behind next-token prediction in a simple autoregressive in-context learning task. We will explicitly construct a Transformer that learns to solve this task in-context through a causal kernel descent method, with connections to the Kaczmarz algorithm in Hilbert spaces, and discuss connections with inference-time scaling.
  
  References
  
  Sander, M. E., & Peyré, G. (2025). Towards understanding the universality of transformers for next-token prediction. International Conference on Learning Representations (ICLR).
  Sander, M. E., Giryes, R., Suzuki, T., Blondel, M., & Peyré, G. (2024). How do transformers perform in-context autoregressive learning? International Conference on Machine Learning (ICML).
  Sander, M. E., Ablin, P., Blondel, M., & Peyré, G. (2022). Sinkformers: Transformers with doubly stochastic attention. International Conference on Artificial Intelligence and Statistics (AISTATS).
  
  Speaker: Michaël Sander
- 5
  
  Transformers as token-to-token function learners Building 1b, Seminar Room 4ab
  
  Building 1b, Seminar Room 4ab
  
  DESY
  
  Notkestraße 85 22607 Hamburg Germany
  
  The ambitious question of understanding a Transformer can be decomposed into understanding the functions it implements: the class of functions one can theoretically approximate using a Transformer, which subclass of them is learnable via gradient descent, which training data distribution is implicitly biased towards which set of functions, how they are implemented across the neural components of the model, and so on. In this talk, I will focus on Transformers implementing language functions. A primer on mechanistic interpretability will be given, followed by certain open problems in this area. Then, I will present an alternate view of the Transformer functions that can potentially solve many of the existing limitations: existence of multiple parallel computation paths, lack of robustness of autoencoder-based replacement models, and how to formalize causal models embedded in training.
  
  Speaker: Subhabrata Dutta
- 15:30
  
  World Café (discussion format) Building 1b, Seminar Room 4ab
  
  Building 1b, Seminar Room 4ab
  
  DESY
  
  Notkestraße 85 22607 Hamburg Germany

Choose timezone

Mathematics of Transformers

Building 1b, Seminar Room 4ab

DESY

Building 1b, Seminar Room 4ab

DESY

Building 1b, Seminar Room 4ab

DESY

Building 1b, Seminar Room 4ab

DESY

Building 1b, Seminar Room 4ab

DESY

Building 1b, Seminar Room 4ab

DESY

Building 1b, Seminar Room 4ab

DESY

Building 1b, Seminar Room 4ab

DESY

Canteen

DESY

Building 1b, Seminar Room 4ab

DESY

Building 1b, Seminar Room 4ab

DESY

Building 1b, Seminar Room 4ab

DESY