Matching features, not tokens: Energy-based fine-tuning of language models
- Mujin Kwun, Carles Domingo-Enrich | Harvard University, Microsoft Research
Cross-entropy (CE) training provides dense and scalable supervision for language models, but it optimizes next-token prediction under teacher forcing rather than sequence-level behavior under model rollouts. We introduce a feature-matching objective for language-model fine-tuning that targets sequence-level statistics of the completion distribution, providing dense semantic feedback without requiring a task-specific verifier or preference model. To optimize this objective efficiently, we propose energy-based fine-tuning (EBFT), which uses strided block-parallel sampling to generate multiple rollouts from nested prefixes concurrently, batches feature extraction over these rollouts, and uses the resulting embeddings to perform an on-policy policy-gradient update. We present a theoretical perspective connecting EBFT to KL-regularized feature-matching and energy-based modeling. Empirically, across Q&A coding, unstructured coding, and translation, EBFT matches RLVR and outperforms SFT on downstream accuracy while achieving a lower validation cross-entropy than both methods.
Speaker bios
Mujin Kwun is an Engineering Fellow at the Kempner Institute at Harvard University, advised by Professor Sham Kakade. His recent work focuses on low-precision training, optimization, and post-training for large language models. He received his A.B and S.M in Statistics and Computer Science from Harvard University.
Carles Domingo-Enrich is a Senior Researcher at Microsoft Research New England, based in Cambridge, MA. He works on generative AI models (diffusion and flow models, language models) and related topics at the intersection of machine learning, statistics, and AI for science. He received his PhD in Computer Science from NYU and B.S. degrees in Mathematics and Engineering Physics from the Polytechnic University of Catalonia (UPC).
系列: MSR New England Generative Modeling & Sampling Seminar
-
-
Physics and information theory of generative diffusion
- Luca Ambrogioni
-
-
Matching features, not tokens: Energy-based fine-tuning of language models
- Mujin Kwun,
- Carles Domingo-Enrich
-
-
-
Generative Models for Molecular Dynamics Across Timescales
- Michael Plainer,
- Winfried Ripken,
- Gregor Lied
-
-
Q-learning with Flow-Matching Policies
- Qiyang (Colin) Li
-
-
-
A non-Markovian approach to diffusion-based sampling
- Lorenz Richter
-
Blind denoising diffusion models and the blessings of dimensionality
- Aram-Alexandre Pooladian
-
Meta Flow Maps
- Peter Potaptchik