Toronto LLM Meetup

Energy Based Models

From Boltzmann machines to modern transformers

March 17, 2026  Β·  Shashank Shekhar

Transformers are already
energy-based models

We just don't see them that way (yet)

Three recent results

πŸ† 55M-parameter energy-based reward model

127Γ— smaller than typical reward models β€” boosts Llama 3 to 90.7% on GSM8k, 63.7% on MATH

Jiang et al, Learning to Rank Chain-of-Thought, 2025

πŸ“ˆ Energy-based Transformers scale 35–57% faster

Better data efficiency than standard autoregressive transformers, tested up to 120B tokens

Gladstone et al, Energy-Based Transformers are Scalable Learners and Thinkers, 2025

🎯 RLHF optimal policy = Boltzmann distribution

The optimal policy in RLHF is literally the Boltzmann distribution over reward-weighted outputs

Talk outline

1. The Energy-Based Model paradigm

2. How do we train EBMs?

3. EBMs meet LLMs

Part 1

The Energy-Based
Model Paradigm

The Energy Landscape

  • Blue = low energy β‰ˆ good
  • Red = high energy β‰ˆ bad
  • Training = shaping the landscape
  • Inference = rolling a ball downhill

The Energy Function

Given an input x and a candidate answer y, the energy function returns "how incompatible are they?"

⬆️

High energy = incompatible

⬇️

Low energy = compatible

"Probability comes at a high price, and should be avoided when the application does not require it" β€” Yann LeCun

Inference

Find Y that minimizes E

Energy-Based vs. Autoregressive

Supervised AR
Energy-Based
Scoring level
Token level
Completion level
Generation
Sequential, left-to-right
Optimization / search
Flexibility
Fixed factorization
Any structure

Energy–Probability Duality

The Boltzmann Distribution

From the maximum entropy principle: among all distributions consistent with a given average energy, the Boltzmann distribution has the maximum entropy.

Key insight: probability ratios are energy differences β†’ independent energy functions combine additively

Temperature as Normalizer

Temperature controls how peaked vs. flat the probability distribution is.

  • Low T β†’ sharp peaks (deterministic)
  • High T β†’ flat (exploratory)
1.0
Partition function Z = β€”
Energy landscape E(x) β€” low energy = preferred states
Boltzmann probability p(x) = exp(βˆ’E(x)/T) / Z

Connection to Transformer Decoding

When a transformer produces logits over the vocabulary and you compute , you are treating the negative logits as an energy function over vocabulary items and computing the Boltzmann distribution at temperature T.

"I set temperature to 0.3" β†’ more deterministic output
"I set temperature to 1.5" β†’ more creative output

They are adjusting the Boltzmann temperature of an energy-based model.

The Partition Function Problem

Sum (or integral) over all possible configurations X

πŸ–ΌοΈ Images (256Γ—256Γ—3, 8-bit)

256^(256Γ—256Γ—3) β‰ˆ 10^473,000

πŸ“ Text (length 1000, vocab 50k)

50,000^1000

The intractable partition function was one reason EBMs fell out of fashion.

Two-Phase Training

Training the EBM to match a target distribution by minimizing the negative log-likelihood:

The gradient decomposes into two terms:

Positive Phase

Push energy down on real data

Negative Phase

Push energy up on model samples

Independent of partition function Z(ΞΈ)!

Positive Phase

Push energy DOWN on real data.

The expectation is on the data distribution β€” picking samples from our dataset. The update changes parameters to minimize the energy at data points.

Negative Phase

Push energy UP on model samples.

The expectation is on the model distribution β€” sampling from the model with current parameters. The update maximizes energy at points the model currently thinks are likely.

Two-Phase Training Demo

Why Both Phases Matter

Positive phase alone

Simple gradient descent on data β€” tractable.

But leads to degenerate solutions: the model could assign constant energy everywhere.

Model learned "real data is good" but never learned what is bad.

Negative phase is essential

Need some notion of incorrect configurations to push energy up on.

But the denominator is intractable for high dimensions β€” this is where the difficulty comes from.

Recap

An EBM is a scoring function . Low = good.

Probabilities require the partition function Z, which is intractable. Avoid computing it.

Training requires both pushing energy down on data AND pushing energy up on wrong answers.

Up Next

How to train your EBM? β€” Contrastive divergence, score matching, noise-contrastive estimation, and more.

Part 2

How to Train
Your EBM

Based on Song & Kingma, "How to Train Your Energy-Based Models" (2021)

Song, Y. & Kingma, D.P. "How to Train Your Energy-Based Models." arXiv:2101.03288, 2021.

The core challenge

We derived the gradient of the negative log-likelihood:

But the negative phase requires sampling from:

where is intractable for large dimensions. So how do we handle this?

Three approaches

1. Markov Chain Monte Carlo

Generate negative samples by running a Markov chain

2. Score Matching

Avoid sampling altogether β€” match the gradient of the log-density

3. Noise Contrastive Estimation

Cast density estimation as binary classification (data vs. noise)

Approach 1

Markov Chain
Monte Carlo

MCMC: The Idea

Can't sample from exactly β€” the partition function blocks us.

But we can approximately sample by running a Markov chain that converges to in the limit.

πŸ’‘ Key insight

Run an MCMC sampler to generate approximate negative samples x⁻, then plug them into the gradient equation as if they were exact samples from .

The score doesn't need Z β€” so gradient-based MCMC is feasible!

Langevin Dynamics

Starting from a random point, iteratively follow the energy gradient with noise:

When Ξ΅ β†’ 0 and K β†’ ∞, is guaranteed to distribute as under regularity conditions.

MCMC + Gradient Update

Once we have Langevin samples, plug them into the two-phase gradient:

Data samples (minibatch)

Langevin samples (MCMC)

The Problem with MCMC

Langevin dynamics can take a very long time to converge, especially in high-dimensional spaces with multiple modes.

If we need 10,000 Langevin steps per sample, and we need samples for every gradient update during training, the whole thing is prohibitively slow.

The mixing problem

If the energy landscape has widely separated modes with high-energy barriers, the chain gets trapped in one mode. The negative samples only represent that region, leaving other modes untouched. This gets worse in high dimensions.

Hack: Contrastive Divergence

Hinton, 2002

Don't start the chain from a random point β€” start from a data point.

CD-1: One MCMC step from data

Very biased, doesn't represent true MLE β€” but works surprisingly well in practice.

Persistent CD (Tieleman 2008)

Don't reset the chain between updates β€” carry over the state. Works because model parameters change slowly between updates.

Replay Buffer (Du & Mordatch 2019)

Keep historical MCMC states in a buffer, randomly sample to initialize new chains.

Hinton, G.E. "Training Products of Experts by Minimizing Contrastive Divergence." Neural Computation, 2002. Β· Tieleman, T. "Training RBMs Using Approximations to the Likelihood Gradient." ICML, 2008.

Approach 2

Score Matching

Reformulate learning to avoid sampling altogether

The Score Function

The score of a distribution is the gradient of the log-density with respect to the input:

For an EBM with :

log Z_ΞΈ vanishes! β€” because Z_ΞΈ is a constant with respect to x. The score only depends on the energy function, not the intractable partition function.

HyvΓ€rinen, A. "Estimation of Non-Normalized Statistical Models by Score Matching." JMLR, 2005.

Why Scores Are Enough

If two continuously differentiable log-PDFs have equal first derivatives everywhere, and both integrate to 1, they must be the same distribution.

So we can learn the right distribution by matching scores rather than matching probabilities:

No partition function, no sampling, no MCMC β€” just make the model's score look like the data's score.

Fisher Divergence

Formally, minimize the Fisher divergence between model and data scores:

Expanding the squared norm:

Problem: the cross-term contains β€” we don't know p_data, only samples from it!

Integration by Parts Trick

HyvΓ€rinen (2005) showed the Fisher divergence can be rewritten using only the model's score and its Jacobian:

We only need the model's score and its Jacobian (the Hessian of E_ΞΈ).

But: the trace requires second-order derivatives β€” O(d) backward passes for dimensionality d. Computationally infeasible for high dimensions.

Denoising Score Matching

Vincent, 2011

Instead of score-matching clean data, score-match on data + noise:

The noisy score has a simple closed form, so the DSM objective becomes:

No second-order derivatives, no unknown β€” just a regression problem: predict the noise direction from the noisy input.

Vincent, P. "A Connection Between Score Matching and Denoising Autoencoders." Neural Computation, 2011.

Score Matching β†’ Diffusion Models

If we perform denoising score matching at many noise levels β€” from pure noise down to near-zero noise β€” we get a multi-scale score model.

Sample by starting from noise and gradually denoising via Langevin dynamics at decreasing noise scales (annealed Langevin dynamics).

The punchline

Denoising score matching ≑ denoising diffusion probabilistic models, just viewed through different lenses (score vs. probability).

Diffusion models are EBMs trained via denoising score matching.

Song & Ermon 2019; Ho et al. 2020; Song et al. 2021

Song, Y. & Ermon, S. "Generative Modeling by Estimating Gradients of the Data Distribution." NeurIPS, 2019. Β· Ho, J. et al. "Denoising Diffusion Probabilistic Models." NeurIPS, 2020.

Approach 3

Noise Contrastive
Estimation

Cast density estimation as binary classification

NCE: The Idea

No more trying to sample from or match its score.

Instead, train a binary classifier to tell whether a sample came from the data distribution or a known noise distribution.

Any classifier that can answer this optimally can implicitly recover the data density β€” this is theoretically proven (Gutmann & HyvΓ€rinen, 2010).

Gutmann, M.U. & HyvΓ€rinen, A. "Noise-Contrastive Estimation: A New Estimation Principle for Unnormalized Statistical Models." AISTATS, 2010.

Unique advantage: NCE learns Z as a by-product β€” the only method among the three that does.

NCE: Two Sources

1. Data distribution

Can sample from training set.

2. Noise distribution

Could be Gaussian, uniform, or… the output of a pretrained autoregressive model 🧐

(spoiler alert)

Must be able to both sample from q and evaluate its density at any point.

Mix them together: draw x from either source with equal probability, then ask the classifier: "which source did this come from?"

NCE: The Objective

At optimality, the classifier's posterior matches the true posterior:

This gives the standard NCE loss β€” a binary cross-entropy:

When the classifier is optimal, β€” the model has recovered the data distribution, and Z_ΞΈ is learned as a free parameter.

Why Hard Negatives Matter

The closer the noise distribution is to the data distribution, the better representations are needed to distinguish them.

🎯

Easy noise (e.g. uniform)

Trivial to classify β†’ model learns superficial features

πŸ”₯

Hard noise (e.g. pretrained LM)

Hard to classify β†’ model must learn deep structure

This principle connects to contrastive learning broadly β€” SimCLR, CLIP, and ELECTRA all benefit from harder negatives.

Comparing Training Paradigms

Method
Avoids Z via
MCMC?
Learns Z?
Key limitation
MLE + CD
Short MCMC chain
Yes
No
Mixing time / bias
Score Matching
Matches gradients
No
No
Needs Hessian trace
Denoising SM
Noise perturbation
No
No
Learns noisy dist.
NCE
Binary classification
No
Yes
Needs good noise dist.

Part 2 Recap

MCMC: Sample negatives via Langevin dynamics. Practical with contrastive divergence, but biased and struggles with high-dimensional multimodal landscapes.

Score Matching: Match βˆ‡β‚“ log p, bypassing Z entirely. Denoising variant avoids expensive Hessians and connects directly to diffusion models.

NCE: Learn density by classifying data vs. noise. The only method that learns Z. Quality depends heavily on the noise distribution.

Key theme

All three methods find creative ways to avoid the intractable partition function β€” MCMC sidesteps it with sampling, SM removes it via calculus, NCE absorbs it into a classifier.

Part 3

EBMs 🀝 LLMs

Your Classifier Is Secretly an EBM

Grathwohl et al., JEM, 2020

A standard classifier with softmax:

We can reinterpret the logits as a joint energy function:

By marginalizing out y, we get an energy over inputs:

There is a hidden energy-based generative model inside every discriminative classifier.

Grathwohl, W. et al. "Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One." ICLR, 2020. arXiv:1912.03263

Training the Joint EBM

Decompose the joint log-probability:

β€” generative term

Trained via SGLD (Langevin dynamics)

β€” discriminative term

Standard cross-entropy loss

Results: better classification and better generation than single-purpose models (same architecture), plus better calibration, OOD detection, and adversarial robustness.

ELECTRA

Pre-Training Text Encoders as Discriminators

A small generator (MLM) fills in masked tokens. A larger discriminator classifies each token as original or replaced.

Why it's better than BERT MLM

  • Dense signal over all tokens (not just ~15% masked)
  • No [MASK] token mismatch between pre-training and fine-tuning
  • 3–7Γ— sample efficiency over BERT

The EBM connection

ELECTRA's discriminator is doing negative sampling β€” distinguishing real tokens (data) from fake tokens (generator). This is NCE with the generator as the noise distribution!

Clark, K. et al. "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators." ICLR, 2020. arXiv:2003.10555

Electric: Energy-Based Cloze Models

Clark et al., 2020

Electric takes ELECTRA's insight further β€” from negative sampling to proper noise contrastive estimation:

ELECTRA

Discriminator classifies real vs. fake tokens. Ignores the noise distribution.

Electric

Given fake tokens (noise), learns to model data tokens via NCE. Produces calibrated per-token pseudo-probabilities.

Electric outputs β€” an energy-based cloze probability for each token given its context. No autoregressive factorization needed.

Clark, K. et al. "Pre-Training Transformers as Energy-Based Cloze Models." EMNLP, 2020. arXiv:2012.08561

Residual EBMs for Text

Deng et al., 2020

Take a pretrained autoregressive LM and multiply its distribution by an energy correction from a bidirectional model:

πŸ”„

AR model handles fluency and local coherence (token-by-token)

🌐

Bidirectional energy provides global, sequence-level quality correction

πŸ’‘ Energy-based models as "verifiers" β€” the EBM doesn't generate, it scores and corrects. Trained end-to-end via NCE.

Deng, Y. et al. "Residual Energy-Based Models for Text Generation." ICLR, 2020. arXiv:2004.11714

EDLM: The Problem with Parallel Text Generation

Zhao et al., 2025

Discrete diffusion models for text unmask tokens in parallel, but predict each token independently:

It's like having 10 people each independently write one word of a sentence β€” each word is reasonable given the context, but together the sentence may be incoherent.

The more parallel the model tries to be (fewer denoising steps), the worse this factorization error gets. This is the fundamental gap between diffusion and AR quality for language.

Ye, J. et al. "Energy-Based Diffusion Language Models for Text Generation." arXiv:2410.21357, 2024.

EDLM: Residual Energy Correction

Apply the residual EBM idea at every denoising step:

Two ways to get the energy function:

EDLM-AR

Plug in a pretrained AR model as the energy. Turns into parallel sampling from an AR model via importance weighting. One AR forward pass scores a complete sequence.

EDLM-NCE

Train a small energy head via NCE on top of the diffusion model. Positives = clean data, negatives = diffusion model's own predictions.

EDLM: Results

Inference: draw k candidates from diffusion model, score with energy, resample via importance weights.

Key finding

Correction matters most in early denoising steps (high masking ratio β†’ worst factorization error). Apply importance sampling only for t ∈ [0.8, 1.0] to get most benefit at a fraction of the cost.

49%

generative perplexity improvement

1.3Γ—

sampling speedup at same quality

400k

fine-tuning steps (NCE variant)

First diffusion model to seriously challenge autoregressive quality on language.

Energy-Based Transformers

Gladstone et al., 2025 β€” A radically different paradigm

Unlike EDLM which adds an energy correction, EBT makes the entire model an energy function:

Inference = gradient descent from random noise to a converged prediction:

Each gradient step is one unit of thinking. The model is simultaneously a generator (via energy minimization) and a verifier (via the energy scalar) β€” unified in a single model.

Gladstone, A. et al. "Energy-Based Transformers are Scalable Learners and Thinkers." arXiv:2507.02092, 2025.

Three Facets of System 2 Thinking

Architecture
Dynamic Compute
Uncertainty
Verification
FF Transformers
βœ—
βœ—
βœ—
Diffusion Transformers
βœ“
βœ—
βœ—
EBTs
βœ“
βœ“
βœ“

🎯 Dynamic Compute

Iterate more on harder predictions. Same compute for "the" as "serendipitous"? Not anymore.

πŸ“Š Uncertainty

Energy at convergence directly quantifies confidence. Easy tokens β†’ low energy. Hard tokens β†’ high energy.

βœ… Verification

Best-of-N sampling without a separate reward model. The energy IS the verifier.

All three emerge from unsupervised pretraining β€” no RL, no verifiable rewards.

EBT: Scaling Results

35%

faster scaling on data

28%

faster on batch size

57%

faster on depth

29%

inference improvement via thinking

First architecture to out-scale Transformer++ across multiple axes simultaneously (data, batch size, parameters, FLOPs, depth).

The P vs NP intuition

Despite slightly worse pretraining perplexity, EBTs beat Transformer++ on downstream tasks. Verification generalizes better than generation β€” learning to score is easier than learning to produce.

Thinking at Inference Time

EBTs can improve by using more forward passes β€” Transformer++ cannot.

OOD thinking boost

As data becomes more out-of-distribution, thinking helps more β€” roughly linear relationship. Just like humans engage deliberate reasoning for unfamiliar problems.

Thinking scales with training

As the model sees more data, the benefit from self-verification increases from 4–8% β†’ 10–14%. Extrapolation to Llama-3 scale suggests potentially massive gains.

Current limitations

Experiments up to 800M parameters. Training overhead 3.3–6.6Γ—. Struggles with multi-modal distributions (merges nearby modes). Open question: do advantages persist at GPT-4 scale?

The Spectrum of Energy-Based LLMs

Conservative: EBM as correction layer

JEM, Electric, Residual EBMs, EDLM β€” keep existing models, add energy-based scoring on top. Practical, already works at scale. The EBM captures what the base model misses.

Radical: EBM as the whole model

EBTs β€” the transformer IS the energy function. Theoretically cleaner, remarkable scaling properties, built-in verification. Not yet demonstrated at frontier scale.

The common thread: both approaches show that scoring complete configurations holistically β€” the energy-based way β€” improves generation quality, provides built-in verification, and enables test-time compute scaling.

Thank you!

Energy-based models: from Boltzmann machines to modern transformers

Shashank Shekhar  Β·  Toronto LLM Meetup  Β·  March 17, 2026

Key references

Song & Kingma 2021 Β· Grathwohl et al. 2020 Β· Deng et al. 2020

Zhao et al. 2025 (EDLM) Β· Gladstone et al. 2025 (EBT)