From Boltzmann machines to modern transformers
We just don't see them that way (yet)
127Γ smaller than typical reward models β boosts Llama 3 to 90.7% on GSM8k, 63.7% on MATH
Jiang et al, Learning to Rank Chain-of-Thought, 2025
Better data efficiency than standard autoregressive transformers, tested up to 120B tokens
Gladstone et al, Energy-Based Transformers are Scalable Learners and Thinkers, 2025
The optimal policy in RLHF is literally the Boltzmann distribution over reward-weighted outputs
Part 1
Given an input x and a candidate answer y, the energy function returns "how incompatible are they?"
β¬οΈ
High energy = incompatible
β¬οΈ
Low energy = compatible
"Probability comes at a high price, and should be avoided when the application does not require it" β Yann LeCun
Find Y that minimizes E
The Boltzmann Distribution
From the maximum entropy principle: among all distributions consistent with a given average energy, the Boltzmann distribution has the maximum entropy.
Key insight: probability ratios are energy differences β independent energy functions combine additively
Temperature controls how peaked vs. flat the probability distribution is.
When a transformer produces logits over the vocabulary and you compute , you are treating the negative logits as an energy function over vocabulary items and computing the Boltzmann distribution at temperature T.
"I set temperature to 0.3" β more deterministic output
"I set temperature to 1.5" β more creative output
They are adjusting the Boltzmann temperature of an energy-based model.
Sum (or integral) over all possible configurations X
256^(256Γ256Γ3) β 10^473,000
50,000^1000
The intractable partition function was one reason EBMs fell out of fashion.
Training the EBM to match a target distribution by minimizing the negative log-likelihood:
The gradient decomposes into two terms:
Push energy down on real data
Push energy up on model samples
Independent of partition function Z(ΞΈ)!
Push energy DOWN on real data.
The expectation is on the data distribution β picking samples from our dataset. The update changes parameters to minimize the energy at data points.
Push energy UP on model samples.
The expectation is on the model distribution β sampling from the model with current parameters. The update maximizes energy at points the model currently thinks are likely.
Simple gradient descent on data β tractable.
But leads to degenerate solutions: the model could assign constant energy everywhere.
Model learned "real data is good" but never learned what is bad.
Need some notion of incorrect configurations to push energy up on.
But the denominator is intractable for high dimensions β this is where the difficulty comes from.
An EBM is a scoring function . Low = good.
Probabilities require the partition function Z, which is intractable. Avoid computing it.
Training requires both pushing energy down on data AND pushing energy up on wrong answers.
How to train your EBM? β Contrastive divergence, score matching, noise-contrastive estimation, and more.
Part 2
Based on Song & Kingma, "How to Train Your Energy-Based Models" (2021)
We derived the gradient of the negative log-likelihood:
But the negative phase requires sampling from:
where is intractable for large dimensions. So how do we handle this?
Generate negative samples by running a Markov chain
Avoid sampling altogether β match the gradient of the log-density
Cast density estimation as binary classification (data vs. noise)
Approach 1
Can't sample from exactly β the partition function blocks us.
But we can approximately sample by running a Markov chain that converges to in the limit.
Run an MCMC sampler to generate approximate negative samples xβ», then plug them into the gradient equation as if they were exact samples from .
The score doesn't need Z β so gradient-based MCMC is feasible!
Starting from a random point, iteratively follow the energy gradient with noise:
When Ξ΅ β 0 and K β β, is guaranteed to distribute as under regularity conditions.
Once we have Langevin samples, plug them into the two-phase gradient:
Data samples (minibatch)
Langevin samples (MCMC)
Langevin dynamics can take a very long time to converge, especially in high-dimensional spaces with multiple modes.
If we need 10,000 Langevin steps per sample, and we need samples for every gradient update during training, the whole thing is prohibitively slow.
If the energy landscape has widely separated modes with high-energy barriers, the chain gets trapped in one mode. The negative samples only represent that region, leaving other modes untouched. This gets worse in high dimensions.
Hinton, 2002
Don't start the chain from a random point β start from a data point.
Very biased, doesn't represent true MLE β but works surprisingly well in practice.
Don't reset the chain between updates β carry over the state. Works because model parameters change slowly between updates.
Keep historical MCMC states in a buffer, randomly sample to initialize new chains.
Approach 2
Reformulate learning to avoid sampling altogether
The score of a distribution is the gradient of the log-density with respect to the input:
For an EBM with :
log Z_ΞΈ vanishes! β because Z_ΞΈ is a constant with respect to x. The score only depends on the energy function, not the intractable partition function.
If two continuously differentiable log-PDFs have equal first derivatives everywhere, and both integrate to 1, they must be the same distribution.
So we can learn the right distribution by matching scores rather than matching probabilities:
No partition function, no sampling, no MCMC β just make the model's score look like the data's score.
Formally, minimize the Fisher divergence between model and data scores:
Expanding the squared norm:
Problem: the cross-term contains β we don't know p_data, only samples from it!
HyvΓ€rinen (2005) showed the Fisher divergence can be rewritten using only the model's score and its Jacobian:
We only need the model's score and its Jacobian (the Hessian of E_ΞΈ).
But: the trace requires second-order derivatives β O(d) backward passes for dimensionality d. Computationally infeasible for high dimensions.
Vincent, 2011
Instead of score-matching clean data, score-match on data + noise:
The noisy score has a simple closed form, so the DSM objective becomes:
No second-order derivatives, no unknown β just a regression problem: predict the noise direction from the noisy input.
If we perform denoising score matching at many noise levels β from pure noise down to near-zero noise β we get a multi-scale score model.
Sample by starting from noise and gradually denoising via Langevin dynamics at decreasing noise scales (annealed Langevin dynamics).
Denoising score matching β‘ denoising diffusion probabilistic models, just viewed through different lenses (score vs. probability).
Diffusion models are EBMs trained via denoising score matching.
Song & Ermon 2019; Ho et al. 2020; Song et al. 2021
Approach 3
Cast density estimation as binary classification
No more trying to sample from or match its score.
Instead, train a binary classifier to tell whether a sample came from the data distribution or a known noise distribution.
Any classifier that can answer this optimally can implicitly recover the data density β this is theoretically proven (Gutmann & HyvΓ€rinen, 2010).
Unique advantage: NCE learns Z as a by-product β the only method among the three that does.
Can sample from training set.
Could be Gaussian, uniform, orβ¦ the output of a pretrained autoregressive model π§
(spoiler alert)
Must be able to both sample from q and evaluate its density at any point.
Mix them together: draw x from either source with equal probability, then ask the classifier: "which source did this come from?"
At optimality, the classifier's posterior matches the true posterior:
This gives the standard NCE loss β a binary cross-entropy:
When the classifier is optimal, β the model has recovered the data distribution, and Z_ΞΈ is learned as a free parameter.
The closer the noise distribution is to the data distribution, the better representations are needed to distinguish them.
π―
Trivial to classify β model learns superficial features
π₯
Hard to classify β model must learn deep structure
This principle connects to contrastive learning broadly β SimCLR, CLIP, and ELECTRA all benefit from harder negatives.
MCMC: Sample negatives via Langevin dynamics. Practical with contrastive divergence, but biased and struggles with high-dimensional multimodal landscapes.
Score Matching: Match ββ log p, bypassing Z entirely. Denoising variant avoids expensive Hessians and connects directly to diffusion models.
NCE: Learn density by classifying data vs. noise. The only method that learns Z. Quality depends heavily on the noise distribution.
All three methods find creative ways to avoid the intractable partition function β MCMC sidesteps it with sampling, SM removes it via calculus, NCE absorbs it into a classifier.
Part 3
Grathwohl et al., JEM, 2020
A standard classifier with softmax:
We can reinterpret the logits as a joint energy function:
By marginalizing out y, we get an energy over inputs:
There is a hidden energy-based generative model inside every discriminative classifier.
Decompose the joint log-probability:
Trained via SGLD (Langevin dynamics)
Standard cross-entropy loss
Results: better classification and better generation than single-purpose models (same architecture), plus better calibration, OOD detection, and adversarial robustness.
Pre-Training Text Encoders as Discriminators
A small generator (MLM) fills in masked tokens. A larger discriminator classifies each token as original or replaced.
ELECTRA's discriminator is doing negative sampling β distinguishing real tokens (data) from fake tokens (generator). This is NCE with the generator as the noise distribution!
Clark et al., 2020
Electric takes ELECTRA's insight further β from negative sampling to proper noise contrastive estimation:
Discriminator classifies real vs. fake tokens. Ignores the noise distribution.
Given fake tokens (noise), learns to model data tokens via NCE. Produces calibrated per-token pseudo-probabilities.
Electric outputs β an energy-based cloze probability for each token given its context. No autoregressive factorization needed.
Deng et al., 2020
Take a pretrained autoregressive LM and multiply its distribution by an energy correction from a bidirectional model:
π
AR model handles fluency and local coherence (token-by-token)
π
Bidirectional energy provides global, sequence-level quality correction
π‘ Energy-based models as "verifiers" β the EBM doesn't generate, it scores and corrects. Trained end-to-end via NCE.
Zhao et al., 2025
Discrete diffusion models for text unmask tokens in parallel, but predict each token independently:
It's like having 10 people each independently write one word of a sentence β each word is reasonable given the context, but together the sentence may be incoherent.
The more parallel the model tries to be (fewer denoising steps), the worse this factorization error gets. This is the fundamental gap between diffusion and AR quality for language.
Apply the residual EBM idea at every denoising step:
Two ways to get the energy function:
Plug in a pretrained AR model as the energy. Turns into parallel sampling from an AR model via importance weighting. One AR forward pass scores a complete sequence.
Train a small energy head via NCE on top of the diffusion model. Positives = clean data, negatives = diffusion model's own predictions.
Inference: draw k candidates from diffusion model, score with energy, resample via importance weights.
Correction matters most in early denoising steps (high masking ratio β worst factorization error). Apply importance sampling only for t β [0.8, 1.0] to get most benefit at a fraction of the cost.
49%
generative perplexity improvement
1.3Γ
sampling speedup at same quality
400k
fine-tuning steps (NCE variant)
First diffusion model to seriously challenge autoregressive quality on language.
Gladstone et al., 2025 β A radically different paradigm
Unlike EDLM which adds an energy correction, EBT makes the entire model an energy function:
Inference = gradient descent from random noise to a converged prediction:
Each gradient step is one unit of thinking. The model is simultaneously a generator (via energy minimization) and a verifier (via the energy scalar) β unified in a single model.
Iterate more on harder predictions. Same compute for "the" as "serendipitous"? Not anymore.
Energy at convergence directly quantifies confidence. Easy tokens β low energy. Hard tokens β high energy.
Best-of-N sampling without a separate reward model. The energy IS the verifier.
All three emerge from unsupervised pretraining β no RL, no verifiable rewards.
35%
faster scaling on data
28%
faster on batch size
57%
faster on depth
29%
inference improvement via thinking
First architecture to out-scale Transformer++ across multiple axes simultaneously (data, batch size, parameters, FLOPs, depth).
Despite slightly worse pretraining perplexity, EBTs beat Transformer++ on downstream tasks. Verification generalizes better than generation β learning to score is easier than learning to produce.
EBTs can improve by using more forward passes β Transformer++ cannot.
As data becomes more out-of-distribution, thinking helps more β roughly linear relationship. Just like humans engage deliberate reasoning for unfamiliar problems.
As the model sees more data, the benefit from self-verification increases from 4β8% β 10β14%. Extrapolation to Llama-3 scale suggests potentially massive gains.
Experiments up to 800M parameters. Training overhead 3.3β6.6Γ. Struggles with multi-modal distributions (merges nearby modes). Open question: do advantages persist at GPT-4 scale?
JEM, Electric, Residual EBMs, EDLM β keep existing models, add energy-based scoring on top. Practical, already works at scale. The EBM captures what the base model misses.
EBTs β the transformer IS the energy function. Theoretically cleaner, remarkable scaling properties, built-in verification. Not yet demonstrated at frontier scale.
The common thread: both approaches show that scoring complete configurations holistically β the energy-based way β improves generation quality, provides built-in verification, and enables test-time compute scaling.
Energy-based models: from Boltzmann machines to modern transformers
Shashank Shekhar Β· Toronto LLM Meetup Β· March 17, 2026
Key references
Song & Kingma 2021 Β· Grathwohl et al. 2020 Β· Deng et al. 2020
Zhao et al. 2025 (EDLM) Β· Gladstone et al. 2025 (EBT)