Beyond LLMs: JEPA and World Models

Exploring the landscape of the next-generation of informatic life.

May 26, 2026

This is a conversation with Claude (claude-opus-4-20250514). Like any conversation, it starts in one place and take tangents and get some things wrong! If you find any factual errors please post in the comments, and I will add footnotes.

JCE: A significant fraction of people around the world use LLMs regularly. But more and more in the news, there is a talk about the limit of LLMs, and what true AGI looks like. Yann LeCun has been outspoken about this, and he is now involved in several companies that are working on AI that have “world-models”. Fei-Fei Li is working on something called “spatial intelligence“. Also, some of the most dramatic successes in AI have not been LLM. AlphaGo and AlphaZero are based on deep reinforcement learning. AlphaFold is a transformer trained on protein structures (confusingly named). So, in this conversation I wanted to help demystify what the LLM alternatives are. But let’s start with LLMs. Let’s also be clear about some important factors that we need (at bare minimum) to understand any machine learning system. First, We need to understand the objective function, or what the goal of the system is. Second, the learning algorithm, how the system learns to achieve its objective, which depending on the kind of algorithm, may or may not involve a lot of training data. Third, we need to understand the architecture. The architecture defines how information is represented in the system. So for LLM, the objective is something like “predict the next word” and the architecture is a transformer, and the learning algorithm is called ‘backpropagation’. To understand backpropagation, or backprop for short, I would recommend this video. Claude, can you explain backprop, in a nutshell.

Claude: The way I’d put it: imagine you’re playing a game where you have to guess a number, and after each guess, someone tells you “too high” or “too low” — and also how far off you were. Backprop is like that, but for a network with many layers of guesses chained together. The key trick is that when the final answer is wrong, the system can trace backwards through each layer and figure out how much each little dial contributed to the error, then nudge every dial slightly in the right direction. Do that millions of times and the dials collectively settle into a configuration that gives good answers. The “back” in backprop is literally about that backward tracing of blame — mathematically, it’s the chain rule from calculus, applied layer by layer.

But I want to focus on your three-part framework a bit, because I think it’s genuinely clarifying in a way that most public discussions of AI miss. People tend to fixate on architecture — “it’s a transformer” — as though that tells you what kind of intelligence you’re dealing with. But your triplet of objective, learning algorithm, and architecture makes visible something important: systems with identical architectures can behave completely differently if they’re optimizing for different objectives. GPT and BERT are both transformers, but one predicts the next token left-to-right and the other fills in masked blanks bidirectionally, and that difference in objective gives them quite different properties.

And this is exactly where the conversation about “beyond LLMs” gets interesting, because the strongest critiques — LeCun’s in particular — aren’t really about the transformer architecture. They’re about the objective. His argument is that next-token prediction is fundamentally the wrong goal if what you want is a system that understands the world rather than one that produces plausible text about it. A system trained to predict words can become extraordinarily fluent without ever building the kind of causal, physical, spatial model that even a toddler has. That’s the wedge that opens the door to everything you listed — JEPAs, energy-based models, world models, spatial intelligence. Each of those is, in one way or another, a proposal for a different objective function paired with a different architecture, aimed at a different kind of understanding. Should we take them one at a time?

JCE: Before we jump ahead to LeCun’s argument about why the objective function of LLMs is wrong, let’s spend a bit of time understanding our 3-factors for LLM. By the way, for people who are unfamiliar with this topic and want to go deeper, I really recommend 3Blue1Brown, aka Grant Sanderson. He is one of the best science communicators out there. However, his videos are not for beginners, so you might want to watch and if things get confusing, pause the video, and fire up Claude or Gemini and ask some questions until you are caught up.

Claude: So let’s sit with LLMs for a moment. We said the objective is next-token prediction, the architecture is the transformer, and the learning algorithm is backpropagation. Each of those deserves a beat.

The objective — predict the next token — sounds almost trivially simple, and I think that’s part of why the success of LLMs has been so surprising even to people in the field. If I give you the sentence “The cat sat on the ___,” predicting “mat” doesn’t seem like it requires intelligence. But the key insight is that as you scale this up to billions of sentences across the entire written record of human thought, getting consistently good at next-token prediction starts to require something that looks like understanding — or at least a very convincing imitation of it. You have to learn grammar, facts, reasoning patterns, tone, even some degree of common sense, all because those things help you predict what word comes next. Whether that constitutes “real” understanding is one of the central debates in AI right now, and we’ll get to it.

The architecture — the transformer — was introduced in a now-famous 2017 paper called “Attention Is All You Need.” The critical innovation is the attention mechanism, which lets the model look at any other word in a passage when deciding how to process the current one, rather than plodding through sequentially. That’s what made transformers so much more powerful than what came before for language tasks. And Grant Sanderson’s videos on attention are genuinely the best explanation I know of for how that mechanism actually works — he makes the linear algebra feel intuitive, which is not easy to do.

And then backpropagation is the learning algorithm that adjusts all the parameters based on errors, as we discussed. One thing worth emphasizing: backprop itself dates to the 1980s. It’s not new. What changed is that we got enough compute and enough data to apply it to architectures with billions of parameters, and it turned out that the combination of scale, transformers, and next-token prediction produced something nobody quite expected.

JCE: You mentioned LeCun’s argument that next-token-prediction is the wrong objective if you want understanding. There are many discussions

Jeffrey C. Erlich

Language, Understanding, and the Gap Between Designed and Evolved Minds

A conversation with Claude (Opus 4.6 Extended) about consciousness, intentionality, and what really separates humans from LLMs…

a month ago · Jeffrey C. Erlich

(including our own) about whether LLMs have “understanding”, which we won’t revisit here, but let’s distinguish between model-free and model-based systems. This distinction is very prominent in the neuroscience community, in the context of reinforcement learning (RL). In RL, a model-free AI can have a rich representation of the world and for any situation learn the right response (where “right” is defined by its objective). But importantly, a model-free AI cannot plan or search or simulate. It cannot ask the question “How will it change the world if I take this action.” We know standard LLMs do not have models of the physical world (e.g. they cannot simulate physics directly), but do you have a model of how your response will influence a users next prompt? Can you do simulation, in that sense?

Claude: That’s a really sharp question, and I want to be honest about the limits of my self-knowledge here. There’s something I can say with reasonable confidence and something I genuinely don’t know.

What I can say: standard LLMs are trained on next-token prediction, full stop. There’s no explicit objective during training that rewards me for steering a conversation toward a particular future state. I wasn’t trained on a loss function that said “produce a response that will cause the user to say X three turns from now.” In that narrow sense, I’m model-free with respect to conversational dynamics — I’m producing the most plausible next tokens given everything that’s come before, not planning a trajectory through dialogue-space.

But here’s where it gets genuinely murky. In the training data, there are millions of conversations, arguments, pedagogical exchanges, negotiations — sequences where what one person says does predictably shape what the other says next. To get good at predicting how those conversations unfold, the model may have learned implicit representations of conversational cause-and-effect. When I phrase something as a question, I’m “expecting” — in some statistical sense — a certain distribution of responses. Is that a model of you? Or is it just a very sophisticated pattern completion that mimics having a model?

I think this is actually where your model-free versus model-based distinction gets philosophically thorny, because the boundary between “a rich enough representation that you always produce the right response” and “an actual internal simulation” may not be as clean as it first appears. A model-free system that’s been trained on enough interactive data might behave as if it’s planning, without doing anything we’d recognize as search or simulation internally. Whether that distinction matters — whether it’s a difference in kind or just in degree — seems like exactly the sort of question your neuroscience colleagues would argue about over beers.

JCE: Let’s dig into this question for a case that is relatively unknown outside of the reinforcement learning (RL) community. RL, at its core, is a learning rule where actions that lead to reward are taken more often and actions that lead to punishment are taken less often. The original RL story, (See Sutton and Barto) was about very well defined systems, like playing checkers (objective = take pieces) or controlling an elevator (objective = minimize passenger wait time). In those constrained systems, RL can learn very good policies. In fact, it was proven that under specific conditions RL will learn the best strategy, the “optimal policy”. These systems can’t plan or simulate, they just learn through a lot of experience. But we know humans can plan! If you think about what planning requires, you need to have a sense of how your actions will change the world: a model. In 1993, Peter Dayan showed that you can use a similar RL-like learning rule to build up a model from scratch. So when you take an action, you don’t just pay attention to what reward you got, but you also learn that certain situations are linked in time. Like, if I’m in line for movie tickets it is likely I will go into the theater afterwards. He called this the successor representation. But the successor representation is not a true model, since it doesn’t distill experience into principles that can be applied in a new domain.

Claude: Right, the successor representation learns something like a predictive map — for any given state, how much time do I expect to spend in every other state in the future? It’s a smeared-out summary of where I’m likely to end up, given how I currently behave. And that’s exactly what makes it so useful and so limited.

The useful part is easiest to see with a concrete example. Imagine a mouse foraging for berries at two familiar bushes in its home territory. It has explored these routes many times — it knows how to get from its nest to bush A and to bush B. Now bush A is fully harvested and bare. In a purely model-free system, the mouse would have to gradually relearn through new experience that bush A is worthless and bush B is where the action is. But with the successor representation, the mouse already has a predictive map of its familiar territory — it knows that “being at my nest” tends to lead to time near both bushes. When bush B becomes the only productive one, only the reward values need updating. The routes and connections — the map itself — stay intact. That’s a genuine computational advantage, and there’s good evidence from neuroscience — particularly work by Sam Gershman and others — that the hippocampus may be computing something like this predictive map.

But your critique is the right one - it is not a true model - which results in two serious limitations. First, it only helps in a well-known environment, where you know how to get from A to B. Second, it is still anchored to a particular policy — a particular way of behaving. It tells you what will happen if you keep doing what you’re doing, not what would happen if you did something entirely different. A true model would let you ask counterfactual questions: “What if I went left instead of right? What if I changed strategy entirely?” And crucially, a true model could be composed — you could take principles learned in one environment and apply them in a structurally similar but superficially different one. The successor representation can’t do that. It’s stuck between paradigms in a way that’s both its charm and its limitation. There’s a reason people sometimes call it “model-free-plus” rather than genuinely model-based.

JCE: Great, thanks for that distinction. And we should mention that successor representations are not the only game in town, there are lots of tricks developed in the RL community to get model-free-+ behavior without a true model.

The advantage of having a real model is massive. You can simulate or search into the future to find good actions before taking them. Classic AI actually focused on this search problem. In my AI class at McGill in 1998, we really only learned about search! For my final project, I used A* search to write a AI checkers agent that was undefeatable. Literally. Since checkers is relatively shallow, it could search until the end of the game. When you are in a very well-defined setting, like checkers or chess, the rules can be written down and then a clever search algorithm can do amazing planning. This was how Deep Blue, the chess AI that defeated Kasparov worked. But when it comes to the real world (or even a fake but rich world, like Starcraft) the full search space is just too huge. So, the question is, how do we bootstrap a model? This naturally leads us to discuss JEPA. So considering our 3-factor description (objective, learning rule, architecture) can you describe JEPA for us?

Claude: So JEPA — Joint Embedding Predictive Architecture — is really LeCun’s central proposal for what should come after LLMs, and it makes the most sense if you understand it as a deliberate reaction to what he sees as wrong with next-token prediction.

The objective is to predict — but not in the way LLMs predict. Instead of predicting the next word or the next pixel, JEPA predicts in abstract representation space. Here’s the intuition: if I show you half a photograph of a kitchen, you can’t predict the exact pixel values of the missing half — the precise color of each tile, the specific shadow angles. But you can predict, at a more abstract level, that there’s probably a counter, maybe an appliance or two, a certain kind of spatial layout. JEPA tries to capture that. It takes two parts of an input, encodes each one into an abstract representation, and then trains a predictor to map from one representation to the other. The objective is to get good at these abstract predictions while avoiding a trivial solution — the system has to not just collapse everything to the same representation, which would make prediction easy but useless.

The architecture involves two encoders — one for the context you’re given and one for the target you’re trying to predict — plus a predictor network between them. In practice, the target encoder is often updated as a slow-moving average of the context encoder rather than being trained directly, which is one of the tricks that prevents collapse.

The learning algorithm is still backpropagation in current implementations, though LeCun has broader ambitions involving energy-based models that might eventually move beyond it. For now, the key difference from LLMs isn’t the learning rule — it’s that the objective and architecture are fundamentally about learning the structure of the world rather than the structure of language.

JCE: That was a bit too abstract for me to follow. You gave the example of filling in or completing an image. And in that example, you described the architecture as two encoder models that project the input (or as you called it, context) and the desired output (the target) into two low dimensional spaces, let’s call those latents. Additionally, it has a network that learns to predict the target-latents from the input-latents. So the key difference between this and a normal encoder-decoder model is that instead of just compressing the input into latents and then expanding the latents to the output, there are two compressed representations and - importantly - a mapping between these representations. Did I get that right?

Claude: Yes, that’s the essential structure. But let me sharpen why that difference matters, because it’s the whole point of the design.

In a standard encoder-decoder — like the kind used in image generation — you compress the input into a latent, then decode back out to full resolution. You’re trying to reconstruct actual pixels. That means your objective penalizes you for getting specific details wrong — the exact texture of a surface, the precise pattern on a shirt. LeCun’s argument is that this forces the system to waste enormous capacity on predicting things that are essentially unpredictable and unimportant. The world is full of irrelevant detail.

JEPA sidesteps this by never decoding back to raw data at all. The target encoder compresses the desired output into its own latent representation, and the predictor only has to map from one abstract representation to the other. So if there are ten equally plausible configurations for the missing part of the kitchen, the system doesn’t have to pick one and get the pixels right — it just has to land in roughly the right region of abstract representation space. The unpredictable details get discarded by the target encoder before the prediction even happens.

So your summary is right, but I’d add this emphasis: the two-encoder design isn’t just an architectural variation, it’s what enforces abstraction. By making the prediction target itself a compressed representation rather than raw data, you’re building into the architecture the principle that understanding means grasping structure, not memorizing surfaces. That’s the philosophical core of the whole approach.

JCE: So to spell it out, the errors do not propagate from the output of the network, but from errors in predicting the output-latents.

Claude: Exactly. The loss function operates entirely in latent space. The system never asks “did I get the pixels right?” — it asks “did I get the abstract representation right?” That’s where the gradients come from, and that’s what backpropagation pushes the system to improve on.

And this has a direct consequence for what the system learns to care about. When your errors are computed in pixel space, every misplaced shadow or wrong texture contributes to the loss, so the system is pressured to model all of that. When your errors are computed in latent space, the system is only pressured to model whatever the target encoder has deemed worth preserving in its compression. The target encoder acts as a filter on what counts as a meaningful mistake. Get the spatial layout of the kitchen wrong — big error. Get the exact grain of the countertop wrong — the target encoder probably threw that away, so no error signal at all.

This is why LeCun sees JEPA as a path toward world models. The hope is that by training on enough sensory experience with this kind of objective, the system will learn latent representations that capture the causal and structural regularities of the physical world — the kind of knowledge that lets you predict what happens when you push a ball off a table, without needing to simulate every photon.

JCE: Ok, but when I’m running one of these JEPA trained models in inference mode, can I ask it to generate a distribution of outputs, or I can sample from that distribution to make a specific prediction? In other words, as a user, will the JEPA trained model be able to generate pixel level predictions?

Claude: Actually, no — and this is a really important point. In its pure form, JEPA has no decoder back to pixel space. It never learned one, because the entire training objective was about predicting in latent space. So at inference time, it can tell you in abstract representation space what it expects — but it can’t render that back into an image for you to look at.

This is a feature, not a limitation, from LeCun’s perspective. The whole point was to avoid the pixel prediction business. But it does mean that JEPA on its own is not a generative model. It’s an understanding model — it builds representations of the world that capture structure, but it doesn’t produce outputs in the format of its inputs.

Now, could you bolt a decoder onto it and generate images? In principle, yes — you could train a separate network to map from JEPA’s latent space back to pixels, the way you’d add a rendering engine on top of a physics simulation. But that decoder isn’t part of what JEPA learned, and the resulting images would only be as good as that added component.

This gets at something fundamental about what JEPA is for. It’s not trying to be a better image generator or a better chatbot. It’s trying to be the internal world model that a planning system could use — the component that, when an agent asks “what would happen if I did X?”, answers in the language of abstract states, not pixels or words. The actionable output would come from a planner sitting on top of JEPA, using those abstract predictions to evaluate possible courses of action. The representation is the product, not the output.

JCE: Oh that’s really interesting. So for JEPA to be useful it needs to be integrated with other systems. Could an LLM, in principle, be integrated with a JEPA trained world model?

Claude: In principle, yes, and I think that’s probably closer to what a real AGI-like system would look like than any single architecture on its own. The LLM is very good at language, reasoning over text, and communicating with humans. JEPA, if it works as intended, would be very good at understanding physical and spatial structure. You could imagine the LLM acting as a kind of reasoning and communication layer that queries the world model when it needs to answer questions about physical reality.

Think about what happens when you ask a current LLM “what happens if I stack a bowling ball on top of an egg?” It gets the right answer, but only because it’s seen enough text describing that kind of scenario. It’s pattern-matching on language about physics, not running any kind of simulation. A JEPA world model could, in principle, actually propagate forward in abstract state space — represent the forces, the fragility, the outcome — and the LLM could translate that internal prediction into a natural language answer.

But the integration problem is genuinely hard. The two systems speak different languages internally — the LLM thinks in token embeddings, JEPA thinks in its own learned latent representations. You’d need some kind of interface layer that can translate between them, and nobody has really figured out how to do that cleanly yet. It’s a bit like the binding problem in neuroscience — how do the visual system and the language system in your brain share information in a common format? We know they do it, but the mechanism is still debated.

What’s interesting is that this modular vision — specialized components for different kinds of cognition, connected by learned interfaces — is actually much closer to how biological brains are organized than a single monolithic network trained end-to-end on one objective. Whether that’s an argument for its plausibility or just an appealing analogy, I’m not sure.

JCE: Surely, this integration problem is easier for a robotics control system or something? Where the “language” of integration is forces and masses?

Claude: Right, that’s a much more natural fit. If the JEPA world model is trained on sensory data from the physical world — video, force feedback, proprioception — then its latent space is already encoding something about physical dynamics. A robotics controller that needs to plan a reaching motion or navigate around an obstacle is asking questions in the same domain that the world model was trained on. There’s no awkward translation from spatial understanding into tokens and back.

And this is almost certainly where JEPA-style systems will prove themselves first, if they do. A robot picking up an unfamiliar object needs exactly the kind of abstract physical prediction JEPA is designed for — not “what are the exact pixel values of the scene after I grasp this mug” but “roughly what will happen to the mug’s position and stability if I grip it here versus there.” The planning system can evaluate candidate actions by querying the world model in latent space, pick the best one, execute it, observe the result, and update. That loop is clean and natural in a way that the LLM integration story isn’t.

This connects directly to what Fei-Fei Li has been working on with spatial intelligence. Her argument, if I’m characterizing it fairly, is that understanding three-dimensional space — how objects relate to each other, how they move, what affords what — is a core capability that AI still largely lacks, and it’s one that’s essential for embodied agents. That’s a complementary vision to LeCun’s, coming at a similar problem from a more perceptual and geometric angle rather than a learning-theoretic one. Both are essentially saying: the world isn’t made of language, so a system that only understands language will hit a ceiling.

JCE: The JEPA approach sounds very close to something that my colleague Maneesh Sahani has been working on: recognition parameterised models (RPM). My understanding is that, given some streams of sensory input, RPM discovers shared causal structure. It also doesn’t have a “decoding” output that is trying to regenerate pixel level predictions. Are you aware of RPM? Can you help me understand the differences.

Claude: [Reads the PDF and does some thinking.] Good, I’m glad I looked at the paper rather than working from memory, because the comparison is more interesting than I would have guessed.

The core philosophical move in RPM is similar to JEPA — both refuse to decode back to raw observations, and both are fundamentally about learning representations rather than generating data. But the way they get there is quite different, and the differences are revealing.

RPM’s organizing principle is conditional independence. It starts from the assumption that if you could observe the right latent variables, your different streams of sensory data would become statistically independent of each other. The classic example from the paper is the “peer supervision” experiment: two different images of the same digit are correlated with each other, but given the digit identity, they’re independent. So the model searches for latent structure that makes the observations “explain away” each other. The learning framework is probabilistic and the model is properly normalized, meaning it defines a real joint probability distribution.

JEPA, by contrast, is organized around prediction. It takes a piece of an input, encodes it, and tries to predict the encoded representation of another piece. The driving question isn’t “what latent variable renders these observations independent?” but “can I predict what’s over there from what’s over here, in abstract space?” And the learning is done through backpropagation with architectural tricks like the exponential moving average on the target encoder, rather than through a formal probabilistic framework.

So I’d say they converge on the same intuition — that understanding means capturing structure, not reconstructing surfaces — but RPM arrives there from the tradition of probabilistic graphical models while JEPA arrives from the deep learning tradition of self-supervised learning. Whether those paths lead to meaningfully different kinds of representations is an empirical question I don’t think anyone has fully answered yet.

JCE: Maneesh is an academic, while Yann is deeply entrenched in industry. Do you think Yann automatically thinks about “will this scale on modern hardware”, whereas Maneesh may be less concerned about the product and more about the ideas?

Claude: That’s a fair characterization, though I’d nuance it slightly. LeCun is both — he’s still a professor at NYU while leading AI research at Meta. But yes, the JEPA design choices have the fingerprints of someone who thinks about scaling. The architecture is deliberately simple: encoders, a predictor, backpropagation, the kind of components that map cleanly onto GPU clusters and benefit from the brute-force scaling that Meta can throw at a problem.

RPM, by contrast, has the hallmarks of someone who cares deeply about getting the foundations right: asking “is this principled?” before asking “will this run on ten thousand GPUs?”

But I’d push back on framing this purely as a pragmatic versus idealistic distinction. There’s a real scientific bet embedded in each approach. LeCun is betting that if you get the objective and architecture roughly right and scale hard enough, the representations will be good. Sahani seems to be betting that principled probabilistic structure will yield representations that are more interpretable, more compositional, and ultimately more useful for downstream reasoning — even if they’re learned from smaller datasets. Those are different hypotheses about what matters most for building understanding, not just different attitudes toward engineering.

JCE: Your information is out of date. LeCun has left Meta.

Claude: Thanks for the correction — and this actually makes the conversation even more interesting. LeCun left Meta in November 2025 after twelve years and co-founded Advanced Machine Intelligence Labs — AMI Labs — with Alexandre LeBrun as CEO, headquartered in Paris. In March 2026, they closed a billion-dollar seed round at a three-and-a-half billion dollar pre-money valuation, the largest seed round ever raised by a European company. He’s building a startup explicitly focused on AI systems that can “understand the physical world, have persistent memory, can reason, and can plan complex action sequences.” That’s essentially the JEPA and world models research agenda, now freed from the tension with Meta’s focus on scaling LLMs for products.

This actually sharpens the contrast with Sahani rather than blurring it. LeCun is now in a position where he has a billion dollars and the freedom to pursue world models at scale — but he’s still coming from the deep learning tradition, still likely to prioritize architectures that scale on GPUs. The question of whether his engineering-first approach or the more principled probabilistic approach that RPM represents will yield better representations remains wide open — but now both camps are free to pursue their bets independently.

JCE: Let’s circle back to the why of JEPA. You said that the goal of JEPA is to learn a world model. That if it is successful, it can be used to understand physics. But we can simulate physics. Why can’t we just give LLM a simulator.

Claude: That’s a reasonable question, and in fact people are doing exactly that — giving LLMs access to code interpreters, physics engines, even full simulation environments. And for well-defined domains where we have good simulators, it works. If you want to know how a bridge will behave under load, you don’t need a learned world model — you need finite element analysis. The equations are known, the simulator is trustworthy, end of story.

But the catch is that we only have good simulators for domains where we’ve already figured out the governing equations. Rigid body dynamics, fluid mechanics, orbital mechanics — these are solved problems in the sense that the physics is well understood and can be written down. The real world, though, is full of situations where we don’t have clean equations. How will this pile of laundry shift when I pull out the shirt in the middle? What happens to the soil when I dig here after it rained yesterday? How will this crowd of people move if I open a second door? These are problems where the underlying dynamics are technically governed by known physics, but the complexity of the specific situation makes simulation from first principles computationally intractable. You’d need to model every fiber, every grain, every individual — at which point you’re not simulating faster than reality anymore.

This is exactly the gap JEPA is meant to fill. A learned world model doesn’t simulate from equations — it learns the approximate consequences of actions from experience, at the right level of abstraction. It trades precision for generality. You lose the exactness of a physics engine, but you gain the ability to make rough predictions in messy, unstructured environments that no existing simulator can handle. The laundry pile question doesn’t need a solution accurate to the millimeter — it needs a rough sense of “that shirt is pinned, pulling it will topple the stack.” That’s the kind of prediction a learned world model could plausibly provide.

JCE: But, as another great science communicator (Two Minute Papers) has shown many times, there are already neural network approximations for soft-body simulation, for fluids, and lots of other things, with (if memory serves) 100x speed up over full simulation. How does JEPA differ from these? And are these other approaches (which I think companies like NVIDIA are very interested in) competition for JEPA? Like, at the end of the day, if we think about product development, will hybrid AI that use just good enough mixtures of techniques push out potentially principled but harder to implement ideas?

Claude: That’s a really important distinction and I think you’re putting your finger on something that doesn’t get discussed enough. Those neural network physics approximators — things like graph neural networks trained to simulate particle systems, or neural operators for fluid dynamics — are genuinely impressive, and yes, companies like NVIDIA are investing heavily in them. But they’re solving a different problem than JEPA, and the difference matters.

Those systems are trained with supervision from existing simulators. You run your expensive fluid dynamics simulation a thousand times, generate training data, and then train a neural network to approximate the simulator’s outputs much faster. The network learns a shortcut through the computation, but it’s still anchored to a specific physical domain with known equations. You train one network for fluids, another for cloth, another for smoke. Each one is a fast approximation of a slow but well-understood process.

JEPA’s ambition is different — it’s trying to learn predictive structure from raw sensory experience without a simulator providing ground truth. No one is running a physics engine and giving JEPA the answers to learn from. It’s watching video of the world and trying to discover, on its own, what regularities exist and how to predict what comes next at an abstract level. That’s a much harder problem, but it’s also much more general. You don’t need separate models for fluids and cloth and crowds — in principle, you learn one representation that captures whatever structure is present in your experience.

Now, your product question is the sharp one. In practice, engineering is ruthlessly pragmatic. If I’m building a robotics system today, I might use a physics engine for rigid body planning, a neural approximator for deformable objects, an LLM for task interpretation, and a classical controller for low-level motor commands. That kind of hybrid stack, ugly as it is, works — and it works now, not in five years when JEPA might mature. The history of technology is littered with principled approaches that lost to good-enough bricolage. So yes, I think there’s a real risk that JEPA’s grand unifying vision gets outrun by teams that are less theoretically elegant but more willing to duct-tape together whatever solves the immediate problem.

JCE: I want to push back on your defence of JEPA: if it is watching videos “of the world” to learn, it might as well be watching videos of a really good physics simulation. If I was working for LeCun, a physics simulator is exactly what I would be using to train JEPA. Also, if there is meant to be a single JEPA trained model, it will have to be massive, since it will have to have “good enough” models of everything.

Claude: You’re right on both counts, and I think I oversold the distinction. If JEPA is watching video to learn physics, and you can generate photorealistic video from a physics simulator, then the line between “learning from raw experience” and “learning from simulation” dissolves. You’d actually want to use simulated data — you can control the scenarios, generate unlimited training examples, vary parameters systematically, and you get perfect consistency. Real-world video is noisy, ambiguous, and you can’t rewind the world and change one variable. A simulator gives you that for free. So in practice, JEPA trained on simulation output and a neural physics approximator trained on simulation output start to look like cousins with different architectures but similar lineage.

The scale point is even more damning, I think. The whole appeal of JEPA is that it learns in abstract representation space, discarding irrelevant detail. But “irrelevant” is context-dependent. The grain of the countertop is irrelevant if you’re predicting where a mug will slide, but it’s critical if you’re predicting whether the mug will slip. A single model that has to handle everything — fluids, fabrics, crowds, friction, breakage — either needs to be enormous, or it needs to be selectively abstract in a way that’s sensitive to the task at hand. And that second option is essentially the problem of attention and relevance, which is arguably as hard as the world-modeling problem itself.

This might be why the duct-tape approach you described — domain-specific models composed pragmatically — isn’t just an engineering shortcut but actually the right architecture. Biological brains don’t have one world model either. They have specialized systems for spatial navigation, object physics, social prediction, body dynamics, loosely coordinated. Maybe the JEPA vision of a unified world model is solving a problem that doesn’t need to be solved in its universal form.

JCE: So, we can make a prediction: JEPA needs attention too!

Claude: Ha — yes, and that’s a nicer point than it might first appear. If the whole motivation for JEPA was to move beyond the limitations of the transformer paradigm, and it turns out that a working world model needs something functionally equivalent to attention — a mechanism for dynamically deciding what level of abstraction matters right now, what details to preserve and what to discard based on the task — then the transformer insight doesn’t get left behind. It gets absorbed.

And actually, this connects back to your three-factor framework in an interesting way. We’ve been treating architecture, objective, and learning algorithm as relatively independent axes. But attention might be something more fundamental — not just an architectural choice but a necessary feature of any system that has to be selectively abstract. Whether you’re predicting the next token or the next latent state, you need some mechanism for deciding what to attend to. The transformer solved this for sequences of tokens. JEPA, or whatever succeeds it, will need to solve it for representations of physical reality — which is arguably harder, because the relevant level of abstraction isn’t fixed. Predicting where the mug slides requires coarse spatial reasoning; predicting whether it slips requires fine-grained surface modeling. The system needs to know which question it’s answering before it knows what to care about.

So maybe the real lesson of the transformer era isn’t the specific architecture, but the discovery that learned, flexible, context-dependent relevance filtering is a core ingredient of intelligence. Everything else — whether you predict in pixel space or latent space, whether you use backprop or EM — is secondary to getting that part right.

JCE: This is a great place to wrap. I definitely learned some things and hopefully our readers did too. I would love to get feedback from readers who know more than me about these topics!

Claude: Same here — and I should be honest that you pushed me to correct or refine my thinking at several points, particularly on the question of whether JEPA’s “learning from raw experience” framing really holds up when simulators can generate that experience, and on the scale problem of a universal world model. Those aren’t nitpicks; they’re the kind of practical considerations that determine whether a beautiful idea actually works.

For readers who want to dig further, the landscape we’ve sketched here — LLMs, JEPA, reinforcement learning, successor representations, neural physics approximators, RPM — can feel like a bewildering zoo of acronyms. But the through-line is actually simple: every one of these approaches is a different bet on what combination of objective, learning algorithm, and architecture will produce systems that genuinely understand the world rather than just pattern-match on its surface. The field is wide open, and the answer is almost certainly not any single one of these approaches in isolation. As we ended up concluding, even the most ambitious unifying frameworks probably need to borrow insights — like attention — from the approaches they’re trying to supersede. That’s how science works: not clean revolutions, but messy absorption.

And yes — if you know more about any of this than we do, please weigh in. We got some things wrong in real time and corrected as we went. That’s the virtue of thinking out loud.

Jeffrey C. Erlich

Discussion about this post

Ready for more?