This blog maybe a little too late to the picture, but a few days ago I stumbled upon a new paper about "VL-JEPA". Some quick searches led me back to the conception of the idea of Joint Embedding Predictive Architecture in this position paper titled A Path Towards Autonomous Machine Intelligence by Yann LeCun. In recent times, I had picked up on reading a textbook called "LLMs from Scratch" by Sebastian Raschka, and I was of the opinion to not take any detours until after finishing the book and implementing a model. Yet, here I am reading a paper I lament about being way too long (62 pages!).
Now this might not be the best explanatory article on JEPA, but there are a few articles that I've been referred to as some really good blogs to read from. I've denied myself the opportunity of going through them (yet), only opening them after I finish writing this blog. Therefore, I'll add them to the end of this blog. (ended up reading them by the time I was done writing this.)
Setting Context¶
You might ask: "Why a new architecture when the current LLMs seem to be great enough already?" And that question kind of hits the bullseye of the stir this paper has caused, although only partially. Current Large Language Models work, in an broad oversimplification, guessing the next word or token.
A more detailed explanation would involve defining a few things to truly understand what is happening:- - Tokens: basic unit of text - Vectors: numerical representation of data with a lot of dimensions to capture semantic meaning - Encoder: Encodes raw text into tokens - Decoder: Decodes tokens into raw text
With that, let's try to understand what's happening behind current LLMs. First, the input is taken by the encoder and encoded into tokens. These tokens are then embedded into vectors, which is then sent to the model. Current models are Transformers, and they're implemented as decoders themselves. This is known as a Decoder only Transformer. This decoder, which we will refer to our "LLM" now, can generate a token one by one, which is essentially guessing the next word/token of the sentence. How does our LLM know what comes next? It employs a method called Attention, first explained in the paper Attention is all you need to capture context from the input. Great, but how does it know what those words mean? That, is infact, explained by the P in GPT, or the formal name for these models. GPT stands for Generative Pre-trained Transformer, and our LLMs, which in essence are actually GPT models, are trained on a huge set of data. A way too much data, or so LeCun says. Why so?
When you compare this to a human, you find out that the way humans learn, or to put it in the same jargon, consume "data" and form patterns, seems way more efficient than an LLM. Current LLMs can think only by generating the next token, and that means the LLM is stuck to the language it learns by processing the data fed into it by us. Recent advancements like Reasoning and thoughts did improve it, but these LLMs are still prone to hallucinations and false information.
LeCun points out that generative models suffer from what is essentially a "high-dimensional nightmare." Think about it: if you're predicting the next word in a sentence, there's a finite number of words in a dictionary. But if you're predicting the next frame of a video, there are an infinite number of ways those pixels could be arranged. Trying to hit that exact needle in a haystack of infinity is why generative video often looks like a psychedelic fever dream when it gets things slightly wrong.
LeCun compares the (then, 2022 is 4 years ago) current AI systems with humans and highlights the drawbacks of these LLMs. He argues that, despite pitting against the best AI model we have, a human will always be better than it in driving and other 'trivial' tasks. With that, he puts forward his theory of animals and humans learning "world models". More on that later, but there are 3 questions that current AI research must answer, atleast according to LeCun:- 1. How can machines learn and act only largely through observation? 2. Can a machine reason and plan in a way that is compatible with gradient-based learning? 3. Can machines learn to think about stuff in multiple abstractions and time frames?
Yann says he has an answer with the model architecture he proposes.
Drawing Inspiration¶
So where do we start when making a smart model that imitates humans? Well, we imitate humans. Atleast the way we think we work, that is. LeCun begins with the idea that humans and animals interact the world using something called "world models" from Craik's 1943 paper titled "The Nature of of explanation". He also takes Bryson and Ho's 1969 paper on Applied Optimal Control and explains that predicting the next state using forward thinking models has been a standard in terms of control theory.
Fine, but what are we trying to achieve with these papers in the first place? LeCun proposes that everything we learn -- from looking at faces to communicating to making things -- are a result of us having a "world model" in our head, i.e. a representation of the world's behaviours upon which we act and make decisions. This, he calls "common sense knowledge", and that we have different heirarchial models, or understandings of the world. Given the below chart by Emmanuel Dupox, babies tend to learn concepts with lesser data than current models ever could. Maybe a concept like Object permanance could only take 960 hours of video data for current models, but as we work through skills that a human learns as they grow up, the time and data taken to learn the same things by a model grows exponentially in comparision.

Modelling the model¶
In order to build this supposed Autonomous, Intelligent model, we first break our own learning process into modules. Defining these modules with roles allows us to build upon the modules and then define their characteristics in more detail.
The Configurator Module¶
The configurator module is responsible for configuring the model's parameters and hyperparameters. It takes in a set of inputs and outputs and uses a set of rules to determine the best configuration for the model. It's responsible for ensuring that the model is optimized for the task at hand.
The Perception Module¶
The perception module's job is to take raw inputs and transform them into an internal representation (an embedding). This isn't just a simple compression; it’s about extracting the essential features that are relevant to the task, while discarding the "noise." For example, if you're driving, you care about the position of the car in front of you, but not necessarily the exact pattern of the license plate frame unless it's relevant to your goal.
The World Model Module¶
This is arguably the most critical part. The World Model is responsible for two things: 1. Estimating missing information about the current state of the world (filling in the blanks). 2. Predicting the future state of the world based on a proposed sequence of actions.
Instead of predicting every pixel of a future video frame (which is what a generative model might do), it predicts the representation of the next state. This is much more efficient and closer to how we think, simply because it isn't going through the whole process of creating thoughts (or states) as output. Instead, like thoughts that aren't said, the represntation is internal, which can then be decoded into output that can be used.
The Cost Module¶
The Cost module acts as the "objective" or the "critic." It measures how well the agent is doing. It typically consists of two parts: - Intrinsic Cost: Hardwired "drives" (like avoiding pain or hunger in animals). - Trainable Cost: Specific goals for the task at hand (like "reach the finish line"). The system's entire purpose is to find actions that minimize this cost over time.
The Actor Module¶
The Actor is the one that actually makes decisions. It proposes sequences of actions and uses the World Model to simulate what might happen. It then picks the sequence that results in the lowest cost according to the Cost module.
The Short-term Memory Module¶
This module keeps track of the recent history of states, actions, and costs. It provides the necessary context for the World Model to make accurate predictions about what comes next.
Energy-Based Models (EBMs)¶
Wait, before we move on, I should probably mention "Energy." In LeCun's world, an Energy-Based Model is just a way of saying how "compatible" two things are. High energy means they don't fit together (a bad prediction); low energy means they're a perfect match. The goal of the whole system is to minimize this energy. JEPA is effectively a non-generative EBM, meaning it doesn't create the next state, but rather finds a state-representation that has the lowest energy relative to the current one.
Hierarchical Planning (H-JEPA)¶
This is where it gets really cool. HUMANS don't plan at a single level. If I'm planning a trip from Bangalore to New York, I don't start by thinking about which foot to move first to get out of bed. I plan in hierarchies: 1. High level: Get to New York. 2. Mid level: Get to the airport. 3. Low level: Book a cab. 4. Action level: Open the Uber app.
JEPA handles this by having multiple layers of World Models working at different "time scales" and abstractions. Small World Models handle the quick, detailed actions, while High-level World Models handle the long-term goals.
Where does JEPA fit in all this?¶
If you've been reading carefully, you might have noticed I haven't really touched upon the name of the paper's main highlight: the Joint Embedding Predictive Architecture. So where does this fit in?
Well, JEPA is essentially the backbone of how the Perception and World Model modules actually operate. The term "Joint Embedding" refers to taking two things—say, an initial state (like the start of a video) and a future state (how the video looks a few seconds or frames later)—and passing both through an encoder. This encoder squishes them down into their fundamental, abstract representations (embeddings).
Once we have these representations, the World Model tries to predict the future representation from the current one. This is exactly what I meant earlier when I said it doesn't predict raw pixels. It operates entirely in the embedding space.
But there's an obvious catch. The real world is incredibly unpredictable. If I drop a piece of paper, I know it's going to hit the floor, but I cannot predict the exact way it will flutter down or the precise orientation it'll land in. A generative model would try to guess the exact pixels of that paper falling, inevitably mess up, and "hallucinate" a weird glitchy video. JEPA dodges this bullet entirely by only predicting the abstract state, i.e. "the paper is on the floor", and ignoring the noisy details.
This is solved using a latent variable. Think of this variable as a placeholder for the "unknowns" or missing information that we can't observe right now. By feeding different values of this latent variable into the World Model (alongside the current state's embedding), the model can generate a set of different plausible future representations. It's essentially the machine equivalent of thinking, "If the driver is running late, they might run the yellow light; if they are relaxed, they'll hit the brakes."
Avoiding the "Collapse"¶
Now, a common problem with these kinds of "embedding-only" models is something called Collapse. If you tell a model to just "predict a representation," the easiest way for it to win is to just output a constant value (like all zeros) for every single input. Boom, perfect prediction every time, right?
To stop this, JEPA uses a specific training objective. Instead of just trying to make the prediction match the target, it uses methods like VICReg (Variance-Invariance-Covariance Regularization). It basically forces the representations to be diverse and carry as much information as possible, effectively punishing the model if it tries to take the "lazy" zero-output way out.
Conclusion¶
So, is JEPA the ultimate solution that will get us to Autonomous Machine Intelligence? Yann LeCun definitely seems to think so, and honestly, the logic is pretty airtight. Learning how the world works by predicting abstract representations feels infinitely closer to actual biological learning than trying to forcefully predict the next token in a string of text. It's moving from being a "parrot" of language to being a "student" of the world.
All this being said, I've still got to get back to my LLMs from Scratch book. This detour was long, but arguably necessary for me to understand that the current generative AI hype train isn't the only (or even the best) track we can be on.
As I promised at the start, here are the links to some really good blogs that explain JEPA way better than I probably just did:-