Understanding Joint Embedding Predictive Architecture

This blog maybe a little too late to the picture, but a few days ago I stumbled upon a new paper about "VL-JEPA". Some quick searches led me back to the conception of the idea of Joint Embedding Predictive Architecture in this position paper titled A Path Towards Autonomous Machine Intelligence by Yann LeCun. In recent times, I had picked up on reading a textbook called "LLMs from Scratch" by Sebastian Raschka, and I was of the opinion to not take any detours until after finishing the book and implementing a model. Yet, here I am reading a paper I lament about being way too long (62 pages!).

Now this might not be the best explanatory article on JEPA, but there are a few articles that I've been referred to as some really good blogs to read from. I've denied myself the opportunity of going through them (yet), only opening them after I finish writing this blog. Therefore, I'll add them to the end of this blog.

Setting Context

You might ask: "Why a new architecture when the current LLMs seem to be great enough already?" And that question kind of hits the bullseye of the stir this paper has caused, although only partially. Current Large Language Models work, in an broad oversimplification, guessing the next word or token.

A more detailed explanation would involve defining a few things to truly understand what is happening:- - Tokens: basic unit of text - Vectors: numerical representation of data with a lot of dimensions to capture semantic meaning - Encoder: Encodes raw text into tokens - Decoder: Decodes tokens into raw text

With that, let's try to understand what's happening behind current LLMs. First, the input is taken by the encoder and encoded into tokens. These tokens are then embedded into vectors, which is then sent to the model. Current models are Transformers, and they're implemented as decoders themselves. This is known as a Decoder only Transformer. This decoder, which we will refer to our "LLM" now, can generate a token one by one, which is essentially guessing the next word/token of the sentence. How does our LLM know what comes next? It employs a method called Attention, first explained in the paper Attention is all you need to capture context from the input. Great, but how does it know what those words mean? That, is infact, explained by the P in GPT, or the formal name for these models. GPT stands for Generative Pre-trained Transformer, and our LLMs, which in essence are actually GPT models, are trained on a huge set of data. A way too much data, or so LeCun says. Why so?

When you compare this to a human, you find out that the way humans learn, or to put it in the same jargon, consume "data" and form patterns, seems way more efficient than an LLM. Current LLMs can think only by generating the next token, and that means the LLM is stuck to the language it learns by processing the data fed into it by us. Reent advancements like Reasoning and thoughts did improve it, but these LLMs are still prone to hallucinations and false information.

LeCun compares the (then, 2022 is 4 years ago) current AI systems with humans and highlights the drawbacks of these LLMs. He argues that, despite pitting against the best AI model we have, a human will always be better than it in driving and other 'trivial' tasks. With that, he puts forward his theory of animals and humans learning "world models". More on that later, but there are 3 questions that current AI research must answer, atleast according to LeCun:- 1. How can machines learn and act only largely through observation? 2. Can a machine reason and plan in a way that is compatible with gradient-based learning? 3. Can machines learn to think about stuff in multiple abstractions and time frames?

Yann says he has an answer with the model architecture he proposes.

Drawing Inspiration

So where do we start when making a smart model that imitates humans? Well, we imitate humans. Atleast the way we think we work, that is. LeCun begins with the idea that humans and animals interact the world using something called "world models" from Craik's 1943 paper titled "The Nature of of explanation". He also takes Bryson and Ho's 1969 paper on Applied Optimal Control and explains that predicting the next state using forward thinking models has been a standard in terms of .

Fine, but what are we trying to achieve with these papers in the first place? LeCun proposes that everything we learn -- from looking at faces to communicating to making things -- are a result of us having a "world model" in our head, i.e. a representation of the world's behaviours upon which we act and make decisions. This, he calls "common sense knowledge", and that we have different heirarchial models, or understandings of the world. Given the below chart by Emmanuel Dupox, babies tend to learn concepts with lesser data than current models ever could. Maybe a concept like Object permanance could only take 960 hours of video data for current models, but as we work through skills that a human learns as they grow up, the time and data taken to learn the same things by a model grows exponentially in comparision. Emmaunel Dupoux's chart

Modelling the model

In order to build this supposed Autonomous, Intelligent model, we first break our own learning process into modules. Defining these modules with roles allows us to build upon the modules and thedefine their characteristics in more detail.

The Configurator Module

The configurator module is responsible for configuring the model's parameters and hyperparameters. It takes in a set of inputs and outputs and uses a set of rules to determine the best configuration for the model. It's responsible for ensuring that the model is optimized for the task at hand.