Kahn Process Networks, and a teeny bit of Systolic Arrays

I began today by reading about Systolic Arrays, and the wikipedia page mentioned KPNs, which caught my eye. Hoping it wasn't another one of Neural networks, I clicked on the link. And boy was I surprised.

Kahn Process Network is a model of parallel computation proposed by Gilles Kahn in 1974. It is special because it is deterministic, and no matter in what order the processes are executed, the output will be the same! Let's delve into how a KPN works.

We can visualise KPN as a directed graph, where the nodes are the processes, and the edges are infinite capacity FIFO queues (pipes). These pipes hold 'tokens', which are basically chunks of data on which the nodes perform operations on. Now, this architecture follows a few rules:-

Non blocking writes - A process can always write to its output pipe regardless of the situation.
Blocking reads - A process can only read from an input pipe when it has a token to read. Otherwise it is suspended from executing.
No polling - A process cannot check if the pipe has data; it must commit to reading a token, and if there is no token, it must wait.

These rules effectively mean that a process can only be blocked from running by data (or lack thereof) only. This is what makes KPN deterministic; as long as the sequence of reads and writes are the same, the output will be deterministic. The order of execution doesn't matter, because reads and writes only depend on the data itself, not the order in which it arrives.

This property, in mathematics, is called Monotonicity. Gilles Kahn, using Domain Theory, essentially proved that the mapping from input histories to output histories is monotonic and continuous, and hence solving for a fixed point must yield a solution to the computation, and the KPN is effectively the least fixed point! I'm quite sure this is quite a bit to wrap your head around, so I'm linking his paper here.

All's cool so far, but we've kinda hit a problem: In order to actually implement this, we'd need to get around the infinite capacity FIFO queues, which is not possible to make in real life. To get around this, implementations have a fixed bound for these queues, and upon them filling up, Parks' Algorithm is used to increase the capacity of that particular queue until a write is possible. This is a deadlock.

But wait! There are 2 kinds of deadlocks in a KPN:

True Deadlock: Occurs when there is a design flaw in the KPN, where there is a cyclic dependency in the dataflow.
Artificial Deadlock: Occurs when the memory is run out, and writes aren't possible.

KPNs are usually implemented in FPGA boards, and my understanding is that since each of these process nodes are independent of each other, they are implemented as their own core with their own buffers/memory, etc. And since you can't modify a silicon, extensive analysis is done, with static analysis where mathematical models are made, and then cross checked with HLS (High Level Synthesis) simulations to identify the maximum memory requirement.

Also, instead of relying on writing their own handshake protocol for the pipes, developers apparently rely on point to point protocols like AXI4-Stream, which focuses on transferring continuous streams of data between a master (producer) and a slave (consumer). AXI4-Stream is part of ARM's AMBA 4 specification, but I haven't really gone through too much about this.

And then, feeling really out of depth the moment FPGA, HLS, HDLs, AMBA came up, I promptly went back to systolic arrays. Systolic Arrays are "homogeneous networks of tightly coupled data processing units (DPUs) called cells or nodes." Systolic Arrays came up in my course work as an example for MISD in Flynn's Taxonomy, although the wikipedia article says it's debatable to call it MISD. Setting that aside, I had a question: If operations done in a systolic array are usually hardcoded, how do we have GPUs running multiple tensor operations? Is the hardcoding not really necessary?

And then I found out: They are hardcoded, but these systolic arrays in GPUs are present inside a Streaming Multiprocessor (SM). Each SM contains additional components that complement the systolic arrays in performing operations, and these SMs must be programmed to utilise them. The SMs on Nvidia's GPUs, called Tensor Cores (introduced in the Volta Architecture), are accessible through CUDA using WMMA (Warp Matrix Multiply Accumulate) instructions. Google's Tensor Processing Units, on the other hand, take a different approach. Instead of relying on a lot of these cores containing small systolic arrays, they have one or two massive systolic arrays (the MXU, of size 256*256) that perform matrix multiplications.

Google's TPUs are CISC (Complex Instruction Set Computer) processors. It is called as a coprocessor, and it executes large sets of instructions sent by the CPU. The code for a TPU is compiled into it's native instruction set, using a compiler called XLA (Accelerated Linear Algebra). (Nvidia GPUs use Nvidia Cuda Compiler (NVCC) to compile code).

Today's deep dive opened up too many topics to look into, but there is one that I'd really like to look into: Communicating Sequential Processes, which are pretty similar to KPNs. I also did want to read the paper (lecture?) on Systolic Architectures by H.T. Kung, and hopefully I do by the end of this week!