In figure 3, what is the advantage of pipelining over simply having each core run on separate data streams?
DKC: What makes RTL synchronous? HDL blocks are typically edge triggered and typically use a clock edge so that only 1 signal is involved, otherwise bad things called glitches occur when detecting the "edge" of combinatorial logic. Asynchronous data transfers require deskew of the bits which requires a time delay to wait for the slow bit and delays are not well controlled in silicon. How do you expect the FSM state chabges to be triggered? HDL is compiled to RTL before anything useful happens, so if the latest silicon does not handle RTL it is useless.
You can also just add the language features of HDLs (Verilog/VHDL) to (say) C++, and get a not-so-new language that handles data-flow & event-driven programming -
The main issue is that neither shared memory or synchronous (RTL) design styles work efficiently on the latest Silicon. Going forward hardware design and software design are going to start looking very similar - asynchronous communication, FSMs, and lots of threads.