this is a very good question.
When working with fully asynchronous design (delay insensitive approach), latches/keepers are used instead of conventional clocked Flip-Flops.
In this kind of circuit, dimensioning loop gains by controlling CMOS transistor parameters is critical to minimize the probability of reaching a metastable state. When working over COTS devices, the transitor customization option simply dissapears so, as you note, there are limitations to the set of asynchronous methodologies that can be implemented in an optimal (and secure!!) way.
As stated in the article, the design methodology used in the AsyncArt project is mostly inspired in the Sutherland's micropipeline. This kind of circuits relies in the bundled-data approach, in which the datapath is implemented with conventional digital logic and only the data flow control is constructed with delay insensitive asynchronous logic.
There are plenty of Flip-Flop resources in any FPGA, so these pieces of logic are used intensively in our designs not only for storing datapath values, but even for keeping the asynchronous dataflow control states too.
By this way, LUT based asynchronous logic is in charge of generating perfectly coordinated clock shots (or bursts) that feed the clock input of different Flip-Flop domains when the associated datapath segment need to perform any task.
In order to verify the correct behaviour of these FF + LUT based design approach, intensive stress tests have been conducted in several FPGA devices. In these tests, the devices were left running at maximum speed for more than a week and no failure was detected.
It's interesting to note that not only RAM LUT based devices have been tested (Xilinx's Spartan/Virtex & Altera's Cyclone): FLASH LUT based devices performed correctly too (Microsemi ‘s -formerly Actel- ProAsic/Fusion).
I have designed a few self-timed logic circuits, and as long as the loop gains are sufficient they work fine. As far as I know, it is the only way to completely eliminate the probability for metastate.
However, I never made self-timed logic in an FPGA, because the LUT is implemented using a small RAM. This could generate unpredictable spikes on the output when more than one address bit changes concurrently. How do you avoid this?