Editor’s Note: This article is reproduced from Xcell Journal with the kind permission of Xilinx.
The year was 1976. Disco was still popular, the Cold War was in full swing and I wouldn’t even be born for another nine years when the Cray-1 burst onto the computing scene.
Personal computing was barely in its infancy (the MITS Altair had been introduced a year earlier) at the time, and companies like Control Data Corp. and IBM dominated the high end. The Cray-1 was one of those legendary machines that helped define the term “supercomputer” in the public imagination. Its iconic C-shape structure housed a fire-breathing machine running at 80 MHz – something desktops wouldn’t reach until almost two decades later. The Cray had speed. It had style.
Now let’s fast-forward 33 years, to the morning in early 2009 when I woke up and just decided I wanted to own one.
I first got into FPGA-based retro-computing, something I lovingly refer to as “computational necromancy,” shortly after graduating from the University of Southern California with a BSEE in December 2007. As a newly minted electrical engineer and all-around fan of arcane computer architectures, I saw this pursuit as the perfect excuse to improve my Verilog skills. Starting with a Digilent Spartan-3E 1200 board that I bought myself as a graduation present, my first machine was another abandoned relic of the 1980s, the NonVon-1. This was one of the first “massively parallel” machines, similar to the more successful Connection Machine series of the same vintage, although geared more toward databases. It was a wonderfully odd machine, composed of a binary tree of 8-bit processors (with 1-bit ALUs).
After a few months of tinkering, I eventually found myself the proud owner of a 31-node supercomputer dwarfed in computing power by any modern wristwatch. As useless as it was, however, the machine made me realize just how far Moore’s Law has brought us. And it whetted my appetite for more.
After my success with the NonVon-1, I was casting around for a new project (and my Verilog skills were still a bit lacking). I realized that low-end FPGAs had grown to the point that they could handle some pretty serious hardware—even 32-bit “soft” processors are fairly common these days. Searching about for a new target to try to revive, I considered a few—the UNIVAC is an interesting machine, but it’s a bit too old for me. Digital Equipment Corp.’s PDP series has been emulated before. Simulators for Z80 machines are commonplace. That’s where the Cray comes in.What is the Cray-1?
The Cray-1 was Seymour Cray’s first machine after splitting off from Control Data and founding his own company, Cray Research, in the early 1970s. It was a ruthless number cruncher that required a room full of computers and disks to keep it fed with data. It also had a full-time staff of engineers just to keep it running, and nearly required its own power plant just to boot up. This is a machine that redefined the term “supercomputer” (I mean, it’s a Cray) – and, fortunately, it’s also beautifully simple in its design. Thankfully, it’s incredibly well-documented too (Figure 1). The Cray-1 Hardware Reference Manuals (readily available on the Internet) go into a level of detail that’s almost shocking to modern-day readers used to being handed black boxes. Nearly every op-code, register and timing diagram is documented in exquisite detail.
Figure 1. Fortunately for hobbyists, the Cray architecture is
beautifully simple in its design and very well-documented.
The computer itself is a 64-bit, pipelined processor with in-order instruction issue and a mere 128 unique instructions. It has a very RISC-like instruction set, with all instructions being either between memory and registers (load or store instructions) or between two operand registers and a destination register (all arithmetic/logic instructions). Instructions are either 16 or 32 bits long. The machine uses three different types of registers: address, scalar and vector registers. The address registers are 24 bits wide and let the machine address up to 4 Megawords (32 Mbytes) of main memory. The scalar registers, which are 64 bits wide, are used for computation. Each vector register contains sixty-four 64-bit registers, giving the machine great performance when doing scientific calculations on large matrices.
Inside the CPU, instructions can be issued to 13 independent, fully pipelined “functional units.” Heavy pipelining was crucial to achieving the Cray’s insanely high (for the time) 80-MHz clock frequency. Separate functional units handle logical operations, shifting, multiplication and so on. A floating-point multiply instruction, for instance, takes seven cycles to complete, but the computer can issue a new multiply instruction on every cycle (assuming no register conflicts exist). An interesting consequence of this design is that there is no “divide” instruction. Instead, the machine uses “division by reciprocal approximation.” Rather than computing X / Y, you compute (1 / Y) * X. A separate floating-point “reciprocal approximation” functional unit can calculate a reciprocal in 14 clock periods. The Marathon
When I first began working on this project, I still hadn’t convinced myself it would be possible to re-create such a sophisticated computing machine by myself. The original Cray-1 took a whole team of people years to design and build. Was I motivated enough to stick with it? (As it turned out, yes, I was.) Was my FPGA big enough to actually fit it? (As it turned out, no, it was not.) Even if the design is fairly straightforward, it’s still a large design (currently ~5,600 lines of Verilog and counting). I just had to get myself into the right mind-set. Building your own supercomputer is a marathon, not a sprint. I could only hope to accomplish it one step at a time.
I started, one by one, with creating the functional units. Like building your own hot rod, building a complete computer gets you acquainted with every aspect of a design in a way you would rarely experience otherwise. I explored multiplier and adder design. I reopened textbooks on floating-point arithmetic. I learned how to use three iterations of the Newton-Raphson method to compute a reciprocal approximation to 30 bits of accuracy (did I mention how detailed the hardware reference manual is?).
One by one, the functional units took shape. This was a strictly “free-time” project, so progress came in fits and starts. I started with the easiest blocks first, and finished the two address functional units (a simple adder and a multiplier) without much difficulty. My momentum started to falter as I tackled the scalar functional units (an adder, a logical unit, a shifter and a population/leading zero count). I hit a low point in my motivation as I fiddled with the three floating-point functional units (an adder, a multiplier and the infamous reciprocal approximation unit). As I said, this was a marathon, not a sprint. I started working on the Cray-1 in early 2009 and probably spent 19 to 20 months total on it.
I started to get my second wind toward the end of the floating-point units’ design, and regained steam as I moved on to the vector units. As I mentioned earlier, the Cray-1 was designed as a number-crunching behemoth. It has eight vector registers, each of which holds sixty-four 64-bit registers. When a vector instruction executes, say an addition operation, one entry from each operand will be added and stored in a third (result) vector on every cycle.
An awesome feature that the Cray-1 supports is called “vector chaining.” The vector add unit, for instance, only takes three cycles to generate the first result. If we’re adding two 64-entry vectors together, however, we don’t want to wait for all 64 entries to finish adding before we do something with the result. Vector chaining allows us to “chain” the result coming out of the adder unit straight into the input of another unit, without waiting for the operation to finish. We can start multiplying the result with a third vector two cycles after the first result is available. For some large matrix calculations, you could almost sustain two floating-point operations per clock cycle – at 80 MHz, that’s a peak rate of 160 MFLOPS! Common desktop computers didn’t catch up to the Cray-1 until the mid-1990s.
With the functional units in place, I could almost see the light at the end of the tunnel. Surely it was just a matter of adding in a bit of glue logic and being done with it, right? Well, close. It turns out there’s a lot of glue logic. Even though the Cray-1 is well-documented, it’s not that well-documented. I knew exactly what every instruction was supposed to do, but I got stuck reverse-engineering minor (and not-so-minor) details like instruction issuing, hazard detection and vector chaining. Some things, like big 64-bit data buses, are probably easier to build with discrete logic chips than with FPGAs designed for narrower datapaths. The vector registers gave me a routing headache.
A few features I also had to fudge. The Cray-1 had a 16-bank all-SRAM memory system the size of my refrigerator that could sustain 640 Mbytes/second of bandwidth (one 64-bit word per cycle at 80 MHz) to its 4 Megawords of memory, something the measly DDR memory chip on my development kit could never approach. I wound up using nearly all of my FPGA’s on-die Block RAM to scrounge together a mere 4 kilowords of memory space, by far my Cray’s biggest limitation at the moment. And I had to leave a few features out altogether: DMA-style I/O channels designed to communicate with disk drives and “host” minicomputers, and rapid context-switching support. These might make it back in once I get a useful amount of memory and a bit of software for the machine.