Design Article
How to turn every FPGA LVDS pair into a complete SERDES solution
Clive Maxfield
9/26/2007 2:37 PM EDT
Just a couple of days ago as I pen these words, I was chatting with Bryan Hoyer from Align Engineering. After slaving away for years in their secret underground bunker, these little rapscallions have just come out of "Stealth Mode". As part of their public launch, the folks at Align have announced a patented breakthrough technology called Align Lock Loop (ALL), which allows every LVDS input/output (I/O) pair in an FPGA to be used as a complete SERDES (Serializer/Deserializer) solution. This forms the basis for implementing fast, simple, and very affordable chip-to-chip and board-to-board communication without using large numbers of I/O pins and without involving intensive engineering that makes your eyes water.
In fact, I was so excited about the ALL concept that I decided to pen this brief technology introduction/backgrounder. Bryan promises that Align will follow this with a full-up "How To" article in the not-so-distant future (after the "proof-in-silicon" technology demonstration that is currently planned for sometime in Q4 2007).
Phase Lock Loops (PLLs) and Clock Data Recovery (CDR)
Before we leap into the fray with gusto and abandon, it's well-worth spending a few moments reminding ourselves as to the role played by PLL and CDR functions, because these concepts will be important to our future discussions.
A PLL is a closed-loop electronic control system/function that can be used for frequency control by generating an output signal with a fixed relation to the phase of an input ("reference") signal. For the purposes of these discussions, both the input and output signals will be considered to be clock signals. The simplest form of PLL generates an output clock with the same frequency and phase as the input clock (Fig 1).

1. A generic PLL.
By means of a feedback path coupled with a phase detector, the PLL responds to both the frequency and the phase of the input signal, automatically raising or lowering the frequency of a controlled oscillator until it is synchronized to the input/reference signal in both frequency and phase.
As described in an associated Wikipedia Article, a good analogy is the tuning of a string on a guitar. Using a tuning fork as a reference signal, the tension of the string is adjusted up or down until the beat frequency is inaudible, thereby indicating that both the tuning fork and the guitar string are vibrating at the same frequency. If the guitar string is perfectly tuned and in phase with the tuning fork and maintained there, it may be described as being in phase-lock with the fork.
Now, to the uninitiated, this may not at first appear to be too mind-bending. A common knee-jerk reaction is "woopee-doo-dah" – we've used a lot of complex circuitry to take an existing clock signal and generate a new one that looks exactly the same. But, of course, there's a lot more to it than this. Consider the case of "jitter" for example, in which the rising and falling edges of the input clock may wander back and forth slightly. The PLL removes this jitter and generates a nice "clean and shiny" clock signal.
Sidebar: How important is jitter? Very! Consider an Analog-to-Digital Converter (ADC), whose role in life is to sample data at specific times, for example. Now assume that this sampled data is fed into a Digital Signal Processing (DSP) algorithm/function, which works on the assumption that all of the samples are taken at regular intervals as determined by some clock signal. Not surprisingly, if the clock signal is subject to jitter, the overall quality of the resulting data is degraded. In order to address this issue, some "clock cleaner" PLL chips/functions can reduce jitter down into the fempto second range.
And, of course, there's more, because the output frequency from the PLL may be a higher (multiplied) or lower (divided) version of the input frequency.
Sidebar: As a simple example of why we might wish to use a multiplying PLL function, consider the case where an ADC chip is required to sample data at 100 MHz. Also consider that several chips may be driven by a common clock. One approach would be to use a 100 MHz clock generator to feed the various chips, but running a 100 MHz clock signal around a board isn't a lot of fun. An alternative would be to employ a 25 MHz clock generator at the board level, and for each of the chips to use a 4× PLL to generate 100 MHz signals for internal consumption.
In fact, there are a variety of PLL functions as follows (these are presented in terms of increasing complexity, which equates to more gates/transistors, silicon area, power consumption, and so forth):
- Base-Level PLL: In this, the simplest case, the output from the PLL has the same frequency and phase as the input signal (the phase can be adjusted as required by means of the feedback path).
- Integer Mult (m) PLL: The output frequency from the PLL is some integer multiple 'm' of the input frequency.
- Integer Div (n) PLL: The output frequency is some integer dividend 'n' of the input frequency.
- Integer Mult/Div (m/n) PLL': The output frequency is generated as a combination of an integer multiple 'm' and an integer dividend 'n' of the input frequency (this is often achieved by using two or more PLL's in tandem).
- Fractional Mult/Div (m/n) PLL: As for the previous case, except that the multiplication 'm' and division 'n' values may be real/fractional values.
- Clock Data Recovery (CDR) PLL: In this case, a clock signal is embedded in – and recovered from – a stream of data.
As we shall see, the CDR case is of particular interest to us in the context of these discussions. The idea here is that, as opposed to having separate clock and data signals, the clock is embedded in (and can be derived from) the data stream itself.
As a starting point, let's consider a data stream that consists of alternating 0s and 1s (e.g. 010101010101. . .) as illustrated in Fig 2(a), where each 0-to-1 and 1-to-0 transition occurs on a clock edge from a reference clock that is embedded in the transmitter.
Obviously, recovering the clock is not too complex a task in this case (to all intents and purposes, this data stream is the clock).

2. Embedding the clock in the data stream.
By comparison, consider the more complex data stream illustrated in Fig 2(b). Once again, any data transitions between 0 and 1 values (and vice versa) occur on "clock edges" corresponding to a reference clock that is embedded in the transmitter, but the data stream itself may comprise a "random" sequence of 0s and 1s. In this case, the CDR function embedded in the receiver will have to be much more sophisticated.
8b/10b encodingNow, we're getting a little ahead of ourselves here, because the concept of 8b/10b (and related encoding schemes) doesn't really come into play until we start talking about the SERDES and ALL techniques. The reason we are going to introduce it at this time is that it is relevant in the context of the CDR functions presented in the previous topic.
As a starting point, let's assume that we are using some high-speed serial transceiver technique, and also that we are transmitting an ideal signal consisting of a series of alternating 0s and 1s as illustrated in Fig 3.

3. Transmitting and receiving an "ideal" high-speed serial signal.
For the purposes of this simple example, the signal generated by the transmitter is shown as being a pure square wave; in the real world, however, this signal would have significant analog characteristics. Also, the signal as "seen" by the receiver would be phase-shifted from that shown in Fig 3; we've aligned the signals here to better illustrate which bits at the transmitter and receiver are associated with each other.
Now, when we are talking about high-speed signals with data rates as high as gigabits-per-second, the tracks linking the transmitting and receiving chips (and the pins on the chips) absorb a lot of the signal's high-frequency content, which means that the receiver "sees" only a drastically attenuated version of the original signal.
The end result at these extreme frequencies is that the signal coming out of the transmitting chip is horrible, and it's even worse by the time it reaches the receiver, but we digress. The point here is that the signal as "seen" by the receiver in Fig 3 still oscillates above and below some median level, which means the receiver can detect it and pull useful information (such as the data and the recovered clock) from it.
Now let's consider what might happen if we were to modify the previous data stream such that it commences by transmitting a series of three consecutive 1 values as illustrated in Fig 4.

4. The effect of transmitting a series of identical bits.
In this case (and remembering that this is an overly-pessimistic scenario intended only to provide us with something to talk about), the signal as "seen" by the receiver continues to rise throughout the course of the first three bits. This takes the signal above its "median" value, which means that even when the signal returns to its 010101. . . sequence, the receiver will actually continue to "see" a never-ending series of 1s.
The point of all of this is that 8b/10b refers to an encoding scheme in which original 8-bit (256-value) characters/symbols are mapped into 10-bit (1,024-value) characters/symbols. This means that each of the original 8-bit characters/symbols can have a number of 10-bit counterparts. The result is that even if the transmitter wishes to transmit a group of 0s or 1s, it can select between different 10-bit symbols so as to ensure that the overall sequence ends up "hovering" around the median value.
In addition to ensuring a constant DC value as discussed above, 8b/10b encoding is also used to ensure enough state changes to facilitate clock recovery by the receiver. Last but not least, some of the 10-bit codes can be used as control characters; for example, to announce the start and end of a "frame".
Examples applications in which 8b/10b encoding is used are PCI Express, Serial RapidIO, Gigabit Ethernet (except for the twisted pair based 1000Base-T), InfiniBand, and XAUI.
The evolution of I/O
When I was a bright-eyed, bushy-tailed young engineer, transmitting signals from one chip to another was so much simpler than it is today. In those now-far-off times, we were typically working with Transistor-Transistor Logic (TTL), whose signals swung between 0V and 5V. Furthermore, generally speaking, we were working with clock frequencies of only a few hundred KHz – how well I can remember the excitement when our clock speeds started to approach 1 MHz (at that time we would have laughed our socks off if anyone had talked in terms of gigahertz clock frequencies and data rates of gigabits-per-second).
But, once again, we are "wandering off into the weeds", so let's bypass those days of yore, leap forward to when the use of CMOS became prevalent, and briefly summarize the evolution of different types of I/O as follows:
- Parallel asynchronous CMOS [No PLLs or CDRs]
- Parallel synchronous CMOS [No PLLs or CDRs]
- Parallel source-synchronous CMOS [Requires PLLs]
- Parallel source-synchronous LVDS [Requires PLLs]
- Serial source-synchronous LVDS [Requires PLLs]
- XCVR-based* multi-gigabit SERDES [Requires CDRs]
*Just in case you aren't familiar with this terminology, in ham radio jargon, X can stand for trans (from the Latin, meaning "across" or "through"), so XCVR is an abbreviation for "transceiver".
We'll next take a very quick peek at each of these cases to quickly remind ourselves as to the most salient points...
Parallel asynchronous CMOS: Let's start by considering an FPGA acting as a master device communicating (reading and/or writing data) with some form of slave device. In this case the master device will be fed by an external clock and the two components will be connected by a parallel data bus augmented by some control signals; for example, Chip Select and Read/Write as illustrated in Fig 5.

5. Parallel asynchronous CMOS (single slave device).
The advantage of this scheme is that it's relatively simple. One disadvantage is that it utilizes a lot of pins; another is that is requires a number of clock cycles to set up the R/W and CS lines and then perform the read/write operation.
If multiple slave devices are required, the data bus and R/W signals are copied to all of the slaves; meanwhile a unique CS signal is required by each slave as illustrated in Fig 6.

6. Parallel asynchronous CMOS (multiple slave devices).
Of course, as opposed to the master device generating unique CS signals, it could output an address value that was externally decoded to generate the CS signals, but we're trying to keep things as simple as possible.
Parallel synchronous CMOS: This is very similar to the previous case, except that that the clock is fed to all of the devices as illustrated in Fig 7.

7. Parallel asynchronous CMOS (multiple slave devices).
The advantage of this scheme is that each read/write operation requires only a single clock cycle; the disadvantage is that we now have to balance the clock lines so as ensure that all of the devices" see" the clock at the same time. This introduces a new level of system complexity, where the "balancing requirements" become tighter and tighter as the clock frequency increases.
Parallel source-synchronous CMOS: OK, let's change things around a little. For the purpose of the following examples, let's assume that we are trying to establish communications between an FPGA and an Analog-to-Digital Converter (ADC) chip. The idea here is that our master device (the FPGA) wishes to upload a continuous stream of sample data from the slave device (the ADC).
Now assume that the data rate we require is so high that we can no longer guarantee our ability to synchronize operations between the two devices using a common clock. One solution is to use a source synchronous technique, in which the slave device produces its own clock that travels in parallel with the data as illustrated in Fig 8.

8. Parallel source-synchronous CMOS (single slave device).
Note that only the data and clock signals are shown in Fig 8; any additional control signals have been omitted for simplicity. Also note that the use of a single system clock to drive both devices as illustrated in Fig 8 is only one scenario; it is also common for both devices to be fed by separate clocks.
Using this approach, the clock generated by the slave device suffers the same delay and drift as its data, thereby facilitating the receiving device's ability to reliably retrieve that data.
Observe that the Source Synchronous Control (SSC) logic is relatively small compared to a PLL block. In the case of the slave (ADC), both the PLL and SSC would be implemented as hard-wired logic; by comparison, in the case of the FPGA, the PLL would be implemented as a hard macro while the SSC would be implemented using the device's programmable fabric.
One disadvantage of this approach is the proliferation of PLLs – first, we need a PLL in the slave to lock onto (and remove jitter from) the external clock; second, we need a PLL in the master to lock onto (and remove jitter from) the clock signal generated by the slave.
Furthermore, each new slave device will require an additional PLL macro in the FPGA. Even worse, each slave device behaves as though it were the only device (and clock generator) in the world. The result as "seen" from the FPGA's perspective is multiple clock domains – one for each slave.
Parallel source-synchronous LVDS: This is almost identical in concept to the previous topic. The only significant difference is the fact that the clock signal and each data signal are presented as LVDS (Low-Voltage Differential Signal) pairs.
There are three key advantages associated with using LVDS: (a) low power consumption, (b) low "outbound" (radiated) Electromagnetic Interference (EMI) emissions, and (c) a greater tolerance to "inbound" EMI (noise). The primary disadvantage is that each signal consumes two pins on each device.
Serial source-synchronous LVDS: In the case of a serial source synchronous LVDS scheme, we require a minimum of three signals: Clock, Data, and Frame as illustrated in Fig 9.

9. Serial source-synchronous LVDS (single slave device).
As for each of the other source synchronous techniques, the FPGA will require a PLL to process the clock from each of its slave devices.
XCVR-based multi-gigabit SERDES: At the top of the "food chain" we find high-speed serial communications schemes, such as PCI express, as illustrated in Fig 10, in which the data signal includes an embedded clock. Observe that PCI express was originally conceived as a board-to-board technique (chip-to-chip incarnations followed later). Thus, standard implementations employ a separate clock for each device.

10. XCVR-based multi-gigabit SERDES (single slave device).
In this case, a minimum implementation consists of a single (×1) "lane" comprising a transmit path and a receive path, each of which uses a special-purpose differential signal pair. Higher bandwidths may be achieved by using multiple lanes, which is why it is common to see ×1, ×4, ×8, etc. references in this context.
At the current time, the bandwidth for a single lane is typically quoted as 2.5 or 5.0 gigabits-per-second, but this can be a little misleading (in the case of 10 gigabits-per-second solutions, these are formed using four × 2.5 gigabits-per-second lanes).
The problem is that the data stream will use some form of encoding, such as the 8b/10b scheme introduced earlier in this paper (in the case of networks, a related 64b/66b scheme is typically employed, but we will assume the use of 8b/10b for the purposes of these discussions). This means that – for every 8 bits of data we wish to transmit or receive – we actually end up using 10 bits. Thus, in the case of a 2.5 gigabits-per-second Link Rate, the corresponding Data Rate is actually 2.5 / 10 * 8 = 2.0 gigabits-per-second.
While we're here, it's probaby worth noting that, in the case of 10 Gig Ethernet, the Link Rate = 4 × 3.125 Gbps, while the Data Rate = 4 × 2.5 Gbps. The point is that one has to be careful when making comparisons, because network folks commonly quote the Data Rate, whereas other folks often quote the Link Rate (which is 25% higher for 8b/10b encoding).
Observe that the control ("Ctrl") logic is relatively small compared to a CDR block. Once again, in the case of the slave (ADC), both the CDR and the "Ctrl" functions would be implemented as hard-wired logic; by comparison, in the case of the FPGA, the CDR would be implemented as a hard macro while the "Ctrl" functions would be implemented using the device's programmable fabric.
The advantages of multi-gigabit SERDES solutions are that they are extremely fast and require a low pin count. The disadvantages are that they are expensive in terms of dollars, silicon area, power consumption, and complexity to use; also that their high-speed pins are typically dedicated to their hardwired CDR macros.
Align Lock Loops to the rescueBefore we pull the veils asunder and unveil the mystery of the Align Lock Loop (ALL), let's quickly review a few particularly pertinent points. Let's start with the fact that the ideal I/O solution will exhibit the following characteristics:
- Minimum pin count
- No clock to distribute
- A single clock domain
- Flexibility
- Ease-of-use
- Inexpensive
With regard to the last point, we mean inexpensive in terms of dollars, silicon area, power consumption, and complexity. Now let's consider the fact that – from one point of view – FPGAs come in three main flavors (categories):
- Cheap (for example, the Cyclone/Spartan families)
- Expensive (for example, the Stratix/Virtex families)
- Very Expensive (for example, the StratixGX/VirtexPro families)
Of course, there are variations in a variety of resources associated with the various members of the different FPGA families, but overall it's fair to generalize the little rascals as follows:
- Memory (hundreds of thousands to millions of bits)
- Logic Elements (tens to hundreds of thousands)
- I/Os (hundreds to thousands)
- DSP blocks (hundreds)
- PLLs (a dozen at most)
- CDRs (a handful at most)
The bottom line is that the inclusion of PLLs and CDRs are a key differentiator when it comes to separating FPGAs into our Cheap, Expensive, and Very Expensive categories. And so we come to the concept of the Align Lock Loop (ALL), which doesn't require a PLL or CDR in the FPGA. In order to understand how this works, consider the scenario illustrated in Fig 11.

11. Align Lock Loop (single slave device).
Now, it's very important to note that the fact that the CDR block is shown as being small as compared to the ALL function is a fiction that is used only to illustrate the way in which things hang together. In reality, the ALL logic is a small fraction of the CDR function. As usual, the ALL in the master FPGA would be implemented using programmable fabric, while the ALL in the slave would be realized as hard-wired logic.
The idea is that when the system is first powered up, the master initiates a training sequence in which it transmits a series of symbols (observe that the master requires neither a PLL nor a CDR). The slave uses its CDR to recover the clock and data; the data is passed through a phase adjuster block; and the data is then transmitted back to the master chip.
The master chip now introduces control codes into the data stream, where these codes are used by the phase adjuster block in the slave device to modify the phase relationship of the data it returns to the master. Eventually everything is fully synchronized, at which point the training session (which requires less than a microsecond) is terminated and the slave device can start to transmit real data back to the master.
Another interesting point is that each slave device can use its own "natural encoding" scheme. In the case of a 12-bit ADC, for example, the hard-wired ALL/CDR combo could be created in such a way as to use a 12b/14b encoding approach. The fact that the ALL in the FPGA master is fully implemented in programmable fabric means that it is "protocol agnostic" and can be configured to use any required encoding scenario.
Furthermore, the fact that the master FPGA requires neither a PLL nor a CDR means that lower-cost FPGA devices can now be used. Adding additional slave chips simply requires new ALL soft functions to be programmed into the FPGA. Furthermore, all of the slave devices can be adjusted so as to result in the FPGA seeing a single clock domain.
Of course, with bandwidths ranging from 300 mega-bits-per second to 1.5 gigabits-per-second, a single ALL channel has less bandwidth than a corresponding SERDES lane. However, designers can use any LVDS pair to implement an ALL channel, and multiple channels can be combined to increase the overall bandwidth.
Summary
Personally, I think that the ALL concept is an incredibly exciting idea. Of course, one consideration is that the manufactures of devices such as ADCs (or IP providers supplying the CDR functions that are used in these devices) have to be convinced that it is worth their while.
Now, if I were in the business of supplying ADC chips, I could care less whether the users of my chips are buying cheap FPGAs or expensive ones. What I do care about is what my competitors are doing. Let's suppose that one of my competitors decides to add the relatively small amount of logic required to implement an ALL into their devices. The fact that the ALL functionality can be enabled/disabled means that my competitor's devices can now be used in both conventional designs and in ALL-based systems.
In this case, I think my competitor will soon be selling a lot of components to system engineers who wish to use low-cost FPGAs ... at which time I will become VERY interested in incorporating ALL technology into my devices.
Keeping this in mind, I will be watching the ongoing development of ALL technology with an "eagle eye".
Clive "Max" Maxfield is president of TechBites Interactive, a marketing consultancy firm specializing in high technology. Max is the author and co-author of a number of books, including Bebop to the Boolean Boogie (An Unconventional Guide to Electronics), The Design Warrior's Guide to FPGAs (Devices, Tools, and Flows), and How Computers Do Math featuring the pedagogical and phantasmagorical virtual DIY Calculator.
Widely regarded as being an expert in all aspects of computing and electronics (at least by his mother), Max was once referred to as "an industry notable" and a "semiconductor design expert" by someone famous who wasn't prompted, coerced, or remunerated in any way. Max can be reached at max@techbites.com.



