Design Article
How to turn every FPGA LVDS pair into a complete SERDES solution
Clive Maxfield
9/26/2007 2:37 PM EDT
Just a couple of days ago as I pen these words, I was chatting with Bryan Hoyer from Align Engineering. After slaving away for years in their secret underground bunker, these little rapscallions have just come out of "Stealth Mode". As part of their public launch, the folks at Align have announced a patented breakthrough technology called Align Lock Loop (ALL), which allows every LVDS input/output (I/O) pair in an FPGA to be used as a complete SERDES (Serializer/Deserializer) solution. This forms the basis for implementing fast, simple, and very affordable chip-to-chip and board-to-board communication without using large numbers of I/O pins and without involving intensive engineering that makes your eyes water.
In fact, I was so excited about the ALL concept that I decided to pen this brief technology introduction/backgrounder. Bryan promises that Align will follow this with a full-up "How To" article in the not-so-distant future (after the "proof-in-silicon" technology demonstration that is currently planned for sometime in Q4 2007).
Phase Lock Loops (PLLs) and Clock Data Recovery (CDR)
Before we leap into the fray with gusto and abandon, it's well-worth spending a few moments reminding ourselves as to the role played by PLL and CDR functions, because these concepts will be important to our future discussions.
A PLL is a closed-loop electronic control system/function that can be used for frequency control by generating an output signal with a fixed relation to the phase of an input ("reference") signal. For the purposes of these discussions, both the input and output signals will be considered to be clock signals. The simplest form of PLL generates an output clock with the same frequency and phase as the input clock (Fig 1).

1. A generic PLL.
By means of a feedback path coupled with a phase detector, the PLL responds to both the frequency and the phase of the input signal, automatically raising or lowering the frequency of a controlled oscillator until it is synchronized to the input/reference signal in both frequency and phase.
As described in an associated Wikipedia Article, a good analogy is the tuning of a string on a guitar. Using a tuning fork as a reference signal, the tension of the string is adjusted up or down until the beat frequency is inaudible, thereby indicating that both the tuning fork and the guitar string are vibrating at the same frequency. If the guitar string is perfectly tuned and in phase with the tuning fork and maintained there, it may be described as being in phase-lock with the fork.
Now, to the uninitiated, this may not at first appear to be too mind-bending. A common knee-jerk reaction is "woopee-doo-dah" – we've used a lot of complex circuitry to take an existing clock signal and generate a new one that looks exactly the same. But, of course, there's a lot more to it than this. Consider the case of "jitter" for example, in which the rising and falling edges of the input clock may wander back and forth slightly. The PLL removes this jitter and generates a nice "clean and shiny" clock signal.
Sidebar: How important is jitter? Very! Consider an Analog-to-Digital Converter (ADC), whose role in life is to sample data at specific times, for example. Now assume that this sampled data is fed into a Digital Signal Processing (DSP) algorithm/function, which works on the assumption that all of the samples are taken at regular intervals as determined by some clock signal. Not surprisingly, if the clock signal is subject to jitter, the overall quality of the resulting data is degraded. In order to address this issue, some "clock cleaner" PLL chips/functions can reduce jitter down into the fempto second range.
And, of course, there's more, because the output frequency from the PLL may be a higher (multiplied) or lower (divided) version of the input frequency.
Sidebar: As a simple example of why we might wish to use a multiplying PLL function, consider the case where an ADC chip is required to sample data at 100 MHz. Also consider that several chips may be driven by a common clock. One approach would be to use a 100 MHz clock generator to feed the various chips, but running a 100 MHz clock signal around a board isn't a lot of fun. An alternative would be to employ a 25 MHz clock generator at the board level, and for each of the chips to use a 4× PLL to generate 100 MHz signals for internal consumption.
In fact, there are a variety of PLL functions as follows (these are presented in terms of increasing complexity, which equates to more gates/transistors, silicon area, power consumption, and so forth):
- Base-Level PLL: In this, the simplest case, the output from the PLL has the same frequency and phase as the input signal (the phase can be adjusted as required by means of the feedback path).
- Integer Mult (m) PLL: The output frequency from the PLL is some integer multiple 'm' of the input frequency.
- Integer Div (n) PLL: The output frequency is some integer dividend 'n' of the input frequency.
- Integer Mult/Div (m/n) PLL': The output frequency is generated as a combination of an integer multiple 'm' and an integer dividend 'n' of the input frequency (this is often achieved by using two or more PLL's in tandem).
- Fractional Mult/Div (m/n) PLL: As for the previous case, except that the multiplication 'm' and division 'n' values may be real/fractional values.
- Clock Data Recovery (CDR) PLL: In this case, a clock signal is embedded in – and recovered from – a stream of data.
As we shall see, the CDR case is of particular interest to us in the context of these discussions. The idea here is that, as opposed to having separate clock and data signals, the clock is embedded in (and can be derived from) the data stream itself.
As a starting point, let's consider a data stream that consists of alternating 0s and 1s (e.g. 010101010101. . .) as illustrated in Fig 2(a), where each 0-to-1 and 1-to-0 transition occurs on a clock edge from a reference clock that is embedded in the transmitter.
Obviously, recovering the clock is not too complex a task in this case (to all intents and purposes, this data stream is the clock).

2. Embedding the clock in the data stream.
By comparison, consider the more complex data stream illustrated in Fig 2(b). Once again, any data transitions between 0 and 1 values (and vice versa) occur on "clock edges" corresponding to a reference clock that is embedded in the transmitter, but the data stream itself may comprise a "random" sequence of 0s and 1s. In this case, the CDR function embedded in the receiver will have to be much more sophisticated.




Comments
LonM
9/27/2007 11:24 AM EDT
this is similar to an idea that I had to transmit parallel data over widely seperated paths. Multiple slaves all used the same basic reference clock (which could be sent from the master, or shared from a third source). Each slave must have a PLL with a phase adjuster that must be register programmable (like Cyclone3). They transmit the clock back to the Master, and the master measures the phase skew between them. The master sends the phase adjust commands to each slave. The loop integrates down to bring all slave clocks into alignment at the master. The master can now latch the slave data together, or use a FIFO to cross the data into the Master's own clock domain. As long as the slave FPGA has an adjustable phase PLL, it can be done without additional hard logic.
Sign in to Reply
EDW
9/27/2007 12:13 PM EDT
Nice article Max, thanks!
Typo alert:
"This means that for every 8 bits of data we wish to transmit or receive we actually end up using 10 bits. Thus, in the case of a 2.5 gigabits-per-second data rate, the corresponding link rate is actually 3.8 gigabits-per-second."
For a 2.5 gbps data rate, the line rate is 2.5 * 10 / 8 = 3.125 gbps
You probably meant 3 1/8 gbps.
Sign in to Reply
EDW
9/27/2007 12:17 PM EDT
Also, jitter is only attenuated if it is above the pass band of the PLL, otherwise it is passed on (if the pass band is flat) or even amplified (if the pass band is peaky). Thus the concern over wander accumulation in any synchronous digital hierarchy.
Sign in to Reply
EDW
9/27/2007 12:38 PM EDT
Also also, if you examine 8b/10b encoding (perhaps via spreadsheet) you will find that there are not enough unbiased (5 zeros and 5 ones) 10b characters (252 by my calculations) to cover the 8 bit input (256). So some biased 10b characters are pressed into service, and an output circuit is employed that enforces "neutral disparity" by inverting entire 10b characters when necessary. Of course, someone (legacy devices) decided negative disparity when idle was necessary, further mucking things up.
It never ceases to amaze me how, on the occasion that engineers build something beautiful, others invariably come along and turn it into a Frankenstein.
Sign in to Reply
Max the Magnificent
9/27/2007 5:04 PM EDT
Good catch --- you are right -- I was thinking 3 1/8 and I just wrote 3.8 ... but I should have caught that when re-reading the little scamp because (a) I know it's 3.125 and (b) i'm an anal-retentitive and I usually re-perform the calculations "just to make sure" ...
Thansk for the "heads up" -- cheers -- Max
Sign in to Reply
mwinnc
10/8/2007 10:35 AM EDT
Max - excellent refresher for some of us.
a couple of notes: you only list two major FPGA suppliers, but there are three: Lattice has ECP2 on the low end; and SC on the high end. The key however, is the ECP2M which incorporates significant memory resources and up to (16) 3.125G SerDes at a Spartan/Cyclone cost point. This allows engineers to implement full function SerDes (full PCS layer) at minimal extra cost. And with (8) PLLs, clocking resources are not an issue.
Sign in to Reply