Determining the optimum approach to driving multiple NeoPixel rings -- as individual elements or as a single entity -- is non-trivial.
With regard to our ongoing contest (see Latest & Greatest in the Cunning Chronograph Competition), I've been thinking about something recently.
In the case of our chronograph, we're using three NeoPixel rings: a 60-element ring, a 24-element ring, and a 12-element ring. The point is that there are different ways in which to connect and drive these rings. We could treat them as a single chain, for example, in which case the output from one ring would drive the input to the next, thereby requiring only one Arduino input/output (I/O) pin. Another approach would be to allocate a separate I/O pin to each ring and to treat them as being completely separate entities.
Until now, I've been treating my three rings as a single chain. However, while chatting with Max Maxfield a couple of days ago, I discovered that he's been treating the rings in his clock as separate entities.
Coding for three separate rings could be easier, offer more flexibility, and be more readable, but what interest me is if there is a performance advantage with regard to writing to all 96 pixels when the three rings are wired sequentially, as opposed to driving them individually.
Although the Adruino NeoPixel library transfers the data to the pixels using machine code (the most efficient and fastest way possible), I couldn't help wondering if there was an overhead when writing to three separate I/O pins. I haven’t studied the library, nor have I studied the data sheet for the WR2812B chip that drives the RGB LEDs, but I do know there will be some protocol required to initiate and terminate communications. Multiply that "start stop" data by three, and it must add a bit of time when compared to communicating with just a single string of pixels.
In order to investigate this further, I decided to perform some simple tests as follows:
- Setup a single instance of a NeoPixel strip with 96 elements (60 + 24 + 12) to represent each of the rings.
- Apply some arbitrary colour values to each pixel, say hexadecimal AA, which applies alternate ones and zeros.
- Time how long the strip.show() function command takes to execute.
- Setup three NeoPixel strip instances using separate I/O pins, with strip lengths of 60, 24, and 12.
- Apply the same arbitrary colour values to each pixel as for the single strip.
- Time how long the three strip.show() functions commands take to execute.
The next consideration is how to accurately measure the execution time(s) strip.show() function(s). I can think of a couple of methods as follows:
- Use the Arduino millis() function to capture the time immediately before the strip.show() is executed and capture it again immediately after it completes its execution, then subtract the "before" from the "after" and use a Serial.print() function to display the time difference in milliseconds.
- Use an I/O pin -- set this pin low immediately before we call the strip.show() function and return it to high immediately afterwards, then use either a oscilloscope or a logic analyser to precisely measure the time the I/O pin remains low.
The Arduino Nano used for this experiment runs at 16MHz, which equates to 62.5nS per clock cycle (1 / 16,000,000 = 0.000,000,0625 seconds). I suspect writing to the NeoPixels happens very fast and that millis() may not have enough resolution to calculate the time accurately enough. I also believe the millis() function relies on a software interrupt and counter and may be a little inaccurate for precise timing. Luckily the Arduino also has the micros() function, which uses a hardware counter and which will resolve down to 4µs. The main downside of this function is that the counter overflows and resets back to zero every 71 minutes, but this really doesn't matter for these experiments.
The millis(), micros(), and digitalWrite() functions each take some time to execute, so obtaining an absolute accurate measurement of the strip.show() function is doomed to be slightly flawed. There is an alternative method for toggling I/O pins that’s uses direct port manipulation. This involves some quirky syntax, but it takes only one clock cycle to execute, which makes it about 10 times faster than using a digitalWrite().
Assuming I decide to use Pin 7 for the Low/High toggle, the direct port manipulation commands would look like the following:
PORTD &= ~(1 << 7); // Set Port D bit 7 LOW
// Code to be measured goes here
PORTD |= (1 << 7); // Set Port D bit 7 HIGH
These methods use bit rotation and bitwise AND and OR operations to set pin 7 without affecting any of the other bits on port Port D. The “~” character inverts the resultant bit pattern represented in (1 << 7).
Knowing that each of these command lines takes only a single clock cycle, we could subtract ( 2 * 62.5nS = 125nS ) from the time measured on the logic analyser or the oscilloscope, thereby ending up with a more accurate measurement.
Click here to see the first piece of test code for a single strip of 96 pixels using the micros() function. The result displayed in the terminal monitor window was 860µs.
I now repeated this experiment, again using the micros() function, but this time with three separate strips. Click here to see this code.
This time, the result displayed in the terminal monitor window was 1984µs. This is more than double the time for the single strip. Is this really that different? Are these timings accurate?
Well, let's repeat the experiment, but this time we'll use direct port manipulation of the toggle pin. Click here to see the code associated with a single strip.
The screen shot below is from my logic analyser. Trace 0 is the data stream from the strip.show() function, while Trace 1 reflects the toggle pin. Timing markers were placed on the toggle trace going low and high, and the difference between them is 2.9036250ms.
The value taken from the logic analyser is significantly different to the one obtained using the micros() function and I don’t have an absolute explanation for the big difference between the two methods. (To verify the logic analyser output I checked the timing using an oscilloscope, and it matched the logic analyser.)
I will make one assumption though -- I think this difference has something to do with the Adafruit NeoPixel library and the fact that it uses machine code to output the data, I think this somehow disrupts the normal behaviour of the millis() and micros() time functions while it executes. Because of this anomaly, I will therefore have to discount the micros() method for time measurement when used in conjunction with strip.show() functions.
Finally, I performed the experiment using three strips in conjunction with direct port manipulation of the toggle pin. Click here to see the code associated with this test.
In the screenshot below we can see the three data streams from the three strip.show() functions presented on Traces 0, 1, and 2; in this case, the signal from the toggle pin in presented in Trace 4. Again, timing markers were placed on the toggle pin low and high events, and the measured time was 3.0050625ms.
We could subtract the 125ns for the toggle measurments from these last two results, but it doesn’t make much of a dent in the time measured in this instance, so I decided to ignore them. What we really want to know is the difference between the two measurements, which is 3.0050625ms - 2.9036250ms = 0.1014375ms.
Thus, treating the rings as three separate entities, as opposed to treating them as a single entity, adds approximately 0.1ms. This means that if I decide to update/refresh the entire clock display 60 times a second, treating the three rings as a single entity will save me 6ms every second. This may not sound like much, but if the code to generate the display ends-up being time critical, then 6ms equates to 96,000 clock cycles, which could make the difference between an animation working or not.
Measuring the execution time for a section of code -- not just the strip.show() function, but any piece of code -- the approach you take rather depends on how accurate the measurement needs to be and if the measurement of time is to be an integral part of the code or just a way to get an idea of the time taken.
When it came to timing things like the strip.show() function, it turned out that the millis() and micros() functions were far to inaccurate, while the direct pin toggle method was way more precise. On the other hand, the fact that the toggle method requires external hardware for the measurement makes this technique less practical if the code needs to take time measurements dynamically from within.
In the case of our chronograph, the choice of one strip verses three strips rather depends on how and what the clock display is required to do. Consideration for ease of writing the code, code flexibility, and readability needs to be balanced against a slight performance gain. You could start with three strips and then -- if the code needs more CPU time -- switch to one strip downstream, but this may be a tough change to make depending on the complexity of the code.
And one final thought is that all of this depends on how we wish to drive the rings. Suppose we decide to update the 12-pixel ring 60 times a second, but we only wish to update the 24-pixel and 60-pixel rings once a second, then it would much faster and more efficient to be able to write to the 12-pixel ring independently -- why waste time updating pixels that aren’t changing? On the other hand, if we decide to apply the Adafruit rainbow effect across all 96 pixels simultaneously, then treating things as a single string would be the better approach.