Engineering Investigations
Comment
cdhmanning
t.alex
I should practising bringing the boards home from now on instead of debugging in ...
Speaking out of line
Adrian Michaud
4/29/2011 10:08 AM EDT
It was a nice summer evening on the east coast. I could clearly see the cloudless blue skies through my office window and feel that summer warmth radiating from the glass. I should be outside, I thought.
All the more reason for finding out why my new custom embedded development platform crashes occasionally when booting. The hardware continues to check out just fine; however, it appears memory/DDR is getting corrupted for no apparent reason during the boot process. This isn't making any sense. I've been using my own boot loader and my own hard real-time operating system for awhile now and I've ported it successfully to everything under the sun. What's worse, the memory/DDR corruption during the boot is not repeatable. Yes, the DDR controller is configured correctly. Yes, the DDR layout doesn't violate any guidelines. Yes, the memory/DDR passes with flying colors when I run extensive memory tests. Something during the boot sequence is causing occasional memory/DDR corruption.
I bring the board home and continue my investigation later that night, only to find the problem has resolved itself. This is great; or is it? No, there was clearly an issue that still needs to be root caused. I rechecked the hardware again looking for any possible physical issues, thermal issues, etc. Everything is fine. I setup the board to keep rebooting itself overnight while I sleep on it. The next morning, the board is still not exhibiting any problems. I bring the board back into work with me and expect another frustrating day trying to reproduce the problem. To my surprise, the board fails to boot on the first attempt. OK, something about the environment must be causing this. I jokingly stuck my hand over the board to see if I could detect any alpha particle bombardments and then it suddenly hit me like a ton of bricks. As I placed my hand over the board, I must have visually blocked my focal point (the memory/DDR subsystem) and I noticed that the Ethernet PHY lights on the board were blinking. Of course! That's one significant environmental difference between my office and my house, the network! Now I was on to something.
I quickly ruled out any electrical/EMI/RFI issues with the Ethernet PHY/magnetics and quickly moved on to the Ethernet MAC which was integrated into the CPU. My boot loader has the ability to load my RTOS from the network (via TFTP); but I was loading it from a Flash file system in NAND memory so my boot loader wasn't really using the network, however it did initialize and setup the Ethernet MAC and PHY. Fast forward 5-10 minutes and we find that my Ethernet driver in my boot loader "forgot" to tear down a ring descriptor for RX packets and disable DMA in the Ethernet MAC before transferring execution to another program (My RTOS in this case).
The root cause of the "corrupted memory/DDR" during boot was the fact that my boot loader initialized/setup the Ethernet MAC w/DMA support; but it never disabled it before transferring control to my RTOS which re-claims the same memory that the boot loader was using for a DMA scatter gather list. The result was my RTOS's data segment was getting written to during initialization depending on if broadcast packets were present on the network or not. Nowadays, when I no longer wish to communicate with another hardware engineer, I mask/disable them for fear of them speaking out of line when I'm not present.
Adrian Michaud is a Principal Software/Firmware/Hardware engineer with over 25 years of experience. Adrian has extensive embedded systems experience and routinely designs/develops custom embedded real-time operating systems, device drivers, development tools, and debuggers for many different CPU/uP architectures. When Adrian is not deeply embedded in his work, he enjoys the relaxing sport of skydiving with this wife.
All the more reason for finding out why my new custom embedded development platform crashes occasionally when booting. The hardware continues to check out just fine; however, it appears memory/DDR is getting corrupted for no apparent reason during the boot process. This isn't making any sense. I've been using my own boot loader and my own hard real-time operating system for awhile now and I've ported it successfully to everything under the sun. What's worse, the memory/DDR corruption during the boot is not repeatable. Yes, the DDR controller is configured correctly. Yes, the DDR layout doesn't violate any guidelines. Yes, the memory/DDR passes with flying colors when I run extensive memory tests. Something during the boot sequence is causing occasional memory/DDR corruption.
I bring the board home and continue my investigation later that night, only to find the problem has resolved itself. This is great; or is it? No, there was clearly an issue that still needs to be root caused. I rechecked the hardware again looking for any possible physical issues, thermal issues, etc. Everything is fine. I setup the board to keep rebooting itself overnight while I sleep on it. The next morning, the board is still not exhibiting any problems. I bring the board back into work with me and expect another frustrating day trying to reproduce the problem. To my surprise, the board fails to boot on the first attempt. OK, something about the environment must be causing this. I jokingly stuck my hand over the board to see if I could detect any alpha particle bombardments and then it suddenly hit me like a ton of bricks. As I placed my hand over the board, I must have visually blocked my focal point (the memory/DDR subsystem) and I noticed that the Ethernet PHY lights on the board were blinking. Of course! That's one significant environmental difference between my office and my house, the network! Now I was on to something.
I quickly ruled out any electrical/EMI/RFI issues with the Ethernet PHY/magnetics and quickly moved on to the Ethernet MAC which was integrated into the CPU. My boot loader has the ability to load my RTOS from the network (via TFTP); but I was loading it from a Flash file system in NAND memory so my boot loader wasn't really using the network, however it did initialize and setup the Ethernet MAC and PHY. Fast forward 5-10 minutes and we find that my Ethernet driver in my boot loader "forgot" to tear down a ring descriptor for RX packets and disable DMA in the Ethernet MAC before transferring execution to another program (My RTOS in this case).
The root cause of the "corrupted memory/DDR" during boot was the fact that my boot loader initialized/setup the Ethernet MAC w/DMA support; but it never disabled it before transferring control to my RTOS which re-claims the same memory that the boot loader was using for a DMA scatter gather list. The result was my RTOS's data segment was getting written to during initialization depending on if broadcast packets were present on the network or not. Nowadays, when I no longer wish to communicate with another hardware engineer, I mask/disable them for fear of them speaking out of line when I'm not present.
Adrian Michaud is a Principal Software/Firmware/Hardware engineer with over 25 years of experience. Adrian has extensive embedded systems experience and routinely designs/develops custom embedded real-time operating systems, device drivers, development tools, and debuggers for many different CPU/uP architectures. When Adrian is not deeply embedded in his work, he enjoys the relaxing sport of skydiving with this wife.
Navigate to related information


jimfordbroadcom
4/29/2011 8:10 PM EDT
Ha! That last sentence just reminded me of my days at a now-defunct startup (NDS, to coin an acronym) when we used to often say "Boss, he's masking my interrupt!"
Sign in to Reply
http://www.lulu.com/spotlight/poconoarmchairreview
4/30/2011 3:38 AM EDT
You're too late. They used that line already on the show, "Everybody Loves Klatuu."
Sign in to Reply
http://www.lulu.com/spotlight/poconoarmchairreview
4/30/2011 3:41 AM EDT
Or maybe it was Klaatu. This is something my spell checker is just so stupid about.
Sign in to Reply
Silicon_Smith
4/30/2011 5:16 AM EDT
Is it Klaatu barad nikto or Klaatu Verata Nictu?? I could never be sure...
Sign in to Reply
Sheetal.Pandey
4/30/2011 1:43 PM EDT
Good..normally the biggest problems are solved when you change the envionment, i guess if we keep working in the same place and keep seeing the problem we tend to ignore minor problems that are actually the cause.
Sign in to Reply
cdhmanning
5/5/2011 1:07 AM EDT
Precisely.
I've often had an "aaah #$%^" moment in the shower, during a "bio break" or while driving. If I get stuck for more than an hour then I go home and will often figure the problem out.
One strategy that seems silly, but often really works, is to have a whiteboard in a room with a teddy-bear in a chair. If you get stuck then go explain to teddy how it works. In doing so you'll often find the problem.
Sign in to Reply
t.alex
5/1/2011 10:18 PM EDT
I should practising bringing the boards home from now on instead of debugging in the office :-)
Sign in to Reply