SEE (Single Event Effects) are an often misunderstood issue and this often leads to confusion in the design and mitigation strategy.
Terrestrial SEE come from two sources the first being atmospheric neutrons, while the second is alpha particles emitted from the package itself. Obviously the second cause the manufacturers have some control over and do go to great lengths to prevent. The rate of atmospheric neutrons will change with both altitude and latitude.
When considering SEE mitigation you must look at the type and what you are trying to protect e.g. configuration memory or user logic as the implementation will be different. There are a few different types SEE can be grouped into
Single Event Transient - when a SEE hits a combinatorial gate or signal line creating a temporary glitch.
Single Event Upset - When a SEE hits a memory or register and flips the state of it this is what people traditionally think of when they think of SEE
Multiple bit upset - when a SEE corrupts more than one memory or register bit.
Single Event functional interrupt - when a SEE hits logic which prevents the device from operating correctly without power cycle / re configuration - This is only really a consideration for space flight.
Xilinx have I know spent a lot of time and effort through the Rosetta programme to establish the FIT rate per Mega Bit of configuration and user memory. This enables when coupled with the essential bits technology the ability to determine the actual MTBF between SEE effects on your FPGA and hence the probability of success for your FPGA in seeing a SEE during the time it is powered on. What that means in reality is that often the best mitigation against SEE for some applications (not life critical, mission critical and so on) is to regularly reconfigure the device.
There is a really good handbook which can be downloaded from the ESA website called "Space Engineering Product Assurance, Techniques for Radiation Effects Mitigation in ASICS and FPGAs" it is very comprehensive and is a good read for anyone working in high reliability applications where SEE have to be considered.
Re: SEU/SEE/SER on post 40nm chips in general - a hot topic: ST just presented a paper (#31.1) at IEDM entitled Technology Downscaling Worsening Radiation Effects in Bulk: SOI to the Rescue. IBM makes a similar argument for SOI-FinFETs (see http://www.advancedsubstratenews.com/2013/04/ibm-finfet-isolation-considerations-and-ramifications-bulk-vs-soi/)
You are right that SOI has a good performance with regards to SEUs. To date, none of the FPGA vendors have produced a product using SOI, or SOI-FinFET technology. The other big advantage claimed for it is for low leakage current, which is an important part of the total supply current. It would take a lot of effort to design the chips on SOI, and the ST process is larger than the 20nm planar or 14nm FinFET that is the current focus of Xilinx and Altera. But you never know in this industry...
Packaging and environmental / use case factors can result in a high soft error rate independant of altitude or geographic location. Microsemi FPGA's for example employ low alpha particle packaging -- the contribution by packaging can take a FIT rate in the low 10's and turn it into a fit rate of over 10,000 just due to high alpha packaging, or airborne particles.
These "single events" have the potential to flip the charge on a gate- got it. And this is bad for memory exposed to radiation. OK then, how about the solution used to compensate for poor memory, in all major business setups?
In short - what about RAID?
On a small scale, I supose you'd call it RAIM (redundant array of independent memory). Options would range from RAIM-1 mirrored data, through to RAIM-5 or higher. You can lock out segments with persistent faults, or choose to dismiss parity faults as a "one off" due to a single event radiation particle.
Is that a viable solution? Does spreading copies of data over a wider area reduce the risk of corruption?
The concept of using RAID is a bit different to the situation on an FPGA. Let me explain.
RAID (as I understand it) is looking to spread the data over a number of different disks. That way, if a disc fails, the error detection and correct codes can reconstitute the data and recover the errors. The same basic technique can be applied to application data stored inside the memory blocks on the FPGA. Thinking about it though, there might be a virtue in spreading the data across the different blocks of memory inside the FPGA. I say that because there is a probability that a particle can "flip" more than one adjacent cell as is speeds through the silicon. This could cause multiple errors on the same word stored in memory.
The other problem with SEUs in an FPGA is that the particle might flip a storage bit from the configuration. If the bit is active in that particular design (and 9 out of 10 are not), then it could change the logic and cause a malfunction until the error is purged. There is no eqivalent of RAID to correct the configuration data, but the chips can be set to test for flipped bits.