ABSTRACT High performance computer systems, including those used for instrumentation, measurement, and advanced processing require high reliability, high quality complex integrated circuits (ICs) to ensure the accuracy of analytical data they process. Microprocessors and other complex ICs (i.e. GPGPU) are predominantly considered the important components within these systems. They are susceptible to electrical, mechanical and thermal modes of failure like other components on a printed circuit boards, but due to their complexity and roles within a circuit, performance-based failure can be considered an even larger concern. Stability of device parameters is key to guarantee a system will function according to its design. Modifications to operational parameters of these devices through over- or under-clocking can either reduce or improve overall reliability, respectively, and furthermore directly affect the lifetime of the system which it is installed in.
The ability to analyze and understand the impact that specific operating parameters have on device reliability is necessary to mitigate risk of system degradation, which will affect bus speeds, memory access, data retrieval, and even cause early failure of that system or critical components within it. An accurate mathematical approach which utilizes semiconductor formulae, industry accepted failure mechanism models, Physics-of-Failure (PoF) knowledge and more importantly, device functionality has been devised to access reliability of those integrated circuits vital to system reliability.
David Patterson, known for his pioneering research that led to RAID, clusters and more, is part of a team at UC Berkeley that recently made its RISC-V processor architecture an open source hardware offering. We talk with Patterson and one of his colleagues behind the effort about the opportunities they see, what new kinds of designs they hope to enable and what it means for today’s commercial processor giants such as Intel, ARM and Imagination Technologies.