Parallaft: Runtime-based CPU Fault Tolerance via Heterogeneous Parallelism
The increasing vulnerability of microprocessors due to frequent silicon faults greatly exacerbates the risks of silent data corruption. Existing software-based schemes to detect these suffer from high power and performance overhead. State-of-the-art hardware fault-tolerance techniques exploit processor heterogeneity to minimize power, performance and area overhead, but have not seen deployment in production due to their complexity.
This paper shows for the first time that the same insights of heterogeneous parallelism can be repurposed without any hardware support. We present Parallaft, a parallel software-based error detection technique taking the insights of state-of-the-art hardware techniques and repurposing them with tools more suited to the hardware of today, such as copy-on-write checkpointing, dirty-page tracking mechanisms and performance-counter synchronization. This allows error checking to be offloaded to little cores of an Apple M2 heterogeneous processor, achieving less than half the energy overhead while maintaining comparable performance to the homogeneous duplication mechanism RAFT.