Nice to see SDC concerns being taken more seriously by hardware folks. Once software gets to sufficient quality (which we have achieved in many cases), these kinds of rando hw issues are the only remaining causes of "impossible" bugs that waste endless engineering time to debug.
I wonder how much of this relies on or is made easier by the clustered core architecture of E-Core Xeons. In comparison each physical core of P-Core Xeons is its own island basically.
Is this limited to lockstep between softcores on a die - so good for low level error failures like soft error, but no good if the package dies?
(Still very neatly done)
Nice to see SDC concerns being taken more seriously by hardware folks. Once software gets to sufficient quality (which we have achieved in many cases), these kinds of rando hw issues are the only remaining causes of "impossible" bugs that waste endless engineering time to debug.
I wonder how much of this relies on or is made easier by the clustered core architecture of E-Core Xeons. In comparison each physical core of P-Core Xeons is its own island basically.
Is this limited to lockstep between softcores on a die - so good for low level error failures like soft error, but no good if the package dies? (Still very neatly done)
I wonder what is the ratio of software vs those type of hardware bugs in the wild. Maybe the product of this paper will help produce this metric.