Application
-
The increase in machine size and the decrease in operating voltage causes more hard errors (component failure) and soft errors (bit flip) in high performance computing.
-
Hard-error resiliency solutions such as checkpoint-restart are costly and severely degrade performance. These solutions are based on distributed “2D” algorithms, hence guarantee optimal performance only for minimal memory size.
-
When more memory is available significant increase in the processors number is required and the inter-processor communication costs are asymptotically larger than the lower bounds dictate.
Our Innovation
A novel computation model for fault tolerant matrix multiplication algorithms that reduce resources overhead: minimizing both the number of additional processors required and the communication costs.
-
Enable redundant memory
-
Obtain resiliency for Strassen and Strassen-like algorithms, with small costs overheads.
-
Lower bounds on additional resources
Opportunity
-
Lower communication costs
-
Better computation and high performance
PATENT STATUS
Granted US 11,080,131