Fault Tolerance with High Performance for Matrix Multiplication

Schwartz Oded, HUJI, School of Computer Science and Engineering, Computer Science


  • The increase in machine size and the decrease in operating voltage causes more hard errors (component failure) and soft errors (bit flip) in high performance computing.
  • Hard-error resiliency solutions such as checkpoint-restart are costly and severely degrade performance. These solutions are based on distributed "2D" algorithms, hence guarantee optimal performance only for minimal memory size.
  • When more memory is available significant increase in the processors number is required and the inter-processor communication costs are asymptotically larger than the lower bounds dictate.

Our Innovation

A novel computation model for fault tolerant matrix multiplication algorithms that reduce resources overhead: minimizing both the number of additional processors required and the communication costs.

  • Enable redundant memory
  • Obtain resiliency for Strassen and Strassen-like algorithms, with small costs overheads.
  • Lower bounds on additional resources


  • Lower communication costs
  • Better computation and high performance

Patent Status

Granted US 11,080,131

Contact for more information:

Anna Pellivert
Contact ME: