4458

High Performance Fault Tolerant Matrix Multiplication Algorithms

Schwartz Oded, HUJI, School of Computer Science and Engineering, Computer Science

Reduces incidence of serious errors and costs

Categories

Computer Science & Engineering

Development Stage

Ongoing

Patent Status

Patent pending

Highlights

  • Errors are a serious concern in high performance computing. Increase in machine size and decrease in operating voltage lead to increase in the incidence of hard errors (component failure) and soft errors (bit flip).
  • There are various general-purpose hard error resiliency solutions, but these tend to be costly and severely degrade performance.
  • More efficient solutions incurring much lower overhead are available for numerical linear algebra computations based on distributed ”2D” algorithms, but these can guarantee high performance only when matrices feel all local memories otherwise there is a degradation in performance.

Our Innovation

Fault tolerance matrix multiplication algorithms that reduce overheads on resources. The algorithms reduce both the number of additional processors required and communication costs, while resulting in only a negligible penalty on the flops count.

See attached tables.

Key Features

  • In the case of 2D algorithms, the number of additional processors required is reduced and a significant factor of the latency cost shaved off.
  • In cases of local memory larger than the minimum needed to store inputs and outputs, fault tolerant adaptations of blocked 2.5D algorithms and BFS-DFS (breadth first search and depth first search) algorithms are obtained that attain low order communication costs, require very few or no additional processors and have negligible overhead on the flops count.
  • Strassen and Strassen-like algorithms attain resiliency with small overhead costs.
  • Lower bounds on resources and performance parameters such as, additional processors, flops, and communication costs are formulated as functions of machine parameters, input size, maximum number of simultaneous faults and total number of faults showing that the algorithms are optimal or close to optimal.

Development Milestones

  • Submission of paper(s)
  • Testing, implementation, benchmarking

The Opportunity

  • The algorithms will be used in large scale computing, such as clouds, supercomputers, medium-large clusters, where occasional hard-errors are inevitable and are likely to harm the computation.
  • Interested parties may include Intel, AMD, NVIDIA, and any other hardware provider that supports numerical computations as well as software providers for large scale or error-sensitive computing.

 

Contact for more information:

Aviv Shoher
SVP BUSINESS DEVELOPMENT
+972-2-6586635
Contact ME:
Image CAPTCHA