Fault tolerance matrix multiplication algorithms that reduce overheads on resources. The algorithms reduce both the number of additional processors required and communication costs, while resulting in only a negligible penalty on the flops count.
Reduces Incidence of Serious Errors and Costs
Errors are a serious concern in high performance computing. Increase in machine size and decrease in operating voltage lead to increase in the incidence of hard errors (component failure) and soft errors (bit flip).
There are various general-purpose hard error resiliency solutions, but these tend to be costly and severely degrade performance.
More efficient solutions incurring much lower overhead are available for numerical linear algebra computations based on distributed ”2D” algorithms, but these can guarantee high performance only when matrices feel all local memories otherwise there is a degradation in performance.
In the case of 2D algorithms, the number of additional processors required is reduced and a significant factor of the latency cost shaved off.
In cases of local memory larger than the minimum needed to store inputs and outputs, fault tolerant adaptations of blocked 2.5D algorithms and BFS-DFS (breadth first search and depth first search) algorithms are obtained that attain low order communication costs, require very few or no additional processors and have negligible overhead on the flops count.
Strassen and Strassen-like algorithms attain resiliency with small overhead costs.
Lower bounds on resources and performance parameters such as, additional processors, flops, and communication costs are formulated as functions of machine parameters, input size, maximum number of simultaneous faults and total number of faults showing that the algorithms are optimal or close to optimal.
The algorithms will be used in large scale computing, such as clouds, supercomputers, medium-large clusters, where occasional hard-errors are inevitable and are likely to harm the computation. Interested parties may include Intel, AMD, NVIDIA, and any other hardware provider that supports numerical computations as well as software providers for large scale or error-sensitive computing.
Granted US 10,387,534