
The defense will take place on wednesday 17th of september.
Title
Hardware Arithmetic Acceleration for Machine Learning and Scientic Computing.
Abstract
In a data-driven world, machine learning and scientific computing have become increasingly important, justifying dedicated hardware accelerators. This thesis explores the design and implementation of arithmetic units for such accelerators in Kalray’s Massively Parallel Processor Array.
Machine learning requires matrix multiplications that operate on very small number formats. In this context, this thesis studies the implementation of mixed-precision dot-product-and-add for various 8-bit and 16-bit formats (FP8, INT8, Posit8, FP16, BF16), using variants of a classic state-of-the-art technique, the long accumulator. It also introduces techniques to combine various input formats. Radically different methods are studied to scale to the larger range of 32-bit and 64-bit formats common in scientific computing.
This thesis also studies the evaluation of some elementary functions. An operator for exponential function (crucial for softmax computations) extends a state-of-the-art architecture to accept multiple input formats. The inverse square root function (used for layer normalisation) is accelerated by combining state-of-the-art techniques for range reduction, correctly rounded multipartite tables, and software iterative refinement techniques.