# 3rd Workshop On Approximate Computing (WAPCO 2017)

### In conjunction with HiPEAC 2017, Stockholm, January 25, 2017

### http://wapco.inf.uth.gr

## Call For Papers

### WAPCO has been supported by the EC within the H2020-ICT-2015 programme under the grant Uniserver

# Workshop Program

08:15 - 10:00 HiPEAC Conference Keynote

10:00 - 11:00 **Welcome**

**Invited Talk**

Andreas Moshovos, University of Toronto

* A Bit-Pragmatic Approach to Accelerating Convolutional Neural Networks for Deep Learning *

[Abstract][Slides]

11:00 - 11:30 Coffee Break

11:30 - 13:00 **Session 1: Approximation at the Hardware and Software Levels **

Session Chair: George Karakonstantis (Queen's University Belfast)

11:30 - 11:50

Vassilis Vassiliadis, Konstantinos Parasyris, Christos D. Antonopoulos, Spyros Lalis and Nikolaos Bellas.
*Using artificial neural networks for error detection in unreliable computations.*

[Abstract][Slides][Paper]

11:50 - 12:10

Mario Barbareschi, Antonino Mazzeo, Domenico Amelino, Alberto Bosio and Antonio Tammaro.
*Implementing Approximate Computing Techniques by Automatic Code Mutation. *

[Abstract][Slides][Paper]

12:10 - 12:30

Aurangzeb and Rudolf Eigenmann.
* PROCsimate: A Scheme for Approximating Procedures with Dynamic Quality Monitoring and Result Guarantees. *

[Abstract][Slides]>][Paper]

12:30 - 12:45

Igor Neri, Miquel López-Suárez and Luca Gammaitoni.
*Thermodynamic limits for approximate MEMS memory devices (Short paper).*

[Abstract][Slides][Paper]

12:50 - 14:00 Lunch Break

14:00 - 15:30 **Session 2: Applications of Approximate Computing and Modelling **

Session Chair: Christos D. Antonopoulos (University of Thessaly)

14:00 - 14:20

Imran Wali, Marcello Traiola, Arnaud Virazel, Patrick Girard, Mario Barbareschi and Alberto Bosio.
* Can we Approximate the Test of Integrated Circuits? *

[Abstract][Slides][Paper]

14:20 - 14:40

Oscar Palomar, Andy Nisbet, John Mawer, Graham Riley and Mikel Luján.
*Reduced precision applicability and trade-offs for SLAM algorithms. *

[Abstract][Slides]>][Paper]

14:40 - 15:00

Jens Deussen and Uwe Naumann.
*Compression of Higher Derivative Tensors in Stochastic Significance Analysis. *

[Abstract][Slides]>][Paper]

Andreas Moshovos has been teaching Computer Architecture at the University of Toronto since 2000. He has also taught at Northwestern University, the University of Athens, the Hellenic Open University, and as an invited professor at the Ecole Polytechnique Federal De Lausanne. He has served as the Program Chair for the International Symposium on Microarchitecture, and the International Symposium on Performance Analysis of Systems and Software. Over the years, he has trained 6 Ph.D. and 12 Master’s students and is currently working with 5 graduate students

This talk reviews the progress we have been making toward making convolutional neural networks (CNNs) execute faster and with higher energy efficiency exploiting their value properties and tolerance for approximate computations. Our emphasis has been on a value-based approach which goes beyond exploiting the computation structure of CNNs to also exploit their values properties. Our past work exploited ineffectual activations and the per layer precision requirement variability of CNNs during inference to reduce the number of computations achieving a 1.9x performance improvement over the state-of-the-art. By further exploiting the tolerance of CNNs to approximate computations, our accelerator can trade off a small loss in accuracy to further boost performance and energy efficiency. Our latest work exploits the lopsided bit value distribution of the runtime calculated values in CNNs where most bits tend to be zero. This talk will explain that if it were possible to eliminate all the computations that involve a zero activation bit, performance could improve by more than 5x. A practical accelerator design improves performance by nearly 3.5x even when the baseline state-of-the-art accelerator uses a TensorFlow-like 8-bit quantized representation.

We introduce a methodology to use Artificial Neural Networks (ANNs) for automatic error detection on outputs of selected parts of a program which execute on unreliable hardware that operates at frequencies beyond the nominal levels. We use an OpenMP inspired programming model and an accompanying runtime system which enables developers to specify the significance of tasks regarding their effect to the output quality of the program. The runtime system executes the least significant tasks on cores which operate unreliably and uses specially trained ANNs to detect errors which are then corrected by a user supplied error correction function. We test our approach using with a benchmark application that uses the Discrete Cosine Transform (DCT) kernel to compress images. A fault injection campaign indicates that it is possible to achieve a 1.77x speedup over a fully reliable execution of the code, with a minimal penalty to output quality (43.46 dB Peak Signal to Noise Ratio between the de-compressed and the original image compared to the 43.67 dB PSNR of a fully reliable execution).

Approximate computing has emerged as one of the most important breakthrough in many scientific research areas. It exploits the inherent tolerance of algorithms against computational errors in order to outperform the original versions by worsening the result quality. The research community demonstrated the effectiveness of the trade-off between accuracy and performance, such as energy consumption, time and occupied area for integrated circuits, and many approximate computing methodologies were proposed. Unfortunately, introduced approaches fit in specific application domain and a general and systematic methodology to automatically define approximate algorithms is still an open challenge. Bearing in mind such a lack of generality, in previous works we introduced a methodology which makes use of software mutation to obtain approximate versions of a software defined algorithm. Based on this concept, we developed IDEA, a design exploration tool that relies on a source-to-source manipulation technique, implemented by an open-source tool called clang-Chimera, in order to apply code transformations that approximate the computation of a C/C++ algorithm. In this paper, we detail the methodological flow introduced by IDEA and we describe every step needed to have available any approximate computing technique by a walkthrough. Furthermore, we provide experimental results which highlight the effectiveness of the approach, in particular by searching for approximate variants applying the loop perforation technique.

Approximating entire procedures in applications amenable to approximation can offer significant performance gains. This paper proposes PROCsimate, a function approximation scheme. PROCsimate provides efficient and fast function approximation in software, monitors quality of approximation, and offers guarantees about results. The scheme dynamically improves approximation over the course of application execution by capturing changing input behavior of the application and by speculatively training itself. The scheme is easy-touse as it does not require multiple trial-and-error attempts from the user; it rather automatically selects its parameters and tunes them over time to produce acceptable results while honoring guarantees. PROCsimate leverages idle cores in a system to offload some of the tasks from its critical path. It also introduces a multilevel hash table to dynamically enrich history which the scheme uses to build an approximate model of the candidate function.

In this paper we present the results obtained evaluating the energy cost of operating a micro elctro-mechanical system as a memory bit. The experiment has been conducted at an effective temperature above the room temperature in order to make the energy dissipated due to friction negligible respect to the energy required by the thermodynamics. Finally we have evaluated the impact of approximate memorization to the energy cost of the operation.

In the recent years Approximate Computing (AC) has emerged as new paradigm for energy efficient IC design. It addresses the problem of maintaining reliability and thus coping with run-time errors exploiting an acceptable amount of overheads in terms of area, performances and energy consumption. This work starts from the consideration that ACbased systems can intrinsically accept the presence of faulty hardware (i.e., hardware that can produce errors). This paradigm is also called “computing on unreliable hardware”. The hardware-induced errors have to be analyzed to determine their propagation through the system layers and eventually determining their impact on the final application. In other words, an AC-based system does not need to be built using defect-free ICs. Under this assumption, we can relax test and reliability constraints of the manufactured ICs. One of the ways to achieve this goal is to test only for a subset of faults instead of targeting all possible faults. In this way, we can reduce the manufacturing cost since we eventually reduce the number test patterns and thus the test time. We call this approach Approximate Test. The main advantage is the fact that we do not need a prior knowledge of the workload (i.e., we are application independent). Therefore, the proposed approach can be applied to any kind of ICs, reducing the test time and increasing the yield. We present preliminary results on some simple case studies. The main goal is to show that by letting some faults undetected we can save test time without having a huge impact on the application quality.

This paper presents a study of the use of halfprecision (16-bit) floating point for SLAM algorithms, an emerging computer vision problem. Our experiments show that a mix of 16-bit and 32-bit floating point data is needed for the algorithm to function properly, and a library that facilitates mixing 16- and 32-bit floating point variables is introduced. We present our initial results for the whole benchmark selected, showing that 72% of the floating point operations can be performed on 16-bit data, with limited impact on the accuracy of the results. Then we study in depth one of the kernels of the algorithm and the impact of using 16-bit/32-bit floating point values in different variables used in the kernel.

In the SCoRPiO project significance based approximate computing is used to reduce the energy consumption of a program execution by tolerating less accurate results. Part of the project is to define significance as an algorithmic property to quantify the impact of a computation to the output. In the last years we presented the interval-adjoint significance analysis which combines interval information with derivative information. Thus, the analysis can identify computations that can be evaluated less accurate, e.g. on low power but less reliable hardware. Unfortunately, by using interval arithmetic the analysis sometimes results in a large overestimation of intervals and hence large significance values. For that reason, we propose an alternative approach to obtain significance information by propagation of moment approximations of probability distributions. This method uses Taylor series which require higher derivatives to approximate statistical moments, e.g. the expectation and the variance. The computation of these derivatives is expensive even for adjoint algorithmic differentiation methods. Therefore, we exploit the symmetry and the sparsity structure of the higher derivatives to considerably improve the efficiency of the computation.