# 2nd Workshop On Approximate Computing (WAPCO 2016)

### In conjunction with HiPEAC 2016, Prague, 18-20 January 2016

### http://wapco.inf.uth.gr

## Call For Papers

# Workshop Program (Room: Zenit)

08:15 - 10:00 HiPEAC Conference Keynote

10:00 - 11:00 **Welcome**

**Invited Talk**

Cristiano Malossi, IBM Research Zurich

*How Algorithm Re-engineering Can Open the Way to ExaScale *

[Abstract][Slides]

11:00 - 11:30 Coffee Break

11:30 - 13:00 **Session 1: Approximation at the Hardware Level**

Session Chair: Pedro Trancoso (University of Cyprus)

11:30 - 11:50

Cong Hao and Takeshi Yoshimura.
*EACH: An Energy-Efficient High-Level Synthesis Framework for Approximate Computing.*

[Abstract][Slides][Paper]

11:50 - 12:10

Magnus Själander, Gustaf Borgström and Stefanos Kaxiras.
*Improving Error-Resilience of Emerging Multi-Value Technologies. *

[Abstract][Slides][Paper]

12:10 - 12:30

Alessandro Vallero, Alessandro Savino, Gianfranco Michele Maria Politano, Stefano Di Carlo, Athanasios
Chatzidimitriou, Sotiris Tselonis, Manolis Kaliorakis, Dimitris Gizopoulos, Marc Riera, Ramon Canal,
Antonio Gonzalez, Maha Kooli, Alberto Bosio and Giorgio Di Natale.
*Early Component-Based System Reliability Analysis for Approximate Computing Systems. *

[Abstract][Slides][Paper]

12:30 - 12:50

Alberto Bosio, Philippe Debaud, Patrick Girard, Stephane Guilhot, Miroslav Valka and Arnaud Virazel.
*Under-limits Voltage Scaling: The benefit of Approximate Computing.*

[Abstract][Slides][Paper]

12:50 - 13:00

Germán León, Rafael Mayo and Enrique S. Quintana-Orti.
*Stationary Iterative Solvers with Adaptive Precision on FPGAs.*

[Abstract][Slides][Paper]

13:00 - 14:00 Lunch Break

14:00 - 15:30 **Session 2: Approximation Modelling and Applications**

Session Chair: Christos Antonopoulos (University of Thessaly)

14:00 - 14:20

Lukas Holik, Ondrej Lengal, Adam Rogalewicz, Lukas Sekanina, Zdenek Vasicek and Tomas Vojnar.
*. Towards Formal Relaxed Equivalence Checking in Approximate Computing Methodology. *

[Abstract][Slides][Paper]

14:20 - 14:40

Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger and Andreas Moshovos.
*Proteus: Exploiting Numerical Precision Variability in Deep Neural Networks. *

[Abstract][Slides][Paper]

14:40 - 15:00

Valery Kritchallo, Erik Vermij, Koen Bertels and Zaid Al-Ars.
*Fidelity Slider: a User-Defined Method to Trade off Accuracy for Performance in Canny Edge Detector. *

[Abstract][Slides][Paper]

15:00 - 15:20

Jens Deussen, Jan Riehme and Uwe Naumann.
*A Case Study for Interval Adjoint Significance Analysis. *

[Abstract][Slides][Paper]

15:20 - 15:30

Lukas Sekanina and Zdenek Vasicek.
*Genetic Improvement for Approximate Computing. *

[Abstract][Slides][Paper]

15:30 - 16:00 Coffee Break

16:00 - 17:15 **Session 3: Approximation at the Software Level**

Session Chair: Lucas Sekanina (Brno University of Technology)

16:00 - 16:20

Konstantinos Parasyris, Vassilis Vassiliadis, Christos Antonopoulos, Nikolaos Bellas and Spyros Lalis.
*Compiler Techniques for Protection of Critical Instructions on Faulty Architectures.*

[Abstract][Slides][Paper]

16:20 - 16:40

Aurangzeb and Rudolf Eigenmann.
*History-based Piecewise Approximation Scheme for Procedures. *

[Abstract][Slides][Paper]

16:40 - 17:00

Damien Couroussé, Caroline Quéva and Henri-Pierre Charles.
*Approximate Computing on Resource-Constrained Embedded Devices with Runtime Code Generation.*

[Abstract][Slides][Paper]

17:00 - 17:10

Chhaya Trehan, Hans Vandierendonck, Georgios Karakonstantis and Dimitrios S. Nikolopoulos.
*Energy Optimization of Parallel Workloads on Unreliable Hardware. *

[Abstract][Slides][Paper]

Cristiano Malossi received his B.Sc. in Aerospace Engineering and his M.Sc. in Aeronautical Engineering from the Politecnico di Milano (Italy) in 2004 and 2007, respectively. After working one year on computational geology problems in collaboration with ENI, he moved to Switzerland where in 2012 he got his Ph.D. in Applied Mathematics from the Swiss Federal Institute of Technology in Lausanne (EPFL). His Ph.D. thesis on algorithms and mathematical methods for the numerical simulation of cardiovascular problems has been awarded with the IBM Research Prize for Computational Sciences. Later, in July 2013 Cristiano joined IBM Research - Zurich in the Foundations of Cognitive Solutions group. Cristiano is recipient of the ACM Gordon Bell Prize 2015. His main research interests include: High Performance Computing, Energy-Aware Algorithms and Architectures, Graph Analytics, Numerical Analysis, Computational Fluid Dynamics, Aircraft Design, Computational Geology, and Cardiovascular Simulations.

Recent years have seen a dramatic change in core technology. To continue the exponential overall improvements technologists have turned into multi-core chips and parallelism at all scales. With these new trends, a series of new problems arise: how to program such complex machines and how to keep pace with the very fast increase in power requirements. In this work, we present several examples where a careful algorithm re-thinking have opened the way to faster, scalable, and energy-efficient solutions. In several of these examples, approximate computing plays a central role. For instance, on the one hand some applications are error-resilient, i.e., they allow for strong approximations in intermediate computations, without losing quality in the final output. Thus, a smart re-engineered algorithm will profit from a relaxed accuracy in the intermediate computations to reduce power consumption, and obtain the same result with less energy. On the other hand, many applications do not really require the full-precision of classical floating-point arithmetics, and more generally can tolerate some error in the final answer. These applications will directly benefit from new algorithms that make a more efficient use of existing hardware at the cost of some controlled approximation in the final solution. These and other techniques are discussed in this talk.

In this work we propose an integrated energy-efficient Approximate Computing circuit targeted High level synthesis framework named EACH, which transforms the design from behavioral level description to register transfer level description implemented by approximate computing resources, aiming at minimizing the total energy consumption and resource usage. Given the system accuracy requirement, we propose an error constraint table for error control, and then propose an FU allocation method to determine the physical implementation for each operation to minimize the energy and resource usage. We also propose an erroraware mobility allocation based scheduling algorithm for FU allocation optimization, by distributing the resource and error ratio densities uniformly. The experiments show that our whole EACH framework achieves 11% energy reduction from its initial solution, and achieves 18% energy reduction from its optimized solution, compared with the precise circuits; the runtime also reduces 11% compared with the previous work KILS.

Approximate computing systems aim at slightly reducing the output quality of service, or precision, of a program in order to save computing operations, reduce the execution time and the energy consumption of the system. However, to the best of our knowledge, in all the approximate computing systems presented in the research literature, the implementation of the components that support the approximation is left to the developer. In this paper, we describe the implementation of a precisionaware computing library that saves the developer from the implementation of approximated functions. Efficient implementations of the approximated functions are achieved with runtime code generation. Runtime code generation is fast and can be amortised in a few executions only of the generated code. We illustrate the performance and the lightness of our implementation on the WisMote, a MSP430-based platform with only 16 kB of RAM and 256 kB of flash memory. When the generated code is specialised on one of the input arguments of the approximated function, we achieve a speedup above 7x.

Approximate computing has emerged as an active area of research in recent years. A number of techniques have been proposed to increase performance of applications that can tolerate a certain degree of inaccuracy. Approximating functions can offer significant performance improvement but this opportunity has not yet been fully explored by the approximate computing community. This paper introduces techniques to perform approximation at the level of functions. We present our schemes with two flavors and discuss four realizations. We present results on 90 scientific functions that underscore the opportunity. We also present results on real applications that demonstrate the practicality and effectiveness of the idea.

There exist extensive ongoing research efforts on emerging technologies that have the potential to become an alternative to today’s CMOS technologies. A common feature among the investigated technologies is that of multivalue devices and the possibility of implementing quaternary logic and memory. However, multi-value devices tend to be more sensitive to interferences and, thus, have reduced error resilience. We present an architecture based on multi-value devices where we can trade energy efficiency against error resilience. Important data are encoded in a more robust binary format while error tolerant data is encoded in a quaternary format. We show for eight benchmarks an energy reduction of 32% and 36% for the register file and level-one data cache, respectively, and for the two integer benchmarks, an energy reduction for arithmetic operations of 13% and 23%. We also show that for a quaternary technology to be viable it need to have a raw bit error rate of one error in 100 million or better.

This paper connects the Genetic Improvement (GI) method, recently established in the search-based software engineering community, with approximate computing, in order to obtain improvements in the cases when errors in computations can be tolerated. It is argued that Genetic Improvement which shares many objectives with the approximate computing can easily be adopted to solve typical problems in the area of approximate computing. An open problem is whether GI-based methodology can really be accepted by the approximate computing community.

Conventional computer systems are designed to deliver error-free operation. However, this strict correctness prequisite is threatened due to the continuous efforts towards denser structures which are vulnerable to voltage and temperature fluctuations. In such conditions, errors occur due to timing violations. In this paper, an LLVM-based, compile time analysis is described which categorizes instructions according to their criticality to program correctness. We quantify the positive effects on application resiliency when protecting critical instructions from producing erroneous results. Moreover, we study the effects of compiler optimizations on the number of critical instructions for Intel’s x86 architecture. As a general rule, compiler optimization increase the number of instructions that are non-critical to program execution, and which can be executed on less-reliable hardware.

We report further development of the interval adjoint significance analysis (IASA) as a part of the SCoRPiO project. In SCoRPiO significance based approximate computing is used to reduce the energy consumption of a program execution by tolerating less accurate results. Part of the project is to define significance as an algorithmic property to quantify the impact of a computation to the output. Information needed by this definition is obtained by an analysis combining algorithmic differentiation and interval arithmetic. Thus, the analysis can identify computations that can be evaluated less accurate, e.g. on low power but less reliable hardware. An interval splitting approach is presented to address issues introduced by a naive usage of interval arithmetic. This approach additionally offers a more detailed and refined analysis of the underlying computer code. Furthermore, we introduce the quantification mode that is used to verify intuitive and well known characteristics of a code and to obtain the corresponding insignificant computations.

This work exploits the tolerance of Deep Neural Networks (DNNs) to reduced precision numerical representations and specifically, their ability to use different representations per layer while maintaining accuracy. This flexibility provides an additional opportunity to improve performance and energy compared to conventional DNN implementations that use a single, uniform representation for all layers throughout the network. This work exploits this property by proposing Proteus , a layered extension over existing DNN implementations that converts between the numerical representation used by the DNN execution engines and a shorter, layer specific fixed-point representation when reading and writing data values to memory be it on-chip buffers or off-chip memory. When used with a modified layout of data in memory, Proteus can use a simple, low-cost and low energy conversion unit. On five popular DNNs, Proteus can reduce data traffic among layers by 41% on average and up to 44% compared to a baseline that uses 16-bit fixed-point representation, while maintaining accuracy within 1% even when compared to a single precision floating-point implementation. When incorporated into a state-of-the-art accelerator Proteus improves energy by 14% While maintaining the same performance. When incorporated on a graphics processor Proteus improves performance by 1%, energy by 4% and reduces off-chip DRAM accesses by 46%.

Most design automation methods developed for approximate computing evaluate candidate solutions by applying a set of input vectors and measuring the error of the output vectors with respect to an exact solution. This approach is not, however, applicable when approximating complex combinational or sequential circuits since the error is not computed precisely enough. This paper surveys various methods of formal verification that could be extended for purposes of determining the error of approximation more precisely and formulates this task through a notion of formal relaxed equivalence checking.

This paper presents the concept of a fidelity slider, which is a user-defined method that enables trading off accuracy for performance in a parallelized application. The slider is defined in the context of the Canny edge detector, but can be generalized to other image processing algorithms. The slider moderates discontinuity issues introduced by an image-slicing technique used to increase the level of the parallelism in the Canny edge algorithm, and allows for strong scalability across multiple cores. The domain decomposition-based technique used by our method is a toplevel image-slicing loop incorporated into the algorithm to process segments of an image concurrently. The slider controls three factors to moderate the aggregate output data divergence induced by the parallelized Canny edge algorithm: 1. image slice overlap size, 2. the degree of histogram synchronization, and 3. the edge tracing continuity factor. Results show that the fidelity slider is able to control the tradeoff from a speedup of 7x at 100% accuracy up to a speedup of 19x at 99% accuracy, for an image of 8000x8000 pixels processed on an Intel Xeon platform with 14 cores and 28 hardware threads.

In this paper, we survey the methods that have been proposed to functional approximation of digital circuits. The main focus is on evolutionary computing, particularly Cartesian genetic programming (CGP), which can provide, in an automated manner, very good compromises between key circuit parameters. This is demonstrated in a case study -- evolutionary approximation of an 8-bit multiplier.

Approximate computing has been adopted as a promising approach to energy-efficient design of digital systems. Actually the straightforward idea is to lower the supplied voltage level until errors appear. These errors can be “implicitly” tolerated since the application accepts some loss of quality or optimality in the computed result(s). This paper aims at targeting applications that cannot accept loss of quality. We propose an architecture able to scaling down the supplied voltage level even under the nominal value. The key concept is a preliminary characterization phase the exploit the benefit of approximate computing.

A key enabler of real applications on approximate computing systems is the availability of instruments to analyze system reliability, early in the design cycle. Accurately measuring the impact on system reliability of any change in the technology, circuits, microarchitecture and software is most of the time a multi-team multi-objective problem and reliability must be traded off against other crucial design attributes (or objectives) such as power, performance and cost. Unfortunately, tools and models for cross-layer reliability analysis are still at their early stages compared to other very mature design tools and this represents a major issue for mainstream applications. This paper presents preliminary information on a cross-layer framework built on top of a Bayesian model designed to perform component-based reliability analysis of complex systems.

We present a work in progress report on the problem of minimizing the energy consumption of a parallel application on an unreliable hardware platform, specifically unreliable memory. In such a system, not all of the application data in memory is accurate at all the times. Allowing some inaccuracies in the data saves memory refresh power at the cost of increased inaccuracy which in many application domains can be dealt with algorithmic error resilience. We are building an analytical model to capture the CPU energy consumption of a parallel application with precedence constraints running on a system with unreliable memory. Using this model, we plan to provide a framework for analytically selecting CPU frequencies that minimize the overall CPU energy consumption of the application.