08:15 - 10:00 HiPEAC Conference Keynote
10:00 - 11:00 Welcome
11:00 - 11:30 Coffee Break
11:30 - 13:00 Session 1: Approximation at the Hardware Level
11:30 - 11:45
Adi Teman, Georgios Karakonstantis, Shrikanth Ganapathy and Andreas Burg. Exploiting Application Error Resilience for Standby Energy Savings in Dynamic Memories.
12:00 - 12:15
Rochus Nowosielski, Julian Hartig, Guillermo Paya-Vaya, Holger Blume and Alberto Garcia-Ortiz. Exploring Different Approximate Adder Architecture Implementations in a 250°C SOI Technology
13:00 - 14:00 Lunch Break
14:00 - 15:00 HiPEAC Conference Keynote
15:00 - 16:00 Invited Talk
16:00 - 16:30 Session 2: Approximation in Applications
16:15 - 16:30
Tyler M. Smith, Enrique S. Quintana-Orti, Mikhail Smelyanskiy and Robert A. van de Geijn. Embedding Fault-Tolerance, Exploiting Approximate Computing and Retaining High Performance in the Matrix Multiplication
16:30 - 17:00 Coffee Break
17:00 - 18:15 Session 3: System Software and Architectures for Approximation
17:15 - 17:30
Vassilis Vassiliadis, Konstantinos Parasyris, Charalambos Chalios, Christos D. Antonopoulos, Spyros Lalis, Nikolaos Bellas, Hans Vandierendonck and Dimitrios S. Nikolopoulos. A Programming Model and Runtime System for Significance-Aware Energy-Efficient Computing.
Tim Palmer has been a Royal Society Research Professor of Climate Physics at Oxford University since 2010. After his PhD at Oxford in general relativity theory, Tim worked at the Meteorological Office, the University of Washington and the European Centre for Medium Range Weather Forecasts, where he led a team which developed the medium-range and seasonal ensemble prediction systems. Tim's research spans a range of work, both theoretical and practical, in the field of dynamics and predictability of weather and climate.
Weather and climate models have become increasingly important tools for making society more resilient to extremes of weather and for helping society plan for possible changes in future climate. These models are based on known laws of physics, but, because of computing constraints, are solved over a considerably reduced range of scales than are described in the mathematical expression of these laws. This generically leads to systematic errors when models are compared with observations. A new paradigm is proposed for solving these equations which sacrifices precision and determinism for small-scale motions. It is suggested that this sacrifice may allow the truncation scale of weather and climate models to extend down to cloud scales in the coming years, leading to more accurate predictions of future weather and climate.
Kaushik Roy received B.Tech. degree in electronics and electrical communications engineering from the Indian Institute of Technology, Kharagpur, India, and Ph.D. degree from the electrical and computer engineering department of the University of Illinois at Urbana-Champaign in 1990. He was with the Semiconductor Process and Design Center of Texas Instruments, Dallas, where he worked on FPGA architecture development and low-power circuit design. He joined the electrical and computer engineering faculty at Purdue University, West Lafayette, IN, in 1993, where he is currently Edward G. Tiedemann Jr. Distinguished Professor. His research interests include spintronics, device-circuit co-design for nano-scale Silicon and non-Silicon technologies, low-power electronics for portable computing and wireless communications, and new computing models enabled by emerging technologies. Dr. Roy has published more than 600 papers in refereed journals and conferences, holds 15 patents, graduated 65 PhD students, and is co-author of two books on Low Power CMOS VLSI Design (John Wiley & McGraw Hill).
Dr. Roy received the National Science Foundation Career Development Award in 1995, IBM faculty partnership award, ATT/Lucent Foundation award, 2005 SRC Technical Excellence Award, SRC Inventors Award, Purdue College of Engineering Research Excellence Award, Humboldt Research Award in 2010, 2010 IEEE Circuits and Systems Society Technical Achievement Award, Distinguished Alumnus Award from Indian Institute of Technology (IIT), Kharagpur, Fulbright-Nehru Distinguished Chair, DoD National Security Science and Engineering Faculty Fellow (2014-2019), and best paper awards at 1997 International Test Conference, IEEE 2000 International Symposium on Quality of IC Design, 2003 IEEE Latin American Test Workshop, 2003 IEEE Nano, 2004 IEEE International Conference on Computer Design, 2006 IEEE/ACM International Symposium on Low Power Electronics & Design, and 2005 IEEE Circuits and system society Outstanding Young Author Award (Chris Kim), 2006 IEEE Transactions on VLSI Systems best paper award, 2012 ACM/IEEE International Symposium on Low Power Electronics and Design best paper award, 2013 IEEE Transactions on VLSI Best paper award. Dr. Roy was a Purdue University Faculty Scholar (1998-2003). He was a Research Visionary Board Member of Motorola Labs (2002) and held the M.K. Gandhi Distinguished Visiting faculty at Indian Institute of Technology (Bombay). He has been in the editorial board of IEEE Design and Test, IEEE Transactions on Circuits and Systems, IEEE Transactions on VLSI Systems, and IEEE Transactions on Electron Devices. He was Guest Editor for Special Issue on Low-Power VLSI in the IEEE Design and Test (1994) and IEEE Transactions on VLSI Systems (June 2000), IEE Proceedings -- Computers and Digital Techniques (July 2002), and IEEE Journal on Emerging and Selected Topics in Circuits and Systems (2011). Dr. Roy is a fellow of IEEE.
In today’s world there is an explosive growth in digital information content. Moreover, there is also a rapid increase in the number of users of multimedia applications related to image and video processing, recognition, mining and synthesis. These facts pose an interesting design challenge to process digital data in an energy-efficient manner while catering to desired user quality requirements. Most of these multimedia applications possess an inherent merit of "error"-resilience. This means that there is considerable room for allowing approximations in intermediate computations, as long as the final output meets the user quality requirements. This relaxation in "accuracy" can be used to simplify the complexity of computations at different levels of design abstraction, which directly helps in reducing the power consumption.
At the algorithm and architecture levels, the computations can be divided into significant and non-significant. Significant computations have a greater impact on the overall output quality, compared to non-significant ones. Thus the underlying architecture can be modified to promote faster computation of significant components, thereby enabling voltage-scaling (at the same operating frequency). At the logic and circuit levels, one can relax Boolean equivalence to reduce the number of transistors and decrease the overall switched capacitance. This can be done in a controlled manner to introduce limited approximations in common mathematical operations like addition and multiplication. All these techniques can be classified under the general topic of “Approximate Computing”, which is the main focus of this talk.
Current variation aware design methodologies, tuned for worst-case scenarios, are becoming increasingly pessimistic from the perspective of power and performance. A good example of such pessimism is setting the refresh rate of DRAMs according to the worst-case access statistics, thereby resulting in very frequent refresh cycles, which are responsible for the majority of the standby power consumption of these memories. However, such a high refresh rate may not be required, either due to extremely low probability of the actual occurrence of such a worst-case, or due to the inherent error resilient nature of many applications that can tolerate a certain number of potential failures. In this paper, we exploit and quantify the possibilities that exist in dynamic memory design by shifting to the so-called approximate computing paradigm in order to save power and enhance yield at no cost. The statistical characteristics of the retention time in dynamic memories were revealed by studying a fabricated 2 kb CMOS compatible embedded DRAM (eDRAM) memory array based on gain-cells. Measurements show that up to 73% of the retention power can be saved by altering the refresh time and setting it such that a small number of failures is allowed. We show that these savings can be further increased by utilizing known circuit techniques, such as body biasing, which can help, not only in extending, but also in preferably shaping the retention time distribution. Our approach is one of the first attempts to access the data integrity and energy trade-offs achieved in eDRAMs for utilizing them in error resilient applications and can prove helpful in the anticipated shift to approximate computing.
State of the art implementations of the exponential function rely on interpolation tables, Taylor expansions or IEEE manipulations containing a small fraction of integer operations. Unfortunately, none of these methods is able to maximize the profit of vectorization and at the same time, provide sufficient accuracy. Indeed, many applications such as solving PDEs, simulations of neuronal networks, Fourier transforms and many more involved a large quantity of exponentials that have to be computed efficiently. In this paper we device and demonstrate the usefulness of a novel formulation to compute the exponential employing only floating point operations, with a flexible accuracy ranging from a few digits up to the full machine precision. Using the presented algorithm we can compute exponentials of large vectors, in any application setting, maximizing the performance gains of the vectorization units available to modern processors. This immediately results in a speedup for all applications.
Reliability has emerged as an important design criterion due to shrinking device dimensions. To address this challenge, researchers have proposed techniques compromising the Quality-of-Service (QoS) across all design abstractions. Performing reliability-QoS trade-off from a high-level design abstraction is a major challenge. In this paper, we propose an analytical reliability evaluation framework, based on probabilistic error masking matrices (PeMMs). The reliability evaluation is performed by propagating erroneous tokens through an abstract circuit model. We report detailed experiments using a RISC processor and several embedded applications. The proposed approach demonstrates significantly faster reliability evaluation compared to pure simulation-driven approach.
Releasing the stringent accuracy requirement would potentially offer greater freedom to create a design with better performance or energy efficiency. In this paper, we evaluate the design trade-offs for adders, which are key building blocks for many applications. We demonstrate the optimum design metric for adders, under the consideration of various design constraints, such as accuracy, operating frequency, silicon area, different types of adder structure and alternative form of computer arithmetic for adder implementation. Two design scenarios are compared: one is the conventional design scenario where timing closure is met by adjusting operand word-lengths. The other is an overclocking scenario where timing violations are allowed to occur. We show that applying the overclocking approach to ripple carry adders can be more beneficial than using fast adders to achieve similar latency, because the worst cases only happen with very small probabilities. We also show that using the conventional approach on adders with new form of computer arithmetic is optimal for a wide range of design constraints.
Redundancy lies at the heart of graphical applications. However, as we demonstrate in this work harvesting this high degree of redundancy is not an easy task. Our experimental findings reveal that simply memoizing the outcomes of single instructions or a set of instructions is not able to pay off due to the high-precision calculations required in modern graphics APIs (e.g., OpenGL|ES 3.0). To this end, we propose clumsy value cache (VC), a hardware memoization mechanism that is explicitly managed by special machine-level instructions (part of the target GPU ISA). A unique characteristic of VC is that it is able to perform partial matches i.e., reducing the arithmetic precision (accuracy) of the input parameters, thus increasing significantly the volume of successful value reuses. To eliminate the error introduced by partial matches and consequently the impact on the quality of the rendered images, i) we systematically examine the precision tolerance of a large set of instructions in modern OpenGL fragment shaders and ii) we devise and optimize various run-time, feedback-directed policies to control the interplay between accuracy and image quality maximizing the value reuse benefits at the same time. The proposed mechanism is evaluated in a cycle-accurate OpenGL simulator and our results indicate that our approach reduces the number of executed instructions by 13.5% with negligible (non-visible) impact in the quality of the rendered images.
One factor that contributes to high energy consumption is that all parts of the program are considered equally significant for the accuracy of the end-result. However, in many cases, parts of computations can be performed in an approximate way, or even dropped, without affecting the quality of the final output to a significant degree. In this paper, we introduce a task-based programming model and runtime system that exploit this observation to trade off the quality of program outputs for increased energy-efficiency. This is done in a structured and flexible way, allowing for easy exploration of different execution points in the quality/energy space, without code modifications and without adversely affecting application performance. The programmer specifies the significance of tasks, and optionally provides approximations for them. Moreover, she provides hints to the runtime on the percentage of tasks that should be executed accurately in order to reach the target quality of results. Two different significance-aware runtime policies are proposed for deciding whether a task will be executed in its accurate or approximate version. These policies differ in terms of their runtime overhead but also the degree to which they manage to execute tasks according to the programmer's specification. The results from experiments performed using six different benchmarks on top of an Intel-based multicore/multiprocessor platform show that, depending on the runtime policy used, our system can achieve an energy reduction of up to 83% vs. a fully accurate execution, and up to 35% vs. approximate versions that employ loop perforation. At the same time, our approach always results to graceful degradation of the final result.
We report first results of significance analysis developed within the FET-open project SCoRPiO. SCoRPiO aims to introduce result significance to the hardware development process in order to reduce the safeguard power consumption by permitting a controlled level of imprecision into computations and data. An important aspect of the project is to formally introduce computational significance as an algorithmic property and expose reliability and energy to the level of the programming model and algorithm. Based on that, SCoRPiO seeks to devise techniques that facilitate automatic characterization of code and data significance using compile-time or runtime analysis. As a part of SCoRPiO techniques for significance analysis are developed: Based on the source code of algorithms and user-provided significance information, the analysis should identify parts of the algorithms that can be evaluated with less accuracy and, hence, higher energy efficiency. We describe in this paper the initial version of our significance model. Therefore we do present ideas and work in progress instead of a finished theory.
In this paper we introduce a fault tolerance mechanism into a high-performance implementation of the matrix multiplication (GEMM) that can be easily tuned for approximate computing. The approach exploits the internal organization of GEMM, explicitly exposed by the BLIS implementation of this operation, in order to balance the workspace requirements and the overheads introduced by error detection and error correction. Our preliminary experimental results on an Intel Xeon E5-2680 reveal that the new implementation introduces moderate overhead while integrating fault tolerance and approximate computing into a common framework.
Approximate Computing (AC) comprises approaches to relax the preciseness of computations in order to achieve a trade-off between power/energy efficiency and an acceptable quality of the results. As the volume of data being produced at an unprecedented rate reducing the preciseness of operations and algorithms itself will allow a faster and a more energy-efficient processing of the data. E.g. applications like video or image processing will benefit from approximate computing as the input data itself as well as the human senses to perceive the output data are not exact. In this paper we present an AC approach targeting image processing applications. The algorithm selected for this case study to calculate a depth map is based on dynamic programming. The algorithm has been implemented in hardware exploiting the reconfigurable logic to trade-off the preciseness of the results and the resource consumption. Removing the back tracking step of the algorithm enable a streaming architecture and additionally reduces the on-chip memory consumption. The result is an FPGA-based approximation unit that calculates the results 367x faster compared to the original version while the root mean square error is increased only by a factor of 2. The approximation unit is integrated in an hybrid computing system. The incoming data can be processed with high performance by the approximation unit. Additionally, the incoming data are buffered to allow a recalculation with a higher precision if it is required. Furthermore, available resources of the hybrid system can be exploited to calculate precise results using alternative algorithms.
Experimental devices aiming at real-time detection and suppression of epileptic-seizure events in live subjects already exist. However, to guarantee high detection accuracy, existing approaches employ high-accuracy detection filters that overlook the incurred energy costs, thus leading to unrealistic solutions for low-power implementations. In this short paper, we capitalize on the approximate nature of the seizure-detection phenomenon and propose an energy-efficient scheme for embedded, seizure-detection devices with trivial impact on detection accuracy. For a 1% reduction in filter detection accuracy we achieve a 3.7x increase in device-battery lifetime.
In this paper, we survey the methods that have been proposed to functional approximation of digital circuits. The main focus is on evolutionary computing, particularly Cartesian genetic programming (CGP), which can provide, in an automated manner, very good compromises between key circuit parameters. This is demonstrated in a case study -- evolutionary approximation of an 8-bit multiplier.
Aggressive power supply voltage Vdd scaling are widely utilized to exploit the design margin introduced by the process, voltage and environment variations. However, scaling beyond the critical Vdd results to numerous setup timing errors, and hence unacceptable output quality. In this paper, we propose partial computation-skip (CS) scheme to mitigate setup timing errors, for recursive digital signal processors with a fixed cycles per instruction (CPI). A coordinate rotation digital computer (CORDIC) processor with the proposed partial CS scheme still functions when scaling beyond the error-free voltage. It achieves 1.7 X energy saving w.r.t. nominal Vdd condition, and another 1.3 X energy saving when sacrificing 10.5 dB output quality.
Designing VLSI circuits for high temperature applications requires the use of specialized ASIC technologies suited for operation above 250°C. The technologies available today expose computation performance as well as the integration density far beyond state of the art VLSI technologies. Especially at high temperatures the switching speed of integrated circuits becomes very slow. Thus, the design space for implementation of digital architectures for signal processing at high temperature is strongly limited. This paper presents, novel in context of high temperature ASIC design, an evaluation of different adder architectures used in approximate computing as well as stochastic arithmetic with the goal of analyzing the error characteristic of these adders under out of specification operation like high temperature and frequency overscaling. With these results the best adder architecture, for out of specification operation for a given application can be identified. The presented work describes the first step of the two folded evaluation process by publishing result from gate level simulation as well as introduces the chip design for real world experiments committed for fabrication.