



## Exploiting approximate computing methods in FPGAs to accelerate stereo correspondence algorithms

### Michael Bromberger, Wolfgang Karl and Vincent Heuveline

Chair for Computer Architecture and Parallel Processing (CAPP) / Institute of Computer Science & Engineering (ITEC) WAPCO 2015



## **Motivation**



[http://rack.3.mshcdn.com]



[http://recognition-systems.com/]



[http://www.electronicproducts.com]

- The world's amount of data is doubling every two years
- 23 % of this data is useful to tag and analyze (according to IDC)
- Computer vision and machine learning algorithms are often used
- Such applications can benefit from Approximate Computing (AC)
- AC is applicable on different levels
  - Applications, software libraries, ISA, hardware
- Our case study: stereo correspondence algorithms

# **Stereo correspondence algorithm**

- Calculates depth map between two rectified images
  - Map contains information about the distance of objects to the camera
- Considered algorithm is based on dynamic programming (DP)
- Do for each image/depth map row:
  - 1. Calculating cost matrix using Sum-of-Absolute-Differences (SAD)
  - 2. Use DP to calculate three matrices (Match, Left occluded, Right occluded)
  - 3. Backtracking step determines the best disparity values for the current row

# Approach

Port the DP matrix calculation (step 2 and 3) to a FPGA

- exploiting the parallel capabilities of a FPGA
- Achieved by
  - A pipeline structure for calculating several DP columns
  - Calculating different DP matrices in parallel
- Remove time-consuming backtracking step
  - Avoids storage of the entire DP matrices
  - No time-consuming off-chip memory accesses required
- Local decision for finding the best depth map pixel value
  - Enables a streaming architecture
    - Pixel value (disparity) is determined by the minimum value in each DP column
  - Drawback: reducing accuracy of the algorithm



- Calc-DP unit calculates 3 columns for each DP matrix concurrently
- FPGA determines 4 depth map rows in parallel by 4 Calc-DP units
- e.g. the Convey HC-1 includes 4 FPGAs  $\rightarrow$  16 rows
- Depth map is calculated by a single call to avoid setup costs

# **Evaluation I: different depth maps**



Tsukuba image



Ground truth



Sum of absolute differences (SM)



Simulated Annealing (SM)



Scan-line optimization (SM)



Graph cut (SM)



DTAggr-P



CSM

SM: algorithm is part of Stereo matcher framework

DTAggr-P, CSM are published in the Middlebury Evaluation



Dynamic Programming (SM)



Proposed FPGA-based AC unit

**Baseline** 

Karl – Exploiting approximate computing methods in FPGAs ...

# **Evaluation II**

Average execution time and RMSE for different algorithms calculating a depth map for Tsukuba

|         | Costs  | Aggregation/<br>Data transfer | Optimization | Total    | RMSE<br>(all) |
|---------|--------|-------------------------------|--------------|----------|---------------|
| SM-GC   | 0.07 s | -                             | 159.31 s     | 159.32 s | 1.4           |
| SM-DP   | 0.07 s | -                             | 0.14 s       | 0.22 s   | 1.78          |
| SM-SAD  | 0.06 s | 0.03                          | 0.01 s       | 0.11 s   | 1.78          |
| SM-SO   | 0.07 s | -                             | 0.14 s       | 0.21 s   | 1.88          |
| SM-SA   | 0.07 s | -                             | 258.51 s     | 258.59 s | 2.03          |
| FPGA-AC | 0.05 s | 3.41 ms                       | 381 μs       | 0.06 s   | 3.34          |

ACU is 367x faster than the original DP-based optimization
The RMSE is only increased by a factor of 2
ACU optimization step achieves 26 fps for 4096x4096 images

# Integration of the ACU into a hybrid system



- Fast ACU allows real time processing
- Decider "evaluates" if the result is exact enough
- Other units
  - can recalculate the result if it is required
  - Calculate further step(s) of the overall algorithm

# **Conclusion and future work**

Integration of AC into a stereo correspondence algorithm

- **FPGAs** are well suited to implement AC hardware
- Removing the time-consuming backtracking step
  - Enables a streaming architecture
  - Avoids accesses to off-chip memory
- A hybrid system (CPU + FPGA) combines
  - Traditional "exact" computation (CPU)
  - Approximate Computing Hardware (FPGA)
- Future Work
  - Consideration of further AC strategies for the FPGA architecture
  - Developing a domain-specific AC framework for hybrid systems





# Exploiting approximate computing methods in FPGAs to accelerate stereo correspondence algorithms

#### Michael Bromberger, <u>Wolfgang Karl</u> and Vincent Heuveline

Chair for Computer Architecture and Parallel Processing (CAPP) / Institute of Computer Science & Engineering (ITEC) WAPCO 2015

