HLIBpro
3.0
|
Accumulator based H-arithmetic uses a special accumulator for updates to matrix blocks during arithmetic. Furthermore, updates are applied following the H-hierarchy. By this, the number of low-rank truncations is significantly reduced and with this also the runtime. However, the savings are matrix dependent and also result in slightly reduced arithmetic accuracy.
To activate accumulator arithmetic, you have to set the parameter CFG::Arith::use_accu
to true
, e.g.,
or
or
in the configuration file .hlibpro.conf
(see also Parameters).
With accumulator arithmetic block updates may be computed as soon as possible (eager) and applied to the accumulator. Or the update computation and application is postponed to the time where the matrix block is needed for further computations (lazy).
By default, updates are applied in an eager mode. To switch to lazy mode, the parameter CFG::Arith::lazy_eval
has to be set to true
.
Furthermore, with lazy evaluation, the individual updates may be sorted to increase accuracy, e.g., if the updates differ much with respect to their norm, possibly letting smaller updates vanish when combined with larger updates.
Sorting updates in lazy mode is activated with the parameter CFG::Arith::sort_updates
.
We will run H-LU benchmarks (without using symmetry) with three different H-matrices to show the effect of accumulator H-arithmetic compared to standard arithmetic.
All benchmarks are performed on a Intel i5-4670 CPU using a single CPU core.
This benchmark uses the Laplace SLP operator with a spherical with 32768 triangles and piecewise linear ansatz (see Boundary Element Methods).
Type | t(LU) | size(LU) | error |
---|---|---|---|
standard | 14.59s | 199 MB | 1.636₁₀-3 |
accu/eager | 9.32s | 216 MB | 5.937₁₀-3 |
accu/lazy | 9.64s | 215 MB | 4.117₁₀-3 |
acc/lazy/sort | 10.22s | 215 MB | 4.294₁₀-3 |
Standard arithmetic gives smallest memory and best approximation but also highest runtime. Best runtime is obtained with accumulator arithmetic in eager mode. However, approximation is worst with this setting. Slightly slower is lazy mode and also improves approximation.
This benchmark uses the Helmholtz SLP operator with a spherical with 32768 triangles and piecewise constant ansatz (see Boundary Element Methods).
Type | t(LU) | size(LU) | error |
---|---|---|---|
standard | 69.43s | 861 MB | 6.726₁₀-4 |
accu/eager | 42.25s | 916 MB | 1.167₁₀-3 |
accu/lazy | 42.66s | 915 MB | 1.031₁₀-3 |
acc/lazy/sort | 45.17s | 915 MB | 1.015₁₀-3 |
Again, standard arithmetic results in smallest memory and best approximation as well as highest runtime. Best runtime is again obtained with accumulator arithmetic in eager mode, which also has worst approximation. Lazy mode improves the latter slightly as does update sorting.
The third benchmark uses a sparse matrix from a 3D convection diffusion equation, partitioned using nested dissection (see Solving a Sparse Equation System). The number of unknowns is 132651.
Type | t(LU) | size(LU) | error |
---|---|---|---|
standard | 40.67s | 1350 MB | 1.271₁₀-3 |
accu/eager | 35.62s | 1350 MB | 1.358₁₀-3 |
accu/lazy | 35.08s | 1350 MB | 1.391₁₀-3 |
acc/lazy/sort | 38.00s | 1350 MB | 1.367₁₀-3 |
The situation changes a little with this benchmark. Best runtime is now with lazy evaluation. Also the difference compared to standard arithmetic is smaller for runtime and for approximation error. The memory consumption is equal for all forms of arithmetic.
In all benchmark problems, accumulator based arithmetic results in significant runtime savings compared to the standard arithmetic. This however, goes along with a reduced accuracy, which can be compensated by slightly increasing the predefined accuracy of the arithmetic.
For instance, if the block-wise accuracy in the H-LU for the Laplace example, where the relative difference in the approximation is highest, is reduced from \(10^{-4}\) to \(3 \cdot 10^{-5}\), the same error is obtained as with the standard arithmetic, while the runtime is then 11.24s.
In the other examples, the (relative) difference between the different arithmetic versions is even smaller.
There seem to be no significant improvments in runtime or accuracy when using lazy mode or sorting the updates.
Nevertheless, it is recommended to benchmark all different versions for a specific application to optimize runtime, memory and error.
See also Approximation and Truncation Algorithms for other parameters influencing arithmetic.