Attention: The discussion of the numerical tests below only applies to the equation system used in this example. The results, especially for classical preconditioners may vary strongly for other matrices. This example should therefore only be considered as a hint, that different preconditioners are available and may be tested for a specific problem to obtain optimal results.

Problem Setup

As a benchmark, we will use the same problem as in the Boundary Element Methods example, i.e., the Helmholtz SLP operator on the unit sphere. The wavenumber \(\kappa\) was chosen to be 16.

The source code for setting up the matrix and RHS is identical to the Boundary Element Methods example.

For this specific test, we use a grid with 62.466 vertices and 124.928 triangles. With a constant ansatz, this results in an H-matrix of dimension \(124.928 \times 124.928\).

Furthermore, the matrix was built unsymmetric, i.e., both lower and upper triangular part are constructed, to prevent algorithmic properties of symmetric matrices, e.g., during matrix-vector solves, to influence the numerical results.

For comparison with the timings below, we need the time to build the H-matrix, which is 75.9 seconds for a block-wise accuracy of \(10^{-4}\) with ACA+.

All tests below are performed using a single Intel Xeon E7-8857 CPU with 12 cores and 3GHz.

Remarks: Although the example uses a dense matrix, for sparse matrices similar tests may be done. However, note that point-wise versions of the various preconditioners, e.g., Jacobi, Gauss-Seidel, SOR or also ILU, do not need H-arithmetic or 𝖧𝖫𝖨𝖡𝗉𝗋𝗈. Furthermore, several direct solvers are available, e.g., UMFPACK, PARDISO or MUMPS, which are very fast if they are applicable to a equation system.

No Preconditioner

First, the various solvers in 𝖧𝖫𝖨𝖡𝗉𝗋𝗈 should try to solve the system without any preconditioner. The stop criterion is chosen with a maximal number of 1000 steps and a reduction of the residual norm of \(10^{-6}\) compared to the initial residual norm. Furthermore, the lower bound for the absolute residual norm was set to \(10^{-6}\):

TSolver::TStopCriterion  sstop( 1000, 1e-13, 1e-6 );
TBiCGStab                solver( sstop );
solver->solve( A, x, b );

The results are

Lin. Iter. not converged  after 1000 steps with  rate 1.0000e+00, |r| = 1.6751e-02, in 98.89s
        CG not converged  after 1000 steps with  rate 1.0040e+00, |r| = 9.0055e-01, in 98.76s
       CGS not converged  after 1000 steps with  rate 1.0075e+00, |r| = 2.9027e+01, in 99.17s
  BiCGStab     converged  after  722 steps with  rate 9.8083e-01, |r| = 1.4219e-08, in 72.25s
    MINRES not converged  after 1000 steps with  rate 9.9531e-01, |r| = 2.6489e-03, in 102.44s
 GMRES(20) not converged  after 1000 steps with  rate 9.8640e-01, |r| = 1.8802e-08, in 109.60s
     TFQMR     converged  after  575 steps with  rate 9.7618e-01, |r| = 2.2689e-09, in 58.12s

The fastest solver is TFQMR, which managed to reach the wanted residual norm within 575 steps. However, it needed 58.12 seconds to do so, which is almost as long as for building the matrix.

Remarks: In the following, only results for solvers are printed, which managed to converge within the given iteration limits.

Jacobi Preconditioner

The next test will use the Jacobi preconditioner, first the point-wise version:

auto  Jacobi = make_unique< TJacobi >( A.get(), 1.0, 1 );
auto  solver = ...;
solver->solve( A, x, b, Jacobi.get() );

The results now look as

 BiCGStab     converged  after  293 steps with  rate 9.5385e-01, |r| = 4.2289e-07, in 32.03s
GMRES(20)     converged  after  193 steps with  rate 9.3092e-01, |r| = 4.4926e-07, in 22.97s
    TFQMR     converged  after  304 steps with  rate 9.5552e-01, |r| = 6.3025e-08, in 33.59s

The GMRES solver now is able to reduce the residual within 193 steps in 22.97 seconds. Also improved are the convergence rates of BiCGStab and TFQMR. CGS now is not able to converge, resulting in absent data.

However, point-wise Jacobi only uses the diagonal entries of \(A\) as an approximation. In the next test, the diagonal blocks of \(A\) are used instead:

auto Jacobi = make_unique< TJacobi >( A.get(), 1.0, 2 );

The third argument to the TJacobi constructor defines the block size and since no block smaller than diagonal blocks are supported by 𝖧𝖫𝖨𝖡𝗉𝗋𝗈, a 2 results in using all diagonal blocks for a block-wise Jacobi preconditioner.

Instead of a blocksize value of 2, one could also use \(n_{\min}\) if known, since this is equivalent as it defines the size of the smallest diagonal blocks.

The results are:

 BiCGStab     converged  after  317 steps with  rate 9.5593e-01, |r| = 7.7363e-08, in 32.17s
GMRES(20)     converged  after  403 steps with  rate 9.6628e-01, |r| = 1.1475e-07, in 44.76s
    TFQMR     converged  after  293 steps with  rate 9.5275e-01, |r| = 9.7448e-09, in 30.21s

The convergence rate actually decreases for GMRES and BiCGStab. Only TFQMR benefits a little.

The reason for this behaviour is, that although going from point-wise diagonal to block-diagonal with blocksize \(n_{\min}\) will increase the number of coefficients in the preconditioner, but the change in the overall approximation of \(A^{-1}\) is still small, and hence the quality of the preconditioner does not change enbough to guarantee a better convergence for this example.

To improve this, we increase the diagonal blocksize to 100:

auto Jacobi = make_unique< TJacobi >( A.get(), 1.0, 100 );

which gives the following results:

 BiCGStab     converged  after  215 steps with  rate 9.3745e-01, |r| = 5.7064e-08, in 22.47s
GMRES(20)     converged  after  258 steps with  rate 9.4773e-01, |r| = 5.4591e-08, in 29.42s
    TFQMR     converged  after  207 steps with  rate 9.3496e-01, |r| = 4.9908e-09, in 22.00s

For comparison, also a blocksize of 500 is used:

 BiCGStab     converged  after  173 steps with  rate 9.2311e-01, |r| = 3.1961e-08, in 20.74s
GMRES(20)     converged  after  148 steps with  rate 9.1078e-01, |r| = 4.3545e-08, in 19.29s
    TFQMR     converged  after  174 steps with  rate 9.2273e-01, |r| = 3.4310e-09, in 21.19s

Since the approximation to \(A\) and hence also to \(A^{-1}\) is getter better with an increased blocksize, the convergence rates are also decreasing.

However, an increased block size also has drawbacks:

The time for matrix-vector multiplication with the preconditioner is bigger.
The setup time for the preconditioner increases, i.e., the time for inverting the diagonal blocks.

The increased time for matrix-vector multiplications can be seen for the BiCGStab and the TFQMR solver, which have similar solve times for both block sizes, although the number of iteration steps is reduced.

For point-wise Jacobi and for a block-size equal to \(n_{\min}\), the setup was actually negligible. So lets correct the above data by the measured time for setting up the preconditioner. First for a blocksize of \(100\):

 BiCGStab     converged  after  215 steps with  rate 9.3745e-01, |r| = 5.7064e-08, in 22.47s +  0.47s = 22.94
GMRES(20)     converged  after  258 steps with  rate 9.4773e-01, |r| = 5.4591e-08, in 29.42s +  0.47s = 29.89
    TFQMR     converged  after  207 steps with  rate 9.3496e-01, |r| = 4.9908e-09, in 22.00s +  0.47s = 22.47

and also for a blocksize of 500:

 BiCGStab     converged  after  173 steps with  rate 9.2311e-01, |r| = 3.1961e-08, in 20.74s + 10.72s = 31.46
GMRES(20)     converged  after  148 steps with  rate 9.1078e-01, |r| = 4.3545e-08, in 19.29s + 10.72s = 30.01
    TFQMR     converged  after  174 steps with  rate 9.2273e-01, |r| = 3.4310e-09, in 21.19s + 10.72s = 31.91

The relatively large increase in the setup time is due to H-inversion for the diagonal blocks, as performed by the TJacobi class. For small blocks, this is comparable to dense inversion, which is very fast for small matrices. However, depending on the H-block structure, a more than linear increase may be visible for larger blocks (before H-complexity is dominant).

TJacobi also accepts a fourth argument defining the accuracy of the H-inversion, which defaults to exact inversion (up to machine precision):

auto Jacobi = make_unique< TJacobi >( A.get(), 1.0, 100, fixed_prec( 1e-4 ) );

In this example however, a relaxed accuracy only leads to very little change in the setup and solve times.

This is a typicall example that a more advanced preconditioner does not necessarily results in an overall performance increase. Only if the reduction in iteration steps is large enough to actually compensate the increase in the more costly setup phase, will the overall time also be reduced.

Gauss-Seidel Preconditioner

In addition to the diagonal of \(A\), the Gauss-Seidel preconditioner also uses the lower (forward GS), upper (backward GS) or both (symmetric GS) triangular part(s) of the matrix. This gives as the following preconditioners with \(A = D - E - F\) and \(E\) being the strict (block) lower triangular part of \(A\) and \(F\) the strict upper triangular part respectively:

Forward-GS: \((D-E)^{-1}\),
Backward-GS: \((D-F)^{-1}\),
Symmetric-GS: \((D-F)^{-1} D (D-E)^{-1}\).

Let us test all three versions of Gauss-Seidel for the Helmholtz model problem.

Forward-GS:

auto  FGS = make_unique< TGaussSeidel >( A.get(), GS_FORWARD );
auto  solver = ...;
solver->solve( A, x, b, FGS.get() );

 BiCGStab     converged  after  300 steps with  rate 9.5495e-01, |r| = 2.8519e-08, in 76.56s
GMRES(20)     converged  after  281 steps with  rate 9.5196e-01, |r| = 4.4621e-08, in 77.02s
    TFQMR     converged  after  250 steps with  rate 9.4581e-01, |r| = 6.5929e-09, in 64.65s

Backward-GS:

auto  BGS = make_unique< TGaussSeidel >( A.get(), GS_BACKWARD );
auto  solver = ...;
solver->solve( A, x, b, BGS.get() );

 BiCGStab     converged  after  279 steps with  rate 9.5164e-01, |r| = 2.8068e-08, in 72.67s
GMRES(20)     converged  after  280 steps with  rate 9.5179e-01, |r| = 4.7332e-08, in 77.79s
    TFQMR     converged  after  260 steps with  rate 9.4799e-01, |r| = 2.5888e-09, in 68.42s

Symmetric-GS:

auto  SGS = make_unique< TGaussSeidel >( A.get(), GS_SYMMETRIC );
auto  solver = ...;
solver->solve( A, x, b, SGS.get() );

 BiCGStab     converged  after   88 steps with  rate 8.5257e-01, |r| = 1.9405e-08, in 37.84s
GMRES(20)     converged  after   76 steps with  rate 8.3227e-01, |r| = 7.9860e-08, in 34.66s
    TFQMR     converged  after   93 steps with  rate 8.6043e-01, |r| = 8.8380e-09, in 40.82s

Only Symmetric-GS shows a significant improvement over block-wise Jacobi in terms of convergence rates, with less that 100 iteration steps to solve the Helmholtz problem.

However, the solve time is significantly higher compared to Jacobi which is the result of the matrix solve algorithm in 𝖧𝖫𝖨𝖡𝗉𝗋𝗈. For one, solving a lower/upper triangular H-matrix takes much longer than matrix-vector multiplication with a block-diagonal matrix since much more data is involved. Furthermore this matrix solve algorithm does not scale as efficiently with the number of CPUs as does matrix-vector multiplication, especially for a block-diagonal matrix.

Remarks: 𝖧𝖫𝖨𝖡𝗉𝗋𝗈 does not support block-GS with a larger block size than \(n_{\min}\), which possibly would further improve convergence rates albeit with higher setup costs comparable to block-Jacobi.

H-LU Preconditioner

Already Block-Jacobi and Gauss-Seidel made use of the block structure of H-matrices and the correponding H-arithmetic. Especially Block-Jacobi used H-matrix inversion for the diagonal blocks.

Using H-LU factorisation, a better approximation to \(A\) can be computed by using the whole matrix instead of just the (block-) diagonal, which then in theory should serve as an excellent preconditioner. In fact, an exact (up to machine precision) computation of \(A^{-1}\) would result in a direct solver, i.e., only a single matrix-vector multiplication (or matrix-vector solve) would be needed for solving the equation system.

H-arithmetic allows us to control the accuracy of the approximation, which thereby permits us to optimize the overall solution process, e.g., invest more time into setup (high accuracy H-LU) or more time in the iteration process (low accuracy H-LU) depending on our problem to solve.

For the Helmholtz problem, we will increase the block-wise H-arithmetic accuracy \(\varepsilon\) from \(1\) to \(10^{-4}\).

Except for the accuracy setting, the source for all examples is:

auto  acc    = fixed_prec( ... );
auto  A_copy = A->copy( acc );
auto  A_inv  = factorise_inv( A_copy.get(), acc );
auto  solver = ...;
solver->solve( A, x, b, A_inv.get() );

Remarks: Note, that already the copy of \(A\) is constructed with a reduced accuracy, thereby starting with less data for H-LU.

The numerical results are:

\(\varepsilon = 1\)

 BiCGStab     converged  after   43 steps with  rate 7.0292e-01, |r| = 2.7785e-09, in 15.28s +  28.98s =  44.26
   MINRES     converged  after  104 steps with  rate 8.7492e-01, |r| = 1.7772e-03, in 37.02s +  28.98s =  66.00
GMRES(20)     converged  after   36 steps with  rate 6.7405e-01, |r| = 1.3937e-08, in 13.63s +  28.98s =  42.61
    TFQMR     converged  after   43 steps with  rate 7.2430e-01, |r| = 2.8872e-09, in 15.93s +  28.98s =  44.91

\(\varepsilon = 10^{-2}\)

Lin. Iter.     converged  after    6 steps with  rate 8.5859e-02, |r| = 6.3671e-11, in  2.79s +  96.78s =  99.57
        CG     converged  after   42 steps with  rate 7.1875e-01, |r| = 1.0744e-13, in 17.49s +  96.78s = 114.27
       CGS     converged  after    8 steps with  rate 1.0733e-01, |r| = 1.3481e-12, in  3.90s +  96.78s = 100.68
  BiCGStab     converged  after    6 steps with  rate 8.1799e-02, |r| = 4.7609e-11, in  2.81s +  96.78s =  99.59
    MINRES     converged  after    6 steps with  rate 8.8599e-02, |r| = 1.5918e-09, in  3.12s +  96.78s =  99.90
 GMRES(20)     converged  after    6 steps with  rate 8.6919e-02, |r| = 3.8247e-11, in  3.12s +  96.78s =  99.90
     TFQMR     converged  after    7 steps with  rate 1.0685e-01, |r| = 4.6321e-12, in  3.92s +  96.78s = 100.70

\(\varepsilon = 10^{-4}\)

Lin. Iter.     converged  after    2 steps with  rate 6.8749e-04, |r| = 3.2750e-13, in  1.40s + 248.34s = 249.74 
        CG     converged  after    5 steps with  rate 5.7639e-02, |r| = 4.0160e-16, in  3.18s + 248.34s = 251.52 
       CGS     converged  after    2 steps with  rate 7.2098e-04, |r| = 3.2766e-13, in  1.78s + 248.34s = 250.12 
  BiCGStab     converged  after    2 steps with  rate 6.8746e-04, |r| = 3.2746e-13, in  1.41s + 248.34s = 249.75 
    MINRES     converged  after    2 steps with  rate 7.2055e-04, |r| = 3.2755e-13, in  1.78s + 248.34s = 250.12 
 GMRES(20)     converged  after    2 steps with  rate 7.2055e-04, |r| = 3.2758e-13, in  1.78s + 248.34s = 250.12 
     TFQMR     converged  after    2 steps with  rate 8.5698e-04, |r| = 3.2750e-13, in  2.25s + 248.34s = 250.59 

As expected, with a better H-accuracy, the convergence rate also becomes better, until only two iteration steps are needed for \(\varepsilon = 10^{-4}\). However, the setup time for computing the H-LU preconditioner also increases significantly with a smaller \(\varepsilon\), already passing matrix construction time for \(\varepsilon = 10^{-2}\).

Regarding overall solve time: already H-LU with the coarsest accuracy needs longer than a simple block-Jacobi preconditioner. Furthermore, matrix-vector solves as are needed for the application of H-LU as a preconditioner are expensive, which already was a problem for Gauss-Seidel preconditioners.

All together, the advantage of H-LU for this specific example will only show for many right-hand sides to solve. As an example, we compare TFQMR+Block-Jacobi (blocksize 100) with Linear Iteration+H-LU ( \(\varepsilon = 10^{-2}\)). Both solvers would need the same overall time for five right-hand side vectors (110.47s vs. 110.73s). Only for more vectors H-LU shows a better performance than Block-Jacobi.

Attention: Again, this is only valid for this example matrix. Already for a larger problem size for the same Helmholtz example, the Jacobi preconditioner will need more iteration steps, which will shift the break-even point in favour of H-LU.

Block-Diagonal H-LU Preconditioner

The main problem with block-wise Jacobi for large block sizes was the large setup costs due to inversion of the diagonal matrix blocks. Instead of inversion, these blocks may also be factorised using H-LU factorisation. This permits larger block sizes for block-Jacobi.

𝖧𝖫𝖨𝖡𝗉𝗋𝗈 provides two functions to either restrict a given matrix to it's block-diagonal part

void restrict_blockdiag ( TMatrix *          A,
                          const uint         lvl       = lvl_leaf,
                          const size_t       blocksize = 0 );

or to copy the block-diagonal of a matrix:

std::unique_ptr< TMatrix >
blockdiag               ( const TMatrix *    A,
                          const uint         lvl       = lvl_leaf,
                          const size_t       blocksize = 0 );

In both cases, the block-diagonal is either defined by the level of the diagonal blocks or by their sizes, e.g., stop recursion if a certain level or a certain blocksize was reached. The default value lvl_leaf signals, that recursion should only be stopped at leaf level (or blocksize = \(n_{\min}\)).

For our problem, a corresponding preconditioner is obtained via

auto  A_bd   = blockdiag( A.get(), lvl_leaf, ... );
auto  acc    = fixed_prec( ... );
auto  A_inv  = factorise_inv( A_bd.get(), acc );
auto  solver = ...;
solver->solve( A, x, b, A_inv.get() );

The results for different block sizes are:

blocksize \(100\)

 BiCGStab     converged  after  221 steps with  rate 9.3928e-01, |r| = 5.9560e-08, in 33.35s +  0.45s = 33.80s
GMRES(20)     converged  after  258 steps with  rate 9.4773e-01, |r| = 5.4604e-08, in 41.84s +  0.45s = 42.29s
    TFQMR     converged  after  207 steps with  rate 9.3494e-01, |r| = 4.9847e-09, in 31.61s +  0.45s = 32.06s

blocksize \(500\)

 BiCGStab     converged  after  175 steps with  rate 9.2347e-01, |r| = 2.9163e-08, in 43.48s +  1.44s = 44.92s
GMRES(20)     converged  after  148 steps with  rate 9.1077e-01, |r| = 4.3472e-08, in 39.37s +  1.44s = 40.81s
    TFQMR     converged  after  174 steps with  rate 9.2358e-01, |r| = 4.1605e-09, in 43.81s +  1.44s = 45.25s

blocksize \(2000\)

 BiCGStab     converged  after  143 steps with  rate 9.0767e-01, |r| = 2.5690e-08, in 40.33s +  4.65s = 44.98s
GMRES(20)     converged  after  124 steps with  rate 8.9395e-01, |r| = 4.4211e-08, in 37.50s +  4.65s = 42.15s
    TFQMR     converged  after  144 steps with  rate 9.0833e-01, |r| = 6.2214e-09, in 41.22s +  4.65s = 45.87s

blocksize \(10000\)

 BiCGStab     converged  after   94 steps with  rate 8.6236e-01, |r| = 1.9162e-08, in 29.53s + 11.29s = 40.82s
GMRES(20)     converged  after   76 steps with  rate 8.3217e-01, |r| = 2.7094e-08, in 25.47s + 11.29s = 36.76s
    TFQMR     converged  after  106 steps with  rate 8.7687e-01, |r| = 4.5741e-09, in 33.89s + 11.29s = 45.18s

blocksize \(20000\)

 BiCGStab     converged  after   72 steps with  rate 8.2350e-01, |r| = 1.6905e-08, in 24.45s + 15.79s = 40.24s
GMRES(20)     converged  after   59 steps with  rate 7.8976e-01, |r| = 3.1821e-08, in 21.31s + 15.79s = 37.10s
    TFQMR     converged  after   93 steps with  rate 8.5858e-01, |r| = 1.8237e-09, in 32.14s + 15.79s = 47.93s

The results for a blocksize of \(100\) and \(500\) are (almost) identical to the block-Jacobi preconditioner above, only the setup time is reduced, e.g., from 5.19 seconds to 0.58 seconds for blocksize \(500\). The solve times are slightly worse than in the Jacobi version, since matrix-vector solves are involved.

However, already for a blocksize of \(500\), the overall time is faster and even better for a blocksize of \(2000\) with a minimum for \(10000\). Although a larger block size will result in a better convergence but the setup time also increases making block-diagonal H-LU to the fastest solver for our example in terms of overall time.

We have not discussed the accuracy used for H-LU in case of block diagonal preconditioning. For the above tests, the accuracy was \(1\), i.e., a very coarse approximation. Tests with higher accuracy did not resulted in an improvment, for one because the block diagonal structure already sets a lower limit on the approximability of \(A^{-1}\) (see block-Jacobi) and second, because the setup time will then be higher. Only for very large block sizes, this could improve convergence but not necessarily solve times.

Nearfield H-LU Preconditioner

Another alternative for computing a H-based preconditioner is to restrict the matrix to the nearfield part only, as defined by the dense matrix blocks. All other low-rank blocks are considered to be zero and will not be updated during LU factorization. This algorithm can be considered as an incomplete LU with limited fill-in.

Remarks: If started with a sparse matrix, this method corresponds to an incomplete LU, where the fill-in is restricted to those matrix coefficients defined by the dense matrix blocks.

Since only dense arithmetic is involved, all computations are performed exact w.r.t. machine precision. Furthermore, the computation time is much faster compared to standard H-LU. However, the approximation of \(A^{-1}\) is also limited.

As for block diagonal matrices, we have the following two functions to either restrict a given matrix or to copy the nearfield part:

void
restrict_nearfield  ( TMatrix *        A,
                      const bool       delete_farfield );
std::unique_ptr< TMatrix >
nearfield           ( const TMatrix *  A,
                      const bool       without_farfield );

The second parameter defines, whether low-rank blocks should be deleted/not copied or just zeroed.

In our case, this looks as:

auto  A_nf   = nearfield( A.get(), true );
auto  A_inv  = factorise_inv( A_nf.get(), acc_exact );
auto  solver = ...;
solver->solve( A, x, b, A_inv.get() );

And the results are:

 BiCGStab     converged  after  122 steps with  rate 8.9181e-01, |r| = 1.6491e-08, in 30.07s + 1.47s = 31.54s
GMRES(20)     converged  after  117 steps with  rate 8.8802e-01, |r| = 2.4539e-08, in 30.78s + 1.47s = 32.25s
    TFQMR     converged  after  132 steps with  rate 8.9907e-01, |r| = 1.0807e-09, in 33.00s + 1.47s = 34.47s

The solve times are not as fast as in the case with block diagonal (Jacobi or H-LU), but not much worse. Especially the fast setup makes this version interesting. Also matrix-vector solves are faster than for standard H-LU.

Conclusion

For the given Helmholtz problem with a single right-hand side, the fastest solver of the block-diagonal H-LU factorisation with a block size of \(10000\). The reason was the fast factorisation of the diagonal blocks, i.e., a low setup time, with a reasonable number of iteration steps.

Using the standard H-LU preconditioner with \(\varepsilon = 10^{-2}\) for comparsion (H-LU time is similar to matrix construction time), it needs 6 right-hand side vectors to be faster than block-diagonal H-LU.

Attention: To emphasize again: the results are for a specific problem. What is not visible in the data is for instance the dependency of the number of iteration steps on the problem dimension which is typical for classical precondtioners like Jacobi or Gauss-Seidel. Furthermore, convergence rates are strongly dependent on the properties of a matrix (we leave out convergence theory here).

Table of Contents