. 5
( 26)


with its associated graph.
As previously noticed, during the factorization procedure, nonzero entries
can be generated in memory positions that correspond to zero entries in
98 3. Direct Methods for the Solution of Linear Systems

the starting matrix. This action is referred to as ¬ll-in. Figure 3.4 shows the
e¬ect of ¬ll-in on the sparse matrix whose pattern is shown in Figure 3.3.
Since use of pivoting in the factorization process makes things even more
complicated, we shall only consider the case of symmetric positive de¬nite
matrices for which pivoting is not necessary.
A ¬rst remarkable result concerns the amount of ¬ll-in. Let mi (A) =
i ’ min {j < i : aij = 0} and denote by E(A) the convex hull of A, given

E(A) = {(i, j) : 0 < i ’ j ¤ mi (A)} . (3.59)

For a symmetric positive de¬nite matrix,

E(A) = E(H + HT ) (3.60)

where H is the Cholesky factor, so that ¬ll-in is con¬ned within the convex
hull of A (see Figure 3.4). Moreover, if we denote by lk (A) the number of
active rows at the k-th step of the factorization (i.e., the number of rows
of A with i > k and aik = 0), the computational cost of the factorization
process is
lk (A) (lk (A) + 3) ¬‚ops, (3.61)

having accounted for all the nonzero entries of the convex hull. Con¬nement
of ¬ll-in within E(A) ensures that the LU factorization of A can be stored
without extra memory areas simply by storing all the entries of E(A) (in-
cluding the null elements). However, such a procedure might still be highly
ine¬cient due to the large number of zero entries in the hull (see Exercise
On the other hand, from (3.60) one gets that the reduction in the convex
hull re¬‚ects a reduction of ¬ll-in, and in turn, due to (3.61), of the number
of operations needed to perform the factorization. For this reason several
strategies for reordering the graph of the matrix have been devised. Among
them, we recall the Cuthill-McKee method, which will be addressed in the
next section.
An alternative consists of decomposing the matrix into sparse subma-
trices, with the aim of reducing the original problem to the solution of
subproblems of reduced size, where matrices can be stored in full format.
This approach leads to submatrix decomposition methods which will be
addressed in Section 3.9.2.

3.9.1 The Cuthill-McKee Algorithm
The Cuthill-McKee algorithm is a simple and e¬ective method for reorder-
ing the system variables.
3.9 Sparse Matrices 99

x xxx x x
xxx x
x x
x 000 x
0000 x xx
000x x
x 0000
1111 xxx
x 000
0 1111x x
x 0000x
1 0000 xx xx x
1111 xxx xx
xxx x
xx x
x 000x
0000000 x 000x x
1111111 111 xxx xx
x 0000000
1111111 x xx xx
FIGURE 3.4. The shaded regions in the left ¬gure show the areas of the matrix
that can be a¬ected by ¬ll-in, for the matrix considered in Figure 3.3. Solid
lines denote the boundary of E(A). The right ¬gure displays the factors that
have been actually computed. Black dots denote the elements of A that were
originarily equal to zero

The ¬rst step of the algorithm consists of associating with each vertex of
the graph the number of its connections with neighboring vertices, called
the degree of the vertex. Next, the following steps are taken:

1. a vertex with a low number of connections is chosen as the ¬rst vertex
of the graph;

2. the vertices connected to it are progressively re-labeled starting from
those having lower degrees;

3. the procedure is repeated starting from the vertices connected to the
second vertex in the updated list. The nodes already re-labeled are
ignored. Then, a third new vertex is considered, and so on, until all
the vertices have been explored.

The usual way to improve the e¬ciency of the algorithm is based on the
so-called reverse form of the Cuthill-McKee method. This consists of ex-
ecuting the Cuthill-McKee algorithm described above where, at the end,
the i-th vertex is moved into the n ’ i + 1-th position of the list, n being
the number of nodes in the graph. Figure 3.5 reports, for comparison, the
graphs obtained using the direct and reverse Cuthill-McKee reordering in
the case of the matrix pattern represented in Figure 3.3, while in Figure
3.6 the factors L and U are compared. Notice the absence of ¬ll-in when
the reverse Cuthill-McKee method is used.

Remark 3.5 For an e¬cient solution of linear systems with sparse ma-
trices, we mention the public domain libraries SPARSKIT [Saa90], UMF-
PACK [DD95] and the MATLAB sparfun package.
100 3. Direct Methods for the Solution of Linear Systems

4 (8)
5 (12) 3 (11)
4 (5)
3 (2)
5 (6)
6 (7) 2 (9)
6 (1) 2 (4)

7 (6)
7 (12) 1 (3)
1 (10)

8 (5)
12 (1)
12 (7)
8 (8)

11 (11) 9 (3) 11 (2)
9 (10)
10 (4)
10 (9)

FIGURE 3.5. Reordered graphs using the direct (left) and reverse (right)
Cuthill-McKee algorithm. The label of each vertex, before reordering is per-
formed, is reported in braces

xxx x x xxx
xxxxxx xxxx
xxx x xxxx x
x xx
x xx xx
xx xx x x
x xx x x
xx x
x xx xx xxxx xx
xx x xx xx x
xx x xx x
xxx x xx xx
xxxx x xx xx
x x xx x xx

FIGURE 3.6. Factors L and U after the direct (left) and reverse (right)
Cuthill-McKee reordering. In the second case, ¬ll-in is absent

3.9.2 Decomposition into Substructures
These methods have been developed in the framework of numerical ap-
proximation of partial di¬erential equations. Their basic strategy consists
of splitting the solution of the original linear system into subsystems of
smaller size which are almost independent from each other and can be
easily interpreted as a reordering technique.
We describe the methods on a special example, referring for a more
comprehensive presentation to [BSG96]. Consider the linear system Ax=b,
where A is a symmetric positive de¬nite matrix whose pattern is shown in
Figure 3.3. To help develop an intuitive understanding of the method, we
draw the graph of A in the form as in Figure 3.7.
3.9 Sparse Matrices 101

We then partition the graph of A into the two subgraphs (or substruc-
tures) identi¬ed in the ¬gure and denote by xk , k = 1, 2, the vectors of the
unknowns relative to the nodes that belong to the interior of the k-th sub-
structure. We also denote by x3 the vector of the unknowns that lie along
the interface between the two substructures. Referring to the decomposi-
tion in Figure 3.7, we have x1 = (2, 3, 4, 6) , x2 = (8, 9, 10, 11, 12)
and x3 = (1, 5, 7) .
As a result of the decomposition of the unknowns, matrix A will be
partitioned in blocks, so that the linear system can be written in the form

11 10

5 1
7 8 9
substructure II

substructure I
FIGURE 3.7. Decomposition into two substructures

® ® ® 
A11 0 A13 x1 b1
°0 A23 » ° x2 » = ° b2 » ,
AT AT A33 x3 b3
13 23

having reordered the unknowns and partitioned accordingly the right hand
side of the system. Suppose that A33 is decomposed into two parts, A33
and A33 , which represent the contributions to A33 of each substructure.
Similarly, let the right hand side b3 be decomposed as b3 +b3 . The original
linear system is now equivalent to the following pair

A11 A13 x1 b1
= ,
b3 + γ 3
AT A33 x3

A22 A23 x2 b2
b3 ’ γ 3
AT A33 x3

having denoted by γ 3 a vector that takes into account the coupling between
the substructures. A typical way of proceeding in decomposition techniques
consists of eliminating γ 3 to end up with independent systems, one for each
102 3. Direct Methods for the Solution of Linear Systems

substructure. Let us apply this strategy to the example at hand. The linear
system for the ¬rst substructure is

A11 A13 x1 b1
= . (3.62)
b3 + γ 3
AT A33 x3

Let us now factorize A11 as HT H11 and proceed with the reduction method
already described in Section 3.8.3 for block tridiagonal matrices. We obtain
the system

H11 H21 x1 c1
A33 ’ H21 HT b3 + γ 3 ’ H21 c1
0 x3

where H21 = H’T A13 and c1 = H’T b1 . The second equation of this system
11 11
yields γ 3 explicitly as

γ 3 = A33 ’ HT H21 x3 ’ b3 + HT c1 .
21 21

Substituting this equation into the system for the second substructure, one
ends up with a system only in the unknowns x2 and x3

A22 A23 x2 b2
= , (3.63)
AT A33 x3 b3

where A33 = A33 ’ HT H21 and b3 = b3 ’ HT c1 . Once (3.63) has been
21 21
solved, it will be possible, by backsubstitution into (3.62), to compute also
x1 .
The technique described above can be easily extended to the case of
several substructures and its e¬ciency will increase the more the substruc-
tures are mutually independent. It reproduces in nuce the so-called frontal
method (introduced by Irons [Iro70]), which is quite popular in the solution
of ¬nite element systems (for an implementation, we refer to the UMF-
PACK library [DD95]).

Remark 3.6 (The Schur complement) An approach that is dual to
the above method consists of reducing the starting system to a system
acting only on the interface unknowns x3 , passing through the assembling
of the Schur complement of matrix A, de¬ned in the 3—3 case at hand as

S = A33 ’ AT A’1 A13 ’ AT A’1 A23 .
13 11 23 22

The original problem is thus equivalent to the system

Sx3 = b3 ’ AT A’1 b1 ’ AT A’1 b2 .
13 11 23 22

This system is full (even if the matrices Aij were sparse) and can be solved
using either a direct or an iterative method, provided that a suitable pre-
conditioner is available. Once x3 has been computed, one can get x1 and
3.10 Accuracy of the Solution Achieved Using GEM 103

x2 by solving two systems of reduced size, whose matrices are A11 and A22 ,
We also notice that if the block matrix A is symmetric and positive
de¬nite, then the linear system on the Schur complement S is no more
ill-conditioned than the original system on A, since

K2 (S) ¤ K2 (A)

(for a proof, see Lemma 3.12, [Axe94]. See also [CM94] and [QV99]).

3.9.3 Nested Dissection
This is a renumbering technique quite similar to substructuring. In practice,
it consists of repeating the decomposition process several times at each
substructure level, until the size of each single block is made su¬ciently
small. In Figure 3.8 a possible nested dissection is shown in the case of the
matrix considered in the previous section. Once the subdivision procedure
has been completed, the vertices are renumbered starting with the nodes
belonging to the latest substructuring level and moving progressively up to
the ¬rst level. In the example at hand, the new node ordering is 11, 9, 7,
6, 12, 8, 4, 2, 1, 5, 3.
This procedure is particularly e¬ective if the problem has a large size and
the substructures have few connections between them or exhibit a repetitive
pattern [Geo73].

3.10 Accuracy of the Solution Achieved Using
Let us analyze the e¬ects of rounding errors on the accuracy of the solution
yielded by GEM. Suppose that A and b are a matrix and a vector of
¬‚oating-point numbers. Denoting by L and U, respectively, the matrices
of the LU factorization induced by GEM and computed in ¬‚oating-point
arithmetic, the solution x yielded by GEM can be regarded as being the
solution (in exact arithmetic) of the perturbed system (A + δA)x = b,
where δA is a perturbation matrix such that

|δA| ¤ nu 3|A| + 5|L||U| + O(u2 ), (3.64)

where u is the roundo¬ unit and the matrix absolute value notation has
been used (see [GL89], Section 3.4.6). As a consequence, the entries of δA
will be small in size if the entries of L and U are small. Using partial
pivoting allows for bounding below 1 the module of the entries of L in such
a way that, passing to the in¬nity norm and noting that L ∞ ¤ n, the
104 3. Direct Methods for the Solution of Linear Systems

2 A
1 0000000
2 0000000000000000
2 00000
1 1111111111111111
A 0000000000000000

3 4 C 5 6B A
3 000
111 0000000
11 111111111111111
00 000000000000000
000 111111111111111
111 111111111111111
4 000000000000000
3 1111111
111 0000000
C 0000000
A 11111
5 000000000000000
6 111111111111111
B 00000
5 111111111111111
A 000000000000000
00 000000000000000
11 000000000000000

FIGURE 3.8. Two steps of nested dissection. Graph partitioning (left) and matrix
reordering (right)

estimate (3.64) becomes

¤ nu 3 A + O(u2 ).
δA + 5n U (3.65)
∞ ∞ ∞

The bound for δA ∞ in (3.65) is of practical use only if it is possible to
provide an estimate for U ∞ . With this aim, backward analysis can be
carried out introducing the so-called growth factor
max|aij |
ρn = . (3.66)
max|aij |

Taking advantage of the fact that |uij | ¤ ρn max|aij |, the following result
due to Wilkinson can be drawn from (3.65),
¤ 8un3 ρn A + O(u2 ).
δA (3.67)
∞ ∞

The growth factor can be bounded by 2n’1 and, although in most of the
cases it is of the order of 10, there exist matrices for which the inequality in
(3.67) becomes an equality (see, for instance, Exercise 5). For some special
classes of matrices, a sharp bound for ρn can be found:
3.10 Accuracy of the Solution Achieved Using GEM 105

1. for banded matrices with upper and lower bands equal to p, ρn ¤
22p’1 ’ (p ’ 1)2p’2 . As a consequence, in the tridiagonal case one
gets ρn ¤ 2;
2. for Hessenberg matrices, ρn ¤ n;
3. for symmetric positive de¬nite matrices, ρn = 1;
4. for matrices strictly diagonally dominant by columns, ρn ¤ 2.
To achieve better stability when using GEM for arbitrary matrices, re-
sorting to complete pivoting would seem to be mandatory, since it ensures
that ρn ¤ n1/2 2 · 31/2 · . . . · n1/(n’1) . Indeed, this growth is slower
than 2 as n increases.
However, apart from very special instances, GEM with only partial piv-
oting exhibits acceptable growth factors. This make it the most commonly
employed method in the computational practice.

Example 3.7 Consider the linear system (3.2) with
µ 1 1+µ
A= , b= , (3.68)
1 0 1
which admits the exact solution x=1 for any value of µ. The matrix is well-
conditioned, having K∞ (A) = (1 + µ)2 . Attempting to solve the system for µ =
10’15 by the LU factorization with 16 signi¬cant digits, and using the Programs
5, 2 and 3, yields the solution x = [0.8881784197001253, 1.000000000000000]T ,
with an error greater than 11% on the ¬rst component. Some insight into the
causes of the inaccuracy of the computed solution can be drawn from (3.64).
Indeed this latter does not provide a uniformly small bound for all the entries of
matrix δA, rather
3.55 · 10’30 1.33 · 10’15
|δA| ¤ .
1.33 · 10’15 2.22

Notice that the entries of the corresponding matrices L and U are quite large in
module. Conversely, resorting to GEM with partial or complete pivoting yields

the exact solution of the system (see Exercise 6).

Let us now address the role of the condition number in the error analysis
for GEM. GEM yields a solution x that is typically characterized by having
a small residual r = b ’ Ax (see [GL89]). This feature, however, does not
ensure that the error x ’ x is small when K(A) 1 (see Example 3.8). In
fact, if δb in (3.11) is regarded as being the residual, then
x’x 1 r
¤ K(A) r ¤ K(A) .
x A x b
This result will be applied to devise methods, based on the a posteriori
analysis, for improving the accuracy of the solution of GEM (see Section
106 3. Direct Methods for the Solution of Linear Systems

Example 3.8 Consider the linear system Ax = b with
1 1.0001 1
A= , b= ,
1.0001 1 1

which admits the solution x = (0.499975 . . . , 0.499975 . . . )T . Assuming as an ap-
proximate solution the vector x = (’4.499775, 5.5002249)T , one ¬nds the residual
r (’0.001, 0)T , which is small although x is quite di¬erent from the exact so-
lution. The reason for this is due to the ill-conditioning of matrix A. Indeed in

this case K∞ (A) = 20001.

An estimate of the number of exact signi¬cant digits of a numerical
solution of a linear system can be given as follows. From (3.13), letting
γ = u and assuming that uK∞ (A) ¤ 1/2 we get
δx ∞ 2uK∞ (A)
¤ ¤ 4uK∞ (A).
1 ’ uK∞ (A)
As a consequence
x’x ∞
uK∞ (A). (3.69)
β ’t and K∞ (A) β m , one gets that the solution x
Assuming that u
computed by GEM will have at least t ’ m exact digits, t being the number
of digits available for the mantissa. In other words, the ill-conditioning of a
system depends both on the capability of the ¬‚oating-point arithmetic that
is being used and on the accuracy that is required in the solution.

3.11 An Approximate Computation of K(A)
Suppose that the linear system (3.2) has been solved by a factorization
method. To determine the accuracy of the computed solution, the analy-
sis carried out in Section 3.10 can be used if an estimate of the condition
number K(A) of A, which we denote by K(A), is available. Indeed, al-
though evaluating A can be an easy task if a suitable norm is chosen
(for instance, · 1 or · ∞ ), it is by no means reasonable (or compu-
tationally convenient) to compute A’1 if the only purpose is to evaluate
A’1 . For this reason, we describe in this section a procedure (proposed
in [CMSW79]) that approximates A’1 with a computational cost of the
order of n2 ¬‚ops.
The basic idea of the algorithm is as follows: ∀d ∈ Rn with d = 0, thanks
to the de¬nition of matrix norm, A’1 ≥ y / d = γ(d) with Ay = d.
Thus, we look for d in such a way that γ(d) is as large as possible and
assume the obtained value as an estimate of A’1 .
For the method to be e¬ective, the selection of d is crucial. To explain
how to do this, we start by assuming that the QR factorization of A has
3.11 An Approximate Computation of K(A) 107

been computed and that K2 (A) is to be approximated. In such an event,
since K2 (A) = K2 (R) due to Property 1.8, it su¬ces to estimate R’1 2
instead of A’1 2 . Considerations related to the SVD of R induce approx-
imating R’1 2 by the following algorithm:
compute the vectors x and y, solutions to the systems

RT x = d, Ry = x, (3.70)

then estimate R’1 2 by the ratio γ2 = y 2 / x 2 . The vector d appearing
in (3.70) should be determined in such a way that γ2 is as close as possible
to the value actually attained by R’1 2 . It can be shown that, except in
very special cases, γ2 provides for any choice of d a reasonable (although
not very accurate) estimate of R’1 2 (see Exercise 15). As a consequence,
a proper selection of d can encourage this natural trend.
Before going on, it is worth noting that computing K2 (R) is not an easy
matter even if an estimate of R’1 2 is available. Indeed, it would remain
to compute R 2 = ρ(RT R). To overcome this di¬culty, we consider
henceforth K1 (R) instead of K2 (R) since R 1 is easily computable. Then,
heuristics allows us to assume that the ratio γ1 = y 1 / x 1 is an estimate
of R’1 1 , exactly as γ2 is an estimate of R’1 2 .

Let us now deal with the choice of d. Since RT x = d, the generic compo-
nent xk of x can be formally related to x1 , . . . , xk’1 through the formulae
of forward substitution as

r11 x0 = d1 ,
rkk xk = dk ’ (r1k x1 + . . . + rk’1,k xk’1 ), k ≥ 1.

Assume that the components of d are of the form dk = ±θk , where θk
are random numbers and set arbitrarily d1 = θ1 . Then, x1 = θ1 /r11 is
completely determined, while x2 = (d2 ’ r12 x1 )/r22 depends on the sign of
d2 . We set the sign of d2 as the opposite of r12 x1 in such a way to make
x(1 : 2) 1 = |x1 | + |x2 |, for a ¬xed x1 , the largest possible. Once x2 is
known, we compute x3 following the same criterion, and so on, until xn .
This approach sets the sign of each component of d and yields a vector
x with a presumably large · 1 . However, it can fail since it is based on
the idea (which is in general not true) that maximizing x 1 can be done
by selecting at each step k in (3.71) the component xk which guarantees
the maximum increase of x(1 : k ’ 1) 1 (without accounting for the fact
that all the components are related).
Therefore, we need to modify the method by including a sort of “look-
ahead” strategy, which accounts for the way of choosing dk a¬ects all later
values xi , with i > k, still to be computed. Concerning this point, we notice
that for a generic row i of the system it is always possible to compute at
108 3. Direct Methods for the Solution of Linear Systems

step k the vector p(k’1) with components
i = 1, . . . , k ’ 1,
pi =0
pi = r1i x1 + . . . + rk’1,i xk’1 i = k, . . . , n.
Thus xk = (±θk ’ pk )/rkk . We denote the two possible values of xk by

xk and xk . The choice between them is now taken not only accounting for
which of the two most increases x(1 : k) 1 , but also evaluating the increase
of p(k) 1 . This second contribution accounts for the e¬ect of the choice of
dk on the components that are still to be computed. We can include both
criteria in a unique test. Denoting by
(k)+ (k)’
pi = 0, pi = 0, i = 1, . . . , k,
(k)+ (k)’
+ rki x’ , i = k + 1, . . . , n,
(k’1) (k’1)
+ rki x+ , pi
pi = pi = pi
k k
+ ’
the components of the vectors p(k) and p(k) respectively, we set each
k-th step dk = +θk or dk = ’θk according to whether |rkk x+ | + p(k) 1

is greater or less than |rkk x’ | + p(k) 1 .
Under this choice d is completely determined and the same holds for x.
Now, solving the system Ry = x, we are warranted that y 1 / x 1 is a reli-
able approximation to R’1 1 , so that we can set K1 (A) = R 1 y 1 / x 1 .

In practice the PA=LU factorization introduced in Section 3.5 is usually
available. Based on the previous considerations and on some heuristics, an
analogous procedure to that shown above can be conveniently employed
to approximate A’1 1 . Precisely, instead of systems (3.70), we must now
(LU)T x = d, LUy = x.
We set y 1 / x 1 as the approximation of A’1 1 and, consequently, we
de¬ne K1 (A). The strategy for selecting d can be the same as before;
indeed, solving (LU)T x = d amounts to solving
UT z = d, LT x = z, (3.72)
and thus, since UT is lower triangular, we can proceed as in the previous
case. A remarkable di¬erence concerns the computation of x. Indeed, while
the matrix RT in the second system of (3.70) has the same condition number
as R, the second system in (3.72) has a matrix LT which could be even more
ill-conditioned than UT . If this were the case, solving for x could lead to
an inaccurate outcome, thus making the whole process useless.
Fortunately, resorting to partial pivoting prevents this circumstance from
occurring, ensuring that any ill-condition in A is re¬‚ected in a correspond-
ing ill-condition in U. Moreover, picking θk randomly between 1/2 and 1
3.12 Improving the Accuracy of GEM 109

guarantees accurate results even in the special cases where L turns out to
be ill-conditioned.

The algorithm presented below is implemented in the LINPACK library
[BDMS79] and in the MATLAB function rcond. This function, in order
to avoid rounding errors, returns as output parameter the reciprocal of
K1 (A). A more accurate estimator, described in [Hig88], is implemented in
the MATLAB function condest.

Program 14 implements the approximate evaluation of K1 for a matrix
A of generic form. The input parameters are the size n of the matrix A, the
matrix A, the factors L, U of its PA=LU factorization and the vector theta
containing the random numbers θk , for k = 1, . . . , n.

Program 14 - cond est : Algorithm for the approximation of K1 (A)

function [k1] = cond est(n,A,L,U,theta)
for i=1:n, p(i)=0; end
for k=1:n
zplus=(theta(k)-p(k))/U(k,k); zminu=(-theta(k)-p(k))/U(k,k);
splus=abs(theta(k)-p(k)); sminu=abs(-theta(k)-p(k));
for i=(k+1):n
if splus >= sminu, z(k)=zplus; else, z(k)=zminu; end
for i=(k+1):n, p(i)=p(i)+U(k,i)*z(k); end
z = z™; x = backward col(L™,z);
w = forward col(L,x); y = backward col(U,w);

Example 3.9 Let us consider the Hilbert matrix H4 . Its condition number
K1 (H4 ), computed using the MATLAB function invhilb which returns the exact
inverse of H4 , is 2.8375 · 104 . Running Program 14 with theta=(1, 1, 1, 1)T gives
the reasonable estimate K1 (H4 ) = 2.1523 · 104 (which is the same as the output

of rcond), while the function condest returns the exact result.

3.12 Improving the Accuracy of GEM
As previously noted if the matrix of the system is ill-conditioned, the so-
lution generated by GEM could be inaccurate even though its residual is
small. In this section, we mention two techniques for improving the accu-
racy of the solution computed by GEM.
110 3. Direct Methods for the Solution of Linear Systems

3.12.1 Scaling
If the entries of A vary greatly in size, it is likely that during the elimination
process large entries are summed to small entries, with a consequent onset
of rounding errors. A remedy consists of performing a scaling of the matrix
A before the elimination is carried out.

Example 3.10 Consider again the matrix A of Remark 3.3. Multiplying it on
the right and on the left with matrix D=diag(0.0005, 1, 1), we obtain the scaled
® 
’0.0001 1 1
A = DAD = ° 0.78125 0 » .
˜ 1
1 0 0

Applying GEM to the scaled system A˜ = Db = (0.2, 1.3816, 1.9273)T , we get

the correct solution x = D˜.

Row scaling of A amounts to ¬nding a diagonal nonsingular matrix D1
such that the diagonal entries of D1 A are of the same size. The linear
system Ax = b transforms into

D1 Ax = D1 b.

When both rows and columns of A are to be scaled, the scaled version of
(3.2) becomes

with y = D’1 x,
(D1 AD2 )y = D1 b 2

having also assumed that D2 is invertible. Matrix D1 scales the equations
while D2 scales the unknowns. Notice that, to prevent rounding errors, the
scaling matrices are chosen in the form

D1 = diag(β r1 , . . . , β rn ), D2 = diag(β c1 , . . . , β cn ),

where β is the base of the used ¬‚oating-point arithmetic and the exponents
r1 , . . . , rn , c1 , . . . , cn must be determined. It can be shown that

D’1 (x ’ x) ∞
uK∞ (D1 AD2 ).
D’1 x ∞

Therefore, scaling will be e¬ective if K∞ (D1 AD2 ) is much less than K∞ (A).
Finding convenient matrices D1 and D2 is not in general an easy matter.
A strategy consists, for instance, of picking up D1 and D2 in such a way
that D1 AD2 ∞ and D1 AD2 1 belong to the interval [1/β, 1], where β is
the base of the used ¬‚oating-point arithmetic (see [McK62] for a detailed
analysis in the case of the Crout factorization).
3.12 Improving the Accuracy of GEM 111

Remark 3.7 (The Skeel condition number) The Skeel condition num-
ber, de¬ned as cond(A) = |A’1 | |A| ∞ , is the supremum over the set
x∈ Rn , with x = 0, of the numbers

|A’1 | |A| |x| ∞
cond(A, x) = .

Unlike what happens for K(A), cond(A,x) is invariant with respect to a
scaling by rows of A, that is, to transformations of A of the form DA, where
D is a nonsingular diagonal matrix. As a consequence, cond(A) provides a
sound indication of the ill-conditioning of a matrix, irrespectively of any
possible row diagonal scaling.

3.12.2 Iterative Re¬nement
Iterative re¬nement is a technique for improving the accuracy of a solution
yielded by a direct method. Suppose that the linear system (3.2) has been
solved by means of LU factorization (with partial or complete pivoting),
and denote by x(0) the computed solution. Having ¬xed an error tolerance,
toll, the iterative re¬nement performs as follows: for i = 0, 1, . . . , until

1. compute the residual r(i) = b ’ Ax(i) ;

2. solve the linear system Az = r(i) using the LU factorization of A;

3. update the solution setting x(i+1) = x(i) + z;

4. if z / x(i+1) < toll, then terminate the process returning the solu-
tion x(i+1) . Otherwise, the algorithm restarts at step 1.

In absence of rounding errors, the process would stop at the ¬rst step,
yielding the exact solution. The convergence properties of the method
can be improved by computing the residual r(i) in double precision, while
computing the other quantities in single precision. We call this procedure
mixed-precision iterative re¬nement (shortly, MPR), as compared to ¬xed-
precision iterative re¬nement (FPR).
It can be shown that, if |A’1 | |L| |U| ∞ is su¬ciently small, then at
each step i of the algorithm, the relative error x’x(i) ∞ / x ∞ is reduced
by a factor ρ, which is given by

ρ 2 n cond(A, x)u (FPR),
ρ (MPR),

where ρ is independent of the condition number of A in the case of MPR.
Slow convergence of FPR is a clear indication of the ill-conditioning of the
112 3. Direct Methods for the Solution of Linear Systems

matrix, as it can be shown that, if p is the number of iterations for the
method to converge, then K∞ (A) β t(1’1/p) .
Even if performed in ¬xed precision, iterative re¬nement is worth using
since it improves the overall stability of any direct method for solving the
system. We refer to [Ric81], [Ske80], [JW77] [Ste73], [Wil63] and [CMSW79]
for an overview of this subject.

3.13 Undetermined Systems
We have seen that the solution of the linear system Ax=b exists and is
unique if n = m and A is nonsingular. In this section we give a meaning
to the solution of a linear system both in the overdetermined case, where
m > n, and in the underdetermined case, corresponding to m < n. We
notice that an underdetermined system generally has no solution unless
the right side b is an element of range(A).
For a detailed presentation, we refer to [LH74], [GL89] and [Bj¨88].

Given A∈ R with m ≥ n, b∈ R , we say that x ∈ R is a solution
m—n m n

of the linear system Ax=b in the least-squares sense if

¦(x— ) = Ax— ’ b ¤ min Ax ’ b
2 2
= min ¦(x). (3.73)
2 2
n n
x∈R x∈R

The problem thus consists of minimizing the Euclidean norm of the resid-
ual. The solution of (3.73) can be found by imposing the condition that the
gradient of the function ¦ in (3.73) must be equal to zero at x— . From

¦(x) = (Ax ’ b)T (Ax ’ b) = xT AT Ax ’ 2xT AT b + bT b,

we ¬nd that

∇¦(x— ) = 2AT Ax— ’ 2AT b = 0,

from which it follows that x— must be the solution of the square system

AT Ax— = AT b (3.74)

known as the system of normal equations. The system is nonsingular if
A has full rank and in such a case the least-squares solution exists and
is unique. We notice that B = AT A is a symmetric and positive de¬nite
matrix. Thus, in order to solve the normal equations, one could ¬rst com-
pute the Cholesky factorization B = HT H and then solve the two systems
HT y = AT b and Hx— = y. However, due to roundo¬ errors, the com-
putation of AT A may be a¬ected by a loss of signi¬cant digits, with a
consequent loss of positive de¬niteness or nonsingularity of the matrix, as
happens in the following example (implemented in MATLAB) where for a
3.13 Undetermined Systems 113

the corresponding matrix f l(AT A) turns out to
matrix A with full rank,
be singular
® 
1 1
1 1
A = ° 2’27 » , f l(AT A) =
0 .
1 1
Therefore, in the case of ill-conditioned matrices it is more convenient to
utilize the QR factorization introduced in Section 3.4.3. Indeed, the follow-
ing result holds.

Theorem 3.8 Let A ∈ Rm—n , with m ≥ n, be a full rank matrix. Then
the unique solution of (3.73) is given by

x— = R’1 QT b
˜˜ (3.75)
˜ ˜
where R ∈ Rn—n and Q ∈ Rm—n are the matrices de¬ned in (3.48) starting
from the QR factorization of A. Moreover, the minimum of ¦ is given by

[(QT b)i ]2 .
¦(x ) =

Proof. The QR factorization of A exists and is unique since A has full rank.
Thus, there exist two matrices, Q∈ Rm—m and R∈ Rm—n such that A=QR, where
Q is orthogonal. Since orthogonal matrices preserve the Euclidean scalar product
(see Property 1.8), it follows that

Ax ’ b = Rx ’ QT b 2 .
2 2

Recalling that R is upper trapezoidal, we have
˜ ˜
Rx ’ Q b = Rx ’ QT b
T 2 2
[(QT b)i ]2 ,
2 2

so that the minimum is achieved when x = x— . 3

For more details about the analysis of the computational cost the algo-
rithm (which depends on the actual implementation of the QR factoriza-
tion), as well as for results about its stability, we refer the reader to the
texts quoted at the beginning of the section.
If A does not have full rank, the solution techniques above fail, since in
this case if x— is a solution to (3.73), the vector x— + z, with z ∈ ker(A), is
a solution too. We must therefore introduce a further constraint to enforce
the uniqueness of the solution. Typically, one requires that x— has minimal
Euclidean norm, so that the least-squares problem can be formulated as
¬nd x— ∈ Rn with minimal Euclidean norm such that
Ax— ’ b ¤ min Ax ’ b 2 .
2 n
114 3. Direct Methods for the Solution of Linear Systems

This problem is consistent with (3.73) if A has full rank, since in this case
(3.73) has a unique solution which necessarily must have minimal Euclidean
The tool for solving (3.76) is the singular value decomposition (or SVD,
see Section 1.9), for which the following theorem holds.

Theorem 3.9 Let A ∈ Rm—n with SVD given by A = UΣVT . Then the
unique solution to (3.76) is

x— = A† b (3.77)

where A† is the pseudo-inverse of A introduced in De¬nition 1.15.
Proof. Using the SVD of A, problem (3.76) is equivalent to ¬nding w = VT x
such that w has minimal Euclidean norm and

Σw ’ UT b ¤ Σy ’ UT b 2 , ∀y ∈ Rn .
2 2

If r is the number of nonzero singular values σi of A, then
r m
2 2
Σw ’ U b σi wi ’ (U b)i
T 2 T
(UT b)i
= + ,
i=1 i=r+1

which is minimum if wi = (UT b)i /σi for i = 1, . . . , r. Moreover, it is clear that
among the vectors w of Rn having the ¬rst r components ¬xed, the one with
minimal Euclidean norm has the remaining n ’ r components equal to zero.
Thus the solution vector is w— = Σ† UT b, that is, x— = VΣ† UT b = A† b, where
Σ† is the diagonal matrix de¬ned in (1.11). 3

As for the stability of problem (3.76), we point out that if the matrix
A does not have full rank, the solution x— is not necessarily a continuous
function of the data, so that small changes on these latter might produce
large variations in x— . An example of this is shown below.

Example 3.11 Consider the system Ax = b with
®  ® 
10 1
A = ° 0 0 » , b = ° 2 » , rank(A) = 1.
00 3

Using the MATLAB function svd we can compute the SVD of A. Then computing
the pseudo-inverse, one ¬nds the solution vector x— = (1, 0)T . If we perturb the
null entry a22 , with the value 10’12 , the perturbed matrix has (full) rank 2
and the solution (which is unique in the sense of (3.73)) is now given by x— =
1, 2 · 1012 . •

We refer the reader to Section 5.8.3 for the approximate computation of
the SVD of a matrix.
3.14 Applications 115

In the case of underdetermined systems, for which m < n, if A has full
rank the QR factorization can still be used. In particular, when applied
to the transpose matrix AT , the method yields the solution of minimal
euclidean norm. If, instead, the matrix has not full rank, one must resort
to SVD.

Remark 3.8 If m = n (square system), both SVD and QR factorization
can be used to solve the linear system Ax=b, as alternatives to GEM.
Even though these algorithms require a number of ¬‚ops far superior to
GEM (SVD, for instance, requires 12n3 ¬‚ops), they turn out to be more
accurate when the system is ill-conditioned and nearly singular.

Example 3.12 Compute the solution to the linear system H15 x=b, where H15
is the Hilbert matrix of order 15 (see (3.32)) and the right side is chosen in
such a way that the exact solution is the unit vector x = 1. Using GEM with
partial pivoting yields a solution a¬ected by a relative error larger than 100%. A
solution of much better quality is obtained by passing through the computation
of the pseudo-inverse, where the entries in Σ that are less than 10’13 are set

equal to zero.

3.14 Applications
In this section we present two problems, suggested by structural mechan-
ics and grid generation in ¬nite element analysis, whose solutions require
solving large linear systems.

3.14.1 Nodal Analysis of a Structured Frame
Let us consider a structured frame which is made by rectilinear beams con-
nected among them through hinges (referred to as the nodes) and suitably
constrained to the ground. External loads are assumed to be applied at
the nodes of the frame and for any beam in the frame the internal actions
amount to a unique force of constant strength and directed as the beam
itself. If the normal stress acting on the beam is a traction we assume
that it has positive sign, otherwise the action has negative sign. Structured
frames are frequently employed as covering structures for large size public
buildings like exhibition stands, railway stations or airport halls.
To determine the internal actions in the frame, that are the unknowns
of the mathematical problem, a nodal analysis is used (see [Zie77]): the
equilibrium with respect to translation is imposed at every node of the
frame yielding a sparse and large-size linear system. The resulting matrix
has a sparsity pattern which depends on the numbering of the unknowns
and that can strongly a¬ect the computational e¬ort of the LU factorization
116 3. Direct Methods for the Solution of Linear Systems

due to ¬ll-in. We will show that the ¬ll-in can be dramatically reduced by
a suitable reordering of the unknowns.

The structure shown in Figure 3.9 is arc-shaped and is symmetric with
respect to the origin. The radii r and R of the inner and outer circles are
equal to 1 and 2, respectively. An external vertical load of unit size directed
downwards is applied at (0, 1) while the frame is constrained to ground
through a hinge at (’(r + R), 0) and a bogie at (r + R, 0). To generate
the structure we have partitioned the half unit circle in nθ uniform slices,
resulting in a total number of n = 2(nθ + 1) nodes and a matrix size of
m = 2n. The structure in Figure 3.9 has nθ = 7 and the unknowns are
numbered following a counterclockwise labeling of the beams starting from
the node at (1, 0).
We have represented the structure along with the internal actions com-
puted by solving the nodal equilibrium equations where the width of the
beams is proportional to the strength of the computed action. Black is
used to identify tractions whereas gray is associated with compressions. As
expected the maximum traction stress is attained at the node where the
external load is applied.







’2 ’1.5 ’1 ’0.5 0 0.5 1 1.5 2

FIGURE 3.9. A structured frame loaded at the point (0, 1)

We show in Figure 3.10 the sparsity pattern of matrix A (left) and that
of the L-factor of its LU factorization with partial pivoting (right) in the
case nθ = 40 which corresponds to a size of 164 — 164. Notice the large
¬ll-in e¬ect arising in the lower part of L which results in an increase of
the nonzero entries from 645 (before the factorization) to 1946 (after the
3.14 Applications 117

0 0

20 20

40 40

60 60

80 80

100 100

120 120

140 140

160 160
0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160
nz = 645 nz = 1946

FIGURE 3.10. Sparsity pattern of matrix A (left) and of the L-factor of the LU
factorization with partial pivoting (right) in the case nθ = 40

In view of the solution of the linear system by a direct method, the
increase of the nonzero entries demands for a suitable reordering of the
unknowns. For this purpose we use the MATLAB function symrcm which
implements the symmetric reverse Cuthill-McKee algorithm described in
Section 3.9.1. The sparsity pattern, after reordering, is shown in Figure 3.11
(left) while the L-factor of the LU factorization of the reordered matrix
is shown in Figure 3.11 (right). The results indicate that the reordering
procedure has “scattered” the sparsity pattern throughout the matrix with
a relatively modest increase of the nonzero entries from 645 to 1040.

0 0

20 20

40 40

60 60

80 80

100 100

120 120

140 140

160 160
0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160
nz = 645 nz = 1040

FIGURE 3.11. Sparsity pattern of matrix A (left) after a reordering with the sym-
metric reverse Cuthill-McKee algorithm and the L-factor of the LU factorization
of the reordered matrix with partial pivoting (right) in the case nθ = 40

The e¬ectiveness of the symmetric reverse Cuthill-McKee reordering pro-
cedure is demonstrated in Figure 3.12 which shows the number of nonzero
entries nz in the L-factor of A as a function of the size m of the matrix
(represented on the x-axis). In the reordered case (solid line) a linear in-
118 3. Direct Methods for the Solution of Linear Systems

crease of nz with m can be clearly appreciated at the expense of a dramatic
¬ll-in growing with m if no reordering is performed (dashed line).

x 10






0 100 200 300 400 500 600 700 800 900 1000

FIGURE 3.12. Number of nonzero entries in the L-factor of A as a function of
the size m of the matrix, with (solid line) and without (dashed line) reordering

3.14.2 Regularization of a Triangular Grid
The numerical solution of a problem in a two-dimensional domain D of
polygonal form, for instance by ¬nite element or ¬nite di¬erence methods,
very often requires that D be decomposed in smaller subdomains, usually
of triangular form (see for instance Section 9.9.2).
T , where Th is the considered triangulation (also
Suppose that D =
T ∈Th
called computational grid) and h is a positive parameter which characterizes
the triangulation. Typically, h denotes the maximum length of the triangle
edges. We shall also assume that two triangles of the grid, T1 and T2 , have
either null intersection or share a vertex or a side.
The geometrical properties of the computational grid can heavily a¬ect the
quality of the approximate numerical solution. It is therefore convenient to
devise a su¬ciently regular triangulation, such that, for any T ∈ Th , the
ratio between the maximum length of the sides of T (the diameter of T )
and the diameter of the circle inscribed within T (the sphericity of T ) is
bounded by a constant independent of T . This latter requirement can be
satis¬ed employing a regularization procedure, applied to an existing grid.
We refer to [Ver96] for further details on this subject.
Let us assume that Th contains NT triangles and N vertices, of which Nb ,
lying on the boundary ‚D of D, are kept ¬xed and having coordinates
(‚D) (‚D) (‚D)
). We denote by Nh the set of grid nodes, excluding
xi = (xi , yi
the boundary nodes, and for each node xi = (xi , yi )T ∈ Nh , let Pi and Zi
respectively be the set of triangles T ∈ Th sharing xi (called the patch of
3.14 Applications 119




FIGURE 3.13. An example of a decomposition into triangles of a polygonal do-
main D (left), and the e¬ect of the barycentric regularization on a patch of
triangles (right). The newly generated grid is plotted in dashed line

xi ) and the set of nodes of Pi except node xi itself (see Figure 3.13, right).
We let ni = dim(Zi ).
The regularization procedure consists of moving the generic node xi to
a new position which is determined by the center of gravity of the polygon
generated by joining the nodes of Zi , and for that reason it is called a
barycentric regularization. The e¬ect of such a procedure is to force all the
triangles that belong to the interior of the domain to assume a shape that
is as regular as possible (in the limit, each triangle should be equilateral).
In practice, we let
« 

xi =  xj  /ni ,
∀xi ∈ Nh , if xi ∈ ‚D.
xi = xi
xj ∈Zi

Two systems must then be solved, one for the x-components {xi } and the
other for the y-components {yi }. Denoting by zi the generic unknown, the
i-th row of the system, in the case of internal nodes, reads

ni zi ’ ∀i ∈ Nh ,
zj = 0, (3.78)
zj ∈Zi

while for the boundary nodes the identities zi = zi hold. Equations
(3.78) yield a system of the form Az = b, where A is a symmetric and pos-
itive de¬nite matrix of order N ’Nb which can be shown to be an M-matrix
(see Section 1.12). This property ensures that the new grid coordinates sat-
isfy minimum and maximum discrete principles, that is, they take a value
which is between the minimum and the maximum values attained on the


. 5
( 26)