As previously noticed, during the factorization procedure, nonzero entries

can be generated in memory positions that correspond to zero entries in

98 3. Direct Methods for the Solution of Linear Systems

the starting matrix. This action is referred to as ¬ll-in. Figure 3.4 shows the

e¬ect of ¬ll-in on the sparse matrix whose pattern is shown in Figure 3.3.

Since use of pivoting in the factorization process makes things even more

complicated, we shall only consider the case of symmetric positive de¬nite

matrices for which pivoting is not necessary.

A ¬rst remarkable result concerns the amount of ¬ll-in. Let mi (A) =

i ’ min {j < i : aij = 0} and denote by E(A) the convex hull of A, given

by

E(A) = {(i, j) : 0 < i ’ j ¤ mi (A)} . (3.59)

For a symmetric positive de¬nite matrix,

E(A) = E(H + HT ) (3.60)

where H is the Cholesky factor, so that ¬ll-in is con¬ned within the convex

hull of A (see Figure 3.4). Moreover, if we denote by lk (A) the number of

active rows at the k-th step of the factorization (i.e., the number of rows

of A with i > k and aik = 0), the computational cost of the factorization

process is

n

1

lk (A) (lk (A) + 3) ¬‚ops, (3.61)

2

k=1

having accounted for all the nonzero entries of the convex hull. Con¬nement

of ¬ll-in within E(A) ensures that the LU factorization of A can be stored

without extra memory areas simply by storing all the entries of E(A) (in-

cluding the null elements). However, such a procedure might still be highly

ine¬cient due to the large number of zero entries in the hull (see Exercise

11).

On the other hand, from (3.60) one gets that the reduction in the convex

hull re¬‚ects a reduction of ¬ll-in, and in turn, due to (3.61), of the number

of operations needed to perform the factorization. For this reason several

strategies for reordering the graph of the matrix have been devised. Among

them, we recall the Cuthill-McKee method, which will be addressed in the

next section.

An alternative consists of decomposing the matrix into sparse subma-

trices, with the aim of reducing the original problem to the solution of

subproblems of reduced size, where matrices can be stored in full format.

This approach leads to submatrix decomposition methods which will be

addressed in Section 3.9.2.

3.9.1 The Cuthill-McKee Algorithm

The Cuthill-McKee algorithm is a simple and e¬ective method for reorder-

ing the system variables.

3.9 Sparse Matrices 99

x xxx x x

x

xxx x

x x

xxxx

xxx

xxxxxx

xxxx

x 000 x

111x1111

0000 x xx

000x x

111

x 0000

1111 xxx

x 000

111

0 1111x x

x 0000x

1 0000 xx xx x

1111 xxx xx

xxx x

xx

xx x

xx

x 000x

111

0000000 x 000x x

1111111 111 xxx xx

xx

x 0000000

1111111 x xx xx

x

000

111

FIGURE 3.4. The shaded regions in the left ¬gure show the areas of the matrix

that can be a¬ected by ¬ll-in, for the matrix considered in Figure 3.3. Solid

lines denote the boundary of E(A). The right ¬gure displays the factors that

have been actually computed. Black dots denote the elements of A that were

originarily equal to zero

The ¬rst step of the algorithm consists of associating with each vertex of

the graph the number of its connections with neighboring vertices, called

the degree of the vertex. Next, the following steps are taken:

1. a vertex with a low number of connections is chosen as the ¬rst vertex

of the graph;

2. the vertices connected to it are progressively re-labeled starting from

those having lower degrees;

3. the procedure is repeated starting from the vertices connected to the

second vertex in the updated list. The nodes already re-labeled are

ignored. Then, a third new vertex is considered, and so on, until all

the vertices have been explored.

The usual way to improve the e¬ciency of the algorithm is based on the

so-called reverse form of the Cuthill-McKee method. This consists of ex-

ecuting the Cuthill-McKee algorithm described above where, at the end,

the i-th vertex is moved into the n ’ i + 1-th position of the list, n being

the number of nodes in the graph. Figure 3.5 reports, for comparison, the

graphs obtained using the direct and reverse Cuthill-McKee reordering in

the case of the matrix pattern represented in Figure 3.3, while in Figure

3.6 the factors L and U are compared. Notice the absence of ¬ll-in when

the reverse Cuthill-McKee method is used.

Remark 3.5 For an e¬cient solution of linear systems with sparse ma-

trices, we mention the public domain libraries SPARSKIT [Saa90], UMF-

PACK [DD95] and the MATLAB sparfun package.

100 3. Direct Methods for the Solution of Linear Systems

4 (8)

5 (12) 3 (11)

4 (5)

3 (2)

5 (6)

6 (7) 2 (9)

6 (1) 2 (4)

7 (6)

7 (12) 1 (3)

1 (10)

8 (5)

12 (1)

12 (7)

8 (8)

11 (11) 9 (3) 11 (2)

9 (10)

10 (4)

10 (9)

FIGURE 3.5. Reordered graphs using the direct (left) and reverse (right)

Cuthill-McKee algorithm. The label of each vertex, before reordering is per-

formed, is reported in braces

xxx x x xxx

xxxxxx xxxx

xxx x xxxx x

x xx

x xx xx

xxx

xx

xx xx x x

x xx x x

xx x

x xx xx xxxx xx

xx x xx xx x

xx

xx x xx x

xx

xxx x xx xx

xxxx x xx xx

x x xx x xx

x

FIGURE 3.6. Factors L and U after the direct (left) and reverse (right)

Cuthill-McKee reordering. In the second case, ¬ll-in is absent

3.9.2 Decomposition into Substructures

These methods have been developed in the framework of numerical ap-

proximation of partial di¬erential equations. Their basic strategy consists

of splitting the solution of the original linear system into subsystems of

smaller size which are almost independent from each other and can be

easily interpreted as a reordering technique.

We describe the methods on a special example, referring for a more

comprehensive presentation to [BSG96]. Consider the linear system Ax=b,

where A is a symmetric positive de¬nite matrix whose pattern is shown in

Figure 3.3. To help develop an intuitive understanding of the method, we

draw the graph of A in the form as in Figure 3.7.

3.9 Sparse Matrices 101

We then partition the graph of A into the two subgraphs (or substruc-

tures) identi¬ed in the ¬gure and denote by xk , k = 1, 2, the vectors of the

unknowns relative to the nodes that belong to the interior of the k-th sub-

structure. We also denote by x3 the vector of the unknowns that lie along

the interface between the two substructures. Referring to the decomposi-

T T

tion in Figure 3.7, we have x1 = (2, 3, 4, 6) , x2 = (8, 9, 10, 11, 12)

T

and x3 = (1, 5, 7) .

As a result of the decomposition of the unknowns, matrix A will be

partitioned in blocks, so that the linear system can be written in the form

11 10

12

5 1

7 8 9

substructure II

3

4

substructure I

2

6

FIGURE 3.7. Decomposition into two substructures

® ® ®

A11 0 A13 x1 b1

°0 A23 » ° x2 » = ° b2 » ,

A22

AT AT A33 x3 b3

13 23

having reordered the unknowns and partitioned accordingly the right hand

side of the system. Suppose that A33 is decomposed into two parts, A33

and A33 , which represent the contributions to A33 of each substructure.

Similarly, let the right hand side b3 be decomposed as b3 +b3 . The original

linear system is now equivalent to the following pair

A11 A13 x1 b1

= ,

b3 + γ 3

AT A33 x3

13

A22 A23 x2 b2

=

b3 ’ γ 3

AT A33 x3

23

having denoted by γ 3 a vector that takes into account the coupling between

the substructures. A typical way of proceeding in decomposition techniques

consists of eliminating γ 3 to end up with independent systems, one for each

102 3. Direct Methods for the Solution of Linear Systems

substructure. Let us apply this strategy to the example at hand. The linear

system for the ¬rst substructure is

A11 A13 x1 b1

= . (3.62)

b3 + γ 3

AT A33 x3

13

Let us now factorize A11 as HT H11 and proceed with the reduction method

11

already described in Section 3.8.3 for block tridiagonal matrices. We obtain

the system

H11 H21 x1 c1

=

A33 ’ H21 HT b3 + γ 3 ’ H21 c1

0 x3

21

where H21 = H’T A13 and c1 = H’T b1 . The second equation of this system

11 11

yields γ 3 explicitly as

γ 3 = A33 ’ HT H21 x3 ’ b3 + HT c1 .

21 21

Substituting this equation into the system for the second substructure, one

ends up with a system only in the unknowns x2 and x3

A22 A23 x2 b2

= , (3.63)

AT A33 x3 b3

23

where A33 = A33 ’ HT H21 and b3 = b3 ’ HT c1 . Once (3.63) has been

21 21

solved, it will be possible, by backsubstitution into (3.62), to compute also

x1 .

The technique described above can be easily extended to the case of

several substructures and its e¬ciency will increase the more the substruc-

tures are mutually independent. It reproduces in nuce the so-called frontal

method (introduced by Irons [Iro70]), which is quite popular in the solution

of ¬nite element systems (for an implementation, we refer to the UMF-

PACK library [DD95]).

Remark 3.6 (The Schur complement) An approach that is dual to

the above method consists of reducing the starting system to a system

acting only on the interface unknowns x3 , passing through the assembling

of the Schur complement of matrix A, de¬ned in the 3—3 case at hand as

S = A33 ’ AT A’1 A13 ’ AT A’1 A23 .

13 11 23 22

The original problem is thus equivalent to the system

Sx3 = b3 ’ AT A’1 b1 ’ AT A’1 b2 .

13 11 23 22

This system is full (even if the matrices Aij were sparse) and can be solved

using either a direct or an iterative method, provided that a suitable pre-

conditioner is available. Once x3 has been computed, one can get x1 and

3.10 Accuracy of the Solution Achieved Using GEM 103

x2 by solving two systems of reduced size, whose matrices are A11 and A22 ,

respectively.

We also notice that if the block matrix A is symmetric and positive

de¬nite, then the linear system on the Schur complement S is no more

ill-conditioned than the original system on A, since

K2 (S) ¤ K2 (A)

(for a proof, see Lemma 3.12, [Axe94]. See also [CM94] and [QV99]).

3.9.3 Nested Dissection

This is a renumbering technique quite similar to substructuring. In practice,

it consists of repeating the decomposition process several times at each

substructure level, until the size of each single block is made su¬ciently

small. In Figure 3.8 a possible nested dissection is shown in the case of the

matrix considered in the previous section. Once the subdivision procedure

has been completed, the vertices are renumbered starting with the nodes

belonging to the latest substructuring level and moving progressively up to

the ¬rst level. In the example at hand, the new node ordering is 11, 9, 7,

6, 12, 8, 4, 2, 1, 5, 3.

This procedure is particularly e¬ective if the problem has a large size and

the substructures have few connections between them or exhibit a repetitive

pattern [Geo73].

3.10 Accuracy of the Solution Achieved Using

GEM

Let us analyze the e¬ects of rounding errors on the accuracy of the solution

yielded by GEM. Suppose that A and b are a matrix and a vector of

¬‚oating-point numbers. Denoting by L and U, respectively, the matrices

of the LU factorization induced by GEM and computed in ¬‚oating-point

arithmetic, the solution x yielded by GEM can be regarded as being the

solution (in exact arithmetic) of the perturbed system (A + δA)x = b,

where δA is a perturbation matrix such that

|δA| ¤ nu 3|A| + 5|L||U| + O(u2 ), (3.64)

where u is the roundo¬ unit and the matrix absolute value notation has

been used (see [GL89], Section 3.4.6). As a consequence, the entries of δA

will be small in size if the entries of L and U are small. Using partial

pivoting allows for bounding below 1 the module of the entries of L in such

a way that, passing to the in¬nity norm and noting that L ∞ ¤ n, the

104 3. Direct Methods for the Solution of Linear Systems

2 A

1

0000000

1111111

0000000000000000

1111111111111111

1111111111111111

0000000000000000

1111111

0000000

0000000000000000

1111111111111111

1111111111111111

0000000000000000

1 0000000

1111111

0000000000000000

1111111111111111

1111111111111111

0000000000000000

1111111

0000000

1111111111111111

0000000000000000

0000000000000000

1111111111111111

1111111

0000000

0000000000000000

1111111111111111

2 0000000000000000

1111111111111111

0000000

1111111

0000000000000000

1111111111111111

1111111111111111

0000000000000000

1111111

0000000

00000

11111

1111111111111111

0000000000000000

1111111111111111

0000000000000000

A

00000

11111

0000000000000000

1111111111111111

1111111111111111

0000000000000000

00000

11111

0000000000000000

1111111111111111

0000000000000000

1111111111111111

2 00000

11111

0000000000000000

1111111111111111

1111111111111111

0000000000000000

00000

11111

0000000000000000

1111111111111111

0000000000000000

1111111111111111

1111111111111111

0000000000000000

0000000000000000

1111111111111111

1111111111111111

0000000000000000

0000000000000000

1111111111111111

1 1111111111111111

0000000000000000

A 0000000000000000

1111111111111111

1111111111111111

0000000000000000

0000000000000000

1111111111111111

1111111111111111

0000000000000000

3 4 C 5 6B A

3 000

111 0000000

1111111

000000000000000

111111111111111

11

00

C

1111111

000000000000000

111111111111111

0000000

111

000

00

11 111111111111111

1111111

000000000000000

0000000

11

00 000000000000000

0000000

1111111

111111111111111

111

000 111111111111111

1111111

0000000

000000000000000

000

111 111111111111111

000000000000000

1111111

0000000

0000000

111111111111111

000000000000000

1111111

111

000

4 000000000000000

111111111111111

1111111

0000000

4

3 1111111

0000000

000000000000000

111111111111111

000

111 0000000

111111111111111

000000000000000

1111111

1111111

0000000

111111111111111

000000000000000

C 0000000

111111111111111

000000000000000

1111111

000

111

111111111111111

0000000

1111111

000000000000000

A 11111

00000

000000000000000

111111111111111

00000

11111

000

111

000000000000000

111111111111111

11111

00000

5 000000000000000

111111111111111

00000

11111

000

111

111111111111111

000000000000000

00000

11111

11

00

000000000000000

111111111111111

00000

11111

111111111111111

000000000000000

00000

11111

00

11

6 111111111111111

000000000000000

00000

11111

000000000000000

111111111111111

00000

11111

111111111111111

000000000000000

B 00000

11111

111111111111111

000000000000000

5 111111111111111

000000000000000

111111111111111

000000000000000

111111111111111

000000000000000

A 000000000000000

111111111111111

11

00 000000000000000

111111111111111

00B

11 000000000000000

111111111111111

6

111111111111111

000000000000000

FIGURE 3.8. Two steps of nested dissection. Graph partitioning (left) and matrix

reordering (right)

estimate (3.64) becomes

¤ nu 3 A + O(u2 ).

δA + 5n U (3.65)

∞ ∞ ∞

The bound for δA ∞ in (3.65) is of practical use only if it is possible to

provide an estimate for U ∞ . With this aim, backward analysis can be

carried out introducing the so-called growth factor

(k)

max|aij |

i,j,k

ρn = . (3.66)

max|aij |

i,j

Taking advantage of the fact that |uij | ¤ ρn max|aij |, the following result

i,j

due to Wilkinson can be drawn from (3.65),

¤ 8un3 ρn A + O(u2 ).

δA (3.67)

∞ ∞

The growth factor can be bounded by 2n’1 and, although in most of the

cases it is of the order of 10, there exist matrices for which the inequality in

(3.67) becomes an equality (see, for instance, Exercise 5). For some special

classes of matrices, a sharp bound for ρn can be found:

3.10 Accuracy of the Solution Achieved Using GEM 105

1. for banded matrices with upper and lower bands equal to p, ρn ¤

22p’1 ’ (p ’ 1)2p’2 . As a consequence, in the tridiagonal case one

gets ρn ¤ 2;

2. for Hessenberg matrices, ρn ¤ n;

3. for symmetric positive de¬nite matrices, ρn = 1;

4. for matrices strictly diagonally dominant by columns, ρn ¤ 2.

To achieve better stability when using GEM for arbitrary matrices, re-

sorting to complete pivoting would seem to be mandatory, since it ensures

1/2

that ρn ¤ n1/2 2 · 31/2 · . . . · n1/(n’1) . Indeed, this growth is slower

n’1

than 2 as n increases.

However, apart from very special instances, GEM with only partial piv-

oting exhibits acceptable growth factors. This make it the most commonly

employed method in the computational practice.

Example 3.7 Consider the linear system (3.2) with

µ 1 1+µ

A= , b= , (3.68)

1 0 1

which admits the exact solution x=1 for any value of µ. The matrix is well-

conditioned, having K∞ (A) = (1 + µ)2 . Attempting to solve the system for µ =

10’15 by the LU factorization with 16 signi¬cant digits, and using the Programs

5, 2 and 3, yields the solution x = [0.8881784197001253, 1.000000000000000]T ,

with an error greater than 11% on the ¬rst component. Some insight into the

causes of the inaccuracy of the computed solution can be drawn from (3.64).

Indeed this latter does not provide a uniformly small bound for all the entries of

matrix δA, rather

3.55 · 10’30 1.33 · 10’15

|δA| ¤ .

1.33 · 10’15 2.22

Notice that the entries of the corresponding matrices L and U are quite large in

module. Conversely, resorting to GEM with partial or complete pivoting yields

•

the exact solution of the system (see Exercise 6).

Let us now address the role of the condition number in the error analysis

for GEM. GEM yields a solution x that is typically characterized by having

a small residual r = b ’ Ax (see [GL89]). This feature, however, does not

ensure that the error x ’ x is small when K(A) 1 (see Example 3.8). In

fact, if δb in (3.11) is regarded as being the residual, then

x’x 1 r

¤ K(A) r ¤ K(A) .

x A x b

This result will be applied to devise methods, based on the a posteriori

analysis, for improving the accuracy of the solution of GEM (see Section

3.12).

106 3. Direct Methods for the Solution of Linear Systems

Example 3.8 Consider the linear system Ax = b with

1 1.0001 1

A= , b= ,

1.0001 1 1

which admits the solution x = (0.499975 . . . , 0.499975 . . . )T . Assuming as an ap-

proximate solution the vector x = (’4.499775, 5.5002249)T , one ¬nds the residual

r (’0.001, 0)T , which is small although x is quite di¬erent from the exact so-

lution. The reason for this is due to the ill-conditioning of matrix A. Indeed in

•

this case K∞ (A) = 20001.

An estimate of the number of exact signi¬cant digits of a numerical

solution of a linear system can be given as follows. From (3.13), letting

γ = u and assuming that uK∞ (A) ¤ 1/2 we get

δx ∞ 2uK∞ (A)

¤ ¤ 4uK∞ (A).

1 ’ uK∞ (A)

x∞

As a consequence

x’x ∞

uK∞ (A). (3.69)

x∞

β ’t and K∞ (A) β m , one gets that the solution x

Assuming that u

computed by GEM will have at least t ’ m exact digits, t being the number

of digits available for the mantissa. In other words, the ill-conditioning of a

system depends both on the capability of the ¬‚oating-point arithmetic that

is being used and on the accuracy that is required in the solution.

3.11 An Approximate Computation of K(A)

Suppose that the linear system (3.2) has been solved by a factorization

method. To determine the accuracy of the computed solution, the analy-

sis carried out in Section 3.10 can be used if an estimate of the condition

number K(A) of A, which we denote by K(A), is available. Indeed, al-

though evaluating A can be an easy task if a suitable norm is chosen

(for instance, · 1 or · ∞ ), it is by no means reasonable (or compu-

tationally convenient) to compute A’1 if the only purpose is to evaluate

A’1 . For this reason, we describe in this section a procedure (proposed

in [CMSW79]) that approximates A’1 with a computational cost of the

order of n2 ¬‚ops.

The basic idea of the algorithm is as follows: ∀d ∈ Rn with d = 0, thanks

to the de¬nition of matrix norm, A’1 ≥ y / d = γ(d) with Ay = d.

Thus, we look for d in such a way that γ(d) is as large as possible and

assume the obtained value as an estimate of A’1 .

For the method to be e¬ective, the selection of d is crucial. To explain

how to do this, we start by assuming that the QR factorization of A has

3.11 An Approximate Computation of K(A) 107

been computed and that K2 (A) is to be approximated. In such an event,

since K2 (A) = K2 (R) due to Property 1.8, it su¬ces to estimate R’1 2

instead of A’1 2 . Considerations related to the SVD of R induce approx-

imating R’1 2 by the following algorithm:

compute the vectors x and y, solutions to the systems

RT x = d, Ry = x, (3.70)

then estimate R’1 2 by the ratio γ2 = y 2 / x 2 . The vector d appearing

in (3.70) should be determined in such a way that γ2 is as close as possible

to the value actually attained by R’1 2 . It can be shown that, except in

very special cases, γ2 provides for any choice of d a reasonable (although

not very accurate) estimate of R’1 2 (see Exercise 15). As a consequence,

a proper selection of d can encourage this natural trend.

Before going on, it is worth noting that computing K2 (R) is not an easy

matter even if an estimate of R’1 2 is available. Indeed, it would remain

to compute R 2 = ρ(RT R). To overcome this di¬culty, we consider

henceforth K1 (R) instead of K2 (R) since R 1 is easily computable. Then,

heuristics allows us to assume that the ratio γ1 = y 1 / x 1 is an estimate

of R’1 1 , exactly as γ2 is an estimate of R’1 2 .

Let us now deal with the choice of d. Since RT x = d, the generic compo-

nent xk of x can be formally related to x1 , . . . , xk’1 through the formulae

of forward substitution as

r11 x0 = d1 ,

(3.71)

rkk xk = dk ’ (r1k x1 + . . . + rk’1,k xk’1 ), k ≥ 1.

Assume that the components of d are of the form dk = ±θk , where θk

are random numbers and set arbitrarily d1 = θ1 . Then, x1 = θ1 /r11 is

completely determined, while x2 = (d2 ’ r12 x1 )/r22 depends on the sign of

d2 . We set the sign of d2 as the opposite of r12 x1 in such a way to make

x(1 : 2) 1 = |x1 | + |x2 |, for a ¬xed x1 , the largest possible. Once x2 is

known, we compute x3 following the same criterion, and so on, until xn .

This approach sets the sign of each component of d and yields a vector

x with a presumably large · 1 . However, it can fail since it is based on

the idea (which is in general not true) that maximizing x 1 can be done

by selecting at each step k in (3.71) the component xk which guarantees

the maximum increase of x(1 : k ’ 1) 1 (without accounting for the fact

that all the components are related).

Therefore, we need to modify the method by including a sort of “look-

ahead” strategy, which accounts for the way of choosing dk a¬ects all later

values xi , with i > k, still to be computed. Concerning this point, we notice

that for a generic row i of the system it is always possible to compute at

108 3. Direct Methods for the Solution of Linear Systems

step k the vector p(k’1) with components

(k’1)

i = 1, . . . , k ’ 1,

pi =0

(k’1)

pi = r1i x1 + . . . + rk’1,i xk’1 i = k, . . . , n.

(k’1)

Thus xk = (±θk ’ pk )/rkk . We denote the two possible values of xk by

’

+

xk and xk . The choice between them is now taken not only accounting for

which of the two most increases x(1 : k) 1 , but also evaluating the increase

of p(k) 1 . This second contribution accounts for the e¬ect of the choice of

dk on the components that are still to be computed. We can include both

criteria in a unique test. Denoting by

(k)+ (k)’

pi = 0, pi = 0, i = 1, . . . , k,

(k)+ (k)’

+ rki x’ , i = k + 1, . . . , n,

(k’1) (k’1)

+ rki x+ , pi

pi = pi = pi

k k

+ ’

the components of the vectors p(k) and p(k) respectively, we set each

+

k-th step dk = +θk or dk = ’θk according to whether |rkk x+ | + p(k) 1

k

’

is greater or less than |rkk x’ | + p(k) 1 .

k

Under this choice d is completely determined and the same holds for x.

Now, solving the system Ry = x, we are warranted that y 1 / x 1 is a reli-

able approximation to R’1 1 , so that we can set K1 (A) = R 1 y 1 / x 1 .

In practice the PA=LU factorization introduced in Section 3.5 is usually

available. Based on the previous considerations and on some heuristics, an

analogous procedure to that shown above can be conveniently employed

to approximate A’1 1 . Precisely, instead of systems (3.70), we must now

solve

(LU)T x = d, LUy = x.

We set y 1 / x 1 as the approximation of A’1 1 and, consequently, we

de¬ne K1 (A). The strategy for selecting d can be the same as before;

indeed, solving (LU)T x = d amounts to solving

UT z = d, LT x = z, (3.72)

and thus, since UT is lower triangular, we can proceed as in the previous

case. A remarkable di¬erence concerns the computation of x. Indeed, while

the matrix RT in the second system of (3.70) has the same condition number

as R, the second system in (3.72) has a matrix LT which could be even more

ill-conditioned than UT . If this were the case, solving for x could lead to

an inaccurate outcome, thus making the whole process useless.

Fortunately, resorting to partial pivoting prevents this circumstance from

occurring, ensuring that any ill-condition in A is re¬‚ected in a correspond-

ing ill-condition in U. Moreover, picking θk randomly between 1/2 and 1

3.12 Improving the Accuracy of GEM 109

guarantees accurate results even in the special cases where L turns out to

be ill-conditioned.

The algorithm presented below is implemented in the LINPACK library

[BDMS79] and in the MATLAB function rcond. This function, in order

to avoid rounding errors, returns as output parameter the reciprocal of

K1 (A). A more accurate estimator, described in [Hig88], is implemented in

the MATLAB function condest.

Program 14 implements the approximate evaluation of K1 for a matrix

A of generic form. The input parameters are the size n of the matrix A, the

matrix A, the factors L, U of its PA=LU factorization and the vector theta

containing the random numbers θk , for k = 1, . . . , n.

Program 14 - cond est : Algorithm for the approximation of K1 (A)

function [k1] = cond est(n,A,L,U,theta)

for i=1:n, p(i)=0; end

for k=1:n

zplus=(theta(k)-p(k))/U(k,k); zminu=(-theta(k)-p(k))/U(k,k);

splus=abs(theta(k)-p(k)); sminu=abs(-theta(k)-p(k));

for i=(k+1):n

splus=splus+abs(p(i)+U(k,i)*zplus);

sminu=sminu+abs(p(i)+U(k,i)*zminu);

end

if splus >= sminu, z(k)=zplus; else, z(k)=zminu; end

for i=(k+1):n, p(i)=p(i)+U(k,i)*z(k); end

end

z = z™; x = backward col(L™,z);

w = forward col(L,x); y = backward col(U,w);

k1=norm(A,1)*norm(y,1)/norm(x,1);

Example 3.9 Let us consider the Hilbert matrix H4 . Its condition number

K1 (H4 ), computed using the MATLAB function invhilb which returns the exact

inverse of H4 , is 2.8375 · 104 . Running Program 14 with theta=(1, 1, 1, 1)T gives

the reasonable estimate K1 (H4 ) = 2.1523 · 104 (which is the same as the output

•

of rcond), while the function condest returns the exact result.

3.12 Improving the Accuracy of GEM

As previously noted if the matrix of the system is ill-conditioned, the so-

lution generated by GEM could be inaccurate even though its residual is

small. In this section, we mention two techniques for improving the accu-

racy of the solution computed by GEM.

110 3. Direct Methods for the Solution of Linear Systems

3.12.1 Scaling

If the entries of A vary greatly in size, it is likely that during the elimination

process large entries are summed to small entries, with a consequent onset

of rounding errors. A remedy consists of performing a scaling of the matrix

A before the elimination is carried out.

Example 3.10 Consider again the matrix A of Remark 3.3. Multiplying it on

the right and on the left with matrix D=diag(0.0005, 1, 1), we obtain the scaled

matrix

®

’0.0001 1 1

A = DAD = ° 0.78125 0 » .

˜ 1

1 0 0

˜x

Applying GEM to the scaled system A˜ = Db = (0.2, 1.3816, 1.9273)T , we get

•

the correct solution x = D˜.

x

Row scaling of A amounts to ¬nding a diagonal nonsingular matrix D1

such that the diagonal entries of D1 A are of the same size. The linear

system Ax = b transforms into

D1 Ax = D1 b.

When both rows and columns of A are to be scaled, the scaled version of

(3.2) becomes

with y = D’1 x,

(D1 AD2 )y = D1 b 2

having also assumed that D2 is invertible. Matrix D1 scales the equations

while D2 scales the unknowns. Notice that, to prevent rounding errors, the

scaling matrices are chosen in the form

D1 = diag(β r1 , . . . , β rn ), D2 = diag(β c1 , . . . , β cn ),

where β is the base of the used ¬‚oating-point arithmetic and the exponents

r1 , . . . , rn , c1 , . . . , cn must be determined. It can be shown that

D’1 (x ’ x) ∞

2

uK∞ (D1 AD2 ).

D’1 x ∞

2

Therefore, scaling will be e¬ective if K∞ (D1 AD2 ) is much less than K∞ (A).

Finding convenient matrices D1 and D2 is not in general an easy matter.

A strategy consists, for instance, of picking up D1 and D2 in such a way

that D1 AD2 ∞ and D1 AD2 1 belong to the interval [1/β, 1], where β is

the base of the used ¬‚oating-point arithmetic (see [McK62] for a detailed

analysis in the case of the Crout factorization).

3.12 Improving the Accuracy of GEM 111

Remark 3.7 (The Skeel condition number) The Skeel condition num-

ber, de¬ned as cond(A) = |A’1 | |A| ∞ , is the supremum over the set

x∈ Rn , with x = 0, of the numbers

|A’1 | |A| |x| ∞

cond(A, x) = .

x∞

Unlike what happens for K(A), cond(A,x) is invariant with respect to a

scaling by rows of A, that is, to transformations of A of the form DA, where

D is a nonsingular diagonal matrix. As a consequence, cond(A) provides a

sound indication of the ill-conditioning of a matrix, irrespectively of any

possible row diagonal scaling.

3.12.2 Iterative Re¬nement

Iterative re¬nement is a technique for improving the accuracy of a solution

yielded by a direct method. Suppose that the linear system (3.2) has been

solved by means of LU factorization (with partial or complete pivoting),

and denote by x(0) the computed solution. Having ¬xed an error tolerance,

toll, the iterative re¬nement performs as follows: for i = 0, 1, . . . , until

convergence:

1. compute the residual r(i) = b ’ Ax(i) ;

2. solve the linear system Az = r(i) using the LU factorization of A;

3. update the solution setting x(i+1) = x(i) + z;

4. if z / x(i+1) < toll, then terminate the process returning the solu-

tion x(i+1) . Otherwise, the algorithm restarts at step 1.

In absence of rounding errors, the process would stop at the ¬rst step,

yielding the exact solution. The convergence properties of the method

can be improved by computing the residual r(i) in double precision, while

computing the other quantities in single precision. We call this procedure

mixed-precision iterative re¬nement (shortly, MPR), as compared to ¬xed-

precision iterative re¬nement (FPR).

It can be shown that, if |A’1 | |L| |U| ∞ is su¬ciently small, then at

each step i of the algorithm, the relative error x’x(i) ∞ / x ∞ is reduced

by a factor ρ, which is given by

ρ 2 n cond(A, x)u (FPR),

ρ (MPR),

u

where ρ is independent of the condition number of A in the case of MPR.

Slow convergence of FPR is a clear indication of the ill-conditioning of the

112 3. Direct Methods for the Solution of Linear Systems

matrix, as it can be shown that, if p is the number of iterations for the

method to converge, then K∞ (A) β t(1’1/p) .

Even if performed in ¬xed precision, iterative re¬nement is worth using

since it improves the overall stability of any direct method for solving the

system. We refer to [Ric81], [Ske80], [JW77] [Ste73], [Wil63] and [CMSW79]

for an overview of this subject.

3.13 Undetermined Systems

We have seen that the solution of the linear system Ax=b exists and is

unique if n = m and A is nonsingular. In this section we give a meaning

to the solution of a linear system both in the overdetermined case, where

m > n, and in the underdetermined case, corresponding to m < n. We

notice that an underdetermined system generally has no solution unless

the right side b is an element of range(A).

For a detailed presentation, we refer to [LH74], [GL89] and [Bj¨88].

o

—

Given A∈ R with m ≥ n, b∈ R , we say that x ∈ R is a solution

m—n m n

of the linear system Ax=b in the least-squares sense if

¦(x— ) = Ax— ’ b ¤ min Ax ’ b

2 2

= min ¦(x). (3.73)

2 2

n n

x∈R x∈R

The problem thus consists of minimizing the Euclidean norm of the resid-

ual. The solution of (3.73) can be found by imposing the condition that the

gradient of the function ¦ in (3.73) must be equal to zero at x— . From

¦(x) = (Ax ’ b)T (Ax ’ b) = xT AT Ax ’ 2xT AT b + bT b,

we ¬nd that

∇¦(x— ) = 2AT Ax— ’ 2AT b = 0,

from which it follows that x— must be the solution of the square system

AT Ax— = AT b (3.74)

known as the system of normal equations. The system is nonsingular if

A has full rank and in such a case the least-squares solution exists and

is unique. We notice that B = AT A is a symmetric and positive de¬nite

matrix. Thus, in order to solve the normal equations, one could ¬rst com-

pute the Cholesky factorization B = HT H and then solve the two systems

HT y = AT b and Hx— = y. However, due to roundo¬ errors, the com-

putation of AT A may be a¬ected by a loss of signi¬cant digits, with a

consequent loss of positive de¬niteness or nonsingularity of the matrix, as

happens in the following example (implemented in MATLAB) where for a

3.13 Undetermined Systems 113

the corresponding matrix f l(AT A) turns out to

matrix A with full rank,

be singular

®

1 1

1 1

A = ° 2’27 » , f l(AT A) =

0 .

1 1

2’27

0

Therefore, in the case of ill-conditioned matrices it is more convenient to

utilize the QR factorization introduced in Section 3.4.3. Indeed, the follow-

ing result holds.

Theorem 3.8 Let A ∈ Rm—n , with m ≥ n, be a full rank matrix. Then

the unique solution of (3.73) is given by

x— = R’1 QT b

˜˜ (3.75)

˜ ˜

where R ∈ Rn—n and Q ∈ Rm—n are the matrices de¬ned in (3.48) starting

from the QR factorization of A. Moreover, the minimum of ¦ is given by

m

—

[(QT b)i ]2 .

¦(x ) =

i=n+1

Proof. The QR factorization of A exists and is unique since A has full rank.

Thus, there exist two matrices, Q∈ Rm—m and R∈ Rm—n such that A=QR, where

Q is orthogonal. Since orthogonal matrices preserve the Euclidean scalar product

(see Property 1.8), it follows that

Ax ’ b = Rx ’ QT b 2 .

2

2 2

Recalling that R is upper trapezoidal, we have

m

˜ ˜

Rx ’ Q b = Rx ’ QT b

T 2 2

[(QT b)i ]2 ,

+

2 2

i=n+1

so that the minimum is achieved when x = x— . 3

For more details about the analysis of the computational cost the algo-

rithm (which depends on the actual implementation of the QR factoriza-

tion), as well as for results about its stability, we refer the reader to the

texts quoted at the beginning of the section.

If A does not have full rank, the solution techniques above fail, since in

this case if x— is a solution to (3.73), the vector x— + z, with z ∈ ker(A), is

a solution too. We must therefore introduce a further constraint to enforce

the uniqueness of the solution. Typically, one requires that x— has minimal

Euclidean norm, so that the least-squares problem can be formulated as

¬nd x— ∈ Rn with minimal Euclidean norm such that

(3.76)

Ax— ’ b ¤ min Ax ’ b 2 .

2

2

2 n

x∈R

114 3. Direct Methods for the Solution of Linear Systems

This problem is consistent with (3.73) if A has full rank, since in this case

(3.73) has a unique solution which necessarily must have minimal Euclidean

norm.

The tool for solving (3.76) is the singular value decomposition (or SVD,

see Section 1.9), for which the following theorem holds.

Theorem 3.9 Let A ∈ Rm—n with SVD given by A = UΣVT . Then the

unique solution to (3.76) is

x— = A† b (3.77)

where A† is the pseudo-inverse of A introduced in De¬nition 1.15.

Proof. Using the SVD of A, problem (3.76) is equivalent to ¬nding w = VT x

such that w has minimal Euclidean norm and

Σw ’ UT b ¤ Σy ’ UT b 2 , ∀y ∈ Rn .

2

2 2

If r is the number of nonzero singular values σi of A, then

r m

2 2

Σw ’ U b σi wi ’ (U b)i

T 2 T

(UT b)i

= + ,

2

i=1 i=r+1

which is minimum if wi = (UT b)i /σi for i = 1, . . . , r. Moreover, it is clear that

among the vectors w of Rn having the ¬rst r components ¬xed, the one with

minimal Euclidean norm has the remaining n ’ r components equal to zero.

Thus the solution vector is w— = Σ† UT b, that is, x— = VΣ† UT b = A† b, where

Σ† is the diagonal matrix de¬ned in (1.11). 3

As for the stability of problem (3.76), we point out that if the matrix

A does not have full rank, the solution x— is not necessarily a continuous

function of the data, so that small changes on these latter might produce

large variations in x— . An example of this is shown below.

Example 3.11 Consider the system Ax = b with

® ®

10 1

A = ° 0 0 » , b = ° 2 » , rank(A) = 1.

00 3

Using the MATLAB function svd we can compute the SVD of A. Then computing

the pseudo-inverse, one ¬nds the solution vector x— = (1, 0)T . If we perturb the

null entry a22 , with the value 10’12 , the perturbed matrix has (full) rank 2

and the solution (which is unique in the sense of (3.73)) is now given by x— =

T

1, 2 · 1012 . •

We refer the reader to Section 5.8.3 for the approximate computation of

the SVD of a matrix.

3.14 Applications 115

In the case of underdetermined systems, for which m < n, if A has full

rank the QR factorization can still be used. In particular, when applied

to the transpose matrix AT , the method yields the solution of minimal

euclidean norm. If, instead, the matrix has not full rank, one must resort

to SVD.

Remark 3.8 If m = n (square system), both SVD and QR factorization

can be used to solve the linear system Ax=b, as alternatives to GEM.

Even though these algorithms require a number of ¬‚ops far superior to

GEM (SVD, for instance, requires 12n3 ¬‚ops), they turn out to be more

accurate when the system is ill-conditioned and nearly singular.

Example 3.12 Compute the solution to the linear system H15 x=b, where H15

is the Hilbert matrix of order 15 (see (3.32)) and the right side is chosen in

such a way that the exact solution is the unit vector x = 1. Using GEM with

partial pivoting yields a solution a¬ected by a relative error larger than 100%. A

solution of much better quality is obtained by passing through the computation

of the pseudo-inverse, where the entries in Σ that are less than 10’13 are set

•

equal to zero.

3.14 Applications

In this section we present two problems, suggested by structural mechan-

ics and grid generation in ¬nite element analysis, whose solutions require

solving large linear systems.

3.14.1 Nodal Analysis of a Structured Frame

Let us consider a structured frame which is made by rectilinear beams con-

nected among them through hinges (referred to as the nodes) and suitably

constrained to the ground. External loads are assumed to be applied at

the nodes of the frame and for any beam in the frame the internal actions

amount to a unique force of constant strength and directed as the beam

itself. If the normal stress acting on the beam is a traction we assume

that it has positive sign, otherwise the action has negative sign. Structured

frames are frequently employed as covering structures for large size public

buildings like exhibition stands, railway stations or airport halls.

To determine the internal actions in the frame, that are the unknowns

of the mathematical problem, a nodal analysis is used (see [Zie77]): the

equilibrium with respect to translation is imposed at every node of the

frame yielding a sparse and large-size linear system. The resulting matrix

has a sparsity pattern which depends on the numbering of the unknowns

and that can strongly a¬ect the computational e¬ort of the LU factorization

116 3. Direct Methods for the Solution of Linear Systems

due to ¬ll-in. We will show that the ¬ll-in can be dramatically reduced by

a suitable reordering of the unknowns.

The structure shown in Figure 3.9 is arc-shaped and is symmetric with

respect to the origin. The radii r and R of the inner and outer circles are

equal to 1 and 2, respectively. An external vertical load of unit size directed

downwards is applied at (0, 1) while the frame is constrained to ground

through a hinge at (’(r + R), 0) and a bogie at (r + R, 0). To generate

the structure we have partitioned the half unit circle in nθ uniform slices,

resulting in a total number of n = 2(nθ + 1) nodes and a matrix size of

m = 2n. The structure in Figure 3.9 has nθ = 7 and the unknowns are

numbered following a counterclockwise labeling of the beams starting from

the node at (1, 0).

We have represented the structure along with the internal actions com-

puted by solving the nodal equilibrium equations where the width of the

beams is proportional to the strength of the computed action. Black is

used to identify tractions whereas gray is associated with compressions. As

expected the maximum traction stress is attained at the node where the

external load is applied.

2.5

2

1.5

1

0.5

0

’0.5

’2 ’1.5 ’1 ’0.5 0 0.5 1 1.5 2

FIGURE 3.9. A structured frame loaded at the point (0, 1)

We show in Figure 3.10 the sparsity pattern of matrix A (left) and that

of the L-factor of its LU factorization with partial pivoting (right) in the

case nθ = 40 which corresponds to a size of 164 — 164. Notice the large

¬ll-in e¬ect arising in the lower part of L which results in an increase of

the nonzero entries from 645 (before the factorization) to 1946 (after the

factorization).

3.14 Applications 117

0 0

20 20

40 40

60 60

80 80

100 100

120 120

140 140

160 160

0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160

nz = 645 nz = 1946

FIGURE 3.10. Sparsity pattern of matrix A (left) and of the L-factor of the LU

factorization with partial pivoting (right) in the case nθ = 40

In view of the solution of the linear system by a direct method, the

increase of the nonzero entries demands for a suitable reordering of the

unknowns. For this purpose we use the MATLAB function symrcm which

implements the symmetric reverse Cuthill-McKee algorithm described in

Section 3.9.1. The sparsity pattern, after reordering, is shown in Figure 3.11

(left) while the L-factor of the LU factorization of the reordered matrix

is shown in Figure 3.11 (right). The results indicate that the reordering

procedure has “scattered” the sparsity pattern throughout the matrix with

a relatively modest increase of the nonzero entries from 645 to 1040.

0 0

20 20

40 40

60 60

80 80

100 100

120 120

140 140

160 160

0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160

nz = 645 nz = 1040

FIGURE 3.11. Sparsity pattern of matrix A (left) after a reordering with the sym-

metric reverse Cuthill-McKee algorithm and the L-factor of the LU factorization

of the reordered matrix with partial pivoting (right) in the case nθ = 40

The e¬ectiveness of the symmetric reverse Cuthill-McKee reordering pro-

cedure is demonstrated in Figure 3.12 which shows the number of nonzero

entries nz in the L-factor of A as a function of the size m of the matrix

(represented on the x-axis). In the reordered case (solid line) a linear in-

118 3. Direct Methods for the Solution of Linear Systems

crease of nz with m can be clearly appreciated at the expense of a dramatic

¬ll-in growing with m if no reordering is performed (dashed line).

4

x 10

6

5

4

3

2

1

0

0 100 200 300 400 500 600 700 800 900 1000

FIGURE 3.12. Number of nonzero entries in the L-factor of A as a function of

the size m of the matrix, with (solid line) and without (dashed line) reordering

3.14.2 Regularization of a Triangular Grid

The numerical solution of a problem in a two-dimensional domain D of

polygonal form, for instance by ¬nite element or ¬nite di¬erence methods,

very often requires that D be decomposed in smaller subdomains, usually

of triangular form (see for instance Section 9.9.2).

T , where Th is the considered triangulation (also

Suppose that D =

T ∈Th

called computational grid) and h is a positive parameter which characterizes

the triangulation. Typically, h denotes the maximum length of the triangle

edges. We shall also assume that two triangles of the grid, T1 and T2 , have

either null intersection or share a vertex or a side.

The geometrical properties of the computational grid can heavily a¬ect the

quality of the approximate numerical solution. It is therefore convenient to

devise a su¬ciently regular triangulation, such that, for any T ∈ Th , the

ratio between the maximum length of the sides of T (the diameter of T )

and the diameter of the circle inscribed within T (the sphericity of T ) is

bounded by a constant independent of T . This latter requirement can be

satis¬ed employing a regularization procedure, applied to an existing grid.

We refer to [Ver96] for further details on this subject.

Let us assume that Th contains NT triangles and N vertices, of which Nb ,

lying on the boundary ‚D of D, are kept ¬xed and having coordinates

(‚D) (‚D) (‚D)

). We denote by Nh the set of grid nodes, excluding

xi = (xi , yi

the boundary nodes, and for each node xi = (xi , yi )T ∈ Nh , let Pi and Zi

respectively be the set of triangles T ∈ Th sharing xi (called the patch of

3.14 Applications 119

xk

xj

T

xi

FIGURE 3.13. An example of a decomposition into triangles of a polygonal do-

main D (left), and the e¬ect of the barycentric regularization on a patch of

triangles (right). The newly generated grid is plotted in dashed line

xi ) and the set of nodes of Pi except node xi itself (see Figure 3.13, right).

We let ni = dim(Zi ).

The regularization procedure consists of moving the generic node xi to

a new position which is determined by the center of gravity of the polygon

generated by joining the nodes of Zi , and for that reason it is called a

barycentric regularization. The e¬ect of such a procedure is to force all the

triangles that belong to the interior of the domain to assume a shape that

is as regular as possible (in the limit, each triangle should be equilateral).

In practice, we let

«

xi = xj /ni ,

(‚D)

∀xi ∈ Nh , if xi ∈ ‚D.

xi = xi

xj ∈Zi

Two systems must then be solved, one for the x-components {xi } and the

other for the y-components {yi }. Denoting by zi the generic unknown, the

i-th row of the system, in the case of internal nodes, reads

ni zi ’ ∀i ∈ Nh ,

zj = 0, (3.78)

zj ∈Zi

(‚D)

while for the boundary nodes the identities zi = zi hold. Equations

(3.78) yield a system of the form Az = b, where A is a symmetric and pos-

itive de¬nite matrix of order N ’Nb which can be shown to be an M-matrix

(see Section 1.12). This property ensures that the new grid coordinates sat-

isfy minimum and maximum discrete principles, that is, they take a value

which is between the minimum and the maximum values attained on the

boundary.