a ¬nite number of digits xk with 0 ¤ xk < β for k = ’m, . . . , n. The

notation (conventionally adopted)

xβ = (’1)s [xn xn’1 . . . x1 x0 .x’1 x’2 . . . x’m ] , xn = 0 (2.26)

is called the positional representation of x with respect to the base β. The

point between x0 and x’1 is called decimal point if the base is 10, binary

point if the base is 2, while s depends on the sign of x (s = 0 if x is positive,

1 if negative). Relation (2.26) actually means

n

s

xk β k

xβ = (’1) .

k=’m

Example 2.10 The conventional writing x10 = 425.33 denotes the number x =

4 · 102 + 2 · 10 + 5 + 3 · 10’1 + 3 · 10’2 , while x6 = 425.33 would denote the

real number x = 4 · 62 + 2 · 6 + 5 + 3 · 6’1 + 3 · 6’2 . A rational number can of

course have a ¬nite number of digits in a base and an in¬nite number of digits in

another base. For example, the fraction 1/3 has in¬nite digits in base 10, being

•

x10 = 0.¯ while it has only one digit in base 3, being x3 = 0.1.

3,

46 2. Principles of Numerical Mathematics

Any real number can be approximated by numbers having a ¬nite repre-

sentation. Indeed, having ¬xed the base β, the following property holds

∀µ > 0, ∀xβ ∈ R, ∃yβ ∈ R such that |yβ ’ xβ | < µ,

where yβ has ¬nite positional representation.

In fact, given the positive number xβ = xn xn’1 . . . x0 .x’1 . . . x’m . . . with

a number of digits, ¬nite or in¬nite, for any r ≥ 1 one can build two

numbers

r’1

(l) (u) (l)

xn’k β n’k , xβ = xβ + β n’r+1 ,

xβ =

k=0

(l) (u) (u) (l)

and xβ ’ xβ = β n’r+1 . If

having r digits, such that xβ < xβ < xβ

(l)

r is chosen in such a way that β n’r+1 < , then taking yβ equal to xβ

(u)

or xβ yields the desired inequality. This result legitimates the computer

representation of real numbers (and thus by a ¬nite number of digits).

Although theoretically speaking all the bases are equivalent, in the com-

putational practice three are the bases generally employed: base 2 or binary,

base 10 or decimal (the most natural) and base 16 or hexadecimal. Almost

all modern computers use base 2, apart from a few which traditionally

employ base 16. In what follows, we will assume that β is an even integer.

In the binary representation, digits reduce to the two symbols 0 and 1,

called bits (binary digits), while in the hexadecimal case the symbols used

for the representation of the digits are 0,1,...,9,A,B,C,D,E,F. Clearly,

the smaller the adopted base, the longer the string of characters needed to

represent the same number.

To simplify notations, we shall write x instead of xβ , leaving the base β

understood.

2.5.2 The Floating-point Number System

Assume a given computer has N memory positions in which to store any

number. The most natural way to make use of these positions in the rep-

resentation of a real number x di¬erent from zero is to ¬x one of them for

its sign, N ’ k ’ 1 for the integer digits and k for the digits beyond the

point, in such a way that

x = (’1)s · [aN ’2 aN ’3 . . . ak . ak’1 . . . a0 ] (2.27)

s being equal to 1 or 0. Notice that one memory position is equivalent to

one bit storage only when β = 2. The set of numbers of this kind is called

¬xed-point system. Equation (2.27) stands for

N ’2

’k

x = (’1) · β

s

aj β j (2.28)

j=0

2.5 Machine Representation of Numbers 47

and therefore this representation amounts to ¬xing a scaling factor for all

the representable numbers.

The use of ¬xed point strongly limits the value of the minimum and maxi-

mum numbers that can be represented on the computer, unless a very large

number N of memory positions is employed. This drawback can be easily

overcome if the scaling in (2.28) is allowed to be varying. In such a case,

given a non vanishing real number x, its ¬‚oating-point representation is

given by

x = (’1)s · (0.a1 a2 . . . at ) · β e = (’1)s · m · β e’t (2.29)

where t ∈ N is the number of allowed signi¬cant digits ai (with 0 ¤ ai ¤

β ’ 1), m = a1 a2 . . . at an integer number called mantissa such that 0 ¤

m ¤ β t ’ 1 and e an integer number called exponent. Clearly, the exponent

can vary within a ¬nite interval of admissible values: we let L ¤ e ¤ U

(typically L < 0 and U > 0). The N memory positions are now distributed

among the sign (one position), the signi¬cant digits (t positions) and the

digits for the exponent (the remaining N ’ t ’ 1 positions). The number

zero has a separate representation.

Typically, on the computer there are two formats available for the ¬‚oating-

point number representation: single and double precision. In the case of bi-

nary representation, these formats correspond in the standard version to

the representation with N = 32 bits (single precision)

1 8 bits 23 bits

s e m

and with N = 64 bits (double precision)

1 11 bits 52 bits

s e m

Let us denote by

t

ai β ’i

F(β, t, L, U ) = {0} ∪ x ∈ R : x = (’1) β se

i=1

the set of ¬‚oating-point numbers with t signi¬cant digits, base β ≥ 2, 0 ¤

ai ¤ β ’ 1, and range (L,U ) with L ¤ e ¤ U .

In order to enforce uniqueness in a number representation, it is typi-

cally assumed that a1 = 0 and m ≥ β t’1 . In such an event a1 is called

the principal signi¬cant digit, while at is the last signi¬cant digit and the

representation of x is called normalized. The mantissa m is now varying

between β t’1 and β t ’ 1.

For instance, in the case β = 10, t = 4, L = ’1 and U = 4, without

the assumption that a1 = 0, the number 1 would admit the following

representations

0.1000 · 101 , 0.0100 · 102 , 0.0010 · 103 , 0.0001 · 104 .

48 2. Principles of Numerical Mathematics

To always have uniqueness in the representation, it is assumed that also

the number zero has its own sign (typically s = 0 is assumed).

It can be immediately noticed that if x ∈ F(β, t, L, U ) then also ’x ∈

F(β, t, L, U ). Moreover, the following lower and upper bounds hold for the

absolute value of x

xmin = β L’1 ¤ |x| ¤ β U (1 ’ β ’t ) = xmax . (2.30)

The cardinality of F(β, t, L, U ) (henceforth shortly denoted by F) is

card F = 2(β ’ 1)β t’1 (U ’ L + 1) + 1.

From (2.30) it turns out that it is not possible to represent any number

(apart from zero) whose absolute value is less than xmin . This latter limi-

tation can be overcome by completing F by the set FD of the ¬‚oating-point

de-normalized numbers obtained by removing the assumption that a1 is

non null, only for the numbers that are referred to the minimum exponent

L. In such a way the uniqueness in the representation is not lost and it is

possible to generate numbers that have mantissa between 1 and β t’1 ’ 1

and belong to the interval (’β L’1 , β L’1 ). The smallest number in this set

has absolute value equal to β L’t .

Example 2.11 The positive numbers in the set F(2, 3, ’1, 2) are

7 5

(0.111) · 22 = (0.110) · 22 = 3, (0.101) · 22 = (0.100) · 22 = 2,

, ,

2 2

7 3 5

(0.111) · 2 = (0.110) · 2 = (0.101) · 2 = (0.100) · 2 = 1,

, , ,

4 2 4

7 3 5 1

(0.111) = , (0.110) = , (0.101) = , (0.100) = ,

8 4 8 2

7 3 5 1

(0.111) · 2’1 = (0.110) · 2’1 = (0.101) · 2’1 = (0.100) · 2’1 =

, , , .

16 8 16 4

They are included between xmin = β L’1 = 2’2 = 1/4 and xmax = β U (1’β ’t ) =

22 (1 ’ 2’3 ) = 7/2. As a whole, we have (β ’ 1)β t’1 (U ’ L + 1) = (2 ’ 1)23’1 (2 +

1 + 1) = 16 strictly positive numbers. Their opposites must be added to them,

as well as the number zero. We notice that when β = 2, the ¬rst signi¬cant digit

in the normalized representation is necessarily equal to 1 and thus it may not be

stored in the computer (in such an event, we call it hidden bit).

When considering also the positive de-normalized numbers, we should complete

the above set by adding the following numbers

3 1 1

(.011)2 · 2’1 = (.010)2 · 2’1 = (.001)2 · 2’1 =

, , .

16 8 16

According to what previously stated, the smallest de-normalized number is β L’t =

2’1’3 = 1/16. •

2.5 Machine Representation of Numbers 49

2.5.3 Distribution of Floating-point Numbers

The ¬‚oating-point numbers are not equally spaced along the real line, but

they get dense close to the smallest representable number. It can be checked

that the spacing between a number x ∈ F and its next nearest y ∈ F, where

both x and y are assumed to be non null, is at least β ’1 M |x| and at most

M |x|, being M = β

1’t

the machine epsilon. This latter represents the

distance between the number 1 and the nearest ¬‚oating-point number, and

therefore it is the smallest number of F such that 1 + M > 1.

Having instead ¬xed an interval of the form [β e , β e+1 ], the numbers of F

that belong to such an interval are equally spaced and have distance equal

to β e’t . Decreasing (or increasing) by one the exponent gives rise to a

decrement (or increment) of a factor β of the distance between consecutive

numbers.

Unlike the absolute distance, the relative distance between two consecu-

tive numbers has a periodic behavior which depends only on the mantissa

m. Indeed, denoting by (’1)s m(x)β e’t one of the two numbers, the dis-

tance ∆x from the successive one is equal to (’1)s β e’t , which implies that

the relative distance is

(’1)s β e’t

∆x 1

= = . (2.31)

(’1)s m(x)β e’t

x m(x)

Within the interval [β e , β e+1 ], the ratio in (2.31) is decreasing as x increases

since in the normalized representation the mantissa varies from β t’1 to β t ’

1 (not included). However, as soon as x = β e+1 , the relative distance gets

back to the value β ’t+1 and starts decreasing on the successive intervals,

as shown in Figure 2.2. This oscillatory phenomenon is called wobbling

precision and the greater the base β, the more pronounced the e¬ect. This

is another reason why small bases are preferably employed in computers.

2.5.4 IEC/IEEE Arithmetic

The possibility of building sets of ¬‚oating-point numbers that di¬er in base,

number of signi¬cant digits and range of the exponent has prompted in the

past the development, for almost any computer, of a particular system F. In

order to avoid this proliferation of numerical systems, a standard has been

¬xed that is nowadays almost universally accepted. This standard was de-

veloped in 1985 by the Institute of Electrical and Electronics Engineers

(shortly, IEEE) and was approved in 1989 by the International Electroni-

cal Commission (IEC) as the international standard IEC559 and it is now

known by this name (IEC is an organization analogue to the International

Standardization Organization (ISO) in the ¬eld of electronics). The stan-

dard IEC559 endorses two formats for the ¬‚oating-point numbers: a basic

format, made by the system F(2, 24, ’125, 128) for the single precision,

and by F(2, 53, ’1021, 1024) for the double precision, both including the

50 2. Principles of Numerical Mathematics

-23

2

-24

2

-123

-126 -124

-125

2 2 2

2

FIGURE 2.2. Variation of relative distance for the set of numbers

F(2, 24, ’125, 128) IEC/IEEE in single precision

de-normalized numbers, and an extended format, for which only the main

limitations are ¬xed (see Table 2.1).

single double single double

≥ 43 bits ≥ 79 bits ≥ 32 ≥ 64

N t

¤ ’1021 ¤ 16381 ≥ 1024 ≥ 16384

L U

TABLE 2.1. Lower or upper limits in the standard IEC559 for the extended

format of ¬‚oating-point numbers

Almost all the computers nowadays satisfy the requirements above. We

summarize in Table 2.2 the special codings that are used in IEC559 to

deal with the values ±0, ±∞ and with the so-called non numbers (shortly,

N aN , that is not a number), which correspond for instance to 0/0 or to

other exceptional operations.

value exponent mantissa

±0 L’1 0

±∞ U +1 0

N aN U +1 =0

TABLE 2.2. IEC559 codings of some exceptional values

2.5.5 Rounding of a Real Number in its Machine

Representation

The fact that on any computer only a subset F(β, t, L, U ) of R is actually

available poses several practical problems, ¬rst of all the representation in F

2.5 Machine Representation of Numbers 51

of any given real number. To this concern, notice that, even if x and y were

two numbers in F, the result of an operation on them does not necessarily

belong to F. Therefore, we must de¬ne an arithmetic also on F.

The simplest approach to solve the ¬rst problem consists of rounding

x ∈ R in such a way that the rounded number belongs to F. Among all

the possible rounding operations, let us consider the following one. Given

x ∈ R in the normalized positional notation (2.29) let us substitute x by

its representant f l(x) in F, de¬ned as

at if at+1 < β/2

f l(x) = (’1)s (0. a1 a2 . . . at ) · β e , at =

˜ (2.32)

at + 1 if at+1 ≥ β/2.

The mapping f l : R ’ F is the most commonly used and is called rounding

(in the chopping one would take more trivially at = at ). Clearly, f l(x) = x

if x ∈ F and moreover f l(x) ¤ f l(y) if x ¤ y ∀x, y ∈ R (monotonicity

property).

Remark 2.2 (Over¬‚ow and under¬‚ow) Everything written so far holds

only for the numbers that in (2.29) have exponent e within the range of F.

If, indeed, x ∈ (’∞, ’xmax ) ∪ (xmax , ∞) the value f l(x) is not de¬ned,

while if x ∈ (’xmin , xmin ) the operation of rounding is de¬ned anyway

(even in absence of de-normalized numbers). In the ¬rst case, if x is the

result of an operation on numbers of F, we speak about over¬‚ow, in the

second case about under¬‚ow (or graceful under¬‚ow if de-normalized num-

bers are accounted for). The over¬‚ow is handled by the system through an

interrupt of the executing program.

Apart from exceptional situations, we can easily quantify the error, ab-

solute and relative, that is made by substituting f l(x) for x. The following

result can be shown (see for instance [Hig96], Theorem 2.2).

Property 2.1 If x ∈ R is such that xmin ¤ |x| ¤ xmax , then

f l(x) = x(1 + δ) with |δ| ¤ u (2.33)

where

1 1’t 1

u= β = (2.34)

M

2 2

is the so-called roundo¬ unit (or machine precision).

As a consequence of (2.33), the following bound holds for the relative error

|x ’ f l(x)|

¤ u,

Erel (x) = (2.35)

|x|

while, for the absolute error, one gets

E(x) = |x ’ f l(x)| ¤ β e’t |(a1 . . . at .at+1 . . . ) ’ (a1 . . . at )|.

˜

52 2. Principles of Numerical Mathematics

From (2.32), it follows that

β

|(a1 . . . at .at+1 . . . ) ’ (a1 . . . at )| ¤ β ’1 ,

˜

2

from which

1 ’t+e

E(x) ¤ β .

2

Remark 2.3 In the MATLAB environment it is possible to know imme-

diately the value of M , which is given by the system variable eps.

2.5.6 Machine Floating-point Operations

As previously stated, it is necessary to de¬ne on the set of machine numbers

an arithmetic which is analogous, as far as possible, to the arithmetic in R.

Thus, given any arithmetic operation —¦ : R — R ’ R on two operands in

R (the symbol —¦ may denote sum, subtraction, multiplication or division),

we shall denote by —¦ the corresponding machine operation

—¦ : R — R ’ F, x —¦ y = f l(f l(x) —¦ f l(y)).

From the properties of ¬‚oating-point numbers one could expect that for the

operations on two operands, whenever well de¬ned, the following property

holds: ∀x, y ∈ F, ∃δ ∈ R such that

x —¦ y = (x —¦ y)(1 + δ) with |δ| ¤ u. (2.36)

In order for (2.36) to be satis¬ed when —¦ is the operator of subtraction, it

will require an additional assumption on the structure of the numbers in F,

that is the presence of the so-called round digit (which is addressed at the

end of this section). In particular, when —¦ is the sum operator, it follows

that for all x, y ∈ F (see Exercise 11)

|x + y ’ (x + y)| |x| + |y|

¤ u(1 + u) + u, (2.37)

|x + y| |x + y|

so that the relative error associated with every machine operation will be

small, unless x + y is not small by itself. An aside comment is deserved by

the case of the sum of two numbers close in module, but opposite in sign.

In fact, in such a case x + y can be quite small, this generating the so-called

cancellation errors (as evidenced in Example 2.6).

It is important to notice that, together with properties of standard arith-

metic that are preserved when passing to ¬‚oating-point arithmetic (like, for

instance, the commutativity of the sum of two addends, or the product of

2.5 Machine Representation of Numbers 53

two factors), other properties are lost. An example is given by the associa-

tivity of sum: it can indeed be shown (see Exercise 12) that in general

x + (y + z) = (x + y) + z.

We shall denote by ¬‚op the single elementary ¬‚oating-point operation (sum,

subtraction, multiplication or division) (the reader is warned that in some

texts ¬‚op identi¬es an operation of the form a + b · c). According to the

previous convention, a scalar product between two vectors of length n will

require 2n ’ 1 ¬‚ops, a product matrix-vector 2(m ’ 1)n ¬‚ops if the matrix

is n — m and ¬nally, a product matrix-matrix 2(r ’ 1)mn ¬‚ops if the two

matrices are m — r and r — n respectively.

Remark 2.4 (IEC559 arithmetic) The IEC559 standard also de¬nes a

closed arithmetic on F, this meaning that any operation on it produces

a result that can be represented within the system itself, although not

necessarily being expected from a pure mathematical standpoint. As an

example, in Table 2.3 we report the results that are obtained in exceptional

situations.

exception examples result

0/0, 0 · ∞

non valid operation N aN

±∞

overf low

±∞

division by zero 1/0

underf low subnormal numbers

TABLE 2.3. Results for some exceptional operations

The presence of a N aN (Not a Number) in a sequence of operations au-

tomatically implies that the result is a N aN . General acceptance of this

standard is still ongoing.

We mention that not all the ¬‚oating-point systems satisfy (2.36). One of

the main reasons is the absence of the round digit in subtraction, that is,

an extra-bit that gets into action on the mantissa level when the subtrac-

tion between two ¬‚oating-point numbers is performed. To demonstrate the

importance of the round digit, let us consider the following example with

a system F having β = 10 and t = 2. Let us subtract 1 and 0.99. We have

101 · 0.1 101 · 0.10

100 · 0.99 ’ 101 · 0.09

101 · 0.01 ’’ 100 · 0.10

that is, the result di¬ers from the exact one by a factor 10. If we now

execute the same subtraction using the round digit, we obtain the exact

54 2. Principles of Numerical Mathematics

result. Indeed

101 · 0.1 101 · 0.10

100 · 0.99 ’ 101 · 0.09 9

101 · 0.00 1 ’’ 100 · 0.01

In fact, it can be shown that addition and subtraction, if executed without

round digit, do not satisfy the property

f l(x ± y) = (x ± y)(1 + δ) with |δ| ¤ u,

but the following one

f l(x ± y) = x(1 + ±) ± y(1 + β) with |±| + |β| ¤ u.

An arithmetic for which this latter event happens is called aberrant. In some

computers the round digit does not exist, most of the care being spent on

velocity in the computation. Nowadays, however, the trend is to use even

two round digits (see [HP94] for technical details about the subject).

2.6 Exercises

1. Use (2.7) to compute the condition number K(d) of the following expres-

sions

x ’ ad = 0, a > 0 d ’ x + 1 = 0,

(1) (2)

d being the datum, a a parameter and x the “unknown”.

[Solution : (1) K(d) |d|| log a|, (2) K(d) = |d|/|d + 1|.]

2. Study the well posedness and the conditioning in the in¬nity norm of the

following problem as a function of the datum d: ¬nd x and y such that

x + dy = 1,

dx + y = 0.

[Solution : the given problem is a linear system whose matrix is A =

1d

. It is well-posed if A is nonsingular, i.e., if d = ±1. In such a

d1

case, K∞ (A) = |(|d| + 1)/(|d| ’ 1)|.]

3. Study the conditioning of the solving formula x± = ’p ± p2 + q for

the second degree equation x2 + 2px ’ q with respect to changes in the

parameters p and q separately.

[Solution : K(p) = |p|/ p2 + q, K(q) = |q|/(2|x± | p2 + q).]

4. Consider the following Cauchy problem

x (t) = x0 eat (a cos(t) ’ sin(t)) , t>0

(2.38)

x(0) = x0

2.6 Exercises 55

whose solution is x(t) = x0 eat cos(t) (a is a given real number). Study the

conditioning of (2.38) with respect to the choice of the initial datum and

check that on unbounded intervals it is well conditioned if a < 0, while it

is ill conditioned if a > 0.

[Hint : consider the de¬nition of Kabs (a).]

5. Let x = 0 be an approximation of a non null quantity x. Find the relation

˜

between the relative error = |x ’ x|/|x| and E = |x ’ x|/|x|.

6. Find a stable formula for evaluating the square root of a complex number.

7. Determine all the elements of the set F = (10, 6, ’9, 9), in both normalized

and de-normalized cases.

8. Consider the set of the de-normalized numbers FD and study the behavior

of the absolute distance and of the relative distance between two of these

numbers. Does the wobbling precision e¬ect arise again?

[Hint : for these numbers, uniformity in the relative density is lost. As a

consequence, the absolute distance remains constant (equal to β L’t ), while

the relative one rapidly grows as x tends to zero.]

9. What is the value of 00 in IEEE arithmetic?

[Solution : ideally, the outcome should be N aN . In practice, IEEE systems

recover the value 1. A motivation of this result can be found in [Gol91].]

10. Show that, due to cancellation errors, the following sequence

6 1

(2.39)

I0 = log , Ik + 5Ik’1 = , k = 1, 2, . . . , n,

5 k

is not well suited to ¬nite arithmetic computations of the integral In =

n

1x

dx when n is su¬ciently large, although it works in in¬nite arith-

0 x+5

metic.

˜

[Hint : consider the initial perturbed datum I0 = I0 + µ0 and study the

propagation of the error µ0 within (2.39).]

11. Prove (2.37).

[Solution : notice that

|x + y ’ (x + y)| |x + y ’ (f l(x) + f l(y))| |f l(x) ’ x + f l(y) ’ y|

¤ + .

|x + y| |x + y| |x + y|

Then, use (2.36) and (2.35).]

12. Given x, y, z ∈ F with x + y, y + z, x + y + z that fall into the range of F,

show that

|(x + y) + z ’ (x + y + z)| ¤ C1 (2|x + y| + |z|)u

|x + (y + z) ’ (x + y + z)| ¤ C2 (|x| + 2|y + z|)u.

13. Which among the following approximations of π,

1 1 1 1

π =4 1’ + ’ + ’ ... ,

3 5 7 9

(2.40)

3 · 5(0.5)7

3

3(0.5)5

(0.5)

π = 6 0.5 + + + + ...

2·3 2·4·5 2·4·6·7

56 2. Principles of Numerical Mathematics

better limits the propagation of rounding errors? Compare using MATLAB

the obtained results as a function of the number of the terms in each sum

in (2.40).

14. Analyze the stability, with respect to propagation of rounding errors, of the

following two MATLAB codes to evaluate f (x) = (ex ’ 1)/x for |x| 1

% Algorithm 1 % Algorithm 2

if x == 0 y = exp (x);

f = 1; if y == 1

else f = 1;

f = (exp(x) - 1) / x; else

end f = (y - 1) / log (y);

end

[Solution : the ¬rst algorithm is inaccurate due to cancellation errors, while

the second one (in presence of round digit) is stable and accurate.]

15. In binary arithmetic one can show [Dek71] that the rounding error in the

sum of two numbers a and b, with a ≥ b, can be computed as

((a + b) ’ a) ’ b).

Based on this property, a method has been proposed, called Kahan com-

pensated sum, to compute the sum of n addends ai in such a way that the

rounding errors are compensated. In practice, letting the initial rounding

error e1 = 0 and s1 = a1 , at the i-th step, with i ≥ 2, the algorithm

evaluates yi = xi ’ ei’1 , the sum is updated setting si = si’1 + yi and

the new rounding error is computed as ei = (si ’ si’1 ) ’ yi . Implement

this algorithm in MATLAB and check its accuracy by evaluating again the

second expression in (2.40).

16. The area A(T ) of a triangle T with sides a, b and c, can be computed using

the following formula

p(p ’ a)(p ’ b)(p ’ c),

A(T ) =

where p is half the perimeter of T . Show that in the case of strongly de-

formed triangles (a b + c), this formula lacks accuracy and check this

experimentally.

3

Direct Methods for the Solution of

Linear Systems

A system of m linear equations in n unknowns consists of a set of algebraic

relations of the form

n

aij xj = bi , i = 1, . . . , m (3.1)

j=1

where xj are the unknowns, aij are the coe¬cients of the system and bi

are the components of the right hand side. System (3.1) can be more con-

veniently written in matrix form as

Ax = b, (3.2)

where we have denoted by A = (aij ) ∈ Cm—n the coe¬cient matrix, by

b=(bi ) ∈ Cm the right side vector and by x=(xi ) ∈ Cn the unknown

vector, respectively. We call a solution of (3.2) any n-tuple of values xi

which satis¬es (3.1).

In this chapter we shall be mainly dealing with real-valued square systems

of order n, that is, systems of the form (3.2) with A ∈ Rn—n and b ∈ Rn .

In such cases existence and uniqueness of the solution of (3.2) are ensured

if one of the following (equivalent) hypotheses holds:

1. A is invertible;

2. rank(A)=n;

3. the homogeneous system Ax=0 admits only the null solution.

58 3. Direct Methods for the Solution of Linear Systems

The solution of system (3.2) is formally provided by Cramer™s rule

∆j

xj = , j = 1, . . . , n, (3.3)

det(A)

where ∆j is the determinant of the matrix obtained by substituting the

j-th column of A with the right hand side b. This formula is, however,

of little practical use. Indeed, if the determinants are evaluated by the

recursive relation (1.4), the computational e¬ort of Cramer™s rule is of the

order of (n + 1)! ¬‚ops and therefore turns out to be unacceptable even for

small dimensions of A (for instance, a computer able to perform 109 ¬‚ops

per second would take 9.6 · 1047 years to solve a linear system of only 50

equations).

For this reason, numerical methods that are alternatives to Cramer™s rule

have been developed. They are called direct methods if they yield the so-

lution of the system in a ¬nite number of steps, iterative if they require

(theoretically) an in¬nite number of steps. Iterative methods will be ad-

dressed in the next chapter. We notice from now on that the choice between

a direct and an iterative method does not depend only on the theoretical ef-

¬ciency of the scheme, but also on the particular type of matrix, on memory

storage requirements and, ¬nally, on the architecture of the computer.

3.1 Stability Analysis of Linear Systems

Solving a linear system by a numerical method invariably leads to the

introduction of rounding errors. Only using stable numerical methods can

keep away the propagation of such errors from polluting the accuracy of the

solution. In this section two aspects of stability analysis will be addressed.

Firstly, we will analyze the sensitivity of the solution of (3.2) to changes

in the data A and b (forward a priori analysis). Secondly, assuming that

an approximate solution x of (3.2) is available, we shall quantify the per-

turbations on the data A and b in order for x to be the exact solution

of a perturbed system (backward a priori analysis). The size of these per-

turbations will in turn allow us to measure the accuracy of the computed

solution x by the use of a posteriori analysis.

3.1.1 The Condition Number of a Matrix

The condition number of a matrix A ∈ Cn—n is de¬ned as

A’1 ,

K(A) = A (3.4)

where · is an induced matrix norm. In general K(A) depends on the

choice of the norm; this will be made clear by introducing a subscript

3.1 Stability Analysis of Linear Systems 59

into the notation, for instance, K∞ (A) = A ∞ A’1 ∞ . More generally,

Kp (A) will denote the condition number of A in the p-norm. Remarkable

instances are p = 1, p = 2 and p = ∞ (we refer to Exercise 1 for the

relations among K1 (A), K2 (A) and K∞ (A)).

As already noticed in Example 2.3, an increase in the condition number

produces a higher sensitivity of the solution of the linear system to changes

in the data. Let us start by noticing that K(A) ≥ 1 since

1 = AA’1 ¤ A A’1 = K(A).

Moreover, K(A’1 ) = K(A) and ∀± ∈ C with ± = 0, K(±A) = K(A).

Finally, if A is orthogonal, K2 (A) = 1 since A 2 = ρ(AT A) = ρ(I) = 1

and A’1 = AT . The condition number of a singular matrix is set equal to

in¬nity.

For p = 2, K2 (A) can be characterized as follows. Starting from (1.21),

it can be proved that

σ1 (A)

A’1

K2 (A) = A =

2 2

σn (A)

where σ1 (A) and σn (A) are the maximum and minimum singular values of

A (see Property 1.7). As a consequence, in the case of symmetric positive

de¬nite matrices we have

»max

= ρ(A)ρ(A’1 )

K2 (A) = (3.5)

»min

where »max and »min are the maximum and minimum eigenvalues of A.

To check (3.5), notice that

ρ(AT A) = ρ(A2 ) = »2

A = max = »max .

2

Moreover, since »(A’1 ) = 1/»(A), one gets A’1 2 = 1/»min from which

(3.5) follows. For that reason, K2 (A) is called spectral condition number.

Remark 3.1 De¬ne the relative distance of A ∈ Cn—n from the set of

singular matrices with respect to the p-norm by

δA p

distp (A) = min : A + δA is singular .

Ap

It can then be shown that ([Kah66], [Gas83])

1

distp (A) = . (3.6)

Kp (A)

Equation (3.6) suggests that a matrix A with a high condition number

can behave like a singular matrix of the form A+δA. In other words, null

60 3. Direct Methods for the Solution of Linear Systems

perturbations in the right hand side do not necessarily yield non vanishing

changes in the solution since, if A+δA is singular, the homogeneous system

(A + δA)z = 0 does no longer admit only the null solution. From (3.6) it

also follows that if A+δA is nonsingular then

δA A < 1. (3.7)

p p

Relation (3.6) seems to suggest that a natural candidate for measuring

the ill-conditioning of a matrix is its determinant, since from (3.3) one is

prompted to conclude that small determinants mean nearly-singular matri-

ces. However this conclusion is wrong, as there exist examples of matrices

with small (respectively, high) determinants and small (respectively, high)

condition numbers (see Exercise 2).

3.1.2 Forward a priori Analysis

In this section we introduce a measure of the sensitivity of the system to

changes in the data. These changes will be interpreted in Section 3.10 as

being the e¬ects of rounding errors induced by the numerical method used

to solve the system. For a more comprehensive analysis of the subject we

refer to [Dat95], [GL89], [Ste73] and [Var62].

Due to rounding errors, a numerical method for solving (3.2) does not

provide the exact solution but only an approximate one, which satis¬es a

perturbed system. In other words, a numerical method yields an (exact)

solution x + δx of the perturbed system

(A + δA)(x + δx) = b + δb. (3.8)

The next result provides an estimate of δx in terms of δA and δb.

Theorem 3.1 Let A ∈ Rn—n be a nonsingular matrix and δA ∈ Rn—n be

such that (3.7) is satis¬ed for a matrix norm · . Then, if x∈ Rn is the

solution of Ax=b with b ∈ Rn (b = 0) and δx ∈ Rn satis¬es (3.8) for

δb ∈ Rn ,

δx δb

K(A) δA

¤ + . (3.9)

1 ’ K(A) δA / A

x b A

Proof. From (3.7) it follows that the matrix A’1 δA has norm less than 1. Then,

due to Theorem 1.5, I + A’1 δA is invertible and from (1.26) it follows that

1 1

(I + A’1 δA)’1 ¤ ¤ . (3.10)

1’ A 1’ A

’1 δA ’1 δA

On the other hand, solving for δx in (3.8) and recalling that Ax = b, one gets

δx = (I + A’1 δA)’1 A’1 (δb ’ δAx),

3.1 Stability Analysis of Linear Systems 61

from which, passing to the norms and using (3.10), it follows that

A’1

δx ¤ ( δb + δA x ).

1 ’ A’1 δA

Finally, dividing both sides by x (which is nonzero since b = 0 and A is

nonsingular) and noticing that x ≥ b / A , the result follows. 3

Well-conditioning alone is not enough to yield an accurate solution of the

linear system. It is indeed crucial, as pointed out in Chapter 2, to resort to

stable algorithms. Conversely, ill-conditioning does not necessarily exclude

that for particular choices of the right side b the overall conditioning of the

system is good (see Exercise 4).

A particular case of Theorem 3.1 is the following.

Theorem 3.2 Assume that the conditions of Theorem 3.1 hold and let

δA = 0. Then

δb δx δb

1

¤ ¤ K(A) . (3.11)

K(A) b x b

Proof. We will prove only the ¬rst inequality since the second one directly

follows from (3.9). Relation δx = A’1 δb yields δb ¤ A δx . Multiplying

both sides by x and recalling that x ¤ A’1 b it follows that x δb ¤

K(A) b δx , which is the desired inequality. 3

In order to employ the inequalities (3.10) and (3.11) in the analysis of

propagation of rounding errors in the case of direct methods, δA and

δb should be bounded in terms of the dimension of the system and of

the characteristics of the ¬‚oating-point arithmetic that is being used.

It is indeed reasonable to expect that the perturbations induced by a

method for solving a linear system are such that δA ¤ γ A and δb ¤

γ b , γ being a positive number that depends on the roundo¬ unit u (for

example, we shall assume henceforth that γ = β 1’t , where β is the base

and t is the number of digits of the mantissa of the ¬‚oating-point system

F). In such a case (3.9) can be completed by the following theorem.

Theorem 3.3 Assume that δA ¤ γ A , δb ¤ γ b with γ ∈ R+ and

δA ∈ Rn—n , δb ∈ Rn . Then, if γK(A) < 1 the following inequalities hold

x + δx 1 + γK(A)

¤ , (3.12)

1 ’ γK(A)

x

δx 2γ

¤ K(A). (3.13)

1 ’ γK(A)

x

62 3. Direct Methods for the Solution of Linear Systems

Proof. From (3.8) it follows that (I + A’1 δA)(x + δx) = x + A’1 δb. Moreover,

since γK(A) < 1 and δA ¤ γ A it turns out that I + A’1 δA is nonsingular.

Taking the inverse of such a matrix and passing to the norms we get x + δx ¤

(I + A’1 δA)’1 x + γ A’1 b . From Theorem 1.5 it then follows that

1

x + γ A’1

x + δx ¤ b ,

1 ’ A’1 δA

which implies (3.12), since A’1 δA ¤ γK(A) and b ¤ A x .

Let us prove (3.13). Subtracting (3.2) from (3.8) it follows that

Aδx = ’δA(x + δx) + δb.

Inverting A and passing to the norms, the following inequality is obtained

A’1 δA x + δx + A’1

δx ¤ δb

(3.14)

γK(A) x + δx + γ A’1

¤ b.

Dividing both sides by x and using the triangular inequality x+δx ¤ δx +

3

x , we ¬nally get (3.13).

Remarkable instances of perturbations δA and δb are those for which

|δA| ¤ γ|A| and |δb| ¤ γ|b| with γ ≥ 0. Hereafter, the absolute value

notation B = |A| denotes the matrix n — n having entries bij = |aij | with

i, j = 1, . . . , n and the inequality C ¤ D, with C, D ∈ Rm—n has the

following meaning

cij ¤ dij for i = 1, . . . , m, j = 1, . . . , n.

·

If is considered, from (3.14) it follows that

∞

|A’1 | |A| |x| + |A’1 | |b| ∞

δx ∞

¤γ

(1 ’ γ |A’1 | |A| ∞ ) x ∞

x∞

(3.15)

2γ

|A’1 | |A|

¤ ∞.

1 ’ γ |A’1 | |A| ∞

Estimate (3.15) is generally too pessimistic; however, the following compo-

nentwise error estimates of δx can be derived from (3.15)

|δxi | ¤ γ|rT | |A| |x + δx|, i = 1, . . . , n if δb = 0,

(i)

(3.16)

|rT | |b|

|δxi | (i)

¤γ T , i = 1, . . . , n if δA = 0,

|xi | |r(i) b|

being rT the row vector eT A’1 . Estimates (3.16) are more stringent than

i

(i)

(3.15), as can be seen in Example 3.1. The ¬rst inequality in (3.16) can be

used when the perturbed solution x + δx is known, being henceforth x + δx

the solution computed by a numerical method.

3.1 Stability Analysis of Linear Systems 63

In the case where |A’1 | |b| = |x|, the parameter γ in (3.15) is equal

to 1. For such systems the components of the solution are insensitive to

perturbations to the right side. A slightly worse situation occurs when A

is a triangular M-matrix and b has positive entries. In such a case γ is

bounded by 2n ’ 1, since

|rT | |A| |x| ¤ (2n ’ 1)|xi |.

(i)

For further details on the subject we refer to [Ske79], [CI95] and [Hig89].

Results linking componentwise estimates to normwise estimates through

the so-called hypernorms can be found in [ADR92].

Example 3.1 Consider the linear system Ax=b with

® ®2

1 1

±± ± +±

A=° », b = ° »

1 1

0± ±

which has solution xT = (±, 1), where 0 < ± < 1. Let us compare the results

obtained using (3.15) and (3.16). From

T

2

’1 ’1

|A | |A| |x| = |A | |b| = ± + 2,1 (3.17)

±

it follows that the supremum of (3.17) is unbounded as ± ’ 0, exactly as it

happens in the case of A ∞ . On the other hand, the ampli¬cation factor of

the error in (3.16) is bounded. Indeed, the component of the maximum absolute

value, x2 , of the solution, satis¬es |rT | |A| |x|/|x2 | = 1. •

(2)

3.1.3 Backward a priori Analysis

The numerical methods that we have considered thus far do not require the

explicit computation of the inverse of A to solve Ax=b. However, we can

always assume that they yield an approximate solution of the form x = Cb,

where the matrix C, due to rounding errors, is an approximation of A’1 .

In practice, C is very seldom constructed; in case this should happen, the

following result yields an estimate of the error that is made substituting C

for A’1 (see [IK66], Chapter 2, Theorem 7).

Property 3.1 Let R = AC ’ I; if R < 1, then A and C are nonsingular

and

C R CR

¤ C ’ A’1 ¤

A’1 ¤ , . (3.18)

1’ R 1’ R

A

In the frame of backward a priori analysis we can interpret C as being the

inverse of A + δA (for a suitable unknown δA). We are thus assuming that

C(A + δA) = I. This yields

δA = C’1 ’ A = ’(AC ’ I)C’1 = ’RC’1

64 3. Direct Methods for the Solution of Linear Systems

and, as a consequence, if R < 1 it turns out that

RA

δA ¤ , (3.19)

1’ R

having used the ¬rst inequality in (3.18), where A is assumed to be an

approximation of the inverse of C (notice that the roles of C and A can be

interchanged).

3.1.4 A posteriori Analysis

Having approximated the inverse of A by a matrix C turns into having an

approximation of the solution of the linear system (3.2). Let us denote by

y a known approximate solution. The aim of the a posteriori analysis is to

relate the (unknown) error e = y ’ x to quantities that can be computed

using y and C.

The starting point of the analysis relies on the fact that the residual

vector r = b ’ Ay is in general nonzero, since y is just an approximation

to the unknown exact solution. The residual can be related to the error

through Property 3.1 as follows. We have e = A’1 (Ay ’ b) = ’A’1 r and

thus, if R < 1 then

rC

e¤ . (3.20)

1’ R

Notice that the estimate does not necessarily require y to coincide with

the solution x = Cb of the backward a priori analysis. One could therefore

think of computing C only for the purpose of using the estimate (3.20) (for

instance, in the case where (3.2) is solved through the Gauss elimination

method, one can compute C a posteriori using the LU factorization of A,

see Sections 3.3 and 3.3.1).

We conclude by noticing that if δb is interpreted in (3.11) as being the

residual of the computed solution y = x + δx, it also follows that

e r

¤ K(A) . (3.21)

x b

The estimate (3.21) is not used in practice since the computed residual

is a¬ected by rounding errors. A more signi¬cant estimate (in the · ∞

norm) is obtained letting r = f l(b ’ Ay) and assuming that r = r + δr

with |δr| ¤ γn+1 (|A| |y| + |b|), where γn+1 = (n + 1)u/(1 ’ (n + 1)u) > 0,

from which we have

|A’1 |(|r| + γn+1 (|A||y| + |b|))

e ∞ ∞

¤ .

y y∞

∞

Formulae like this last one are implemented in the library for linear algebra

LAPACK (see [ABB+ 92]).

3.2 Solution of Triangular Systems 65

3.2 Solution of Triangular Systems

Consider the nonsingular 3—3 lower triangular system

® ® ®

l11 0 0 x1 b1

° l21 l22 0»° x2 » = ° b2 » .

l31 l32 l33 x3 b3

Since the matrix is nonsingular, its diagonal entries lii , i = 1, 2, 3, are

non vanishing, hence we can solve sequentially for the unknown values

xi , i = 1, 2, 3 as follows

x1 = b1 /l11 ,

x2 = (b2 ’ l21 x1 )/l22 ,

x3 = (b3 ’ l31 x1 ’ l32 x2 )/l33 .

This algorithm can be extended to systems n — n and is called forward

substitution. In the case of a system Lx=b, with L being a nonsingular

lower triangular matrix of order n (n ≥ 2), the method takes the form

b1

x1 = ,

l11«

(3.22)

i’1

1

lij xj , i = 2, . . . , n.

bi ’

xi =

lii j=1

The number of multiplications and divisions to execute the algorithm is

equal to n(n+1)/2, while the number of sums and subtractions is n(n’1)/2.

The global operation count for (3.22) is thus n2 ¬‚ops.

Similar conclusions can be drawn for a linear system Ux=b, where U

is a nonsingular upper triangular matrix of order n (n ≥ 2). In this case

the algorithm is called backward substitution and in the general case can

be written as

bn

xn = ,

«

unn

(3.23)

n

1

uij xj , i = n ’ 1, . . . , 1.

bi ’

xi =

uii j=i+1

Its computational cost is still n2 ¬‚ops.

3.2.1 Implementation of Substitution Methods

Each i-th step of algorithm (3.22) requires performing the scalar product

between the row vector L(i, 1 : i ’ 1) (this notation denoting the vector

extracted from matrix L taking the elements of the i-th row from the ¬rst

66 3. Direct Methods for the Solution of Linear Systems

to the (i-1)-th column) and the column vector x(1 : i ’ 1). The access to

matrix L is thus by row; for that reason, the forward substitution algorithm,

when implemented in the form above, is called row-oriented.

Its coding is reported in Program 1 (the Program mat square that is

called by forward row merely checks that L is a square matrix).

Program 1 - forward row : Forward substitution: row-oriented version

function [x]=forward row(L,b)

[n]=mat square(L); x(1) = b(1)/L(1,1);

for i = 2:n, x (i) = (b(i)-L(i,1:i-1)*(x(1:i-1))™)/L(i,i); end

x=x™;

To obtain a column-oriented version of the same algorithm, we take ad-

vantage of the fact that i-th component of the vector x, once computed,

can be conveniently eliminated from the system.

An implementation of such a procedure, where the solution x is over-

written on the right vector b, is reported in Program 2.

Program 2 - forward col : Forward substitution: column-oriented version

function [b]=forward col(L,b)

[n]=mat square(L);

for j=1:n-1,

b(j)= b(j)/L(j,j); b(j+1:n)=b(j+1:n)-b(j)*L(j+1:n,j);

end; b(n) = b(n)/L(n,n);

Implementing the same algorithm by a row-oriented rather than a column-

oriented approach, might dramatically change its performance (but of course,

not the solution). The choice of the form of implementation must therefore

be subordinated to the speci¬c hardware that is used.

Similar considerations hold for the backward substitution method, pre-

sented in (3.23) in its row-oriented version.

In Program 3 only the column-oriented version of the algorithm is coded.

As usual, the vector x is overwritten on b.

Program 3 - backward col : Backward substitution: column-oriented ver-

sion

function [b]=backward col(U,b)

[n]=mat square(U);

for j = n:-1:2,

b(j)=b(j)/U(j,j); b(1:j-1)=b(1:j-1)-b(j)*U(1:j-1,j);

end; b(1) = b(1)/U(1,1);

When large triangular systems must be solved, only the triangular portion

of the matrix should be stored leading to considerable saving of memory

resources.

3.2 Solution of Triangular Systems 67

3.2.2 Rounding Error Analysis

The analysis developed so far has not accounted for the presence of round-

ing errors. When including these, the forward and backward substitution

algorithms no longer yield the exact solutions to the systems Lx=b and

Uy=b, but rather provide approximate solutions x that can be regarded

as being exact solutions to the perturbed systems

(L + δL)x = b, (U + δU)x = b,

where δL = (δlij ) and δU = (δuij ) are perturbation matrices. In order

to apply the estimates (3.9) carried out in Section 3.1.2, we must provide

estimates of the perturbation matrices, δL and δU, as a function of the

entries of L and U, of their size and of the characteristics of the ¬‚oating-

point arithmetic. For this purpose, it can be shown that

nu

|δT| ¤ |T|, (3.24)

1 ’ nu

where T is equal to L or U, u = 1 β 1’t is the roundo¬ unit de¬ned in (2.34).

2

Clearly, if nu < 1 from (3.24) it turns out that, using a Taylor expansion,

|δT| ¤ nu|T| + O(u2 ). Moreover, from (3.24) and (3.9) it follows that, if

nuK(T) < 1, then

x’x nuK(T)

¤ = nuK(T) + O(u2 ) (3.25)

1 ’ nuK(T)

x

for the norms · 1 , · ∞ and the Frobenius norm. If u is su¬ciently

small (as typically happens), the perturbations introduced by the rounding

errors in the solution of a triangular system can thus be neglected. As

a consequence, the accuracy of the solution computed by the forward or

backward substitution algorithm is generally very high.

These results can be improved by introducing some additional assump-

tions on the entries of L or U. In particular, if the entries of U are such

that |uii | ≥ |uij | for any j > i, then

nu

|xi ’ xi | ¤ 2n’i+1 max|xj |, 1 ¤ i ¤ n.

1 ’ nu j≥i

The same result holds if T=L, provided that |lii | ≥ |lij | for any j < i, or if

L and U are diagonally dominant. The previous estimates will be employed

in Sections 3.3.1 and 3.4.2.

For the proofs of the results reported so far, see [FM67], [Hig89] and

[Hig88].

3.2.3 Inverse of a Triangular Matrix

The algorithm (3.23) can be employed to explicitly compute the inverse

of an upper triangular matrix. Indeed, given an upper triangular matrix

68 3. Direct Methods for the Solution of Linear Systems

U, the column vectors vi of the inverse V=(v1 , . . . , vn ) of U satisfy the

following linear systems

Uvi = ei , i = 1, . . . , n (3.26)

where {ei } is the canonical basis of Rn (de¬ned in Example 1.3). Solving

for vi thus requires the application of algorithm (3.23) n times to (3.26).

This procedure is quite ine¬cient since at least half the entries of the

inverse of U are null. Let us take advantage of this as follows. Denote by

vk = (v1k , . . . , vkk )T the vector of size k such that

U(k) vk = lk (3.27)

k = 1, . . . , n

where U(k) is the principal submatrix of U of order k and lk the vector of

Rk having null entries, except the ¬rst one which is equal to 1. Systems

(3.27) are upper triangular, but have order k and can be again solved using

the method (3.23). We end up with the following inversion algorithm for

upper triangular matrices: for k = n, n ’ 1, . . . , 1 compute

vkk = u’1 ,

kk

k

(3.28)

’u’1 uij vjk , for i = k ’ 1, k ’ 2, . . . , 1.

vik = ii

j=i+1

At the end of this procedure the vectors vk furnish the non vanishing entries

of the columns of U’1 . The algorithm requires about n3 /3 + (3/4)n2 ¬‚ops.

Once again, due to rounding errors, the algorithm (3.28) no longer yields

the exact solution, but an approximation of it. The error that is introduced

can be estimated using the backward a priori analysis carried out in Section

3.1.3.

A similar procedure can be constructed from (3.22) to compute the in-

verse of a lower triangular system.

3.3 The Gaussian Elimination Method (GEM) and

LU Factorization

The Gaussian elimination method aims at reducing the system Ax=b to an

equivalent system (that is, having the same solution) of the form Ux=b,

where U is an upper triangular matrix and b is an updated right side

vector. This latter system can then be solved by the backward substitution

method. Let us denote the original system by A(1) x = b(1) . During the

reduction procedure we basically employ the property which states that

replacing one of the equations by the di¬erence between this equation and

another one multiplied by a non null constant yields an equivalent system

(i.e., one with the same solution).

3.3 The Gaussian Elimination Method (GEM) and LU Factorization 69

Thus, consider a nonsingular matrix A ∈ Rn—n , and suppose that the

diagonal entry a11 is non vanishing. Introducing the multipliers

(1)

ai1

mi1 = , i = 2, 3, . . . , n,

(1)

a11

(1)

where aij denote the elements of A(1) , it is possible to eliminate the un-

known x1 from the rows other than the ¬rst one by simply subtracting

from row i, with i = 2, . . . , n, the ¬rst row multiplied by mi1 and doing

the same on the right side. If we now de¬ne

(2) (1) (1)

aij = aij ’ mi1 a1j , i, j = 2, . . . , n,

(2) (1) (1)

’ mi1 b1 ,

bi = bi i = 2, . . . , n,

(1)

denote the components of b(1) , we get a new system of the form

where bi

® ® ®

(1) (1) (1) (1)

a11 a12 ... a1n b1

x1

(2) (2) (2)

x2

0 a22 ... a2n b2

= ,

°

.

. . . .

»

.

. . . .

° » ° »

.

. . . .

(2) (2) (2)

xn

0 an2 ... ann bn

which we denote by A(2) x = b(2) , that is equivalent to the starting one.

Similarly, we can transform the system in such a way that the unknown

x2 is eliminated from rows 3, . . . , n. In general, we end up with the ¬nite

sequence of systems

A(k) x = b(k) , 1 ¤ k ¤ n, (3.29)

where, for k ≥ 2, matrix A(k) takes the following form

®

(1) (1) (1)

a11 a12 ... ... ... a1n

(2) (2)

0 a22 a2n

. .

..

. .

.

. .

(k)

= ,

A

(k) (k)

0 ... 0 akk ... akn

. . . .

° »

. . . .

. . . .

(k) (k)

0 ... 0 ank ... ann

70 3. Direct Methods for the Solution of Linear Systems

(i)

having assumed that aii = 0 for i = 1, . . . , k ’ 1. It is clear that for k = n

we obtain the upper triangular system A(n) x = b(n)

® (1) ® ® (1)

(1) (1)

a11 a12 . . . . . . a1n b1

x1

(2) (2)

a2n x2

(2)

0 b2

a22

.

. .

. . . .

.. .

. . = . .

. .

. .

.

.. . »°. .» °.»

°0 . . .

(n) (n)

xn

0 ann bn

Consistently with the notations that have been previously introduced, we

(k)

denote by U the upper triangular matrix A(n) . The entries akk are called

pivots and must obviously be non null for k = 1, . . . , n ’ 1.

In order to highlight the formulae which transform the k-th system into

(k)

the k + 1-th one, for k = 1, . . . , n ’ 1 we assume that akk = 0 and de¬ne

the multiplier

(k)

aik

mik = , i = k + 1, . . . , n. (3.30)

(k)

akk

Then we let

(k+1) (k) (k)

= aij ’ mik akj , i, j = k + 1, . . . , n

aij

(3.31)

(k+1) (k) (k)

’

bi = bi mik bk , i = k + 1, . . . , n.

Example 3.2 Let us use GEM to solve the following system

± 1 1 11

x 1 + 2 x2 + x =

33

6

(1) (1) 1

x1 + 1 x2 1 13

+ x =

(A x = b ) ,

43

2 3 12

1

+ 1 x2 1 47

x + x =

31 53

4 60

which admits the solution x=(1, 1, 1)T . At the ¬rst step we compute the mul-

tipliers m21 = 1/2 and m31 = 1/3, and subtract from the second and third

equation of the system the ¬rst row multiplied by m21 and m31 , respectively. We

obtain the equivalent system

± 1 1 11

x1 + x + x =

22 33

6

(A(2) x = b(2) ) 1 1 1

0 + 12 x2 + 12 x3 = .

6

1 4 31

0 + 12 x2 + 45 x3 = 180

If we now subtract the second row multiplied by m32 = 1 from the third one, we

end up with the upper triangular system

± 1 1 11

x1 + x + x =

22 33

6

(A(3) x = b(3) ) 1 1 1

0 + 12 x2 + x = ,

12 3

6

1 1

0+ 0 + 180 x3 = 180

3.3 The Gaussian Elimination Method (GEM) and LU Factorization 71

from which we immediately compute x3 = 1 and then, by back substitution, the

•

remaining unknowns x1 = x2 = 1.

Remark 3.2 The matrix in Example 3.2 is called the Hilbert matrix of

order 3. In the general n — n case, its entries are

hij = 1/(i + j ’ 1), i, j = 1, . . . , n. (3.32)

As we shall see later on, this matrix provides the paradigm of an ill-

conditioned matrix.

To complete Gaussian elimination 2(n ’ 1)n(n + 1)/3 + n(n ’ 1) ¬‚ops are

required, plus n2 ¬‚ops to backsolve the triangular system U x = b(n) .

Therefore, about (2n3 /3 + 2n2 ) ¬‚ops are needed to solve the linear sys-

tem using GEM. Neglecting the lower order terms, we can state that the

Gaussian elimination process has a cost of 2n3 /3 ¬‚ops.

(k)

As previously noticed, GEM terminates safely i¬ the pivotal elements akk ,

for k = 1, . . . , n ’ 1, are non vanishing. Unfortunately, having non null

diagonal entries in A is not enough to prevent zero pivots to arise during

the elimination process. For example, matrix A in (3.33) is nonsingular and

has nonzero diagonal entries

® ®

12 3

123

A = ° 2 4 5 » , A(2) = ° 0 0 ’1 » . (3.33)

0 ’6 ’12

789

Nevertheless, when GEM is applied, it is interrupted at the second step

(2)

since a22 = 0.

More restrictive conditions on A are thus needed to ensure the appli-

cability of the method. We shall see in Section 3.3.1 that if the leading

dominating minors di of A are nonzero for i = 1, . . . , n ’ 1, then the corre-

(i)

sponding pivotal entries aii must necessarily be non vanishing. We recall

that di is the determinant of Ai , the i-th principal submatrix made by the

¬rst i rows and columns of A. The matrix in the previous example does

not satisfy this condition, having d1 = 1 and d2 = 0.

Classes of matrices exist such that GEM can be always safely employed in

its basic form (3.31). Among them, we recall the following ones:

1. matrices diagonally dominant by rows;

2. matrices diagonally dominant by columns. In such a case one can even

show that the multipliers are in module less than or equal to 1 (see

Property 3.2);

3. matrices symmetric and positive de¬nite (see Theorem 3.6).

For a rigorous derivation of these results, we refer to the forthcoming sec-

tions.

72 3. Direct Methods for the Solution of Linear Systems

3.3.1 GEM as a Factorization Method

In this section we show how GEM is equivalent to performing a factorization

of the matrix A into the product of two matrices, A=LU, with U=A(n) .

Since L and U depend only on A and not on the right hand side, the same

factorization can be reused when solving several linear systems having the

same matrix A but di¬erent right hand side b, with a considerable reduction

of the operation count (indeed, the main computational e¬ort, about 2n3 /3

¬‚ops, is spent in the elimination procedure).

Let us go back to Example 3.2 concerning the Hilbert matrix H3 . In

practice, to pass from A(1) =H3 to the matrix A(2) at the second step, we

have multiplied the system by the matrix

® ®

100 10 0

M1 = ’ 1 1 0 = ’m21 1 0 .

° »° »

2

’m31 0

’3 0 1

1

1

Indeed,

®

1 1

1 2 3

(1)

= A(2) .

= 0 1 1

M1 A = M1 A

° »

12 12

1 4

0 12 45

Similarly, to perform the second (and last) step of GEM, we must multiply

A(2) by the matrix

® ®

1 00 1 00

M2 = 0 1 0 = 0 1 0 ,

° »° »