<<

. 3
( 26)



>>

Let a base β ∈ N be ¬xed with β ≥ 2, and let x be a real number with
a ¬nite number of digits xk with 0 ¤ xk < β for k = ’m, . . . , n. The
notation (conventionally adopted)

xβ = (’1)s [xn xn’1 . . . x1 x0 .x’1 x’2 . . . x’m ] , xn = 0 (2.26)

is called the positional representation of x with respect to the base β. The
point between x0 and x’1 is called decimal point if the base is 10, binary
point if the base is 2, while s depends on the sign of x (s = 0 if x is positive,
1 if negative). Relation (2.26) actually means
n
s
xk β k
xβ = (’1) .
k=’m


Example 2.10 The conventional writing x10 = 425.33 denotes the number x =
4 · 102 + 2 · 10 + 5 + 3 · 10’1 + 3 · 10’2 , while x6 = 425.33 would denote the
real number x = 4 · 62 + 2 · 6 + 5 + 3 · 6’1 + 3 · 6’2 . A rational number can of
course have a ¬nite number of digits in a base and an in¬nite number of digits in
another base. For example, the fraction 1/3 has in¬nite digits in base 10, being

x10 = 0.¯ while it has only one digit in base 3, being x3 = 0.1.
3,
46 2. Principles of Numerical Mathematics

Any real number can be approximated by numbers having a ¬nite repre-
sentation. Indeed, having ¬xed the base β, the following property holds
∀µ > 0, ∀xβ ∈ R, ∃yβ ∈ R such that |yβ ’ xβ | < µ,
where yβ has ¬nite positional representation.
In fact, given the positive number xβ = xn xn’1 . . . x0 .x’1 . . . x’m . . . with
a number of digits, ¬nite or in¬nite, for any r ≥ 1 one can build two
numbers
r’1
(l) (u) (l)
xn’k β n’k , xβ = xβ + β n’r+1 ,
xβ =
k=0
(l) (u) (u) (l)
and xβ ’ xβ = β n’r+1 . If
having r digits, such that xβ < xβ < xβ
(l)
r is chosen in such a way that β n’r+1 < , then taking yβ equal to xβ
(u)
or xβ yields the desired inequality. This result legitimates the computer
representation of real numbers (and thus by a ¬nite number of digits).

Although theoretically speaking all the bases are equivalent, in the com-
putational practice three are the bases generally employed: base 2 or binary,
base 10 or decimal (the most natural) and base 16 or hexadecimal. Almost
all modern computers use base 2, apart from a few which traditionally
employ base 16. In what follows, we will assume that β is an even integer.
In the binary representation, digits reduce to the two symbols 0 and 1,
called bits (binary digits), while in the hexadecimal case the symbols used
for the representation of the digits are 0,1,...,9,A,B,C,D,E,F. Clearly,
the smaller the adopted base, the longer the string of characters needed to
represent the same number.
To simplify notations, we shall write x instead of xβ , leaving the base β
understood.

2.5.2 The Floating-point Number System
Assume a given computer has N memory positions in which to store any
number. The most natural way to make use of these positions in the rep-
resentation of a real number x di¬erent from zero is to ¬x one of them for
its sign, N ’ k ’ 1 for the integer digits and k for the digits beyond the
point, in such a way that
x = (’1)s · [aN ’2 aN ’3 . . . ak . ak’1 . . . a0 ] (2.27)
s being equal to 1 or 0. Notice that one memory position is equivalent to
one bit storage only when β = 2. The set of numbers of this kind is called
¬xed-point system. Equation (2.27) stands for
N ’2
’k
x = (’1) · β
s
aj β j (2.28)
j=0
2.5 Machine Representation of Numbers 47

and therefore this representation amounts to ¬xing a scaling factor for all
the representable numbers.
The use of ¬xed point strongly limits the value of the minimum and maxi-
mum numbers that can be represented on the computer, unless a very large
number N of memory positions is employed. This drawback can be easily
overcome if the scaling in (2.28) is allowed to be varying. In such a case,
given a non vanishing real number x, its ¬‚oating-point representation is
given by
x = (’1)s · (0.a1 a2 . . . at ) · β e = (’1)s · m · β e’t (2.29)
where t ∈ N is the number of allowed signi¬cant digits ai (with 0 ¤ ai ¤
β ’ 1), m = a1 a2 . . . at an integer number called mantissa such that 0 ¤
m ¤ β t ’ 1 and e an integer number called exponent. Clearly, the exponent
can vary within a ¬nite interval of admissible values: we let L ¤ e ¤ U
(typically L < 0 and U > 0). The N memory positions are now distributed
among the sign (one position), the signi¬cant digits (t positions) and the
digits for the exponent (the remaining N ’ t ’ 1 positions). The number
zero has a separate representation.
Typically, on the computer there are two formats available for the ¬‚oating-
point number representation: single and double precision. In the case of bi-
nary representation, these formats correspond in the standard version to
the representation with N = 32 bits (single precision)
1 8 bits 23 bits
s e m
and with N = 64 bits (double precision)
1 11 bits 52 bits
s e m
Let us denote by
t
ai β ’i
F(β, t, L, U ) = {0} ∪ x ∈ R : x = (’1) β se

i=1

the set of ¬‚oating-point numbers with t signi¬cant digits, base β ≥ 2, 0 ¤
ai ¤ β ’ 1, and range (L,U ) with L ¤ e ¤ U .
In order to enforce uniqueness in a number representation, it is typi-
cally assumed that a1 = 0 and m ≥ β t’1 . In such an event a1 is called
the principal signi¬cant digit, while at is the last signi¬cant digit and the
representation of x is called normalized. The mantissa m is now varying
between β t’1 and β t ’ 1.
For instance, in the case β = 10, t = 4, L = ’1 and U = 4, without
the assumption that a1 = 0, the number 1 would admit the following
representations
0.1000 · 101 , 0.0100 · 102 , 0.0010 · 103 , 0.0001 · 104 .
48 2. Principles of Numerical Mathematics

To always have uniqueness in the representation, it is assumed that also
the number zero has its own sign (typically s = 0 is assumed).
It can be immediately noticed that if x ∈ F(β, t, L, U ) then also ’x ∈
F(β, t, L, U ). Moreover, the following lower and upper bounds hold for the
absolute value of x

xmin = β L’1 ¤ |x| ¤ β U (1 ’ β ’t ) = xmax . (2.30)

The cardinality of F(β, t, L, U ) (henceforth shortly denoted by F) is

card F = 2(β ’ 1)β t’1 (U ’ L + 1) + 1.

From (2.30) it turns out that it is not possible to represent any number
(apart from zero) whose absolute value is less than xmin . This latter limi-
tation can be overcome by completing F by the set FD of the ¬‚oating-point
de-normalized numbers obtained by removing the assumption that a1 is
non null, only for the numbers that are referred to the minimum exponent
L. In such a way the uniqueness in the representation is not lost and it is
possible to generate numbers that have mantissa between 1 and β t’1 ’ 1
and belong to the interval (’β L’1 , β L’1 ). The smallest number in this set
has absolute value equal to β L’t .

Example 2.11 The positive numbers in the set F(2, 3, ’1, 2) are
7 5
(0.111) · 22 = (0.110) · 22 = 3, (0.101) · 22 = (0.100) · 22 = 2,
, ,
2 2
7 3 5
(0.111) · 2 = (0.110) · 2 = (0.101) · 2 = (0.100) · 2 = 1,
, , ,
4 2 4
7 3 5 1
(0.111) = , (0.110) = , (0.101) = , (0.100) = ,
8 4 8 2
7 3 5 1
(0.111) · 2’1 = (0.110) · 2’1 = (0.101) · 2’1 = (0.100) · 2’1 =
, , , .
16 8 16 4
They are included between xmin = β L’1 = 2’2 = 1/4 and xmax = β U (1’β ’t ) =
22 (1 ’ 2’3 ) = 7/2. As a whole, we have (β ’ 1)β t’1 (U ’ L + 1) = (2 ’ 1)23’1 (2 +
1 + 1) = 16 strictly positive numbers. Their opposites must be added to them,
as well as the number zero. We notice that when β = 2, the ¬rst signi¬cant digit
in the normalized representation is necessarily equal to 1 and thus it may not be
stored in the computer (in such an event, we call it hidden bit).
When considering also the positive de-normalized numbers, we should complete
the above set by adding the following numbers
3 1 1
(.011)2 · 2’1 = (.010)2 · 2’1 = (.001)2 · 2’1 =
, , .
16 8 16
According to what previously stated, the smallest de-normalized number is β L’t =
2’1’3 = 1/16. •
2.5 Machine Representation of Numbers 49

2.5.3 Distribution of Floating-point Numbers
The ¬‚oating-point numbers are not equally spaced along the real line, but
they get dense close to the smallest representable number. It can be checked
that the spacing between a number x ∈ F and its next nearest y ∈ F, where
both x and y are assumed to be non null, is at least β ’1 M |x| and at most
M |x|, being M = β
1’t
the machine epsilon. This latter represents the
distance between the number 1 and the nearest ¬‚oating-point number, and
therefore it is the smallest number of F such that 1 + M > 1.
Having instead ¬xed an interval of the form [β e , β e+1 ], the numbers of F
that belong to such an interval are equally spaced and have distance equal
to β e’t . Decreasing (or increasing) by one the exponent gives rise to a
decrement (or increment) of a factor β of the distance between consecutive
numbers.
Unlike the absolute distance, the relative distance between two consecu-
tive numbers has a periodic behavior which depends only on the mantissa
m. Indeed, denoting by (’1)s m(x)β e’t one of the two numbers, the dis-
tance ∆x from the successive one is equal to (’1)s β e’t , which implies that
the relative distance is
(’1)s β e’t
∆x 1
= = . (2.31)
(’1)s m(x)β e’t
x m(x)

Within the interval [β e , β e+1 ], the ratio in (2.31) is decreasing as x increases
since in the normalized representation the mantissa varies from β t’1 to β t ’
1 (not included). However, as soon as x = β e+1 , the relative distance gets
back to the value β ’t+1 and starts decreasing on the successive intervals,
as shown in Figure 2.2. This oscillatory phenomenon is called wobbling
precision and the greater the base β, the more pronounced the e¬ect. This
is another reason why small bases are preferably employed in computers.


2.5.4 IEC/IEEE Arithmetic
The possibility of building sets of ¬‚oating-point numbers that di¬er in base,
number of signi¬cant digits and range of the exponent has prompted in the
past the development, for almost any computer, of a particular system F. In
order to avoid this proliferation of numerical systems, a standard has been
¬xed that is nowadays almost universally accepted. This standard was de-
veloped in 1985 by the Institute of Electrical and Electronics Engineers
(shortly, IEEE) and was approved in 1989 by the International Electroni-
cal Commission (IEC) as the international standard IEC559 and it is now
known by this name (IEC is an organization analogue to the International
Standardization Organization (ISO) in the ¬eld of electronics). The stan-
dard IEC559 endorses two formats for the ¬‚oating-point numbers: a basic
format, made by the system F(2, 24, ’125, 128) for the single precision,
and by F(2, 53, ’1021, 1024) for the double precision, both including the
50 2. Principles of Numerical Mathematics



-23
2




-24
2

-123
-126 -124
-125
2 2 2
2


FIGURE 2.2. Variation of relative distance for the set of numbers
F(2, 24, ’125, 128) IEC/IEEE in single precision

de-normalized numbers, and an extended format, for which only the main
limitations are ¬xed (see Table 2.1).

single double single double
≥ 43 bits ≥ 79 bits ≥ 32 ≥ 64
N t
¤ ’1021 ¤ 16381 ≥ 1024 ≥ 16384
L U
TABLE 2.1. Lower or upper limits in the standard IEC559 for the extended
format of ¬‚oating-point numbers

Almost all the computers nowadays satisfy the requirements above. We
summarize in Table 2.2 the special codings that are used in IEC559 to
deal with the values ±0, ±∞ and with the so-called non numbers (shortly,
N aN , that is not a number), which correspond for instance to 0/0 or to
other exceptional operations.

value exponent mantissa
±0 L’1 0
±∞ U +1 0
N aN U +1 =0
TABLE 2.2. IEC559 codings of some exceptional values




2.5.5 Rounding of a Real Number in its Machine
Representation
The fact that on any computer only a subset F(β, t, L, U ) of R is actually
available poses several practical problems, ¬rst of all the representation in F
2.5 Machine Representation of Numbers 51

of any given real number. To this concern, notice that, even if x and y were
two numbers in F, the result of an operation on them does not necessarily
belong to F. Therefore, we must de¬ne an arithmetic also on F.
The simplest approach to solve the ¬rst problem consists of rounding
x ∈ R in such a way that the rounded number belongs to F. Among all
the possible rounding operations, let us consider the following one. Given
x ∈ R in the normalized positional notation (2.29) let us substitute x by
its representant f l(x) in F, de¬ned as
at if at+1 < β/2
f l(x) = (’1)s (0. a1 a2 . . . at ) · β e , at =
˜ (2.32)
at + 1 if at+1 ≥ β/2.
The mapping f l : R ’ F is the most commonly used and is called rounding
(in the chopping one would take more trivially at = at ). Clearly, f l(x) = x
if x ∈ F and moreover f l(x) ¤ f l(y) if x ¤ y ∀x, y ∈ R (monotonicity
property).

Remark 2.2 (Over¬‚ow and under¬‚ow) Everything written so far holds
only for the numbers that in (2.29) have exponent e within the range of F.
If, indeed, x ∈ (’∞, ’xmax ) ∪ (xmax , ∞) the value f l(x) is not de¬ned,
while if x ∈ (’xmin , xmin ) the operation of rounding is de¬ned anyway
(even in absence of de-normalized numbers). In the ¬rst case, if x is the
result of an operation on numbers of F, we speak about over¬‚ow, in the
second case about under¬‚ow (or graceful under¬‚ow if de-normalized num-
bers are accounted for). The over¬‚ow is handled by the system through an
interrupt of the executing program.
Apart from exceptional situations, we can easily quantify the error, ab-
solute and relative, that is made by substituting f l(x) for x. The following
result can be shown (see for instance [Hig96], Theorem 2.2).

Property 2.1 If x ∈ R is such that xmin ¤ |x| ¤ xmax , then
f l(x) = x(1 + δ) with |δ| ¤ u (2.33)
where
1 1’t 1
u= β = (2.34)
M
2 2
is the so-called roundo¬ unit (or machine precision).
As a consequence of (2.33), the following bound holds for the relative error
|x ’ f l(x)|
¤ u,
Erel (x) = (2.35)
|x|
while, for the absolute error, one gets
E(x) = |x ’ f l(x)| ¤ β e’t |(a1 . . . at .at+1 . . . ) ’ (a1 . . . at )|.
˜
52 2. Principles of Numerical Mathematics

From (2.32), it follows that

β
|(a1 . . . at .at+1 . . . ) ’ (a1 . . . at )| ¤ β ’1 ,
˜
2
from which
1 ’t+e
E(x) ¤ β .
2

Remark 2.3 In the MATLAB environment it is possible to know imme-
diately the value of M , which is given by the system variable eps.


2.5.6 Machine Floating-point Operations
As previously stated, it is necessary to de¬ne on the set of machine numbers
an arithmetic which is analogous, as far as possible, to the arithmetic in R.
Thus, given any arithmetic operation —¦ : R — R ’ R on two operands in
R (the symbol —¦ may denote sum, subtraction, multiplication or division),
we shall denote by —¦ the corresponding machine operation

—¦ : R — R ’ F, x —¦ y = f l(f l(x) —¦ f l(y)).

From the properties of ¬‚oating-point numbers one could expect that for the
operations on two operands, whenever well de¬ned, the following property
holds: ∀x, y ∈ F, ∃δ ∈ R such that

x —¦ y = (x —¦ y)(1 + δ) with |δ| ¤ u. (2.36)

In order for (2.36) to be satis¬ed when —¦ is the operator of subtraction, it
will require an additional assumption on the structure of the numbers in F,
that is the presence of the so-called round digit (which is addressed at the
end of this section). In particular, when —¦ is the sum operator, it follows
that for all x, y ∈ F (see Exercise 11)

|x + y ’ (x + y)| |x| + |y|
¤ u(1 + u) + u, (2.37)
|x + y| |x + y|

so that the relative error associated with every machine operation will be
small, unless x + y is not small by itself. An aside comment is deserved by
the case of the sum of two numbers close in module, but opposite in sign.
In fact, in such a case x + y can be quite small, this generating the so-called
cancellation errors (as evidenced in Example 2.6).

It is important to notice that, together with properties of standard arith-
metic that are preserved when passing to ¬‚oating-point arithmetic (like, for
instance, the commutativity of the sum of two addends, or the product of
2.5 Machine Representation of Numbers 53

two factors), other properties are lost. An example is given by the associa-
tivity of sum: it can indeed be shown (see Exercise 12) that in general

x + (y + z) = (x + y) + z.

We shall denote by ¬‚op the single elementary ¬‚oating-point operation (sum,
subtraction, multiplication or division) (the reader is warned that in some
texts ¬‚op identi¬es an operation of the form a + b · c). According to the
previous convention, a scalar product between two vectors of length n will
require 2n ’ 1 ¬‚ops, a product matrix-vector 2(m ’ 1)n ¬‚ops if the matrix
is n — m and ¬nally, a product matrix-matrix 2(r ’ 1)mn ¬‚ops if the two
matrices are m — r and r — n respectively.

Remark 2.4 (IEC559 arithmetic) The IEC559 standard also de¬nes a
closed arithmetic on F, this meaning that any operation on it produces
a result that can be represented within the system itself, although not
necessarily being expected from a pure mathematical standpoint. As an
example, in Table 2.3 we report the results that are obtained in exceptional
situations.

exception examples result
0/0, 0 · ∞
non valid operation N aN
±∞
overf low
±∞
division by zero 1/0
underf low subnormal numbers
TABLE 2.3. Results for some exceptional operations


The presence of a N aN (Not a Number) in a sequence of operations au-
tomatically implies that the result is a N aN . General acceptance of this
standard is still ongoing.

We mention that not all the ¬‚oating-point systems satisfy (2.36). One of
the main reasons is the absence of the round digit in subtraction, that is,
an extra-bit that gets into action on the mantissa level when the subtrac-
tion between two ¬‚oating-point numbers is performed. To demonstrate the
importance of the round digit, let us consider the following example with
a system F having β = 10 and t = 2. Let us subtract 1 and 0.99. We have

101 · 0.1 101 · 0.10
100 · 0.99 ’ 101 · 0.09
101 · 0.01 ’’ 100 · 0.10

that is, the result di¬ers from the exact one by a factor 10. If we now
execute the same subtraction using the round digit, we obtain the exact
54 2. Principles of Numerical Mathematics

result. Indeed
101 · 0.1 101 · 0.10
100 · 0.99 ’ 101 · 0.09 9
101 · 0.00 1 ’’ 100 · 0.01

In fact, it can be shown that addition and subtraction, if executed without
round digit, do not satisfy the property

f l(x ± y) = (x ± y)(1 + δ) with |δ| ¤ u,

but the following one

f l(x ± y) = x(1 + ±) ± y(1 + β) with |±| + |β| ¤ u.

An arithmetic for which this latter event happens is called aberrant. In some
computers the round digit does not exist, most of the care being spent on
velocity in the computation. Nowadays, however, the trend is to use even
two round digits (see [HP94] for technical details about the subject).


2.6 Exercises
1. Use (2.7) to compute the condition number K(d) of the following expres-
sions

x ’ ad = 0, a > 0 d ’ x + 1 = 0,
(1) (2)

d being the datum, a a parameter and x the “unknown”.
[Solution : (1) K(d) |d|| log a|, (2) K(d) = |d|/|d + 1|.]
2. Study the well posedness and the conditioning in the in¬nity norm of the
following problem as a function of the datum d: ¬nd x and y such that

x + dy = 1,
dx + y = 0.

[Solution : the given problem is a linear system whose matrix is A =
1d
. It is well-posed if A is nonsingular, i.e., if d = ±1. In such a
d1
case, K∞ (A) = |(|d| + 1)/(|d| ’ 1)|.]
3. Study the conditioning of the solving formula x± = ’p ± p2 + q for
the second degree equation x2 + 2px ’ q with respect to changes in the
parameters p and q separately.
[Solution : K(p) = |p|/ p2 + q, K(q) = |q|/(2|x± | p2 + q).]
4. Consider the following Cauchy problem

x (t) = x0 eat (a cos(t) ’ sin(t)) , t>0
(2.38)
x(0) = x0
2.6 Exercises 55

whose solution is x(t) = x0 eat cos(t) (a is a given real number). Study the
conditioning of (2.38) with respect to the choice of the initial datum and
check that on unbounded intervals it is well conditioned if a < 0, while it
is ill conditioned if a > 0.
[Hint : consider the de¬nition of Kabs (a).]
5. Let x = 0 be an approximation of a non null quantity x. Find the relation
˜
between the relative error = |x ’ x|/|x| and E = |x ’ x|/|x|.
6. Find a stable formula for evaluating the square root of a complex number.
7. Determine all the elements of the set F = (10, 6, ’9, 9), in both normalized
and de-normalized cases.
8. Consider the set of the de-normalized numbers FD and study the behavior
of the absolute distance and of the relative distance between two of these
numbers. Does the wobbling precision e¬ect arise again?
[Hint : for these numbers, uniformity in the relative density is lost. As a
consequence, the absolute distance remains constant (equal to β L’t ), while
the relative one rapidly grows as x tends to zero.]
9. What is the value of 00 in IEEE arithmetic?
[Solution : ideally, the outcome should be N aN . In practice, IEEE systems
recover the value 1. A motivation of this result can be found in [Gol91].]
10. Show that, due to cancellation errors, the following sequence
6 1
(2.39)
I0 = log , Ik + 5Ik’1 = , k = 1, 2, . . . , n,
5 k
is not well suited to ¬nite arithmetic computations of the integral In =
n
1x
dx when n is su¬ciently large, although it works in in¬nite arith-
0 x+5
metic.
˜
[Hint : consider the initial perturbed datum I0 = I0 + µ0 and study the
propagation of the error µ0 within (2.39).]
11. Prove (2.37).
[Solution : notice that

|x + y ’ (x + y)| |x + y ’ (f l(x) + f l(y))| |f l(x) ’ x + f l(y) ’ y|
¤ + .
|x + y| |x + y| |x + y|
Then, use (2.36) and (2.35).]
12. Given x, y, z ∈ F with x + y, y + z, x + y + z that fall into the range of F,
show that
|(x + y) + z ’ (x + y + z)| ¤ C1 (2|x + y| + |z|)u
|x + (y + z) ’ (x + y + z)| ¤ C2 (|x| + 2|y + z|)u.

13. Which among the following approximations of π,
1 1 1 1
π =4 1’ + ’ + ’ ... ,
3 5 7 9
(2.40)
3 · 5(0.5)7
3
3(0.5)5
(0.5)
π = 6 0.5 + + + + ...
2·3 2·4·5 2·4·6·7
56 2. Principles of Numerical Mathematics

better limits the propagation of rounding errors? Compare using MATLAB
the obtained results as a function of the number of the terms in each sum
in (2.40).
14. Analyze the stability, with respect to propagation of rounding errors, of the
following two MATLAB codes to evaluate f (x) = (ex ’ 1)/x for |x| 1

% Algorithm 1 % Algorithm 2
if x == 0 y = exp (x);
f = 1; if y == 1
else f = 1;
f = (exp(x) - 1) / x; else
end f = (y - 1) / log (y);
end

[Solution : the ¬rst algorithm is inaccurate due to cancellation errors, while
the second one (in presence of round digit) is stable and accurate.]
15. In binary arithmetic one can show [Dek71] that the rounding error in the
sum of two numbers a and b, with a ≥ b, can be computed as

((a + b) ’ a) ’ b).

Based on this property, a method has been proposed, called Kahan com-
pensated sum, to compute the sum of n addends ai in such a way that the
rounding errors are compensated. In practice, letting the initial rounding
error e1 = 0 and s1 = a1 , at the i-th step, with i ≥ 2, the algorithm
evaluates yi = xi ’ ei’1 , the sum is updated setting si = si’1 + yi and
the new rounding error is computed as ei = (si ’ si’1 ) ’ yi . Implement
this algorithm in MATLAB and check its accuracy by evaluating again the
second expression in (2.40).
16. The area A(T ) of a triangle T with sides a, b and c, can be computed using
the following formula

p(p ’ a)(p ’ b)(p ’ c),
A(T ) =

where p is half the perimeter of T . Show that in the case of strongly de-
formed triangles (a b + c), this formula lacks accuracy and check this
experimentally.
3
Direct Methods for the Solution of
Linear Systems




A system of m linear equations in n unknowns consists of a set of algebraic
relations of the form
n
aij xj = bi , i = 1, . . . , m (3.1)
j=1

where xj are the unknowns, aij are the coe¬cients of the system and bi
are the components of the right hand side. System (3.1) can be more con-
veniently written in matrix form as

Ax = b, (3.2)

where we have denoted by A = (aij ) ∈ Cm—n the coe¬cient matrix, by
b=(bi ) ∈ Cm the right side vector and by x=(xi ) ∈ Cn the unknown
vector, respectively. We call a solution of (3.2) any n-tuple of values xi
which satis¬es (3.1).
In this chapter we shall be mainly dealing with real-valued square systems
of order n, that is, systems of the form (3.2) with A ∈ Rn—n and b ∈ Rn .
In such cases existence and uniqueness of the solution of (3.2) are ensured
if one of the following (equivalent) hypotheses holds:

1. A is invertible;

2. rank(A)=n;

3. the homogeneous system Ax=0 admits only the null solution.
58 3. Direct Methods for the Solution of Linear Systems

The solution of system (3.2) is formally provided by Cramer™s rule

∆j
xj = , j = 1, . . . , n, (3.3)
det(A)

where ∆j is the determinant of the matrix obtained by substituting the
j-th column of A with the right hand side b. This formula is, however,
of little practical use. Indeed, if the determinants are evaluated by the
recursive relation (1.4), the computational e¬ort of Cramer™s rule is of the
order of (n + 1)! ¬‚ops and therefore turns out to be unacceptable even for
small dimensions of A (for instance, a computer able to perform 109 ¬‚ops
per second would take 9.6 · 1047 years to solve a linear system of only 50
equations).
For this reason, numerical methods that are alternatives to Cramer™s rule
have been developed. They are called direct methods if they yield the so-
lution of the system in a ¬nite number of steps, iterative if they require
(theoretically) an in¬nite number of steps. Iterative methods will be ad-
dressed in the next chapter. We notice from now on that the choice between
a direct and an iterative method does not depend only on the theoretical ef-
¬ciency of the scheme, but also on the particular type of matrix, on memory
storage requirements and, ¬nally, on the architecture of the computer.



3.1 Stability Analysis of Linear Systems
Solving a linear system by a numerical method invariably leads to the
introduction of rounding errors. Only using stable numerical methods can
keep away the propagation of such errors from polluting the accuracy of the
solution. In this section two aspects of stability analysis will be addressed.
Firstly, we will analyze the sensitivity of the solution of (3.2) to changes
in the data A and b (forward a priori analysis). Secondly, assuming that
an approximate solution x of (3.2) is available, we shall quantify the per-
turbations on the data A and b in order for x to be the exact solution
of a perturbed system (backward a priori analysis). The size of these per-
turbations will in turn allow us to measure the accuracy of the computed
solution x by the use of a posteriori analysis.


3.1.1 The Condition Number of a Matrix
The condition number of a matrix A ∈ Cn—n is de¬ned as

A’1 ,
K(A) = A (3.4)

where · is an induced matrix norm. In general K(A) depends on the
choice of the norm; this will be made clear by introducing a subscript
3.1 Stability Analysis of Linear Systems 59

into the notation, for instance, K∞ (A) = A ∞ A’1 ∞ . More generally,
Kp (A) will denote the condition number of A in the p-norm. Remarkable
instances are p = 1, p = 2 and p = ∞ (we refer to Exercise 1 for the
relations among K1 (A), K2 (A) and K∞ (A)).
As already noticed in Example 2.3, an increase in the condition number
produces a higher sensitivity of the solution of the linear system to changes
in the data. Let us start by noticing that K(A) ≥ 1 since

1 = AA’1 ¤ A A’1 = K(A).

Moreover, K(A’1 ) = K(A) and ∀± ∈ C with ± = 0, K(±A) = K(A).
Finally, if A is orthogonal, K2 (A) = 1 since A 2 = ρ(AT A) = ρ(I) = 1
and A’1 = AT . The condition number of a singular matrix is set equal to
in¬nity.
For p = 2, K2 (A) can be characterized as follows. Starting from (1.21),
it can be proved that
σ1 (A)
A’1
K2 (A) = A =
2 2
σn (A)

where σ1 (A) and σn (A) are the maximum and minimum singular values of
A (see Property 1.7). As a consequence, in the case of symmetric positive
de¬nite matrices we have
»max
= ρ(A)ρ(A’1 )
K2 (A) = (3.5)
»min
where »max and »min are the maximum and minimum eigenvalues of A.
To check (3.5), notice that

ρ(AT A) = ρ(A2 ) = »2
A = max = »max .
2


Moreover, since »(A’1 ) = 1/»(A), one gets A’1 2 = 1/»min from which
(3.5) follows. For that reason, K2 (A) is called spectral condition number.

Remark 3.1 De¬ne the relative distance of A ∈ Cn—n from the set of
singular matrices with respect to the p-norm by
δA p
distp (A) = min : A + δA is singular .
Ap

It can then be shown that ([Kah66], [Gas83])
1
distp (A) = . (3.6)
Kp (A)

Equation (3.6) suggests that a matrix A with a high condition number
can behave like a singular matrix of the form A+δA. In other words, null
60 3. Direct Methods for the Solution of Linear Systems

perturbations in the right hand side do not necessarily yield non vanishing
changes in the solution since, if A+δA is singular, the homogeneous system
(A + δA)z = 0 does no longer admit only the null solution. From (3.6) it
also follows that if A+δA is nonsingular then

δA A < 1. (3.7)
p p




Relation (3.6) seems to suggest that a natural candidate for measuring
the ill-conditioning of a matrix is its determinant, since from (3.3) one is
prompted to conclude that small determinants mean nearly-singular matri-
ces. However this conclusion is wrong, as there exist examples of matrices
with small (respectively, high) determinants and small (respectively, high)
condition numbers (see Exercise 2).


3.1.2 Forward a priori Analysis
In this section we introduce a measure of the sensitivity of the system to
changes in the data. These changes will be interpreted in Section 3.10 as
being the e¬ects of rounding errors induced by the numerical method used
to solve the system. For a more comprehensive analysis of the subject we
refer to [Dat95], [GL89], [Ste73] and [Var62].
Due to rounding errors, a numerical method for solving (3.2) does not
provide the exact solution but only an approximate one, which satis¬es a
perturbed system. In other words, a numerical method yields an (exact)
solution x + δx of the perturbed system

(A + δA)(x + δx) = b + δb. (3.8)

The next result provides an estimate of δx in terms of δA and δb.

Theorem 3.1 Let A ∈ Rn—n be a nonsingular matrix and δA ∈ Rn—n be
such that (3.7) is satis¬ed for a matrix norm · . Then, if x∈ Rn is the
solution of Ax=b with b ∈ Rn (b = 0) and δx ∈ Rn satis¬es (3.8) for
δb ∈ Rn ,
δx δb
K(A) δA
¤ + . (3.9)
1 ’ K(A) δA / A
x b A
Proof. From (3.7) it follows that the matrix A’1 δA has norm less than 1. Then,
due to Theorem 1.5, I + A’1 δA is invertible and from (1.26) it follows that
1 1
(I + A’1 δA)’1 ¤ ¤ . (3.10)
1’ A 1’ A
’1 δA ’1 δA
On the other hand, solving for δx in (3.8) and recalling that Ax = b, one gets

δx = (I + A’1 δA)’1 A’1 (δb ’ δAx),
3.1 Stability Analysis of Linear Systems 61

from which, passing to the norms and using (3.10), it follows that

A’1
δx ¤ ( δb + δA x ).
1 ’ A’1 δA

Finally, dividing both sides by x (which is nonzero since b = 0 and A is
nonsingular) and noticing that x ≥ b / A , the result follows. 3

Well-conditioning alone is not enough to yield an accurate solution of the
linear system. It is indeed crucial, as pointed out in Chapter 2, to resort to
stable algorithms. Conversely, ill-conditioning does not necessarily exclude
that for particular choices of the right side b the overall conditioning of the
system is good (see Exercise 4).
A particular case of Theorem 3.1 is the following.

Theorem 3.2 Assume that the conditions of Theorem 3.1 hold and let
δA = 0. Then
δb δx δb
1
¤ ¤ K(A) . (3.11)
K(A) b x b

Proof. We will prove only the ¬rst inequality since the second one directly
follows from (3.9). Relation δx = A’1 δb yields δb ¤ A δx . Multiplying
both sides by x and recalling that x ¤ A’1 b it follows that x δb ¤
K(A) b δx , which is the desired inequality. 3

In order to employ the inequalities (3.10) and (3.11) in the analysis of
propagation of rounding errors in the case of direct methods, δA and
δb should be bounded in terms of the dimension of the system and of
the characteristics of the ¬‚oating-point arithmetic that is being used.
It is indeed reasonable to expect that the perturbations induced by a
method for solving a linear system are such that δA ¤ γ A and δb ¤
γ b , γ being a positive number that depends on the roundo¬ unit u (for
example, we shall assume henceforth that γ = β 1’t , where β is the base
and t is the number of digits of the mantissa of the ¬‚oating-point system
F). In such a case (3.9) can be completed by the following theorem.

Theorem 3.3 Assume that δA ¤ γ A , δb ¤ γ b with γ ∈ R+ and
δA ∈ Rn—n , δb ∈ Rn . Then, if γK(A) < 1 the following inequalities hold

x + δx 1 + γK(A)
¤ , (3.12)
1 ’ γK(A)
x


δx 2γ
¤ K(A). (3.13)
1 ’ γK(A)
x
62 3. Direct Methods for the Solution of Linear Systems

Proof. From (3.8) it follows that (I + A’1 δA)(x + δx) = x + A’1 δb. Moreover,
since γK(A) < 1 and δA ¤ γ A it turns out that I + A’1 δA is nonsingular.
Taking the inverse of such a matrix and passing to the norms we get x + δx ¤
(I + A’1 δA)’1 x + γ A’1 b . From Theorem 1.5 it then follows that
1
x + γ A’1
x + δx ¤ b ,
1 ’ A’1 δA

which implies (3.12), since A’1 δA ¤ γK(A) and b ¤ A x .
Let us prove (3.13). Subtracting (3.2) from (3.8) it follows that

Aδx = ’δA(x + δx) + δb.

Inverting A and passing to the norms, the following inequality is obtained
A’1 δA x + δx + A’1
δx ¤ δb
(3.14)
γK(A) x + δx + γ A’1
¤ b.

Dividing both sides by x and using the triangular inequality x+δx ¤ δx +
3
x , we ¬nally get (3.13).

Remarkable instances of perturbations δA and δb are those for which
|δA| ¤ γ|A| and |δb| ¤ γ|b| with γ ≥ 0. Hereafter, the absolute value
notation B = |A| denotes the matrix n — n having entries bij = |aij | with
i, j = 1, . . . , n and the inequality C ¤ D, with C, D ∈ Rm—n has the
following meaning

cij ¤ dij for i = 1, . . . , m, j = 1, . . . , n.

·
If is considered, from (3.14) it follows that


|A’1 | |A| |x| + |A’1 | |b| ∞
δx ∞
¤γ
(1 ’ γ |A’1 | |A| ∞ ) x ∞
x∞
(3.15)

|A’1 | |A|
¤ ∞.
1 ’ γ |A’1 | |A| ∞

Estimate (3.15) is generally too pessimistic; however, the following compo-
nentwise error estimates of δx can be derived from (3.15)

|δxi | ¤ γ|rT | |A| |x + δx|, i = 1, . . . , n if δb = 0,
(i)

(3.16)
|rT | |b|
|δxi | (i)
¤γ T , i = 1, . . . , n if δA = 0,
|xi | |r(i) b|

being rT the row vector eT A’1 . Estimates (3.16) are more stringent than
i
(i)
(3.15), as can be seen in Example 3.1. The ¬rst inequality in (3.16) can be
used when the perturbed solution x + δx is known, being henceforth x + δx
the solution computed by a numerical method.
3.1 Stability Analysis of Linear Systems 63

In the case where |A’1 | |b| = |x|, the parameter γ in (3.15) is equal
to 1. For such systems the components of the solution are insensitive to
perturbations to the right side. A slightly worse situation occurs when A
is a triangular M-matrix and b has positive entries. In such a case γ is
bounded by 2n ’ 1, since
|rT | |A| |x| ¤ (2n ’ 1)|xi |.
(i)

For further details on the subject we refer to [Ske79], [CI95] and [Hig89].
Results linking componentwise estimates to normwise estimates through
the so-called hypernorms can be found in [ADR92].

Example 3.1 Consider the linear system Ax=b with
®  ®2 
1 1
±± ± +±
A=° », b = ° »
1 1
0± ±

which has solution xT = (±, 1), where 0 < ± < 1. Let us compare the results
obtained using (3.15) and (3.16). From
T
2
’1 ’1
|A | |A| |x| = |A | |b| = ± + 2,1 (3.17)
±
it follows that the supremum of (3.17) is unbounded as ± ’ 0, exactly as it
happens in the case of A ∞ . On the other hand, the ampli¬cation factor of
the error in (3.16) is bounded. Indeed, the component of the maximum absolute
value, x2 , of the solution, satis¬es |rT | |A| |x|/|x2 | = 1. •
(2)



3.1.3 Backward a priori Analysis
The numerical methods that we have considered thus far do not require the
explicit computation of the inverse of A to solve Ax=b. However, we can
always assume that they yield an approximate solution of the form x = Cb,
where the matrix C, due to rounding errors, is an approximation of A’1 .
In practice, C is very seldom constructed; in case this should happen, the
following result yields an estimate of the error that is made substituting C
for A’1 (see [IK66], Chapter 2, Theorem 7).

Property 3.1 Let R = AC ’ I; if R < 1, then A and C are nonsingular
and
C R CR
¤ C ’ A’1 ¤
A’1 ¤ , . (3.18)
1’ R 1’ R
A

In the frame of backward a priori analysis we can interpret C as being the
inverse of A + δA (for a suitable unknown δA). We are thus assuming that
C(A + δA) = I. This yields
δA = C’1 ’ A = ’(AC ’ I)C’1 = ’RC’1
64 3. Direct Methods for the Solution of Linear Systems

and, as a consequence, if R < 1 it turns out that
RA
δA ¤ , (3.19)
1’ R
having used the ¬rst inequality in (3.18), where A is assumed to be an
approximation of the inverse of C (notice that the roles of C and A can be
interchanged).


3.1.4 A posteriori Analysis
Having approximated the inverse of A by a matrix C turns into having an
approximation of the solution of the linear system (3.2). Let us denote by
y a known approximate solution. The aim of the a posteriori analysis is to
relate the (unknown) error e = y ’ x to quantities that can be computed
using y and C.
The starting point of the analysis relies on the fact that the residual
vector r = b ’ Ay is in general nonzero, since y is just an approximation
to the unknown exact solution. The residual can be related to the error
through Property 3.1 as follows. We have e = A’1 (Ay ’ b) = ’A’1 r and
thus, if R < 1 then
rC
e¤ . (3.20)
1’ R
Notice that the estimate does not necessarily require y to coincide with
the solution x = Cb of the backward a priori analysis. One could therefore
think of computing C only for the purpose of using the estimate (3.20) (for
instance, in the case where (3.2) is solved through the Gauss elimination
method, one can compute C a posteriori using the LU factorization of A,
see Sections 3.3 and 3.3.1).
We conclude by noticing that if δb is interpreted in (3.11) as being the
residual of the computed solution y = x + δx, it also follows that
e r
¤ K(A) . (3.21)
x b
The estimate (3.21) is not used in practice since the computed residual
is a¬ected by rounding errors. A more signi¬cant estimate (in the · ∞
norm) is obtained letting r = f l(b ’ Ay) and assuming that r = r + δr
with |δr| ¤ γn+1 (|A| |y| + |b|), where γn+1 = (n + 1)u/(1 ’ (n + 1)u) > 0,
from which we have
|A’1 |(|r| + γn+1 (|A||y| + |b|))
e ∞ ∞
¤ .
y y∞


Formulae like this last one are implemented in the library for linear algebra
LAPACK (see [ABB+ 92]).
3.2 Solution of Triangular Systems 65

3.2 Solution of Triangular Systems
Consider the nonsingular 3—3 lower triangular system
® ®  ® 
l11 0 0 x1 b1
° l21 l22 0»° x2 » = ° b2 » .
l31 l32 l33 x3 b3

Since the matrix is nonsingular, its diagonal entries lii , i = 1, 2, 3, are
non vanishing, hence we can solve sequentially for the unknown values
xi , i = 1, 2, 3 as follows

x1 = b1 /l11 ,
x2 = (b2 ’ l21 x1 )/l22 ,
x3 = (b3 ’ l31 x1 ’ l32 x2 )/l33 .

This algorithm can be extended to systems n — n and is called forward
substitution. In the case of a system Lx=b, with L being a nonsingular
lower triangular matrix of order n (n ≥ 2), the method takes the form

b1
x1 = ,
l11« 
(3.22)
i’1
1
lij xj  , i = 2, . . . , n.
bi ’
xi =
lii j=1


The number of multiplications and divisions to execute the algorithm is
equal to n(n+1)/2, while the number of sums and subtractions is n(n’1)/2.
The global operation count for (3.22) is thus n2 ¬‚ops.
Similar conclusions can be drawn for a linear system Ux=b, where U
is a nonsingular upper triangular matrix of order n (n ≥ 2). In this case
the algorithm is called backward substitution and in the general case can
be written as
bn
xn = ,
« 
unn
(3.23)
n
1
uij xj  , i = n ’ 1, . . . , 1.
bi ’
xi =
uii j=i+1


Its computational cost is still n2 ¬‚ops.


3.2.1 Implementation of Substitution Methods
Each i-th step of algorithm (3.22) requires performing the scalar product
between the row vector L(i, 1 : i ’ 1) (this notation denoting the vector
extracted from matrix L taking the elements of the i-th row from the ¬rst
66 3. Direct Methods for the Solution of Linear Systems

to the (i-1)-th column) and the column vector x(1 : i ’ 1). The access to
matrix L is thus by row; for that reason, the forward substitution algorithm,
when implemented in the form above, is called row-oriented.
Its coding is reported in Program 1 (the Program mat square that is
called by forward row merely checks that L is a square matrix).

Program 1 - forward row : Forward substitution: row-oriented version

function [x]=forward row(L,b)
[n]=mat square(L); x(1) = b(1)/L(1,1);
for i = 2:n, x (i) = (b(i)-L(i,1:i-1)*(x(1:i-1))™)/L(i,i); end
x=x™;

To obtain a column-oriented version of the same algorithm, we take ad-
vantage of the fact that i-th component of the vector x, once computed,
can be conveniently eliminated from the system.
An implementation of such a procedure, where the solution x is over-
written on the right vector b, is reported in Program 2.
Program 2 - forward col : Forward substitution: column-oriented version

function [b]=forward col(L,b)
[n]=mat square(L);
for j=1:n-1,
b(j)= b(j)/L(j,j); b(j+1:n)=b(j+1:n)-b(j)*L(j+1:n,j);
end; b(n) = b(n)/L(n,n);

Implementing the same algorithm by a row-oriented rather than a column-
oriented approach, might dramatically change its performance (but of course,
not the solution). The choice of the form of implementation must therefore
be subordinated to the speci¬c hardware that is used.
Similar considerations hold for the backward substitution method, pre-
sented in (3.23) in its row-oriented version.
In Program 3 only the column-oriented version of the algorithm is coded.
As usual, the vector x is overwritten on b.
Program 3 - backward col : Backward substitution: column-oriented ver-
sion
function [b]=backward col(U,b)
[n]=mat square(U);
for j = n:-1:2,
b(j)=b(j)/U(j,j); b(1:j-1)=b(1:j-1)-b(j)*U(1:j-1,j);
end; b(1) = b(1)/U(1,1);

When large triangular systems must be solved, only the triangular portion
of the matrix should be stored leading to considerable saving of memory
resources.
3.2 Solution of Triangular Systems 67

3.2.2 Rounding Error Analysis
The analysis developed so far has not accounted for the presence of round-
ing errors. When including these, the forward and backward substitution
algorithms no longer yield the exact solutions to the systems Lx=b and
Uy=b, but rather provide approximate solutions x that can be regarded
as being exact solutions to the perturbed systems
(L + δL)x = b, (U + δU)x = b,
where δL = (δlij ) and δU = (δuij ) are perturbation matrices. In order
to apply the estimates (3.9) carried out in Section 3.1.2, we must provide
estimates of the perturbation matrices, δL and δU, as a function of the
entries of L and U, of their size and of the characteristics of the ¬‚oating-
point arithmetic. For this purpose, it can be shown that
nu
|δT| ¤ |T|, (3.24)
1 ’ nu
where T is equal to L or U, u = 1 β 1’t is the roundo¬ unit de¬ned in (2.34).
2
Clearly, if nu < 1 from (3.24) it turns out that, using a Taylor expansion,
|δT| ¤ nu|T| + O(u2 ). Moreover, from (3.24) and (3.9) it follows that, if
nuK(T) < 1, then
x’x nuK(T)
¤ = nuK(T) + O(u2 ) (3.25)
1 ’ nuK(T)
x
for the norms · 1 , · ∞ and the Frobenius norm. If u is su¬ciently
small (as typically happens), the perturbations introduced by the rounding
errors in the solution of a triangular system can thus be neglected. As
a consequence, the accuracy of the solution computed by the forward or
backward substitution algorithm is generally very high.
These results can be improved by introducing some additional assump-
tions on the entries of L or U. In particular, if the entries of U are such
that |uii | ≥ |uij | for any j > i, then
nu
|xi ’ xi | ¤ 2n’i+1 max|xj |, 1 ¤ i ¤ n.
1 ’ nu j≥i
The same result holds if T=L, provided that |lii | ≥ |lij | for any j < i, or if
L and U are diagonally dominant. The previous estimates will be employed
in Sections 3.3.1 and 3.4.2.
For the proofs of the results reported so far, see [FM67], [Hig89] and
[Hig88].


3.2.3 Inverse of a Triangular Matrix
The algorithm (3.23) can be employed to explicitly compute the inverse
of an upper triangular matrix. Indeed, given an upper triangular matrix
68 3. Direct Methods for the Solution of Linear Systems

U, the column vectors vi of the inverse V=(v1 , . . . , vn ) of U satisfy the
following linear systems

Uvi = ei , i = 1, . . . , n (3.26)

where {ei } is the canonical basis of Rn (de¬ned in Example 1.3). Solving
for vi thus requires the application of algorithm (3.23) n times to (3.26).
This procedure is quite ine¬cient since at least half the entries of the
inverse of U are null. Let us take advantage of this as follows. Denote by
vk = (v1k , . . . , vkk )T the vector of size k such that

U(k) vk = lk (3.27)
k = 1, . . . , n

where U(k) is the principal submatrix of U of order k and lk the vector of
Rk having null entries, except the ¬rst one which is equal to 1. Systems
(3.27) are upper triangular, but have order k and can be again solved using
the method (3.23). We end up with the following inversion algorithm for
upper triangular matrices: for k = n, n ’ 1, . . . , 1 compute

vkk = u’1 ,
kk
k
(3.28)
’u’1 uij vjk , for i = k ’ 1, k ’ 2, . . . , 1.
vik = ii
j=i+1

At the end of this procedure the vectors vk furnish the non vanishing entries
of the columns of U’1 . The algorithm requires about n3 /3 + (3/4)n2 ¬‚ops.
Once again, due to rounding errors, the algorithm (3.28) no longer yields
the exact solution, but an approximation of it. The error that is introduced
can be estimated using the backward a priori analysis carried out in Section
3.1.3.
A similar procedure can be constructed from (3.22) to compute the in-
verse of a lower triangular system.


3.3 The Gaussian Elimination Method (GEM) and
LU Factorization
The Gaussian elimination method aims at reducing the system Ax=b to an
equivalent system (that is, having the same solution) of the form Ux=b,
where U is an upper triangular matrix and b is an updated right side
vector. This latter system can then be solved by the backward substitution
method. Let us denote the original system by A(1) x = b(1) . During the
reduction procedure we basically employ the property which states that
replacing one of the equations by the di¬erence between this equation and
another one multiplied by a non null constant yields an equivalent system
(i.e., one with the same solution).
3.3 The Gaussian Elimination Method (GEM) and LU Factorization 69

Thus, consider a nonsingular matrix A ∈ Rn—n , and suppose that the
diagonal entry a11 is non vanishing. Introducing the multipliers

(1)
ai1
mi1 = , i = 2, 3, . . . , n,
(1)
a11

(1)
where aij denote the elements of A(1) , it is possible to eliminate the un-
known x1 from the rows other than the ¬rst one by simply subtracting
from row i, with i = 2, . . . , n, the ¬rst row multiplied by mi1 and doing
the same on the right side. If we now de¬ne

(2) (1) (1)
aij = aij ’ mi1 a1j , i, j = 2, . . . , n,
(2) (1) (1)
’ mi1 b1 ,
bi = bi i = 2, . . . , n,

(1)
denote the components of b(1) , we get a new system of the form
where bi

® ® ® 

(1) (1) (1) (1)
a11 a12 ... a1n b1
x1
   

(2) (2) (2)
   
x2
0 a22 ... a2n b2

  = ,
 °  
.
. . . .
»
.
. . . .
° » ° »
.
. . . .
(2) (2) (2)
xn
0 an2 ... ann bn


which we denote by A(2) x = b(2) , that is equivalent to the starting one.
Similarly, we can transform the system in such a way that the unknown
x2 is eliminated from rows 3, . . . , n. In general, we end up with the ¬nite
sequence of systems

A(k) x = b(k) , 1 ¤ k ¤ n, (3.29)


where, for k ≥ 2, matrix A(k) takes the following form

® 
(1) (1) (1)
a11 a12 ... ... ... a1n
 
(2) (2)
 
0 a22 a2n
 
. .
 
..
. .
 
.
. .
(k)
= ,
A
 
(k) (k)
0 ... 0 akk ... akn
 
 
. . . .
° »
. . . .
. . . .
(k) (k)
0 ... 0 ank ... ann
70 3. Direct Methods for the Solution of Linear Systems

(i)
having assumed that aii = 0 for i = 1, . . . , k ’ 1. It is clear that for k = n
we obtain the upper triangular system A(n) x = b(n)
® (1) ® ® (1) 

(1) (1)
a11 a12 . . . . . . a1n b1
x1
 (2)    (2) 
a2n   x2 
(2)
0  b2 
a22
.
. .

. . . .
.. .
.  .  =  . .
. .
   
. .
 .
.. . »°. .» °.»
°0 . . .
(n) (n)
xn
0 ann bn
Consistently with the notations that have been previously introduced, we
(k)
denote by U the upper triangular matrix A(n) . The entries akk are called
pivots and must obviously be non null for k = 1, . . . , n ’ 1.
In order to highlight the formulae which transform the k-th system into
(k)
the k + 1-th one, for k = 1, . . . , n ’ 1 we assume that akk = 0 and de¬ne
the multiplier
(k)
aik
mik = , i = k + 1, . . . , n. (3.30)
(k)
akk
Then we let
(k+1) (k) (k)
= aij ’ mik akj , i, j = k + 1, . . . , n
aij
(3.31)
(k+1) (k) (k)

bi = bi mik bk , i = k + 1, . . . , n.

Example 3.2 Let us use GEM to solve the following system
± 1 1 11
 x 1 + 2 x2 + x =
 33
 6

(1) (1) 1
x1 + 1 x2 1 13
+ x =
(A x = b ) ,
43
2 3 12


1
+ 1 x2 1 47
x + x =
31 53
4 60

which admits the solution x=(1, 1, 1)T . At the ¬rst step we compute the mul-
tipliers m21 = 1/2 and m31 = 1/3, and subtract from the second and third
equation of the system the ¬rst row multiplied by m21 and m31 , respectively. We
obtain the equivalent system
± 1 1 11
 x1 + x + x =
 22 33
 6

(A(2) x = b(2) ) 1 1 1
0 + 12 x2 + 12 x3 = .
 6


 1 4 31
0 + 12 x2 + 45 x3 = 180
If we now subtract the second row multiplied by m32 = 1 from the third one, we
end up with the upper triangular system
± 1 1 11
 x1 + x + x =
 22 33
 6

(A(3) x = b(3) ) 1 1 1
0 + 12 x2 + x = ,
12 3
 6


 1 1
0+ 0 + 180 x3 = 180
3.3 The Gaussian Elimination Method (GEM) and LU Factorization 71

from which we immediately compute x3 = 1 and then, by back substitution, the

remaining unknowns x1 = x2 = 1.

Remark 3.2 The matrix in Example 3.2 is called the Hilbert matrix of
order 3. In the general n — n case, its entries are
hij = 1/(i + j ’ 1), i, j = 1, . . . , n. (3.32)
As we shall see later on, this matrix provides the paradigm of an ill-
conditioned matrix.
To complete Gaussian elimination 2(n ’ 1)n(n + 1)/3 + n(n ’ 1) ¬‚ops are
required, plus n2 ¬‚ops to backsolve the triangular system U x = b(n) .
Therefore, about (2n3 /3 + 2n2 ) ¬‚ops are needed to solve the linear sys-
tem using GEM. Neglecting the lower order terms, we can state that the
Gaussian elimination process has a cost of 2n3 /3 ¬‚ops.
(k)
As previously noticed, GEM terminates safely i¬ the pivotal elements akk ,
for k = 1, . . . , n ’ 1, are non vanishing. Unfortunately, having non null
diagonal entries in A is not enough to prevent zero pivots to arise during
the elimination process. For example, matrix A in (3.33) is nonsingular and
has nonzero diagonal entries
®  ® 
12 3
123
A = ° 2 4 5 » , A(2) = ° 0 0 ’1 » . (3.33)
0 ’6 ’12
789
Nevertheless, when GEM is applied, it is interrupted at the second step
(2)
since a22 = 0.

More restrictive conditions on A are thus needed to ensure the appli-
cability of the method. We shall see in Section 3.3.1 that if the leading
dominating minors di of A are nonzero for i = 1, . . . , n ’ 1, then the corre-
(i)
sponding pivotal entries aii must necessarily be non vanishing. We recall
that di is the determinant of Ai , the i-th principal submatrix made by the
¬rst i rows and columns of A. The matrix in the previous example does
not satisfy this condition, having d1 = 1 and d2 = 0.
Classes of matrices exist such that GEM can be always safely employed in
its basic form (3.31). Among them, we recall the following ones:
1. matrices diagonally dominant by rows;
2. matrices diagonally dominant by columns. In such a case one can even
show that the multipliers are in module less than or equal to 1 (see
Property 3.2);
3. matrices symmetric and positive de¬nite (see Theorem 3.6).
For a rigorous derivation of these results, we refer to the forthcoming sec-
tions.
72 3. Direct Methods for the Solution of Linear Systems

3.3.1 GEM as a Factorization Method
In this section we show how GEM is equivalent to performing a factorization
of the matrix A into the product of two matrices, A=LU, with U=A(n) .
Since L and U depend only on A and not on the right hand side, the same
factorization can be reused when solving several linear systems having the
same matrix A but di¬erent right hand side b, with a considerable reduction
of the operation count (indeed, the main computational e¬ort, about 2n3 /3
¬‚ops, is spent in the elimination procedure).

Let us go back to Example 3.2 concerning the Hilbert matrix H3 . In
practice, to pass from A(1) =H3 to the matrix A(2) at the second step, we
have multiplied the system by the matrix
® ® 
100 10 0
  
  
M1 =  ’ 1 1 0  =  ’m21 1 0 .
° »° »
2

’m31 0
’3 0 1
1
1

Indeed,
® 
1 1
1 2 3
 
 
(1)
 = A(2) .
= 0 1 1
M1 A = M1 A
° »
12 12
1 4
0 12 45

Similarly, to perform the second (and last) step of GEM, we must multiply
A(2) by the matrix
® ® 
1 00 1 00
  
  
M2 =  0 1 0 = 0 1 0 ,
° »° »

<<

. 3
( 26)



>>