. 3
( 18)


© 2009 by Taylor and Francis Group, LLC
50 Algorithmic Cryptanalysis


Figure 2.1: Ordinary LFSR Representation by LFSRs
A LFSR on n bits is formed of n individual registers each able to store a
single bit, say r0 , r1 , . . . , rn’1 . The value contained in each register changes
with time. For LFSRs, time is discretized and follows a clock signal. We
assume that the clock is initially set to 0 and goes up by increments of 1.
We denote the value contained in ri at time t by ri . There exist two dif-
ferent kinds of LFSRs, Galois LFSR and ordinary LFSR, which di¬er by the
evolution of the register from time t to time t + 1. They are represented in
Figures 2.1 and 2.2.
From time t to time t + 1, a Galois LSFR evolves according to the following
(t+1) (t) (t)
• For all 0 ¤ i < n ’ 1 : ri+1 • ci+1 rn’1 , where each ci is a
= ri
constant 0 or 1.
(t+1) (t)
• Moreover, r0 = rn’1 .

The time advance of a Galois LFSR can be interpreted as multiplication as
a polynomial by X modulo a polynomial of F2 [X]. Indeed, if we let I(X) =
X n • i=1 ci X i • 1 and let f (t) for every value of t denote the polynomial
n’1 (t) (t)
f (t) (X) = i=0 ri X i . We easily see that when rn’1 = 0, then f (t+1) (X) =
Xf (t) (X). Similarly, when rn’1 = 1 then f (t+1) (X) = Xf (t) (X)•I(X). As a
consequence, assuming that I is irreducible, the simple rule for advancing the
corresponding Galois LFSR corresponds to multiplication by ±, a root of I in
F2n . This correspondence holds under the convention that the registers (ri )
of a Galois LFSR encode the ¬nite ¬eld element i=0 ri ±i . Clearly, the sum
of two ¬nite ¬eld elements can be obtained by a simple XOR of the registers.
Finally, multiplication by an arbitrary element can be obtained by a sequence
of multiplications by ± and XOR operations, similar to the exponentiation
Algorithm 2.9 or 2.10. A simple implementation of F232 using a Galois LFSR
in C code is given as Program 2.1.
To understand the other kind of LFSR called ordinary LFSR when opposed
to Galois LFSR or simply LFSR in many contexts, it is useful to focus on the
sequence of bits st = rn’1 obtained by considering only the high order bit

© 2009 by Taylor and Francis Group, LLC
Elementary number theory and algebra background 51


Figure 2.2: Galois LFSR

Program 2.1 Representation of F232 with a Galois LFSR
#include <stdio.h>
#include <stdlib.h>

/* GF(2^32) using x^32+x^7+x^3+x^2+1 -> 2^7+2^3+2^2+1=141*/
#define IRRED 141

typedef unsigned int state;

state Mul_by_alpha(state val)
state res;

res=(val<<1); if (val&0x80000000) res^=IRRED;

state Add(state val1, state val2)

state Mul(state val1, state val2)
state res=0;

while(val2!=0) {
if (val2&1) res=Add(res,val1);

© 2009 by Taylor and Francis Group, LLC
52 Algorithmic Cryptanalysis
of a Galois LFSR. Let us follow the evolution of one bit rn’1 as time passes.
After one round, the bit is in r0 . After i rounds, we see by induction that:

(t+i) (t) t+j

ri’1 = rn’1 cj rn’1 . (2.32)

Finally, with the additional convention that c0 = 1, we ¬nd that:

st+n = cj st+j . (2.33)

Ordinary LSFRs directly compute the sequence s from this recurrence rela-
tion. Once again we need a register able to store n bits. For simplicity, we
reuse the same notation ri . With ordinary LFSRs, with well-chosen initial
values of the registers, we can achieve the relation ri = st’i , for all accept-
able values of i and t, i.e., 0 ¤ i < n and t ’ i ≥ 0. The evolution rules for
ordinary LFSRs from time t to time t + 1 are the following:
(t+1) (t)
• For all 0 ¤ i < n ’ 1 : ri+1 = ri .

(t+1) (t)
• Moreover, r0 is obtained as the XOR of a subset of the values ri .
More precisely, we have:

(t+1) (t)
r0 = ci ri , (2.34)

where each ci is the same constant as before.

The above discussion shows that extracting a single bit from a Galois LFSR or
for the corresponding ordinary LFSR yields the same sequence. Both LFSRs
are thus said to be equivalent. One very important property about LFSRs is
that they produce periodic sequences whose lengths are equal to the multi-
plicative order of the root ± of I in the ¬nite ¬eld F2n . Note that the all-zeros
sequence can also be produced by any LFSR and is an exception to this rule
since its periodicity is 1. If ± is a generator of the multiplicative group F—n ,
its order is 2n ’ 1 and the length of the sequence is maximal. When this
condition is satis¬ed, the irreducible polynomial I is said to be primitive.
Letting N = 2n ’ 1, to check whether ± has order N , we need to know
the factorization of N . Assuming that N = i=1 pei , we can check that no
proper divisor of N is the order of ± simply by checking that for all 1 ¤ i ¤ t :

±N/pi = 1. (2.35)

© 2009 by Taylor and Francis Group, LLC
Elementary number theory and algebra background 53 Berlekamp-Massey algorithm Knowing that LFSR can gen-
erate long periodic sequences by using simple linear recurrences, it is also
interesting to consider the dual question. Given a sequence of bits, can we
determine the short linear recurrence expression that can generate this se-
The Berlekamp-Massey algorithm is a very e¬cient method to solve this
question. It can be applied to ¬nd an optimal linear recurrence that generates
a given sequence over any ¬eld K. In order to describe this algorithm, it is
useful to introduce a few notations. Given a sequence T of elements of the
¬eld K, a nice way to describe the shortest linear recurrence is to introduce
the notion of minimal polynomial of a sequence.
First, let us de¬ne the evaluation of a polynomial f of K[X] of degree df on
a sequence T of elements of K. Assuming that f (X) = i=0 ±i X i , we de¬ne
the sequence S = f (T ) by the following formula:
Si = ±j Ti+j . (2.36)

If T is an in¬nite sequence, S is also in¬nite. Otherwise, S contains df fewer
elements than T . We say that f annihilates T if and only if f (T ) is the all
zero sequence. We de¬ne IT to be the set of all polynomials that annihilate
T . This set is an ideal of K[X]. When IT is di¬erent from the zero ideal, we
can ¬nd a generator fT of IT (unique up to multiplication by a constant of
K— ) is called the minimal polynomial of S. Note that fT is the polynomial of
smallest degree that annihilates T .
In this section, we assume for simplicity that T is an in¬nite sequence of
numbers over a ¬eld K, with minimal polynomial fT of degree d. By de¬nition,
fT annihilates T , thus fT (T ) = 0. It is useful to express the sequence T
∞ i
by forming series T (x) = i=0 Ti x in K[[x]]. Matching the de¬nition of
multiplication for series and polynomials and the de¬nition of evaluation of
fT at T , we see that i-th term of fT (T ) is the coe¬cient of xd+i in fT (x)T (x),
where fT is the reciprocal polynomial of fT , i.e., when fT = i=0 ft xi we
(i) d’i
write fT = i=0 ft x . Since, fT (T ) is the zero sequence, we see that
fT (x)T (x) is a polynomial gT (x) of degree at most d ’ 1. In other words:
gT (x)
T (x) = , (2.37)
fT (x)
using division according to increasing powers of x, i.e., considering each poly-
nomial as a Taylor series. Note that we need to take some care with Equa-
tion (2.37). To see why, let us construct a sequence T , starting with d ar-
bitrary values from T0 to Td’1 and constant after that point. The minimal
polynomial of such a sequence is xd ’ xd’1 . Now the reciprocal polynomial
is simply x ’ 1. The problem is that this polynomial no longer contains in-
formation about the degree d. Thankfully, in this case, the degree of gT is

© 2009 by Taylor and Francis Group, LLC
54 Algorithmic Cryptanalysis

exactly d ’ 1 and this allows us to recover d. In order to compute fT and gT ,
it su¬ces to consider the ¬rst D terms of T and to decompose T (x) as a sum
T (x) = TD (x) + xD TD (x). Thus:
gT (x)
’ xD TD (x).
TD (x) = (2.38)
fT (x)

Multiplying by fT (x) and matching terms of the same degrees, we see that:

fT (x)TD (x) + xD hT (x) = gT (x), (2.39)
where hT is a polynomial of degree at most d ’ 1. When D ≥ 2d, this expres-
sion can be obtained through a partial run of the extended Euclidean algo-
rithm on the polynomials TD and xD and this is at the core of the Berlekamp-
Massey algorithm.
A slightly di¬erent formulation allows to remove the cumbersome issue of
having to alternate between the fT and fT representations. The idea is to
take the reciprocal of all polynomials in Equation (2.39), with a small caveat
on our use of the word reciprocal. As de¬ned above, the reciprocal is de¬ned
as the sum up to the degree of the polynomial. Here, instead of using the
e¬ective value of the degree, we use the expected value and treat TD as a
polynomial of degree D even when its coe¬cient of xD is zero; similarly, hT
is viewed as a polynomial of degree d ’ 1. The case of gT is slightly more
subtle, it should be considered as a polynomial of degree D + d ’ 1, with at
least its D upper coe¬cients equal to zero. To emphasize this fact, we write
the reciprocal as xD gT (x), where gT is a reciprocal of gT viewed as a degree
d ’ 1 polynomial. After this, Equation (2.39) becomes:

fT (x)TD (x) + hT (x) = xD gT (x) or (2.40)
fT (x)TD (x) ’ xD gT (x) = hT (x).
Once again, we can use an extended Euclidean algorithm, this time on the
polynomials TD and xD .
However, this is not the complete story. Indeed, both of the above ap-
proaches have a notable disadvantage; they require as input an upper bound
d on the degree of fT , in order to choose D and this value controls the whole
computation. It is much preferable to rewrite the operation in a way that
lets us increase D as the computation proceeds. The idea is to create a se-
ries of polynomials that each annihilate increasingly long subsequences of the
original sequence. A typical example is a polynomial f such that:
f (x)TN (x) = g(x) + xN h(x), (2.41)
with deg g < deg f ¤ N/2 and h(0) = 0. Assume that we know an equation
as above and have another equation:
F (x)TM (x) = G(x) + xM H(x), (2.42)

© 2009 by Taylor and Francis Group, LLC
Elementary number theory and algebra background 55

with M > N . If N ’ deg f ≥ M ’ deg F, we can improve the precision of
Equation (2.42) by letting:

H(0) M ’N
F (x) = F (x) ’ x f (x) and (2.43)
H(0) M ’N
G (x) = G(x) ’ x g(x).

Then, there exists H such that

F (x)TM +1 (x) = G (x) + xM +1 H (x).

Moreover, we can check that the degree of F is equal to the degree of F and
that deg G < deg F .
By repeating this improvement to Equation (2.42), we obtain an incremen-
tal way to compute fT , the resulting process is given as Algorithm 2.14. In
this algorithm, polynomials need to be encoded as arrays. However, for ease of
reading, we write them as polynomials and denote by [xl ]F the l-th coe¬cient
of the polynomial F , with numbering starting at 0.

2.5.3 Solving univariate polynomial equations
Over ¬nite ¬elds, there are e¬cient algorithms, Berlekamp™s and Cantor-
Zassenhaus, for ¬nding roots of polynomials equations f (x) = 0 and also to
factor f (x) into irreducible factors. Of course, we may, without loss of gener-
ality assume that f is unitary, after multiplying it by the inverse of its highest
degree coe¬cient. Moreover, it is clear that factoring f into irreducible fac-
tors directly yields the roots of f (x) = 0, simply by considering the irreducible
factors of degree 1. Indeed, r is a root of f , if and only if, x ’ r divides f (x).
In order to factor f , there are two essential intermediate steps, squarefree
factorization and distinct degree factorization.

Squarefree factorization A polynomial is said to be squarefree when no
irreducible factor appears with exponent greater than 1 in its factorization.
The goal of squarefree factorization is to decompose f into a product:
f (i) (x),
f (x) = (2.44)

where each factor is squarefree. Clearly, unless we give more precision, this
does not de¬ne a unique decomposition for f . To make the decomposition
unique, we require the squarefree decomposition to be minimal, i.e., to contain
as few factors as possible.
In order to compute such a minimal squarefree factorization, we rely on the
basic property that whenever the square of an irreducible polynomial divides

© 2009 by Taylor and Francis Group, LLC
56 Algorithmic Cryptanalysis

Algorithm 2.14 Berlekamp-Massey algorithm
Require: Input sequence S of N elements over Fq
Let f (x) ←’ 1
Let L0 ←’ 0
while S[L0 ] = 0 do
Let L0 ←’ L0 + 1
end while
Let ± ←’ S[L0 ]
Let F (x) ←’ 1
Let δ ←’ 1
for l from 0 to N ’ 1 ’ L0 do
Let β ←’ i=0 S[l + L0 ]([xl ]F ) in Fq
if β = 0 then
if 2L > l then
Let F (x) ←’ F (x) ’ (β/±)xδ f (x)
Let δ ←’ δ + 1
Let L ←’ l + 1 ’ L
Let g(x) ←’ F (x) ’ (β/±)xδ f (x)
Let f (x) ←’ F (x)
Let F (x) ←’ g(x)
Let δ ←’ 1
Let ± ←’ β
end if
Let δ ←’ δ + 1
end if
end for
Output F (x)xL0 (lowest degree annihilator of S up to N elements)

© 2009 by Taylor and Francis Group, LLC
Elementary number theory and algebra background 57

f , then the irreducible polynomial also divides Df . Indeed, if f = g 2 h, we

D(f ) = D(g(gh)) = gD(gh) + D(g)gh = g(D(gh) + D(g)h). (2.45)

Thus, g divides D(f ).
Conversely, if g is an irreducible polynomial that divides f , such that g 2
does not divide f , then g does not divide Df . Indeed, write f = gh with
h not a multiple of g. If g divides Df , the equality Df = gD(h) + D(g)h
implies that g divides D(g)h. Since g is irreducible and h not a multiple of g,
we ¬nd that D(g) is a multiple of g. Since the degree of D(g) is lower than
the degree of g, this is possible only if D(g) = 0. We saw previously that D(g)
can be zero, when g is either a constant polynomial or a p-th power. Since g
is irreducible, none of these options is possible. We conclude that g cannot
divide Df .
With this remark in mind, let us consider the GCD of f and Df . If Df = 0
then f is a p-th power. Otherwise, if g is an irreducible polynomial that
divides f and g t the greatest power of g that divides f , we ¬nd that g t’1
divides Df and thus the above GCD. Letting f (1) = f /GCD(f, Df ) yields
the ¬rst step of the squarefree factorization. f (1) is the product of all irre-
ducible polynomials that divide f , without their exponents. The other factor
GCD(f, Df ) contains the irreducible polynomials that appear at least twice
in f , with an exponent equal to the corresponding exponent in f minus 1.
Repeating the same process with GCD(f, Df ), we obtain f (2) that contains
all the irreducible appearing at least twice, each with exponent 1 and so on.
Finally, we obtain a decomposition as in Equation (2.44), where f (i) is the
product of all irreducible polynomials that appear at least i times in f . This
squarefree decomposition is described as Algorithm 2.15.

Distinct degree factorization The next step toward complete factoriza-
tion of polynomials is to take as input a squarefree polynomial f , usually
a polynomial coming out of Algorithm 2.15 and to write it as a product of
several factors. The i-th factor in the decomposition is the product of all
irreducible polynomials of degree i.
The key ingredient of distinct degree factorization is the fact that every
element ± in a ¬nite ¬eld Fqn of characteristic p, i.e., where q is either p or a
proper power of p, satis¬es: n
±q ’ ± = 0. (2.46)
If I(x) is an irreducible polynomial of degree n over Fq , we know that I can be
used to represent Fqn . Applying Equation (2.46) to a root ± of I implies that
I(x) divides the polynomial Pq,n (x) = xq ’ x. Since the argument applies
to all irreducible of degree n, all these polynomials divide Pq,n . Moreover, no
irreducible polynomial of degree higher than n may divide Pq,n .
As a consequence, taking the GCD of f with Pq,1 yields the product f (1)
of all irreducible of degree 1 dividing f . Replacing f by f /f (1) and taking

© 2009 by Taylor and Francis Group, LLC
58 Algorithmic Cryptanalysis

Algorithm 2.15 Squarefree factorization of polynomials
Require: Input polynomial f , ¬nite ¬eld Fpn
Let m ←’ 0
Let a be the highest degree coe¬cient of f
Let f ←’ f /a
Let g ←’ D(f )
if g = 0 then
Here f is a p-th power
Recurse on f 1/p , multiply all multiplicities by p and return
end if
Let h ←’ GCD(f, g)
Let m ←’ m + 1 and f (m) ←’ f /h
Let f ←’ h
until f = 1
Output a and f (1) , . . . , f (m)

the GCD with Pq,2 yields the product f (2) of all irreducible of degree 2 and
so on. For practicality of the resulting algorithm, it is essential to split the
GCD computations encountered here into two parts. Indeed, the polynomials
Pq,n have a high degree q n and it would not be practical to directly use
Algorithm 2.1. Instead, we ¬rst compute the polynomial xq modulo the
other polynomial f (x) using Algorithm 2.9 or 2.10. After subtracting x from
the result, we may use Algorithm 2.1 to conclude the GCD computation.
Distinct degree factorization is given as Algorithm 2.16.

Algorithm 2.16 Distinct degree factorization of a squarefree polynomial
Require: Input unitary squarefree polynomial f
Require: Input prime power q
Let m ←’ 0
Let m ←’ m + 1
Let g ←’ xq mod f (x)
Let g ←’ g ’ x
Let f (m) ←’ GCD(f, g)
Let f ←’ f /f (m)
until f = 1
Output f (1) , . . . , f (m)

© 2009 by Taylor and Francis Group, LLC
Elementary number theory and algebra background 59

Concluding the factorization At this ¬nal stage, we receive as input
a polynomial f , knowing that it has no repeated factors and that all its
factors have the same degree k. The strategy is to split f into a product
of polynomials of smaller degree until all the factors have degree k. At this
point, we know that the factorization is complete and we stop. Thanks to
distinct degree factorization, we know that the GCD of f and xq ’ x is equal
to f . When q is odd, we see that q k is also odd and we may write:
q k ’1 q k ’1
xq ’ x = x(x ’ 1)(x + 1). (2.47)
2 2

This remark can be used to split f into polynomials of smaller degree. First,
if k = 1 we can easily test whether x divides f or not, simply by looking at
the constant coe¬cient of f . Second, we can compute the GCD of f with
x(q ’1)/2 ’ 1. We expect that this yields a factor of f , roughly of half degree.
This is a nice start, but does not su¬ce to conclude the factorization of f .
In order to go further, we need to repeat the same idea with other polynomials.
To illustrate this, let us consider the case k = 1. In this case, we know that
f divides xq ’ x but also any polynomial of the form (x ’ a)q ’ (x ’ a). To
prove this, it su¬ces to consider the polynomial g de¬ned by g(x) = f (x + a).
Since f splits into factors of degree 1, g also splits into factors of degree 1,
thus xq ’ x divides g. Replacing x by x ’ a in the division yields the desired
result. Since,
q’1 q’1
(x ’ a)q ’ (x ’ a) = (x ’ a)((x ’ a) ’ 1)((x ’ a) + 1) (2.48)
2 2

we get another chance of decomposing f into smaller factors. We obtain an
e¬cient probabilistic algorithm simply by trying a small number of random
values for a.
When k is larger than 1, we can use the same argument and prove that f
divides (x ’ a)q ’ (x ’ a). However, in general, this is not su¬cient to fully
factor f . Instead, we need to generalize the idea and remark that for any
polynomial h, f divides h(x)q ’ h(x). Using random polynomials of degree
k su¬ces to conclude the factorization using the identity:
q k ’1 q k ’1
h(x)q ’ h(x) = h(x)(h(x) ’ 1)(h(x) + 1). (2.49)
2 2

The resulting probabilistic method is described as Algorithm 2.17.
In characteristic 2, with q = 2e the same idea is used with a di¬erent way
of splitting x2 ’ x. When ke is 1 or 2, it su¬ces to use exhaustive search.
Indeed, the only possible roots when e = 1 and k = 1 are 0 and 1. When
e = 1 and k = 2, we are looking for the only irreducible factor of degree 2
over F2 : i.e., x2 + x + 1. When e = 2 and k = 1, we search for roots over F4 :
i.e 0, 1, ρ or ρ + 1. Here ρ is a non-trivial cube root of unity in F4 , satisfying
ρ2 + ρ + 1 = 0.
When ke is even and larger than 2, we use the same approach as in the odd
characteristic case, but we use di¬erent relations. Two cases are distinguished.

© 2009 by Taylor and Francis Group, LLC
60 Algorithmic Cryptanalysis

Algorithm 2.17 Final splitting of polynomials
Require: Input unitary squarefree polynomial f with factors of degree k only
Require: Input prime power q
Let g(x) be a random unitary polynomial of degree k
Create an empty list of factors L
Let f0 ←’ GCD(f (x), g(x))
if deg f0 = k then
Add f0 to the list of factors L
end if
Let G(x) ←’ g(x)(q ’1)/2 in Fq [X]/(f (x))
Let f1 ←’ GCD(f (x), G(x) ’ 1)
if deg f1 > k then
Recursively call the ¬nal splitting on f1 and obtain a list L1
Concatenate L1 after L
if deg f1 = k then
Append f1 to L
end if
end if
Let f’1 ←’ GCD(f (x), G(x) + 1)
if deg f’1 > k then
Recursively call the ¬nal splitting on f’1 and obtain a list L’1
Concatenate L’1 after L
if deg f’1 = k then
Append f’1 to L
end if
end if
Output the list of factors L

© 2009 by Taylor and Francis Group, LLC
Elementary number theory and algebra background 61

When e is odd or equivalently when F2e does not contains a cube root of unity,
we use the identity:
k k k k
’1)/3 ’1)/3 ’1)/3
x2 • x = x(x(2 • 1)(x2(2 • x(2 • 1). (2.50)

When e is even, we let ρ denote a cube root of unity in F2e and we use the
k k k k
’1)/3 ’1)/3 ’1)/3
x2 • x = x(x(2 • 1)(x(2 • ρ)(x(2 • ρ • 1). (2.51)
ek 2ek
Finally, when e and k are both odd, we remark that x2 + x divides x2 + x
modulo two and use the analog of Equation (2.50) for this higher degree

2.6 Vector spaces and linear maps
A vector space over a ¬eld K is a set V , whose elements are called vectors,
together with two operations, addition of two vectors and multiplication of a
vector of V by an element of K, satisfying the following properties:

• Addition has a neutral element called the zero vector and denoted by 0.

• Every vector v has an opposite ’v, which can be obtained by multiplying
v by the constant ’1 in K. In particular, if K is a ¬eld of characteristic
2, every vector v is its own opposite.

• Multiplication by a constant is distributive, i.e., for any pair of vectors
(v, w) and any scalar ± in K, we have:

±(v + w) = ±v + ±w. (2.52)

• Similarly, for any vector v and any pair of scalars (±, β), we have:

(± + β)v = ±v + βv. (2.53)

A ¬nite family of n vectors (v1 , · · · , vn ) is a generating family for V , if
and only if, for all vector v in V there exists (at least) one n-uple of scalars
(±1 , · · · , ±n ) such that:
v= ±i vi . (2.54)

A family (v1 , · · · , vn ) is a free family, if and only if, any n-uple of scalars
(±1 , · · · , ±n ) such that i=1 ±i vi = 0 is the all zero n-uple.

© 2009 by Taylor and Francis Group, LLC
62 Algorithmic Cryptanalysis

A family of vectors (v1 , · · · , vn ) which is both free and a generating family
for V is called a basis of V . For any vector v of V , there exists a unique n-uple
(±1 , · · · , ±n ) such that Equation (2.54) holds. All bases of V have the same
cardinality, called the dimension of V and denoted dim V . Note that, in this
book, we only consider ¬nite dimensional vector spaces.
For example, K is a vector space of dimension 1 over itself. More generally,
Fpn can be seen as a vector space of dimension n over Fp .
A subset of V stable by addition and by multiplication by K is called a
vector subspace of V .
Given two vector spaces V and W over K, a map L from V to W is said to
be linear, if and only if, for all pairs of scalars (±, β) and all pairs of vectors
(v, w), we have:
L(±v + β w) = ±L(v) + βL(w). (2.55)
The subset of vectors v of V such that L(v) = 0 is a subspace of V called the
kernel of L and denoted by Ker(L). Similarly, the set of all vectors L(v) is a
subspace of W called the image of L and denoted by Im(L). The dimensions
of the kernel and image of L satisfy the relation:

dim V = dim Im(L) + dim Ker(L). (2.56)

For any vector y in Im(L), there exists (at least) one vector x such that
L(x) = y. Moreover, the set of vectors that are mapped to y by L is x+Ker(L).
If the dimension of Ker(L) is zero, then Ker(L) = {0} and any vector in Im(L)
has a unique inverse image. In that case, we say that L is invertible or that it
has full rank. We can then de¬ne the map that sends y to this inverse image,
it is a linear map, called the inverse of L and denoted by L’1 . Note that
L’1 is only de¬ned on Im(L). The composition L’1 —¦ L is the identity on V .
Moreover, if W = Im(L), then L —¦ L’1 is the identity on W .
If (v1 , · · · , vn ) is a basis for V and (w1 , · · · , wm ) is a basis for W , then a
linear map from V to W can be described by a matrix M with m rows and
n columns. Letting Mi,j denote the entry on row i and column j of M , we
de¬ne the entries of M by considering the unique decomposition of each L(vi )
as a sum:
L(vi ) = Mi,j wj . (2.57)

With this representation, composition of linear maps can be computed using
matrix multiplication.
Linear maps from V to K are called linear forms. The set of all linear forms
on V is a vector space over K, called the dual space of V . The dual of V has
the same dimension as V .
Both for matrices and for linear maps from a vector space to itself, it is
possible to de¬ne a deep mathematical invariant called the determinant, e.g.,
see [Lan05, Chapter XIII]. The determinant is non-zero, if and only if, the
matrix or linear map is invertible. Moreover, the determinant of the product

© 2009 by Taylor and Francis Group, LLC
Elementary number theory and algebra background 63

of two matrices or of the composition of two maps is equal to the product of
the determinants. The computation of determinants is described in Chapter 3.

2.7 The RSA and Di¬e-Hellman cryptosystems
To conclude this chapter, we use the number theory to brie¬‚y recall the
description of the RSA and Di¬e-Hellman cryptosystems.

2.7.1 RSA
The system of Rivest, Shamir and Adleman (RSA) is based on the structure
of the multiplicative group Z/N Z, when N is the product of two large primes
p and q. After choosing p and q, we also choose an encryption exponent e
coprime to φ(N ) and let d denote the inverse of e modulo φ(N ). Thus, we
have ed = 1 + »φ(N ) for some integer ».
Since φ(N ) = (p ’ 1)(q ’ 1), we know that for all integers 0 < x < N,
coprime with N , we have xφ(N ) = 1 (mod N ). As a consequence:

xed = x1+»φ(N ) = x · (xφ(N ) )» = x (mod N ). (2.58)

This shows that the two maps:

x ’ xe (mod N ) and
x ’ xd (mod N )

are inverses of each other in the group (Z/N Z)— . We leave it as an exercise to
the reader to show that the two maps are inverses of each other everywhere in
Z/N Z. Note that this is not really needed for the RSA cryptosystem. Indeed,
we may assume that no one ever encrypts an integer 0 < x < N not coprime
to N , since that would essentially reveal the factorization of N . We may also
assume that no one ever encrypts 0, since it is a weak message for RSA.
RSA as a whole is an e¬cient system for the following reasons:
• RSA modulus can be e¬ciently constructed. Indeed, to construct RSA
modulus, we need the ability to pick large prime at random. This can be
done easily, thanks to the e¬cient primality testing algorithm described
in Section 2.3.2. Note that many practical implementations of RSA key
do not choose large primes uniformly at random in a speci¬ed interval,
instead, the start from a random number in the interval and ¬nd the next
prime after this number. Since primes are non-uniformly distributed,
this procedure favors primes which are far apart from their immediate
predecessor. However, RSA numbers of this type are not easier to factor

© 2009 by Taylor and Francis Group, LLC
64 Algorithmic Cryptanalysis

than ordinary RSA numbers and no attack is known against this non-
uniform generation.

• Computing decryption exponents is easy and results from a simple ap-
plication of Euclid™s algorithm.

• The RSA encryption and decryption permutations are, given N and e (or
d), e¬ciently computed. If su¬ces to use a fast modular exponentiation
algorithm such as Algorithms 2.9 or 2.10.

From the security point-of-view, it is clear that RSA can be completely broken
using a modular e-th root algorithm. However, at the present time, the only
known approach to compute arbitrary e-th roots without external help, i.e.,
given only N and e, is to factor N and then to proceed as the private key
owner. Note that, the knowledge of both e and d can be used to compute a
multiple of φ(N ) and, as shown in Section, this multiple can be used
to recover the factorization of N . For these reasons, it is usually said that the
security of RSA relies on the hardness of factoring N .
In the above paragraph, the words “arbitrary” and “external help” are es-
sential to the link between factoring and the security of RSA. Concerning
“arbitrary,” we see that since RSA encryption is deterministic, it is easy to
decrypt a value x chosen from a small subset, simply by enumerating pos-
sible encryptions. A more advanced attack, described in Chapter 8 allows
to decrypt x e¬ciently, with a non-negligible probability of success, when it
belongs to a small interval. Concerning “external help,” let us remark that
plain RSA does not resist chosen ciphertext attacks, i.e., given access to a
RSA decryption box, it is also possible to decrypt messages not submitted to
the box. This is based on an essential property of RSA, which is not shared
by trapdoor one-way permutations in general: its multiplicativity. By multi-
plicativity of RSA, we simply mean that given x and y in Z/N Z the following
relation is satis¬ed:
(xy)e = xe · y e (mod N ). (2.59)

To realize a chosen ciphertext attack, the attacker receives a value z and tries
to compute x = z d (mod N ) without submitting this value to the decryption
box. He chooses a random element r in Z/N Z— and computes z1 = zre
(mod N ). Sending z1 to the decryption box, he receives a value x1 such that
x1 = xr (mod N ). Thus, a modular division by r su¬ces to recover x.
Despite these attacks, the multiplicativity of RSA is not necessarily a bad
thing. Indeed, it can also be used constructively. In particular, homomorphic
encryption based on RSA or on similar computations, such as the Paillier™s
cryptosystem [Pai99], explicitly requires multiplicativity. Moreover, general
methods have been devised to use RSA securely, both for signature, such as
the full domain hash and probabilistic signature schemes of [BR96], and for
encryption, such as the optimal asymmetric encryption scheme of [BR94].

© 2009 by Taylor and Francis Group, LLC
Elementary number theory and algebra background 65

2.7.2 Di¬e-Hellman key exchange
Let K be any ¬nite ¬eld and let g be a generator of a subgroup of K— of
order q, preferably prime. In order to create a common secret, two users A
and B proceed as follows:

• Each user chooses a random integer in the range [0, q ’ 1], a for user A
and b for user B, and raises g to this power.

• Then A sends g a to B and B sends g b to A.

• Upon reception A computes (g b )a and B computes (g a )b . Clearly, these
two elements of K are equal and their common value is g ab , where ab is
computed modulo q.

• Finally, A and B extract their common secret from g ab . Two frequently
encountered approaches are to extract a few bits from the binary rep-
resentation of g ab or, alternatively, to use a hash value of g ab . This last
option is often used in proofs that rely on the random oracle model.

From a security point-of-view, an algorithm that could solve computational
Di¬e-Hellman, i.e., compute g ab given g a an g b , but not a or b, would com-
pletely break Di¬e-Hellman key exchange. At the present time, this only
known general algorithm, involves computing discrete logarithms, i.e. recover-
ing a or b. Computations of discrete logarithms are discussed in Chapters 6, 7
and 15.
Note that the Di¬e-Hellman key exchange algorithm is not secure against
active attackers. The simplest attack is probably to substitute g a and g b by 1
during transmission. After this tampering, A computes 1a = 1 as its common
key and B computes 1b = 1 as its common key. Since the two users end up with
the same key, the communication may proceed without problem. After that
initial tampering, the attacker knows the common secret 1. As a consequence,
he just needs to eavesdrop on the remainder of the communication. Of course,
this can be avoided by excluding 0 from the allowed range of a and b and by
aborting if the incoming value is a 1.
Another well-known active attack against Di¬e-Hellman key exchange is
the man-in-the-middle attack. Here, the attacker breaks the communication
link between A and B. Afterwards, he communicates with A pretending to
be B and with B pretending to be A. On each side, he participates in a
copy of the Di¬e-Hellman key exchange. After that, the attacker plays the
role of a proxy between A and B, he decrypts with one common key any
incoming message, reencrypts it with the other key and, ¬nally, forwards the
message. As a consequence, he listens to the whole communication. Neither
A and B can detect the man-in-the-middle attack, unless some external mean
of authentication is used.

© 2009 by Taylor and Francis Group, LLC
66 Algorithmic Cryptanalysis

Non interactive Di¬e-Hellman
The Di¬e-Hellman key exchange algorithm can be used as a hybrid cryp-
tosystem in a very simple way. Each user simply needs to choose a pair (x, g x )
as in the Di¬e-Hellman key exchange algorithm. However, instead of choosing
a new pair for each run of the protocol, he keeps it ¬xed. As a consequence,
x becomes the user™s private key and g x his public key. With this change,
it becomes possible to exchange a secret in a non-interactive fashion. When
a user wants to initiate a communication, he chooses a random value r and
compute g r . Then, considering that the public key of the intended recipient
as the other party contribution to this run of the protocol, he can derive a
common secret g rx and, ¬nally, extracting a key from this common secret, he
can encrypt the message he wants to transmit using a secret key cryptosys-
tem. Attaching g r to the encrypted message allows the recipient to recover
g rx and decrypt.
ElGamal encryption [ElG85] is a variant that encrypts by multiplying g rx
with a message m consisting of a number in Fp . As RSA, it su¬ers from
several weaknesses related to multiplicativity, for an example see Chapter 8.

Zero-knowledge proofs of discrete logarithms
Once a user has a public key g x , he can use it to prove his identity. Of course,
this should be done without revealing the secret key x. An ideal tool for this
purpose is the notion of zero-knowledge proof [GMR89], which formally de¬nes
the meaning of proving that x is known, without giving away information on
x. For discrete logarithm, there is a simple and e¬cient protocol that achieves
this. Assuming that the system parameters, i.e., the prime p and an element
g of prime order q, are known and that the prover wants to prove knowledge
of x such that Y = g x (mod p), the protocol proceeds as follows:
• The prover chooses a random number r in the range [0, q ’ 1] and an-
nounces R = g r mod p.
• The veri¬er chooses a random number c in the range [0, q ’ 1], called
the challenge and sends it to the prover.
• The prover computes u = (r + cx) mod q and sends it to the veri¬er.
• The verify checks that g u = R · Y c (mod p) and, if so, accepts the proof.
It is easy to check that an honest prover, who knowns x and follows the
protocol correctly always successfully convinces the veri¬er. Moreover, if, for
a ¬xed value of R, a prover can give convincing answers u and u for two
di¬erent challenges c and c , he can recover x by remarking that (u ’ u) =
(c ’ c)x (mod q). As a consequence, a cheating prover, i.e., someone who
does not know x and tries to run the protocol as prover, has at most 1 chance
out of q to succeed. This probability is so low that it can be ignored for
all practical purposes. This proof is zero-knowledge because a veri¬er can

© 2009 by Taylor and Francis Group, LLC
Elementary number theory and algebra background 67

produce a convincing copy of a legitimate proof, by ¬rst choosing the challenge
c, the answer u and then by computing R as g u · Y ’c (mod p). Using this
method, the veri¬er is able to simulate the prover on the single challenge c
(for the value R) and this fake proof cannot be distinguished from a real one
by a third party.
In cryptography, such knowledge proofs are often used as authentication

Signature based on discrete logarithm
The above zero-knowledge proof is easy to transform into a signature by
replacing the interaction with the veri¬er by the output of a hash function H.
Assuming that H is a random oracle with output in [0, q ’1], the simplest way
is to use Schnorr™s signature [Sch90] and let c = H(R, M ), where R = g r is
produced by the signer and M is the message to be signed. After computing
u, the pair (R, u) becomes a signature for M . There are several possible
variations of this idea, including ElGamal™s signature [ElG85] and NIST digital
signature algorithm.

© 2009 by Taylor and Francis Group, LLC
Part II


© 2009 by Taylor and Francis Group, LLC
Chapter 3
Linear algebra

Linear algebra is a widely used tool in computer science and cryptography is
no exception to this rule. One notable di¬erence is that in cryptography and
cryptanalysis, we mostly consider linear algebra over ¬nite ¬elds (or sometimes
rings). Compared to linear algebra over real or complex numbers, there are
two essential changes. One for the best: no stability problems can occur; one
for the worst: the notion of convergence is no longer available. This makes
linear algebra in cryptanalysis quite di¬erent from linear algebra for scienti¬c
computing. The reader interested in the scienti¬c computing aspect of linear
algebra may refer to [GL96].

3.1 Introductory example: Multiplication of small ma-
trices over F2
In order to illustrate the speci¬cities of the linear algebra problem encoun-
tered in cryptography, we ¬rst consider the multiplication of small matrices
over the ¬nite ¬eld F2 . More precisely, we show how to optimize the imple-
mentation of the basic matrix multiplication Algorithm 3.1, directly derived
from the mathematical de¬nition of matrix multiplication, when multiplying
Boolean matrices. We especially consider matrices of size 32 — 32, 64 — 64 or
128 — 128. The natural idea that ¬rst comes to mind is to directly follow the
algorithm description and write the simple Program 3.1. As written, it works
on 32 — 32 matrices, but this can be easily changed by de¬ning DIM di¬erently.

One important drawback of this elementary implementation is that it wastes
memory. Indeed, each matrix entry is represented by a full integer. This is
clearly suboptimal, since a single bit would su¬ce. This waste can be re-
duced by encoding the entries using shorter integers. However, even using
8-bit integers (the C type char) is already costly in terms of memory. As
a consequence, before trying to improve the code running time, we replace
the matrix representation by an optimal representation where several bits are
packed into a single integer. With 32-bit integers, this is extremely well-suited
to represent square matrices whose dimension is a multiple of 32. On comput-

© 2009 by Taylor and Francis Group, LLC
72 Algorithmic Cryptanalysis

Algorithm 3.1 Elementary square matrix multiplication
Require: Input n — n matrices M and N
Create n — n matrix R initialized to 0
for l from 1 to n do
for c from 1 to n do
for k from 1 to n do
R[l, c] ←’ (R[l, c] + M [l, k] · N [k, c]) mod 2
end for
end for
end for
Output R

ers that o¬er larger 64- or 128-bit integers, matrices™ dimension divisible by 64
or 128 are a natural choice. In particular, many recent processors have spe-
cial instructions to work with such extended integers, interested readers may
refer to Figure 3.1. Assuming 32-bit words, we are going to focus on 32 — 32
matrices. Changing the memory representation and using logical operations
to compute arithmetic modulo 2 (logical AND for multiplication and XOR
for addition) yields the code of Program 3.2. In order to access the individ-
ual elements of the matrices in each line, the program uses two preprocessors
macros. The ¬rst macro bit(M,l,c) extracts the bit in line l and column c
of the matrix M under the convention that each line of the matrix is stored in
an unsigned integer of the correct size. The second macro flipbit(M,l,c)
¬‚ips the value of the bit from 0 to 1 or from 1 to 0. We do not give a macro
for writing a value into a speci¬c bit, since ¬‚ipping bits is su¬cient for our
purpose, easier to write and faster to execute.
In order to improve the matrix multiplication algorithm and compare the
performance of various implementations, we need a reliable way of measuring
the running times. Since the time for performing a single multiplication is
negligible compared to the time of starting the program and reading the input
matrices, it is useful to add a loop in order to repeat the multiplication a large
number of times, say 100,000 times.
Our ¬rst optimization comes from an improvement of the binary scalar
product of two 32-bit words. Instead of using a loop to extract one bit posi-
tion from each word, multiply these bits together and add up the products,
we remark that all the multiplications can be performed in parallel using a
single wordwise logical AND of integers. Moreover, adding up the products
requires a smaller number of XOR than initially expected. The key idea is
to XOR the upper and lower halves of a word. This performs 16 additions
in parallel using a single XOR and can be viewed as folding the word in two.
Folding again, we perform 8 additions and obtain an intermediate result on
a single byte. After three more folds, we end up with the expected result
on a single bit. Clearly, since a matrix multiplication consists of computing

© 2009 by Taylor and Francis Group, LLC
Linear algebra 73

Program 3.1 Basic C code for matrix multiplication over F2
#include <stdio.h>
#include <stdlib.h>
#define DIM 32

void input_mat(int mat[DIM][DIM])
int l,c;
for (l=0;l<DIM;l++) {
for (c=0;c<DIM;c++) {
scanf("%d",&mat[l][c]); } } }

void print_mat(int mat[DIM][DIM])
int l,c;
for (l=0;l<DIM;l++) {
for (c=0;c<DIM;c++) {
printf("%d ",mat[l][c]); }
printf("\n"); } }

void Mul(int res[DIM][DIM],
int mat1[DIM][DIM], int mat2[DIM][DIM])
int l,c,k;
for (l=0;l<DIM;l++) {
for (c=0;c<DIM;c++) { res[l][c]=0;
for (k=0;k<DIM;k++) {
res[l][c]%=2; } } }

int mat1[DIM][DIM]; int mat2[DIM][DIM];
int mat3[DIM][DIM]; int count;
printf("Input Mat1\n"); input_mat(mat1);
printf("Input Mat2\n"); input_mat(mat2);
for (count=0;count<100000;count++) Mul(mat3,mat1,mat2);
printf("Product :\n"); print_mat(mat3);

© 2009 by Taylor and Francis Group, LLC
74 Algorithmic Cryptanalysis

Program 3.2 Matrix multiplication over F2 with compact encoding
#include <stdio.h>
#include <stdlib.h>
#define DIM 32
#define WORD unsigned int
#define bit(M,l,c) ((M[l]>>c)&1)
#define flipbit(M,l,c) if (1) {M[l]^=(1UL<<c);} else

void input_mat(WORD mat[DIM])
int l,c,val;
for (l=0;l<DIM;l++) {
for (c=0;c<DIM;c++) {
if (val) flipbit(mat,l,c); } } }

void print_mat(WORD mat[DIM])
int l,c;
for (l=0;l<DIM;l++) {
for (c=0;c<DIM;c++) {
printf("%d ",bit(mat,l,c)); }
printf("\n"); } }

void Mul(WORD res[DIM],WORD mat1[DIM],WORD mat2[DIM])
int l,c,k,val;
for (l=0;l<DIM;l++) {
for (c=0;c<DIM;c++) {
for (k=0;k<DIM;k++) {
val^=bit(mat1,l,k)&bit(mat2,k,c); }
if (val) flipbit(res,l,c); } } }

WORD mat1[DIM]; WORD mat2[DIM]; WORD mat3[DIM]; int count;
printf("Input Mat1\n"); input_mat(mat1);
printf("Input Mat2\n"); input_mat(mat2);
for(count=0; count<100000; count++) Mul(mat3,mat1,mat2);
printf("Product :\n"); print_mat(mat3);

© 2009 by Taylor and Francis Group, LLC
Linear algebra 75

Recent microprocessors are becoming more and more powerful and manufac-
turers use this additional power to enrich the functionalities of processors.
MMX and SSE instructions have been introduced as part of this process.
They allow to compute on 64-bit and 128-bit speci¬c registers. The ad-
vantage of these instructions, when optimizing programs, is twofold. First,
these instructions compute faster than ordinary microprocessor instructions.
Second, since these instructions operate on speci¬c registers rather than on
the main registers, they are extremely useful for codes which require many
registers. For example, the bitslice technique, described in Chapter 5, re-
quires a large number of registers and MMX or SSE instructions are very
useful in this case.
One important caveat is that MMX or SSE registers share their storage area
with ¬‚oating point registers. As a consequence, both types of instruction
cannot be interleaved within a program and when starting a sequence of
MMX or SSE instructions or a sequence of ¬‚oating point instructions, the
storage area needs to be cleared by using a speci¬c instruction called EMMS.
Otherwise, we get either unpredictable results or unwanted errors.
MMX and SSE operations can view the speci¬c registers in several ways.
For example, a 64-bit MMX register can be considered either as eight bytes,
four 16-bit words or two 32-bit words, optionally using saturated arithmetic
of signed or unsigned kind. There are instructions to add in parallel eight
bytes, four 16-bit words or two 32-bit words. SSE registers can also contain
these di¬erent types of contents; however, they contain twice as many of
each. One notable exception, which is very useful for cryptographic purposes,
concerns logical operations. Indeed, performing a logical AND, OR or XOR
on eight bytes or on two 32-bit words yields the same result. Thus, for logical
operations, it is not necessary to have three di¬erent types.
At the present time, with the GCC compiler, MMX and SSE operations can
only be used in C programs by using a speci¬c mean of inlining assembly
code. The easiest way probably is to use the de¬nitions from the include ¬les
mmintrin.h, emmintrin.h, pmmintrin.h, tmmintrin.h and xmmintrin.h.
This ¬les de¬nes two new data types m64 for 64-bits MMX registers and
m128 for 128-bits SSE registers. These special data types can be operated
on using speci¬c operations. A few of these operations are listed below.
Instruction name Functionality Data type
Reset storage area None
mm empty
Addition Four 16-bit words
mm add pi16
Subtraction Eight bytes
mm sub pi8
XOR 128-bit word
mm xor pd

Figure 3.1: MMX and SSE instructions

© 2009 by Taylor and Francis Group, LLC
76 Algorithmic Cryptanalysis

the scalar products of lines with columns, this can be very useful. Since in
our representation words are used to encode lines, we ¬rst need to extract
columns from the representation of the second matrix and encode them back
into another integer. The multiplication subroutine thus obtained is given as
Program 3.3.

Program 3.3 Matrix multiplication using fast scalar product
void Mul(WORD res[DIM], WORD mat1[DIM], WORD mat2[DIM])
int l,c,k;
WORD val,mask;
for (l=0;l<DIM;l++) res[l]=0;
for (c=0;c<DIM;c++) { mask=0;
for (k=0;k<DIM;k++) {
mask^=(bit(mat2,k,c)<<k); }
for (l=0;l<DIM;l++) {
val^=(val>>16); val^=(val>>8);
val^=(val>>4); val^=(val>>2);
if (val&1) flipbit(res,l,c);

In order to further improve this code, we are going to modify two parts of
the program. The ¬rst modi¬cation replaces the extraction of columns from
the second matrix by a faster approach. Since we need all the columns of this
matrix, we are, in truth, computing the transpose of this matrix. Thus, our
modi¬cation consists of writing a fast transpose procedure and applying it
before calling the matrix multiplication. The second modi¬cation is to speed
up the slow part of the scalar product. Clearly, when performing the scalar
products, the multiplication part is optimal since we compute 32 parallel bit
multiplications using the logical AND on 32-bit words. However, the addition
part is suboptimal, we need ¬ve folds to compute a single bit. This can be
improved by remarking that the second fold only uses 16 bits out of 32, the
third only 8 bits and so on. . . Thus, it is possible to share the logical AND
operations between several folds thus computing several bits of the resulting
matrix at once.
Let us start by improving matrix transposition. In order to do this, we view
the original matrix as a 2 — 2 block matrix. Transposing this matrix, we ¬nd

© 2009 by Taylor and Francis Group, LLC
Linear algebra 77

= .
Using the above remark, transposing a 32 — 32 matrix can be done in ¬ve
steps. The ¬rst step exchanges the upper right and lower left 16—16 matrices.
The second step performs a similar exchange on four pairs of 8 — 8 matrices.
Then 16 pairs of 4—4 matrices, followed by 64 pairs of 2—2 matrices and ¬nally
256 pairs of 1 — 1 matrices are exchanged. By using logical operations and
shifts, each of these steps costs a small number of operations per line. From
an asymptotic point-of-view 2t — 2t matrices can be transposed in O(t 2t )
operations on 2t -bit words, instead of 22t operations for the basic algorithm.
The transposition C code for 32 matrices is given as Program 3.4. Note the
constants de¬ned at the beginning of the program which are de¬ned in order
to extract the bits contained in the submatrices that need to be exchanged at
each step.
Improving the additions during the scalar product multiplication is done in
a similar way. The ¬rst fold is performed as before, since it uses up full words.
The second fold regroups the computation of 2 second stage folds in a single
operation, the third fold regroups 4 third stage folds, and so on. This allows
us to compute 32 bits using a total of 63 elementary fold operations, thus
the amortized cost is less than two operations per bit, instead of ¬ve in the
previous approach. The corresponding code for matrix multiplication makes
use of the fast transposition Program 3.4 and is included as Program 3.5. To
illustrate the technique, the inner loops have been partially unrolled.
In order to compare the practical performance of these algorithms, we give
in Table 3.1 a sample of running times on a laptop computer, with each
program compiled using gcc™s -O3 option. These timings are given using 3
di¬erent versions of gcc. They clearly illustrate that low-level optimization
are not only machine dependent, but also deeply rely on the speci¬c compiler
being used. For example, the loop unrolling in Program 3.5 improves the
running time with gcc 4.0.1 but makes it worse for the other two versions.
With gcc 4.3.2, Programs 3.1 and 3.5 illustrate the fact that this compiler
version can vectorize some programs using MMX and SSE instructions (see
Figure 3.1).

3.2 Dense matrix multiplication
The complexity of matrix multiplication is a very important and di¬cult
problem in computer science. Indeed, the complexity of many important
linear algebra problems can be reduced to the complexity of matrix multipli-
cation. As a consequence, improving matrix multiplication is a key problem.

© 2009 by Taylor and Francis Group, LLC
78 Algorithmic Cryptanalysis

Program 3.4 Fast transposition of 32 — 32 matrices over F2
#define DIM 32
#define Cst16 0xffff
#define Cst8 0xff00ff
#define Cst4 0xf0f0f0f
#define Cst2 0x33333333
#define Cst1 0x55555555

void Transpose(WORD transp[DIM], WORD mat[DIM])
int l,c,l0,l1;
WORD val1,val2;
for (l=0;l<DIM/2;l++) {
for (l1=0;l1<DIM/4;l1++) {
transp[l]=val1; transp[l+(DIM/4)]=val2;
for (l1=0;l1<DIM/8;l1++) {
transp[l]=val1; transp[l+(DIM/8)]=val2;
for (l1=0;l1<DIM/16;l1++) {
transp[l]=val1; transp[l+(DIM/16)]=val2;
for (l=0;l<DIM;l+=2) {
transp[l]=val1; transp[l+1]=val2;

© 2009 by Taylor and Francis Group, LLC
Linear algebra 79

Program 3.5 Faster scalar product for multiplying of 32 — 32 matrices
void Mul(WORD res[DIM], WORD mat1[DIM], WORD mat2[DIM]) {
int l,c,k; WORD transp[DIM]; WORD tmp[DIM]; WORD val;
for (l=0;l<DIM;l++) {
for (c=0;c<DIM;c+=4) {
val^=(val>>16); val&=Cst16;
val^=(val>>16); val&=Cst16;
val^=(val>>16); val&=Cst16;
val^=(val>>16); val&=Cst16;
tmp[c+3]=val; }
for (c=0;c<DIM/2;c+=4) {
tmp[c+3]=(val&Cst8)^((val>>(DIM/4))&Cst8); }
for (c=0;c<DIM/4;c+=2) {
tmp[c+1]=(val&Cst4)^((val>>(DIM/8))&Cst4); }
for (c=0;c<DIM/8;c+=2) {
tmp[c+1]=(val&Cst2)^((val>>(DIM/16))&Cst2); }
res[l]=val; } }

© 2009 by Taylor and Francis Group, LLC
80 Algorithmic Cryptanalysis

Program Runtime (100,000 mult.)
gcc 4.0.1 gcc 4.2.1 gcc 4.3.2
3.1 9.01 s 8.16 s 3.68 s
3.2 8.09 s 12.51 s 12.39 s
3.3 1.38 s 1.38 s 1.30 s
3.5 0.32 s 0.38 s 0.27 s
3.5 without loop unrolling 0.38 s 0.24 s 0.11 s

Table 3.1: 32 — 32 Boolean matmul. on Intel Core 2 Duo at 2.4 GHz

Asymptotically, the basic Algorithm 3.1 multiplies n — n matrices in time
O(n3 ). In the other direction, we have a trivial lower bound of the running
time of matrix multiplication: O(n2 ) which is the time to read the input ma-
trices and/or write the result. The ¬rst matrix multiplication that beats n3
complexity was proposed in 1969 by Volker Strassen. After that initial step,
several further asymptotic improvements were proposed; they are discussed
in Section 3.2.2.
Throughout the current section, we focus on matrices represented by a
dense array of coe¬cients. Matrices with sparse encoding are discussed in
Section 3.4. We consider the complexity of matrix multiplication, detail many
practical aspects of matrix multiplication algorithms, overview some asymp-
totic complexity results and show the relation between matrix multiplication
and other linear algebra problems.

3.2.1 Strassen™s algorithm

The ¬rst matrix multiplication with asymptotic complexity better than n3
of Strassen [Str69] is a divide-and-conquer algorithm built on a “magic” recipe
for multiplying 2—2 matrices. The ordinary formula for 2—2 matrices requires
8 multiplications and 4 additions. Strassen™s formula requires 7 multiplica-
tions and 18 additions. For 2 — 2 matrices, this costs more than the usual
algorithm; however, for large 2n — 2n matrices, we need 7 multiplications of
n—n matrices and 18 additions. Since matrices can be added together in time
quadratic in n, the contribution of the additions to the asymptotic complexity
is negligible and the analysis focuses on the number of multiplications. We see
that the running time T (n) of Strassen algorithm as a function of n satis¬es
a recurrence formula:

T (2n) = 7 · T (n) + O(n2 )

and conclude that T (n) = nlog 7/ log 2 ≈ n2.807 .

© 2009 by Taylor and Francis Group, LLC
Linear algebra 81

ab ab
Strassen™s formula for multiplying M = by M = are:
cd cd

P1 = (a + c) · (a + b ), (3.1)
P2 = (b + d) · (c + d ),
P3 = (b + c) · (c ’ b ),
P4 = c · (a + c ),
P5 = b · (b + d ),
P6 = (c ’ d) · c ,
P7 = (a ’ b) · b and
P1 + P3 ’ P4 ’ P7 P5 + P7
M ·M = .
P4 ’ P6 P2 ’ P3 ’ P5 + P6

Note that another set of formulas with as many multiplications but fewer
additions was later proposed as an alternative by Winograd and described
further on in this section. To transform the formulas of Strassen into a matrix
multiplication algorithm, the basic idea is to use a recursive algorithm that
multiplies matrices using seven recursive calls to itself on matrices of half
size. The problem is that while this approach works very well for matrices
whose dimensions are powers of two, it needs to be modi¬ed to deal with
matrices which do not obey this restriction. From a theoretical point-of-view,
the easiest way to multiply n — n matrices is to embed them within matrices
of dimension 2t — 2t with the smallest possible value of t such that 2t ≥ n. In
the worst case, this embedding doubles the dimension of the matrix and thus
this only a¬ects the constant factor in the runtime complexity O(nlog2 7 ), not
the exponent. However, from a practical point-of-view, it is better to deal
with the di¬culty progressively throughout the algorithm rather than once
and for all at the beginning. To multiply matrices of even dimensions, we use
a direct recursion. To multiply matrices of odd dimensions, we need to deal
with the imperfect split. Two methods are possible, we can round the size
of the half-size matrices either up or down. To round up, it su¬ces to add
bordering zeros to increase the size of the original matrix. To round down,
we use the ordinary matrix multiplication formulas to deal with the extra line
and column. The two methods are respectively presented as Algorithms 3.2
and 3.3.

Practical aspects of Strassen™s multiplication
When implementing Strassen™s algorithm, it is clearly better to have a cuto¬
point and to turn back to ordinary matrix multiplication for small matrices,
thus aborting the recursion earlier. Indeed, for small matrices, the overhead
of dividing and reassembling the matrices dominates the running time. In this
section, we give some explicit data comparing ordinary matrix multiplication
with Strassen™s multiplication for some typical matrices. This data is far from

© 2009 by Taylor and Francis Group, LLC
82 Algorithmic Cryptanalysis

Algorithm 3.2 Strassen matrix multiplication (rounding up)
Require: Input n — n matrices M and N
if n = 1 then
Return a 1 — 1 matrix with entry M [1, 1] · N [1, 1]
end if
Let h1 ←’ n/2
Let n1 ←’ 2h1
If needed, add zeros to expand M and N into n1 — n1 matrices
Create R a n1 — n1 matrix with zero entries
Create a h1 — h1 matrix M1 , with M1 [i, j] = M [i, j] + M [h1 + i, j]


. 3
( 18)