ńņš. 3 |

Ā© 2009 by Taylor and Francis Group, LLC

50 Algorithmic Cryptanalysis

rt

Figure 2.1: Ordinary LFSR

2.5.2.1 Representation by LFSRs

A LFSR on n bits is formed of n individual registers each able to store a

single bit, say r0 , r1 , . . . , rnā’1 . The value contained in each register changes

with time. For LFSRs, time is discretized and follows a clock signal. We

assume that the clock is initially set to 0 and goes up by increments of 1.

(t)

We denote the value contained in ri at time t by ri . There exist two dif-

ferent kinds of LFSRs, Galois LFSR and ordinary LFSR, which diļ¬er by the

evolution of the register from time t to time t + 1. They are represented in

Figures 2.1 and 2.2.

From time t to time t + 1, a Galois LSFR evolves according to the following

rules:

(t+1) (t) (t)

ā¢ For all 0 ā¤ i < n ā’ 1 : ri+1 ā• ci+1 rnā’1 , where each ci is a

= ri

constant 0 or 1.

(t+1) (t)

ā¢ Moreover, r0 = rnā’1 .

The time advance of a Galois LFSR can be interpreted as multiplication as

a polynomial by X modulo a polynomial of F2 [X]. Indeed, if we let I(X) =

nā’1

X n ā• i=1 ci X i ā• 1 and let f (t) for every value of t denote the polynomial

nā’1 (t) (t)

f (t) (X) = i=0 ri X i . We easily see that when rnā’1 = 0, then f (t+1) (X) =

(t)

Xf (t) (X). Similarly, when rnā’1 = 1 then f (t+1) (X) = Xf (t) (X)ā•I(X). As a

consequence, assuming that I is irreducible, the simple rule for advancing the

corresponding Galois LFSR corresponds to multiplication by Ī±, a root of I in

F2n . This correspondence holds under the convention that the registers (ri )

nā’1

of a Galois LFSR encode the ļ¬nite ļ¬eld element i=0 ri Ī±i . Clearly, the sum

of two ļ¬nite ļ¬eld elements can be obtained by a simple XOR of the registers.

Finally, multiplication by an arbitrary element can be obtained by a sequence

of multiplications by Ī± and XOR operations, similar to the exponentiation

Algorithm 2.9 or 2.10. A simple implementation of F232 using a Galois LFSR

in C code is given as Program 2.1.

To understand the other kind of LFSR called ordinary LFSR when opposed

to Galois LFSR or simply LFSR in many contexts, it is useful to focus on the

(t)

sequence of bits st = rnā’1 obtained by considering only the high order bit

Ā© 2009 by Taylor and Francis Group, LLC

Elementary number theory and algebra background 51

rt

Figure 2.2: Galois LFSR

Program 2.1 Representation of F232 with a Galois LFSR

#include <stdio.h>

#include <stdlib.h>

/* GF(2^32) using x^32+x^7+x^3+x^2+1 -> 2^7+2^3+2^2+1=141*/

#define IRRED 141

typedef unsigned int state;

state Mul_by_alpha(state val)

{

state res;

res=(val<<1); if (val&0x80000000) res^=IRRED;

return(res);

}

state Add(state val1, state val2)

{

return(val1^val2);

}

state Mul(state val1, state val2)

{

state res=0;

while(val2!=0) {

if (val2&1) res=Add(res,val1);

val1=Mul_by_alpha(val1);

val2>>=1;

}

}

Ā© 2009 by Taylor and Francis Group, LLC

52 Algorithmic Cryptanalysis

(t)

of a Galois LFSR. Let us follow the evolution of one bit rnā’1 as time passes.

(t+1)

After one round, the bit is in r0 . After i rounds, we see by induction that:

iā’1

(t+i) (t) t+j

ā•

riā’1 = rnā’1 cj rnā’1 . (2.32)

j=1

Finally, with the additional convention that c0 = 1, we ļ¬nd that:

nā’1

st+n = cj st+j . (2.33)

j=0

Ordinary LSFRs directly compute the sequence s from this recurrence rela-

tion. Once again we need a register able to store n bits. For simplicity, we

(t)

reuse the same notation ri . With ordinary LFSRs, with well-chosen initial

(t)

values of the registers, we can achieve the relation ri = stā’i , for all accept-

able values of i and t, i.e., 0 ā¤ i < n and t ā’ i ā„ 0. The evolution rules for

ordinary LFSRs from time t to time t + 1 are the following:

(t+1) (t)

ā¢ For all 0 ā¤ i < n ā’ 1 : ri+1 = ri .

(t+1) (t)

ā¢ Moreover, r0 is obtained as the XOR of a subset of the values ri .

More precisely, we have:

nā’1

(t+1) (t)

r0 = ci ri , (2.34)

i=0

where each ci is the same constant as before.

The above discussion shows that extracting a single bit from a Galois LFSR or

for the corresponding ordinary LFSR yields the same sequence. Both LFSRs

are thus said to be equivalent. One very important property about LFSRs is

that they produce periodic sequences whose lengths are equal to the multi-

plicative order of the root Ī± of I in the ļ¬nite ļ¬eld F2n . Note that the all-zeros

sequence can also be produced by any LFSR and is an exception to this rule

since its periodicity is 1. If Ī± is a generator of the multiplicative group Fā—n ,

2

its order is 2n ā’ 1 and the length of the sequence is maximal. When this

condition is satisļ¬ed, the irreducible polynomial I is said to be primitive.

Letting N = 2n ā’ 1, to check whether Ī± has order N , we need to know

t

the factorization of N . Assuming that N = i=1 pei , we can check that no

i

proper divisor of N is the order of Ī± simply by checking that for all 1 ā¤ i ā¤ t :

Ī±N/pi = 1. (2.35)

Ā© 2009 by Taylor and Francis Group, LLC

Elementary number theory and algebra background 53

2.5.2.1.1 Berlekamp-Massey algorithm Knowing that LFSR can gen-

erate long periodic sequences by using simple linear recurrences, it is also

interesting to consider the dual question. Given a sequence of bits, can we

determine the short linear recurrence expression that can generate this se-

quence?

The Berlekamp-Massey algorithm is a very eļ¬cient method to solve this

question. It can be applied to ļ¬nd an optimal linear recurrence that generates

a given sequence over any ļ¬eld K. In order to describe this algorithm, it is

useful to introduce a few notations. Given a sequence T of elements of the

ļ¬eld K, a nice way to describe the shortest linear recurrence is to introduce

the notion of minimal polynomial of a sequence.

First, let us deļ¬ne the evaluation of a polynomial f of K[X] of degree df on

df

a sequence T of elements of K. Assuming that f (X) = i=0 Ī±i X i , we deļ¬ne

the sequence S = f (T ) by the following formula:

df

Si = Ī±j Ti+j . (2.36)

j=0

If T is an inļ¬nite sequence, S is also inļ¬nite. Otherwise, S contains df fewer

elements than T . We say that f annihilates T if and only if f (T ) is the all

zero sequence. We deļ¬ne IT to be the set of all polynomials that annihilate

T . This set is an ideal of K[X]. When IT is diļ¬erent from the zero ideal, we

can ļ¬nd a generator fT of IT (unique up to multiplication by a constant of

Kā— ) is called the minimal polynomial of S. Note that fT is the polynomial of

smallest degree that annihilates T .

In this section, we assume for simplicity that T is an inļ¬nite sequence of

numbers over a ļ¬eld K, with minimal polynomial fT of degree d. By deļ¬nition,

fT annihilates T , thus fT (T ) = 0. It is useful to express the sequence T

ā i

by forming series T (x) = i=0 Ti x in K[[x]]. Matching the deļ¬nition of

multiplication for series and polynomials and the deļ¬nition of evaluation of

fT at T , we see that i-th term of fT (T ) is the coeļ¬cient of xd+i in fT (x)T (x),

(i)

d

where fT is the reciprocal polynomial of fT , i.e., when fT = i=0 ft xi we

(i) dā’i

d

write fT = i=0 ft x . Since, fT (T ) is the zero sequence, we see that

fT (x)T (x) is a polynomial gT (x) of degree at most d ā’ 1. In other words:

gT (x)

T (x) = , (2.37)

fT (x)

using division according to increasing powers of x, i.e., considering each poly-

nomial as a Taylor series. Note that we need to take some care with Equa-

tion (2.37). To see why, let us construct a sequence T , starting with d ar-

bitrary values from T0 to Tdā’1 and constant after that point. The minimal

polynomial of such a sequence is xd ā’ xdā’1 . Now the reciprocal polynomial

is simply x ā’ 1. The problem is that this polynomial no longer contains in-

formation about the degree d. Thankfully, in this case, the degree of gT is

Ā© 2009 by Taylor and Francis Group, LLC

54 Algorithmic Cryptanalysis

exactly d ā’ 1 and this allows us to recover d. In order to compute fT and gT ,

it suļ¬ces to consider the ļ¬rst D terms of T and to decompose T (x) as a sum

T (x) = TD (x) + xD TD (x). Thus:

gT (x)

ā’ xD TD (x).

TD (x) = (2.38)

fT (x)

Multiplying by fT (x) and matching terms of the same degrees, we see that:

fT (x)TD (x) + xD hT (x) = gT (x), (2.39)

where hT is a polynomial of degree at most d ā’ 1. When D ā„ 2d, this expres-

sion can be obtained through a partial run of the extended Euclidean algo-

rithm on the polynomials TD and xD and this is at the core of the Berlekamp-

Massey algorithm.

A slightly diļ¬erent formulation allows to remove the cumbersome issue of

having to alternate between the fT and fT representations. The idea is to

take the reciprocal of all polynomials in Equation (2.39), with a small caveat

on our use of the word reciprocal. As deļ¬ned above, the reciprocal is deļ¬ned

as the sum up to the degree of the polynomial. Here, instead of using the

eļ¬ective value of the degree, we use the expected value and treat TD as a

polynomial of degree D even when its coeļ¬cient of xD is zero; similarly, hT

is viewed as a polynomial of degree d ā’ 1. The case of gT is slightly more

subtle, it should be considered as a polynomial of degree D + d ā’ 1, with at

least its D upper coeļ¬cients equal to zero. To emphasize this fact, we write

the reciprocal as xD gT (x), where gT is a reciprocal of gT viewed as a degree

d ā’ 1 polynomial. After this, Equation (2.39) becomes:

fT (x)TD (x) + hT (x) = xD gT (x) or (2.40)

fT (x)TD (x) ā’ xD gT (x) = hT (x).

Once again, we can use an extended Euclidean algorithm, this time on the

polynomials TD and xD .

However, this is not the complete story. Indeed, both of the above ap-

proaches have a notable disadvantage; they require as input an upper bound

d on the degree of fT , in order to choose D and this value controls the whole

computation. It is much preferable to rewrite the operation in a way that

lets us increase D as the computation proceeds. The idea is to create a se-

ries of polynomials that each annihilate increasingly long subsequences of the

original sequence. A typical example is a polynomial f such that:

f (x)TN (x) = g(x) + xN h(x), (2.41)

with deg g < deg f ā¤ N/2 and h(0) = 0. Assume that we know an equation

as above and have another equation:

F (x)TM (x) = G(x) + xM H(x), (2.42)

Ā© 2009 by Taylor and Francis Group, LLC

Elementary number theory and algebra background 55

with M > N . If N ā’ deg f ā„ M ā’ deg F, we can improve the precision of

Equation (2.42) by letting:

H(0) M ā’N

F (x) = F (x) ā’ x f (x) and (2.43)

h(0)

H(0) M ā’N

G (x) = G(x) ā’ x g(x).

h(0)

Then, there exists H such that

F (x)TM +1 (x) = G (x) + xM +1 H (x).

Moreover, we can check that the degree of F is equal to the degree of F and

that deg G < deg F .

By repeating this improvement to Equation (2.42), we obtain an incremen-

tal way to compute fT , the resulting process is given as Algorithm 2.14. In

this algorithm, polynomials need to be encoded as arrays. However, for ease of

reading, we write them as polynomials and denote by [xl ]F the l-th coeļ¬cient

of the polynomial F , with numbering starting at 0.

2.5.3 Solving univariate polynomial equations

Over ļ¬nite ļ¬elds, there are eļ¬cient algorithms, Berlekampā™s and Cantor-

Zassenhaus, for ļ¬nding roots of polynomials equations f (x) = 0 and also to

factor f (x) into irreducible factors. Of course, we may, without loss of gener-

ality assume that f is unitary, after multiplying it by the inverse of its highest

degree coeļ¬cient. Moreover, it is clear that factoring f into irreducible fac-

tors directly yields the roots of f (x) = 0, simply by considering the irreducible

factors of degree 1. Indeed, r is a root of f , if and only if, x ā’ r divides f (x).

In order to factor f , there are two essential intermediate steps, squarefree

factorization and distinct degree factorization.

Squarefree factorization A polynomial is said to be squarefree when no

irreducible factor appears with exponent greater than 1 in its factorization.

The goal of squarefree factorization is to decompose f into a product:

m

f (i) (x),

f (x) = (2.44)

i=1

where each factor is squarefree. Clearly, unless we give more precision, this

does not deļ¬ne a unique decomposition for f . To make the decomposition

unique, we require the squarefree decomposition to be minimal, i.e., to contain

as few factors as possible.

In order to compute such a minimal squarefree factorization, we rely on the

basic property that whenever the square of an irreducible polynomial divides

Ā© 2009 by Taylor and Francis Group, LLC

56 Algorithmic Cryptanalysis

Algorithm 2.14 Berlekamp-Massey algorithm

Require: Input sequence S of N elements over Fq

Let f (x) āā’ 1

Let L0 āā’ 0

while S[L0 ] = 0 do

Let L0 āā’ L0 + 1

end while

Let Ī± āā’ S[L0 ]

Let F (x) āā’ 1

Let Ī“ āā’ 1

for l from 0 to N ā’ 1 ā’ L0 do

l

Let Ī² āā’ i=0 S[l + L0 ]([xl ]F ) in Fq

if Ī² = 0 then

if 2L > l then

Let F (x) āā’ F (x) ā’ (Ī²/Ī±)xĪ“ f (x)

Let Ī“ āā’ Ī“ + 1

else

Let L āā’ l + 1 ā’ L

Let g(x) āā’ F (x) ā’ (Ī²/Ī±)xĪ“ f (x)

Let f (x) āā’ F (x)

Let F (x) āā’ g(x)

Let Ī“ āā’ 1

Let Ī± āā’ Ī²

end if

else

Let Ī“ āā’ Ī“ + 1

end if

end for

Output F (x)xL0 (lowest degree annihilator of S up to N elements)

Ā© 2009 by Taylor and Francis Group, LLC

Elementary number theory and algebra background 57

f , then the irreducible polynomial also divides Df . Indeed, if f = g 2 h, we

have:

D(f ) = D(g(gh)) = gD(gh) + D(g)gh = g(D(gh) + D(g)h). (2.45)

Thus, g divides D(f ).

Conversely, if g is an irreducible polynomial that divides f , such that g 2

does not divide f , then g does not divide Df . Indeed, write f = gh with

h not a multiple of g. If g divides Df , the equality Df = gD(h) + D(g)h

implies that g divides D(g)h. Since g is irreducible and h not a multiple of g,

we ļ¬nd that D(g) is a multiple of g. Since the degree of D(g) is lower than

the degree of g, this is possible only if D(g) = 0. We saw previously that D(g)

can be zero, when g is either a constant polynomial or a p-th power. Since g

is irreducible, none of these options is possible. We conclude that g cannot

divide Df .

With this remark in mind, let us consider the GCD of f and Df . If Df = 0

then f is a p-th power. Otherwise, if g is an irreducible polynomial that

divides f and g t the greatest power of g that divides f , we ļ¬nd that g tā’1

divides Df and thus the above GCD. Letting f (1) = f /GCD(f, Df ) yields

the ļ¬rst step of the squarefree factorization. f (1) is the product of all irre-

ducible polynomials that divide f , without their exponents. The other factor

GCD(f, Df ) contains the irreducible polynomials that appear at least twice

in f , with an exponent equal to the corresponding exponent in f minus 1.

Repeating the same process with GCD(f, Df ), we obtain f (2) that contains

all the irreducible appearing at least twice, each with exponent 1 and so on.

Finally, we obtain a decomposition as in Equation (2.44), where f (i) is the

product of all irreducible polynomials that appear at least i times in f . This

squarefree decomposition is described as Algorithm 2.15.

Distinct degree factorization The next step toward complete factoriza-

tion of polynomials is to take as input a squarefree polynomial f , usually

a polynomial coming out of Algorithm 2.15 and to write it as a product of

several factors. The i-th factor in the decomposition is the product of all

irreducible polynomials of degree i.

The key ingredient of distinct degree factorization is the fact that every

element Ī± in a ļ¬nite ļ¬eld Fqn of characteristic p, i.e., where q is either p or a

proper power of p, satisļ¬es: n

Ī±q ā’ Ī± = 0. (2.46)

If I(x) is an irreducible polynomial of degree n over Fq , we know that I can be

used to represent Fqn . Applying Equation (2.46) to a root Ī± of I implies that

n

I(x) divides the polynomial Pq,n (x) = xq ā’ x. Since the argument applies

to all irreducible of degree n, all these polynomials divide Pq,n . Moreover, no

irreducible polynomial of degree higher than n may divide Pq,n .

As a consequence, taking the GCD of f with Pq,1 yields the product f (1)

of all irreducible of degree 1 dividing f . Replacing f by f /f (1) and taking

Ā© 2009 by Taylor and Francis Group, LLC

58 Algorithmic Cryptanalysis

Algorithm 2.15 Squarefree factorization of polynomials

Require: Input polynomial f , ļ¬nite ļ¬eld Fpn

Let m āā’ 0

Let a be the highest degree coeļ¬cient of f

Let f āā’ f /a

repeat

Let g āā’ D(f )

if g = 0 then

Here f is a p-th power

Recurse on f 1/p , multiply all multiplicities by p and return

end if

Let h āā’ GCD(f, g)

Let m āā’ m + 1 and f (m) āā’ f /h

Let f āā’ h

until f = 1

Output a and f (1) , . . . , f (m)

the GCD with Pq,2 yields the product f (2) of all irreducible of degree 2 and

so on. For practicality of the resulting algorithm, it is essential to split the

GCD computations encountered here into two parts. Indeed, the polynomials

Pq,n have a high degree q n and it would not be practical to directly use

n

Algorithm 2.1. Instead, we ļ¬rst compute the polynomial xq modulo the

other polynomial f (x) using Algorithm 2.9 or 2.10. After subtracting x from

the result, we may use Algorithm 2.1 to conclude the GCD computation.

Distinct degree factorization is given as Algorithm 2.16.

Algorithm 2.16 Distinct degree factorization of a squarefree polynomial

Require: Input unitary squarefree polynomial f

Require: Input prime power q

Let m āā’ 0

repeat

Let m āā’ m + 1

m

Let g āā’ xq mod f (x)

Let g āā’ g ā’ x

Let f (m) āā’ GCD(f, g)

Let f āā’ f /f (m)

until f = 1

Output f (1) , . . . , f (m)

Ā© 2009 by Taylor and Francis Group, LLC

Elementary number theory and algebra background 59

Concluding the factorization At this ļ¬nal stage, we receive as input

a polynomial f , knowing that it has no repeated factors and that all its

factors have the same degree k. The strategy is to split f into a product

of polynomials of smaller degree until all the factors have degree k. At this

point, we know that the factorization is complete and we stop. Thanks to

k

distinct degree factorization, we know that the GCD of f and xq ā’ x is equal

to f . When q is odd, we see that q k is also odd and we may write:

q k ā’1 q k ā’1

k

xq ā’ x = x(x ā’ 1)(x + 1). (2.47)

2 2

This remark can be used to split f into polynomials of smaller degree. First,

if k = 1 we can easily test whether x divides f or not, simply by looking at

the constant coeļ¬cient of f . Second, we can compute the GCD of f with

k

x(q ā’1)/2 ā’ 1. We expect that this yields a factor of f , roughly of half degree.

This is a nice start, but does not suļ¬ce to conclude the factorization of f .

In order to go further, we need to repeat the same idea with other polynomials.

To illustrate this, let us consider the case k = 1. In this case, we know that

f divides xq ā’ x but also any polynomial of the form (x ā’ a)q ā’ (x ā’ a). To

prove this, it suļ¬ces to consider the polynomial g deļ¬ned by g(x) = f (x + a).

Since f splits into factors of degree 1, g also splits into factors of degree 1,

thus xq ā’ x divides g. Replacing x by x ā’ a in the division yields the desired

result. Since,

qā’1 qā’1

(x ā’ a)q ā’ (x ā’ a) = (x ā’ a)((x ā’ a) ā’ 1)((x ā’ a) + 1) (2.48)

2 2

we get another chance of decomposing f into smaller factors. We obtain an

eļ¬cient probabilistic algorithm simply by trying a small number of random

values for a.

When k is larger than 1, we can use the same argument and prove that f

k

divides (x ā’ a)q ā’ (x ā’ a). However, in general, this is not suļ¬cient to fully

factor f . Instead, we need to generalize the idea and remark that for any

k

polynomial h, f divides h(x)q ā’ h(x). Using random polynomials of degree

k suļ¬ces to conclude the factorization using the identity:

q k ā’1 q k ā’1

k

h(x)q ā’ h(x) = h(x)(h(x) ā’ 1)(h(x) + 1). (2.49)

2 2

The resulting probabilistic method is described as Algorithm 2.17.

In characteristic 2, with q = 2e the same idea is used with a diļ¬erent way

ke

of splitting x2 ā’ x. When ke is 1 or 2, it suļ¬ces to use exhaustive search.

Indeed, the only possible roots when e = 1 and k = 1 are 0 and 1. When

e = 1 and k = 2, we are looking for the only irreducible factor of degree 2

over F2 : i.e., x2 + x + 1. When e = 2 and k = 1, we search for roots over F4 :

i.e 0, 1, Ļ or Ļ + 1. Here Ļ is a non-trivial cube root of unity in F4 , satisfying

Ļ2 + Ļ + 1 = 0.

When ke is even and larger than 2, we use the same approach as in the odd

characteristic case, but we use diļ¬erent relations. Two cases are distinguished.

Ā© 2009 by Taylor and Francis Group, LLC

60 Algorithmic Cryptanalysis

Algorithm 2.17 Final splitting of polynomials

Require: Input unitary squarefree polynomial f with factors of degree k only

Require: Input prime power q

Let g(x) be a random unitary polynomial of degree k

Create an empty list of factors L

Let f0 āā’ GCD(f (x), g(x))

if deg f0 = k then

Add f0 to the list of factors L

end if

k

Let G(x) āā’ g(x)(q ā’1)/2 in Fq [X]/(f (x))

Let f1 āā’ GCD(f (x), G(x) ā’ 1)

if deg f1 > k then

Recursively call the ļ¬nal splitting on f1 and obtain a list L1

Concatenate L1 after L

else

if deg f1 = k then

Append f1 to L

end if

end if

Let fā’1 āā’ GCD(f (x), G(x) + 1)

if deg fā’1 > k then

Recursively call the ļ¬nal splitting on fā’1 and obtain a list Lā’1

Concatenate Lā’1 after L

else

if deg fā’1 = k then

Append fā’1 to L

end if

end if

Output the list of factors L

Ā© 2009 by Taylor and Francis Group, LLC

Elementary number theory and algebra background 61

When e is odd or equivalently when F2e does not contains a cube root of unity,

we use the identity:

k k k k

ā’1)/3 ā’1)/3 ā’1)/3

x2 ā• x = x(x(2 ā• 1)(x2(2 ā• x(2 ā• 1). (2.50)

When e is even, we let Ļ denote a cube root of unity in F2e and we use the

identity:

k k k k

ā’1)/3 ā’1)/3 ā’1)/3

x2 ā• x = x(x(2 ā• 1)(x(2 ā• Ļ)(x(2 ā• Ļ ā• 1). (2.51)

ek 2ek

Finally, when e and k are both odd, we remark that x2 + x divides x2 + x

modulo two and use the analog of Equation (2.50) for this higher degree

polynomial.

2.6 Vector spaces and linear maps

A vector space over a ļ¬eld K is a set V , whose elements are called vectors,

together with two operations, addition of two vectors and multiplication of a

vector of V by an element of K, satisfying the following properties:

ā¢ Addition has a neutral element called the zero vector and denoted by 0.

ā¢ Every vector v has an opposite ā’v, which can be obtained by multiplying

v by the constant ā’1 in K. In particular, if K is a ļ¬eld of characteristic

2, every vector v is its own opposite.

ā¢ Multiplication by a constant is distributive, i.e., for any pair of vectors

(v, w) and any scalar Ī± in K, we have:

Ī±(v + w) = Ī±v + Ī±w. (2.52)

ā¢ Similarly, for any vector v and any pair of scalars (Ī±, Ī²), we have:

(Ī± + Ī²)v = Ī±v + Ī²v. (2.53)

A ļ¬nite family of n vectors (v1 , Ā· Ā· Ā· , vn ) is a generating family for V , if

and only if, for all vector v in V there exists (at least) one n-uple of scalars

(Ī±1 , Ā· Ā· Ā· , Ī±n ) such that:

n

v= Ī±i vi . (2.54)

i=1

A family (v1 , Ā· Ā· Ā· , vn ) is a free family, if and only if, any n-uple of scalars

n

(Ī±1 , Ā· Ā· Ā· , Ī±n ) such that i=1 Ī±i vi = 0 is the all zero n-uple.

Ā© 2009 by Taylor and Francis Group, LLC

62 Algorithmic Cryptanalysis

A family of vectors (v1 , Ā· Ā· Ā· , vn ) which is both free and a generating family

for V is called a basis of V . For any vector v of V , there exists a unique n-uple

(Ī±1 , Ā· Ā· Ā· , Ī±n ) such that Equation (2.54) holds. All bases of V have the same

cardinality, called the dimension of V and denoted dim V . Note that, in this

book, we only consider ļ¬nite dimensional vector spaces.

For example, K is a vector space of dimension 1 over itself. More generally,

Fpn can be seen as a vector space of dimension n over Fp .

A subset of V stable by addition and by multiplication by K is called a

vector subspace of V .

Given two vector spaces V and W over K, a map L from V to W is said to

be linear, if and only if, for all pairs of scalars (Ī±, Ī²) and all pairs of vectors

(v, w), we have:

L(Ī±v + Ī² w) = Ī±L(v) + Ī²L(w). (2.55)

The subset of vectors v of V such that L(v) = 0 is a subspace of V called the

kernel of L and denoted by Ker(L). Similarly, the set of all vectors L(v) is a

subspace of W called the image of L and denoted by Im(L). The dimensions

of the kernel and image of L satisfy the relation:

dim V = dim Im(L) + dim Ker(L). (2.56)

For any vector y in Im(L), there exists (at least) one vector x such that

L(x) = y. Moreover, the set of vectors that are mapped to y by L is x+Ker(L).

If the dimension of Ker(L) is zero, then Ker(L) = {0} and any vector in Im(L)

has a unique inverse image. In that case, we say that L is invertible or that it

has full rank. We can then deļ¬ne the map that sends y to this inverse image,

it is a linear map, called the inverse of L and denoted by Lā’1 . Note that

Lā’1 is only deļ¬ned on Im(L). The composition Lā’1 ā—¦ L is the identity on V .

Moreover, if W = Im(L), then L ā—¦ Lā’1 is the identity on W .

If (v1 , Ā· Ā· Ā· , vn ) is a basis for V and (w1 , Ā· Ā· Ā· , wm ) is a basis for W , then a

linear map from V to W can be described by a matrix M with m rows and

n columns. Letting Mi,j denote the entry on row i and column j of M , we

deļ¬ne the entries of M by considering the unique decomposition of each L(vi )

as a sum:

m

L(vi ) = Mi,j wj . (2.57)

j=1

With this representation, composition of linear maps can be computed using

matrix multiplication.

Linear maps from V to K are called linear forms. The set of all linear forms

on V is a vector space over K, called the dual space of V . The dual of V has

the same dimension as V .

Both for matrices and for linear maps from a vector space to itself, it is

possible to deļ¬ne a deep mathematical invariant called the determinant, e.g.,

see [Lan05, Chapter XIII]. The determinant is non-zero, if and only if, the

matrix or linear map is invertible. Moreover, the determinant of the product

Ā© 2009 by Taylor and Francis Group, LLC

Elementary number theory and algebra background 63

of two matrices or of the composition of two maps is equal to the product of

the determinants. The computation of determinants is described in Chapter 3.

2.7 The RSA and Diļ¬e-Hellman cryptosystems

To conclude this chapter, we use the number theory to brieļ¬‚y recall the

description of the RSA and Diļ¬e-Hellman cryptosystems.

2.7.1 RSA

The system of Rivest, Shamir and Adleman (RSA) is based on the structure

of the multiplicative group Z/N Z, when N is the product of two large primes

p and q. After choosing p and q, we also choose an encryption exponent e

coprime to Ļ(N ) and let d denote the inverse of e modulo Ļ(N ). Thus, we

have ed = 1 + Ī»Ļ(N ) for some integer Ī».

Since Ļ(N ) = (p ā’ 1)(q ā’ 1), we know that for all integers 0 < x < N,

coprime with N , we have xĻ(N ) = 1 (mod N ). As a consequence:

xed = x1+Ī»Ļ(N ) = x Ā· (xĻ(N ) )Ī» = x (mod N ). (2.58)

This shows that the two maps:

x ā’ xe (mod N ) and

x ā’ xd (mod N )

are inverses of each other in the group (Z/N Z)ā— . We leave it as an exercise to

the reader to show that the two maps are inverses of each other everywhere in

Z/N Z. Note that this is not really needed for the RSA cryptosystem. Indeed,

we may assume that no one ever encrypts an integer 0 < x < N not coprime

to N , since that would essentially reveal the factorization of N . We may also

assume that no one ever encrypts 0, since it is a weak message for RSA.

RSA as a whole is an eļ¬cient system for the following reasons:

ā¢ RSA modulus can be eļ¬ciently constructed. Indeed, to construct RSA

modulus, we need the ability to pick large prime at random. This can be

done easily, thanks to the eļ¬cient primality testing algorithm described

in Section 2.3.2. Note that many practical implementations of RSA key

do not choose large primes uniformly at random in a speciļ¬ed interval,

instead, the start from a random number in the interval and ļ¬nd the next

prime after this number. Since primes are non-uniformly distributed,

this procedure favors primes which are far apart from their immediate

predecessor. However, RSA numbers of this type are not easier to factor

Ā© 2009 by Taylor and Francis Group, LLC

64 Algorithmic Cryptanalysis

than ordinary RSA numbers and no attack is known against this non-

uniform generation.

ā¢ Computing decryption exponents is easy and results from a simple ap-

plication of Euclidā™s algorithm.

ā¢ The RSA encryption and decryption permutations are, given N and e (or

d), eļ¬ciently computed. If suļ¬ces to use a fast modular exponentiation

algorithm such as Algorithms 2.9 or 2.10.

From the security point-of-view, it is clear that RSA can be completely broken

using a modular e-th root algorithm. However, at the present time, the only

known approach to compute arbitrary e-th roots without external help, i.e.,

given only N and e, is to factor N and then to proceed as the private key

owner. Note that, the knowledge of both e and d can be used to compute a

multiple of Ļ(N ) and, as shown in Section 2.3.3.1, this multiple can be used

to recover the factorization of N . For these reasons, it is usually said that the

security of RSA relies on the hardness of factoring N .

In the above paragraph, the words āarbitraryā and āexternal helpā are es-

sential to the link between factoring and the security of RSA. Concerning

āarbitrary,ā we see that since RSA encryption is deterministic, it is easy to

decrypt a value x chosen from a small subset, simply by enumerating pos-

sible encryptions. A more advanced attack, described in Chapter 8 allows

to decrypt x eļ¬ciently, with a non-negligible probability of success, when it

belongs to a small interval. Concerning āexternal help,ā let us remark that

plain RSA does not resist chosen ciphertext attacks, i.e., given access to a

RSA decryption box, it is also possible to decrypt messages not submitted to

the box. This is based on an essential property of RSA, which is not shared

by trapdoor one-way permutations in general: its multiplicativity. By multi-

plicativity of RSA, we simply mean that given x and y in Z/N Z the following

relation is satisļ¬ed:

(xy)e = xe Ā· y e (mod N ). (2.59)

To realize a chosen ciphertext attack, the attacker receives a value z and tries

to compute x = z d (mod N ) without submitting this value to the decryption

box. He chooses a random element r in Z/N Zā— and computes z1 = zre

(mod N ). Sending z1 to the decryption box, he receives a value x1 such that

x1 = xr (mod N ). Thus, a modular division by r suļ¬ces to recover x.

Despite these attacks, the multiplicativity of RSA is not necessarily a bad

thing. Indeed, it can also be used constructively. In particular, homomorphic

encryption based on RSA or on similar computations, such as the Paillierā™s

cryptosystem [Pai99], explicitly requires multiplicativity. Moreover, general

methods have been devised to use RSA securely, both for signature, such as

the full domain hash and probabilistic signature schemes of [BR96], and for

encryption, such as the optimal asymmetric encryption scheme of [BR94].

Ā© 2009 by Taylor and Francis Group, LLC

Elementary number theory and algebra background 65

2.7.2 Diļ¬e-Hellman key exchange

Let K be any ļ¬nite ļ¬eld and let g be a generator of a subgroup of Kā— of

order q, preferably prime. In order to create a common secret, two users A

and B proceed as follows:

ā¢ Each user chooses a random integer in the range [0, q ā’ 1], a for user A

and b for user B, and raises g to this power.

ā¢ Then A sends g a to B and B sends g b to A.

ā¢ Upon reception A computes (g b )a and B computes (g a )b . Clearly, these

two elements of K are equal and their common value is g ab , where ab is

computed modulo q.

ā¢ Finally, A and B extract their common secret from g ab . Two frequently

encountered approaches are to extract a few bits from the binary rep-

resentation of g ab or, alternatively, to use a hash value of g ab . This last

option is often used in proofs that rely on the random oracle model.

From a security point-of-view, an algorithm that could solve computational

Diļ¬e-Hellman, i.e., compute g ab given g a an g b , but not a or b, would com-

pletely break Diļ¬e-Hellman key exchange. At the present time, this only

known general algorithm, involves computing discrete logarithms, i.e. recover-

ing a or b. Computations of discrete logarithms are discussed in Chapters 6, 7

and 15.

Note that the Diļ¬e-Hellman key exchange algorithm is not secure against

active attackers. The simplest attack is probably to substitute g a and g b by 1

during transmission. After this tampering, A computes 1a = 1 as its common

key and B computes 1b = 1 as its common key. Since the two users end up with

the same key, the communication may proceed without problem. After that

initial tampering, the attacker knows the common secret 1. As a consequence,

he just needs to eavesdrop on the remainder of the communication. Of course,

this can be avoided by excluding 0 from the allowed range of a and b and by

aborting if the incoming value is a 1.

Another well-known active attack against Diļ¬e-Hellman key exchange is

the man-in-the-middle attack. Here, the attacker breaks the communication

link between A and B. Afterwards, he communicates with A pretending to

be B and with B pretending to be A. On each side, he participates in a

copy of the Diļ¬e-Hellman key exchange. After that, the attacker plays the

role of a proxy between A and B, he decrypts with one common key any

incoming message, reencrypts it with the other key and, ļ¬nally, forwards the

message. As a consequence, he listens to the whole communication. Neither

A and B can detect the man-in-the-middle attack, unless some external mean

of authentication is used.

Ā© 2009 by Taylor and Francis Group, LLC

66 Algorithmic Cryptanalysis

Non interactive Diļ¬e-Hellman

The Diļ¬e-Hellman key exchange algorithm can be used as a hybrid cryp-

tosystem in a very simple way. Each user simply needs to choose a pair (x, g x )

as in the Diļ¬e-Hellman key exchange algorithm. However, instead of choosing

a new pair for each run of the protocol, he keeps it ļ¬xed. As a consequence,

x becomes the userā™s private key and g x his public key. With this change,

it becomes possible to exchange a secret in a non-interactive fashion. When

a user wants to initiate a communication, he chooses a random value r and

compute g r . Then, considering that the public key of the intended recipient

as the other party contribution to this run of the protocol, he can derive a

common secret g rx and, ļ¬nally, extracting a key from this common secret, he

can encrypt the message he wants to transmit using a secret key cryptosys-

tem. Attaching g r to the encrypted message allows the recipient to recover

g rx and decrypt.

ElGamal encryption [ElG85] is a variant that encrypts by multiplying g rx

with a message m consisting of a number in Fp . As RSA, it suļ¬ers from

several weaknesses related to multiplicativity, for an example see Chapter 8.

Zero-knowledge proofs of discrete logarithms

Once a user has a public key g x , he can use it to prove his identity. Of course,

this should be done without revealing the secret key x. An ideal tool for this

purpose is the notion of zero-knowledge proof [GMR89], which formally deļ¬nes

the meaning of proving that x is known, without giving away information on

x. For discrete logarithm, there is a simple and eļ¬cient protocol that achieves

this. Assuming that the system parameters, i.e., the prime p and an element

g of prime order q, are known and that the prover wants to prove knowledge

of x such that Y = g x (mod p), the protocol proceeds as follows:

ā¢ The prover chooses a random number r in the range [0, q ā’ 1] and an-

nounces R = g r mod p.

ā¢ The veriļ¬er chooses a random number c in the range [0, q ā’ 1], called

the challenge and sends it to the prover.

ā¢ The prover computes u = (r + cx) mod q and sends it to the veriļ¬er.

ā¢ The verify checks that g u = R Ā· Y c (mod p) and, if so, accepts the proof.

It is easy to check that an honest prover, who knowns x and follows the

protocol correctly always successfully convinces the veriļ¬er. Moreover, if, for

a ļ¬xed value of R, a prover can give convincing answers u and u for two

diļ¬erent challenges c and c , he can recover x by remarking that (u ā’ u) =

(c ā’ c)x (mod q). As a consequence, a cheating prover, i.e., someone who

does not know x and tries to run the protocol as prover, has at most 1 chance

out of q to succeed. This probability is so low that it can be ignored for

all practical purposes. This proof is zero-knowledge because a veriļ¬er can

Ā© 2009 by Taylor and Francis Group, LLC

Elementary number theory and algebra background 67

produce a convincing copy of a legitimate proof, by ļ¬rst choosing the challenge

c, the answer u and then by computing R as g u Ā· Y ā’c (mod p). Using this

method, the veriļ¬er is able to simulate the prover on the single challenge c

(for the value R) and this fake proof cannot be distinguished from a real one

by a third party.

In cryptography, such knowledge proofs are often used as authentication

protocols.

Signature based on discrete logarithm

The above zero-knowledge proof is easy to transform into a signature by

replacing the interaction with the veriļ¬er by the output of a hash function H.

Assuming that H is a random oracle with output in [0, q ā’1], the simplest way

is to use Schnorrā™s signature [Sch90] and let c = H(R, M ), where R = g r is

produced by the signer and M is the message to be signed. After computing

u, the pair (R, u) becomes a signature for M . There are several possible

variations of this idea, including ElGamalā™s signature [ElG85] and NIST digital

signature algorithm.

Ā© 2009 by Taylor and Francis Group, LLC

Part II

Algorithms

Ā© 2009 by Taylor and Francis Group, LLC

Chapter 3

Linear algebra

Linear algebra is a widely used tool in computer science and cryptography is

no exception to this rule. One notable diļ¬erence is that in cryptography and

cryptanalysis, we mostly consider linear algebra over ļ¬nite ļ¬elds (or sometimes

rings). Compared to linear algebra over real or complex numbers, there are

two essential changes. One for the best: no stability problems can occur; one

for the worst: the notion of convergence is no longer available. This makes

linear algebra in cryptanalysis quite diļ¬erent from linear algebra for scientiļ¬c

computing. The reader interested in the scientiļ¬c computing aspect of linear

algebra may refer to [GL96].

3.1 Introductory example: Multiplication of small ma-

trices over F2

In order to illustrate the speciļ¬cities of the linear algebra problem encoun-

tered in cryptography, we ļ¬rst consider the multiplication of small matrices

over the ļ¬nite ļ¬eld F2 . More precisely, we show how to optimize the imple-

mentation of the basic matrix multiplication Algorithm 3.1, directly derived

from the mathematical deļ¬nition of matrix multiplication, when multiplying

Boolean matrices. We especially consider matrices of size 32 Ć— 32, 64 Ć— 64 or

128 Ć— 128. The natural idea that ļ¬rst comes to mind is to directly follow the

algorithm description and write the simple Program 3.1. As written, it works

on 32 Ć— 32 matrices, but this can be easily changed by deļ¬ning DIM diļ¬erently.

One important drawback of this elementary implementation is that it wastes

memory. Indeed, each matrix entry is represented by a full integer. This is

clearly suboptimal, since a single bit would suļ¬ce. This waste can be re-

duced by encoding the entries using shorter integers. However, even using

8-bit integers (the C type char) is already costly in terms of memory. As

a consequence, before trying to improve the code running time, we replace

the matrix representation by an optimal representation where several bits are

packed into a single integer. With 32-bit integers, this is extremely well-suited

to represent square matrices whose dimension is a multiple of 32. On comput-

71

Ā© 2009 by Taylor and Francis Group, LLC

72 Algorithmic Cryptanalysis

Algorithm 3.1 Elementary square matrix multiplication

Require: Input n Ć— n matrices M and N

Create n Ć— n matrix R initialized to 0

for l from 1 to n do

for c from 1 to n do

for k from 1 to n do

R[l, c] āā’ (R[l, c] + M [l, k] Ā· N [k, c]) mod 2

end for

end for

end for

Output R

ers that oļ¬er larger 64- or 128-bit integers, matricesā™ dimension divisible by 64

or 128 are a natural choice. In particular, many recent processors have spe-

cial instructions to work with such extended integers, interested readers may

refer to Figure 3.1. Assuming 32-bit words, we are going to focus on 32 Ć— 32

matrices. Changing the memory representation and using logical operations

to compute arithmetic modulo 2 (logical AND for multiplication and XOR

for addition) yields the code of Program 3.2. In order to access the individ-

ual elements of the matrices in each line, the program uses two preprocessors

macros. The ļ¬rst macro bit(M,l,c) extracts the bit in line l and column c

of the matrix M under the convention that each line of the matrix is stored in

an unsigned integer of the correct size. The second macro flipbit(M,l,c)

ļ¬‚ips the value of the bit from 0 to 1 or from 1 to 0. We do not give a macro

for writing a value into a speciļ¬c bit, since ļ¬‚ipping bits is suļ¬cient for our

purpose, easier to write and faster to execute.

In order to improve the matrix multiplication algorithm and compare the

performance of various implementations, we need a reliable way of measuring

the running times. Since the time for performing a single multiplication is

negligible compared to the time of starting the program and reading the input

matrices, it is useful to add a loop in order to repeat the multiplication a large

number of times, say 100,000 times.

Our ļ¬rst optimization comes from an improvement of the binary scalar

product of two 32-bit words. Instead of using a loop to extract one bit posi-

tion from each word, multiply these bits together and add up the products,

we remark that all the multiplications can be performed in parallel using a

single wordwise logical AND of integers. Moreover, adding up the products

requires a smaller number of XOR than initially expected. The key idea is

to XOR the upper and lower halves of a word. This performs 16 additions

in parallel using a single XOR and can be viewed as folding the word in two.

Folding again, we perform 8 additions and obtain an intermediate result on

a single byte. After three more folds, we end up with the expected result

on a single bit. Clearly, since a matrix multiplication consists of computing

Ā© 2009 by Taylor and Francis Group, LLC

Linear algebra 73

Program 3.1 Basic C code for matrix multiplication over F2

#include <stdio.h>

#include <stdlib.h>

#define DIM 32

void input_mat(int mat[DIM][DIM])

{

int l,c;

for (l=0;l<DIM;l++) {

for (c=0;c<DIM;c++) {

scanf("%d",&mat[l][c]); } } }

void print_mat(int mat[DIM][DIM])

{

int l,c;

for (l=0;l<DIM;l++) {

for (c=0;c<DIM;c++) {

printf("%d ",mat[l][c]); }

printf("\n"); } }

void Mul(int res[DIM][DIM],

int mat1[DIM][DIM], int mat2[DIM][DIM])

{

int l,c,k;

for (l=0;l<DIM;l++) {

for (c=0;c<DIM;c++) { res[l][c]=0;

for (k=0;k<DIM;k++) {

res[l][c]+=mat1[l][k]*mat2[k][c];

}

res[l][c]%=2; } } }

main()

{

int mat1[DIM][DIM]; int mat2[DIM][DIM];

int mat3[DIM][DIM]; int count;

printf("Input Mat1\n"); input_mat(mat1);

printf("Input Mat2\n"); input_mat(mat2);

for (count=0;count<100000;count++) Mul(mat3,mat1,mat2);

printf("Product :\n"); print_mat(mat3);

}

Ā© 2009 by Taylor and Francis Group, LLC

74 Algorithmic Cryptanalysis

Program 3.2 Matrix multiplication over F2 with compact encoding

#include <stdio.h>

#include <stdlib.h>

#define DIM 32

#define WORD unsigned int

#define bit(M,l,c) ((M[l]>>c)&1)

#define flipbit(M,l,c) if (1) {M[l]^=(1UL<<c);} else

void input_mat(WORD mat[DIM])

{

int l,c,val;

for (l=0;l<DIM;l++) {

mat[l]=0;

for (c=0;c<DIM;c++) {

scanf("%d",&val);

if (val) flipbit(mat,l,c); } } }

void print_mat(WORD mat[DIM])

{

int l,c;

for (l=0;l<DIM;l++) {

for (c=0;c<DIM;c++) {

printf("%d ",bit(mat,l,c)); }

printf("\n"); } }

void Mul(WORD res[DIM],WORD mat1[DIM],WORD mat2[DIM])

{

int l,c,k,val;

for (l=0;l<DIM;l++) {

res[l]=0;

for (c=0;c<DIM;c++) {

val=0;

for (k=0;k<DIM;k++) {

val^=bit(mat1,l,k)&bit(mat2,k,c); }

if (val) flipbit(res,l,c); } } }

main()

{

WORD mat1[DIM]; WORD mat2[DIM]; WORD mat3[DIM]; int count;

printf("Input Mat1\n"); input_mat(mat1);

printf("Input Mat2\n"); input_mat(mat2);

for(count=0; count<100000; count++) Mul(mat3,mat1,mat2);

printf("Product :\n"); print_mat(mat3);

}

Ā© 2009 by Taylor and Francis Group, LLC

Linear algebra 75

Recent microprocessors are becoming more and more powerful and manufac-

turers use this additional power to enrich the functionalities of processors.

MMX and SSE instructions have been introduced as part of this process.

They allow to compute on 64-bit and 128-bit speciļ¬c registers. The ad-

vantage of these instructions, when optimizing programs, is twofold. First,

these instructions compute faster than ordinary microprocessor instructions.

Second, since these instructions operate on speciļ¬c registers rather than on

the main registers, they are extremely useful for codes which require many

registers. For example, the bitslice technique, described in Chapter 5, re-

quires a large number of registers and MMX or SSE instructions are very

useful in this case.

One important caveat is that MMX or SSE registers share their storage area

with ļ¬‚oating point registers. As a consequence, both types of instruction

cannot be interleaved within a program and when starting a sequence of

MMX or SSE instructions or a sequence of ļ¬‚oating point instructions, the

storage area needs to be cleared by using a speciļ¬c instruction called EMMS.

Otherwise, we get either unpredictable results or unwanted errors.

MMX and SSE operations can view the speciļ¬c registers in several ways.

For example, a 64-bit MMX register can be considered either as eight bytes,

four 16-bit words or two 32-bit words, optionally using saturated arithmetic

of signed or unsigned kind. There are instructions to add in parallel eight

bytes, four 16-bit words or two 32-bit words. SSE registers can also contain

these diļ¬erent types of contents; however, they contain twice as many of

each. One notable exception, which is very useful for cryptographic purposes,

concerns logical operations. Indeed, performing a logical AND, OR or XOR

on eight bytes or on two 32-bit words yields the same result. Thus, for logical

operations, it is not necessary to have three diļ¬erent types.

At the present time, with the GCC compiler, MMX and SSE operations can

only be used in C programs by using a speciļ¬c mean of inlining assembly

code. The easiest way probably is to use the deļ¬nitions from the include ļ¬les

mmintrin.h, emmintrin.h, pmmintrin.h, tmmintrin.h and xmmintrin.h.

This ļ¬les deļ¬nes two new data types m64 for 64-bits MMX registers and

m128 for 128-bits SSE registers. These special data types can be operated

on using speciļ¬c operations. A few of these operations are listed below.

Instruction name Functionality Data type

Reset storage area None

mm empty

Addition Four 16-bit words

mm add pi16

Subtraction Eight bytes

mm sub pi8

XOR 128-bit word

mm xor pd

Figure 3.1: MMX and SSE instructions

Ā© 2009 by Taylor and Francis Group, LLC

76 Algorithmic Cryptanalysis

the scalar products of lines with columns, this can be very useful. Since in

our representation words are used to encode lines, we ļ¬rst need to extract

columns from the representation of the second matrix and encode them back

into another integer. The multiplication subroutine thus obtained is given as

Program 3.3.

Program 3.3 Matrix multiplication using fast scalar product

void Mul(WORD res[DIM], WORD mat1[DIM], WORD mat2[DIM])

{

int l,c,k;

WORD val,mask;

for (l=0;l<DIM;l++) res[l]=0;

for (c=0;c<DIM;c++) { mask=0;

for (k=0;k<DIM;k++) {

mask^=(bit(mat2,k,c)<<k); }

for (l=0;l<DIM;l++) {

val=mat1[l]&mask;

val^=(val>>16); val^=(val>>8);

val^=(val>>4); val^=(val>>2);

val^=(val>>1);

if (val&1) flipbit(res,l,c);

}

}

}

In order to further improve this code, we are going to modify two parts of

the program. The ļ¬rst modiļ¬cation replaces the extraction of columns from

the second matrix by a faster approach. Since we need all the columns of this

matrix, we are, in truth, computing the transpose of this matrix. Thus, our

modiļ¬cation consists of writing a fast transpose procedure and applying it

before calling the matrix multiplication. The second modiļ¬cation is to speed

up the slow part of the scalar product. Clearly, when performing the scalar

products, the multiplication part is optimal since we compute 32 parallel bit

multiplications using the logical AND on 32-bit words. However, the addition

part is suboptimal, we need ļ¬ve folds to compute a single bit. This can be

improved by remarking that the second fold only uses 16 bits out of 32, the

third only 8 bits and so on. . . Thus, it is possible to share the logical AND

operations between several folds thus computing several bits of the resulting

matrix at once.

Let us start by improving matrix transposition. In order to do this, we view

the original matrix as a 2 Ć— 2 block matrix. Transposing this matrix, we ļ¬nd

Ā© 2009 by Taylor and Francis Group, LLC

Linear algebra 77

that:

AB A C

= .

CD B D

Using the above remark, transposing a 32 Ć— 32 matrix can be done in ļ¬ve

steps. The ļ¬rst step exchanges the upper right and lower left 16Ć—16 matrices.

The second step performs a similar exchange on four pairs of 8 Ć— 8 matrices.

Then 16 pairs of 4Ć—4 matrices, followed by 64 pairs of 2Ć—2 matrices and ļ¬nally

256 pairs of 1 Ć— 1 matrices are exchanged. By using logical operations and

shifts, each of these steps costs a small number of operations per line. From

an asymptotic point-of-view 2t Ć— 2t matrices can be transposed in O(t 2t )

operations on 2t -bit words, instead of 22t operations for the basic algorithm.

The transposition C code for 32 matrices is given as Program 3.4. Note the

constants deļ¬ned at the beginning of the program which are deļ¬ned in order

to extract the bits contained in the submatrices that need to be exchanged at

each step.

Improving the additions during the scalar product multiplication is done in

a similar way. The ļ¬rst fold is performed as before, since it uses up full words.

The second fold regroups the computation of 2 second stage folds in a single

operation, the third fold regroups 4 third stage folds, and so on. This allows

us to compute 32 bits using a total of 63 elementary fold operations, thus

the amortized cost is less than two operations per bit, instead of ļ¬ve in the

previous approach. The corresponding code for matrix multiplication makes

use of the fast transposition Program 3.4 and is included as Program 3.5. To

illustrate the technique, the inner loops have been partially unrolled.

In order to compare the practical performance of these algorithms, we give

in Table 3.1 a sample of running times on a laptop computer, with each

program compiled using gccā™s -O3 option. These timings are given using 3

diļ¬erent versions of gcc. They clearly illustrate that low-level optimization

are not only machine dependent, but also deeply rely on the speciļ¬c compiler

being used. For example, the loop unrolling in Program 3.5 improves the

running time with gcc 4.0.1 but makes it worse for the other two versions.

With gcc 4.3.2, Programs 3.1 and 3.5 illustrate the fact that this compiler

version can vectorize some programs using MMX and SSE instructions (see

Figure 3.1).

3.2 Dense matrix multiplication

The complexity of matrix multiplication is a very important and diļ¬cult

problem in computer science. Indeed, the complexity of many important

linear algebra problems can be reduced to the complexity of matrix multipli-

cation. As a consequence, improving matrix multiplication is a key problem.

Ā© 2009 by Taylor and Francis Group, LLC

78 Algorithmic Cryptanalysis

Program 3.4 Fast transposition of 32 Ć— 32 matrices over F2

#define DIM 32

#define Cst16 0xffff

#define Cst8 0xff00ff

#define Cst4 0xf0f0f0f

#define Cst2 0x33333333

#define Cst1 0x55555555

void Transpose(WORD transp[DIM], WORD mat[DIM])

{

int l,c,l0,l1;

WORD val1,val2;

for (l=0;l<DIM/2;l++) {

transp[l]=(mat[l]&Cst16)|((mat[l+(DIM/2)]&Cst16)<<16);

transp[l+(DIM/2)]=((mat[l]>>16)&Cst16)|

(mat[l+(DIM/2)]&(Cst16<<16));

}

for(l0=0;l0<2;l0++)

for (l1=0;l1<DIM/4;l1++) {

l=l0*(DIM/2)+l1;

val1=(transp[l]&Cst8)|((transp[l+(DIM/4)]&Cst8)<<8);

val2=((transp[l]>>8)&Cst8)|(transp[l+(DIM/4)]&(Cst8<<8));

transp[l]=val1; transp[l+(DIM/4)]=val2;

}

for(l0=0;l0<4;l0++)

for (l1=0;l1<DIM/8;l1++) {

l=l0*(DIM/4)+l1;

val1=(transp[l]&Cst4)|((transp[l+(DIM/8)]&Cst4)<<4);

val2=((transp[l]>>4)&Cst4)|(transp[l+(DIM/8)]&(Cst4<<4));

transp[l]=val1; transp[l+(DIM/8)]=val2;

}

for(l0=0;l0<8;l0++)

for (l1=0;l1<DIM/16;l1++) {

l=l0*(DIM/8)+l1;

val1=(transp[l]&Cst2)|((transp[l+(DIM/16)]&Cst2)<<2);

val2=((transp[l]>>2)&Cst2)|(transp[l+(DIM/16)]&(Cst2<<2));

transp[l]=val1; transp[l+(DIM/16)]=val2;

}

for (l=0;l<DIM;l+=2) {

val1=(transp[l]&Cst1)|((transp[l+1]&Cst1)<<1);

val2=((transp[l]>>1)&Cst1)|(transp[l+1]&(Cst1<<1));

transp[l]=val1; transp[l+1]=val2;

}

}

Ā© 2009 by Taylor and Francis Group, LLC

Linear algebra 79

Program 3.5 Faster scalar product for multiplying of 32 Ć— 32 matrices

void Mul(WORD res[DIM], WORD mat1[DIM], WORD mat2[DIM]) {

int l,c,k; WORD transp[DIM]; WORD tmp[DIM]; WORD val;

Transpose(transp,mat2);

for (l=0;l<DIM;l++) {

for (c=0;c<DIM;c+=4) {

val=mat1[l]&transp[c];

val^=(val>>16); val&=Cst16;

tmp[c]=val;

val=mat1[l]&transp[c+1];

val^=(val>>16); val&=Cst16;

tmp[c+1]=val;

val=mat1[l]&transp[c+2];

val^=(val>>16); val&=Cst16;

tmp[c+2]=val;

val=mat1[l]&transp[c+3];

val^=(val>>16); val&=Cst16;

tmp[c+3]=val; }

for (c=0;c<DIM/2;c+=4) {

val=tmp[c]|(tmp[c+(DIM/2)]<<(DIM/2));

tmp[c]=(val&Cst8)^((val>>(DIM/4))&Cst8);

val=tmp[c+1]|(tmp[c+1+(DIM/2)]<<(DIM/2));

tmp[c+1]=(val&Cst8)^((val>>(DIM/4))&Cst8);

val=tmp[c+2]|(tmp[c+2+(DIM/2)]<<(DIM/2));

tmp[c+2]=(val&Cst8)^((val>>(DIM/4))&Cst8);

val=tmp[c+3]|(tmp[c+3+(DIM/2)]<<(DIM/2));

tmp[c+3]=(val&Cst8)^((val>>(DIM/4))&Cst8); }

for (c=0;c<DIM/4;c+=2) {

val=tmp[c]|(tmp[c+(DIM/4)]<<(DIM/4));

tmp[c]=(val&Cst4)^((val>>(DIM/8))&Cst4);

val=tmp[c+1]|(tmp[c+1+(DIM/4)]<<(DIM/4));

tmp[c+1]=(val&Cst4)^((val>>(DIM/8))&Cst4); }

for (c=0;c<DIM/8;c+=2) {

val=tmp[c]|(tmp[c+(DIM/8)]<<(DIM/8));

tmp[c]=(val&Cst2)^((val>>(DIM/16))&Cst2);

val=tmp[c+1]|(tmp[c+1+(DIM/8)]<<(DIM/8));

tmp[c+1]=(val&Cst2)^((val>>(DIM/16))&Cst2); }

val=tmp[0]|(tmp[2]<<2);

tmp[0]=(val&Cst1)^((val>>1)&Cst1);

val=tmp[1]|(tmp[3]<<2);

tmp[1]=(val&Cst1)^((val>>1)&Cst1);

val=tmp[0]|(tmp[1]<<1);

res[l]=val; } }

Ā© 2009 by Taylor and Francis Group, LLC

80 Algorithmic Cryptanalysis

Program Runtime (100,000 mult.)

gcc 4.0.1 gcc 4.2.1 gcc 4.3.2

3.1 9.01 s 8.16 s 3.68 s

3.2 8.09 s 12.51 s 12.39 s

3.3 1.38 s 1.38 s 1.30 s

3.5 0.32 s 0.38 s 0.27 s

3.5 without loop unrolling 0.38 s 0.24 s 0.11 s

Table 3.1: 32 Ć— 32 Boolean matmul. on Intel Core 2 Duo at 2.4 GHz

Asymptotically, the basic Algorithm 3.1 multiplies n Ć— n matrices in time

O(n3 ). In the other direction, we have a trivial lower bound of the running

time of matrix multiplication: O(n2 ) which is the time to read the input ma-

trices and/or write the result. The ļ¬rst matrix multiplication that beats n3

complexity was proposed in 1969 by Volker Strassen. After that initial step,

several further asymptotic improvements were proposed; they are discussed

in Section 3.2.2.

Throughout the current section, we focus on matrices represented by a

dense array of coeļ¬cients. Matrices with sparse encoding are discussed in

Section 3.4. We consider the complexity of matrix multiplication, detail many

practical aspects of matrix multiplication algorithms, overview some asymp-

totic complexity results and show the relation between matrix multiplication

and other linear algebra problems.

3.2.1 Strassenā™s algorithm

The ļ¬rst matrix multiplication with asymptotic complexity better than n3

of Strassen [Str69] is a divide-and-conquer algorithm built on a āmagicā recipe

for multiplying 2Ć—2 matrices. The ordinary formula for 2Ć—2 matrices requires

8 multiplications and 4 additions. Strassenā™s formula requires 7 multiplica-

tions and 18 additions. For 2 Ć— 2 matrices, this costs more than the usual

algorithm; however, for large 2n Ć— 2n matrices, we need 7 multiplications of

nĆ—n matrices and 18 additions. Since matrices can be added together in time

quadratic in n, the contribution of the additions to the asymptotic complexity

is negligible and the analysis focuses on the number of multiplications. We see

that the running time T (n) of Strassen algorithm as a function of n satisļ¬es

a recurrence formula:

T (2n) = 7 Ā· T (n) + O(n2 )

and conclude that T (n) = nlog 7/ log 2 ā n2.807 .

Ā© 2009 by Taylor and Francis Group, LLC

Linear algebra 81

ab ab

Strassenā™s formula for multiplying M = by M = are:

cd cd

P1 = (a + c) Ā· (a + b ), (3.1)

P2 = (b + d) Ā· (c + d ),

P3 = (b + c) Ā· (c ā’ b ),

P4 = c Ā· (a + c ),

P5 = b Ā· (b + d ),

P6 = (c ā’ d) Ā· c ,

P7 = (a ā’ b) Ā· b and

P1 + P3 ā’ P4 ā’ P7 P5 + P7

M Ā·M = .

P4 ā’ P6 P2 ā’ P3 ā’ P5 + P6

Note that another set of formulas with as many multiplications but fewer

additions was later proposed as an alternative by Winograd and described

further on in this section. To transform the formulas of Strassen into a matrix

multiplication algorithm, the basic idea is to use a recursive algorithm that

multiplies matrices using seven recursive calls to itself on matrices of half

size. The problem is that while this approach works very well for matrices

whose dimensions are powers of two, it needs to be modiļ¬ed to deal with

matrices which do not obey this restriction. From a theoretical point-of-view,

the easiest way to multiply n Ć— n matrices is to embed them within matrices

of dimension 2t Ć— 2t with the smallest possible value of t such that 2t ā„ n. In

the worst case, this embedding doubles the dimension of the matrix and thus

this only aļ¬ects the constant factor in the runtime complexity O(nlog2 7 ), not

the exponent. However, from a practical point-of-view, it is better to deal

with the diļ¬culty progressively throughout the algorithm rather than once

and for all at the beginning. To multiply matrices of even dimensions, we use

a direct recursion. To multiply matrices of odd dimensions, we need to deal

with the imperfect split. Two methods are possible, we can round the size

of the half-size matrices either up or down. To round up, it suļ¬ces to add

bordering zeros to increase the size of the original matrix. To round down,

we use the ordinary matrix multiplication formulas to deal with the extra line

and column. The two methods are respectively presented as Algorithms 3.2

and 3.3.

Practical aspects of Strassenā™s multiplication

When implementing Strassenā™s algorithm, it is clearly better to have a cutoļ¬

point and to turn back to ordinary matrix multiplication for small matrices,

thus aborting the recursion earlier. Indeed, for small matrices, the overhead

of dividing and reassembling the matrices dominates the running time. In this

section, we give some explicit data comparing ordinary matrix multiplication

with Strassenā™s multiplication for some typical matrices. This data is far from

Ā© 2009 by Taylor and Francis Group, LLC

82 Algorithmic Cryptanalysis

Algorithm 3.2 Strassen matrix multiplication (rounding up)

Require: Input n Ć— n matrices M and N

if n = 1 then

Return a 1 Ć— 1 matrix with entry M [1, 1] Ā· N [1, 1]

end if

Let h1 āā’ n/2

Let n1 āā’ 2h1

If needed, add zeros to expand M and N into n1 Ć— n1 matrices

Create R a n1 Ć— n1 matrix with zero entries

Create a h1 Ć— h1 matrix M1 , with M1 [i, j] = M [i, j] + M [h1 + i, j]

ńņš. 3 |