ńņš. 5 |

112 Algorithmic Cryptanalysis

b,u b

computing fA for at most a few vectors u suļ¬ces to recover fA with a good

b

enough probability. For example, when fA is an irreducible polynomial over

b,u

b

the ļ¬nite ļ¬eld we are considering, we necessarily have fA = fA when u = 0.

b,u

b b

Even when fA is composite, a fraction of the vectors u satisļ¬es fA = fA .

b

Moreover, we can recover fA by taking the lowest common multiple of a few

b,u

polynomials of the form fA .

Thus, to complete the description of Wiedemannā™s algorithm, we need to

b,u

eļ¬ciently compute fA from the sequence T b,u . This can be done using

Berlekamp-Massey algorithm, as described in Chapter 2. Since this algorithm

recovers the minimal polynomial with degree at most d from a sequence of

2d elements over a ļ¬nite ļ¬eld, it can be used in the context of Wiedemannā™s

algorithm by letting d be the dimension of the matrix A. Note that when A

is non-square, the smallest dimension suļ¬ces.

3.4.1.3 Block iterative algorithms

In Section 3.4.1.1, we have seen that over small ļ¬nite ļ¬elds, Lanczosā™s algo-

rithm may fail by encountering self-orthogonal vectors during its computation.

We recall that a self-orthogonal vector x whose coordinates lie in a ļ¬nite ļ¬eld

Fq is self-orthogonal when (x|x) = 0 (in Fq ). Heuristically, this bad event

occurs with probability near 1/q for each scalar product occurring in the al-

gorithm. This implies that for small ļ¬nite ļ¬elds, Lanczosā™s algorithm is likely

to fail. When this happens, there is a simple way to avoid the problem: in-

stead of using Equation (3.15) with an orthogonal basis (vi ) deļ¬ned over Fq ,

we consider an orthogonal basis deļ¬ned over an extension ļ¬eld FQ , with Q a

large enough power of q. With this simple change, the individual probabil-

ity of failure of each scalar product becomes 1/Q, assuming that the initial

vector v1 is chosen at random in this large ļ¬eld. The fact that we search a

solution for the linear system in Fq does not aļ¬ect these probabilities, since it

only impacts the scalar products with y, which are not used as denominators.

Thus, this simple change removes the bad behavior of Lanczosā™s algorithm.

However, this change is very costly because each computation in Fq is replaced

by a computation in FQ , without lowering the number of computations that

need to be performed.

Another approach is the block Lanczos algorithm proposed in [Mon95],

which starts from a block of several initial random vectors. At each round,

this algorithm computes a new block of vectors by multiplying the previous

block by the matrix and by performing orthogonalization with respect to the

previous blocks. The details are similar to those of plain Lanczosā™s; however,

there are some additional steps. First, the orthogonalization process requires

linear algebra on some matrices, whose dimensions are the number of blocks.

Second, some self-orthogonal vectors may be encountered during the compu-

tation and in order to deal with them, they are removed from the current

Ā© 2009 by Taylor and Francis Group, LLC

Linear algebra 113

block and added to the next one. This implies that block Lanczos algorithm

needs to deal with block of (slightly) varying size.

In addition to the possibility of working with small ļ¬nite ļ¬elds, block Lanc-

zos has another advantage compared to ordinary Lanczos: it is much better

suited for parallel computer architecture. As a consequence, it is frequently

used for computations in large dimension. Note that Wiedemannā™s algorithm

also has a block version [Cop94], which is also well adapted to parallel or even

distributed computation (see [AFK+ 07] for an example).

3.4.2 Structured Gaussian elimination

Structured Gaussian elimination was ļ¬rst introduced in the context of

index calculus algorithms by Odlyzko in [Odl85] (see also Lamacchia and

Odlyzko [LO91]). Its goal is, starting from a large sparse system of linear

equations, to reduce it to a smaller system, while maintaining a relative spar-

sity. It has been especially devised for the kind of linear system typically

encountered in index calculus algorithms. In these systems, some variables

occur very frequently in equations while others are quite rare. Moreover, with

enough sieving eļ¬ort, these systems can be largely overdeļ¬ned. These two

speciļ¬c properties are essential to the behavior of Lamacchia and Odlyzkoā™s

algorithm. Note that this algorithm is heuristic and that no deļ¬nite analy-

sis of its complexity is available. However, this is not essential in practice.

Indeed, structured Gaussian elimination is not used alone, but always in con-

junction with an iterative algorithm, as a practical method to reduce the cost

of this iterative algorithm. Moreover, using such structured Gaussian elimi-

nation does not improve the asymptotic complexity of index calculus methods

as described in Chapter 15.

In the sequel, we are given a sparse linear system of equations and we

try to simplify the system, reducing the number of equations and unknowns,

without losing too much sparsity. Note that it is useful to start with an

overdetermined system, i.e., with more equations than unknowns. This allows

us to discard some equations along the way, when they become too dense. It

is also convenient to assume that most coeļ¬cients that appear in the linear

system are either 1 or ā’1 and that an overwhelming majority of coeļ¬cients

are small enough to ļ¬t in a single computer word or even in a single byte.

To introduce the ideas used in structured Gaussian elimination, we start

by considering some special cases of interest. The simplest transform that

can be applied to the initial system of equations is the removal of singletons.

Scan the system of equations and if any unknown only appears in a single

equation, remove both the unknown and the equation from the system. This

transform is very eļ¬cient since it lowers the number of equations and unknows

and also reduces the total memory size of the linear system description. The

next transform that we can apply looks either for equations containing two

unknowns only or for unknowns that appear in two equations only. In both

cases, it is convenient to add the additional constraint that the two coeļ¬cients

Ā© 2009 by Taylor and Francis Group, LLC

114 Algorithmic Cryptanalysis

involved are both either 1 or ā’1. In this case, an equation containing two

unknowns can be rewritten as y = Ā±x + a, where x and y are the involved

unknowns and a is a constant; thus y can be replaced everywhere by either x or

ā’x, with a corresponding change to the right-hand side constants. Similarly, if

x only appears in two equations E and E , we can replace these two equations

by either E ā’ E or E + E and remove x from the list of unknowns. Note that

these two transforms are in some sense dual. Applying one to the matrix of a

linear system corresponds to applying the other one to the transposed matrix.

These two transforms remove one equation and one unknown, while leaving

the memory size of the system essentially constant. In fact, the amount of

memory usually decreases by 2 coeļ¬cients.

These two simple transforms can be applied iteratively; indeed, removing or

merging equations may create new singletons or doubletons. Once this is done,

we need to consider heavier equations and/or unknowns that appear more

often. Typically, if an equation with t unknowns exists and if, at least, one of

its coeļ¬cients is 1 or ā’1, then the corresponding unknown can be replaced by

an aļ¬ne combination of the others everywhere in the linear system. Similarly,

if an unknown appears times, at least once with coeļ¬cient 1 or ā’1, it can be

eliminated by subtracting the corresponding equation from each of the ā’ 1

others. In fact, if unknown x appears in equation E with coeļ¬cient 1 or

ā’1, with x appearing a total of t times and if E has weight , this transform

removes one unknown and one equation; moreover, assuming that no other

cancellation occurs, the memory size is increased by (t ā’ 1)( ā’ 1) ā’ t ā’ + 1 =

(t ā’ 2)( ā’ 2) ā’ 2.

Since we took care to start from an overdeļ¬ned system of equations, another

basic transform is available; it consists of simply removing some equations

when the system becomes too large to ļ¬t into memory. Of course, during this

step, it is preferable to remove the equation(s) with the largest weight.

All existing variations of structured Gaussian elimination are essentially

heuristic methods that aim at using these transforms or close variations in an

eļ¬cient way.

3.4.2.1 Odlyzkoā™s method

In the original method of Odlyzko, the geometry of the speciļ¬c systems

of equations encountered during index calculus algorithms1 is also taken into

account. With these systems, some unknowns, corresponding to small primes,

occur frequently, while others are much rarer. This dissymmetry is reļ¬‚ected

in the structured Gaussian elimination by distinguishing between heavy and

light unknowns. The transforms that are applied are adapted to take this

distinction into account. More precisely, transforms in the following list are

applied repeatedly:

1 See Chapter 15.

Ā© 2009 by Taylor and Francis Group, LLC

Linear algebra 115

ā¢ Find all light unknowns which appear only once and remove the corre-

sponding rows and columns.

ā¢ Find all light unknowns which appear alone, with coeļ¬cient 1 or ā’1, in

some row, replace the unknown everywhere by its expression in terms

of heavy unknowns and remove the corresponding row.

ā¢ Enlarge the set of heavy unknowns, declaring as heavy some of the

previously light unknowns. These unknowns are selected among the

more frequently occurring.

ā¢ Remove some heavy rows.

The above transforms eliminate singletons, with a modiļ¬ed deļ¬nition of

weight that ignores the contribution of heavy unknowns, i.e., of unknowns

which appear too frequently. In [LO91], the elimination of doubletons was

also considered; however, experiments showed that in this case, the overall

weight grows too quickly.

This approach of Odlyzko works well for linear systems of equations that

appear during index calculus algorithms for factoring or computing discrete

logarithms. Moreover, when implementing the method it is possible to remove

heavy unknowns from memory and to recompute them afterwards by journal-

ing the history of the computation. However, the approach also presents some

drawbacks:

ā¢ The approach is not well adapted to the large primes variation of index

calculus. To deal with such systems, it is preferable to work with the

graph method presented below.

ā¢ Linear systems arising during the computation of discrete logarithms in

ļ¬nite ļ¬elds of small characteristics behave badly with this algorithm.

The problem is that instead of having unknowns associated with prime

numbers, whose frequencies slowly decrease as the primes become larger,

we have unknowns that correspond to irreducible polynomials and the

frequency is a function of the degree of this polynomial. As a con-

sequence, when modifying the set of heavy unknowns, any reasonable

strategy makes jumps and adds a whole family of unknowns, correspond-

ing to the same degree. In this context, the overall behavior is quite bad.

There, the greedy method below is preferable.

ā¢ Finally, despite the relative simplicity of the method, programming this

approach is quite tricky and there is no easy way to check that the

structured Gaussian elimination did not subtly alter the system due to

an undetected bug. When faced with this concern, using the lightweight

approach of Section 3.4.2.3 is a good idea.

Ā© 2009 by Taylor and Francis Group, LLC

116 Algorithmic Cryptanalysis

3.4.2.2 Markowitz pivoting

Instead of partitioning the unknowns between the heavy and light cate-

gories, we can use Markowitz pivoting, a greedy method that simply tries

to apply the basic transforms we described in the introduction of this sec-

tion. We already know that performing Gaussian elimination to remove an

unknown that appears in diļ¬erent equations and in particular in an equation

of weight t, with a coeļ¬cient 1 or ā’1, usually increases the total number of

entries in the linear system by (t ā’ 2)( ā’ 2) ā’ 2. Markowitz pivoting selects

at each step of the computation to perform the elimination corresponding to

the smallest possible increases of the linear system size. When applying this

strategy, the size of the system slowly increases; when it reaches a predeter-

mined threshold, a fraction of the equations with highest weight is removed.

The program stops when the number of remaining equations becomes smaller

than the number of remaining unknows plus a small ļ¬xed security margin,

e.g., 100 extra equations. Another option for halting structured Gaussian

elimination is to compute at regular intervals an estimate of the complexity

of running an iterative solver on the current system of equations. At the

beginning of structured elimination, this estimate quickly decreases, at some

point of the elimination it reaches a minimum, when this happens, stopping

the structured elimination is a good idea. All in all, this often gives a better

result than pushing the elimination too far, which results in a denser matrix.

The principle of the greedy approach is very simple and was ļ¬rst described

in [Mar57]. However, the bookkeeping required by this approach is quite

massive. The key questions are to eļ¬ciently locate at each step the pair

unknown/equation that corresponds to the lowest value of (tā’2)( ā’2)ā’2 and

to eļ¬ciently perform the Gaussian elimination steps. Clearly, recomputing

(t ā’ 2)( ā’ 2) ā’ 2 for all pairs at each step of the computation to locate the

minimum value is going to be too costly. Similarly, in order to perform the

Gaussian elimination or pivoting step, we cannot aļ¬ord to scan the complete

linear system in order to locate the unknown that needs to be eliminated.

Thankfully, these two tasks can be managed eļ¬ciently by using adequate

data structures.

3.4.2.2.1 Structure for eļ¬cient pivoting Given a pair (E, x) where E

is an equation of weight t and x an unknown that appear in E with coeļ¬cient

1 or ā’1 and is the total number of equations that contains x, we need to

eļ¬ciently locate all equations that contain x and add/subtract some multiple

of E from these equations. A neat solution to perform this task is to store

not only the sparse matrix corresponding to the system of equations but also

its transpose. Note that when storing this transpose, it is possible to take

advantage of the heavy column idea of Odlyzkoā™s original approach. Indeed,

unknowns with a very large value for are unlikely to minimize W ( , t) =

( ā’ 2)(t ā’ 2) ā’ 2. As a consequence, it suļ¬ces to store a fraction of the

transpose. During pivoting, both the linear system and the transposed matrix

Ā© 2009 by Taylor and Francis Group, LLC

Linear algebra 117

need to be updated. However, since the linear system is sparse, the update

task only aļ¬ects a fraction of the unknows and equations.

3.4.2.2.2 Structure for pivot selection To select a pivot, we need to

ļ¬nd the pair (E, x) that minimizes the quantity W ( , t). A ļ¬rst remark is

that for a given unknown x, W ( , t) can be computed by scanning the column

of x using the transposed copy of the linear system and by remembering the

equation where x appears with coeļ¬cient 1 or ā’1 that has the lowest weight.

Thus, for every unknown x, we can deļ¬ne a best possible pair (E, x) and a

corresponding value W ( , t). During a pivoting step, thanks to sparsity, only

a small number of unknows are subject to change and need an update of their

W ( , t) value. The only remaining question is to ļ¬nd a data structure that

allows to eļ¬ciently locate the unknown with the smallest value at each step

and that also allows eļ¬cient updates. This can for example be achieved using

self-balancing tree techniques (see Chapter 6).

3.4.2.3 A lightweight approach

One notable disadvantage of using Markowitz pivoting is that since eļ¬cient

implementations of this algorithm require a complex data structure, they may

contain subtle, hard to detect bugs. In the context of index calculus, such a

bug may cause failure of the sparse linear algebra algorithm used after struc-

ture Gaussian elimination. However, such a failure could also arise due to

a bug in sieving or even due to a hardware failure at some point during the

process that would modify a single equation somewhere. When this happens,

it may become necessary to use simpler algorithms. For this reason, we now

propose a lightweight approach to structured Gaussian elimination. After an

initialization step, this approach works in several rounds, each round being

broken into two phases. During initialization, we count the total number of

occurrence of each unknown and memorize the list of light unknowns which

occur less than some ļ¬xed threshold. The ļ¬rst phase of each performs statis-

tics on the sparse system of equations, stored in memory. More precisely, it

sorts the equations by increasing values of the weight, or rather estimated

weight. When an equation contains a single light (active) unknown, this un-

known is remembered as a pivot for the equation. In case of conļ¬‚ict between

two equations, the lighter one is given precedence to keep the pivot. The

second phase performs the pivoting step in a lazy way. It simply adds to each

equation containing one or more unknowns declared as pivot a list of entries

that remember that a multiple of the equation corresponding to the pivot

should be added to it. The addition is not performed, however, the estimated

weight of the equation is modiļ¬ed, by adding to it the weight of the equation

being added minus 1. After that, all pivots, together with their corresponding

equations, are declared unactive unknowns and we proceed to the next round.

In a ļ¬nal phase, the algorithm evaluates for each equation the sum of the

initial entry and of the contribution of all equations that have been lazily

Ā© 2009 by Taylor and Francis Group, LLC

118 Algorithmic Cryptanalysis

added. Note, that this needs to be done in a recursive manner since the

added equations may also contain further lazy additions. Any equation that

becomes too large during this process is simply discarded. Note that for

debugging purposes, only this part of the algorithm need to be checked in

details. Indeed, by construction, this part can only output linear combinations

of the original equations, which by deļ¬nition are valid equations. Any bug

in the previous parts of the program might aļ¬ect the level of sparsity of the

output system but not its correctness.

3.4.2.4 Graph method

In the so-called large prime variation of index calculus algorithms, we en-

counter systems of equations with some additional structure. More precisely,

the set of unknowns is divided in two parts: unknowns corresponding to regu-

lar primes and unknowns corresponding to large primes. The main diļ¬erence

is that on average, each large prime appears less than once in the linear sys-

tem, while regular primes appear at least a few times on average. Moreover,

large prime unknowns always have coeļ¬cients 1 or ā’1. Of course, an un-

known that appears only once can be safely removed, with the corresponding

equation from the linear systems. Thus, the average large prime essentially

yields a useless equation. However, due to birthday paradox eļ¬ects (see Chap-

ter 6), a small fraction of large primes appears at least twice. If there is at

most one large prime in each equation, it suļ¬ces to merge the two equations

where a given large prime occurs as in basic structured Gaussian elimination.

The resulting equation has double weight, thus remaining reasonably sparse.

However, with current index calculus algorithms, we may have up to four large

primes per equation.

Another speciļ¬city of the large prime variation is that since a large frac-

tion of large primes does not appear, a large fraction of the collected linear

equations is lost. As a consequence, we need to consider a huge system, much

bigger than systems arising from index calculus without large primes. For

this reason, it is important to use algorithms that do not require loading into

memory the complete system of equations. Instead, the algorithms should

work using only the large prime parts of equations. This approach called ļ¬l-

tering is in particular discussed in [Cav00]. Of course, in a ļ¬nal stage, it is

necessary to compute the resulting linear system after large prime elimina-

tion. However, this ļ¬nal stage only involves the small fraction of equations

that remains and is not a real problem.

With this idea in mind, speciļ¬c algorithms were proposed. These algo-

rithms treat Gaussian elimination as a graph problem. To explain how, let us

ļ¬rst consider the restricted case with at most two large primes per equation.

In that case, we can view each unknown of the large type as a node in the

graph and each equation as an edge between the two large primes that appear

in the equation. When a given large prime appears alone in an equation, we

construct an edge between the large prime and a special node no-unknown.

Ā© 2009 by Taylor and Francis Group, LLC

Linear algebra 119

This representation is helpful, because any linear combination (with Ā±1 co-

eļ¬cients) of equations where all large prime unknowns cancel out induces a

cycle in the above graph. Moreover, the converse is āalmostā true. If the

linear system is considered modulo 2 as when factoring, since any node in a

cycle appears an even number of times, usually twice, summing all equations

that correspond to the edges of the cycle, the contribution of the large primes

vanishes. When the linear system is modulo a larger prime, it is possible to

adjust the sign of the contribution of each equation in the cycle to make sure

that all unknowns but one vanish. If the special node no-unknown occurs in

the cycle, we can make the contribution of all other nodes vanish and thus

obtain a new equation without large primes. If it does not occur, we need

to rely on luck for eliminating the ļ¬nal large prime unknown and we succeed

with probability 1/2. In truth, the whole approach is more complicated than

that, because it does not suļ¬ce to ļ¬nd a single cycle. To make the most out

of the available equations, we need to construct a set called a cycle base which

generates all possible cycles with redundancies.

3.4.2.5 Final backtracking

After applying any of the above structured Gaussian elimination techniques,

one obtains a reduced sparse system of equations which can be solved using

iterative techniques. However, after that, it is often necessary to turn back to

the initial linear system. Two cases arise depending on the target application.

For factoring, the goal of the linear algebra step is to ļ¬nd an element of the

kernel of the sparse matrix. For discrete logarithms, the goal is to ļ¬nd values

for the unknowns.

Let us start by considering the case of discrete logarithms. After apply-

ing the iterative solver, we have values only for the unknowns that remain

in the linear system after structured Gaussian elimination. We would like to

complete this result and recover a value for all unknowns present in the orig-

inal system. This can be done using a very simple backtracking algorithm.

In fact, it suļ¬ces to scan the initial system of equations and, whenever an

equation with a single undetermined unknown is encountered, to compute a

value for this unknown. Repeating this step until no new value can be added

yields a complete (or almost complete) table of values for the unknowns of the

initial system. Indeed, since structured Gaussian elimination eliminates one

unknown at a time, this backtracking method can recover the missing values

one by one.

The case of factoring is slightly more complicated. Here, we are given

a kernel element as a linear combination of equations in the transformed

system and we need to express it as a linear combination of equations in

the original system. The simplest way to do this is to use journaling in the

structured Gaussian elimination algorithm. In the journal, we give labels to all

initial equations and to all intermediate equations that arise during Gaussian

elimination. Note that, each intermediate equation is a linear combination

Ā© 2009 by Taylor and Francis Group, LLC

120 Algorithmic Cryptanalysis

of two preexisting equations. We now write a kernel element as a linear

combination of equations, allowing all initial, intermediate or ļ¬nal equations

in the expression. Such a expression can be iteratively transformed using the

journal by replacing each equation by its expression in terms of its immediate

predecessors. At the end, we obtain a linear equation that only involves the

initial equations. Since the linear combinations that represent kernel elements

are not sparse, we can only perform backtracking on a small number of kernel

elements. However, this is enough for factoring purposes.

Ā© 2009 by Taylor and Francis Group, LLC

Linear algebra 121

Exercises

1h . Construct a linear system of 4 equations in 4 unknowns, without a solu-

tion. Another one, over the ļ¬eld of reals, with inļ¬nitely many solutions.

Write a linear system of 4 equations in 4 unknowns, with a unique so-

lution, such that Algorithm 3.4 fails. Choose your example in a way

that ensures that if the order of the equations in the system is reversed,

Algorithm 3.4 succeeds.

2. Put together the code fragments in Programs 3.4 and 3.5 to realize

a complete multiplication of 32 Ć— 32 Boolean matrix. Extend this to

64Ć—64 and 128Ć—128. Proļ¬le the execution of the resulting code on your

computer, using the available compiler(s), identify the critical parts of

the program. Without changing the algorithm, rewrite the critical parts

of the code to improve the overall performance.

3h . Assume that we would like to multiply one ļ¬xed square Boolean matrix

of small dimension by many matrices of the same size. To improve the

performance, we consider the algorithm of four Russians which precom-

putes and stores the results all the possible multiplication for rectangular

slices of the original matrix. Assuming that the number of multipli-

cations is very large, we can ignore the time of this precomputation.

Implement and optimize this on your computer.

4. Program 3.6 uses the 32 Ć— 32 matrix multiplication as a black box, in

order to multiply 32n Ć— 32n Boolean matrices. This can be improved

in many ways. A ļ¬rst option is to simply share the fold operations

rather than doing them independently for each small multiplication. A

second option is to use the algorithm of four Russians from the previous

exercise.

5h . In order to make Strassenā™s multiplication algorithm running time be-

have more smoothly as a function of the dimension, we should change

the rounding rule and the decision to cut oļ¬ the algorithm. One simple

but time-consuming option is to run the program for each size using

three possibilities (or two for even dimensions), Strassen with rounding

up, Strassen with rounding down or basic multiplication. After this,

save the most eļ¬cient option for this dimension in a table. Then, when

the multiplication is called recursively from a larger size, reuse the saved

option. Note that this table of decision needs to be constructed by start-

ing from small sizes and going up.

6. Reprogram a recursive multiplication algorithm using Winogradā™s for-

mulas instead of Strassenā™s.

Ā© 2009 by Taylor and Francis Group, LLC

122 Algorithmic Cryptanalysis

7h . Consider Schurā™s complement method for matrix multiplication. How

would you bypass the problem caused by the fact that D might be non-

invertible? The goal is to achieve asymptotic complexity with exponent

log2 7.

8. Gaussian elimination can also be used for matrices deļ¬ned over a ring.

For example, consider the issue of solving linear systems of equations

modulo 256 when the determinant of the system is a multiple of 256

(see Section 3.3.3.1). What new diļ¬culties do appear? How can you

solve them?

9h . Implement the computation of Hermite normal forms over the integers.

Some of the methods presented in this chapter can also be a basis for pro-

gramming projects. Implementations of structured Gaussian elimination or

iterative algorithms are very interesting in this respect, especially in conjunc-

tion with the applications presented in Chapters 11 and 15.

Ā© 2009 by Taylor and Francis Group, LLC

Chapter 4

Sieve algorithms

Many modern algorithms for factoring and computing discrete logarithms

in ļ¬nite ļ¬elds heavily rely on sieving in practical implementations1 . These

index calculus algorithms themselves are described in the applications part

of this book, more precisely in Chapter 15. We chose to present the sieve

algorithms in the present chapter in order to focus on the variety of possible

implementations and available trade-oļ¬s.

4.1 Introductory example: Eratosthenesā™s sieve

Eratosthenes (276ā“194 BC) was an ancient Greek mathematician. Among

his many contributions, he invented one of the ļ¬rst algorithms known. This

algorithm, called Eratosthenesā™s sieve, is an eļ¬cient method for computing

all prime numbers up to a given bound. In 2004, Atkin and Bernstein [AB04]

found a faster algorithm for achieving this task. However, before that, Eratos-

thenesā™s sieve had essentially been unsurpassed for more than two millennia.

In this section, in order to introduce sieve algorithms, we show how to eļ¬-

ciently implement Eratosthenesā™s sieve on modern computers.

4.1.1 Overview of Eratosthenesā™s sieve

The basic idea of Eratosthenesā™s sieve is to start from a list of integers from 1

up to some initial limit and to remove all non-primes from the list, thus ending

with a list of primes. The ļ¬rst step is to cross 1, which is a unit, thus not a

prime, in the ring of integers. After that, the smallest remaining number, 2,

is a prime. As a consequence, all multiples of 2 are composite numbers and

can be crossed from the list. The next smallest remaining number is 3, which

is prime and its multiples need to be crossed. After that, we ļ¬nd in turn the

prime numbers 5, 7, 11 and so on, cross their multiples and continue until

1 Aswe will see in Chapter 15, sieving is only an implementation aspect of these algorithms

and can be ignored where asymptotic analysis is concerned. Nonetheless, it is a very

important practical aspect.

123

Ā© 2009 by Taylor and Francis Group, LLC

124 Algorithmic Cryptanalysis

we reach the end of our list. To make the description more precise, we write

down the pseudo-code in Algorithm 4.1.

Algorithm 4.1 Eratosthenesā™s sieve

Require: Initial limit Lim

Create array IsP rime indexed from 1 to Lim, initialized to true.

IsP rime[1] āā’ false

for i from 2 to Lim do

if IsP rime[i] is true then

j āā’ 2i

repeat

IsP rime[j] āā’ false; j āā’ j + i

until j > Lim

end if

end for

for i from 2 to Lim do

if IsP rime[i] is true then

Display(i,ā™is primeā™)

end if

end for

Before going through the process of programming and optimizing this al-

gorithm, let us analyze its main properties. First of all, the algorithm needs

to work with Boolean values. From a theoretical point-of-view, this is ex-

tremely nice. However, the practical use of Booleans in the C language is

not completely straightforward. Indeed, no Boolean type was deļ¬ned in early

C89, and while an extension was added in C99, it is still not supported by all

compilers. Thus, as in Chapter 3 we use the trick of packing several Booleans

into the machine representation of a single (unsigned) integer. This allows us

to represent the required array of Lim Booleans into Lim/8 bytes of memory.

From the point-of-view of correctness, let us show that the algorithm works

correctly. It is necessary to prove that any prime number has true stored in

the corresponding entry of table IsP rime and that any composite number C

has false stored. Clearly, since the table is initialized with true and since

only 1 and proper multiples of accepted numbers are crossed, no prime number

may be crossed from the list. Thus, we need to check that all composites are

eļ¬ectively removed. More precisely, this event happens during the ļ¬rst i loop,

before i reaches value C. Indeed, if C is composite, it has at least one prime

divisor, say P < C. Since P is prime, it remains in the ļ¬nal list and thus all

its multiples, including C are crossed. Thus the algorithm is correct.

As the runtime analysis goes, the algorithm spends Lim/p steps in the

inner loop to remove the multiples of the prime p (when i is equal to p).

Ā© 2009 by Taylor and Francis Group, LLC

Sieve algorithms 125

Thus, the total running time is bounded by:

Lim

Lim

. (4.1)

p

p=2

Looking more precisely at the algorithm, we see that every composite is

crossed more than once. This opens the way for a slight improvement, it

suļ¬ces to cross multiples of p starting from p2 . Indeed, each composite C has

at least one prime divisor p such that ā2 ā¤ C. This reduces the number of

p

iterations of the outside loop down to Lim and improves the running time.

However, asymptotically, the gain is negligible. The main reason is that small

primes have many multiples and sieving over these primes is the costly part of

the algorithm. In the next section, we present possible improvements in order

to speed up practical implementation of Eratosthenesā™s sieve. Our starting

reference implementation is the C code of Program 4.1.

4.1.2 Improvements to Eratosthenesā™s sieve

Looking at our reference Program 4.1, we quickly discover two main lim-

itations on its range of applicability. The ļ¬rst limitation, which is easily

overcome, is the fact that with many compilers, the integer type int repre-

sents signed 32-bit integers which cannot count above 231 ā’ 1. To remove this

limitation, it suļ¬ces to deļ¬ne the loop variables i and j as 64-bit integers.

The next limitation is the need to store the Boolean array containing one bit

for each number up to the considered limit. Even with a 1 Gbyte memory,

this prevents us from building primes further than 233 .

4.1.2.1 Wheel factorization

Our ļ¬rst improvement pushes this limit further by focusing on the regularity

of the distribution of multiples of small primes in order to use less memory.

The most basic example of this idea is to remove all even numbers from our

table in memory. This is safe since the only even prime is 2. Moreover, this can

be done without any essential change to the sieving process. Indeed, if x is an

odd multiple of p, the next odd multiple of p is x + 2p whose representation

is located p steps further than the representation of x in memory. As a

consequence, this improvement also allows us to run Eratosthenesā™s sieve twice

as far for the same cost in terms of both time and memory. To push this idea

further, let us consider the three primes 2, 3 and 5. Between, 1 and 30, only

eight numbers, namely 1, 7, 11, 13, 17, 19, 23 and 29, are not multiples of

either 2, 3 or 5. Moreover, adding 30 or a multiple of 30 to any number

preserves the properties of being multiple of 2, 3 or 5. As a consequence,

in any interval of 30 consecutive integers, we know that Eratosthenesā™s sieve

already removes 22 numbers, simply by sieving over the small primes 2, 3

and 5. We can use this fact and restrict our table of Booleans in memory

Ā© 2009 by Taylor and Francis Group, LLC

126 Algorithmic Cryptanalysis

Program 4.1 Basic C code for Eratosthenesā™s sieve

#include <stdio.h>

#include <stdlib.h>

typedef unsigned int packedbool;

packedbool *_IsPrime;

#define NpckBool (8*sizeof(packedbool))

#define GetIsPrime(x) \

(_IsPrime[(x-1)/NpckBool]>>((x-1)%NpckBool))&1

#define SetCompo(x) \

_IsPrime[(x-1)/NpckBool]&=˜(1UL<<((x-1)%NpckBool))

SievePrimes(int Limit)

{

int i,j, tabLimit;

tabLimit=(Limit+NpckBool-1)/NpckBool;

for (i=0;i<tabLimit;i++) _IsPrime[i]=˜0;

for (i=2;i*i<=Limit;i++){

if (GetIsPrime(i)) {

for (j=i*i;j<=Limit;j+=i){

SetCompo(j); } } }

printf("List of primes up to %d:\n",Limit);

for (i=2;i<=Limit;i++){

if (GetIsPrime(i)) {

printf("%d\n",i); } } }

main()

{

int Limit;

printf("Enter limit ?");

scanf("%d",&Limit);

_IsPrime=(packedbool *)malloc((Limit+NpckBool-1)/8);

SievePrimes(Limit);

free(_IsPrime);

}

Ā© 2009 by Taylor and Francis Group, LLC

Sieve algorithms 127

to consider only numbers of the form 30k + Ī“, for all integers k and for Ī“

chosen among 1, 7, 11, 13, 17, 19, 23 and 29. This allows us to represent more

than three times more numbers than the initial sieve into the same amount

of memory. In order to visualize the algorithmic principle, it is useful to

represent the numbers on a modular wheel of perimeter 30 as in Figure 4.1.

For this reason, this particular approach to Eratosthenesā™s sieve is often called

wheel factorization.

0

29 1

Ā”

28 2

&

& Ā”

27 3

&

& Ā”

26 4

&

& Ā”

25 5

&

& Ā”

24 6

&

& Ā”

23 7

30k + Ī“

22 8

&

& Ā”

21 9

&

& Ā”

20 10

& &

& &

19 11

18 12

& &

& &

17 13

16 15 14

&&&

&&&

Figure 4.1: Schematic picture of wheel factorization

Of course, we can even add more small primes into the picture to further

improve the ratio, going up to the small prime 11, we restrict ourselves to

480 numbers out of 2310, thus gaining a factor of almost ļ¬ve. Using wheel

factorization also speeds up the computation. Indeed, it removes the need

to sieve over the smallest, more costly, primes. However, the approach has

the drawback of complicating the matter of sieving over larger primes, since

the locations of multiples of these primes are no longer evenly distributed

in memory. To show the diļ¬culty, let us consider the distribution of the

multiples of 7 in a wheel of perimeter 30. We already know that our memory

only holds numbers which are multiples of neither 2, 3 nor 5. We now need to

locate multiples of 7 among these numbers. They are obtained by multiplying

Ā© 2009 by Taylor and Francis Group, LLC

128 Algorithmic Cryptanalysis

together 7 with a number in the wheel, since multiplication by 7 does not

change the divisibility by 2, 3 or 5. Thus, we need to mark all numbers of the

form: 7 Ā· (30k + Ī“) for all k and for Ī“ in 1, 7, 11, 13, 17, 19, 23 and 29. These

numbers form eight series, one for each value of Ī“ and each series consists in

evenly distributed numbers in memory, separated by a distance of 7 Ć— 8 as

shown in Figure 4.2, with one series in each column. However, the ļ¬gure also

shows that the starting points of the diļ¬erent series are not evenly spaced. In

our example, the starting points correspond to the numbers 7, 49, 77, 91, 119,

133, 161 and 203 and their respective locations in memory are 1, 13, 20, 24, 31,

35, 42, 54; assuming that 1 is at location 0. This uneven distribution makes

the sieving code more complicated. We need either to sieve with irregular

steps or to repeat the sieve eight times, once for each of the diļ¬erent series.

With a wheel of ļ¬xed size, this can be achieved by embedding all the diļ¬erent

cases into the program.

Alternatively, we can let the size of the wheel grow with the upper bound

Lim, this is the dynamic wheel approach initially proposed by Pritchard

in [Pri81]. Programming dynamic wheels is trickier and adds some signiļ¬cant

computational overhead, which may be costly in implementations. However,

it truly improves asymptotic complexities as shown in Section 4.1.2.4. A way

to avoid part of the diļ¬culty is to use dynamic compilation, i.e., to write

a program which given the input bounds generates a new program with an

adapted wheel size.

4.1.2.2 Segmented sieve

Our second improvement aims at removing the memory limitation alto-

gether. It is based on a reļ¬nement of the remark that only primes up to

square root of any composite need to be considered in order to cross this com-

posite. As a consequence, the sieving process only needs to know the prime

numbers up to the initial limit in order to proceed further up. Thus, the main

part of the computation only depends on a short initial segment of Eratos-

thenesā™s sieve. This implies that the computation can be split into several

smaller computations, each within an interval of length equal to the square

root of the initial limit. Each of these smaller computations is essentially in-

dependent of the previous ones, the notable exception being the computation

of the primes in the ļ¬rst of these intervals. Doing this, we can reduce the

amount of required memory drastically, representing only short intervals of

square root length. However, sieving each interval independently adds some

computational overhead. Indeed, we need to compute the location of the ļ¬rst

multiple of each prime in each interval. In fact, it is better to reuse some

of the memory we just saved in order to remember this location from one

interval to the next. This is illustrated by the C code of Program 4.2.

This approach, which works with each segment in turn, is called segmented

sieve. From a complexity point-of-view, it greatly reduces the amount of

required memory. However, the number of operations to perform remains

Ā© 2009 by Taylor and Francis Group, LLC

Sieve algorithms 129

Program 4.2 Small memory code for Eratosthenesā™s sieve

#include <stdio.h>

#include <stdlib.h>

typedef unsigned int packedbool; typedef long long int int64;

packedbool *_IsBPrime; packedbool *_IsPrime; int64 *Offset;

#define Npck (8*sizeof(packedbool))

#define GetBasePrime(x) (_IsBPrime[(x)/Npck]>>((x)%Npck))&1

#define SetBaseCompo(x) _IsBPrime[(x)/Npck]&=˜(1UL<<((x)%Npck))

#define GetIsPrime(x) (_IsPrime[(x)/Npck]>>((x)%Npck))&1

#define SetCompo(x) _IsPrime[(x)/Npck]&=˜(1UL<<((x)%Npck))

#define Action(p) printf("%lld\n",(int64)(p))

int InitialSievePrimes(int Limit) {

int i, j, count, tabLimit;

count=0; tabLimit=(Limit+Npck-1)/Npck;

for (i=0;i<tabLimit;i++) _IsBPrime[i]=˜0;

for (i=2;i*i<Limit;i++){

if (GetBasePrime(i)) { count++;

for (j=i*i;j<Limit;j+=i) SetBaseCompo(j); } }

for (;i<Limit;i++){if (GetBasePrime(i)) count++;}

Offset=(int64 *)malloc(count*sizeof(int64)); count=0;

for (i=2;i<Limit;i++){

if (GetBasePrime(i)) { Action(i);

j=Limit%i; if (j!=0) j=i-j; Offset[count]=j;

count++; } } }

int SievePrimesInterval(int64 offset, int Length) {

int i, j, count, tabLimit;

count=0; tabLimit=(Length+Npck-1)/Npck;

for (i=0;i<tabLimit;i++) _IsPrime[i]=˜0;

for (i=2;i<Length;i++) {

if (GetBasePrime(i)) {

for (j=Offset[count];j<Length;j+=i) SetCompo(j);

Offset[count]=j-Length; count++; } }

for (i=0;i<Length;i++) {if (GetIsPrime(i)) Action(offset+i); }

}

Ā© 2009 by Taylor and Francis Group, LLC

130 Algorithmic Cryptanalysis

Program 4.2 Small memory code for Eratosthenesā™s sieve (continued)

main() {

int i,j; int64 Limit, tmp; int sqrt;

printf("Enter limit ?"); scanf("%lld",&Limit);

for(tmp=0;tmp*tmp<=Limit;tmp++); sqrt=tmp;

_IsBPrime=(packedbool *)malloc((sqrt+Npck-1)/8);

_IsPrime=(packedbool *)malloc((sqrt+Npck-1)/8);

InitialSievePrimes(sqrt);

for(tmp=sqrt;tmp<Limit-sqrt;tmp+=sqrt)

SievePrimesInterval(tmp,sqrt);

SievePrimesInterval(tmp,Limit-tmp);

free(_IsPrime);free(_IsBPrime);

}

Ā© 2009 by Taylor and Francis Group, LLC

Sieve algorithms 131

99 373

98 371

97 367

96 361

95 359

94 353

93 349

92 347

91 343

90 341

89 337

88 331

87 329

86 323

85 319

84 317

83 313

82 311

81 307

80 301

79 299

78 293

77 289

76 287

75 283

74 281

73 277

72 271

71 269

70 263

69 259

68 257

67 253

66 251

65 247

64 241

63 239

62 233

61 229

60 227

59 223

Numbers represented

58 221

57 217

Memory locations

56 211

55 209

54 203

53 199

52 197

51 193

50 191

49 187

48 181

47 179

46 173

45 169

44 167

43 163

42 161

41 157

40 151

39 149

38 143

37 139

36 137

35 133

34 131

33 127

32 121

31 119

30 113

29 109

28 107

27 103

26 101

25 97

24 91

23 89

22 83

21 79

20 77

19 73

18 71

17 67

16 61

15 59

14 53

13 49

12 47

11 43

10 41

9 37

8 31

7 29

6 23

5 19

4 17

3 13

2 11

1 7

0 1

Figure 4.2: Multiples of 7 in a wheel of perimeter 30

Ā© 2009 by Taylor and Francis Group, LLC

132 Algorithmic Cryptanalysis

essentially the same and the approach does not seem to oļ¬er any improvements

to the running time. Yet, in practice, doing the sieve in a segmented way

allows to work in a very small amount of memory. For example, with an

upper-bound Lim = 234 , above the limit we previously had with a 1 Gbyte

memory, the basic segment contains 217 integers and is represented by 217

bits or 16 Kbytes. On most modern computers, this ļ¬ts without any diļ¬culty

in the ļ¬rst level of cache. Since sieving is a simple process that requires few

operations, the time of the main memory accesses dominates the computation

of our reference code for sieving. With the segmented sieve, the memory

accesses in cache are about twice as fast. It is interesting to note that the

gain is quite small compared to the most extreme memory behaviors. One of

the reason is that the basic sieve is already quite good in terms of its memory

access structure. Remember that at lot of time is spent sieving the small

primes and that for small primes we access a sequence of memory positions

not too far from each other. Another reason is that the computational load

for packing the Boolean is not negligible and hides part of the gain.

4.1.2.3 Fast practical Eratosthenesā™s sieve

Of course, nothing prevents us from combining both improvements to obtain

an even faster code. However, before devising a combined approach, it is

essential to proļ¬le our previous codes in order to understand their limitations.

The very ļ¬rst remark that arises during proļ¬ling is that a huge portion of the

running time is used to simply print the prime numbers. From an algorithmic

point-of-view, this is irrelevant, indeed there are many possible applications

to computing prime numbers and many of them do not involve any printing.

For example, we may want to compute general statistics on the distribution

of primes or we may require the creation of prime numbers as pre-requisite

for a subsequent algorithm. As a consequence, before proļ¬ling it is important

to replace the action of printing the primes by a diļ¬erent, less costly action.

One simple possibility is to increment a counter and obtain the total number

of prime numbers below the considered bound.

Proļ¬ling this code shows that when building the primes up to 109 , about

three quarters are spent in the sieving process itself, the ļ¬nal quarter of the

running time is spent in the last line of SievePrimesInterval, where the

Boolean table with the prime status is reread from memory. This is quite

surprising, because from an asymptotic point-of-view, this part of the code

has complexity O(Lim) compared to O(Lim Ā· log log(Lim)) for the sieve itself

and should be negligible. However, for practical values of the bound Lim,

the resulting value log log(Lim) is very small, which explains the apparent

discrepancy.

4.1.2.4 Fast asymptotic Eratosthenesā™s sieve

From an asymptotic point-of-view, there are several diļ¬erent optimums

when looking at the sieve of Eratosthenes, depending on the relative contri-

Ā© 2009 by Taylor and Francis Group, LLC

Sieve algorithms 133

bution of time and space to the complexity. In particular, in [Sor98], Sorenson

presented four diļ¬erent combinations with diļ¬erent balance of time and space.

We can either try to optimize the asymptotic running time or take into

account both the running time and the memory use. When considering the

asymptotic running time only, the best known variation on Eratosthenesā™s

sieve is an algorithm by Pritchard, described in [Pri81]. This algorithm uses

wheel factorization with the largest possible wheel applicable to the upper

sieve bound Lim given in input. Clearly, the wheel size is a function of Lim and

this approach needs to deal with the additional technicalities of using wheel

of varying size. The resulting algorithm is sublinear in time but requires a

large amount of space.

Another approach proposed by Pritchard [Pri83] is using both wheels and

a segmented sieve. It only achieves linear time but reduces the memory re-

quirements down to the square root of the bound Lim.

4.1.3 Finding primes faster: Atkin and Bernsteinā™s sieve

As we saw above, asymptotically, the sieve of Eratosthenes can be sublinear;

it can also be performed using a small amount of memory. However, it is not

known how to achieve both properties simultaneously for this sieve. This has

been an open problem for a long time. In 1999, Atkin and Bernstein proposed

in [AB04] a new algorithm, not based on Eratosthenesā™s sieve, that solves this

problem. This algorithm is constructed from 3 subroutines each addressing a

subset of the prime numbers, using a characterization of primes which does

not make use of divisibility. The ļ¬rst algorithm can be used to ļ¬nd primes

congruent to 1 modulo 4, the second ļ¬nds primes congruent to 1 modulo 6

and the third ļ¬nds primes congruent to 11 modulo 12. Since all primes, but 2

and 3, are of these forms these three subroutines suļ¬ce. For primes congruent

to 1 modulo 12, we need to choose either the ļ¬rst or second subroutine.

Let us start with the case of primes congruent to 1 modulo 4. The algorithm

is based on the following theorem:

THEOREM 4.1 (Th. 6.1 in [AB04])

Let n be a square-free positive integer with n ā” 1 (mod 4). Then n is prime

if and only if the cardinality of the set (x, y)|x > 0, y > 0, 4x2 + y 2 = n (or

equivalently of the set (x, y)|x > y > 0, x2 + y 2 = n ) is odd.

The ļ¬rst subroutine follows quite naturally from this theorem. It contains

two phases: the ļ¬rst phase computes the cardinality of the set given in The-

orem 4.1, the second phase removes non-square-free integers. This results in

Algorithm 4.2. For more eļ¬ciency, this algorithm can be combined with a

wheel. For example, in [AB04] it is presented (as Algorithm 3.1) using a wheel

of 30. Note that such a wheel is easily combined with the congruence condi-

tion modulo 4, thus resulting in an extended wheel of length 60 containing 8

Ā© 2009 by Taylor and Francis Group, LLC

134 Algorithmic Cryptanalysis

diļ¬erent numbers.

Algorithm 4.2 Sieve of Atkin and Bernstein for primes ā” 1 (mod 4)

Require: Initial range Lim . . . Lim + 4Count with Lim ā” 1 (mod 4)

Create array IsOdd indexed from 0 to Count, initialized to false.

for all (x, y, i) with x > 0, y > 0, i ā¤ Count, 4x2 + y 2 = Lim + 4i do

Negate the Boolean value of IsOdd[i]

end for

for all prime q with q 2 ā¤ Lim + 4Count do

for all Lim + 4i multiples of q 2 do

Set IsOdd[i] to false

end for

end for

The other subroutines derive from two other Theorems 4.2 and 4.3. They

can also be implemented using extended wheels of 60, respectively containing

eight and four numbers.

THEOREM 4.2 (Th. 6.2 in [AB04])

Let n be a square-free positive integer with n ā” 1 (mod 6). Then n is prime

if and only if the cardinality of the set (x, y)|x > 0, y > 0, 3x2 + y 2 = n is

odd.

THEOREM 4.3 (Th. 6.3 in [AB04])

Let n be a square-free positive integer with n ā” 11 (mod 12). Then n is

prime if and only if the cardinality of the set (x, y)|x > y > 0, 3x2 ā’ y 2 = n

is odd.

Putting together the three subroutines implemented with an extended wheel

of 60 gives a practically eļ¬cient code. A package called primegen written by

Bernstein and distributed on his web pages gives an eļ¬cient implementation

of this algorithm. Moreover, from an asymptotic point-of-view, this new al-

gorithm improves upon the best algorithms based on Eratosthenesā™s sieve. In

ā

this case, we need to use a wheel containing all the primes up to log Lim and

achieve running time O(Lim/ log log(Lim)) using Lim1/2+o(1) bits of memory.

4.1.3.1 Further improvements of Atkin and Bernsteinā™s sieve

While more eļ¬cient than algorithms based on the sieve of Eratosthenes,

Atkin-Bernsteinā™s sieve share a drawback with it. Assume that we do not want

all the primes up to Lim, but only the primes in some interval [Lim1 , Lim2 ].

ā

Then, we see that there is a ļ¬xed start-up cost O( Lim2 ) which is independent

Ā© 2009 by Taylor and Francis Group, LLC

Sieve algorithms 135

of the length of the interval. With the sieve of Eratosthenes, this cost comes

ā

from creating the primes up to Lim2 and ļ¬nding their smallest multiples in

the interval. For Atkin and Bernsteinā™s sieve, the cost comes from the number

of pairs (x, y) we need to enumerate. Let us consider the ļ¬rst case that covers

primes congruent to one modulo 4. There, we need to consider all pairs (x, y)

with Lim1 ā¤ 4x2 + y 2 ā¤ Lim2 . Equivalently, we need to enumerate all pairs

2

(x , y) with Lim1 ā¤ x + y 2 ā¤ Lim2 and x even. This can be done by ļ¬nding

all points ā

with integer coordinates located between two concentric circles of

ā

diameter Lim1 and Lim2 . When Lim1 is near from Lim2 , it seems natural

to expect that this task requires a walk inside the perimeter of the outer circle

ā

and costs O( Lim2 ). This is indeed the behavior of the algorithm proposed

by Atkin and Bernstein in [AB04].

Surprisingly, it is possible to enumerate all integer points between two

nearby concentric circles in a more eļ¬cient manner. The key idea comes

from a theorem of van der Corput and it has been applied to the algorithm of

Atkin and Bernstein by Galway in his PhD thesis. Using this technique, he

devised a āādissectedā sieve based on Atkin-Bernsteinā™s whose overhead cost

is only O( 3 Lim2 ). Using this algorithm, he could enumerate primes in small

intervals for much larger values of Lim2 .

At the time of writing, this dissected sieve idea is not used in any cryp-

tographic application. Thus, we do not describe it in detail and refer the

interested reader to Galwayā™s thesis [Gal04].

4.2 Sieving for smooth composites

In the previous section, we used sieve algorithms to ļ¬nd prime numbers.

However, another very frequent application of sieve algorithms in cryptog-

raphy searches for composite numbers (resp. polynomials) with many small

enough factors, which are usually called smooth numbers (resp. polynomials).

Applications are discussed in Chapter 15. In the present chapter, we simply

focus on the sieve algorithms themselves and show the many possible variants,

together with their performance issues. When sieving for composites instead

of primes, there is a ļ¬rst, essential diļ¬erence. In order to locate primes, a

Boolean table is enough, indeed, whenever a number has a divisor, it is com-

posite and can be removed. When locating smooth numbers, this is no longer

true, indeed, we need to check that each number has enough divisors and thus

we should be able to count higher than 1.

Another important diļ¬erence between the sieve algorithms in Sections 4.1

and 4.1.3 and the algorithms in this section is that instead of simply searching

for numbers in an interval we may have an additional degree of freedom and

search in higher dimensional sets. The most frequent case is to consider two

Ā© 2009 by Taylor and Francis Group, LLC

136 Algorithmic Cryptanalysis

dimensional spaces. In this context, there are several possible strategies to

cover the sieving zone. We present two classical options: line and lattice

sieving.

Finally, when using the algorithms presented in this section in applications,

we need to ļ¬nd many smooth objects but it is usually not essential to retrieve

all of them. We can overlook some without any problems as long as we ļ¬nd

many quickly enough.

Due to these diļ¬erences, to optimize the sieve algorithms in this section, we

cannot follow the same approaches as before and we need to develop speciļ¬c

strategies.

4.2.1 General setting

The most frequent case for sieve algorithms, especially in index calculus

methods (see Chapter 15), is to consider algebraic objects of the form a + bĪ±

where Ī± is ļ¬xed and a and b vary. Without going into the mathematical

details, which are given in Chapter 15, we need to explain what āsmoothnessā

means in this context. In fact, there are two diļ¬erent mathematical settings

to consider. The varying elements a and b may be either numbers (integers)

or univariate polynomials over a small ļ¬nite ļ¬eld. The two settings are very

similar from a high level point-of-view, but the implementation details vary a

lot between the two. Moreover, the mathematical language used to described

both cases is slightly diļ¬erent. For now, we give an overall description using

the language corresponding to the case of numbers. Our ļ¬rst remark is that

we only consider pairs (a, b) where a and b are coprime. Otherwise, we would

be considering multiples of previously seen objects. In the context of index

calculus, this would result in equivalent equations and would not be of any

help. Given such a pair (a, b), we need to consider a + bĪ± modulo a prime p

and test whether we ļ¬nd 0 or not. To do this, we need to deļ¬ne the image

of Ī± modulo p. In general, this image need not be unique, for some primes Ī±

(1) (1) (2)

has a single image Ī±p , for some it has two Ī±p and Ī±p , and so on . . .

There are also primes with no image at all for Ī±, these primes can never

occur among the factors of a + bĪ±. For both theoretical and practical reasons,

it is best to consider each image of Ī± for the same p as a diļ¬erent factor. With

(i) (i)

this convention, we say that a pair (p, Ī±p ) divides a + bĪ± when a + bĪ±p ā” 0

(mod p). From a geometric point-of-view, the pairs (a, b) associated to objects

(i)

of the form a + bĪ± divisible by (p, Ī±p ) are nicely distributed in the Euclidean

plane. More precisely, they form a 2 dimensional lattice. This is easily shown,

(i) (i)

let a + bĪ± be divisible by (p, Ī±p ), then a + bĪ±p ā” 0 (mod p). Thus there

(i)

exists an integer Ī» such that a = ā’bĪ±p + Ī»p. This implies that:

(i)

a p

ā’Ī±p

=b +Ī» . (4.2)

b 0

1

(i)

Thus, a point associated to an element a+bĪ± divisible by (p, Ī±p ) belongs to

Ā© 2009 by Taylor and Francis Group, LLC

Sieve algorithms 137

the 2-dimensional lattice generated by the two vectors appearing in the right-

hand side of Equation (4.2). Conversely, we can check that both generating

(i) (i)

vectors satisfy the divisibility equation, since ā’Ī±p + 1 Ā· Ī±p ā” 0 (mod p) and

(i)

p + 0 Ā· Ī±p ā” 0 (mod p). By linearity, any integer linear combination of the

two vectors also satisļ¬es the divisibility equation. As a consequence, the set

(i)

of points associated to elements a + bĪ± divisible by (p, Ī±p ) is precisely this

2-dimensional lattice.

b

a

Figure 4.3: Set of points a + bĪ± divisible by (11, 5)

To illustrate this geometric point-of-view, we choose the following simple

(1)

example: let us take the prime p = 11 and an image for Ī±: Ī±11 = 5. In this

case, the set of points divisible by (11, 5) is represented by Figure 4.3. In this

example, we see that the generating vectors we gave for this lattice are not

the best possible choice. The two dotted vectors in Figure 4.3 give a better

description of the lattice. Since the lattice is two-dimensional, it is not neces-

sary to use the general purpose algorithms described in Chapter 10 to obtain

this good representation. We can either use Gauss reduction as described in

Chapter 10 or adapt a continued fraction algorithm to our purpose. For now,

(i)

we assume when necessary that each pair (p, Ī±p ) is given together with a

good basis describing the corresponding lattice.

Adaptation to polynomials We mentioned at the beginning of the present

section that we adopt here the language of numbers. In the case of polyno-

mials, the necessary adaptations are quite natural. When considering a + bĪ±,

a and b become univariate polynomials over a ļ¬nite ļ¬eld, say a(t) and b(t).

The primes p are replaced by irreducible polynomials p(t) over the considered

Ā© 2009 by Taylor and Francis Group, LLC

138 Algorithmic Cryptanalysis

ļ¬nite ļ¬eld. The images Ī±p also become polynomials and we say that a pair

(i) (i)

(p(t), Ī±p (t)) divides a(t) + b(t)Ī± when a(t) + b(t)Ī±p (t) is a multiple of p(t)

over the ļ¬nite ļ¬eld.

4.2.1.1 Smoothness

Now that we know how to ļ¬nd divisors of an element a + bĪ±, we are ready

to deļ¬ne the notion of smoothness. We start from a smoothness basis B

(i)

containing all pairs (p, Ī±p ) with p ā¤ B, where B is called the smoothness

bound. We say that a + bĪ± is Bā“smooth if and only if all its divisors belong to

B. To detect a smooth element, we check that product of the primes occuring

in its divisors is near a number called the norm of a + bĪ± which roughly gives

the size of a + bĪ±. This notion of norm is deļ¬ned precisely in Chapter 15. For

the rest of this chapter, we do not need an exact deļ¬nition and simply need

to know that a number is Bā“smooth if and only if its norm is a product of

prime numbers below B.

Without going into details, let us mention an additional complication due

to the fact that the notion of divisibility we presented earlier is also associated

(i)

with a notion of multiplicity: a pair (a, b) may be divisible by (p, Ī±p ) more

than once. We chose to ignore that point, which is needed for the algorithms

in Chapter 15, since the notion of multiplicity is usually overlooked by sieving

algorithms.

This means that when computing the product of the primes occuring in the

divisors of a + bĪ±, we do not correctly account for the contribution of multiple

factors. Thus, there is no guarantee that the product is equal to the norm,

but on average we expect it to be close to it. Indeed, we expect multiple

factors to be of small size and the overall multiple contribution to remain

small. Of course, we could also try to account for divisors that occur multiple

times; however, the additional computational cost makes this approach much

less practical. Note that in practice, instead of computing the above prod-

uct, it is more convenient to compute an additive equivalent by adding the

logarithms of the primes occuring in divisors. An even simpler alternative is

to compute the numbers of diļ¬erent divisors, if it is high enough, we expect

a + bĪ± to be smooth quite often. Of course, using these simpliļ¬cations, i.e.,

ignoring multiplicities and counting the numbers of divisors only, we may well

miss some smooth elements. However, since we only need to ļ¬nd many, not

all, this is not a real problem. A very important parameter when detecting

smooth candidates is the detection threshold. It should be carefully chosen by

balancing the rate of false candidates and the rate of good candidates being

overlooked.

4.2.1.2 Basic Lattice Algorithm

With all these elements in mind, we are now ready to give our ļ¬rst algorithm

to sieve for smooth elements a + bĪ± with |a| and |b| smaller than sieve limits

Ā© 2009 by Taylor and Francis Group, LLC

Sieve algorithms 139

Sa and Sb . As said before, we are only interested in pairs (a, b) where a and

b are coprime. Moreover, we may remark that it is useless to consider both

a + bĪ± and its opposite. As a consequence, we may assume that a ā„ 0. We

need to explore as sieving zone a rectangle of Sa by 2Sb ā’ 1 points. Using a

direct approach to explore the whole sieving zone results in Algorithm 4.3.

Algorithm 4.3 Two-dimensional sieving for smooth numbers

Require: Smoothness basis B

Require: Sieve limits Sa and Sb

Require: Expected value LogV al for smooth candidates

Create two-dimensional array LogSum indexed by [O Ā· Ā· Ā· Sa ā’ 1] Ć— [ā’(Sb ā’

1) . . . Sb ā’ 1] initialized to zero

(i)

for all P = (p, Ī±p ) in B do

for all (a, b) ā [O Ā· Ā· Ā· Sa ā’ 1] Ć— [ā’(Sb ā’ 1) . . . Sb ā’ 1] with a + bĪ± divisible

by P do

Add log p to LogSum[a, b]

end for

end for

for all (a, b) ā [O Ā· Ā· Ā· Sa ā’ 1] Ć— [ā’(Sb ā’ 1) . . . Sb ā’ 1] do

if LogSum[a, b] ā„ LogV al then

Check whether a + bĪ± is really smooth

If so Display((a, b),ā˜corresponds to a smooth valueā™)

end if

end for

As stated, this algorithm is not complete; we need to know how to walk

(i)

over multiples of (p, Ī±p ) and also how to check whether a candidate a + bĪ±

is really smooth or not. Moreover, even in this basic form, some potential

improvements are already worth considering: Can we expect to go faster than

this basic algorithm? Can we use less memory and work with a smaller array?

In this section, we now give general answers to these questions. More detailed

and speciļ¬c answers are discussed in Sections 4.2.1.3, 4.2.1.4 and 4.2.1.5.

(i)

4.2.1.2.1 Walking the multiples. Given an element (p, Ī±p ) of the smoo-

thness basis B, we need a good presentation of the corresponding lattice given

by two basis vectors u and v. After that, we search for the lattice elements

within the total sieve area. This is a reasonably simple task performed by

starting from a known lattice element, usually the point at origin and by

adding multiples of the two basis vectors in order to ļ¬nd new points. This is

usually done by nesting two loops, the inner loop goes through the sieve area

by adding multiples of one basis vector, say u, thus drawing a line. The outer

loop adds multiples of the other basis vector v. This basic idea needs to be

Ā© 2009 by Taylor and Francis Group, LLC

140 Algorithmic Cryptanalysis

adapted to each speciļ¬c setting and this can be done quite eļ¬ciently for all

of our special cases.

4.2.1.2.2 Speeding up the approach. One basic technique to speed

up the sieve algorithm is to remark that an element of B associated with a

prime p corresponds to a walk over a 1/p fraction of the sieving zone. As

a consequence, elements associated to small primes contribute more to the

runtime. Moreover, these elements oļ¬er a low added value: knowing that

the norm of an element is divisible by a small prime is a smaller step toward

smoothness. To overcome these two drawbacks, one classical technique is to

remove from the sieve algorithm all elements of the smoothness basis below a

given bound.

4.2.1.2.3 Checking for smoothness. The easiest way to check for smoo-

thness is simply to compute the norm of a + bĪ± and then to factor this norm

using a fast algorithm. At this point, the cases of numbers and polynomials

are quite diļ¬erent. Indeed, as recalled in Chapter 2, factoring polynomials

can be done in polynomial time by eļ¬cient and practical algorithms, thus

this approach seems to be a good start in the case of polynomials. In the case

of numbers, among known algorithm, the better suited for factoring numbers

with many small factors is the elliptic curve factoring method (ECM, see

Chapter 14). As a consequence, in this case, checking smoothness seems to

be much more diļ¬cult and other methods need to be devised to remove this

practical obstruction. Note that from a complexity theoretic point-of-view,

using ECM is not a real problem and does not change the overall complexity.

However, in practice, getting rid of ECM, wherever it is possible, greatly

improves the eļ¬ciency. Interestingly, these methods can even be used in the

case of polynomials to speed up things a little.

The best approach is to view the problem of checking for smoothness as a

speciļ¬c factoring task. We need to factor a list of candidates in order to keep

the smooth ones. As with ordinary factoring tasks, it is a good idea to start

by using trial division, thus checking for elements of the smoothness basis

associated to small primes, whether they divide each candidate. It is essential

to follow this approach for the very small primes that were omitted from

the initial sieving phase. When a candidate is divisible by such an element

we divide its (precomputed) norm by the corresponding prime, taking the

multiplicity into account. As a consequence, after going through all the very

small primes, we are left for each candidate with a reduced norm without very

small factors. Once this is done, some candidates have a large reduced norm

and others a smaller one. Clearly, candidates with a large reduced norm have

a smaller probability of being smooth. As a consequence, to speed up things,

it is possible to ļ¬lter the candidates, removing those with a large norm. After

that, some additional trial division tests can be performed for slightly larger

primes. This idea of ļ¬ltering, i.e., of aborting early for bad candidates was

ńņš. 5 |