ńņš. 5 |

32

Fig. 1. Overall structure of the AES module

ā“ The Key Unit: The key unit serves two main purposes: the storage of

cipher keys and the calculation of the round keys. To save die size, the

S-Boxes of the data unit are reused to perform the key expansion. In the

presented architecture, this reuse is possible for any key size without loss of

performance.

ā“ The CBC Unit: The CBC unit of the AES module implements the CBC

mode without any negative inļ¬‚uence on the overall performance of the AES

module.

In the presented architecture, a 128-bit block of data is encrypted as follows.

First, a cipher key needs to be loaded via the AMBA APB interface into the key

unit. Once a key is loaded, it can be used for an arbitrary number of encryptions

and decryptions. After the loading of the cipher key, the ļ¬rst 128-bit block of

data is transferred via the interface and the CBC unit into the data unit. The

data unit then iteratively performs the number of AES rounds that are required

for the used key size.

In each round, the key unit provides the corresponding round key to the data

unit. To calculate these round keys, the key unit uses the S-Boxes of the data

unit during a clock cycle in which they are not required by the data unit. After

the calculation of the AES rounds, the encryption result is passed in 32-bit words

to the interface via the CBC unit. Decryptions are computed in a very similar

way. In this case, the data unit performs the inverse AES transformations in

reversed order and also the key unit provides the round keys in reversed order.

The remainder of this section presents the details of the data and the key unit.

2.1 The Data Unit

The data unit is the biggest and the most important component of the AES

architecture. It stores the current 128-bit data block of an encryption or de-

Eļ¬cient AES Implementations on ASICs and FPGAs 101

Key

Out In

Cell 00 Cell 01 Cell 02 Cell 03

Key

Out

Cell 13

Cell 10 Cell 11 Cell 12 In

Key

Out In

Cell 23

Cell 20 Cell 21 Cell 22

Out

Cell 30 Cell 31 Cell 32 In

Cell 33

Key

S-Box

S-Box S-Box S-Box

S-Box

Fig. 2. The architecture of the standard data unit

cryption (referred to as the āStateā) and is capable of performing any number

and type of en-/decryption rounds on this State. Consequently, all four AES

transformations (SubBytes, ShiftRows, MixColumns and AddRoundKey) and

the corresponding inverse transformations are implemented within the data unit.

For the AddRoundKey transformation, a corresponding round key needs to be

provided by the key unit.

Figure 2 shows the standard version of the data unit. It consists of sixteen

so-called data cells and four S-Boxes. An S-Box of the architecture is a circuit

capable of performing the SubBytes and the inverse SubBytes transformation for

an 8-bit input. The data cells store eight bits per cell and perform all other AES

transformations and the corresponding inverses, when connected appropriately.

In full-custom designs, inputs and outputs of the data cells can be deļ¬ned such

that connection by abutment is possible when they are placed next to another.

Another distinguishing feature of the presented architecture is the fact that

the combinational paths are relatively short and, more important, very balanced.

The commonly used approach to implement AES in hardware is to store the 128-

bit State in a register and to perform the AES transformations (except for the

ShiftRows transformation) column by column. Therefore, to perform a normal

102 N. Pramstaller et al.

AES encryption round, ļ¬rst the ShiftRows transformation is done in one clock

cycle. Then, the remaining transformations of an AES round are applied column

by column, whereby all transformations for one column are usually done within

one clock cycle.

The problem of this approach is that the combinational path to perform

a SubBytes, a MixColumn and an AddRoundKey transformation in one clock

cycle is very long. Additionally, the implementation of the ShiftRows transfor-

mation causes a signiļ¬cant wiring overhead. The data unit, presented in this

section, solves both problems. It performs AES encryptions and decryptions in

the following way:

To load a data block, the input data is shifted column by column from the

right side (see Fig. 2) into the data cells. The inputs labelled āInā are connected

via the CBC unit to the interface. The initial AddRoundKey transformation is

done in the fourth clock cycle at the same time as the last column is loaded.

To compute a normal AES round, the registers are rotated vertically to per-

form the (Inv)SubBytes and the (Inv)ShiftRows transformation row by row. In

our design, we use an S-Box with one pipeline stage. Therefore, the (Inv)SubBytes

and the (Inv)ShiftRows transformations can be applied to all sixteen bytes of

the State in ļ¬ve clock cycles.

In the sixth clock cycle of a normal AES round, the (Inv)MixColumns and

the AddRoundKey transformations are performed by all data cells in parallel.

Since the S-Boxes are not used by the data unit during the sixth clock cycle,

they can be utilized by the key unit to perform the key expansion for the next

round key.

This way, the required number of encryption or decryption rounds can be

executed by the data unit and the key unit until the 128-bit result is ļ¬nally

stored in the registers of the data unit. This result is then shifted column by

column to the left (to the interface of the AES module). At the same time, a

new input State can be loaded.

Using the standard data unit, the minimal number of clock cycles that are

required to perform an AES-128 encryption or decryption is 65. Four clock cycles

are required for the I/O of the data unit, 54 clock cycles are required to perform

the nine normal AES rounds and seven are required for the ļ¬nal round.

The following two subsections present the architecture of the S-Boxes and

the data cells.

S-Boxes. In hardware implementations, the SubBytes transformation and its

inverse are the most expensive AES transformations. For the presented AES

module, a pipelined (one stage) implementation of the S-Box as described in [18]

is used. The main idea of this implementation is to build an eļ¬cient combina-

tional circuit for the S-Box, which is based on the fact that GF (28 ) can be seen

as quadratic extension of the ļ¬eld GF (24 ). A pipelined version of the S-Box is

used to accomplish that the combinational paths in the architecture are balanced

(i.e. the paths of the S-Boxes and those of a MixColumns-and-AddRoundKey

step are roughly the same).

Eļ¬cient AES Implementations on ASICs and FPGAs 103

Data Cells. The design of the data cells is crucial for the overall architecture

of the data unit. The data cells serve as storage elements of the AES State and

perform the (Inv)MixColumns and the AddRoundKey transformation. Besides

some input selection circuit, each data cell consists of eight ļ¬‚ip-ļ¬‚ops, a multi-

plier [17] for the MixColums transformation (which can also be omitted in order

to scale the design) and XOR gates for the AddRoundKey transformation.

2.2 The Key Unit

The key unit is used to store keys and to calculate the key expansion function.

Due to the fact that the AES is standardized for 128, 192 and 256-bit keys, the

interface between the key unit and the data unit is designed such that the key

expansion for several diļ¬erent key sizes can be implemented on the same chip.

The key unit used in our design stores the key loaded via the interface and is

capable of calculating all round keys for encryption and decryption iteratively.

Since the data unit does not perform any S-Box lookups while the MixColumns

and AddRoundKey transformations are executed, the S-Boxes of the data unit

are reused by the key unit during this clock cycle. Details about the key unit

can be found in [8].

2.3 Performance of the ASIC AES Implementation

As mentioned before, the presented AES module is built up very regular and is

highly scalable. Three diļ¬erent scaled versions of the module have been imple-

mented and tested. They are named āstandard versionā, which was described

in the previous sections, āminimum versionā, which is the smallest, but slowest

one, and the āhigh-performance versionā, which is the fastest, but most area

intensive version.

As shown in Fig. 2, the data unit of the standard version consists of 16

data cells (including 16 MixColumns multipliers) and four S-Boxes. The en-

/decryption of an 128-bit data block requires 65 clock cycles. By using 16 S-Boxes

instead of four, the performance can be increased. The S-Box lookup can be done

for all 128 bits of the State in parallel. With this conļ¬guration (high-performance

version), the AES module requires only 35 clock cycles to en-/decrypt a 128-bit

data block. In the minimum version of the AES module, only four S-Boxes and

four MixColumns multipliers are used. Only the four āleftmostā data cells con-

tain MixColumns multipliers. In the other data cells the multipliers are omitted.

Here, 92 clock cycles are needed to process a 128-bit data block.

In this section, we give a comparison of the three diļ¬erent scaled versions of

the AES module in terms of performance and area. Additionally, a comparison

to related work is given.

2.4 Performance of the Presented ASIC Design

The three versions of AES-128 have been implemented in VHDL and have been

synthesized for a 0.6 Āµm CMOS process. Table 1 shows the complexity in gate

104 N. Pramstaller et al.

Table 1. Complexity of AES components in GE

Component Complexity [GE]

S-Box 392

Multiplier 212

Data cell (without Multiplier) 87

Key generator 1,633

Key store 691

AMBA Bus Interface 267

CBC Register 1,599

Table 2. Complexity of the AES-128 modules

Component # Minimum # Standard # High Perf.

S-Boxes 4 1,568 4 1,568 16 6,272

Multipliers 4 848 16 3,392 16 3,392

D. cells without mult. 16 1,392 16 1,392 16 1,392

Multiplexors 96 224 192 384 224 374

Data unit 4,032 6,736 11,430

Key generator 1 1,633 1 1,633 1 1,633

Key store 1 691 1 691 1 691

Key unit 2,324 2,324 2,324

AMBA + CBC 1 1,866 1 1,866 1 1,866

Control logic 319 279 230

Additional 2,185 2,145 2,096

Total 8,541 11,205 15,850

equivalents (GE) of each component used for the AES module. In Table 2, the

complexity of the three diļ¬erent modules is calculated by adding the size of the

components used for the diļ¬erent versions. In the ļ¬rst column for each version

the used number of components is given. The standard version of the module

has a complexity of 11,205 GE, whereas the minimum version needs 8,541 GE.

The high performance version requires 15,850 GE.

The high-performance module essentially consists of 12 S-Boxes more than

the standard module. This causes an increase of the complexity by 41%. On the

other hand, the minimum version consists of 12 MixColumns multipliers less than

the standard version and is therefore 24% smaller. The critical path of all three

designs is more or less the same and is determined by the delay of one pipeline

stage of the S-Boxā”the maximum frequency for the complete AES-128 modules

(including AMBA interface, CBC, and control logic) on the 0.6 Āµm technology

is about 50 MHz. In Table 3, a summary of the performance is shown.

The standard version needs 65, the high-performance version 35, and the

minimum version 92 clock cycles to perform an AES-128 encryption or decryp-

tion. In the high-performance version, the improvement of the throughput by

87% is paid by an increase of the complexity by 41%. In the minimum version,

the reduced complexity (-24%) accounts for a 29% loss in throughput.

Eļ¬cient AES Implementations on ASICs and FPGAs 105

Table 3. Summary of the performance of the diļ¬erent AES-128 modules

Clock Throughput@50 MHz Area

[M bps] [GE]

Version Cycles

Minimum 92 70 8,541

Standard 65 98 11,205

High perf. 35 183 15,850

2.5 Related Work

This subsection compares the presented architecture with the one proposed

in [13]. The design of Satoh et al. consists of 5,400 GE and was implemented

on a 0.11 Āµm technology. The design consists of a core data path and a key

generator. It does not include mechanisms for I/O, CBC registers or a key store.

Its maximum clock frequency is about 130 MHz. The design requires 54 cy-

cles to perform an encryption, which leads to a theoretical throughput (for the

four-S-Box version) of 311 Mbps.

For a comparison of complexity, the gate count for our design has to be

reduced by the additional components our design oļ¬ers (key store, CBC registers,

AMBA interface)ā”this leads to a gate count of about 8,600 GE for the standard

AES-128 module. This comparison is still not completely fair, since diļ¬erent

technologies are used for synthesis. The 0.11 Āµm technology used in [13] allows

smaller structures and oļ¬ers diļ¬erent leaf cells. It seems to be more extensive

than our technology, because similar components of the designs are claimed to

be smaller in the architecture in [13]. For example, the S-Box proposed by Satoh

et al. was reconstructed and synthesized with our technology. The result was a

15% bigger S-Box than used in our design, whereas the results with the 0.11 Āµm

technology in [13] show an S-Box that needs 25% less GE than our S-Box design

in the 0.6 Āµm technology.

The big diļ¬erence in the used technology also does not allow a reasonable

comparison of the maximum frequencies or the throughput. When comparing

the two proposed S-Boxes synthesized in our 0.6 Āµm technology in terms of

delay, our proposed S-Box is about 30% faster. An attempt to compare the

throughput of both designs starts with comparing the critical paths. In [13] the

critical path is very long: The SubBytes, the MixColumns and the AddRoundKey

transformation are done for one column within one clock cycle. Additionally, in

the same clock cycle the data passes the so-called selector function, which seems

to be another major cause of delay.

In our presented architecture, the critical path consists only of one pipeline

stage of an S-Box. This is approximately a third of the critical path of the

architecture presented in [13]. Using the same technology for synthesis of the

compared designs, we expect the maximum frequency of our module to be at

least three times higher than the maximum frequency stated in [13]. This leads

to a better overall performance.

106 N. Pramstaller et al.

3 FPGA Implementation of AES

Reconļ¬gurable devices like Field Programmable Logic Arrays (FPGA) gain more

and more importance in hardware, software and hardware/software co-designs.

In the beginning often seen only as devices for rapid prototyping, FPGAs are

increasingly used for ļ¬nal applications. One of the mostly used arguments for

using FPGAs rather than ASICs, is the reduced time-to-market and the cost

advantages of standard devices.

Due to the importance of reconļ¬gurable devices, numerous FPGA AES im-

plementations have been published within the last years. These implementations

mainly focus on high throughput rates [3, 9], and use techniques like loop un-

rolling and pipelining. They are able to report throughput rates up to 12,160

Mbps [3]. Applying such techniques leads to AES hardware implementations that

require a huge amount of FPGA resources that are only available for expensive

devices and cannot be implemented in low-end FPGAs.

In this section we present a new universal architecture that is supported by

several FPGA product families and can also be implemented using inexpensive

low-end FPGAs. It is the ļ¬rst known AES FPGA implementation that does

not require on-chip block RAMs. Furthermore, it implements the complete AES

encryption standard and features the Cipher Block Chaining (CBC) mode.

3.1 Related Work

Gaj et al. [3] published the fastest known FPGA implementation. For encryption

and decryption with 128-bit keys, a throughput of 12,160 Mbps on a Xilinx Vir-

tex XCV1000BG560-6 device is reported. McLoone et al. [9] achieve a through-

put of 6,956 Mbps for 128-bit keys only. They also presented encryption engines

for 192-bit or 256-bit keys with accordingly lower throughput. Their combined

encryption and decryption implementation can handle 128-bit keys and achieves

a throughput of 3,239 Mbps on a Xilinx Virtex-E XCV3200E-8CG1156 device.

The third implementation published by Dandalis et al. [5] also provides encryp-

tion and decryption for 128-bit keys. They achieve a throughput of 353 Mbps

on a Xilinx Virtex XCV1000BG560-6 device. Fischer et al. [6] published a non-

pipelined design supporting encryption and decryption for 128-bit keys. They

report a throughput of 451 Mbps of their fast conļ¬guration and 115 Mbps for

an economic conļ¬guration. A drawback of their design is the missing on-chip

round-key generation. Chodowiec et al. [2] presented an implementation for low-

end devices. Using only few resources they achieve a throughput ranging from

139 Mbps to 166 Mbps depending on the used FPGA device.

All implementations (except [6, 2]) use a considerable amount of hardware

resources. For instance, [9] requires 138 block RAMs for 256-bit keys. This de-

mands the use of expensive million-gate FPGA devices.

As shown above, most published hardware implementations focus on high

throughput rates and do not provide a non-parameterizable design to support the

complete AES standard. Furthermore, the high throughput implementations [3,

9] do not support the Cipher Block Chaining mode (CBC).

Eļ¬cient AES Implementations on ASICs and FPGAs 107

3.2 Architecture of the AES FPGA Implementation

This section describes the architecture of the AES co-processor. Starting with a

swift overview, we will present details to highlight some innovative improvements

that make it possible to present a resource-eļ¬cient AES co-processor suitable

for low-end FPGA devices.

Basic components of the AES co-processor as shown in Fig. 3 are the AMBA

APB interface, the data unit, the key unit, and the control unit. The key unit

calculates the KeyExpansion function. All round keys are pre-calculated and

stored in the key unit. Pre-calculated round keys allow fast en-/decryption of

diļ¬erent data blocks for the same cipher key because no additional KeyExpan-

sion is required. The data unit holds the State and performs all AES transfor-

mations: AddRoundKey, (Inv)SubBytes, (Inv)ShiftRows and (Inv)MixColumns.

When encryption or decryption has completed the ciphertext (plaintext in case

of decryption, respect.) is stored in the data unit. The control unit receives

commands from the AMBA interface and generates control signals for all other

modules. In addition to control round-key calculation, encryption and decryp-

tion, it also sequences data loading and unloading.

Data

Key

data in Unit

AMBA Unit

Interface

data out

(APB)

Control Unit

Fig. 3. Architecture of the AES co-processor

The architecture is similar to the architecture presented in Section 2. Diļ¬er-

ences are a modiļ¬ed State representation and a modiļ¬ed round-key calculation

scheme. Due to a non-pipelined approach, the same performance for all modes

of operations (ECB and CBC) is reached. Next we describe the AES data unit

and the AES State representation in detail.

Data Unit. The data unit stores the State, all intermediate results of the round

function applied to the State and the output data when encryption or decryption

has completed. The major diļ¬erence to all other published AES implementations

is the innovative State representation that consists of two States. One State

contains the actual State values and the other State stores newly calculated

values. Figure 4 depicts the two States, referred to as StateA and StateB. In

each cycle, 32 bits (one row or one column) of either StateA or StateB are

altered. Using a second State provides a lot of beneļ¬ts without the need of

additional recourses: (Inv)ShiftRows comes for free and no State transposition

between column and row operations is required.

Storage elements in FPGAs can be eļ¬ciently implemented by using syn-

chronous RAMs because the basic logic elements of FPGAs, called slices, can

108 N. Pramstaller et al.

0 (0,1)

(0,0) (0,2) (0,3)

StateA

1 (1,1)

(1,0) (1,2) (1,3)

2 (2,1)

(2,0) (2,2) (2,3)

address [2..0]

RAM 0

RAM 1

RAM 2

RAM 3

(3,1)

3 (3,0) (3,2) (3,3)

(4,1)

4 (4,0) (4,2) (4,3)

StateB

(5,1)

(5,0) (5,2) (5,3)

5

(6,1)

(6,0) (6,2) (6,3)

6

(7,1)

(7,0) (7,2) (7,3)

7

Fig. 4. The State-RAM

be conļ¬gured as 16 Ć— 1 bit synchronous RAM. Two slices (= one Conļ¬gurable

Logic Block - CLB) provide 16 Ć— 1 bit synchronous dual-port RAM function-

ality (see [20]). Dual-port RAMs allow concurrent reading and writing to the

RAM. Due to these technology features, the State-RAM as depicted in Fig. 4

is implemented as four slices of 8 Ć— 8 bit synchronous dual-port RAMs to allow

addressing the slices independently.

The data unit performs all transformations of the round function: (Inv)Shift-

Rows, (Inv)SubBytes, (Inv)MixColumns and AddRoundKey. AddRoundKey and

(Inv)MixColumns are applied to the State column by column, whereas (Inv)-

ShiftRows and (Inv)SubBytes are applied to the State row by row. Due to

the slice architecture of the RAM which holds the State, it is not possible to

read/write from/to the RAM column by column. Hence, a transposition of the

State is necessary if a row-oriented operation follows a column-oriented oper-

ation, or vice versa. Transposition would require a reorganization of the State

before further operations can be performed. By using two States, transposition

can be implemented by accordingly addressing the State-RAM. Furthermore,

(Inv)ShiftRows can be combined with transposing the State. As a consequence

of this, (Inv)ShiftRows and transposition come for free. In the sequel we describe

the memory organization and State transposition for encryption. The same ap-

proach can easily be modiļ¬ed for decryption.

When a row-oriented operation follows a column-oriented operation (or vice

versa), the State must be transposed. Combining row and column transfor-

mations minimizes the number of required transpositions: ShiftRows is com-

bined with SubBytes and AddRoundKey is combined with MixColumns. This

approach requires only one transposition per round. Encryption requires Sub-

Bytes followed by ShiftRows. Since ShiftRows does not aļ¬ect the byte values

and SubBytes is applied to each byte of the State individually, the order of

both operations does not matter. This fact eases the address generation for the

State-RAM.

For explaining the State transposition we consider the State as 4 Ć— 4 matrix:

i=0..3

S = (si,j )j=0..3 . The ShiftRows transformation described in [10] can then be

expressed as follows:

S = Shif tRows(S) = (si,jā’i mod 4 )i = 0..3 (1)

j = 0..3

Eļ¬cient AES Implementations on ASICs and FPGAs 109

+

0 0 0 0 0 0 0 0 0 1 2 3 0 1 2 3

1 1 1 1 1 1 1 1 4 5 6 7 4 5 6 7

StateA

1 1

2 2 2 2 2 2 2 2 2 2 2 2 8 9

0 1

3 3 3 3 3 3 3 5 3 3 3 3 3 3 3 3

SubBytes SubBytes

SubBytes SubBytes

4 4 4 4 4 4 4 4 4

4

5 5 5 5 5 5 5 5

5 5

StateB

6 6 6 6 6 6 6

6 6 6

7 7 7 7 7 7

7 7 7 7

Fig. 5. ShiftRows and SubBytes for encryption

If we replace the State by the transposed State, we obtain:

S T = Shif tRows(ST ) = (si+j mod 4,j )ji = 0..3

=0..3

(2)

With the result of (2) the addressing of the StateB-RAM can be determined: the

indices (i, j) must be substituted with (i + j mod 4, j). Due to the even number

of AES rounds for all key lengths, ShiftRows is always applied to StateB only.

Thus, the resulting index tuples can be directly mapped to the RAMs. The ļ¬rst

part of the tuple index speciļ¬es the RAM slice and the second part speciļ¬es the

RAM address. Since we operate on StateB, we must add an oļ¬set of 4 to the

index value to get the correct address. Figure 5 shows the transposition of the

State, including ShiftRows and SubBytes for encryption.

Implementation of (Inv)SubBytes and (Inv)MixColumns. The (Inv)-

SubBytes transformation is based on [18]. One diļ¬erence is that the byte inver-

sion in GF(28 ) is implemented by using a synchronous ROM. (Inv)MixColumns

is similar to the architecture presented in [17]. For further details refer to [18, 17].

Key Unit. An innovative aspect of our implementation is that the key unit

can handle 128-bit, 192-bit and 256-bit keys with minimal additional hardware

requirements. Supporting all key lengths increases the needed hardware resources

for the key unit by only 7.8%. The size of the key memory for 256-bit keys is the

same as for 128-bit keys. For 128-bit keys, the KeyExpansion function derives

44 32-bit round-key parts from the cipher-key. This requires a 64 Ć— 32 bit RAM.

256-bit keys produce 63 32-bit round-key parts ļ¬tting the 64 Ć— 32 bit RAM.

3.3 Performance of the FPGA AES Implementation

This section compares the proposed AES co-processor with the works referred

to in Section 3.1. In order to provide comparable results, we implemented our

co-processor on a Xilinx Virtex-E XCV1000EBG560-8 device.

110 N. Pramstaller et al.

Table 4. Hardware resources and throughput comparison

ECB mode

Work Device #CLB-slices #BRAM Throughput

[Mbps]

Gaj et al. [3] Xilinx XCV1000 12,600 80 12,160

McLoone et al. [9] (I) Xilinx XCV812E 2,222 100 6,956

McLoone et al. [9] (II) Xilinx XCV3200E 2,577 112 5,800

McLoone et al. [9] (III) Xilinx XCV3200E 2,995 138 5,000

McLoone et al. [9] (IV) Xilinx XCV3200E 7,576 102 3,239

Dandalis et al. [5] Xilinx XCV1000 5,673 ? 353

Fischer et al. [6] (I) FLEX 10KE200-1 2,530 24 451

Fischer et al. [6] (II) ACEX 1K50-1 1,213 10 115

Chodowiec et al. [2] Xilinx XC2S30-6 222 3 166

Our proposal 1,125 0 215

Xilinx XCV1000E

[9]: enc.: (I)AES-128, (II)AES-192, (III)AES-256, enc./dec.:(IV)AES-128

[6]: AES-128 enc./dec.: (I) fast conļ¬guration, (II) economic conļ¬guration

The performance results given in Table 4 are for the ECB mode. Most of these

implementations claiming high throughput rates will have similar performance

ļ¬gures when operating in the CBC mode. The CBC mode is strictly recom-

mended and commonly used for encrypting high-speed data streams (e.g. as it

is used for encrypting data transfers over networks) and hence, the above-listed

high throughput rates lose their signiļ¬cance.

As shown in Table 4 our implementation is the only one that does not require

any block RAMs. The presented AES co-processor supports the complete AES

standard and the CBC mode. Additionally, it is equipped with a 32-bit AMBA

APB interface that eases the integration with processors used in System-on-Chip

designs [1]. If we do not consider the CBC mode and the AMBA bus interface,

our approach is still comparable with the above-listed works but we would require

less hardware resources (-26 %).

Our implementation utilizes 9.16% of the available logic cells on a Xilinx

Virtex-E XCV1000EBG560-8 device. 90.8% of the logic resources and 100% of

the on-chip BRAMs can be used by other circuits like a LEON2 or an ARM

processor. For a stand-alone application a low-end FPGA (e.g. Xilinx SpartanII

XC2S100-6) is suļ¬cient for implementing the complete AES co-processorā” the

other approaches (except [2]) do not ļ¬t on a SpartanII device. The high through-

put designs do not support this ļ¬‚exibility and require expensive multi-million

gate FPGAs. Another important fact is that the other works do not provide an

en-/decryption engine that supports all deļ¬ned key lengths.

The maximum clock frequency on a XCV1000 FPGA is 161 MHz. At this

frequency, a throughput of 215 Mbps for AES-128, 180 Mbps for AES-192, and

156 Mbps for AES-256 is achieved for both ECB mode and CBC mode.

Eļ¬cient AES Implementations on ASICs and FPGAs 111

4 Conclusions

In this article we presented two designs of a compact AES co-processor, one

suitable for ASIC implementations, the other one suitable for FPGAs. Both

designs are able to implement the whole functionality of the AES standard:

encryption and decryption with all key lengths (128-bit, 192-bit, and 256-bit).

In addition to covering the complete AES standard they support the Cipher

Block Chaining mode CBC. The AES co-processors also have a standard 32-bit

interface (AMBA) that facilitates the integration in System-on-Chip designs.

Our ASIC implementation is very regular, which makes it well suited for full

custom designs, and highly scalable. By scaling, the ASIC AES module it can be

adapted for many diļ¬erent applications with diļ¬erent requirements. With this

architecture high performance (up to 198 Mbps) as well as low area requirements

(down to 8,500 GE) can be reached on a 0.6 Āµm technology.

For the FPGA implementation, we have shown that due to an innovative

State representation the complete AES co-processor can be implemented on

inexpensive low-end FPGA devices. An implementation on a Xilinx Virtex-E

device uses only 1,125 CLB-slices and no block RAMs. Our FPGA implemen-

tation reaches a throughput of 215 Mbps at a clock frequency of 161 MHz for

encryption and decryption.

References

1. ARM Limited. AMBA 2.0 Speciļ¬cation. "http://www.arm.com/armtech/", 2001.

2. P. Chodowiec and K. Gaj. Very Compact FPGA Implementation of the AES

Algorithm. In Workshop on Cryptographic Hardware and Embedded Systems ā“

CHES 2003, volume 2779 of Lecture Notes in Computer Science (LNCS), pages

319ā“333. Springer, 2003.

3. P. Chodowiec, P. Khuon, and K. Gaj. Fast Implementations of Secret-Key Block

Ciphers Using Mixed Inner- and Outer-Round Pipelining. In Symposium on Field

Programmable Gate Arrays ā“ FPGA 2001, pages 94ā“102. ACM Press, 2001.

4. J. Daemen and V. Rijmen. The Design of Rijndael. Springer-Verlag, 2002.

5. A. Dandalis, V. Prasanna, and J. Rolim. A Comparative Study of Per-

formance of AES Final Candidates Using FGPAs. "http://csrc.nist.gov/

CryptoToolkit/aes/round2/conf3/aes3agenda.html", 2000.

6. V. Fischer and M. DrutarovskĀ“. Two Methods of Rijndael Implementation in

y

Reconļ¬gurable Hardware. In Workshop on Cryptographic Hardware and Embedded

Systems ā“ CHES 2001, volume 2162 of Lecture Notes in Computer Science (LNCS),

pages 77ā“92. Springer-Verlag, 2001.

7. P.C. Kocher, J. Jaļ¬e, and B. Jun. Diļ¬erential Power Analysis. In Advances in

Cryptology ā“ CRYPTO 1999, volume 1666 of Lecture Notes in Computer Science,

pages 388ā“397. Springer-Verlag, 1999.

8. S. Mangard, M. Aigner, and S. Dominikus. A Highly Regular and Scalable AES

Hardware Architecture. In IEEE Transactions on Computers, volume 52, pages

483ā“491, April 2003.

112 N. Pramstaller et al.

9. M. McLoone and J.V. McCanny. High Performance Single-Chip FPGA Rijndael

Algorithm Implementations. In Workshop on Cryptographic Hardware and Em-

bedded Systems ā“ CHES 2001, volume 2162 of Lecture Notes in Computer Science

(LNCS), pages 65ā“76. Springer-Verlag, 2001.

10. National Institute of Standards and Technology. Federal Information

Processing Standard 197, The Advanced Encryption Standard (AES).

"http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf", 2001.

11. N. Pramstaller, F.K. GĀØ rkaynak, S. Haene, H. Kaeslin, N. Felber, and W. Ficht-

u

ner. Towards an AES Crypto-chip Resistant to Diļ¬erential Power Analysis. In

Proccedings of ESSCIRC 2004, to appear, 2004.

12. A. Rudra, P.K. Dubey, C.S. Jutla, V. Kumar amd J.R. Rao, and Pankaj Ro-

hatgi. Eļ¬cient Rijndael Encryption Implementation with Composite Field Arith-

metic. In Workshop on Cryptographic Hardware and Embedded Systems ā“ CHES

2001, volume 2162 of Lecture Notes in Computer Science (LNCS), pages 171ā“184.

Springer-Verlag, 2001.

13. A. Satoh, S. Morioka, K. Takano, and S. Munetoh. A Compact Rijndael Hardware

Architecture with S-Box Optimization. In Advances in Cryptology ā“ ASIACRYPT

2001, volume 2248 of Lecture Notes in Computer Science (LNCS), pages 239ā“254.

Springer-Verlag, 2001.

14. K. Tiri, M. Akmal, and I. Verbauwhede. A Dynamic and Diļ¬erential CMOS Logic

with Signal Independent Power Consumption to Withstand Diļ¬erential Power

Analysis on Smart Cards. In Proceedings of ESSCIRC 2002, 2002.

15. K. Tiri and I. Verbauwhede. Securing Encryption Algorithms against DPA at the

Logic Level: Next Generation Smart Card Technology. In Workshop on Crypto-

graphic Hardware and Embedded Systems ā“ CHES 2003, volume 2779 of Lecture

Notes in Computer Science (LNCS), pages 125ā“136. Springer, 2003.

16. B. Weeks, M. Bean, T. Rozylowicz, and C. Ficke. Hardware Performance Sim-

ulations of Round 2 Advanced Encryption Standard Algorithms. "http://

csrc.nist.gov/encryption/aes/round2/NSA-AESfinalreport.pdf", 2000.

17. J. Wolkerstorfer. An ASIC implementation of the AES-MixColumn operation. In

Proceedings of Austrochip 2001, October 2001.

18. J. Wolkerstorfer, E. Oswald, and M. Lamberger. An ASIC implementation of the

AES S-Boxes. In Topics in Cryptology ā“ CT-RSA 2002, Proceedings of the RSA

Conference 2002, volume 1965 of Lecture Notes in Computer Science. Springer-

Verlag, February 2002.

19. S-Y Wu, S-C Lu, and C-S Laih. Design of AES Based on Dual Cipher and Com-

posite Field. In Topics in Cryptology ā“ CT-RSA 2004, Proceedings of the RSA

Conference 2004, volume 2964 of Lecture Notes in Computer Science. Springer-

Verlag, February 2004.

20. Xilinx Incorporated. Silicon Solutions ā” Virtex Series FPGAs. "http://

www.xilinx.com/products/".

Small Size, Low Power, Side Channel-Immune

AES Coprocessor: Design and Synthesis Results

Elena Trichina1 , Tymur Korkishko2 , and Kyung Hee Lee2

1

Department of Computer Science, University of Kuopio,

P.O.B. 1627, FIN-70211, Kuopio, Finland

2

Information security TG, i-Networking Lab, Information Security Group,

Samsung Advanced Institute of Technology, Korea

Abstract. When cryptosystems are being used in real life, hardware and

software implementations themselves present a fruitful ļ¬eld for attacks.

Side channel attacks exploit information such as time measurements,

power consumption, and electromagnetic emission that leaks from a de-

vice when it executes cryptographic applications. When leaked informa-

tion is correlated to a secret key, an adversary may be able to recover the

key by monitoring this information. This paper describes an AES copro-

cessor that provides complete protection against ļ¬rst-order diļ¬erential

power analysis by embedding a widely used software countermeasure that

decorrelates data being processed from the leaked information, so-called

data masking, at a hardware level.

1 Introduction

In applications such as smart cards hardware complexity and tamper resistance

are very important issues that directly aļ¬ect the cost and consumer acceptance

of such devices. A class of side channel attacks enables breaking cryptographic

algorithms by measuring timing characteristics [12], power consumption [11, 18],

and electromagnetic radiation [8, 23] of a smart card microprocessor when it runs

cryptographic applications.

Until recently, most of these attacks exploited some speciļ¬c features of soft-

ware implementations of cryptographic algorithms, and many countermeasures

were designed at a software level. For many applications, however, it is neces-

sary that cryptographic algorithms should be realized in hardware. Although not

many results have been published yet, it is prudent to suggest that cryptographic

hardware also leaks side channel information, and that alongside with general

tamper-resistant features cryptographic coprocessors should include countermea-

sures speciļ¬cally designed to protect them against side channel attacks.

One of the most powerful software techniques to counteract such attacks is

to mask all input and intermediate data values in order to de-correlate any in-

formation leaked through side channels from actual secret data being processed.

This work had been done when the author was with the Smart Card System Engi-

neering Business Unit, System LSI Division, Samsung Electronics Co. LTD., Korea.

H. Dobbertin, V. Rijmen, A. Sowa (Eds.): AES 2004, LNCS 3373, pp. 113ā“127, 2005.

c Springer-Verlag Berlin Heidelberg 2005

114 E. Trichina, T. Korkishko, and K.H. Lee

In [28] it had been shown how to apply data masking technique at the level

of micro operations such as logical AND, XOR, etc., and how to use these op-

erations as building blocks for implementation of inversion in composite ļ¬elds

directly on masked data. In this paper we use ideas from [28] in order to build a

practical coprocessor for an Advanced Encryption Standard algorithm [7] that

is immune to side channel attacks. First, we generalize multiplication on masked

data from a bitwise multiplication operation (i.e., logical AND) to multiplica-

tion in any GF (2n ) ļ¬eld. This generalization allows us to use standard libraries

while designing side channel attacks-resistant cryptographic hardware for appli-

cations that are based on binary ļ¬eld arithmetic. What is more important, our

method provides considerable savings in the gate count and power consumption

in comparison with [28].

The synthesis results for a āminimalistā architecture that implements a 16

clock/round version of the AES coprocessor with a ļ¬‚exible key size, show that

with 0.18 Āµm technology the performance of 4Mbps can be achieved with the

circuit comprising 20,506 gates clocked at 5MHz and requiring 1.07 ĀµA. This is

better than countermeasures such as dynamic and diļ¬erential CMOS logic [26]

or asynchronous circuits with dual rail logic [21] can oļ¬er. Also, our solution

has an advantage of using standard technologies, standard libraries, and well-

established design tools.

The rest of the paper is organized as follows. After a brief reminder of the AES

algorithm in Chapter 2, we describe principles of power analysis attacks. The

countermeasure consisting in masking all input and intermediate data with some

random values and diļ¬culties of its implementation for the AES algorithm are

discussed in Chapter 4, after which we suggest our solution to this problem. The

details of the DPA-resistant AES coprocessor architecture are given in Chapter

6. The paper is concluded with synthesis results and a brief comparison of our

design with other hardware solutions.

2 AES Reminder

The Advanced Encryption Standard [7] is a round-based symmetric block cipher.

The standard key size is 128 bits, but for some some applications 192 and 256-

bit keys must be supported as well. The round consists of four diļ¬erent opera-

tions, namely, SubBytes, ShiftRows, MixColumn, and AddRoundKey, that are

performed repeatedly in a certain sequence; each operation in a standard algorithm

maps a 128-bit input state into a 128-bit output state. The state is represented as

a 4 Ć— 4 matrix of bytes. The number of rounds depends on the key size. In the de-

cryption process, the inverse operations are executed in a slightly diļ¬erent order.

ShiftRows is a cyclic shift operation on each of four rows in a 4 Ć— 4-byte state

using 0 ā¼ 3 oļ¬sets. MixColumn treats 4-byte data blocks in each column of a

state as coeļ¬cients of a 4-term polynomial, and multiplies them modulo x4 + 1

with the ļ¬xed polynomial c(x) = {03}x3 + {01}x2 + {01}x + {02}. AddRound-

Key is a bit-wise XOR operation on 128-bit round keys and data. These three

operations are linear.

Small Size, Low Power, Side Channel-Immune AES Coprocessor 115

SubBytes is the main building block of the AES. It replaces each byte in a

state by its substitute in an Sbox that comprises a composition of two transfor-

mations:

ā“ First, each byte in a state is replaced with its reciprocal in the ļ¬nite ļ¬eld

GF (28 ), except that 0, which has no reciprocal, is replaced by itself. This is

the only non-linear function in the AES algorithm.

ā“ Then an aļ¬ne transformation f is applied. It consists of a bitwise matrix

multiply with some ļ¬xed 8 Ć— 8 binary matrix followed by XOR with the

hexadecimal number {63}.

The round key is computed in parallel to the round operation. It is derived from

the cipher key by means of key expansion and round key selection operations,

which are similar to those of the round operations, and also use Sboxes.

3 S-Box Architecture

There are many design trade-oļ¬s to be considered when implementing an S-

box in ASIC since the size, the speed, and the power consumption of an AES

coprocessor depends largely on the number and the style of implementation of

Sboxes [20]. It also turned out that this operation is the most diļ¬cult to protect

against side channel attacks.

To optimize the silicon area, a number of ļ¬‚exible ASIC solutions that use

similarities between encryption and decryption to share silicon were proposed

[14, 20], where SubBytes was implemented in two steps, as a combination of

inversion in the ļ¬eld and an aļ¬ne transformation f . While the aļ¬ne trans-

formations used for encryption and decryption are slightly diļ¬erent, the silicon

implementing inversion in GF (28 ) can be used for both, as shown in Fig. 1.

Therefore, an area- and power-eļ¬cient and secure implementation of inversion

may have a big impact on an overall design.

The most obvious solution is to use a look-up table for this operation [14].

It is fast and inexpensive in terms of power consumption [20]. There is a major

drawback, however. Namely, the size of silicon is about 1,700 gate equivalents

per table in 0.18Āµ technology. Considering that up to 20 such tables (including

4 tables for key scheduling) is required per round, this solution is hardly feasible

for co-processors intended for smart cards and other embedded systems.

Among various alternative approaches composite ļ¬eld inversion produces the

most compact AES implementations [24, 25, 30] which can be further optimized

)c+x(1-A

eniffA esr evnI

mr ofsn arT c+1-xA

1-

X

x y

elbaT eniffA

mr ofsn arT

ceD/cnE

Fig. 1. S-box implementation suitable for encryption and decryption

116 E. Trichina, T. Korkishko, and K.H. Lee

to minimize power consumption [20]. As a basis for our design we use fully

combinational logic implementation of inversion in composite ļ¬elds described

in [30].

Usually the ļ¬eld GF (28 ) is seen as an extension of GF (2) and therefore its

elements can be represented as bytes. However, GF (28 ) can also be seen as a

quadratic extension of GF (24 ); in this case an element a ā GF (28 ) is represented

as a linear polynomial aH x + aL , denoted [aH , aL ], with coeļ¬cients in GF (24 ).

This isomorphic representation is far better suited for hardware implementation

[22, 24, 20, 30].

The bijection from a ā GF (28 ) to a two-term polynomial [aH , aL ] is given

by the linear function map computed by means of XOR operations on bits of

a. The inverse transformation mapā’1 converts a two-term polynomial back to

element a ā GF (28 ), and is deļ¬ned in a similar way. For more details see [30].

All arithmetic operations applied to elements of GF (28 ) can also be computed

in a new representation. Two-term polynomials are added by addition of cor-

responding coeļ¬cients. Multiplication and inversion of a two term-polynomial

requires modular reduction to ensure that the result is a two-term polynomial

as well; the irreducible polynomial n(x) = x2 + {1}x + {e} whose coeļ¬cients are

chosen to optimize ļ¬nite ļ¬eld arithmetic can be used for this purpose.

Inversion of a two-term polynomial is deļ¬ned as (aH x+aL )ā—(aH x+aL )ā’1 =

{0}c + {1}. From this deļ¬nition the formulae for inversion can be derived:

(aH x + aL )ā’1 = (aH ā— d)x + (aH ā• aL ) ā— d

(1)

d = ((a2 ā— {e}) ā• (aH ā— aL ) ā• a2 )ā’1 .

H L

Fig. 2 depicts a block diagram of inversion in composite ļ¬eld GF ((24 )2 ). As

one can see, one addition, one squaring, one multiplication by a constant, three

general multiplications, and one inversion in GF (24 ) are necessary for such im-

plementation. All these operations can be realized in combinational logic. Only

general multiplication and inversion in GF (24 ) require to use both, AND and

XOR gates for their implementation; all other operations need only compositions

of XOR gates [30].

.lum tnatsnoC erauqS

)42(FG ni )42(FG ni

noisrevnI

2

X x X

Ha Hb

)42(FG ni

noitacilpitluM 1-

X

)42(FG ni

X X

La Lb

Fig. 2. Inversion in GF ((24 )2 )

Small Size, Low Power, Side Channel-Immune AES Coprocessor 117

4 Side Channel Attacks and Computations on Masked

Data

Basically, side-channel attacks work because there is a correlation between the

physical measurements taken during computations, such as power consump-

tion [11, 18], EMF radiation [8, 23], time of computations [12], and the internal

state of the processing device, which itself is related to a secret key.

4.1 Power Analysis Basics

Among side-channel attacks, a diļ¬erential power analysis (DPA) is the main

concern when implementing cryptographic algorithms in embedded devices be-

cause due to physical constraints an adequate shielding and power consumption

ļ¬ltering cannot be employed. Power analysis attacks use an all-pervasive fact

that, ultimately, all calculations performed by a digital device operate on logi-

cal ones and zeros; and in contemporary technology power consumption while

manipulating a logical one diļ¬ers from power consumption while manipulating

a logical zero.

To illustrate why the power analysis works, let us consider an example in

Fig. 3. The circuit, ļ¬rst appeared in [18], represents a component model useful

for understanding power consumption characteristics of Complementary Metal

Oxide Semiconductor (CMOS) technology.

V pin

cc

L bond

Other

Gates

Q

1

V

gate

C

Q load

2

V pin

ss

+ V scope

L

bond

C R

C 2 m

1

ā’

Fig. 3. Measuring power consumption of a smart card

The two most essential components of power consumption during the change

of a state of a CMOS gate are dynamic charge resp. discharge (appr. 85%) and

dynamic short circuit current (appr. 15%). This is sketched in Fig. 3 where the

output of each gate has a capacitive load consisting of the parasitic capacity of

the connected wires and gates of the following stages. An input transition re-

sults in an output transition, which discharges or charges this parasitic capacity,

causing a current to ļ¬‚ow to Vcc or to Vss .

118 E. Trichina, T. Korkishko, and K.H. Lee

For example, as Vgate changes from 0 to 5 volts, the transistors Q1 and Q2

are both conducting for a brief period causing current to ļ¬‚ow from Vcc to the

ground. Also, during this time the capacitor Cload will be discharged (or charged)

causing more (or less) current to ļ¬‚ow through the Vss pin. The current charges

and discharges capacitors Cload , C1 , and C2 , and ļ¬‚ows out of the smart card

through a bond wire that acts as an inductor Lbond .

Power dissipated by the circuit can be monitored by using a small resistor

Rm in series between the Vss pin and the ground (or, alternatively, between the

Vcc pin and the true source). Current moving through Rm creates a time varying

voltage that can be sampled by a digital oscilloscope.

The more circuit changes its state, the more power is dissipated. In a syn-

chronous design gates are clocked, which means that all gates change their state

at the same time. Information useful to an attacker is leaked because the amount

of current being drawn when the circuit is clocked is directly related to a change

of the state of Cload or the current drawn by other gates attached to Cload . Thus,

whenever the secret key data or data correlated to the secret key is manipulated,

the microprocessor can leak damaging information that can be observed at the

Vscope .

In a microprocessor each clock pulse causes many bit transitions to occur si-

multaneously. There are two types of information leakage that can be observed at

the Vscope : Hamming weight leakage and transition count leakage [18]. The Ham-

ming weight information leaks when the dominant source of current is caused by

discharging of the Cload . The transition current information can leak when the

dominant source of current is due to the switching of the gates that are driven

by the data bus. When the data bus changes state, many of the gates driven by

the bus will brieļ¬‚y conduct current. Thus, the more bits change states, the more

power is dissipated.

From this explanation we deduce that the power consumption of a circuit at

time t is the sum of power dissipated by all gates at this time. Of course, various

noise components must be considered as well [18]. It can be stated as the simple

power model

P (t) = f (g, t) + N (t),

where t denotes time, N (t) is a normally distributed random variable which

represents a noise component, and f (g, t) denotes power consumption of gate g

at time t.

The next step is to relate this model to statistics. If we consider function

f (g, t) as a random variable from an unknown probability distribution, then

according to the Central Limit Theorem, P (t) is normally distributed. In a DPA

attack, an attacker divides the power measurements in two or more diļ¬erent

sets and tries to compute the diļ¬erence between these sets in order to verify

the selection function, which relates the corresponding power measurements to

the hypothesis concerning the values of the target bits of the key. Only if the

hypothesis was correct, there will be some noticeable peaks in statistics.

For example, a selected bit b at the output of one Sbox of the ļ¬rst round of

the AES algorithm will depend on the known input message and 8 unknown bits

Small Size, Low Power, Side Channel-Immune AES Coprocessor 119

of the key. The correlation between power consumption and b can be computed

for all 256 values of 8 unknown bits of the key. The correlation is likely to be

maximal for the correct guess of the 8 bits of the key. Then an attack can be

repeated for the remaining Sboxes.

Data Masking and Inversion in GF (24 )

4.2

There are many strategies to combat side-channel attacks. On a hardware level,

the countermeasures usually include clock randomization, power consumption

randomization, current compensation, and various detectors of abnormal be-

havior. However, the eļ¬ect of these countermeasures can be reduced by various

signal processing and statistical techniques [6]. Software-based countermeasures

include introducing dummy instructions, randomization of the instruction exe-

cution sequence, balancing Hamming weights of the internal data, etc.

Data masking is one of the most powerful software countermeasures against

side channel attacks [5, 11]. The idea is simple: the message and the key are

masked with some random values at the beginning of computations, and there-

after everything is almost as usual. Of course, the value of the mask at the end

of some ļ¬xed step (e.g., at the end of a round or at the end of a linear part of

computations) must be known in order to re-establish the expected data value

at the end of the execution; we call this mask correction.

A traditional XOR operation is used for data masking; however, a mask is

arithmetic in GF (28 ). The operation is compatible with the AES structure ex-

cept for inversion in SubBytes, which is the only non-linear transformation. In

other words, to compute mask corrections for each of the linear transformations

in a round, we simply have to apply this transformation separately to masked

bytes and to corresponding masks.

Unfortunately, it turned out to be rather diļ¬cult to ļ¬nd an eļ¬cient and

secure solution for non-linear operations. The ļ¬rst attempt to transform masked

data between Boolean and arithmetic operations in a secure way [17] was shown

aā• x bā• y z

y

x

aā…b ā• z z

Fig. 4. Masked AND (MAND)

120 E. Trichina, T. Korkishko, and K.H. Lee

to be insuļ¬cient against DPA attacks. A more sound method for mask switching

[10] involves too much computational overhead.

To overcome this diļ¬culty, Akkar and Giraud [1] proposed transformed mask-

ing, where ļ¬rst an additive mask is replaced by a multiplicative mask in a series

of multiply and add operations, after which a normal inversion takes place, and

ļ¬nally, the transformation of a multiplicative mask into an additive mask is car-

ried out again. However, it was pointed out in [9] that a multiplicative mask

does not blind zero, and thus does not prevent a DPA attack.

In [28] it was noticed that for a fully combinational AES Sbox design, the

problem of āmasked inversionā can be eļ¬ectively reduced to the problem of

computing binary XOR and AND operations on masked bits a = a ā• x, Ė = b ā• y

b

Ė

and on bits of the mask x, y without ever revealing actual data bits a, b in

the process. For XOR computing the mask correction is trivial because a ā• Ė =

Ėb

(a ā• b) ā• (x ā• y).

Masked AND and the corresponding mask correction can be computed by

manipulating only masked data bits and the bits of the masks as follows 1 . Let

c = a Ā· b. Then

c ā• z = a Ā· b ā• z = (Ė Ā· Ė ā• (y Ā· a ā• (x Ā· Ė ā• (x Ā· y ā• z)))).

ab b

Ė (2)

This can be implemented as a cascade of logic gates as shown in Fig. 4.

To realize inversion in GF (28 ) on masked data, one would have to replace

each AND gate used by a multiplier and an inverter in GF (24 ) with the circuit

depicted in Fig.4, increasing more than fourfold a total amount of gates in the

Sbox combinational logic design.

Masked GF (2n ) Multiplier

4.3

Let us make an observation that equation (2) can be generalized to the equation

for āmasked multiplicationā™ in any ļ¬eld GF (2n ). If A, B, X, Y, Z ā GF (2n ), then

(Aā—B)ā•Z = [(Aā•X)ā—(Bā•Y )]ā•[X ā—(Bā•Y )]ā•[(Aā•X)ā—Y ]ā•[X ā—Y ]ā•Z. (3)

Hence, masked multiplication in GF (2n ) can be performed using conventional

multipliers with any architecture, and four additional XOR operations for mask

correction, as depicted in Fig. 5. An additional mask Z is also used to mask the

output product. Hence, this approach requires 4 ānormalā multipliers in GF (2n )

and 4 bitwise XOR operations.

In contrast, a straightforward application of the technique suggested in [28]

involves building masked multipliers by replacing every AND gate in a ānormalā

multiplier with a masked AND( MAND) circuit. For example, Fig.6 illustrates a

transformation of a conventional multiplier in GF (22 ) into a masked multiplier

using this approach.

1

To achieve balanced and independent intermediate results, the scheme is used a

freshly generated random bit z as a new mask.

Small Size, Low Power, Side Channel-Immune AES Coprocessor 121

A XB Y X Y Z

GF(2n) GF(2n) GF(2n) GF(2n)

XOR

XOR

XOR

XOR

MMUL

(A B) Z Z

Fig. 5. Building a masked multiplier in GF (2n ) from standard multipliers

a0 b0 a1 b1 a1 b0 a1 a0 b1

AND AND AND XOR

XOR AND

XOR

c0 c1

(a) Multiplier in GF(2^2)

a 1 a 0 b 1 x1 x 0 y 1 z3

aā™0 bā™ x0 y 0 z 0 aā™1 bā™ x1 y 1 z 1 aā™1 bā™0 x 1 y 0 z2

0 1

MAND XOR XOR

MAND MAND

MAND

XOR XOR XOR XOR

cā™1

cā™0 zā™ zā™

0 1

(b) Masked multiplier in GF(2^2)

Fig. 6. Transformation of a usual (a) multiplier into a masked (b) multiplier in GF (22 )

122 E. Trichina, T. Korkishko, and K.H. Lee

The implementation complexity of a masked multiplier can be calculated

knowing the complexity of a basic standard multiplier. For example, for the

popular Mastrovito multipliers in GF (2n ) [16], the space complexity expressed

in the number of AND and XOR gates is n2 and ā„ (n2 ā’ 1), respectively. Then

the complexity of the generalized masked multiplier when compared with the

original straightforward implementation constitutes 9 to 29 % improvement in

the number of XOR gates, as can be seen from the Table 1.

Table 1. Comparison of the space complexity of generalized and straightforward

masked multipliers in GF (2n )

GF Irreducible Mastrovito mult. Straight. masked Proposed mult. Advantage

n Polynomial AND XOR AND XOR AND XOR %

2 [2,1,0] 4 3 16 22 16 20 9.1

4 [4,1,0] 16 15 64 94 64 76 19.1

8 [8,5,3,2,0] 64 84 256 382 256 284 25.7

16 [16,11,6,5,0] 256 281 1024 1534 1024 1084 29.3

5 Secure AES Coprocessor

When manipulating masked data, all operations in a round, apart from Sub-

Bytes, require simple mask corrections in a form of analogous computations on

masks that are carried out in parallel with the main computation ļ¬‚ow. This can

be achieved simply by duplicating hardware for all transformations but Sub-

Bytes.

G xor M M WZ F

InvAffine and InvAffine and

Field Isom Field Isom

Field Isom Field Isom

P xor M

M

W

-1

Secure X Z

F

P-1 xor F

InvField Isom InvField Isom

InvField Isom InvField Isom

and Affine and Affine

Sbox(G) xor F F

Fig. 7. Masked S-box

Small Size, Low Power, Side Channel-Immune AES Coprocessor 123

Another solution is to pipeline computations on masked data and on masks,

which halves the throughput, and for which additional 128-bit registers are re-

quired for mask values. Since each 1-bit register needs an equivalent of 6-7 gates,

there is no visible advantage for pipelining.

The structure of the Masked Sbox is depicted in Fig. 7. The inverse ļ¬eld

isomorphism map ā’1 and aļ¬ne transformation f are both linear operations, and

they are merged to optimize the gate count; the same holds for map and f ā’1

in decryption [14, 20]. A duplicate data path on the right hand side computes a

mask correction.

In this ļ¬gure, the box āSecure X ā’1 ā represents inversion in GF ((24 )2 ) im-

plemented as was deļ¬ned in Eq. 1. M is a random mask, (G xor M) represents

a masked input, Z, W, F are new masks used to ārefreshā the masked data at

the end of each operation in GF (24 ) and at the end of inversion in GF ((24 )2 ).

The details are given in Fig. 8 where every SMul2n box is, in fact, a generalized

masked multiplier in GF (24 ). The box X ā’1 represents inversion in GF (24 ), and

is implemented in combinational logic in the same way as in [30], with every

AND being replaced with masked AND.

Altogether, 1.2K gate equivalent is required to implement one S-box which

is 25% better than the table lookup implementation.

FW M Z P xor M

8 4 8 8

4

ML MH

XOR

4 4

mA mB

Y

SMul

Z

x2g

XOR

2n

X

x2g

XOR

XOR

X-1

X X

mB mA mA mB

Y Y

MH

SMul SMul

2n 2n

Z Z

4 4

8 8

P-1 xor F

F

Fig. 8. Masked inverter in GF ((24 )2 )

124 E. Trichina, T. Korkishko, and K.H. Lee

6 Conclusion

The secure hardware AES module based on the described scheme has been fully

implemented in 0.18Āµm technology. As a balance between the throughput and

the gate count, 1 Sbox has been used for the main datapath, and one for the key

scheduling. Our implementation of MixColumn resembles the one reported in [29]

and exploits common subexpressions for MixColumn/InvMixColumn operations.

The general design ļ¬‚ow for a secure AES module is depicted in Fig. 9. First, a

Verilog model had been created and tested, after which Cadence Design System

Verilog-XL simulator was used to generate timing diagrams and to verify the

correctness of the design. When RTL code had been veriļ¬ed, the digital circuit

was synthesized with Synopsys Design Analyzer tool using Samsung 0.18 Āµm li-

braries. Power-compiler simulation data at 5 MHz were obtained with simulation

tool CubicWare.

The summary of the synthesis results are given in the table in Fig. 10. The

total gate count for the secure AES module (excluding I/O) with a ļ¬‚exible key

size is 20,06K, while for a standard 128-bit key the gate count can be reduced

to 16K. The power consumption is 2.0 mA in 0.18Āµm technology for a ļ¬‚exible

key size architecture, and 1.6 mA for a 128-bit key standard. With 0.13Āµm

technology, the power consumption can be reduced to 1.1 mA, which allows us

to use this module in applications such as GSM and ad-hoc networks. The secure

AES module has throughput 4 Mbps when operated at 5 MHz.

The described approach can be applied to other cryptographic coprocessors

that use arithmetic operations in Galois ļ¬elds. It provides comparable protection

as dynamic and diļ¬erential logic [26] and asynchronous dual rail circuits [21] at

the similar price in terms of the gate count and power consumption. Taking into

Phase 1: Functional simulation Phase 2: Logical synthesis

Initial

Test data

FIPS AES

specs

Logic

Test bench Digital circuit

synthesis

Verilog

Timing diagrams

(Synopsys

Verilog Design Analyzer)

model

simulator

(Cadence Design

Synthesis

System Verilog-XL)

options

Correct Synthesis report

Verilog model

No Samsung 0.18um

or specs

Correct results

standard cell library

produced?

Initiate next phase

Yes

Fig. 9. Hardware design ļ¬‚ow

Small Size, Low Power, Side Channel-Immune AES Coprocessor 125

Components Subcomponents Scalable key size 128 bit key size

Inverter in GF((2 4) 2)

Masked Sbox 1206 1206

Input -output processing 350 (180) 350 (180)

Subtotal for masked Sbox (for key scheduler) 1556 (1386) 1556 (1386)

Masked datapath Masked data 4900 4900

Mask 2732 2732

Subtotal for masked datapath 7632 7632

Masked key scheduler Masked key 6523 4835

Mask 4913 3225

Subtotal for masked key scheduler 11436 8060

Control Data control 300 300

Key control 1140 400

TOTAL for masked AES 20506 16390

Fig. 10. Gate count for a secure AES module

account that the latter techniques require new logic libraries, careful ābalancingā

of place and routing and new development tools which implies higher design and

production costs and longer time-to-market, our solution oļ¬ers a competitive

alternative to a hardware protection.

References

1. Akkar, M., Giraud, C.: An implementation of DES and AES, secure against some

attacks. Proc. Cryptographic Hardware and Embedded Systems: CHES 2001. Lec-

ture Notes in Computer Science 2162 (2001) 309-318

2. Anderson, R., Kuhn, M.: Low cost attacks on tamper resistant devices.

Proc.Security Protocols: IWSP 1997. Lecture Notes in Computer Science 1361

(1997) 125-136

3. BlĀØmmer, J., Merchan J. G., Krummel, V.: Provably secure masking of AES. IACR

o

Cryptology ePrint Archive Report 2004/101 (2004)

4. M. Bucci, L.Germani, M. Guglielmo, R. Luzzi, A. Triļ¬letti: A simulation method-

ology for DPA resistance testing of cryptographic processors (manuscript) 2003

5. Chari, S., Jutla, C., Rao, J., Rohatgi, P.: Towards sound approaches to counteract

power-analysis attacks. Proc. Advances in Cryptology ā“ Cryptoā™99. Lecture Notes

in Computer Science 1666 (1999) 398-412,

6. Clavier, C., Coron, J-S., Dabbous, N.: Diļ¬erential power analysis in the presence

of hardware countermeasures. Proc. Cryptographic Hardware and Embedded Sys-

tems: CHES 2000. Lecture Notes in Computer Science 1965 (2000) 252-263

7. Daemen, J., Rijmen, V.: The design of Rijndael: AES - The Advanced Encryption

Standard. Springer-Verlag Berlin Heidelberg (2002)

8. Gandolļ¬, K., Mourtel, C., Oliver, F.: Electromagnetic analysis: concrete results.

Proc. Cryptographic Hardware and Embedded Systems: CHES 2001. Lecture Notes

in Computer Science 2162 (2001) 251-261

9. GoliĀø, J., Tymen,Ch.: Multiplicative masking and power analysis of AES. Proc.

c

Cryptographic Hardware and Embedded Systems: CHES 2002. Lecture Notes in

Computer Science 2523 198-212

126 E. Trichina, T. Korkishko, and K.H. Lee

10. Goubin, L.: A sound method for switching between boolean and arithmetic mask-

ing. Proc. Cryptographic Hardware and Embedded Systems: CHESā™01. Lecture

Notes in Computer Science 2162 (2001) 3-15

11. Kocher, P., Jaļ¬e, J., Jun, B.: Diļ¬erential power analysis. Proc. Advances in Cryp-

tology ā“ CRYPTOā™99. K Lecture Notes in Computer Science 1666 (1999) 388-397

12. Kocher, P.: Timing attacks on implementations of Diļ¬e-Hellmann, RSA, DSS,

and other systems. Proc. Advances in Cryptology ā“ Cryptoā™96. Lecture Notes in

Computer Science 1109 (1996) 104-113

13. Kommerling, O., Kuhn, M.: Design principles for tamper-resistant smartcard pro-

cessors. Proc. USENIX Workshop on Smartcard Technology (Smartcard 99) (1998)

9-20

14. Lu, C. C., Tseng, S-Y.: Integrated design of AES (Advanced Encryption Sran-

dard) encryptor and decryptor. Proc. IEEE conf. on Application-Speciļ¬c Systems,

Architectures, and Processors (ASAPā™02) (2002) 277-285

15. Mangard, S., Aigner, M., Dominikus, S.: A highly regular and scalable AES hard-

ware architecture. IEEE Transactions on Computers 52 no. 4 (2003) 483-491

16. E.D. Mastrovito, VLSI architectures for computations in Galois ļ¬elds, PhD Thesis,

Linkoping University, Linkoping, Sweden (1991)

17. Messerges, T.: Securing the AES ļ¬nalists against power analysis attacks. Proc.

Fast Software Encryption Workshop 2000. Lecture Notes in Computer Science

1978 (2000) 150-165

18. Messerges, T. S., Dabbish, E. A., Sloan, R. H.: Examining smart-card security

under the thread of power analysis. IEEE Trans. Computers. 51 no. 5 (2002) 541-

522

19. Messerges, T. S.: Using second-order power analysis to attack DPA resistant soft-

ware. Proc. Cryptographic Hardware and Embedded Systems ā“ CHES 2000. Lec-

ture Notes in Computer Science 1965 (2000) 238-251

20. Morioka, S., Satoh, A.: An optimized S-Box circuit architecture for low power

AES design. Proc. Cryptographic Hardware and Embedded Systems: CHES 2002.

Lecture Notes in Computer Science 2523 (2003) 272-186

21. Moore, S., Anderson, R., Cunningham, P., Mullins, R., Taylor, G.,: Improving

smart card security using self-timed circuits. Proc. Proceeding 8th IEEE Inter-

national Symposium on Asynchronous Circuits and Systems ā“ ASYNCā™02. IEEE

(2002) 23-58

22. Paar, C.: Eļ¬cient VLSI architectures for bit parallel computations in Galois ļ¬elds.

PhD Thesis, University of Essen, Germany (1994)

23. Quisquater, J. J., Samide, D.: Electromagnetic analysis (ema): measures and

counter-measures for smart cards. Proc. Smartcard Programming and Security.

Lecture Notes in Computer Science 2140 (2001) 200-210

24. Rudra, A., Dubey, P., Julta, C., Kumar, V., Rao, J., Rohatgi, P.: Eļ¬cient Rijndael

implementation with composite ļ¬eld arithmetic. Proc. Cryptographic Hardware

and Embedded Systems ā“ CHESā™01. Lecture Notes in Computer Science 2162

(2001) 175-188

25. Satoh, A., Morioka, S., Takano, K., Munetoh, S.: A compact Rijndael hardware ar-

chitecture with S-Box optimization. Proc. Advances in Cryptology ā“ ASIACRYPT

2001. Lecture Notes in Computer Science 2248 (2001) 239-254

26. Tiri, K., Akmal, M., Verbauwhede, I.: A dynamic and diļ¬erential CMOS logic with

signal independent power consumption to withstand diļ¬erential power analysis on

smart cards. Proc. IEEE 28th Europen Solid-State Circuit Conf. ā“ ESSCIRCā™02

(2002)

Small Size, Low Power, Side Channel-Immune AES Coprocessor 127

27. E. Trichina, E., De Seta, D., Germani, L.: Simpliļ¬ed Adaptive Multiplicative Mask-

ing for AES and its secure implementation. Proc. Cryptographic Hardware and

Embedded Systems: CHES 2002. 2523 of Lecture Notes in Computer Science 2523

(2002) 277-285

28. Trichina, E.: Combinational logic design for AES SubByte transformation on

ńņš. 5 |