<<

. 5
( 7)



>>

S-Box S-Box S-Box S-Box


32


Fig. 1. Overall structure of the AES module

“ The Key Unit: The key unit serves two main purposes: the storage of
cipher keys and the calculation of the round keys. To save die size, the
S-Boxes of the data unit are reused to perform the key expansion. In the
presented architecture, this reuse is possible for any key size without loss of
performance.
“ The CBC Unit: The CBC unit of the AES module implements the CBC
mode without any negative in¬‚uence on the overall performance of the AES
module.

In the presented architecture, a 128-bit block of data is encrypted as follows.
First, a cipher key needs to be loaded via the AMBA APB interface into the key
unit. Once a key is loaded, it can be used for an arbitrary number of encryptions
and decryptions. After the loading of the cipher key, the ¬rst 128-bit block of
data is transferred via the interface and the CBC unit into the data unit. The
data unit then iteratively performs the number of AES rounds that are required
for the used key size.
In each round, the key unit provides the corresponding round key to the data
unit. To calculate these round keys, the key unit uses the S-Boxes of the data
unit during a clock cycle in which they are not required by the data unit. After
the calculation of the AES rounds, the encryption result is passed in 32-bit words
to the interface via the CBC unit. Decryptions are computed in a very similar
way. In this case, the data unit performs the inverse AES transformations in
reversed order and also the key unit provides the round keys in reversed order.
The remainder of this section presents the details of the data and the key unit.

2.1 The Data Unit
The data unit is the biggest and the most important component of the AES
architecture. It stores the current 128-bit data block of an encryption or de-
E¬cient AES Implementations on ASICs and FPGAs 101



Key


Out In
Cell 00 Cell 01 Cell 02 Cell 03




Key


Out
Cell 13
Cell 10 Cell 11 Cell 12 In


Key


Out In
Cell 23
Cell 20 Cell 21 Cell 22




Out
Cell 30 Cell 31 Cell 32 In
Cell 33



Key




S-Box
S-Box S-Box S-Box

S-Box


Fig. 2. The architecture of the standard data unit


cryption (referred to as the “State”) and is capable of performing any number
and type of en-/decryption rounds on this State. Consequently, all four AES
transformations (SubBytes, ShiftRows, MixColumns and AddRoundKey) and
the corresponding inverse transformations are implemented within the data unit.
For the AddRoundKey transformation, a corresponding round key needs to be
provided by the key unit.
Figure 2 shows the standard version of the data unit. It consists of sixteen
so-called data cells and four S-Boxes. An S-Box of the architecture is a circuit
capable of performing the SubBytes and the inverse SubBytes transformation for
an 8-bit input. The data cells store eight bits per cell and perform all other AES
transformations and the corresponding inverses, when connected appropriately.
In full-custom designs, inputs and outputs of the data cells can be de¬ned such
that connection by abutment is possible when they are placed next to another.
Another distinguishing feature of the presented architecture is the fact that
the combinational paths are relatively short and, more important, very balanced.
The commonly used approach to implement AES in hardware is to store the 128-
bit State in a register and to perform the AES transformations (except for the
ShiftRows transformation) column by column. Therefore, to perform a normal
102 N. Pramstaller et al.

AES encryption round, ¬rst the ShiftRows transformation is done in one clock
cycle. Then, the remaining transformations of an AES round are applied column
by column, whereby all transformations for one column are usually done within
one clock cycle.
The problem of this approach is that the combinational path to perform
a SubBytes, a MixColumn and an AddRoundKey transformation in one clock
cycle is very long. Additionally, the implementation of the ShiftRows transfor-
mation causes a signi¬cant wiring overhead. The data unit, presented in this
section, solves both problems. It performs AES encryptions and decryptions in
the following way:
To load a data block, the input data is shifted column by column from the
right side (see Fig. 2) into the data cells. The inputs labelled “In” are connected
via the CBC unit to the interface. The initial AddRoundKey transformation is
done in the fourth clock cycle at the same time as the last column is loaded.
To compute a normal AES round, the registers are rotated vertically to per-
form the (Inv)SubBytes and the (Inv)ShiftRows transformation row by row. In
our design, we use an S-Box with one pipeline stage. Therefore, the (Inv)SubBytes
and the (Inv)ShiftRows transformations can be applied to all sixteen bytes of
the State in ¬ve clock cycles.
In the sixth clock cycle of a normal AES round, the (Inv)MixColumns and
the AddRoundKey transformations are performed by all data cells in parallel.
Since the S-Boxes are not used by the data unit during the sixth clock cycle,
they can be utilized by the key unit to perform the key expansion for the next
round key.
This way, the required number of encryption or decryption rounds can be
executed by the data unit and the key unit until the 128-bit result is ¬nally
stored in the registers of the data unit. This result is then shifted column by
column to the left (to the interface of the AES module). At the same time, a
new input State can be loaded.
Using the standard data unit, the minimal number of clock cycles that are
required to perform an AES-128 encryption or decryption is 65. Four clock cycles
are required for the I/O of the data unit, 54 clock cycles are required to perform
the nine normal AES rounds and seven are required for the ¬nal round.
The following two subsections present the architecture of the S-Boxes and
the data cells.


S-Boxes. In hardware implementations, the SubBytes transformation and its
inverse are the most expensive AES transformations. For the presented AES
module, a pipelined (one stage) implementation of the S-Box as described in [18]
is used. The main idea of this implementation is to build an e¬cient combina-
tional circuit for the S-Box, which is based on the fact that GF (28 ) can be seen
as quadratic extension of the ¬eld GF (24 ). A pipelined version of the S-Box is
used to accomplish that the combinational paths in the architecture are balanced
(i.e. the paths of the S-Boxes and those of a MixColumns-and-AddRoundKey
step are roughly the same).
E¬cient AES Implementations on ASICs and FPGAs 103

Data Cells. The design of the data cells is crucial for the overall architecture
of the data unit. The data cells serve as storage elements of the AES State and
perform the (Inv)MixColumns and the AddRoundKey transformation. Besides
some input selection circuit, each data cell consists of eight ¬‚ip-¬‚ops, a multi-
plier [17] for the MixColums transformation (which can also be omitted in order
to scale the design) and XOR gates for the AddRoundKey transformation.


2.2 The Key Unit
The key unit is used to store keys and to calculate the key expansion function.
Due to the fact that the AES is standardized for 128, 192 and 256-bit keys, the
interface between the key unit and the data unit is designed such that the key
expansion for several di¬erent key sizes can be implemented on the same chip.
The key unit used in our design stores the key loaded via the interface and is
capable of calculating all round keys for encryption and decryption iteratively.
Since the data unit does not perform any S-Box lookups while the MixColumns
and AddRoundKey transformations are executed, the S-Boxes of the data unit
are reused by the key unit during this clock cycle. Details about the key unit
can be found in [8].


2.3 Performance of the ASIC AES Implementation
As mentioned before, the presented AES module is built up very regular and is
highly scalable. Three di¬erent scaled versions of the module have been imple-
mented and tested. They are named “standard version”, which was described
in the previous sections, “minimum version”, which is the smallest, but slowest
one, and the “high-performance version”, which is the fastest, but most area
intensive version.
As shown in Fig. 2, the data unit of the standard version consists of 16
data cells (including 16 MixColumns multipliers) and four S-Boxes. The en-
/decryption of an 128-bit data block requires 65 clock cycles. By using 16 S-Boxes
instead of four, the performance can be increased. The S-Box lookup can be done
for all 128 bits of the State in parallel. With this con¬guration (high-performance
version), the AES module requires only 35 clock cycles to en-/decrypt a 128-bit
data block. In the minimum version of the AES module, only four S-Boxes and
four MixColumns multipliers are used. Only the four “leftmost” data cells con-
tain MixColumns multipliers. In the other data cells the multipliers are omitted.
Here, 92 clock cycles are needed to process a 128-bit data block.
In this section, we give a comparison of the three di¬erent scaled versions of
the AES module in terms of performance and area. Additionally, a comparison
to related work is given.


2.4 Performance of the Presented ASIC Design
The three versions of AES-128 have been implemented in VHDL and have been
synthesized for a 0.6 µm CMOS process. Table 1 shows the complexity in gate
104 N. Pramstaller et al.

Table 1. Complexity of AES components in GE

Component Complexity [GE]
S-Box 392
Multiplier 212
Data cell (without Multiplier) 87
Key generator 1,633
Key store 691
AMBA Bus Interface 267
CBC Register 1,599


Table 2. Complexity of the AES-128 modules

Component # Minimum # Standard # High Perf.
S-Boxes 4 1,568 4 1,568 16 6,272
Multipliers 4 848 16 3,392 16 3,392
D. cells without mult. 16 1,392 16 1,392 16 1,392
Multiplexors 96 224 192 384 224 374
Data unit 4,032 6,736 11,430
Key generator 1 1,633 1 1,633 1 1,633
Key store 1 691 1 691 1 691
Key unit 2,324 2,324 2,324
AMBA + CBC 1 1,866 1 1,866 1 1,866
Control logic 319 279 230
Additional 2,185 2,145 2,096
Total 8,541 11,205 15,850



equivalents (GE) of each component used for the AES module. In Table 2, the
complexity of the three di¬erent modules is calculated by adding the size of the
components used for the di¬erent versions. In the ¬rst column for each version
the used number of components is given. The standard version of the module
has a complexity of 11,205 GE, whereas the minimum version needs 8,541 GE.
The high performance version requires 15,850 GE.
The high-performance module essentially consists of 12 S-Boxes more than
the standard module. This causes an increase of the complexity by 41%. On the
other hand, the minimum version consists of 12 MixColumns multipliers less than
the standard version and is therefore 24% smaller. The critical path of all three
designs is more or less the same and is determined by the delay of one pipeline
stage of the S-Box”the maximum frequency for the complete AES-128 modules
(including AMBA interface, CBC, and control logic) on the 0.6 µm technology
is about 50 MHz. In Table 3, a summary of the performance is shown.
The standard version needs 65, the high-performance version 35, and the
minimum version 92 clock cycles to perform an AES-128 encryption or decryp-
tion. In the high-performance version, the improvement of the throughput by
87% is paid by an increase of the complexity by 41%. In the minimum version,
the reduced complexity (-24%) accounts for a 29% loss in throughput.
E¬cient AES Implementations on ASICs and FPGAs 105

Table 3. Summary of the performance of the di¬erent AES-128 modules

Clock Throughput@50 MHz Area
[M bps] [GE]
Version Cycles
Minimum 92 70 8,541
Standard 65 98 11,205
High perf. 35 183 15,850


2.5 Related Work
This subsection compares the presented architecture with the one proposed
in [13]. The design of Satoh et al. consists of 5,400 GE and was implemented
on a 0.11 µm technology. The design consists of a core data path and a key
generator. It does not include mechanisms for I/O, CBC registers or a key store.
Its maximum clock frequency is about 130 MHz. The design requires 54 cy-
cles to perform an encryption, which leads to a theoretical throughput (for the
four-S-Box version) of 311 Mbps.
For a comparison of complexity, the gate count for our design has to be
reduced by the additional components our design o¬ers (key store, CBC registers,
AMBA interface)”this leads to a gate count of about 8,600 GE for the standard
AES-128 module. This comparison is still not completely fair, since di¬erent
technologies are used for synthesis. The 0.11 µm technology used in [13] allows
smaller structures and o¬ers di¬erent leaf cells. It seems to be more extensive
than our technology, because similar components of the designs are claimed to
be smaller in the architecture in [13]. For example, the S-Box proposed by Satoh
et al. was reconstructed and synthesized with our technology. The result was a
15% bigger S-Box than used in our design, whereas the results with the 0.11 µm
technology in [13] show an S-Box that needs 25% less GE than our S-Box design
in the 0.6 µm technology.
The big di¬erence in the used technology also does not allow a reasonable
comparison of the maximum frequencies or the throughput. When comparing
the two proposed S-Boxes synthesized in our 0.6 µm technology in terms of
delay, our proposed S-Box is about 30% faster. An attempt to compare the
throughput of both designs starts with comparing the critical paths. In [13] the
critical path is very long: The SubBytes, the MixColumns and the AddRoundKey
transformation are done for one column within one clock cycle. Additionally, in
the same clock cycle the data passes the so-called selector function, which seems
to be another major cause of delay.
In our presented architecture, the critical path consists only of one pipeline
stage of an S-Box. This is approximately a third of the critical path of the
architecture presented in [13]. Using the same technology for synthesis of the
compared designs, we expect the maximum frequency of our module to be at
least three times higher than the maximum frequency stated in [13]. This leads
to a better overall performance.
106 N. Pramstaller et al.

3 FPGA Implementation of AES
Recon¬gurable devices like Field Programmable Logic Arrays (FPGA) gain more
and more importance in hardware, software and hardware/software co-designs.
In the beginning often seen only as devices for rapid prototyping, FPGAs are
increasingly used for ¬nal applications. One of the mostly used arguments for
using FPGAs rather than ASICs, is the reduced time-to-market and the cost
advantages of standard devices.
Due to the importance of recon¬gurable devices, numerous FPGA AES im-
plementations have been published within the last years. These implementations
mainly focus on high throughput rates [3, 9], and use techniques like loop un-
rolling and pipelining. They are able to report throughput rates up to 12,160
Mbps [3]. Applying such techniques leads to AES hardware implementations that
require a huge amount of FPGA resources that are only available for expensive
devices and cannot be implemented in low-end FPGAs.
In this section we present a new universal architecture that is supported by
several FPGA product families and can also be implemented using inexpensive
low-end FPGAs. It is the ¬rst known AES FPGA implementation that does
not require on-chip block RAMs. Furthermore, it implements the complete AES
encryption standard and features the Cipher Block Chaining (CBC) mode.


3.1 Related Work
Gaj et al. [3] published the fastest known FPGA implementation. For encryption
and decryption with 128-bit keys, a throughput of 12,160 Mbps on a Xilinx Vir-
tex XCV1000BG560-6 device is reported. McLoone et al. [9] achieve a through-
put of 6,956 Mbps for 128-bit keys only. They also presented encryption engines
for 192-bit or 256-bit keys with accordingly lower throughput. Their combined
encryption and decryption implementation can handle 128-bit keys and achieves
a throughput of 3,239 Mbps on a Xilinx Virtex-E XCV3200E-8CG1156 device.
The third implementation published by Dandalis et al. [5] also provides encryp-
tion and decryption for 128-bit keys. They achieve a throughput of 353 Mbps
on a Xilinx Virtex XCV1000BG560-6 device. Fischer et al. [6] published a non-
pipelined design supporting encryption and decryption for 128-bit keys. They
report a throughput of 451 Mbps of their fast con¬guration and 115 Mbps for
an economic con¬guration. A drawback of their design is the missing on-chip
round-key generation. Chodowiec et al. [2] presented an implementation for low-
end devices. Using only few resources they achieve a throughput ranging from
139 Mbps to 166 Mbps depending on the used FPGA device.
All implementations (except [6, 2]) use a considerable amount of hardware
resources. For instance, [9] requires 138 block RAMs for 256-bit keys. This de-
mands the use of expensive million-gate FPGA devices.
As shown above, most published hardware implementations focus on high
throughput rates and do not provide a non-parameterizable design to support the
complete AES standard. Furthermore, the high throughput implementations [3,
9] do not support the Cipher Block Chaining mode (CBC).
E¬cient AES Implementations on ASICs and FPGAs 107

3.2 Architecture of the AES FPGA Implementation
This section describes the architecture of the AES co-processor. Starting with a
swift overview, we will present details to highlight some innovative improvements
that make it possible to present a resource-e¬cient AES co-processor suitable
for low-end FPGA devices.
Basic components of the AES co-processor as shown in Fig. 3 are the AMBA
APB interface, the data unit, the key unit, and the control unit. The key unit
calculates the KeyExpansion function. All round keys are pre-calculated and
stored in the key unit. Pre-calculated round keys allow fast en-/decryption of
di¬erent data blocks for the same cipher key because no additional KeyExpan-
sion is required. The data unit holds the State and performs all AES transfor-
mations: AddRoundKey, (Inv)SubBytes, (Inv)ShiftRows and (Inv)MixColumns.
When encryption or decryption has completed the ciphertext (plaintext in case
of decryption, respect.) is stored in the data unit. The control unit receives
commands from the AMBA interface and generates control signals for all other
modules. In addition to control round-key calculation, encryption and decryp-
tion, it also sequences data loading and unloading.


Data
Key
data in Unit
AMBA Unit
Interface
data out
(APB)

Control Unit



Fig. 3. Architecture of the AES co-processor


The architecture is similar to the architecture presented in Section 2. Di¬er-
ences are a modi¬ed State representation and a modi¬ed round-key calculation
scheme. Due to a non-pipelined approach, the same performance for all modes
of operations (ECB and CBC) is reached. Next we describe the AES data unit
and the AES State representation in detail.

Data Unit. The data unit stores the State, all intermediate results of the round
function applied to the State and the output data when encryption or decryption
has completed. The major di¬erence to all other published AES implementations
is the innovative State representation that consists of two States. One State
contains the actual State values and the other State stores newly calculated
values. Figure 4 depicts the two States, referred to as StateA and StateB. In
each cycle, 32 bits (one row or one column) of either StateA or StateB are
altered. Using a second State provides a lot of bene¬ts without the need of
additional recourses: (Inv)ShiftRows comes for free and no State transposition
between column and row operations is required.
Storage elements in FPGAs can be e¬ciently implemented by using syn-
chronous RAMs because the basic logic elements of FPGAs, called slices, can
108 N. Pramstaller et al.


0 (0,1)
(0,0) (0,2) (0,3)




StateA
1 (1,1)
(1,0) (1,2) (1,3)
2 (2,1)
(2,0) (2,2) (2,3)




address [2..0]



RAM 0


RAM 1


RAM 2


RAM 3
(3,1)
3 (3,0) (3,2) (3,3)
(4,1)
4 (4,0) (4,2) (4,3)




StateB
(5,1)
(5,0) (5,2) (5,3)
5
(6,1)
(6,0) (6,2) (6,3)
6
(7,1)
(7,0) (7,2) (7,3)
7


Fig. 4. The State-RAM


be con¬gured as 16 — 1 bit synchronous RAM. Two slices (= one Con¬gurable
Logic Block - CLB) provide 16 — 1 bit synchronous dual-port RAM function-
ality (see [20]). Dual-port RAMs allow concurrent reading and writing to the
RAM. Due to these technology features, the State-RAM as depicted in Fig. 4
is implemented as four slices of 8 — 8 bit synchronous dual-port RAMs to allow
addressing the slices independently.
The data unit performs all transformations of the round function: (Inv)Shift-
Rows, (Inv)SubBytes, (Inv)MixColumns and AddRoundKey. AddRoundKey and
(Inv)MixColumns are applied to the State column by column, whereas (Inv)-
ShiftRows and (Inv)SubBytes are applied to the State row by row. Due to
the slice architecture of the RAM which holds the State, it is not possible to
read/write from/to the RAM column by column. Hence, a transposition of the
State is necessary if a row-oriented operation follows a column-oriented oper-
ation, or vice versa. Transposition would require a reorganization of the State
before further operations can be performed. By using two States, transposition
can be implemented by accordingly addressing the State-RAM. Furthermore,
(Inv)ShiftRows can be combined with transposing the State. As a consequence
of this, (Inv)ShiftRows and transposition come for free. In the sequel we describe
the memory organization and State transposition for encryption. The same ap-
proach can easily be modi¬ed for decryption.
When a row-oriented operation follows a column-oriented operation (or vice
versa), the State must be transposed. Combining row and column transfor-
mations minimizes the number of required transpositions: ShiftRows is com-
bined with SubBytes and AddRoundKey is combined with MixColumns. This
approach requires only one transposition per round. Encryption requires Sub-
Bytes followed by ShiftRows. Since ShiftRows does not a¬ect the byte values
and SubBytes is applied to each byte of the State individually, the order of
both operations does not matter. This fact eases the address generation for the
State-RAM.
For explaining the State transposition we consider the State as 4 — 4 matrix:
i=0..3
S = (si,j )j=0..3 . The ShiftRows transformation described in [10] can then be
expressed as follows:

S = Shif tRows(S) = (si,j’i mod 4 )i = 0..3 (1)
j = 0..3
E¬cient AES Implementations on ASICs and FPGAs 109


+

0 0 0 0 0 0 0 0 0 1 2 3 0 1 2 3

1 1 1 1 1 1 1 1 4 5 6 7 4 5 6 7
StateA




1 1
2 2 2 2 2 2 2 2 2 2 2 2 8 9
0 1
3 3 3 3 3 3 3 5 3 3 3 3 3 3 3 3
SubBytes SubBytes
SubBytes SubBytes
4 4 4 4 4 4 4 4 4
4

5 5 5 5 5 5 5 5
5 5
StateB




6 6 6 6 6 6 6
6 6 6

7 7 7 7 7 7
7 7 7 7



Fig. 5. ShiftRows and SubBytes for encryption


If we replace the State by the transposed State, we obtain:

S T = Shif tRows(ST ) = (si+j mod 4,j )ji = 0..3
=0..3
(2)

With the result of (2) the addressing of the StateB-RAM can be determined: the
indices (i, j) must be substituted with (i + j mod 4, j). Due to the even number
of AES rounds for all key lengths, ShiftRows is always applied to StateB only.
Thus, the resulting index tuples can be directly mapped to the RAMs. The ¬rst
part of the tuple index speci¬es the RAM slice and the second part speci¬es the
RAM address. Since we operate on StateB, we must add an o¬set of 4 to the
index value to get the correct address. Figure 5 shows the transposition of the
State, including ShiftRows and SubBytes for encryption.


Implementation of (Inv)SubBytes and (Inv)MixColumns. The (Inv)-
SubBytes transformation is based on [18]. One di¬erence is that the byte inver-
sion in GF(28 ) is implemented by using a synchronous ROM. (Inv)MixColumns
is similar to the architecture presented in [17]. For further details refer to [18, 17].


Key Unit. An innovative aspect of our implementation is that the key unit
can handle 128-bit, 192-bit and 256-bit keys with minimal additional hardware
requirements. Supporting all key lengths increases the needed hardware resources
for the key unit by only 7.8%. The size of the key memory for 256-bit keys is the
same as for 128-bit keys. For 128-bit keys, the KeyExpansion function derives
44 32-bit round-key parts from the cipher-key. This requires a 64 — 32 bit RAM.
256-bit keys produce 63 32-bit round-key parts ¬tting the 64 — 32 bit RAM.


3.3 Performance of the FPGA AES Implementation
This section compares the proposed AES co-processor with the works referred
to in Section 3.1. In order to provide comparable results, we implemented our
co-processor on a Xilinx Virtex-E XCV1000EBG560-8 device.
110 N. Pramstaller et al.

Table 4. Hardware resources and throughput comparison

ECB mode
Work Device #CLB-slices #BRAM Throughput
[Mbps]
Gaj et al. [3] Xilinx XCV1000 12,600 80 12,160
McLoone et al. [9] (I) Xilinx XCV812E 2,222 100 6,956
McLoone et al. [9] (II) Xilinx XCV3200E 2,577 112 5,800
McLoone et al. [9] (III) Xilinx XCV3200E 2,995 138 5,000
McLoone et al. [9] (IV) Xilinx XCV3200E 7,576 102 3,239
Dandalis et al. [5] Xilinx XCV1000 5,673 ? 353
Fischer et al. [6] (I) FLEX 10KE200-1 2,530 24 451
Fischer et al. [6] (II) ACEX 1K50-1 1,213 10 115
Chodowiec et al. [2] Xilinx XC2S30-6 222 3 166
Our proposal 1,125 0 215
Xilinx XCV1000E

[9]: enc.: (I)AES-128, (II)AES-192, (III)AES-256, enc./dec.:(IV)AES-128
[6]: AES-128 enc./dec.: (I) fast con¬guration, (II) economic con¬guration



The performance results given in Table 4 are for the ECB mode. Most of these
implementations claiming high throughput rates will have similar performance
¬gures when operating in the CBC mode. The CBC mode is strictly recom-
mended and commonly used for encrypting high-speed data streams (e.g. as it
is used for encrypting data transfers over networks) and hence, the above-listed
high throughput rates lose their signi¬cance.
As shown in Table 4 our implementation is the only one that does not require
any block RAMs. The presented AES co-processor supports the complete AES
standard and the CBC mode. Additionally, it is equipped with a 32-bit AMBA
APB interface that eases the integration with processors used in System-on-Chip
designs [1]. If we do not consider the CBC mode and the AMBA bus interface,
our approach is still comparable with the above-listed works but we would require
less hardware resources (-26 %).
Our implementation utilizes 9.16% of the available logic cells on a Xilinx
Virtex-E XCV1000EBG560-8 device. 90.8% of the logic resources and 100% of
the on-chip BRAMs can be used by other circuits like a LEON2 or an ARM
processor. For a stand-alone application a low-end FPGA (e.g. Xilinx SpartanII
XC2S100-6) is su¬cient for implementing the complete AES co-processor” the
other approaches (except [2]) do not ¬t on a SpartanII device. The high through-
put designs do not support this ¬‚exibility and require expensive multi-million
gate FPGAs. Another important fact is that the other works do not provide an
en-/decryption engine that supports all de¬ned key lengths.
The maximum clock frequency on a XCV1000 FPGA is 161 MHz. At this
frequency, a throughput of 215 Mbps for AES-128, 180 Mbps for AES-192, and
156 Mbps for AES-256 is achieved for both ECB mode and CBC mode.
E¬cient AES Implementations on ASICs and FPGAs 111

4 Conclusions
In this article we presented two designs of a compact AES co-processor, one
suitable for ASIC implementations, the other one suitable for FPGAs. Both
designs are able to implement the whole functionality of the AES standard:
encryption and decryption with all key lengths (128-bit, 192-bit, and 256-bit).
In addition to covering the complete AES standard they support the Cipher
Block Chaining mode CBC. The AES co-processors also have a standard 32-bit
interface (AMBA) that facilitates the integration in System-on-Chip designs.
Our ASIC implementation is very regular, which makes it well suited for full
custom designs, and highly scalable. By scaling, the ASIC AES module it can be
adapted for many di¬erent applications with di¬erent requirements. With this
architecture high performance (up to 198 Mbps) as well as low area requirements
(down to 8,500 GE) can be reached on a 0.6 µm technology.
For the FPGA implementation, we have shown that due to an innovative
State representation the complete AES co-processor can be implemented on
inexpensive low-end FPGA devices. An implementation on a Xilinx Virtex-E
device uses only 1,125 CLB-slices and no block RAMs. Our FPGA implemen-
tation reaches a throughput of 215 Mbps at a clock frequency of 161 MHz for
encryption and decryption.



References
1. ARM Limited. AMBA 2.0 Speci¬cation. "http://www.arm.com/armtech/", 2001.
2. P. Chodowiec and K. Gaj. Very Compact FPGA Implementation of the AES
Algorithm. In Workshop on Cryptographic Hardware and Embedded Systems “
CHES 2003, volume 2779 of Lecture Notes in Computer Science (LNCS), pages
319“333. Springer, 2003.
3. P. Chodowiec, P. Khuon, and K. Gaj. Fast Implementations of Secret-Key Block
Ciphers Using Mixed Inner- and Outer-Round Pipelining. In Symposium on Field
Programmable Gate Arrays “ FPGA 2001, pages 94“102. ACM Press, 2001.
4. J. Daemen and V. Rijmen. The Design of Rijndael. Springer-Verlag, 2002.
5. A. Dandalis, V. Prasanna, and J. Rolim. A Comparative Study of Per-
formance of AES Final Candidates Using FGPAs. "http://csrc.nist.gov/
CryptoToolkit/aes/round2/conf3/aes3agenda.html", 2000.
6. V. Fischer and M. Drutarovsk´. Two Methods of Rijndael Implementation in
y
Recon¬gurable Hardware. In Workshop on Cryptographic Hardware and Embedded
Systems “ CHES 2001, volume 2162 of Lecture Notes in Computer Science (LNCS),
pages 77“92. Springer-Verlag, 2001.
7. P.C. Kocher, J. Ja¬e, and B. Jun. Di¬erential Power Analysis. In Advances in
Cryptology “ CRYPTO 1999, volume 1666 of Lecture Notes in Computer Science,
pages 388“397. Springer-Verlag, 1999.
8. S. Mangard, M. Aigner, and S. Dominikus. A Highly Regular and Scalable AES
Hardware Architecture. In IEEE Transactions on Computers, volume 52, pages
483“491, April 2003.
112 N. Pramstaller et al.

9. M. McLoone and J.V. McCanny. High Performance Single-Chip FPGA Rijndael
Algorithm Implementations. In Workshop on Cryptographic Hardware and Em-
bedded Systems “ CHES 2001, volume 2162 of Lecture Notes in Computer Science
(LNCS), pages 65“76. Springer-Verlag, 2001.
10. National Institute of Standards and Technology. Federal Information
Processing Standard 197, The Advanced Encryption Standard (AES).
"http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf", 2001.
11. N. Pramstaller, F.K. G¨ rkaynak, S. Haene, H. Kaeslin, N. Felber, and W. Ficht-
u
ner. Towards an AES Crypto-chip Resistant to Di¬erential Power Analysis. In
Proccedings of ESSCIRC 2004, to appear, 2004.
12. A. Rudra, P.K. Dubey, C.S. Jutla, V. Kumar amd J.R. Rao, and Pankaj Ro-
hatgi. E¬cient Rijndael Encryption Implementation with Composite Field Arith-
metic. In Workshop on Cryptographic Hardware and Embedded Systems “ CHES
2001, volume 2162 of Lecture Notes in Computer Science (LNCS), pages 171“184.
Springer-Verlag, 2001.
13. A. Satoh, S. Morioka, K. Takano, and S. Munetoh. A Compact Rijndael Hardware
Architecture with S-Box Optimization. In Advances in Cryptology “ ASIACRYPT
2001, volume 2248 of Lecture Notes in Computer Science (LNCS), pages 239“254.
Springer-Verlag, 2001.
14. K. Tiri, M. Akmal, and I. Verbauwhede. A Dynamic and Di¬erential CMOS Logic
with Signal Independent Power Consumption to Withstand Di¬erential Power
Analysis on Smart Cards. In Proceedings of ESSCIRC 2002, 2002.
15. K. Tiri and I. Verbauwhede. Securing Encryption Algorithms against DPA at the
Logic Level: Next Generation Smart Card Technology. In Workshop on Crypto-
graphic Hardware and Embedded Systems “ CHES 2003, volume 2779 of Lecture
Notes in Computer Science (LNCS), pages 125“136. Springer, 2003.
16. B. Weeks, M. Bean, T. Rozylowicz, and C. Ficke. Hardware Performance Sim-
ulations of Round 2 Advanced Encryption Standard Algorithms. "http://
csrc.nist.gov/encryption/aes/round2/NSA-AESfinalreport.pdf", 2000.
17. J. Wolkerstorfer. An ASIC implementation of the AES-MixColumn operation. In
Proceedings of Austrochip 2001, October 2001.
18. J. Wolkerstorfer, E. Oswald, and M. Lamberger. An ASIC implementation of the
AES S-Boxes. In Topics in Cryptology “ CT-RSA 2002, Proceedings of the RSA
Conference 2002, volume 1965 of Lecture Notes in Computer Science. Springer-
Verlag, February 2002.
19. S-Y Wu, S-C Lu, and C-S Laih. Design of AES Based on Dual Cipher and Com-
posite Field. In Topics in Cryptology “ CT-RSA 2004, Proceedings of the RSA
Conference 2004, volume 2964 of Lecture Notes in Computer Science. Springer-
Verlag, February 2004.
20. Xilinx Incorporated. Silicon Solutions ” Virtex Series FPGAs. "http://
www.xilinx.com/products/".
Small Size, Low Power, Side Channel-Immune
AES Coprocessor: Design and Synthesis Results

Elena Trichina1 , Tymur Korkishko2 , and Kyung Hee Lee2
1
Department of Computer Science, University of Kuopio,
P.O.B. 1627, FIN-70211, Kuopio, Finland
2
Information security TG, i-Networking Lab, Information Security Group,
Samsung Advanced Institute of Technology, Korea


Abstract. When cryptosystems are being used in real life, hardware and
software implementations themselves present a fruitful ¬eld for attacks.
Side channel attacks exploit information such as time measurements,
power consumption, and electromagnetic emission that leaks from a de-
vice when it executes cryptographic applications. When leaked informa-
tion is correlated to a secret key, an adversary may be able to recover the
key by monitoring this information. This paper describes an AES copro-
cessor that provides complete protection against ¬rst-order di¬erential
power analysis by embedding a widely used software countermeasure that
decorrelates data being processed from the leaked information, so-called
data masking, at a hardware level.



1 Introduction
In applications such as smart cards hardware complexity and tamper resistance
are very important issues that directly a¬ect the cost and consumer acceptance
of such devices. A class of side channel attacks enables breaking cryptographic
algorithms by measuring timing characteristics [12], power consumption [11, 18],
and electromagnetic radiation [8, 23] of a smart card microprocessor when it runs
cryptographic applications.
Until recently, most of these attacks exploited some speci¬c features of soft-
ware implementations of cryptographic algorithms, and many countermeasures
were designed at a software level. For many applications, however, it is neces-
sary that cryptographic algorithms should be realized in hardware. Although not
many results have been published yet, it is prudent to suggest that cryptographic
hardware also leaks side channel information, and that alongside with general
tamper-resistant features cryptographic coprocessors should include countermea-
sures speci¬cally designed to protect them against side channel attacks.
One of the most powerful software techniques to counteract such attacks is
to mask all input and intermediate data values in order to de-correlate any in-
formation leaked through side channels from actual secret data being processed.

This work had been done when the author was with the Smart Card System Engi-
neering Business Unit, System LSI Division, Samsung Electronics Co. LTD., Korea.

H. Dobbertin, V. Rijmen, A. Sowa (Eds.): AES 2004, LNCS 3373, pp. 113“127, 2005.
c Springer-Verlag Berlin Heidelberg 2005
114 E. Trichina, T. Korkishko, and K.H. Lee

In [28] it had been shown how to apply data masking technique at the level
of micro operations such as logical AND, XOR, etc., and how to use these op-
erations as building blocks for implementation of inversion in composite ¬elds
directly on masked data. In this paper we use ideas from [28] in order to build a
practical coprocessor for an Advanced Encryption Standard algorithm [7] that
is immune to side channel attacks. First, we generalize multiplication on masked
data from a bitwise multiplication operation (i.e., logical AND) to multiplica-
tion in any GF (2n ) ¬eld. This generalization allows us to use standard libraries
while designing side channel attacks-resistant cryptographic hardware for appli-
cations that are based on binary ¬eld arithmetic. What is more important, our
method provides considerable savings in the gate count and power consumption
in comparison with [28].
The synthesis results for a ”minimalist” architecture that implements a 16
clock/round version of the AES coprocessor with a ¬‚exible key size, show that
with 0.18 µm technology the performance of 4Mbps can be achieved with the
circuit comprising 20,506 gates clocked at 5MHz and requiring 1.07 µA. This is
better than countermeasures such as dynamic and di¬erential CMOS logic [26]
or asynchronous circuits with dual rail logic [21] can o¬er. Also, our solution
has an advantage of using standard technologies, standard libraries, and well-
established design tools.
The rest of the paper is organized as follows. After a brief reminder of the AES
algorithm in Chapter 2, we describe principles of power analysis attacks. The
countermeasure consisting in masking all input and intermediate data with some
random values and di¬culties of its implementation for the AES algorithm are
discussed in Chapter 4, after which we suggest our solution to this problem. The
details of the DPA-resistant AES coprocessor architecture are given in Chapter
6. The paper is concluded with synthesis results and a brief comparison of our
design with other hardware solutions.


2 AES Reminder
The Advanced Encryption Standard [7] is a round-based symmetric block cipher.
The standard key size is 128 bits, but for some some applications 192 and 256-
bit keys must be supported as well. The round consists of four di¬erent opera-
tions, namely, SubBytes, ShiftRows, MixColumn, and AddRoundKey, that are
performed repeatedly in a certain sequence; each operation in a standard algorithm
maps a 128-bit input state into a 128-bit output state. The state is represented as
a 4 — 4 matrix of bytes. The number of rounds depends on the key size. In the de-
cryption process, the inverse operations are executed in a slightly di¬erent order.
ShiftRows is a cyclic shift operation on each of four rows in a 4 — 4-byte state
using 0 ∼ 3 o¬sets. MixColumn treats 4-byte data blocks in each column of a
state as coe¬cients of a 4-term polynomial, and multiplies them modulo x4 + 1
with the ¬xed polynomial c(x) = {03}x3 + {01}x2 + {01}x + {02}. AddRound-
Key is a bit-wise XOR operation on 128-bit round keys and data. These three
operations are linear.
Small Size, Low Power, Side Channel-Immune AES Coprocessor 115

SubBytes is the main building block of the AES. It replaces each byte in a
state by its substitute in an Sbox that comprises a composition of two transfor-
mations:
“ First, each byte in a state is replaced with its reciprocal in the ¬nite ¬eld
GF (28 ), except that 0, which has no reciprocal, is replaced by itself. This is
the only non-linear function in the AES algorithm.
“ Then an a¬ne transformation f is applied. It consists of a bitwise matrix
multiply with some ¬xed 8 — 8 binary matrix followed by XOR with the
hexadecimal number {63}.
The round key is computed in parallel to the round operation. It is derived from
the cipher key by means of key expansion and round key selection operations,
which are similar to those of the round operations, and also use Sboxes.


3 S-Box Architecture
There are many design trade-o¬s to be considered when implementing an S-
box in ASIC since the size, the speed, and the power consumption of an AES
coprocessor depends largely on the number and the style of implementation of
Sboxes [20]. It also turned out that this operation is the most di¬cult to protect
against side channel attacks.
To optimize the silicon area, a number of ¬‚exible ASIC solutions that use
similarities between encryption and decryption to share silicon were proposed
[14, 20], where SubBytes was implemented in two steps, as a combination of
inversion in the ¬eld and an a¬ne transformation f . While the a¬ne trans-
formations used for encryption and decryption are slightly di¬erent, the silicon
implementing inversion in GF (28 ) can be used for both, as shown in Fig. 1.
Therefore, an area- and power-e¬cient and secure implementation of inversion
may have a big impact on an overall design.
The most obvious solution is to use a look-up table for this operation [14].
It is fast and inexpensive in terms of power consumption [20]. There is a major
drawback, however. Namely, the size of silicon is about 1,700 gate equivalents
per table in 0.18µ technology. Considering that up to 20 such tables (including
4 tables for key scheduling) is required per round, this solution is hardly feasible
for co-processors intended for smart cards and other embedded systems.
Among various alternative approaches composite ¬eld inversion produces the
most compact AES implementations [24, 25, 30] which can be further optimized

)c+x(1-A
eniffA esr evnI
mr ofsn arT c+1-xA
1-
X
x y
elbaT eniffA
mr ofsn arT

ceD/cnE




Fig. 1. S-box implementation suitable for encryption and decryption
116 E. Trichina, T. Korkishko, and K.H. Lee

to minimize power consumption [20]. As a basis for our design we use fully
combinational logic implementation of inversion in composite ¬elds described
in [30].
Usually the ¬eld GF (28 ) is seen as an extension of GF (2) and therefore its
elements can be represented as bytes. However, GF (28 ) can also be seen as a
quadratic extension of GF (24 ); in this case an element a ∈ GF (28 ) is represented
as a linear polynomial aH x + aL , denoted [aH , aL ], with coe¬cients in GF (24 ).
This isomorphic representation is far better suited for hardware implementation
[22, 24, 20, 30].
The bijection from a ∈ GF (28 ) to a two-term polynomial [aH , aL ] is given
by the linear function map computed by means of XOR operations on bits of
a. The inverse transformation map’1 converts a two-term polynomial back to
element a ∈ GF (28 ), and is de¬ned in a similar way. For more details see [30].
All arithmetic operations applied to elements of GF (28 ) can also be computed
in a new representation. Two-term polynomials are added by addition of cor-
responding coe¬cients. Multiplication and inversion of a two term-polynomial
requires modular reduction to ensure that the result is a two-term polynomial
as well; the irreducible polynomial n(x) = x2 + {1}x + {e} whose coe¬cients are
chosen to optimize ¬nite ¬eld arithmetic can be used for this purpose.
Inversion of a two-term polynomial is de¬ned as (aH x+aL )—(aH x+aL )’1 =
{0}c + {1}. From this de¬nition the formulae for inversion can be derived:

(aH x + aL )’1 = (aH — d)x + (aH • aL ) — d
(1)
d = ((a2 — {e}) • (aH — aL ) • a2 )’1 .
H L


Fig. 2 depicts a block diagram of inversion in composite ¬eld GF ((24 )2 ). As
one can see, one addition, one squaring, one multiplication by a constant, three
general multiplications, and one inversion in GF (24 ) are necessary for such im-
plementation. All these operations can be realized in combinational logic. Only
general multiplication and inversion in GF (24 ) require to use both, AND and
XOR gates for their implementation; all other operations need only compositions
of XOR gates [30].


.lum tnatsnoC erauqS
)42(FG ni )42(FG ni
noisrevnI
2
X x X
Ha Hb
)42(FG ni
noitacilpitluM 1-
X
)42(FG ni
X X
La Lb




Fig. 2. Inversion in GF ((24 )2 )
Small Size, Low Power, Side Channel-Immune AES Coprocessor 117

4 Side Channel Attacks and Computations on Masked
Data
Basically, side-channel attacks work because there is a correlation between the
physical measurements taken during computations, such as power consump-
tion [11, 18], EMF radiation [8, 23], time of computations [12], and the internal
state of the processing device, which itself is related to a secret key.

4.1 Power Analysis Basics
Among side-channel attacks, a di¬erential power analysis (DPA) is the main
concern when implementing cryptographic algorithms in embedded devices be-
cause due to physical constraints an adequate shielding and power consumption
¬ltering cannot be employed. Power analysis attacks use an all-pervasive fact
that, ultimately, all calculations performed by a digital device operate on logi-
cal ones and zeros; and in contemporary technology power consumption while
manipulating a logical one di¬ers from power consumption while manipulating
a logical zero.
To illustrate why the power analysis works, let us consider an example in
Fig. 3. The circuit, ¬rst appeared in [18], represents a component model useful
for understanding power consumption characteristics of Complementary Metal
Oxide Semiconductor (CMOS) technology.

V pin
cc



L bond

Other
Gates
Q
1
V
gate
C
Q load
2
V pin
ss
+ V scope
L
bond
C R
C 2 m
1



Fig. 3. Measuring power consumption of a smart card


The two most essential components of power consumption during the change
of a state of a CMOS gate are dynamic charge resp. discharge (appr. 85%) and
dynamic short circuit current (appr. 15%). This is sketched in Fig. 3 where the
output of each gate has a capacitive load consisting of the parasitic capacity of
the connected wires and gates of the following stages. An input transition re-
sults in an output transition, which discharges or charges this parasitic capacity,
causing a current to ¬‚ow to Vcc or to Vss .
118 E. Trichina, T. Korkishko, and K.H. Lee

For example, as Vgate changes from 0 to 5 volts, the transistors Q1 and Q2
are both conducting for a brief period causing current to ¬‚ow from Vcc to the
ground. Also, during this time the capacitor Cload will be discharged (or charged)
causing more (or less) current to ¬‚ow through the Vss pin. The current charges
and discharges capacitors Cload , C1 , and C2 , and ¬‚ows out of the smart card
through a bond wire that acts as an inductor Lbond .
Power dissipated by the circuit can be monitored by using a small resistor
Rm in series between the Vss pin and the ground (or, alternatively, between the
Vcc pin and the true source). Current moving through Rm creates a time varying
voltage that can be sampled by a digital oscilloscope.
The more circuit changes its state, the more power is dissipated. In a syn-
chronous design gates are clocked, which means that all gates change their state
at the same time. Information useful to an attacker is leaked because the amount
of current being drawn when the circuit is clocked is directly related to a change
of the state of Cload or the current drawn by other gates attached to Cload . Thus,
whenever the secret key data or data correlated to the secret key is manipulated,
the microprocessor can leak damaging information that can be observed at the
Vscope .
In a microprocessor each clock pulse causes many bit transitions to occur si-
multaneously. There are two types of information leakage that can be observed at
the Vscope : Hamming weight leakage and transition count leakage [18]. The Ham-
ming weight information leaks when the dominant source of current is caused by
discharging of the Cload . The transition current information can leak when the
dominant source of current is due to the switching of the gates that are driven
by the data bus. When the data bus changes state, many of the gates driven by
the bus will brie¬‚y conduct current. Thus, the more bits change states, the more
power is dissipated.
From this explanation we deduce that the power consumption of a circuit at
time t is the sum of power dissipated by all gates at this time. Of course, various
noise components must be considered as well [18]. It can be stated as the simple
power model
P (t) = f (g, t) + N (t),

where t denotes time, N (t) is a normally distributed random variable which
represents a noise component, and f (g, t) denotes power consumption of gate g
at time t.
The next step is to relate this model to statistics. If we consider function
f (g, t) as a random variable from an unknown probability distribution, then
according to the Central Limit Theorem, P (t) is normally distributed. In a DPA
attack, an attacker divides the power measurements in two or more di¬erent
sets and tries to compute the di¬erence between these sets in order to verify
the selection function, which relates the corresponding power measurements to
the hypothesis concerning the values of the target bits of the key. Only if the
hypothesis was correct, there will be some noticeable peaks in statistics.
For example, a selected bit b at the output of one Sbox of the ¬rst round of
the AES algorithm will depend on the known input message and 8 unknown bits
Small Size, Low Power, Side Channel-Immune AES Coprocessor 119

of the key. The correlation between power consumption and b can be computed
for all 256 values of 8 unknown bits of the key. The correlation is likely to be
maximal for the correct guess of the 8 bits of the key. Then an attack can be
repeated for the remaining Sboxes.

Data Masking and Inversion in GF (24 )
4.2
There are many strategies to combat side-channel attacks. On a hardware level,
the countermeasures usually include clock randomization, power consumption
randomization, current compensation, and various detectors of abnormal be-
havior. However, the e¬ect of these countermeasures can be reduced by various
signal processing and statistical techniques [6]. Software-based countermeasures
include introducing dummy instructions, randomization of the instruction exe-
cution sequence, balancing Hamming weights of the internal data, etc.
Data masking is one of the most powerful software countermeasures against
side channel attacks [5, 11]. The idea is simple: the message and the key are
masked with some random values at the beginning of computations, and there-
after everything is almost as usual. Of course, the value of the mask at the end
of some ¬xed step (e.g., at the end of a round or at the end of a linear part of
computations) must be known in order to re-establish the expected data value
at the end of the execution; we call this mask correction.
A traditional XOR operation is used for data masking; however, a mask is
arithmetic in GF (28 ). The operation is compatible with the AES structure ex-
cept for inversion in SubBytes, which is the only non-linear transformation. In
other words, to compute mask corrections for each of the linear transformations
in a round, we simply have to apply this transformation separately to masked
bytes and to corresponding masks.
Unfortunately, it turned out to be rather di¬cult to ¬nd an e¬cient and
secure solution for non-linear operations. The ¬rst attempt to transform masked
data between Boolean and arithmetic operations in a secure way [17] was shown

a• x b• y z
y
x




a…b • z z



Fig. 4. Masked AND (MAND)
120 E. Trichina, T. Korkishko, and K.H. Lee

to be insu¬cient against DPA attacks. A more sound method for mask switching
[10] involves too much computational overhead.
To overcome this di¬culty, Akkar and Giraud [1] proposed transformed mask-
ing, where ¬rst an additive mask is replaced by a multiplicative mask in a series
of multiply and add operations, after which a normal inversion takes place, and
¬nally, the transformation of a multiplicative mask into an additive mask is car-
ried out again. However, it was pointed out in [9] that a multiplicative mask
does not blind zero, and thus does not prevent a DPA attack.
In [28] it was noticed that for a fully combinational AES Sbox design, the
problem of ”masked inversion” can be e¬ectively reduced to the problem of
computing binary XOR and AND operations on masked bits a = a • x, ˜ = b • y
b
˜
and on bits of the mask x, y without ever revealing actual data bits a, b in
the process. For XOR computing the mask correction is trivial because a • ˜ =
˜b
(a • b) • (x • y).
Masked AND and the corresponding mask correction can be computed by
manipulating only masked data bits and the bits of the masks as follows 1 . Let
c = a · b. Then

c • z = a · b • z = (˜ · ˜ • (y · a • (x · ˜ • (x · y • z)))).
ab b
˜ (2)
This can be implemented as a cascade of logic gates as shown in Fig. 4.
To realize inversion in GF (28 ) on masked data, one would have to replace
each AND gate used by a multiplier and an inverter in GF (24 ) with the circuit
depicted in Fig.4, increasing more than fourfold a total amount of gates in the
Sbox combinational logic design.


Masked GF (2n ) Multiplier
4.3
Let us make an observation that equation (2) can be generalized to the equation
for ”masked multiplication™ in any ¬eld GF (2n ). If A, B, X, Y, Z ∈ GF (2n ), then

(A—B)•Z = [(A•X)—(B•Y )]•[X —(B•Y )]•[(A•X)—Y ]•[X —Y ]•Z. (3)

Hence, masked multiplication in GF (2n ) can be performed using conventional
multipliers with any architecture, and four additional XOR operations for mask
correction, as depicted in Fig. 5. An additional mask Z is also used to mask the
output product. Hence, this approach requires 4 ”normal” multipliers in GF (2n )
and 4 bitwise XOR operations.
In contrast, a straightforward application of the technique suggested in [28]
involves building masked multipliers by replacing every AND gate in a ”normal”
multiplier with a masked AND( MAND) circuit. For example, Fig.6 illustrates a
transformation of a conventional multiplier in GF (22 ) into a masked multiplier
using this approach.

1
To achieve balanced and independent intermediate results, the scheme is used a
freshly generated random bit z as a new mask.
Small Size, Low Power, Side Channel-Immune AES Coprocessor 121


A XB Y X Y Z




GF(2n) GF(2n) GF(2n) GF(2n)


XOR


XOR


XOR


XOR
MMUL


(A B) Z Z

Fig. 5. Building a masked multiplier in GF (2n ) from standard multipliers



a0 b0 a1 b1 a1 b0 a1 a0 b1


AND AND AND XOR


XOR AND


XOR


c0 c1

(a) Multiplier in GF(2^2)
a 1 a 0 b 1 x1 x 0 y 1 z3
a™0 b™ x0 y 0 z 0 a™1 b™ x1 y 1 z 1 a™1 b™0 x 1 y 0 z2
0 1



MAND XOR XOR
MAND MAND


MAND



XOR XOR XOR XOR


c™1
c™0 z™ z™
0 1



(b) Masked multiplier in GF(2^2)

Fig. 6. Transformation of a usual (a) multiplier into a masked (b) multiplier in GF (22 )
122 E. Trichina, T. Korkishko, and K.H. Lee

The implementation complexity of a masked multiplier can be calculated
knowing the complexity of a basic standard multiplier. For example, for the
popular Mastrovito multipliers in GF (2n ) [16], the space complexity expressed
in the number of AND and XOR gates is n2 and ≥ (n2 ’ 1), respectively. Then
the complexity of the generalized masked multiplier when compared with the
original straightforward implementation constitutes 9 to 29 % improvement in
the number of XOR gates, as can be seen from the Table 1.


Table 1. Comparison of the space complexity of generalized and straightforward
masked multipliers in GF (2n )

GF Irreducible Mastrovito mult. Straight. masked Proposed mult. Advantage
n Polynomial AND XOR AND XOR AND XOR %
2 [2,1,0] 4 3 16 22 16 20 9.1
4 [4,1,0] 16 15 64 94 64 76 19.1
8 [8,5,3,2,0] 64 84 256 382 256 284 25.7
16 [16,11,6,5,0] 256 281 1024 1534 1024 1084 29.3




5 Secure AES Coprocessor
When manipulating masked data, all operations in a round, apart from Sub-
Bytes, require simple mask corrections in a form of analogous computations on
masks that are carried out in parallel with the main computation ¬‚ow. This can
be achieved simply by duplicating hardware for all transformations but Sub-
Bytes.


G xor M M WZ F



InvAffine and InvAffine and
Field Isom Field Isom
Field Isom Field Isom




P xor M
M
W
-1
Secure X Z
F
P-1 xor F


InvField Isom InvField Isom
InvField Isom InvField Isom
and Affine and Affine




Sbox(G) xor F F



Fig. 7. Masked S-box
Small Size, Low Power, Side Channel-Immune AES Coprocessor 123

Another solution is to pipeline computations on masked data and on masks,
which halves the throughput, and for which additional 128-bit registers are re-
quired for mask values. Since each 1-bit register needs an equivalent of 6-7 gates,
there is no visible advantage for pipelining.
The structure of the Masked Sbox is depicted in Fig. 7. The inverse ¬eld
isomorphism map ’1 and a¬ne transformation f are both linear operations, and
they are merged to optimize the gate count; the same holds for map and f ’1
in decryption [14, 20]. A duplicate data path on the right hand side computes a
mask correction.
In this ¬gure, the box ”Secure X ’1 ” represents inversion in GF ((24 )2 ) im-
plemented as was de¬ned in Eq. 1. M is a random mask, (G xor M) represents
a masked input, Z, W, F are new masks used to ”refresh” the masked data at
the end of each operation in GF (24 ) and at the end of inversion in GF ((24 )2 ).
The details are given in Fig. 8 where every SMul2n box is, in fact, a generalized
masked multiplier in GF (24 ). The box X ’1 represents inversion in GF (24 ), and
is implemented in combinational logic in the same way as in [30], with every
AND being replaced with masked AND.
Altogether, 1.2K gate equivalent is required to implement one S-box which
is 25% better than the table lookup implementation.


FW M Z P xor M
8 4 8 8
4


ML MH
XOR
4 4

mA mB
Y
SMul
Z
x2g
XOR
2n
X



x2g
XOR

XOR

X-1


X X
mB mA mA mB
Y Y
MH
SMul SMul
2n 2n
Z Z
4 4

8 8

P-1 xor F
F


Fig. 8. Masked inverter in GF ((24 )2 )
124 E. Trichina, T. Korkishko, and K.H. Lee

6 Conclusion
The secure hardware AES module based on the described scheme has been fully
implemented in 0.18µm technology. As a balance between the throughput and
the gate count, 1 Sbox has been used for the main datapath, and one for the key
scheduling. Our implementation of MixColumn resembles the one reported in [29]
and exploits common subexpressions for MixColumn/InvMixColumn operations.
The general design ¬‚ow for a secure AES module is depicted in Fig. 9. First, a
Verilog model had been created and tested, after which Cadence Design System
Verilog-XL simulator was used to generate timing diagrams and to verify the
correctness of the design. When RTL code had been veri¬ed, the digital circuit
was synthesized with Synopsys Design Analyzer tool using Samsung 0.18 µm li-
braries. Power-compiler simulation data at 5 MHz were obtained with simulation
tool CubicWare.
The summary of the synthesis results are given in the table in Fig. 10. The
total gate count for the secure AES module (excluding I/O) with a ¬‚exible key
size is 20,06K, while for a standard 128-bit key the gate count can be reduced
to 16K. The power consumption is 2.0 mA in 0.18µm technology for a ¬‚exible
key size architecture, and 1.6 mA for a 128-bit key standard. With 0.13µm
technology, the power consumption can be reduced to 1.1 mA, which allows us
to use this module in applications such as GSM and ad-hoc networks. The secure
AES module has throughput 4 Mbps when operated at 5 MHz.
The described approach can be applied to other cryptographic coprocessors
that use arithmetic operations in Galois ¬elds. It provides comparable protection
as dynamic and di¬erential logic [26] and asynchronous dual rail circuits [21] at
the similar price in terms of the gate count and power consumption. Taking into


Phase 1: Functional simulation Phase 2: Logical synthesis

Initial
Test data
FIPS AES
specs

Logic
Test bench Digital circuit
synthesis
Verilog
Timing diagrams
(Synopsys
Verilog Design Analyzer)
model
simulator
(Cadence Design
Synthesis
System Verilog-XL)
options
Correct Synthesis report
Verilog model
No Samsung 0.18um
or specs
Correct results
standard cell library
produced?
Initiate next phase
Yes



Fig. 9. Hardware design ¬‚ow
Small Size, Low Power, Side Channel-Immune AES Coprocessor 125


Components Subcomponents Scalable key size 128 bit key size
Inverter in GF((2 4) 2)
Masked Sbox 1206 1206
Input -output processing 350 (180) 350 (180)
Subtotal for masked Sbox (for key scheduler) 1556 (1386) 1556 (1386)
Masked datapath Masked data 4900 4900
Mask 2732 2732
Subtotal for masked datapath 7632 7632
Masked key scheduler Masked key 6523 4835
Mask 4913 3225
Subtotal for masked key scheduler 11436 8060
Control Data control 300 300
Key control 1140 400
TOTAL for masked AES 20506 16390


Fig. 10. Gate count for a secure AES module


account that the latter techniques require new logic libraries, careful ”balancing”
of place and routing and new development tools which implies higher design and
production costs and longer time-to-market, our solution o¬ers a competitive
alternative to a hardware protection.



References
1. Akkar, M., Giraud, C.: An implementation of DES and AES, secure against some
attacks. Proc. Cryptographic Hardware and Embedded Systems: CHES 2001. Lec-
ture Notes in Computer Science 2162 (2001) 309-318
2. Anderson, R., Kuhn, M.: Low cost attacks on tamper resistant devices.
Proc.Security Protocols: IWSP 1997. Lecture Notes in Computer Science 1361
(1997) 125-136
3. Bl¨mmer, J., Merchan J. G., Krummel, V.: Provably secure masking of AES. IACR
o
Cryptology ePrint Archive Report 2004/101 (2004)
4. M. Bucci, L.Germani, M. Guglielmo, R. Luzzi, A. Tri¬letti: A simulation method-
ology for DPA resistance testing of cryptographic processors (manuscript) 2003
5. Chari, S., Jutla, C., Rao, J., Rohatgi, P.: Towards sound approaches to counteract
power-analysis attacks. Proc. Advances in Cryptology “ Crypto™99. Lecture Notes
in Computer Science 1666 (1999) 398-412,
6. Clavier, C., Coron, J-S., Dabbous, N.: Di¬erential power analysis in the presence
of hardware countermeasures. Proc. Cryptographic Hardware and Embedded Sys-
tems: CHES 2000. Lecture Notes in Computer Science 1965 (2000) 252-263
7. Daemen, J., Rijmen, V.: The design of Rijndael: AES - The Advanced Encryption
Standard. Springer-Verlag Berlin Heidelberg (2002)
8. Gandol¬, K., Mourtel, C., Oliver, F.: Electromagnetic analysis: concrete results.
Proc. Cryptographic Hardware and Embedded Systems: CHES 2001. Lecture Notes
in Computer Science 2162 (2001) 251-261
9. Goli¸, J., Tymen,Ch.: Multiplicative masking and power analysis of AES. Proc.
c
Cryptographic Hardware and Embedded Systems: CHES 2002. Lecture Notes in
Computer Science 2523 198-212
126 E. Trichina, T. Korkishko, and K.H. Lee

10. Goubin, L.: A sound method for switching between boolean and arithmetic mask-
ing. Proc. Cryptographic Hardware and Embedded Systems: CHES™01. Lecture
Notes in Computer Science 2162 (2001) 3-15
11. Kocher, P., Ja¬e, J., Jun, B.: Di¬erential power analysis. Proc. Advances in Cryp-
tology “ CRYPTO™99. K Lecture Notes in Computer Science 1666 (1999) 388-397
12. Kocher, P.: Timing attacks on implementations of Di¬e-Hellmann, RSA, DSS,
and other systems. Proc. Advances in Cryptology “ Crypto™96. Lecture Notes in
Computer Science 1109 (1996) 104-113
13. Kommerling, O., Kuhn, M.: Design principles for tamper-resistant smartcard pro-
cessors. Proc. USENIX Workshop on Smartcard Technology (Smartcard 99) (1998)
9-20
14. Lu, C. C., Tseng, S-Y.: Integrated design of AES (Advanced Encryption Sran-
dard) encryptor and decryptor. Proc. IEEE conf. on Application-Speci¬c Systems,
Architectures, and Processors (ASAP™02) (2002) 277-285
15. Mangard, S., Aigner, M., Dominikus, S.: A highly regular and scalable AES hard-
ware architecture. IEEE Transactions on Computers 52 no. 4 (2003) 483-491
16. E.D. Mastrovito, VLSI architectures for computations in Galois ¬elds, PhD Thesis,
Linkoping University, Linkoping, Sweden (1991)
17. Messerges, T.: Securing the AES ¬nalists against power analysis attacks. Proc.
Fast Software Encryption Workshop 2000. Lecture Notes in Computer Science
1978 (2000) 150-165
18. Messerges, T. S., Dabbish, E. A., Sloan, R. H.: Examining smart-card security
under the thread of power analysis. IEEE Trans. Computers. 51 no. 5 (2002) 541-
522
19. Messerges, T. S.: Using second-order power analysis to attack DPA resistant soft-
ware. Proc. Cryptographic Hardware and Embedded Systems “ CHES 2000. Lec-
ture Notes in Computer Science 1965 (2000) 238-251
20. Morioka, S., Satoh, A.: An optimized S-Box circuit architecture for low power
AES design. Proc. Cryptographic Hardware and Embedded Systems: CHES 2002.
Lecture Notes in Computer Science 2523 (2003) 272-186
21. Moore, S., Anderson, R., Cunningham, P., Mullins, R., Taylor, G.,: Improving
smart card security using self-timed circuits. Proc. Proceeding 8th IEEE Inter-
national Symposium on Asynchronous Circuits and Systems “ ASYNC™02. IEEE
(2002) 23-58
22. Paar, C.: E¬cient VLSI architectures for bit parallel computations in Galois ¬elds.
PhD Thesis, University of Essen, Germany (1994)
23. Quisquater, J. J., Samide, D.: Electromagnetic analysis (ema): measures and
counter-measures for smart cards. Proc. Smartcard Programming and Security.
Lecture Notes in Computer Science 2140 (2001) 200-210
24. Rudra, A., Dubey, P., Julta, C., Kumar, V., Rao, J., Rohatgi, P.: E¬cient Rijndael
implementation with composite ¬eld arithmetic. Proc. Cryptographic Hardware
and Embedded Systems “ CHES™01. Lecture Notes in Computer Science 2162
(2001) 175-188
25. Satoh, A., Morioka, S., Takano, K., Munetoh, S.: A compact Rijndael hardware ar-
chitecture with S-Box optimization. Proc. Advances in Cryptology “ ASIACRYPT
2001. Lecture Notes in Computer Science 2248 (2001) 239-254
26. Tiri, K., Akmal, M., Verbauwhede, I.: A dynamic and di¬erential CMOS logic with
signal independent power consumption to withstand di¬erential power analysis on
smart cards. Proc. IEEE 28th Europen Solid-State Circuit Conf. “ ESSCIRC™02
(2002)
Small Size, Low Power, Side Channel-Immune AES Coprocessor 127

27. E. Trichina, E., De Seta, D., Germani, L.: Simpli¬ed Adaptive Multiplicative Mask-
ing for AES and its secure implementation. Proc. Cryptographic Hardware and
Embedded Systems: CHES 2002. 2523 of Lecture Notes in Computer Science 2523
(2002) 277-285
28. Trichina, E.: Combinational logic design for AES SubByte transformation on

<<

. 5
( 7)



>>