but it should only be readable by the (unrevoked) devices in the specified subtree

when the message is sent in a subtree broadcast. The motivation for organizing

devices into trees and allowing for subtree broadcasts is derived from the way

many organizations are naturally structured. For example, the ICS Company

may have several departments divided into groups, and groups may in turn have

divisions located in different cities.

TEAM LinG

Efficient Tree-Based Revocation in Groups of Low-State Devices 513

After a secure broadcast system is set up, we need to have the ability to

revoke devices to avoid revealing messages beyond the current members. (We

also consider the complexities of adding new devices, but the need for revocation

is better motivated, since additions will typically be done in large blocks.) Thus,

we are interested in the following complexity measures for a set of devices.

Broadcast cost: the number of messages the group controller (GC) must send

in order to reach a subtree containing revoked devices.

Revocation cost: the number of messages the GC must send in order to revoke

a device. Note that this cost is zero in the zero-state case.

Insertion cost: the number of messages the GC must send in order to add a

device. Note that this cost parameter does not apply to the zero-state case.

Related Work. Broadcast/multicast encryption was first formally studied by

Fiat and Naor [8], for the model where all the device keys are dynamic. Their al-

gorithms satisfy the log-key restriction, however, only if no more than a constant

number of revoked devices collude, which is probably not a realistic assumption.

Several subsequent approaches have therefore strengthened the collusion resis-

tance for broadcast encryption, and have done so using approaches where the

group is represented by a fixed-degree tree with the group controller (GC) being

the root and devices (users) being associated with leaves [3“7, 11, 13“15, 23, 24,

26, 28].

Of particular note is the logical key hierarchy (LKH) scheme proposed by

Wallner et al. [26] and by Wong and Lam [28], which achieves O(1) broadcast

cost and revocation cost under the log-key restriction (for the dynamic

case). The main idea of the LKH scheme is to associate devices with the leaves

of a complete binary tree, assign unique secret keys to each node in this tree,

and store at each device the keys stored in the path from leaf to the

root. Some improvements of this scheme within the same asymptotic bounds are

given by Canetti et al. [4, 5]. Using Boolean function minimization techniques,

Chang et al. [6] deal with cumulative multi-user revocations and reduces the

space complexity of the GC, i.e., the number of keys stored at the GC, from

to Wong et al. [27] generalize the results from binary trees to

key graphs. In addition, Sherman and McGrew [21] improve the constant factors

of the LKH scheme using a technique they call one-way function trees (OFT),

to reduce the size of revocation messages. Naor and Pinkas [17] and Kumar et

al. [12] also study multi-user revocations withstanding coalitions of colluding

users, and Pinkas [18] studies how to restore an off-line user who has missed a

sequence of group modifications with message size. Also of note is work

of Rodeh et al. [19], who describe how to use AVL trees to keep the LKH tree

balanced. Thus, the broadcast encryption problem is well-studied for the case

of fully-dynamic keys and devices organized in a complete or balanced k-ary

tree (noticing that a k-ary tree can transform to binary with constant times of

height increasing). We are not familiar with any previous work that deals with

unbalanced trees whose structure must be maintained for the sake of subtree

broadcasts, however.

TEAM LinG

514 Michael T. Goodrich, Jonathan Z. Sun, and Roberto Tamassia

There has also been some interesting recent work on broadcast encryption

for zero-state devices (the static case). To begin, we note that several researchers

have observed (e.g., see [10]) that the LKH approach can be used in the zero-

state model under the log-key restriction to achieve broadcast

cost. (We will review the LKH approach in more detail in the next section.)

Naor, Naor, and Lotspiech [16] introduce an alternative approach to LKH, which

they call the sub set-difference revocation (SDR) approach. They show that if

devices are allowed to store static keys, then the group controller

can send out secure broadcasts using messages, i.e., the broadcast cost

of their approach is Halevy and Shamir [10] improve the performance of

the SDR scheme, using an approach they call layered subset difference (LSD).

They show how to reduce the number of keys per device to be

while keeping the broadcast cost They also show how to further extend

their approach to reduce the number of keys per device to be

while increasing the broadcast cost to be These latter results are

obtained using a super-logarithmic number of device keys; hence, they violate

the log-key restriction.

Our Results. We provide several new techniques for broadcast encryption

under the log-key restriction. We study both the static (zero-state) and dynamic

(low-state) versions of this model, and present efficient broadcast encryption

schemes for devices organized in tree structures. We study new solutions for

balanced trees, organizational charts, and multicast trees. We show in Table 1

the best bounds on the broadcast, insertion and revocation cost for each of the

possible combinations of state and tree structure we consider, under the log-key

restriction.

So, for example, we are able to match the log-key bound of the static LKH

scheme while also achieving the broadcast encryption complexity of the

SDR scheme. Indeed, our scheme for this case, which we call the stratified subset

difference (SSD) scheme, is the first scheme we are aware of for zero-state de-

vices that simultaneously achieves both of these bounds. Moreover, we are able

to match the best bounds for balanced trees, even for unbalanced high-degree

organizational charts, which would not be possible using the natural conversion

TEAM LinG

Efficient Tree-Based Revocation in Groups of Low-State Devices 515

to a binary tree. Instead, we use biased trees [1] to do this conversion. But this

approach is nevertheless limited, under the log-key restriction, to cases where the

organizational chart has logarithmic height. Thus, for multicast trees, which can

be very unbalanced (we even allow for height that is we must take a dif-

ferent approach. In particular, in these cases, we extend the linking and cutting

dynamic trees of Sleator and Tarjan [22] to the context of broadcast encryption,

showing how to do subtree broadcasts in this novel context. This implies some

surprisingly efficient performance bounds for broadcast encryption in multicast

trees, for in severely unbalanced multicast trees the number of ancestors of the

leaf associated with some device can be exponentially greater than the number

of keys that device is allowed to store.

2 Preliminaries

The LKH Scheme for a Single Group. Let us briefly review the LKH

scheme [26, 28], which is well known for key management in single groups. The

LKH scheme organizes a group of devices as a complete binary tree with the

GC represented by the root and each user (that is, device) by a leaf, with a key

stored at each node. Each device, as a leaf, knows the path from the root to

itself and all the keys on this path. The GC, as the root, knows the whole tree

and all the keys. (See Figure 1.)

To revoke a device the GC updates every key on the path from itself to

so that: (a) cannot receive any updated key; and (b) any device other than

can receive an updated key if and only if it knows the old value of that key.

The key updating is bottom-up, from the parent of to the root. To distribute

the new key at a node if is the parent of then the GC encrypts the new

key with the current key of the sibling of otherwise, GC encrypts the new

key with the current keys of the two children of respectively. This procedure

guarantees (a) and (b). The total number of messages is Broadcasting

to a subtree simply involves encrypting a message using the key for the root of

that subtree; hence, the broadcast cost is O(1).

In the static case, no updating is allowed. So, the GC must encrypt a broad-

cast using the root of every maximal subtree containing no revoked devices.

Thus, in the static case, LKH has broadcast cost (Recall that is

the number of revoked devices.) In both the static and dynamic case, however,

the number of keys per device remains

Subset Difference Revocation (SDR). The subset difference revocation

(SDR) approach of Naor, Naor, and Lotspiech [16] is also based on associat-

ing all the devices with the leaves of a complete binary tree T. Define a subtree

B as the union of all the paths from the root to leaves associated with revoked

devices. Some internal nodes in B have one child and some two. Mark each in-

ternal node in B with two children as a “cut vertex” and imagine that we cut

out from T the edges from to its two children. This would leave us with

rooted subtrees, each containing some number of valid devices and one revoked

leaf (which may have previously been an internal node). Each such subtree is

TEAM LinG

516 Michael T. Goodrich, Jonathan Z. Sun, and Roberto Tamassia

Fig. 1. The LKH scheme for key management in single groups.

therefore uniquely identified by its root, and its descendent node that is

revoked. The GC associates a secret key with each node and defines a label

for each node in the subtree, of T rooted at is secret

key, and for any internal node in with left child and right child we

define and where and are collision-

resistant one-way hash functions that maintain the size of input strings. (Here

we use the abstract model of and Naor, Naor, and Lotspiech use in [16] a

pseudo-random generator G that triples the size of input, and take the left 1/3

and right 1/3 of the output to be the values of and Each leaf in stores

the values of all the labels of the nodes that are siblings of the path from

to (that is, not on the path itself, but are siblings of a node on the path). The

key used to encode a subtree rooted at with a revoked node inside is

Note that no descendent of knows this value and no node outside of can

compute this value, which is what makes this a secure scheme. However, this

scheme requires each device to hold keys, which violates the log-key

restriction.

3 Improved Zero-State Broadcast Encryption

To improve the storage requirements for stateless broadcast encryption, so as to

satisfy the log-key restriction, we take a data structuring approach. We begin

with the basic approach of the subset difference (SDR) method. Without loss of

generality, we assume that we are given a complete binary tree T with leaves

such that each leaf of T is associated with a different user. For any node in

T, let denote the subtree rooted at In addition, for any node and a

descendent of we let denote tree that is, all the nodes that

are descendents of but not Given a set of revoked users, we can use the

same approach as SDR to partition T into at most subtrees such

that union of all these trees represent the complete set of unrevoked users.

A Linear-Work Solution. As a warm-up for our efficient broadcast encryption

scheme, we first describe a scheme that uses keys per device and

messages per broadcast, but requires work per device to decrypt messages

(we will then show how to improve the device work bound keeping the other two

asymptotic bounds unchanged).

TEAM LinG

Efficient Tree-Based Revocation in Groups of Low-State Devices 517

The main idea is that the GC needs a way of encoding a message so that

every leaf node in can decrypt this message, but not other user (or group

of users) can decrypt it. We note as an additional space saving technique, we

can name each node in T according to a level-numbering scheme (e.g., see [9]),

so that the full structure of any tree can be completely inferred using just

the names of and Moreover, any leaf in can determine its relative

position in immediately from its own name, and the names of and

Let us focus on a specific subtree for a node in T. We define a set of

leftist labels, and rightist labels, for each node of In particular,

let us number the nodes in two ways”first according to a left preorder num-

bering (which visits left children before right children) and second according to

a right preorder numbering (which visits right children before left children) [9].

For a non-root node in let denote the predecessor of in the left pre-

order numbering of the nodes in We define to be where is

a collision-resistant one-way hash function. Likewise, we let denote the prede-

cessor of in the right preorder numbering of the nodes in We define to

be where is a (different) collision-resistant one-way hash function.

We initialize these two hash chains by setting and to random seeds

known only to the GC.

For each leaf node in let and respectively denote the successors

of (if they exist) in the left and right preorder numberings of the nodes in

The keys we store at for are and (Note that we specifically

do not store nor at For the complete key distribution, we store

these two keys for each subtree containing (there are such subtrees).

Given this key distribution, to encrypt a message for the nodes in a GC

encrypts the message twice”once using and once using

Decryption. Let us next consider how a leaf node in can decrypt a message

sent to this subtree from the GC. Since is not an ancestor of there are two

possibilities: either comes after in the left preorder numbering of or

comes after in the right preorder numbering. Since can determine the complete

structure of and relative position with in this subtree from the names of

and it can implicitly represent and know which of these two cases

apply. So suppose the first case applies (as the second case is symmetric with the

first). In this case, starts with the label it stores, where is successor

in the left preorder numbering of It then continues a left preorder traversal of

(which it can perform implicitly if memory is tight) until it reaches With

each new node encounters in this traversal, makes another application of the

one-way function computing the labels of each visited node. Thus, when

visits in this traversal, it will have computed and can then decrypt the

message. This computation takes at most hash function computations.

Security. Let us next consider the security of this scheme. First, observe that

any node outside of has no information that can be used to help decode a

message for the nodes in some tree since and are chosen as

random seeds and nodes outside of receive no function of or

TEAM LinG

518 Michael T. Goodrich, Jonathan Z. Sun, and Roberto Tamassia

So the security risk that remains is that leaf descendents of might be able

to decrypt a message sent to the nodes in Let denote the set of leaf

descendents of For each node in with successors and in the two

preorder numberings, we store and at But none of these values

for the nodes in are useful for computing or without inverting

a one-way function, since, in any preorder traversal, all the ancestors of a node

are visited before the node is visited.

Thus, we have a key distribution strategy for the zero-state case that uses

keys per device and messages per broadcast, albeit with work at

each device that could be In the remainder of this section, we describe

how we can reduce this work bound while keeping the other asymptotic bounds

unchanged.

The Stratified Subset Difference (SSD) Method. Given a constant we

can decrease the work per device to be while increasing the space and

message bounds by at most a factor of which should be a good trade-off in

most applications. For example, when is less than one trillion, is less than

The method involves a stratified version of the scheme described above,

giving rise to a scheme we call the stratified subset difference (SSD) method.

We begin by marking each node at a depth that is a multiple of

as “red;” the other nodes are colored “blue.” (See Figure 2.) Imagine further

that we partition the tree T along the red nodes, subdividing T into maximal

trees whose root and leaves are red and whose internal nodes are blue. Call each

such tree a blue tree (even though its root and leaves are red). We then apply

the method described above in each blue tree, as follows. For each leaf in T,

let be the red ancestors of in top-down order. For let

be the blue tree rooted at and note that is a leaf of

We store at node labels and in T), where and

are the left and right preorder successors of in respectively. Storing

these labels increases the space per device by a factor of

Fig. 2. Illustration of the stratified subset difference (SSD) scheme.

To encrypt a message, the GC first performs the subdivision of T into the

subtrees as before. Then, the GC further partitions each tree at the

TEAM LinG

Efficient Tree-Based Revocation in Groups of Low-State Devices 519

red levels, and encodes the broadcast message, using the previously described

scheme, for each blue subtree rooted at a node on the path from to This

increases the broadcast size by at most a factor of but now the work needed

by each device is reduced to computing the L or R labels in a blue tree, which

has size at most Thus, the work per device is reduced to in this

SSD scheme.

Theorem 1. Given a balanced tree T with devices, for zero-state broadcast

encryption, the stratified subset difference (SSD) scheme for T uses

keys per device and has broadcast cost, where is the number of revoked

devices in the subtree receiving the broadcast. The work per device can be made

to be for any fixed constant

Moreover, as we have noted, the security of this scheme is as strong as that

for SDR and LKH, i.e., it is resilient to collusions of any set of revoked devices.

4 A Biased Tree Scheme for an Organizational Chart

We recall that in the organizational chart structure for devices, we have a

hierarchical partition of the devices induced by a tree T of height

but with unbounded branches at each internal node. Namely, the leaves of T are

associated with the devices and an internal node of T represents the group

(set) of devices associated with the leaves of the subtree rooted at Thus,

sibling nodes of T are associated with disjoint groups and each device belongs

to a unique sequence of groups whose nodes are on the path from the

device™s leaf to the root of T. Without loss of generality, we assume that an

internal node of T has either all internal children (subgroups) or all external

children (devices), and its group is called an interior group or exterior group

accordingly. We consider four types of update operations: insertion and deletion

(revocation) of a device or of an empty group. After each modification, we want

to maintain both forward and backward security.

Biased Trees. Biased trees, introduced by Bent et al. [1], are trees balanced by

the weights of leaves (typically set as access frequencies). There are two versions

of biased trees: locally biased and globally biased. We denote by and

the parent, left child and right child of a node of a tree, and we use these

denotations cumulatively. E.g., is the left child of the grandparent of

The following definitions are taken from [1].

A biased search tree is a full binary search tree such that each node has a

weight and a rank The weight of a leaf is initially assigned, and the

weight of an internal node is the sum of the weights of its children. The rank

of a node is a positive integer such that

1. if is a leaf.

2. if is a leaf.

3. and

TEAM LinG

520 Michael T. Goodrich, Jonathan Z. Sun, and Roberto Tamassia

A locally biased search tree has the following additional property:

Local bias. For any with

1. if then either or is a leaf with rank if

then either or is a leaf with rank and

2. if and then either

or is a leaf with rank if and

then either or is a leaf with rank

A globally biased search tree has the following additional property:

Global bias. For any with both of the two neighboring

leaves of i.e., the right-most leaf on the left and the left-most leaf on the

right, have rank at least

Group Hierarchies and Biased Trees. Given an organizational chart T

that represents a group hierarchy, we have to convert T to a binary tree before

applying any encryption scheme for key management. Without loss of generality,

we convert T to a binary tree that preserves the original group hierarchy.

Each internal node of T, representing a group becomes a special internal

node in that still represents and accommodates a GC. Additional internal

nodes are added between and its children in T (i.e., subgroups or devices)

for the purpose of binarization. As result,node plus all its children in T and

the paths between them in form a binary subtree in with being

the root and each of its children in T being a leaf. Note that, without special

care, is likely to have super-logarithm height and balancing such a tree using

standard techniques would destroy the group hierarchy.

Given a group hierarchy tree T, we assign a unit weight to each leaf and

calculate the weights of other nodes in T accordingly, i.e., the weight of each

internal node is the number of devices in the subtree of T rooted at We

replace each node with a biased binary tree having the children of as its

leaves (using the weights of these nodes for the biasing). Thus, each subtree

representing a group rooted at a node in T can be initialized into a biased

tree without affecting the structure of group hierarchy. Since for each

is an invariant, i.e., the weights of the root and leaves in every are invariant,

the initialization is well defined and can be done in each independently. That

is, combining all the biased into will not change the structure of the

original hierarchy represented by T. (See Figure 3)

Key Assignment. After initializing the biased we still assign a key to

each node of as in the LKH, and inform the keys to devices and GC™s by the

following security properties:

1. each device knows all but only the keys on the path from to itself.

2. the GC of each knows all but only the keys of descendants in

and those on the path from to

TEAM LinG

Efficient Tree-Based Revocation in Groups of Low-State Devices 521

Fig. 3. Binary tree consisting of biased trees and The ranks of the

nodes in and are shown.

Broadcast and Multicast. Using the above security properties and appro-

priate signature or authentication mechanism [2,4,20,25], the GC of each

can send a message securely with one key encryption to or any subgroup or

super-group of without any ambiguity.

Key Update and Tree Rebalance. As in the LKH scheme, keys should be

updated after each insertion or deletion (revocation) of a device or group so

that the security properties 1 and 2 are maintained. Moreover, we should also

rebalance to preserve the bias properties in each Assume that we can

insert a leaf, delete a leaf, or update the weight of a leaf in (by

and respectively) while preserving both the security and

bias properties. Then inserting or deleting a device

can be done in three steps:

1. insert or delete a leaf in the exterior tree

2. update the weights in the interior trees

accordingly; and

3. update the keys on the path from to bottom-up, as in the LKH scheme.

To insert or delete a group is a similar process except

starting with an insertion or deletion in an interior Therefore insert, delete

and reweight in each suffice all our hierarchy modifications in Such

operations preserving the bias properties were already given and analyzed in

[1], we now describe how to modify them to preserve the security properties,

too.

Recall that the biased tree operations, including insert, delete and reweight,

recursively call an operation tilt as the only subroutine to rebalance the biased

tree structure [1]. Operation tilt performs a single rotation associated with rank

modification. Since a node loses descendants during a rotation if it is rotated

down and losing descendants is the only chance of key leaks in the LKH scheme.

To maintain the security properties 1 and 2 after any rotation in it is neces-

sary and sufficient to update the key at the node rotated down. Observing that

TEAM LinG

522 Michael T. Goodrich, Jonathan Z. Sun, and Roberto Tamassia

updating a single key and distributing the result of a rotation are both easy in

our scheme, we can replace the tilt in [1] with our secure-tilt which preserves the

security properties 1 and 2.We give a detailed description of secure-tilt-left in

Figure 4. Operation secure-tilt-right is analogous. Using secure-tilt as the sub-

routine in biased tree operations, the scheme is as secure as LKH.

Fig. 4. The algorithm for operation

Efficiency of the Scheme. The insert, delete and reweight operations in biased

trees are implemented as follows: join and split are the two basic biased tree

operations. has global and local versions, which will merge two global

or local biased trees with roots and and return the root of the resulting

tree, and both versions work by recursively calling secure-tilt. will

split T into two biased trees and each containing all the leaves of T with

their binary search keys less than and greater than respectively, split calls

local-join as a subroutine and is applicable to both local and global biased trees.

Other operations are based on join and split: operation splits T by

and then joins and together; operation splits T by and

then joins and back ignoring and operation splits T by

updates the weight of and then joins and back into T.

The correctness and efficiency of our hierarchy modifications in follow

those of biased tree operations. Notice that our secure-tilt takes constant message

size as well as the constant-time tilt in [1], all time bounds in [1] also hold as

bounds of message size in our scheme.

This gives us the following.

Theorem 2. Given an organizational chart tree T with height and devices,

under the log-key restriction, the dynamic biased binary tree scheme for T has

has O(1) broadcast cost and revocation and insertion cost.

Proof. We show how to access a device from

The analysis of other operations is similar. Since the root of is a leaf of

and each biased tree has the ideal access time, the time to

access from is

TEAM LinG

Efficient Tree-Based Revocation in Groups of Low-State Devices 523

Thus, we satisfy the log-key restriction for any organizational chart with

height. We also note that applying our SSD approach to a static

application of the techniques developed in this section results in a scheme using

keys per device and messages per broadcast for an organization

chart with height

5 A Dynamic Tree Scheme for a Multicast Tree

Let us next consider the multicast tree structure, which, for the sake of broadcast

encryption, is similar to the organizational chart, except that the height of a

multicast tree can be much larger than logarithmic (we even allow for linear

height). For a multicast tree T with devices and groups, we give a scheme

with broadcast cost and update cost, irrespectively of the

depth of T.

Dynamic Trees. Dynamic trees were first studied by Sleator and Tarjan [22]

and used for various tree queries and network flow problems. The key idea is

to partition a highly unbalanced tree into paths and associate a biased tree

structure, which is in some sense balanced, to each path. Thus any node in the

tree can be accessed and any update to the tree can be done in time

through the associated structure, regardless the depth of node or the height of

tree. The dynamic tree used in our scheme is specified by taking the partition by

weight (size) approach and not having cost on each edge. The following definition

refers to this specification.

A dynamic tree T is a weighted binary search tree where the weight is

initially assigned if is a leaf, or if is an internal

node. The edges of T are partitioned into solid and dashed edges so that each

node links with its heavier child by a solid edge and with the lighter child by a

dashed edge. Thus T is partitioned into solid paths linked by dashed edges.

the upper-most one1. Then

We denote by the deepest node in and

the edge between any and its parent must be dashed, and vice versa. For

operations, each solid path is further organized as a global biased

tree, denoted by so that the nodes from to become leaves

of from left to right, and the weight of a leaf in is assigned as

where is the dashed child of in T. Then T consists of

these by linking the root of each with the parent of unless

is the root of T. (See Figure 5.) To show that such structure of T is well

defined, let the root of be and the parent of be then we

have that Thus, can replace as a

child of

1

must be a leaf of T by the “partition by weight” approach.

TEAM LinG

524 Michael T. Goodrich, Jonathan Z. Sun, and Roberto Tamassia

Fig. 5. Partition of tree and the accessing path to

Group Hierarchies and Dynamic Trees. We convert a multicast tree T to a

binary tree that preserves the group hierarchy in T as same as in the biased

tree scheme. Instead of using a biased tree, we simply use a complete binary tree

for each then assign a unit weight to each device and partition

into a dynamic tree as above. A key is assigned to each node of each

Since the root of becomes child of a leaf of each device becomes a

descendant of a unique string of biased trees of paths

The way a device is accessed is not through the real path in but through the

path in the string of (See Figure 5.)

Broadcast and Multicast. Broadcast in a group becomes a little more

complicate because, although device is a descendant of in T, may not

be on the accessing path from to However, if then the accessing

path to any descendant of must pass a node in the prefix of from

to So, to broadcast in it is sufficient to encrypt the message by the keys

in that cover this prefix of In the full version, we show that, with the

dynamic tree scheme, it takes encryptions to broadcast a message in

any group either in worst case or in average.

Key Updates. We follow the dynamic tree operations in [22] to modify the

hierarchy, and update the keys in the accessing path of the updated item as in

the LKH scheme. Dynamic tree operations dynamically change the solid path

partition to guarantee the running time, and such change is carried out

by the biased tree operations among Therefore, operation secure-tilt

preserves the security properties along any accessing path. The dynamic tree

operations we use are as follows:

extend by converting the edge from to its parent solid,

and the edge between and its parent dashed.

TEAM LinG

Efficient Tree-Based Revocation in Groups of Low-State Devices 525

Let be the upper most edge in such that is not the

heavier child of if there exist such edges in Then cut by converting

into dashed and into solid.

make the path from to (the real path in into a single

solid path by a series of splices.

convert every edge in who does not link to a heavier child of

parent into dashed by a series of slices.

combine two dynamic trees by making the parent of where

is the root of the first tree and is a node in the second.

divide a dynamic tree into two by deleting the edge between and

Inserting or deleting a device or a group corresponds to a link or cut operation,

respectively. Such dynamic tree operation take time and can be reduced

to a series of join and split operation on biased trees. The algorithmic template

for a dynamic tree operation is the expose-and-conceal strategy, described as

follows:

1. perform on a node

2. if the above expose operation violates the “partition by weight” property,

restore the property by executing on the appropriate path

Since all the dynamic tree operations reduce to a series of biased tree opera-

tions, operation secure-tilt is still the only subroutine that adjusts the partition

of Notice that the structure is never adjusted, but the accessing

path to each device are adjusted through operations. From [22], we know that,

with partition by weight and representing the solid paths as global biased trees,

any dynamic tree operation takes time. Since a hierarchy modification

consists of a dynamic tree operation plus updating the keys in an access path,

which is also of length the efficiency of key updating for hierarchy

modifications follows.

Theorem 3. Given a multicast tree T with devices, under the log-key re-

striction, structured in groups, the dynamic tree scheme for T has

broadcast cost and revocation and insertion cost.

A zero-state version can also be developed, which uses the biased trees and

broadcast scheme to send messages to the unrevoked leaves in a multicast tree

T using broadcasts for devices storing keys each, where is

the number of revoked devices.

References

l. S. W. Bent, D. D. Sleator, and R. E. Tarjan. Biased search trees. SIAM J. Comput.,

14(3):545“568, Aug. 1985.

2. D. Boneh, G. Durfee, and M. Franklin. Lower bounds for multicast message au-

thentication. In Proc. EUROCRYPT 2001, LNCS 2045, pages 437“452, May 2001.

TEAM LinG

526 Michael T. Goodrich, Jonathan Z. Sun, and Roberto Tamassia

3. B. Briscoe. Marks: Zero side effect multicast key management using arbitrarily

revealed key sequences. In Proc. of First International Workshop on Networked

Group Communication(NCGC™99), 1999.

4. R. Canetti, J. Garay, G. Itkis, D. Micciancio, M. Naor, and B. Pinkas. Multicast

security: A taxonomy and some efficient constructions. In Proc. INFOCOM ™99,

volume 2, pages 708“716, New York, Mar. 1999.

5. R. Canetti, T. Malkin, and K. Nissim. Efficient communication ” storage tradeoffs

for multicast encryption. In Advances in cryptology (EUROCRYPT™99), LNCS

1592, pages 459“474, 1999.

6. I. Chang, R. Engel, D. Kandlur, D. Pendarakis, and D. Saha. Key management

for secure Internet multicast using boolean function minimization techniques. In

Proc. IEEE INFOCOM, volume 2, pages 689“698, 1999.

7. G. D. Crescenzo and O. Kornievskaia. Efficient kerberized multicast in a practical

distributed setting. In 4th International Conference Information Security (ISC™01),

LNCS 2200, pages 27“45, Oct. 2001.

8. A. Fiat and M. Naor. Broadcast encryption. In Advances in Cryptology -

CRYPTO™93, pages 480“491. LNCS 773, 1994.

9. M. T. Goodrich and R. Tamassia. Algorithm Design: Foundations, Analysis and

Internet Examples. John Wiley & Sons, New York, NY, 2002.

10. D. Halevy and A. Shamir. The LSD broadcast encryption scheme. In Advances in

Cryptology (CRYPTO 2002), volume 2442 of LNCS, pages 47“60. Springer-Verlag,

2002.

11. E. Jung, A. X. Liu, and M. G. Gouda. Key bundles and parcels: Secure com-

munication in many groups. In Proceedings of the 5th International Workshop on

Networked Group Communications (NGC-03), LNCS 2816, pages 119“130, Mu-

nich, Germany, September 2003.

12. R. Kumar, R. Rajagopalan, and A. Sahai. Goding constructions for blackl-

iting problems without computational assumptions. In Advances in cryptology

(CRYPTO™99), LNCS 1666, pages 609“623, 1999.

13. D. A. McGrew and T. Sherman. Key establishment in large dynamic groups using

one-way function trees. Technical Report 0755, TIS Labs at Network Associates

Inc., Glenwood, MD, May 1998.

14. D. Micciancio and S. Panjwani. Optimal communication complexity of generic

multicast key distribution. In EUROCRYPT 2004, pages 153“170, 2004.

15. M. J. Mihajevic. Key management schemes for stateless receivers based on time

varying heterogeneous logical key hierarchy. In ASIACRYPT 2003, LNCS 2894,

pages 137“154, 2003.

16. D. Naor, M. Naor, and J. Lotspiech. Revocation and tracing schemes for stateless

receivers. In CRYPTO™01, volume 2139 of LNCS, pages 41“62. Springer-Verlag,

2001.

17. M. Naor and B. Pinkas. Efficient trace and revoke schemes. In Proc. Financial

Crypto 2000, Feb. 2000.

18. B. Pinkas. Efficient state updates for key management. In Proc. ACM Workshop

on Security and Privacy in Digital Rights Management, 2001.

19. O. Rodeh, K. P. Birman, and D. Dolev. Using AVL trees for fault tolerant group key

management. International Journal on Information Security, pages 84“99, 2001.

20. B. Schneier. Applied Cryptography, 2nd Ed. John Wiley - Sons, 1996.

21. A. T. Sherman and D. A. McGrew. Key establishment in large dynamic groups

using one-way function trees. IEEE Trans. Software Engineering, 29(5):444“458,

2003.

TEAM LinG

Efficient Tree-Based Revocation in Groups of Low-State Devices 527

22. D. D. Sleator and R. E. Tarjan. A data structure for dynamic trees. J. Computer

and System Sciences, 26:362“391, 1983.

23. J. Snoeyink, S. Suri, and G. Varghese. A lower bound for multicast key distribution.

In IEEE INFOCOM 2001, volume 1, pages 422“431, 2001.

24. R. Tamassia and N. Triandopoulos. Computational bounds on hierarchical data

processing with applications to information security. Technical report, Center for

Geometric Computing, Brown University, 2004.

25. H. F. Tipton and M. Krause, editors. Information Security Management Handbook,

4th Ed. Auerbach, 1999.

26. D. M. Wallner, E. G. Harder, and R. C. Agee. Key management for multicast:

issues and architecture. In internet draft draft-waller-key-arch-01.txt, Sep. 1998.

27. C. K. Wong, M. Gouda, and S. S. Lam. Secure group communications using key

graphs. In Proc. ACM SIGCOMM™98, volume 28, pages 68“79, 1998.

28. C. K. Wong and S. S. Lam. Digital signatures for flows and multicasts. IEEE/ACM

Transactions on Networking, 7:502“513, 1999.

TEAM LinG

Privacy-Preserving Datamining

on Vertically Partitioned Databases

Cynthia Dwork and Kobbi Nissim

Microsoft Research, SVC, 1065 La Avenida, Mountain View CA 94043

{dwork,kobbi}@microsoft.com

Abstract. In a recent paper Dinur and Nissim considered a statistical

database in which a trusted database administrator monitors queries

and introduces noise to the responses with the goal of maintaining data

privacy [5]. Under a rigorous definition of breach of privacy, Dinur and

Nissim proved that unless the total number of queries is sub-linear in the

size of the database, a substantial amount of noise is required to avoid a

breach, rendering the database almost useless.

As databases grow increasingly large, the possibility of being able to

query only a sub-linear number of times becomes realistic. We further

investigate this situation, generalizing the previous work in two impor-

tant directions: multi-attribute databases (previous work dealt only with

single-attribute databases) and vertically partitioned databases, in which

different subsets of attributes are stored in different databases. In addi-

tion, we show how to use our techniques for datamining on published

noisy statistics.

Keywords: Data Privacy, Statistical Databases, Data Mining, Vertically

Partitioned Databases.

1 Introduction

In a recent paper Dinur and Nissim considered a statistical database in which

a trusted database administrator monitors queries and introduces noise to the

responses with the goal of maintaining data privacy [5]. Under a rigorous defini-

tion of breach of privacy, Dinur and Nissim proved that unless the total number

of queries is sub-linear in the size of the database, a substantial amount of noise

is required to avoid a breach, rendering the database almost useless1. However,

when the number of queries is limited, it is possible to simultaneously preserve

privacy and obtain some functionality by adding an amount of noise that is a

function of the number of queries. Intuitively, the amount of noise is sufficiently

large that nothing specific about an individual can be learned from a relatively

small number of queries, but not so large that information about sufficiently

strong statistical trends is obliterated.

1

For unbounded adversaries, the amount of noise (per query) must be linear in the

size of the database; for polynomially bounded adversaries, noise is required.

M. Franklin (Ed.): CRYPTO 2004, LNCS 3152, pp. 528“544, 2004.

© International Association for Cryptologic Research 2004

TEAM LinG

Privacy-Preserving Datamining on Vertically Partitioned Databases 529

As databases grow increasingly massive, the notion that the database will be

queried only a sub-linear number of times becomes realistic. We further inves-

tigate this situation, significantly broadening the results in [5], as we describe

below.

Methodology. We follow a cryptography-flavored methodology, where we con-

sider a database access mechanism private only if it provably withstands any

adversarial attack. For such a database access mechanism any computation over

query answers clearly preserves privacy (otherwise it would serve as a privacy

breaching adversary). We present a database access mechanism and prove its

security under a strong privacy definition. Then we show that this mechanism

provides utility by demonstrating a datamining algorithm.

Statistical Databases. A statistical database is a collection of samples that are

somehow representative of an underlying population distribution. We model

a database as a matrix, in which rows correspond to individual records and

columns correspond to attributes. A query to the database is a set of indices

(specifying rows), and a Boolean property. The response is a noisy version of the

number of records in the specified set for which the property holds. (Dinur and

Nissim consider one-column databases containing a single binary attribute.) The

model captures the situation of a traditional, multiple-attribute, database, in

which an adversary knows enough partial information about records to “name”

some records or select among them. Such an adversary can target a selected

record in order to try to learn the value of one of its unknown sensitive at-

tributes. Thus, the mapping of individuals to their indices (record numbers) is

not assumed to be secret. For example, we do not assume the records have been

randomly permuted.

We assume each row is independently sampled from some underlying distri-

bution. An analyst would usually assume the existence of a single underlying

row distribution and try to learn its properties.

Privacy. Our notion of privacy is a relative one. We assume the adversary knows

the underlying distribution on the data, and, furthermore, may have some a

priori information about specific records, e.g., the a priori probability that

at least one of the attributes in record 400 has value 1 “ is .38”. We anlyze

privacy with respect to any possible underlying (row) distributions where

the ith row is chosen according to This partially models a priori knowledge

an attacker has about individual rows (i.e. is conditioned on the attacker™s

knowledge of the ith record). Continuing with our informal example, privacy is

breached if the a posteriori probability (after the sequence of queries have been

issued and responded to) that “at least one of the attributes in record 400 has

value 1” differs from the a priori probability “too much”.

Multi-attribute Sub-linear Queries (SuLQ) Databases. The setting studied in [5],

in which an adversary issues only a sublinear number of queries (SuLQ) to a

single attribute database, can be generalized to multiple attributes in several

TEAM LinG

530 Cynthia Dwork and Kobbi Nissim

natural ways. The simplest scenario is of a single SuLQ database,

queried by specifying a set of indices and a Boolean function. The re-

sponse is a noisy version of the number of records in the specified set for which

the function, applied to the attributes in the record, evaluates to 1. A more

involved scenario is of multiple single-attribute SuLQ databases, one for each

attribute, administered independently. In other words, our database

is vertically partitioned into single-attribute databases. In this case, the chal-

lenge will be datamining: learning the statistics of Boolean functions of the at-

tributes, using the single-attribute query and response mechanisms as primitives.

A third possibility is a combination of the first two: a database that

is vertically partitioned into two (or more) databases with and (possibly

overlapping) attributes, respectively, where Database can

handle functional queries, and the goal is to learn relationships between

the functional outputs, eg, “If holds, does this increase the

likelihood that holds?”, where is a function on the attribute

values for records in the ith database.

1.1 Our Results

We obtain positive datamining results in the extensions to the model of [5]

described above, while maintaining the strengthened privacy requirement:

1. Multi-attribute SuLQ databases: The statistics for every Boolean func-

2

tion can be learned . Since the queries here are powerful (any function), it is

not surprising that statistics for any function can be learned. The strength

of the result is that statistics are learned while maintaining privacy.

2. Multiple single-attribute SuLQ databases: We show how to learn the statis-

tics of any 2-ary Boolean function. For example, we can learn the fraction of

records having neither attribute 1 nor attribute 2, or the conditional proba-

bility of having attribute 2 given that one has attribute 1. The key innovation

is a procedure for testing the extent to which one attribute, say, implies

another attribute, in probability, meaning that where

can be estimated by the procedure.

3. Vertically Partitioned SuLQ Databases: The constructions here

are a combination of the results for the first two cases: the attributes are

partitioned into (possibly overlapping) sets of size and respectively,

where each of the two sets of attributes is managed by a multi-

attribute SuLQ database. We can learn all 2-ary Boolean functions of the

outputs of the results from the two databases.

We note that a single-attribute database can be simulated in all of the above

settings; hence, in order to preserve privacy, the sub-linear upper bound on

queries must be enforced. How this bound is enforced is beyond the scope of this

work.

2

Note that because of the noise, statistics cannot be learned exactly. An additive error

on the order of is incurred, where is the number of records in the database.

The same is true for single-attribute databases.

TEAM LinG

Privacy-Preserving Datamining on Vertically Partitioned Databases 531

Datamining on Published Statistics. Our technique for testing implication in

probability yields surprising results in the real-life model in which confidential

information is gathered by a trusted party, such as the census bureau, who pub-

lishes aggregate statistics. Describing our results by example, suppose the bureau

publishes the results of a large (but sublinear) number of queries. Specifically, for

every, say, triple of attributes and for each of the eight conjunctions

of literals over three attributes the bureau

publishes the result of several queries on these conjunctions. We show how to

construct approximate statistics for any binary function of six attributes. (In

general, using data published for it is possible to approximately learn

statistics for any function.) Since the published data are the results of

SuLQ database queries, the total number of published statistics must be sub-

linear in the size of the database. Also, in order to keep the error down,

several queries must be made for each conjunction of literals. These two facts

constrain the values of and the total number of attributes for which the result

is meaningful.

1.2 Related Work

There is a rich literature on confidentiality in statistical databases. An excellent

survey of work prior to the late 1980™s was made by Adam and Wortmann [2].

Using their taxonomy, our work falls under the category of output perturbation.

However, to our knowledge, the only work that has exploited the opportunities

for privacy inherent in the fact that with massive of databases the actual number

of queries will be sublinear is Sect. 4 of [5] (joint work with Dwork). That work

only considered single-attribute SuLQ databases.

Fanconi and Merola give a more recent survey, with a focus on aggregated

data released via web access [10]. Evfimievski, Gehrke, and Srikant, in the Intro-

duction to [7], give a very nice discussion of work in randomization of data, in

which data contributors (e.g., respondents to a survey) independently add noise

to their own responses. A special issue (Vol.14, No. 4, 1998) of the Journal of Of-

ficial Statistics is dedicated to disclosure control in statistical data. A discussion

of some of the trends in the statistical research, accessible to the non-statistician,

can be found in [8].

Many papers in the statistics literature deal with generating simulated data

while maintaining certain quantities, such as marginals [9]. Other widely-studied

techniques include cell suppression, adding simulated data, releasing only a sub-

set of observations, releasing only a subset of attributes, releasing synthetic or

partially synthetic data [13,12], data-swapping, and post-randomization. See

Duncan (2001) [6].

R. Agrawal and Srikant began to address privacy in datamining in 2000 [3].

That work attempted to formalize privacy in terms of confidence intervals (in-

tuitively, a small interval of confidence corresponds to a privacy breach), and

also showed how to reconstruct an original distribution from noisy samples (i.e.,

each sample is the sum of an underlying data distribution sample and a noise

sample), where the noise is drawn from a certain simple known distribution.

TEAM LinG

532 Cynthia Dwork and Kobbi Nissim

This work was revisited by D. Agrawal and C. Aggarwal [1], who noted that it

is possible to use the outcome of the distribution reconstruction procedure to

significantly diminish the interval of confidence, and hence breach privacy. They

formulated privacy (loss) in terms of mutual information, taking into account

(unlike [3]) that the adversary may know the underlying distribution on the data

and “facts of life” (for example, that ages cannot be negative). Intuitively, if the

mutual information between the sensitive data and its noisy version is high, then

a privacy breach occurs. They also considered reconstruction from noisy sam-

ples, using the EM (expectation maximization) technique. Evfimievsky, Gehrke,

and Srikant [7] criticized the usage of mutual information for measuring privacy,

noting that low mutual information allows complete privacy breaches that hap-

pen with low but significant frequency. Concurrently with and independently of

Dinur and Nissim [5] they presented a privacy definition that related the a priori

and a posteriori knowledge of sensitive data. We note below how our definition

of privacy breach relates to that of [7,5].

A different and appealing definition has been proposed by Chawla, Dwork,

McSherry, Smith, and Wee [4], formalizing the intuition that one™s privacy is

guaranteed to the extent that one is not brought to the attention of others. We

do not yet understand the relationship between the definition in [4] and the one

presented here.

There is also a very large literature in secure multi-party computation. In

secure multi-party computation, functionality is paramount, and privacy is only

preserved to the extent that the function outcome itself does not reveal infor-

mation about the individual inputs. In privacy-preserving statistical databases,

privacy is paramount. Functions of the data that cannot be learned while pro-

tecting privacy will simply not be learned.

2 Preliminaries

Notation. We denote by (read: negligible) a function that is asymptoti-

cally smaller than any inverse polynomial. That is, for all for all sufficiently

large we have We write for

2.1 The Database Model

In the following discussion, we do not distinguish between the case of a verti-

cally partitioned database (in which the columns are distributed among several

servers) and a “whole” database (in which all the information is in one place).

We model a database as an binary matrix Intuitively, the

columns in correspond to Boolean attributes and the rows in

correspond to individuals where iff attribute holds for individual

We sometimes refer to a row as a record.

Let be a distribution on We say that a database is

chosen according to distribution if every row in is chosen according to

independently of the other rows (in other words, is chosen according to

TEAM LinG

Privacy-Preserving Datamining on Vertically Partitioned Databases 533

In our privacy analysis we relax this requirement and allow each row to be

chosen from a (possibly) different distribution In that case we say that the

database is chosen according to

Statistical Queries. A statistical query is a pair where indicates a

set of rows in and denotes a function on attribute values.

The exact answer to is the number of rows of in the set for which

holds (evaluates to 1):

We write when the function is a projection onto the jth element:

In that case is a query on a subset of the entries in

the jth column: When we look at vertically partitioned single-

attribute databases, the queries will all be of this form.

Perturbation. We allow the database algorithm to give perturbed (or “noisy”)

answers to queries. We say that an answer is within perturbation if

Similarly, a database algorithm is within perturbation if for every

query

The probability is taken over the randomness of the database algorithm

2.2 Probability Tool

Proposition 1. Let be random variables so that and

then

Using Azuma™s inequality3 we

Proof. Let hence

get that As

the proposition follows.

3 Privacy Definition

We give a privacy definition that extends the definitions in [5,7]. Our definition

is inspired by the notion of semantic security of Goldwasser and Micali [11]. We

first state the formal definition and then show some of its consequences.

Let be the a priori probability that (taking into account that

we assume the adversary knows the underlying distribution on row In

3

Let be a martingale with for all Let

be arbitrary. Azuma™s inequality says that then

TEAM LinG

534 Cynthia Dwork and Kobbi Nissim

general, for a Boolean function we let be the a priori

probability that We analyze the a posteriori probability

that given the answers to T queries, as well as all the values

in all the rows of other than for all We denote this a posteriori

probability

Confidence. To simplify our calculations we follow [5] and define a monotoni-

cally-increasing 1-1 mapping conf : as follows:

Note that a small additive change in conf implies a small additive change in

4

. Let and We write our privacy

defined as5:

requirements in terms of the random variables

Definition 1 A database access mechanism is

if for every distribution on for every row index for every function

and for every adversary making at most T queries it

holds that

The probability is taken over the choice of each row in according to and the

randomness of the adversary as well as the database access mechanism.

A target set F is a set of Boolean functions (one can think of the

functions in F as being selected by an adversary; these represent information it

will try to learn about someone). A target set F is if for

all and Let F be a target set. Definition 1 implies that under a

database mechanism, F is with probability

Proposition 2. Consider database with attributes.

Let F be the target set containing all the Boolean functions over the at-

tributes. Then,

Proof. Let be a target set containing all conjuncts of attributes. We

have that and hence is with probability

To prove the proposition we show that F is safe whenever is. Let

be a Boolean function. Express as a disjunction of conjuncts of attributes:

4

The converse does not hold “ conf grows logarithmically in for and logarith-

mically in for

5

Our choice of defining privacy in terms of is somewhat arbitrary, one could

rewrite our definitions (and analysis) in terms of the a priori and a posteriori proba-

bilities. Note however that limiting in Definition 1 is a stronger requirement

than just limiting

TEAM LinG

Privacy-Preserving Datamining on Vertically Partitioned Databases 535

Similarly, express as the disjunction of the remaining

conjuncts: (So

We have:

Let and maximize

maximize Us-

ing we get that

where the last inequality holds as

vs. Finding Very Heavy Sets. Let be a target function and

Our privacy requirement implies such

that it is infeasible to find a “very” heavy set that is, a set for which

Such a set would violate our privacy

requirement as it would allow guessing for a random record in

Relationship to the Privacy Definition of [7]. Our privacy definition extends the

definition of privacy breaches of [7]. Their definition is introduced with

respect to a scenario in which several users send their sensitive data to a center.

Each user randomizes his data prior to sending it. A privacy breach

occurs if, with respect to some property the a priori probability that holds

for a user is at most whereas the a posteriori probability may grow beyond

(i.e. in a worst case scenario with respect to the coins of the randomization

operator).

4 Privacy of Multi-attribute SuLQ Databases

We first describe our SuLQ Database algorithm, and then prove that it preserves

privacy.

Let and define for some

(taking will work). To simplify notation, we write for

for (and later for

Note that is a binomial random variable with and standard devi-

ation In our analysis we will neglect the case where largely deviates from

TEAM LinG

536 Cynthia Dwork and Kobbi Nissim

zero, as the probability of such an event is extremely small:

In particular, this implies that our SuLQ database algorithm is within

perturbation.

We will use the following proposition.

Proposition 3. Let B be a binomially distributed random variable with expec-

tation 0 and standard deviation Let L be the random variable that takes the

value Then

1. For this value is

bounded by

2. E[L] = O(1/R), where the expectation is taken over the random choice of B.

Proof. 1. The equality follows from the symmetry of the Binomial distribution

(i.e.Pr[B]=Pr[“B]).

To prove the bound consider

Using the limits on B and the definition of R we

get that this value is bounded by

2. Using the symmetry of the Binomial distribution we get:

Our proof of privacy is modeled on the proof in Section 4 of [5] (for single

attribute databases). We extend their proof (i) to queries of the form where

is any Boolean function, and (ii) to privacy of Boolean functions

Theorem 1. Let and for and

Then the SuLQ algorithm is within

perturbation.

Note that whenever bounding the adversary™s number of

queries to allows privacy with perturbation magnitude less than

Proof. Let be as in the theorem and recall for some

Let the queries issued by the adversary be denoted

Let be the perturbed answers to

these queries. Let and

We analyze the a posteriori probability that given the answers to

the first queries and (where denotes the entire database

except for the ith row). Let Note that

(of Section 3), and (due to the independence of rows in

TEAM LinG

Privacy-Preserving Datamining on Vertically Partitioned Databases 537

By the definition of conditional probability6 we get

Note that the probabilities are taken over the coin flips of the SuLQ algorithm

and the choice of In the following we analyze the numerator (the denominator

is analyzed similarly).

The last equality follows as the rows in are chosen independently of each

other. Note that given both and the random variable is independent

of Hence, we get:

Next, we observe that although depends on the dependence is weak.

More formally, let be such that and Note

that whenever we have that

When, instead, we can relate and

via Proposition 3:

Lemma 1. Let be such that Then

where and

and is noise that yields when

Proof. Consider the case Writing

and the proof follows from

Proposition 3. Similarly for

Note that the value of does not depend on

Taking into account both cases and we get

6

I.e.

TEAM LinG

538 Cynthia Dwork and Kobbi Nissim

Let be the probability, over that Letting be such that

we have

and similarly

Putting the pieces together we get that

Define a random walk on the real line with To

conclude the proof we show that (with high probability) T steps of the random

walk do not suffice to reach distance From Proposition 3 and Lemma 1 we get

that

and

Using Proposition 1 with we get that for all

5 Datamining on Vertically Partitioned Databases

In this section we assume that the database is chosen according to for some

underlying distribution on rows, where is independent of the size of the

database. We also assume that is sufficiently large that the true database

statistics are representative of Hence, in the sequel, when we write things like

we mean the probability, over the entries in the database, that holds.

Let and be attributes. We say that implies in probability if the

conditional probability of given exceeds the unconditional probability of

The ability to measure implication in probability is crucial to datamining. Note

that since is simple to estimate well, the problem reduces to obtaining a

TEAM LinG

Privacy-Preserving Datamining on Vertically Partitioned Databases 539

good estimate of Moreover, once we can estimate the we can use

Bayes™ Rule and de Morgan™s Laws to determine the statistics for any Boolean

function of attribute values.

Our key result for vertically partitioned databases is a method, given two

single-attribute SuLQ databases with attributes and respectively, to measure

For more general cases of vertically partitioned data, assume a

database is partitioned into databases, with (possibly

overlapping) attributes, respectively, where We can use functional

queries to learn the statistics on Boolean functions of the attributes in the

ith database, and then use the results for two single-attribute SuLQ databases

to learn binary Boolean functions of any two functions (on attributes in

database and (on attributes in database where

5.1 Probabilistic Implication

In this section we construct our basic building block for mining vertically parti-

tioned databases.

We assume two SuLQ databases of size with attributes respec-

tively. When implies in probability with a gap of we write meaning

that We note that and are easily computed

within error simply by querying the two databases on large subsets.

Our goal is to determine or equivalently, the method will be

to determine if, for a given and then to estimate

by binary search on

Notation. We let and

Let X be a random variable counting the number of times holds when we

take N samples from Then and

Let

Note that Substituting for we get

and hence (by another application of Eq. (1))

We define the following testing procedure to determine, given if

Step 1 finds a heavy (but not very heavy) set for attribute that is, a set for

which the number of records satisfying exceeds the expected number by more

than a standard deviation. Note that since the noise

TEAM LinG

540 Cynthia Dwork and Kobbi Nissim

is so the heavy set really has records for which holds.

Step 2 queries on this heavy set. If the incidence of on this set sufficiently

(as a function of exceeds the expected incidence of then the test returns

“1” (ie, success). Otherwise it returns 0.

Theorem 2. For the test procedure

1. If then

2. If then

where for the advantage is constant, and for

the advantage with constant

In the following analysis we neglect the difference between and since,

as noted above, the perturbation contributes only low order terms (we neglect

some other low order terms). Note that it is possible to compute all the required

constants for Theorem 2 explicitly, in polynomial time, without neglecting these

low-order terms. Our analysis does not attempt to optimize constants.

Proof. Consider the random variable corresponding to given

that is biased according to Step 1 of By linearity of expectation, together

with the fact that the two cases below are disjoint, we get that

The last step uses Eq. (3). Since the distribution of is symmetric around

we get that the first part of the claim, i.e. if then

To get the second part of the claim we use the de Moivre-Laplace theorem

and approximate the binomial distribution with the normal distribution so that

we can approximate the variance of the sum of two distributions (when holds

and when does not hold) in order to obtain the variance of conditioned

on We get:

TEAM LinG

Privacy-Preserving Datamining on Vertically Partitioned Databases 541

Assuming N is large enough, we can neglect the terms involving Hence,

The transition from the second to third lines follows from

We have that the probability distribution on is a Gaussian with mean

and variance at most and respectively.

To conclude the proof, we note that the conditional probability mass of

exceeding its own mean by is at most

where is the cumulative distribution function for the normal distribution.

For constant this yields a constant advantage For we get that

By taking we can run the Test procedure enough times to

determine with sufficiently high confidence which “side” of the interval

is on (if it is not inside the interval). We proceed by binary search to

narrow in on We get:

Theorem 3. There exists an algorithm that invokes the test

times and outputs such that

6 Datamining on Published Statistics

In this section we apply our basic technique for measuring implication in prob-

ability to the real-life model in which confidential information is gathered by

a trusted party, such as the census bureau, who publishes aggregate statistics.

The published statistics are the results of queries to a SuLQ database. That is,

the census bureau generates queries and their noisy responses, and publishes the

results.

7

In more detail:

TEAM LinG

542 Cynthia Dwork and Kobbi Nissim

Let denote the number of attributes (columns). Let be fixed (typi-

cally, will be small; see below). For every of attributes

and for each of the conjunctions of literals over these attributes,

and so on), the bureau publishes the result of some number of

queries on these conjunctions. More precisely, a query set is selected,

and noisy statistics for all conjunctions of literals are published for the

query. This is repeated times.

To see how this might be used, suppose and we wish to learn if

implies in probability. We know from the results in Section 4 that we

need to find a heavy set for and then to query the database on the

set with the function Moreover, we need to do this several times

(for the binary search). If is sufficiently large, then with high probability such

query sets are among the queries. Since we query all triples (generally,

of literals for each query set all the necessary information is published.

The analyst need only follow the instructions for learning the strength of

the implication in probability looking up the results of the

queries (rather than randomly selecting the sets and submitting the queries to

the database).

As in Section 4, once we can determine implication in probability, it is easy

to determine (via Bayes™ rule) the statistics for the conjunction

In other words, we can determine the approximate statistics for any conjunction

of literals of attribute values. Now the procedure for arbitrary func-

tions is conceptually simple. Consider a function of attribute values

The analyst first represents the function as a truth table: for each possible

of literals over the function has value either zero or one. Since

these conjunctions of literals are mutually exclusive, the probability (overall)

that the function has value 1 is simply the sum of the probabilities that each of

the positive (one-valued) conjunctions occurs. Since we can approximate each of

these statistics, we obtain an approximation for their sum. Thus, we can approx-

imate the statistics for each of the Boolean functions of attributes. It

remains to analyze the quality of the approximations.

Let be an upper bound on the number of queries permitted by the

SuLQ database algorithm, e.g., Let and be as above:

is the total number of attributes, and statistics for will be published.

Let be the (combined) additive error achieved for all conjuncts with

probability

Privacy is preserved as long as (Theorem 1). To determine util-

ity, we need to understand the error introduced by the summation of estimates.

TEAM LinG

Privacy-Preserving Datamining on Vertically Partitioned Databases 543

Let If our test results in a additive error for each possible conjunct

of literals, the truth table method described above allows us to compute the

frequency of every function of literals within additive error (a lot better in

many cases). We require that our estimate be within error with probability

where Hence, the probability that a ˜bad™ conjunct exists

(for which the estimation error is not within is bounded by

Plugging and into Theorem 3, we get that for each conjunction of

literals, the number of subsets on which we need to make queries is

For each subset we query each of the conjuncts of attributes. Hence,

the total number of queries we make is

For constant we get that the total number of queries is To

see our gain, compare this with the naive publishing of statistics for all conjuncts

of attributes, resulting in queries.

7 Open Problems

Datamining of 3-ary Boolean Functions. Section 5.1 shows how to use two SuLQ

databases to learn that As noted, this allows estimating

for any Boolean function Consider the case where there exist

three SuLQ databases for attributes In order to use our test procedure

to compute one has to either to find heavy sets for (having

bias of order or, given a heavy set for to decide whether it is also

heavy w.r.t. It is not clear how to extend the test procedure of Section 5.1

in this direction.

Maintaining Privacy for All Possible Functions. Our privacy definition (Defini-

tion 1) requires for every function that with high probability the

confidence gain is limited by some value If is small (less than then,

via the union bound, we get that with high probability the confidence gain is

kept small for all the possible functions.

For large the union bound does not guarantee simultaneous privacy for all

the possible functions. However, the privacy of a randomly selected function

is (with high probability) preserved. It is conceivable that (e.g. using crypto-

graphic measures) it is possible to render infeasible the task of finding a function

whose privacy was breached.

Dependency Between Database Records. We explicitly assume that the database

records are chosen independently from each other, according to some underlying

distribution We are not aware of any work that does not make this assumption

TEAM LinG

544 Cynthia Dwork and Kobbi Nissim

(implicitly or explicitly). An important research direction is to come up with

definition and analysis that work in a more realistic model of weak dependency

between database entries.

References

1. D. Agrawal and C. Aggarwal, On the Design and Quantification of Privacy Pre-

serving Data Mining Algorithms, Proceedings of the 20th Symposium on Principles

of Database Systems, 2001.

2. N. R. Adam and J. C. Wortmann, Security-Control Methods for Statistical

Databases: A Comparative Study, ACM Computing Surveys 21(4): 515-556 (1989).

3. R. Agrawal and R. Srikant, Privacy-preserving data mining, Proc. of the ACM

SIGMOD Conference on Management of Data, pp. 439“450, 2000.

4. S. Chawla, C. Dwork, F. McSherry, A. Smith, and H. Wee, Toward Privacy in

Public Databases, submitted for publication, 2004.

5. I. Dinur and K. Nissim, Revealing information while preserving privacy, Proceed-

ings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Prin-

ciples of Database Systems, pp. 202-210, 2003.

6. G. Duncan, Confidentiality and statistical disclosure limitation. In N. Smelser &

P. Baltes (Eds.), International Encyclopedia of the Social and Behavioral Sciences.

New York: Elsevier. 2001

7. A. V. Evfimievski, J. Gehrke and R. Srikant, Limiting privacy breaches in pri-

vacy preserving data mining, Proceedings of the Twenty-Second ACM SIGACT-

SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 211-222,

2003.

8. S. Fienberg, Confidentiality and Data Protection Through Disclosure Limitation:

Evolving Principles and Technical Advances, IAOS Conference on Statistics, De-

velopment and Human Rights September, 2000, available at

http://www.statistik.admin.ch/about/international/fienberg_final_paper.doc

9. S. Fienberg, U. Makov, and R. Steele, Disclosure Limitation and Related Methods

for Categorical Data, Journal of Official Statistics, 14, pp. 485“502, 1998.

10. L. Franconi and G. Merola, Implementing Statistical Disclosure Control for Aggre-

gated Data Released Via Remote Access, Working Paper No. 30, United Nations

Statistical Commission and European Commission, joint ECE/EUROSTAT work

session on statistical data confidentiality, April, 2003, available at

http://www.unece.org/stats/documents/2003/04/confidentiality/wp.30.e.pdf

11. S. Goldwasser and S. Micali, Probabilistic Encryption and How to Play Mental

Poker Keeping Secret All Partial Information, STOC 1982: 365-377

12. T.E. Raghunathan, J.P. Reiter, and D.B. Rubin, Multiple Imputation for Statistical

Disclosure Limitation, Journal of Official Statistics 19(1), pp. 1 “ 16, 2003

13. D.B. Rubin, Discussion: Statistical Disclosure Limitation, Journal of Official

Statistics 9(2), pp. 461 “ 469, 1993.

14. A. Shoshani, Statistical databases: Characteristics, problems and some solutions,

Proceedings of the 8th International Conference on Very Large Data Bases

(VLDB™82), pages 208“222, 1982.

TEAM LinG

Optimal Perfectly Secure Message Transmission

K. Srinathan*, Arvind Narayanan, and C. Pandu Rangan

Department of Computer Science and Engineering,

Indian Institute of Technology, Madras, Chennai-600036, India

ksrinath@cs.iitm.ernet.in, arvindn@meenakshi.cs.iitm.ernet.in,

rangan@iitm.ernet.in

Abstract. In the perfectly secure message transmission (PSMT) prob-

lem, two synchronized non-faulty players (or processors), the Sender S

and the Receiver R are connected by wires (each of which facilitates

2-way communication); S has an message that he wishes to send to

R; after exchanging messages in phases1 R should correctly obtain S™s

message, while an adversary listening on and actively controlling any set

of (or less) wires should have no information about S™s message.

We measure the quality of a protocol for securely transmitting an

message using the following parameters: the number of wires the num-

ber of phases and the total number of bits transmitted The optima

for and are respectively and 2. We prove that any 2-phase

reliable message transmission protocol, and hence any secure protocol,

over wires out of which at most are faulty is required to transmit

at least While no known protocol is simultaneously

optimal in both communication and phase complexity, we present one

such optimum protocol for the case when the size of message

is large enough, viz., that is, our optimal protocol has

and Note that privacy is for free, if the

message is large enough.

We also demonstrate how randomness can effectively improve the phase

complexity. Specifically, while the (worst-case) lower bound on is 2, we

design an efficient optimally tolerant protocol for PSMT that terminates

in a single phase with arbitrarily high probability.

Finally, we consider the case when the adversary is mobile, that is, he

could corrupt a different set of wires in different phases. Again, the

optima for and are respectively and 2; However we show that

irrespective of We present the first protocol that is