<<

. 17
( 19)



>>

the network transmits a message to the entire group, even the revoked devices,
but it should only be readable by the (unrevoked) devices in the specified subtree
when the message is sent in a subtree broadcast. The motivation for organizing
devices into trees and allowing for subtree broadcasts is derived from the way
many organizations are naturally structured. For example, the ICS Company
may have several departments divided into groups, and groups may in turn have
divisions located in different cities.
TEAM LinG
Efficient Tree-Based Revocation in Groups of Low-State Devices 513

After a secure broadcast system is set up, we need to have the ability to
revoke devices to avoid revealing messages beyond the current members. (We
also consider the complexities of adding new devices, but the need for revocation
is better motivated, since additions will typically be done in large blocks.) Thus,
we are interested in the following complexity measures for a set of devices.
Broadcast cost: the number of messages the group controller (GC) must send
in order to reach a subtree containing revoked devices.
Revocation cost: the number of messages the GC must send in order to revoke
a device. Note that this cost is zero in the zero-state case.
Insertion cost: the number of messages the GC must send in order to add a
device. Note that this cost parameter does not apply to the zero-state case.

Related Work. Broadcast/multicast encryption was first formally studied by
Fiat and Naor [8], for the model where all the device keys are dynamic. Their al-
gorithms satisfy the log-key restriction, however, only if no more than a constant
number of revoked devices collude, which is probably not a realistic assumption.
Several subsequent approaches have therefore strengthened the collusion resis-
tance for broadcast encryption, and have done so using approaches where the
group is represented by a fixed-degree tree with the group controller (GC) being
the root and devices (users) being associated with leaves [3“7, 11, 13“15, 23, 24,
26, 28].
Of particular note is the logical key hierarchy (LKH) scheme proposed by
Wallner et al. [26] and by Wong and Lam [28], which achieves O(1) broadcast
cost and revocation cost under the log-key restriction (for the dynamic
case). The main idea of the LKH scheme is to associate devices with the leaves
of a complete binary tree, assign unique secret keys to each node in this tree,
and store at each device the keys stored in the path from leaf to the
root. Some improvements of this scheme within the same asymptotic bounds are
given by Canetti et al. [4, 5]. Using Boolean function minimization techniques,
Chang et al. [6] deal with cumulative multi-user revocations and reduces the
space complexity of the GC, i.e., the number of keys stored at the GC, from
to Wong et al. [27] generalize the results from binary trees to
key graphs. In addition, Sherman and McGrew [21] improve the constant factors
of the LKH scheme using a technique they call one-way function trees (OFT),
to reduce the size of revocation messages. Naor and Pinkas [17] and Kumar et
al. [12] also study multi-user revocations withstanding coalitions of colluding
users, and Pinkas [18] studies how to restore an off-line user who has missed a
sequence of group modifications with message size. Also of note is work
of Rodeh et al. [19], who describe how to use AVL trees to keep the LKH tree
balanced. Thus, the broadcast encryption problem is well-studied for the case
of fully-dynamic keys and devices organized in a complete or balanced k-ary
tree (noticing that a k-ary tree can transform to binary with constant times of
height increasing). We are not familiar with any previous work that deals with
unbalanced trees whose structure must be maintained for the sake of subtree
broadcasts, however.
TEAM LinG
514 Michael T. Goodrich, Jonathan Z. Sun, and Roberto Tamassia

There has also been some interesting recent work on broadcast encryption
for zero-state devices (the static case). To begin, we note that several researchers
have observed (e.g., see [10]) that the LKH approach can be used in the zero-
state model under the log-key restriction to achieve broadcast
cost. (We will review the LKH approach in more detail in the next section.)
Naor, Naor, and Lotspiech [16] introduce an alternative approach to LKH, which
they call the sub set-difference revocation (SDR) approach. They show that if
devices are allowed to store static keys, then the group controller
can send out secure broadcasts using messages, i.e., the broadcast cost
of their approach is Halevy and Shamir [10] improve the performance of
the SDR scheme, using an approach they call layered subset difference (LSD).
They show how to reduce the number of keys per device to be
while keeping the broadcast cost They also show how to further extend
their approach to reduce the number of keys per device to be
while increasing the broadcast cost to be These latter results are
obtained using a super-logarithmic number of device keys; hence, they violate
the log-key restriction.

Our Results. We provide several new techniques for broadcast encryption
under the log-key restriction. We study both the static (zero-state) and dynamic
(low-state) versions of this model, and present efficient broadcast encryption
schemes for devices organized in tree structures. We study new solutions for
balanced trees, organizational charts, and multicast trees. We show in Table 1
the best bounds on the broadcast, insertion and revocation cost for each of the
possible combinations of state and tree structure we consider, under the log-key
restriction.




So, for example, we are able to match the log-key bound of the static LKH
scheme while also achieving the broadcast encryption complexity of the
SDR scheme. Indeed, our scheme for this case, which we call the stratified subset
difference (SSD) scheme, is the first scheme we are aware of for zero-state de-
vices that simultaneously achieves both of these bounds. Moreover, we are able
to match the best bounds for balanced trees, even for unbalanced high-degree
organizational charts, which would not be possible using the natural conversion
TEAM LinG
Efficient Tree-Based Revocation in Groups of Low-State Devices 515

to a binary tree. Instead, we use biased trees [1] to do this conversion. But this
approach is nevertheless limited, under the log-key restriction, to cases where the
organizational chart has logarithmic height. Thus, for multicast trees, which can
be very unbalanced (we even allow for height that is we must take a dif-
ferent approach. In particular, in these cases, we extend the linking and cutting
dynamic trees of Sleator and Tarjan [22] to the context of broadcast encryption,
showing how to do subtree broadcasts in this novel context. This implies some
surprisingly efficient performance bounds for broadcast encryption in multicast
trees, for in severely unbalanced multicast trees the number of ancestors of the
leaf associated with some device can be exponentially greater than the number
of keys that device is allowed to store.

2 Preliminaries
The LKH Scheme for a Single Group. Let us briefly review the LKH
scheme [26, 28], which is well known for key management in single groups. The
LKH scheme organizes a group of devices as a complete binary tree with the
GC represented by the root and each user (that is, device) by a leaf, with a key
stored at each node. Each device, as a leaf, knows the path from the root to
itself and all the keys on this path. The GC, as the root, knows the whole tree
and all the keys. (See Figure 1.)
To revoke a device the GC updates every key on the path from itself to
so that: (a) cannot receive any updated key; and (b) any device other than
can receive an updated key if and only if it knows the old value of that key.
The key updating is bottom-up, from the parent of to the root. To distribute
the new key at a node if is the parent of then the GC encrypts the new
key with the current key of the sibling of otherwise, GC encrypts the new
key with the current keys of the two children of respectively. This procedure
guarantees (a) and (b). The total number of messages is Broadcasting
to a subtree simply involves encrypting a message using the key for the root of
that subtree; hence, the broadcast cost is O(1).
In the static case, no updating is allowed. So, the GC must encrypt a broad-
cast using the root of every maximal subtree containing no revoked devices.
Thus, in the static case, LKH has broadcast cost (Recall that is
the number of revoked devices.) In both the static and dynamic case, however,
the number of keys per device remains

Subset Difference Revocation (SDR). The subset difference revocation
(SDR) approach of Naor, Naor, and Lotspiech [16] is also based on associat-
ing all the devices with the leaves of a complete binary tree T. Define a subtree
B as the union of all the paths from the root to leaves associated with revoked
devices. Some internal nodes in B have one child and some two. Mark each in-
ternal node in B with two children as a “cut vertex” and imagine that we cut
out from T the edges from to its two children. This would leave us with
rooted subtrees, each containing some number of valid devices and one revoked
leaf (which may have previously been an internal node). Each such subtree is
TEAM LinG
516 Michael T. Goodrich, Jonathan Z. Sun, and Roberto Tamassia




Fig. 1. The LKH scheme for key management in single groups.


therefore uniquely identified by its root, and its descendent node that is
revoked. The GC associates a secret key with each node and defines a label
for each node in the subtree, of T rooted at is secret
key, and for any internal node in with left child and right child we
define and where and are collision-
resistant one-way hash functions that maintain the size of input strings. (Here
we use the abstract model of and Naor, Naor, and Lotspiech use in [16] a
pseudo-random generator G that triples the size of input, and take the left 1/3
and right 1/3 of the output to be the values of and Each leaf in stores
the values of all the labels of the nodes that are siblings of the path from
to (that is, not on the path itself, but are siblings of a node on the path). The
key used to encode a subtree rooted at with a revoked node inside is
Note that no descendent of knows this value and no node outside of can
compute this value, which is what makes this a secure scheme. However, this
scheme requires each device to hold keys, which violates the log-key
restriction.

3 Improved Zero-State Broadcast Encryption
To improve the storage requirements for stateless broadcast encryption, so as to
satisfy the log-key restriction, we take a data structuring approach. We begin
with the basic approach of the subset difference (SDR) method. Without loss of
generality, we assume that we are given a complete binary tree T with leaves
such that each leaf of T is associated with a different user. For any node in
T, let denote the subtree rooted at In addition, for any node and a
descendent of we let denote tree that is, all the nodes that
are descendents of but not Given a set of revoked users, we can use the
same approach as SDR to partition T into at most subtrees such
that union of all these trees represent the complete set of unrevoked users.

A Linear-Work Solution. As a warm-up for our efficient broadcast encryption
scheme, we first describe a scheme that uses keys per device and
messages per broadcast, but requires work per device to decrypt messages
(we will then show how to improve the device work bound keeping the other two
asymptotic bounds unchanged).
TEAM LinG
Efficient Tree-Based Revocation in Groups of Low-State Devices 517

The main idea is that the GC needs a way of encoding a message so that
every leaf node in can decrypt this message, but not other user (or group
of users) can decrypt it. We note as an additional space saving technique, we
can name each node in T according to a level-numbering scheme (e.g., see [9]),
so that the full structure of any tree can be completely inferred using just
the names of and Moreover, any leaf in can determine its relative
position in immediately from its own name, and the names of and
Let us focus on a specific subtree for a node in T. We define a set of
leftist labels, and rightist labels, for each node of In particular,
let us number the nodes in two ways”first according to a left preorder num-
bering (which visits left children before right children) and second according to
a right preorder numbering (which visits right children before left children) [9].
For a non-root node in let denote the predecessor of in the left pre-
order numbering of the nodes in We define to be where is
a collision-resistant one-way hash function. Likewise, we let denote the prede-
cessor of in the right preorder numbering of the nodes in We define to
be where is a (different) collision-resistant one-way hash function.
We initialize these two hash chains by setting and to random seeds
known only to the GC.
For each leaf node in let and respectively denote the successors
of (if they exist) in the left and right preorder numberings of the nodes in
The keys we store at for are and (Note that we specifically
do not store nor at For the complete key distribution, we store
these two keys for each subtree containing (there are such subtrees).
Given this key distribution, to encrypt a message for the nodes in a GC
encrypts the message twice”once using and once using

Decryption. Let us next consider how a leaf node in can decrypt a message
sent to this subtree from the GC. Since is not an ancestor of there are two
possibilities: either comes after in the left preorder numbering of or
comes after in the right preorder numbering. Since can determine the complete
structure of and relative position with in this subtree from the names of
and it can implicitly represent and know which of these two cases
apply. So suppose the first case applies (as the second case is symmetric with the
first). In this case, starts with the label it stores, where is successor
in the left preorder numbering of It then continues a left preorder traversal of
(which it can perform implicitly if memory is tight) until it reaches With
each new node encounters in this traversal, makes another application of the
one-way function computing the labels of each visited node. Thus, when
visits in this traversal, it will have computed and can then decrypt the
message. This computation takes at most hash function computations.

Security. Let us next consider the security of this scheme. First, observe that
any node outside of has no information that can be used to help decode a
message for the nodes in some tree since and are chosen as
random seeds and nodes outside of receive no function of or
TEAM LinG
518 Michael T. Goodrich, Jonathan Z. Sun, and Roberto Tamassia

So the security risk that remains is that leaf descendents of might be able
to decrypt a message sent to the nodes in Let denote the set of leaf
descendents of For each node in with successors and in the two
preorder numberings, we store and at But none of these values
for the nodes in are useful for computing or without inverting
a one-way function, since, in any preorder traversal, all the ancestors of a node
are visited before the node is visited.
Thus, we have a key distribution strategy for the zero-state case that uses
keys per device and messages per broadcast, albeit with work at
each device that could be In the remainder of this section, we describe
how we can reduce this work bound while keeping the other asymptotic bounds
unchanged.
The Stratified Subset Difference (SSD) Method. Given a constant we
can decrease the work per device to be while increasing the space and
message bounds by at most a factor of which should be a good trade-off in
most applications. For example, when is less than one trillion, is less than
The method involves a stratified version of the scheme described above,
giving rise to a scheme we call the stratified subset difference (SSD) method.
We begin by marking each node at a depth that is a multiple of
as “red;” the other nodes are colored “blue.” (See Figure 2.) Imagine further
that we partition the tree T along the red nodes, subdividing T into maximal
trees whose root and leaves are red and whose internal nodes are blue. Call each
such tree a blue tree (even though its root and leaves are red). We then apply
the method described above in each blue tree, as follows. For each leaf in T,
let be the red ancestors of in top-down order. For let
be the blue tree rooted at and note that is a leaf of
We store at node labels and in T), where and
are the left and right preorder successors of in respectively. Storing
these labels increases the space per device by a factor of




Fig. 2. Illustration of the stratified subset difference (SSD) scheme.

To encrypt a message, the GC first performs the subdivision of T into the
subtrees as before. Then, the GC further partitions each tree at the
TEAM LinG
Efficient Tree-Based Revocation in Groups of Low-State Devices 519

red levels, and encodes the broadcast message, using the previously described
scheme, for each blue subtree rooted at a node on the path from to This
increases the broadcast size by at most a factor of but now the work needed
by each device is reduced to computing the L or R labels in a blue tree, which
has size at most Thus, the work per device is reduced to in this
SSD scheme.

Theorem 1. Given a balanced tree T with devices, for zero-state broadcast
encryption, the stratified subset difference (SSD) scheme for T uses
keys per device and has broadcast cost, where is the number of revoked
devices in the subtree receiving the broadcast. The work per device can be made
to be for any fixed constant
Moreover, as we have noted, the security of this scheme is as strong as that
for SDR and LKH, i.e., it is resilient to collusions of any set of revoked devices.


4 A Biased Tree Scheme for an Organizational Chart
We recall that in the organizational chart structure for devices, we have a
hierarchical partition of the devices induced by a tree T of height
but with unbounded branches at each internal node. Namely, the leaves of T are
associated with the devices and an internal node of T represents the group
(set) of devices associated with the leaves of the subtree rooted at Thus,
sibling nodes of T are associated with disjoint groups and each device belongs
to a unique sequence of groups whose nodes are on the path from the
device™s leaf to the root of T. Without loss of generality, we assume that an
internal node of T has either all internal children (subgroups) or all external
children (devices), and its group is called an interior group or exterior group
accordingly. We consider four types of update operations: insertion and deletion
(revocation) of a device or of an empty group. After each modification, we want
to maintain both forward and backward security.

Biased Trees. Biased trees, introduced by Bent et al. [1], are trees balanced by
the weights of leaves (typically set as access frequencies). There are two versions
of biased trees: locally biased and globally biased. We denote by and
the parent, left child and right child of a node of a tree, and we use these
denotations cumulatively. E.g., is the left child of the grandparent of
The following definitions are taken from [1].
A biased search tree is a full binary search tree such that each node has a
weight and a rank The weight of a leaf is initially assigned, and the
weight of an internal node is the sum of the weights of its children. The rank
of a node is a positive integer such that
1. if is a leaf.
2. if is a leaf.
3. and
TEAM LinG
520 Michael T. Goodrich, Jonathan Z. Sun, and Roberto Tamassia

A locally biased search tree has the following additional property:
Local bias. For any with
1. if then either or is a leaf with rank if
then either or is a leaf with rank and
2. if and then either
or is a leaf with rank if and
then either or is a leaf with rank

A globally biased search tree has the following additional property:
Global bias. For any with both of the two neighboring
leaves of i.e., the right-most leaf on the left and the left-most leaf on the
right, have rank at least

Group Hierarchies and Biased Trees. Given an organizational chart T
that represents a group hierarchy, we have to convert T to a binary tree before
applying any encryption scheme for key management. Without loss of generality,
we convert T to a binary tree that preserves the original group hierarchy.
Each internal node of T, representing a group becomes a special internal
node in that still represents and accommodates a GC. Additional internal
nodes are added between and its children in T (i.e., subgroups or devices)
for the purpose of binarization. As result,node plus all its children in T and
the paths between them in form a binary subtree in with being
the root and each of its children in T being a leaf. Note that, without special
care, is likely to have super-logarithm height and balancing such a tree using
standard techniques would destroy the group hierarchy.
Given a group hierarchy tree T, we assign a unit weight to each leaf and
calculate the weights of other nodes in T accordingly, i.e., the weight of each
internal node is the number of devices in the subtree of T rooted at We
replace each node with a biased binary tree having the children of as its
leaves (using the weights of these nodes for the biasing). Thus, each subtree
representing a group rooted at a node in T can be initialized into a biased
tree without affecting the structure of group hierarchy. Since for each
is an invariant, i.e., the weights of the root and leaves in every are invariant,
the initialization is well defined and can be done in each independently. That
is, combining all the biased into will not change the structure of the
original hierarchy represented by T. (See Figure 3)

Key Assignment. After initializing the biased we still assign a key to
each node of as in the LKH, and inform the keys to devices and GC™s by the
following security properties:
1. each device knows all but only the keys on the path from to itself.
2. the GC of each knows all but only the keys of descendants in
and those on the path from to

TEAM LinG
Efficient Tree-Based Revocation in Groups of Low-State Devices 521




Fig. 3. Binary tree consisting of biased trees and The ranks of the
nodes in and are shown.


Broadcast and Multicast. Using the above security properties and appro-
priate signature or authentication mechanism [2,4,20,25], the GC of each
can send a message securely with one key encryption to or any subgroup or
super-group of without any ambiguity.

Key Update and Tree Rebalance. As in the LKH scheme, keys should be
updated after each insertion or deletion (revocation) of a device or group so
that the security properties 1 and 2 are maintained. Moreover, we should also
rebalance to preserve the bias properties in each Assume that we can
insert a leaf, delete a leaf, or update the weight of a leaf in (by
and respectively) while preserving both the security and
bias properties. Then inserting or deleting a device
can be done in three steps:
1. insert or delete a leaf in the exterior tree
2. update the weights in the interior trees
accordingly; and
3. update the keys on the path from to bottom-up, as in the LKH scheme.
To insert or delete a group is a similar process except
starting with an insertion or deletion in an interior Therefore insert, delete
and reweight in each suffice all our hierarchy modifications in Such
operations preserving the bias properties were already given and analyzed in
[1], we now describe how to modify them to preserve the security properties,
too.
Recall that the biased tree operations, including insert, delete and reweight,
recursively call an operation tilt as the only subroutine to rebalance the biased
tree structure [1]. Operation tilt performs a single rotation associated with rank
modification. Since a node loses descendants during a rotation if it is rotated
down and losing descendants is the only chance of key leaks in the LKH scheme.
To maintain the security properties 1 and 2 after any rotation in it is neces-
sary and sufficient to update the key at the node rotated down. Observing that
TEAM LinG
522 Michael T. Goodrich, Jonathan Z. Sun, and Roberto Tamassia

updating a single key and distributing the result of a rotation are both easy in
our scheme, we can replace the tilt in [1] with our secure-tilt which preserves the
security properties 1 and 2.We give a detailed description of secure-tilt-left in
Figure 4. Operation secure-tilt-right is analogous. Using secure-tilt as the sub-
routine in biased tree operations, the scheme is as secure as LKH.




Fig. 4. The algorithm for operation



Efficiency of the Scheme. The insert, delete and reweight operations in biased
trees are implemented as follows: join and split are the two basic biased tree
operations. has global and local versions, which will merge two global
or local biased trees with roots and and return the root of the resulting
tree, and both versions work by recursively calling secure-tilt. will
split T into two biased trees and each containing all the leaves of T with
their binary search keys less than and greater than respectively, split calls
local-join as a subroutine and is applicable to both local and global biased trees.
Other operations are based on join and split: operation splits T by
and then joins and together; operation splits T by and
then joins and back ignoring and operation splits T by
updates the weight of and then joins and back into T.
The correctness and efficiency of our hierarchy modifications in follow
those of biased tree operations. Notice that our secure-tilt takes constant message
size as well as the constant-time tilt in [1], all time bounds in [1] also hold as
bounds of message size in our scheme.
This gives us the following.
Theorem 2. Given an organizational chart tree T with height and devices,
under the log-key restriction, the dynamic biased binary tree scheme for T has
has O(1) broadcast cost and revocation and insertion cost.

Proof. We show how to access a device from
The analysis of other operations is similar. Since the root of is a leaf of
and each biased tree has the ideal access time, the time to
access from is

TEAM LinG
Efficient Tree-Based Revocation in Groups of Low-State Devices 523




Thus, we satisfy the log-key restriction for any organizational chart with
height. We also note that applying our SSD approach to a static
application of the techniques developed in this section results in a scheme using
keys per device and messages per broadcast for an organization
chart with height


5 A Dynamic Tree Scheme for a Multicast Tree

Let us next consider the multicast tree structure, which, for the sake of broadcast
encryption, is similar to the organizational chart, except that the height of a
multicast tree can be much larger than logarithmic (we even allow for linear
height). For a multicast tree T with devices and groups, we give a scheme
with broadcast cost and update cost, irrespectively of the
depth of T.

Dynamic Trees. Dynamic trees were first studied by Sleator and Tarjan [22]
and used for various tree queries and network flow problems. The key idea is
to partition a highly unbalanced tree into paths and associate a biased tree
structure, which is in some sense balanced, to each path. Thus any node in the
tree can be accessed and any update to the tree can be done in time
through the associated structure, regardless the depth of node or the height of
tree. The dynamic tree used in our scheme is specified by taking the partition by
weight (size) approach and not having cost on each edge. The following definition
refers to this specification.
A dynamic tree T is a weighted binary search tree where the weight is
initially assigned if is a leaf, or if is an internal
node. The edges of T are partitioned into solid and dashed edges so that each
node links with its heavier child by a solid edge and with the lighter child by a
dashed edge. Thus T is partitioned into solid paths linked by dashed edges.
the upper-most one1. Then
We denote by the deepest node in and
the edge between any and its parent must be dashed, and vice versa. For
operations, each solid path is further organized as a global biased
tree, denoted by so that the nodes from to become leaves
of from left to right, and the weight of a leaf in is assigned as
where is the dashed child of in T. Then T consists of
these by linking the root of each with the parent of unless
is the root of T. (See Figure 5.) To show that such structure of T is well
defined, let the root of be and the parent of be then we
have that Thus, can replace as a
child of
1
must be a leaf of T by the “partition by weight” approach.

TEAM LinG
524 Michael T. Goodrich, Jonathan Z. Sun, and Roberto Tamassia




Fig. 5. Partition of tree and the accessing path to


Group Hierarchies and Dynamic Trees. We convert a multicast tree T to a
binary tree that preserves the group hierarchy in T as same as in the biased
tree scheme. Instead of using a biased tree, we simply use a complete binary tree
for each then assign a unit weight to each device and partition
into a dynamic tree as above. A key is assigned to each node of each
Since the root of becomes child of a leaf of each device becomes a
descendant of a unique string of biased trees of paths
The way a device is accessed is not through the real path in but through the
path in the string of (See Figure 5.)
Broadcast and Multicast. Broadcast in a group becomes a little more
complicate because, although device is a descendant of in T, may not
be on the accessing path from to However, if then the accessing
path to any descendant of must pass a node in the prefix of from
to So, to broadcast in it is sufficient to encrypt the message by the keys
in that cover this prefix of In the full version, we show that, with the
dynamic tree scheme, it takes encryptions to broadcast a message in
any group either in worst case or in average.
Key Updates. We follow the dynamic tree operations in [22] to modify the
hierarchy, and update the keys in the accessing path of the updated item as in
the LKH scheme. Dynamic tree operations dynamically change the solid path
partition to guarantee the running time, and such change is carried out
by the biased tree operations among Therefore, operation secure-tilt
preserves the security properties along any accessing path. The dynamic tree
operations we use are as follows:
extend by converting the edge from to its parent solid,
and the edge between and its parent dashed.
TEAM LinG
Efficient Tree-Based Revocation in Groups of Low-State Devices 525

Let be the upper most edge in such that is not the
heavier child of if there exist such edges in Then cut by converting
into dashed and into solid.
make the path from to (the real path in into a single
solid path by a series of splices.
convert every edge in who does not link to a heavier child of
parent into dashed by a series of slices.
combine two dynamic trees by making the parent of where
is the root of the first tree and is a node in the second.
divide a dynamic tree into two by deleting the edge between and


Inserting or deleting a device or a group corresponds to a link or cut operation,
respectively. Such dynamic tree operation take time and can be reduced
to a series of join and split operation on biased trees. The algorithmic template
for a dynamic tree operation is the expose-and-conceal strategy, described as
follows:
1. perform on a node
2. if the above expose operation violates the “partition by weight” property,
restore the property by executing on the appropriate path
Since all the dynamic tree operations reduce to a series of biased tree opera-
tions, operation secure-tilt is still the only subroutine that adjusts the partition
of Notice that the structure is never adjusted, but the accessing
path to each device are adjusted through operations. From [22], we know that,
with partition by weight and representing the solid paths as global biased trees,
any dynamic tree operation takes time. Since a hierarchy modification
consists of a dynamic tree operation plus updating the keys in an access path,
which is also of length the efficiency of key updating for hierarchy
modifications follows.
Theorem 3. Given a multicast tree T with devices, under the log-key re-
striction, structured in groups, the dynamic tree scheme for T has
broadcast cost and revocation and insertion cost.
A zero-state version can also be developed, which uses the biased trees and
broadcast scheme to send messages to the unrevoked leaves in a multicast tree
T using broadcasts for devices storing keys each, where is
the number of revoked devices.


References
l. S. W. Bent, D. D. Sleator, and R. E. Tarjan. Biased search trees. SIAM J. Comput.,
14(3):545“568, Aug. 1985.
2. D. Boneh, G. Durfee, and M. Franklin. Lower bounds for multicast message au-
thentication. In Proc. EUROCRYPT 2001, LNCS 2045, pages 437“452, May 2001.

TEAM LinG
526 Michael T. Goodrich, Jonathan Z. Sun, and Roberto Tamassia

3. B. Briscoe. Marks: Zero side effect multicast key management using arbitrarily
revealed key sequences. In Proc. of First International Workshop on Networked
Group Communication(NCGC™99), 1999.
4. R. Canetti, J. Garay, G. Itkis, D. Micciancio, M. Naor, and B. Pinkas. Multicast
security: A taxonomy and some efficient constructions. In Proc. INFOCOM ™99,
volume 2, pages 708“716, New York, Mar. 1999.
5. R. Canetti, T. Malkin, and K. Nissim. Efficient communication ” storage tradeoffs
for multicast encryption. In Advances in cryptology (EUROCRYPT™99), LNCS
1592, pages 459“474, 1999.
6. I. Chang, R. Engel, D. Kandlur, D. Pendarakis, and D. Saha. Key management
for secure Internet multicast using boolean function minimization techniques. In
Proc. IEEE INFOCOM, volume 2, pages 689“698, 1999.
7. G. D. Crescenzo and O. Kornievskaia. Efficient kerberized multicast in a practical
distributed setting. In 4th International Conference Information Security (ISC™01),
LNCS 2200, pages 27“45, Oct. 2001.
8. A. Fiat and M. Naor. Broadcast encryption. In Advances in Cryptology -
CRYPTO™93, pages 480“491. LNCS 773, 1994.
9. M. T. Goodrich and R. Tamassia. Algorithm Design: Foundations, Analysis and
Internet Examples. John Wiley & Sons, New York, NY, 2002.
10. D. Halevy and A. Shamir. The LSD broadcast encryption scheme. In Advances in
Cryptology (CRYPTO 2002), volume 2442 of LNCS, pages 47“60. Springer-Verlag,
2002.
11. E. Jung, A. X. Liu, and M. G. Gouda. Key bundles and parcels: Secure com-
munication in many groups. In Proceedings of the 5th International Workshop on
Networked Group Communications (NGC-03), LNCS 2816, pages 119“130, Mu-
nich, Germany, September 2003.
12. R. Kumar, R. Rajagopalan, and A. Sahai. Goding constructions for blackl-
iting problems without computational assumptions. In Advances in cryptology
(CRYPTO™99), LNCS 1666, pages 609“623, 1999.
13. D. A. McGrew and T. Sherman. Key establishment in large dynamic groups using
one-way function trees. Technical Report 0755, TIS Labs at Network Associates
Inc., Glenwood, MD, May 1998.
14. D. Micciancio and S. Panjwani. Optimal communication complexity of generic
multicast key distribution. In EUROCRYPT 2004, pages 153“170, 2004.
15. M. J. Mihajevic. Key management schemes for stateless receivers based on time
varying heterogeneous logical key hierarchy. In ASIACRYPT 2003, LNCS 2894,
pages 137“154, 2003.
16. D. Naor, M. Naor, and J. Lotspiech. Revocation and tracing schemes for stateless
receivers. In CRYPTO™01, volume 2139 of LNCS, pages 41“62. Springer-Verlag,
2001.
17. M. Naor and B. Pinkas. Efficient trace and revoke schemes. In Proc. Financial
Crypto 2000, Feb. 2000.
18. B. Pinkas. Efficient state updates for key management. In Proc. ACM Workshop
on Security and Privacy in Digital Rights Management, 2001.
19. O. Rodeh, K. P. Birman, and D. Dolev. Using AVL trees for fault tolerant group key
management. International Journal on Information Security, pages 84“99, 2001.
20. B. Schneier. Applied Cryptography, 2nd Ed. John Wiley - Sons, 1996.
21. A. T. Sherman and D. A. McGrew. Key establishment in large dynamic groups
using one-way function trees. IEEE Trans. Software Engineering, 29(5):444“458,
2003.

TEAM LinG
Efficient Tree-Based Revocation in Groups of Low-State Devices 527

22. D. D. Sleator and R. E. Tarjan. A data structure for dynamic trees. J. Computer
and System Sciences, 26:362“391, 1983.
23. J. Snoeyink, S. Suri, and G. Varghese. A lower bound for multicast key distribution.
In IEEE INFOCOM 2001, volume 1, pages 422“431, 2001.
24. R. Tamassia and N. Triandopoulos. Computational bounds on hierarchical data
processing with applications to information security. Technical report, Center for
Geometric Computing, Brown University, 2004.
25. H. F. Tipton and M. Krause, editors. Information Security Management Handbook,
4th Ed. Auerbach, 1999.
26. D. M. Wallner, E. G. Harder, and R. C. Agee. Key management for multicast:
issues and architecture. In internet draft draft-waller-key-arch-01.txt, Sep. 1998.
27. C. K. Wong, M. Gouda, and S. S. Lam. Secure group communications using key
graphs. In Proc. ACM SIGCOMM™98, volume 28, pages 68“79, 1998.
28. C. K. Wong and S. S. Lam. Digital signatures for flows and multicasts. IEEE/ACM
Transactions on Networking, 7:502“513, 1999.




TEAM LinG
Privacy-Preserving Datamining
on Vertically Partitioned Databases

Cynthia Dwork and Kobbi Nissim
Microsoft Research, SVC, 1065 La Avenida, Mountain View CA 94043
{dwork,kobbi}@microsoft.com




Abstract. In a recent paper Dinur and Nissim considered a statistical
database in which a trusted database administrator monitors queries
and introduces noise to the responses with the goal of maintaining data
privacy [5]. Under a rigorous definition of breach of privacy, Dinur and
Nissim proved that unless the total number of queries is sub-linear in the
size of the database, a substantial amount of noise is required to avoid a
breach, rendering the database almost useless.
As databases grow increasingly large, the possibility of being able to
query only a sub-linear number of times becomes realistic. We further
investigate this situation, generalizing the previous work in two impor-
tant directions: multi-attribute databases (previous work dealt only with
single-attribute databases) and vertically partitioned databases, in which
different subsets of attributes are stored in different databases. In addi-
tion, we show how to use our techniques for datamining on published
noisy statistics.

Keywords: Data Privacy, Statistical Databases, Data Mining, Vertically
Partitioned Databases.


1 Introduction

In a recent paper Dinur and Nissim considered a statistical database in which
a trusted database administrator monitors queries and introduces noise to the
responses with the goal of maintaining data privacy [5]. Under a rigorous defini-
tion of breach of privacy, Dinur and Nissim proved that unless the total number
of queries is sub-linear in the size of the database, a substantial amount of noise
is required to avoid a breach, rendering the database almost useless1. However,
when the number of queries is limited, it is possible to simultaneously preserve
privacy and obtain some functionality by adding an amount of noise that is a
function of the number of queries. Intuitively, the amount of noise is sufficiently
large that nothing specific about an individual can be learned from a relatively
small number of queries, but not so large that information about sufficiently
strong statistical trends is obliterated.
1
For unbounded adversaries, the amount of noise (per query) must be linear in the
size of the database; for polynomially bounded adversaries, noise is required.

M. Franklin (Ed.): CRYPTO 2004, LNCS 3152, pp. 528“544, 2004.
© International Association for Cryptologic Research 2004
TEAM LinG
Privacy-Preserving Datamining on Vertically Partitioned Databases 529

As databases grow increasingly massive, the notion that the database will be
queried only a sub-linear number of times becomes realistic. We further inves-
tigate this situation, significantly broadening the results in [5], as we describe
below.

Methodology. We follow a cryptography-flavored methodology, where we con-
sider a database access mechanism private only if it provably withstands any
adversarial attack. For such a database access mechanism any computation over
query answers clearly preserves privacy (otherwise it would serve as a privacy
breaching adversary). We present a database access mechanism and prove its
security under a strong privacy definition. Then we show that this mechanism
provides utility by demonstrating a datamining algorithm.

Statistical Databases. A statistical database is a collection of samples that are
somehow representative of an underlying population distribution. We model
a database as a matrix, in which rows correspond to individual records and
columns correspond to attributes. A query to the database is a set of indices
(specifying rows), and a Boolean property. The response is a noisy version of the
number of records in the specified set for which the property holds. (Dinur and
Nissim consider one-column databases containing a single binary attribute.) The
model captures the situation of a traditional, multiple-attribute, database, in
which an adversary knows enough partial information about records to “name”
some records or select among them. Such an adversary can target a selected
record in order to try to learn the value of one of its unknown sensitive at-
tributes. Thus, the mapping of individuals to their indices (record numbers) is
not assumed to be secret. For example, we do not assume the records have been
randomly permuted.
We assume each row is independently sampled from some underlying distri-
bution. An analyst would usually assume the existence of a single underlying
row distribution and try to learn its properties.

Privacy. Our notion of privacy is a relative one. We assume the adversary knows
the underlying distribution on the data, and, furthermore, may have some a
priori information about specific records, e.g., the a priori probability that
at least one of the attributes in record 400 has value 1 “ is .38”. We anlyze
privacy with respect to any possible underlying (row) distributions where
the ith row is chosen according to This partially models a priori knowledge
an attacker has about individual rows (i.e. is conditioned on the attacker™s
knowledge of the ith record). Continuing with our informal example, privacy is
breached if the a posteriori probability (after the sequence of queries have been
issued and responded to) that “at least one of the attributes in record 400 has
value 1” differs from the a priori probability “too much”.

Multi-attribute Sub-linear Queries (SuLQ) Databases. The setting studied in [5],
in which an adversary issues only a sublinear number of queries (SuLQ) to a
single attribute database, can be generalized to multiple attributes in several
TEAM LinG
530 Cynthia Dwork and Kobbi Nissim

natural ways. The simplest scenario is of a single SuLQ database,
queried by specifying a set of indices and a Boolean function. The re-
sponse is a noisy version of the number of records in the specified set for which
the function, applied to the attributes in the record, evaluates to 1. A more
involved scenario is of multiple single-attribute SuLQ databases, one for each
attribute, administered independently. In other words, our database
is vertically partitioned into single-attribute databases. In this case, the chal-
lenge will be datamining: learning the statistics of Boolean functions of the at-
tributes, using the single-attribute query and response mechanisms as primitives.
A third possibility is a combination of the first two: a database that
is vertically partitioned into two (or more) databases with and (possibly
overlapping) attributes, respectively, where Database can
handle functional queries, and the goal is to learn relationships between
the functional outputs, eg, “If holds, does this increase the
likelihood that holds?”, where is a function on the attribute
values for records in the ith database.

1.1 Our Results
We obtain positive datamining results in the extensions to the model of [5]
described above, while maintaining the strengthened privacy requirement:
1. Multi-attribute SuLQ databases: The statistics for every Boolean func-
2
tion can be learned . Since the queries here are powerful (any function), it is
not surprising that statistics for any function can be learned. The strength
of the result is that statistics are learned while maintaining privacy.
2. Multiple single-attribute SuLQ databases: We show how to learn the statis-
tics of any 2-ary Boolean function. For example, we can learn the fraction of
records having neither attribute 1 nor attribute 2, or the conditional proba-
bility of having attribute 2 given that one has attribute 1. The key innovation
is a procedure for testing the extent to which one attribute, say, implies
another attribute, in probability, meaning that where
can be estimated by the procedure.
3. Vertically Partitioned SuLQ Databases: The constructions here
are a combination of the results for the first two cases: the attributes are
partitioned into (possibly overlapping) sets of size and respectively,
where each of the two sets of attributes is managed by a multi-
attribute SuLQ database. We can learn all 2-ary Boolean functions of the
outputs of the results from the two databases.
We note that a single-attribute database can be simulated in all of the above
settings; hence, in order to preserve privacy, the sub-linear upper bound on
queries must be enforced. How this bound is enforced is beyond the scope of this
work.
2
Note that because of the noise, statistics cannot be learned exactly. An additive error
on the order of is incurred, where is the number of records in the database.
The same is true for single-attribute databases.

TEAM LinG
Privacy-Preserving Datamining on Vertically Partitioned Databases 531

Datamining on Published Statistics. Our technique for testing implication in
probability yields surprising results in the real-life model in which confidential
information is gathered by a trusted party, such as the census bureau, who pub-
lishes aggregate statistics. Describing our results by example, suppose the bureau
publishes the results of a large (but sublinear) number of queries. Specifically, for
every, say, triple of attributes and for each of the eight conjunctions
of literals over three attributes the bureau
publishes the result of several queries on these conjunctions. We show how to
construct approximate statistics for any binary function of six attributes. (In
general, using data published for it is possible to approximately learn
statistics for any function.) Since the published data are the results of
SuLQ database queries, the total number of published statistics must be sub-
linear in the size of the database. Also, in order to keep the error down,
several queries must be made for each conjunction of literals. These two facts
constrain the values of and the total number of attributes for which the result
is meaningful.

1.2 Related Work
There is a rich literature on confidentiality in statistical databases. An excellent
survey of work prior to the late 1980™s was made by Adam and Wortmann [2].
Using their taxonomy, our work falls under the category of output perturbation.
However, to our knowledge, the only work that has exploited the opportunities
for privacy inherent in the fact that with massive of databases the actual number
of queries will be sublinear is Sect. 4 of [5] (joint work with Dwork). That work
only considered single-attribute SuLQ databases.
Fanconi and Merola give a more recent survey, with a focus on aggregated
data released via web access [10]. Evfimievski, Gehrke, and Srikant, in the Intro-
duction to [7], give a very nice discussion of work in randomization of data, in
which data contributors (e.g., respondents to a survey) independently add noise
to their own responses. A special issue (Vol.14, No. 4, 1998) of the Journal of Of-
ficial Statistics is dedicated to disclosure control in statistical data. A discussion
of some of the trends in the statistical research, accessible to the non-statistician,
can be found in [8].
Many papers in the statistics literature deal with generating simulated data
while maintaining certain quantities, such as marginals [9]. Other widely-studied
techniques include cell suppression, adding simulated data, releasing only a sub-
set of observations, releasing only a subset of attributes, releasing synthetic or
partially synthetic data [13,12], data-swapping, and post-randomization. See
Duncan (2001) [6].
R. Agrawal and Srikant began to address privacy in datamining in 2000 [3].
That work attempted to formalize privacy in terms of confidence intervals (in-
tuitively, a small interval of confidence corresponds to a privacy breach), and
also showed how to reconstruct an original distribution from noisy samples (i.e.,
each sample is the sum of an underlying data distribution sample and a noise
sample), where the noise is drawn from a certain simple known distribution.
TEAM LinG
532 Cynthia Dwork and Kobbi Nissim

This work was revisited by D. Agrawal and C. Aggarwal [1], who noted that it
is possible to use the outcome of the distribution reconstruction procedure to
significantly diminish the interval of confidence, and hence breach privacy. They
formulated privacy (loss) in terms of mutual information, taking into account
(unlike [3]) that the adversary may know the underlying distribution on the data
and “facts of life” (for example, that ages cannot be negative). Intuitively, if the
mutual information between the sensitive data and its noisy version is high, then
a privacy breach occurs. They also considered reconstruction from noisy sam-
ples, using the EM (expectation maximization) technique. Evfimievsky, Gehrke,
and Srikant [7] criticized the usage of mutual information for measuring privacy,
noting that low mutual information allows complete privacy breaches that hap-
pen with low but significant frequency. Concurrently with and independently of
Dinur and Nissim [5] they presented a privacy definition that related the a priori
and a posteriori knowledge of sensitive data. We note below how our definition
of privacy breach relates to that of [7,5].
A different and appealing definition has been proposed by Chawla, Dwork,
McSherry, Smith, and Wee [4], formalizing the intuition that one™s privacy is
guaranteed to the extent that one is not brought to the attention of others. We
do not yet understand the relationship between the definition in [4] and the one
presented here.
There is also a very large literature in secure multi-party computation. In
secure multi-party computation, functionality is paramount, and privacy is only
preserved to the extent that the function outcome itself does not reveal infor-
mation about the individual inputs. In privacy-preserving statistical databases,
privacy is paramount. Functions of the data that cannot be learned while pro-
tecting privacy will simply not be learned.


2 Preliminaries
Notation. We denote by (read: negligible) a function that is asymptoti-
cally smaller than any inverse polynomial. That is, for all for all sufficiently
large we have We write for

2.1 The Database Model
In the following discussion, we do not distinguish between the case of a verti-
cally partitioned database (in which the columns are distributed among several
servers) and a “whole” database (in which all the information is in one place).
We model a database as an binary matrix Intuitively, the
columns in correspond to Boolean attributes and the rows in
correspond to individuals where iff attribute holds for individual
We sometimes refer to a row as a record.
Let be a distribution on We say that a database is
chosen according to distribution if every row in is chosen according to
independently of the other rows (in other words, is chosen according to
TEAM LinG
Privacy-Preserving Datamining on Vertically Partitioned Databases 533

In our privacy analysis we relax this requirement and allow each row to be
chosen from a (possibly) different distribution In that case we say that the
database is chosen according to

Statistical Queries. A statistical query is a pair where indicates a
set of rows in and denotes a function on attribute values.
The exact answer to is the number of rows of in the set for which
holds (evaluates to 1):




We write when the function is a projection onto the jth element:
In that case is a query on a subset of the entries in
the jth column: When we look at vertically partitioned single-
attribute databases, the queries will all be of this form.

Perturbation. We allow the database algorithm to give perturbed (or “noisy”)
answers to queries. We say that an answer is within perturbation if
Similarly, a database algorithm is within perturbation if for every
query

The probability is taken over the randomness of the database algorithm

2.2 Probability Tool
Proposition 1. Let be random variables so that and
then




Using Azuma™s inequality3 we
Proof. Let hence
get that As
the proposition follows.


3 Privacy Definition
We give a privacy definition that extends the definitions in [5,7]. Our definition
is inspired by the notion of semantic security of Goldwasser and Micali [11]. We
first state the formal definition and then show some of its consequences.
Let be the a priori probability that (taking into account that
we assume the adversary knows the underlying distribution on row In
3
Let be a martingale with for all Let
be arbitrary. Azuma™s inequality says that then
TEAM LinG
534 Cynthia Dwork and Kobbi Nissim

general, for a Boolean function we let be the a priori
probability that We analyze the a posteriori probability
that given the answers to T queries, as well as all the values
in all the rows of other than for all We denote this a posteriori
probability

Confidence. To simplify our calculations we follow [5] and define a monotoni-
cally-increasing 1-1 mapping conf : as follows:



Note that a small additive change in conf implies a small additive change in
4
. Let and We write our privacy
defined as5:
requirements in terms of the random variables



Definition 1 A database access mechanism is
if for every distribution on for every row index for every function
and for every adversary making at most T queries it
holds that

The probability is taken over the choice of each row in according to and the
randomness of the adversary as well as the database access mechanism.
A target set F is a set of Boolean functions (one can think of the
functions in F as being selected by an adversary; these represent information it
will try to learn about someone). A target set F is if for
all and Let F be a target set. Definition 1 implies that under a
database mechanism, F is with probability
Proposition 2. Consider database with attributes.
Let F be the target set containing all the Boolean functions over the at-
tributes. Then,
Proof. Let be a target set containing all conjuncts of attributes. We
have that and hence is with probability
To prove the proposition we show that F is safe whenever is. Let
be a Boolean function. Express as a disjunction of conjuncts of attributes:
4
The converse does not hold “ conf grows logarithmically in for and logarith-
mically in for
5
Our choice of defining privacy in terms of is somewhat arbitrary, one could
rewrite our definitions (and analysis) in terms of the a priori and a posteriori proba-
bilities. Note however that limiting in Definition 1 is a stronger requirement
than just limiting

TEAM LinG
Privacy-Preserving Datamining on Vertically Partitioned Databases 535

Similarly, express as the disjunction of the remaining
conjuncts: (So
We have:




Let and maximize
maximize Us-
ing we get that
where the last inequality holds as

vs. Finding Very Heavy Sets. Let be a target function and
Our privacy requirement implies such
that it is infeasible to find a “very” heavy set that is, a set for which
Such a set would violate our privacy
requirement as it would allow guessing for a random record in

Relationship to the Privacy Definition of [7]. Our privacy definition extends the
definition of privacy breaches of [7]. Their definition is introduced with
respect to a scenario in which several users send their sensitive data to a center.
Each user randomizes his data prior to sending it. A privacy breach
occurs if, with respect to some property the a priori probability that holds
for a user is at most whereas the a posteriori probability may grow beyond
(i.e. in a worst case scenario with respect to the coins of the randomization
operator).


4 Privacy of Multi-attribute SuLQ Databases
We first describe our SuLQ Database algorithm, and then prove that it preserves
privacy.
Let and define for some
(taking will work). To simplify notation, we write for
for (and later for




Note that is a binomial random variable with and standard devi-
ation In our analysis we will neglect the case where largely deviates from
TEAM LinG
536 Cynthia Dwork and Kobbi Nissim

zero, as the probability of such an event is extremely small:
In particular, this implies that our SuLQ database algorithm is within
perturbation.
We will use the following proposition.
Proposition 3. Let B be a binomially distributed random variable with expec-
tation 0 and standard deviation Let L be the random variable that takes the
value Then

1. For this value is
bounded by
2. E[L] = O(1/R), where the expectation is taken over the random choice of B.

Proof. 1. The equality follows from the symmetry of the Binomial distribution
(i.e.Pr[B]=Pr[“B]).
To prove the bound consider
Using the limits on B and the definition of R we
get that this value is bounded by
2. Using the symmetry of the Binomial distribution we get:




Our proof of privacy is modeled on the proof in Section 4 of [5] (for single
attribute databases). We extend their proof (i) to queries of the form where
is any Boolean function, and (ii) to privacy of Boolean functions
Theorem 1. Let and for and
Then the SuLQ algorithm is within
perturbation.
Note that whenever bounding the adversary™s number of
queries to allows privacy with perturbation magnitude less than
Proof. Let be as in the theorem and recall for some

Let the queries issued by the adversary be denoted
Let be the perturbed answers to
these queries. Let and
We analyze the a posteriori probability that given the answers to
the first queries and (where denotes the entire database
except for the ith row). Let Note that
(of Section 3), and (due to the independence of rows in
TEAM LinG
Privacy-Preserving Datamining on Vertically Partitioned Databases 537

By the definition of conditional probability6 we get




Note that the probabilities are taken over the coin flips of the SuLQ algorithm
and the choice of In the following we analyze the numerator (the denominator
is analyzed similarly).




The last equality follows as the rows in are chosen independently of each
other. Note that given both and the random variable is independent
of Hence, we get:




Next, we observe that although depends on the dependence is weak.
More formally, let be such that and Note
that whenever we have that
When, instead, we can relate and
via Proposition 3:

Lemma 1. Let be such that Then
where and




and is noise that yields when

Proof. Consider the case Writing
and the proof follows from
Proposition 3. Similarly for

Note that the value of does not depend on
Taking into account both cases and we get




6
I.e.

TEAM LinG
538 Cynthia Dwork and Kobbi Nissim

Let be the probability, over that Letting be such that
we have




and similarly



Putting the pieces together we get that




Define a random walk on the real line with To
conclude the proof we show that (with high probability) T steps of the random
walk do not suffice to reach distance From Proposition 3 and Lemma 1 we get
that


and


Using Proposition 1 with we get that for all




5 Datamining on Vertically Partitioned Databases
In this section we assume that the database is chosen according to for some
underlying distribution on rows, where is independent of the size of the
database. We also assume that is sufficiently large that the true database
statistics are representative of Hence, in the sequel, when we write things like
we mean the probability, over the entries in the database, that holds.
Let and be attributes. We say that implies in probability if the
conditional probability of given exceeds the unconditional probability of
The ability to measure implication in probability is crucial to datamining. Note
that since is simple to estimate well, the problem reduces to obtaining a
TEAM LinG
Privacy-Preserving Datamining on Vertically Partitioned Databases 539

good estimate of Moreover, once we can estimate the we can use
Bayes™ Rule and de Morgan™s Laws to determine the statistics for any Boolean
function of attribute values.
Our key result for vertically partitioned databases is a method, given two
single-attribute SuLQ databases with attributes and respectively, to measure

For more general cases of vertically partitioned data, assume a
database is partitioned into databases, with (possibly
overlapping) attributes, respectively, where We can use functional
queries to learn the statistics on Boolean functions of the attributes in the
ith database, and then use the results for two single-attribute SuLQ databases
to learn binary Boolean functions of any two functions (on attributes in
database and (on attributes in database where

5.1 Probabilistic Implication
In this section we construct our basic building block for mining vertically parti-
tioned databases.
We assume two SuLQ databases of size with attributes respec-
tively. When implies in probability with a gap of we write meaning
that We note that and are easily computed
within error simply by querying the two databases on large subsets.
Our goal is to determine or equivalently, the method will be
to determine if, for a given and then to estimate
by binary search on

Notation. We let and
Let X be a random variable counting the number of times holds when we
take N samples from Then and
Let



Note that Substituting for we get



and hence (by another application of Eq. (1))




We define the following testing procedure to determine, given if
Step 1 finds a heavy (but not very heavy) set for attribute that is, a set for
which the number of records satisfying exceeds the expected number by more
than a standard deviation. Note that since the noise
TEAM LinG
540 Cynthia Dwork and Kobbi Nissim

is so the heavy set really has records for which holds.
Step 2 queries on this heavy set. If the incidence of on this set sufficiently
(as a function of exceeds the expected incidence of then the test returns
“1” (ie, success). Otherwise it returns 0.




Theorem 2. For the test procedure
1. If then
2. If then
where for the advantage is constant, and for
the advantage with constant
In the following analysis we neglect the difference between and since,
as noted above, the perturbation contributes only low order terms (we neglect
some other low order terms). Note that it is possible to compute all the required
constants for Theorem 2 explicitly, in polynomial time, without neglecting these
low-order terms. Our analysis does not attempt to optimize constants.
Proof. Consider the random variable corresponding to given
that is biased according to Step 1 of By linearity of expectation, together
with the fact that the two cases below are disjoint, we get that




The last step uses Eq. (3). Since the distribution of is symmetric around
we get that the first part of the claim, i.e. if then



To get the second part of the claim we use the de Moivre-Laplace theorem
and approximate the binomial distribution with the normal distribution so that
we can approximate the variance of the sum of two distributions (when holds
and when does not hold) in order to obtain the variance of conditioned
on We get:


TEAM LinG
Privacy-Preserving Datamining on Vertically Partitioned Databases 541

Assuming N is large enough, we can neglect the terms involving Hence,




The transition from the second to third lines follows from

We have that the probability distribution on is a Gaussian with mean
and variance at most and respectively.
To conclude the proof, we note that the conditional probability mass of
exceeding its own mean by is at most




where is the cumulative distribution function for the normal distribution.
For constant this yields a constant advantage For we get that


By taking we can run the Test procedure enough times to
determine with sufficiently high confidence which “side” of the interval
is on (if it is not inside the interval). We proceed by binary search to
narrow in on We get:

Theorem 3. There exists an algorithm that invokes the test




times and outputs such that


6 Datamining on Published Statistics
In this section we apply our basic technique for measuring implication in prob-
ability to the real-life model in which confidential information is gathered by
a trusted party, such as the census bureau, who publishes aggregate statistics.
The published statistics are the results of queries to a SuLQ database. That is,
the census bureau generates queries and their noisy responses, and publishes the
results.
7
In more detail:




TEAM LinG
542 Cynthia Dwork and Kobbi Nissim

Let denote the number of attributes (columns). Let be fixed (typi-
cally, will be small; see below). For every of attributes
and for each of the conjunctions of literals over these attributes,
and so on), the bureau publishes the result of some number of
queries on these conjunctions. More precisely, a query set is selected,
and noisy statistics for all conjunctions of literals are published for the
query. This is repeated times.
To see how this might be used, suppose and we wish to learn if
implies in probability. We know from the results in Section 4 that we
need to find a heavy set for and then to query the database on the
set with the function Moreover, we need to do this several times
(for the binary search). If is sufficiently large, then with high probability such
query sets are among the queries. Since we query all triples (generally,
of literals for each query set all the necessary information is published.
The analyst need only follow the instructions for learning the strength of
the implication in probability looking up the results of the
queries (rather than randomly selecting the sets and submitting the queries to
the database).
As in Section 4, once we can determine implication in probability, it is easy
to determine (via Bayes™ rule) the statistics for the conjunction
In other words, we can determine the approximate statistics for any conjunction
of literals of attribute values. Now the procedure for arbitrary func-
tions is conceptually simple. Consider a function of attribute values
The analyst first represents the function as a truth table: for each possible
of literals over the function has value either zero or one. Since
these conjunctions of literals are mutually exclusive, the probability (overall)
that the function has value 1 is simply the sum of the probabilities that each of
the positive (one-valued) conjunctions occurs. Since we can approximate each of
these statistics, we obtain an approximation for their sum. Thus, we can approx-
imate the statistics for each of the Boolean functions of attributes. It
remains to analyze the quality of the approximations.
Let be an upper bound on the number of queries permitted by the
SuLQ database algorithm, e.g., Let and be as above:
is the total number of attributes, and statistics for will be published.
Let be the (combined) additive error achieved for all conjuncts with
probability




Privacy is preserved as long as (Theorem 1). To determine util-
ity, we need to understand the error introduced by the summation of estimates.
TEAM LinG
Privacy-Preserving Datamining on Vertically Partitioned Databases 543

Let If our test results in a additive error for each possible conjunct
of literals, the truth table method described above allows us to compute the
frequency of every function of literals within additive error (a lot better in
many cases). We require that our estimate be within error with probability
where Hence, the probability that a ˜bad™ conjunct exists
(for which the estimation error is not within is bounded by
Plugging and into Theorem 3, we get that for each conjunction of
literals, the number of subsets on which we need to make queries is



For each subset we query each of the conjuncts of attributes. Hence,
the total number of queries we make is




For constant we get that the total number of queries is To
see our gain, compare this with the naive publishing of statistics for all conjuncts
of attributes, resulting in queries.


7 Open Problems
Datamining of 3-ary Boolean Functions. Section 5.1 shows how to use two SuLQ
databases to learn that As noted, this allows estimating
for any Boolean function Consider the case where there exist
three SuLQ databases for attributes In order to use our test procedure
to compute one has to either to find heavy sets for (having
bias of order or, given a heavy set for to decide whether it is also
heavy w.r.t. It is not clear how to extend the test procedure of Section 5.1
in this direction.

Maintaining Privacy for All Possible Functions. Our privacy definition (Defini-
tion 1) requires for every function that with high probability the
confidence gain is limited by some value If is small (less than then,
via the union bound, we get that with high probability the confidence gain is
kept small for all the possible functions.
For large the union bound does not guarantee simultaneous privacy for all
the possible functions. However, the privacy of a randomly selected function
is (with high probability) preserved. It is conceivable that (e.g. using crypto-
graphic measures) it is possible to render infeasible the task of finding a function
whose privacy was breached.

Dependency Between Database Records. We explicitly assume that the database
records are chosen independently from each other, according to some underlying
distribution We are not aware of any work that does not make this assumption
TEAM LinG
544 Cynthia Dwork and Kobbi Nissim

(implicitly or explicitly). An important research direction is to come up with
definition and analysis that work in a more realistic model of weak dependency
between database entries.


References
1. D. Agrawal and C. Aggarwal, On the Design and Quantification of Privacy Pre-
serving Data Mining Algorithms, Proceedings of the 20th Symposium on Principles
of Database Systems, 2001.
2. N. R. Adam and J. C. Wortmann, Security-Control Methods for Statistical
Databases: A Comparative Study, ACM Computing Surveys 21(4): 515-556 (1989).
3. R. Agrawal and R. Srikant, Privacy-preserving data mining, Proc. of the ACM
SIGMOD Conference on Management of Data, pp. 439“450, 2000.
4. S. Chawla, C. Dwork, F. McSherry, A. Smith, and H. Wee, Toward Privacy in
Public Databases, submitted for publication, 2004.
5. I. Dinur and K. Nissim, Revealing information while preserving privacy, Proceed-
ings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Prin-
ciples of Database Systems, pp. 202-210, 2003.
6. G. Duncan, Confidentiality and statistical disclosure limitation. In N. Smelser &
P. Baltes (Eds.), International Encyclopedia of the Social and Behavioral Sciences.
New York: Elsevier. 2001
7. A. V. Evfimievski, J. Gehrke and R. Srikant, Limiting privacy breaches in pri-
vacy preserving data mining, Proceedings of the Twenty-Second ACM SIGACT-
SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 211-222,
2003.
8. S. Fienberg, Confidentiality and Data Protection Through Disclosure Limitation:
Evolving Principles and Technical Advances, IAOS Conference on Statistics, De-
velopment and Human Rights September, 2000, available at
http://www.statistik.admin.ch/about/international/fienberg_final_paper.doc
9. S. Fienberg, U. Makov, and R. Steele, Disclosure Limitation and Related Methods
for Categorical Data, Journal of Official Statistics, 14, pp. 485“502, 1998.
10. L. Franconi and G. Merola, Implementing Statistical Disclosure Control for Aggre-
gated Data Released Via Remote Access, Working Paper No. 30, United Nations
Statistical Commission and European Commission, joint ECE/EUROSTAT work
session on statistical data confidentiality, April, 2003, available at
http://www.unece.org/stats/documents/2003/04/confidentiality/wp.30.e.pdf
11. S. Goldwasser and S. Micali, Probabilistic Encryption and How to Play Mental
Poker Keeping Secret All Partial Information, STOC 1982: 365-377
12. T.E. Raghunathan, J.P. Reiter, and D.B. Rubin, Multiple Imputation for Statistical
Disclosure Limitation, Journal of Official Statistics 19(1), pp. 1 “ 16, 2003
13. D.B. Rubin, Discussion: Statistical Disclosure Limitation, Journal of Official
Statistics 9(2), pp. 461 “ 469, 1993.
14. A. Shoshani, Statistical databases: Characteristics, problems and some solutions,
Proceedings of the 8th International Conference on Very Large Data Bases
(VLDB™82), pages 208“222, 1982.




TEAM LinG
Optimal Perfectly Secure Message Transmission

K. Srinathan*, Arvind Narayanan, and C. Pandu Rangan
Department of Computer Science and Engineering,
Indian Institute of Technology, Madras, Chennai-600036, India
ksrinath@cs.iitm.ernet.in, arvindn@meenakshi.cs.iitm.ernet.in,
rangan@iitm.ernet.in



Abstract. In the perfectly secure message transmission (PSMT) prob-
lem, two synchronized non-faulty players (or processors), the Sender S
and the Receiver R are connected by wires (each of which facilitates
2-way communication); S has an message that he wishes to send to
R; after exchanging messages in phases1 R should correctly obtain S™s
message, while an adversary listening on and actively controlling any set
of (or less) wires should have no information about S™s message.
We measure the quality of a protocol for securely transmitting an
message using the following parameters: the number of wires the num-
ber of phases and the total number of bits transmitted The optima
for and are respectively and 2. We prove that any 2-phase
reliable message transmission protocol, and hence any secure protocol,
over wires out of which at most are faulty is required to transmit
at least While no known protocol is simultaneously
optimal in both communication and phase complexity, we present one
such optimum protocol for the case when the size of message
is large enough, viz., that is, our optimal protocol has
and Note that privacy is for free, if the
message is large enough.
We also demonstrate how randomness can effectively improve the phase
complexity. Specifically, while the (worst-case) lower bound on is 2, we
design an efficient optimally tolerant protocol for PSMT that terminates
in a single phase with arbitrarily high probability.
Finally, we consider the case when the adversary is mobile, that is, he
could corrupt a different set of wires in different phases. Again, the
optima for and are respectively and 2; However we show that
irrespective of We present the first protocol that is

<<

. 17
( 19)



>>