Several months ago, ZFS added deduplication support. This is a brilliant feature which allows duplicate (N) copies of data to be represented only once on disk, rather than keeping multiple (N) copies. Unfortunately, deduplication has many potential security concerns, some obvious and some not-so-obvious. For instance, hashing collisions are an obvious problem and are fairly well-considered in deduplication implementations, this is why ZFS uses the still-secure SHA-256 algorithm.

This article describes the simple method by which any user may determine if data exists elsewhere within a ZFS pool with deduplication enabled. The attacking user must possess write access and the ability to determine the volume’s size/alloc/free statistics. Such statistics may be available via various means such as NFS, Samba, or ‘df’.

The root of this bug originates in the choice of ZFS engineers in their handling of ‘df’. That is, should the deduplication be completely transparent? Should used/allocated space always increase even if it surpasses total space (weird)? Or, rather, should the disk size increase by the deduplicated amount? The ZFS engineers choose, perhaps unfortunately, the latter solution.

The attack is simple. Upon allocating blocks, if the volume’s size increases, then the blocks must have already existed. In contrast, if the volume’s size stays the same, but the allocated space increases, then it is the first copy of the blocks to have been written within the pool.

The best prevention against this attack is not to use deduplication. Success of this attack is based on disk activity, frequent writes of random duplicate blocks will invalidate the delta calculation. This attack is expected to perform best against large continuous blocks, as such data will trigger a larger delta. Unfortunately, it is precisely for large datasets that deduplication is preferred.

The important question, of course, is what are the current practical applications? There are certainly privacy concerns. It should be noted that private cryptographic keys should be relatively safe as the attack is more effective against large datasets.

I’d love to receive feedback,

please email me: eric@windisch.us, or find me on twitter: ewindisch

-

EDIT: I realize that the volume statistics are not visible via ‘df’, nor via NFS or CIFS. You must receive this information from the ‘zfs’ command, or from tools which interface to it. A bug report has been filed in OpenSolaris.

Leave a Reply

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Powered by WP Hashcash