When is Pseudonymous Data Not Personal Information ?

The 2016 Brussels Privacy Symposium which took place last week (November 8th) is the first annual academic program jointly presented by the Brussels Privacy Hub of the Vrije Universiteit Brussel (VUB) and the Future of Privacy Forum (FPF). The all-day workshop is titled: Identifiability: Policy and Practical Solutions for Anonymization and Pseudonymization.

Khaled El Emam, Jules Polonetsky, Luk Arbuckle and myself have co-authored and submitted an article entitled The Seven States of Data: When is Pseudonymous Data Not Personal Information ?  (short version). Khaled’s slides pertaining to this paper are available here.

There has been considerable discussion about the meaning of personal information and the meaning of identifiability. This is an important concept in privacy because it determines the applicability of legislative requirements: data protection laws (“DPL”) around the world protect and govern personal information. A common definition of personal information as “information pertaining to an identifiable individual” can be found in all DPL around the world. Consequently, the notion of “identifiability” becomes key in the interpretation and application of these laws.

There is a general view that identifiability falls on a spectrum, from no risk of re-identification to fully identifiable, with many precedents in between. Recently, a number of legal scholars have proposed different approaches to determine at what point information should be considered as “personal”, in many cases using a risk based approach. For instance, Schwartz and Solove define three specific states of data: identified, identifiable, and non-identifiable. Identified information is that which “singles out a specific individual from others”. Identifiable information under Schwartz and Solove’s definition is information that does not currently “single out” an individual but could be used to identify an individual at some point in the future. Finally, non-identifiable information is that which cannot reasonably be linked to an individual. Rubenstein acknowledges that a range of identifiability, or conversely de-identification, exists and that these concepts need to be more clearly defined in terms of the risk posed in order to ensure that the risk can be effectively mitigated. Polonetsky, Tene and Finch define various points on the spectrum of identifiability, from “explicitly personal” information which contains direct and indirect identifiers without any safeguards or controls to “aggregated anonymous” information from which direct and indirect identifiers have been removed or transformed and no controls are required due to the highly aggregated nature of the data. Along this spectrum, they place pseudonymous data somewhere near the middle and argue that it can pose more or less risk depending on the methods used to transform direct identifiers and the safeguards and controls applied.

In our article, we extend the previous work in this area by:

  1. mapping the spectrum of identifiability to a risk-based approach for evaluating identifiability which is consistent with practices in the disclosure control community
  2. defining precise criteria for evaluating the different levels of identifiability
  3. proposing a new point on this spectrum using the same risk-based framework that would allow broader uses of pseudonymous data under certain conditions.
  4. we aim to strengthen the existing literature on the spectrum of identifiability by proposing a precise framework that is consistent with contemporary regulations and best practices. The identifiability “states of data” proposed here are colored by our experiences with health data, although they may nevertheless be useful much more broadly to other domains.

All of the final papers of the The Brussels Privacy Symposium are available on the FPF website.

This content has been updated on December 11, 2016 at 22 h 24 min.