Article Preview
TopIntroduction
Gathering, analysis and dissemination of data is an important aspect of the operations of both commercial and governmental organizations. The Internet has made data gathering and dissemination easier. Improvements in storage technology enable vast amounts of data to be stored and organized. Improvements in hardware and software have enabled organizations to perform sophisticated analysis and data mining on data. From a utility viewpoint, commercial organizations view data as an important resource, essential to maintaining competitive advantage. Governmental agencies such as the Census Bureau also gather and disseminate vast amounts of data in the hope that the analysis of such data will reveal social trends and serve the society as a whole.
While the role of data as an important resource is evident, there have been increasing concerns, expressed by privacy advocates and others, that organizational data gathering and usage could violate individual privacy and confidentiality. Such violations could occur during gathering, storing, analysis, sharing or dissemination of data. Hence techniques that can preserve the value of data while simultaneously protecting the privacy and confidentiality of individuals and other entities are becoming important.
A number of approaches have been proposed to mask sensitive numerical data such as swapping, rounding, imputation, microaggregation, data distortion, perturbation (Willenborg & Waal, 2001) and hybrid approach (Melville & McQuaid, 2012). However, for numerical, confidential data, perturbation is considered best with respect to reducing data disclosure and providing high data utility. Data perturbation can be further classified into additive perturbation and multiplicative perturbation (Muralidhar, Parsa, & Sarathy, 1999). Additive perturbation is better suited to multivariate data and has been improved over the years.
Additive data perturbation involves the addition of random noise with specific characteristics to the confidential numerical data. The noise-added data prevents the worst-case scenario of a record being identified with certainty and the true values of that record being revealed. However, the release of perturbed microdata could still incur identity and value disclosure risks (Muralidhar & Sarathy, 2012). Identity disclosure risk (or simply re-identification risk), is the degree to which a perturbation technique allows the identity of a particular de-identified record to be inferred. Value disclosure risk is an assessment of how close a malicious user could estimate the true confidential value. A malicious user (also called a “snooper” or “intruder”) could take two stages to compromise the privacy of the data. Consider a set of data with information on households. Assume that a snooper has partial information on a specific household and is trying to identify a record belonging to that household in the released data with the intention of revealing household income. Assume further, that all identifying information has been removed from the data (i.e, the data is “de-identified”) and that the data has been protected using one of the data masking techniques mentioned earlier. First, snoopers will use information that they already possess to establish links between records from two files using record linkage techniques. The identity of the individuals and their perturbed data will be exposed if snoopers are successful in this step. In the second stage, snoopers will focus on these records and attempt to extract sensitive information. A snooper may be able to obtain a good estimate of household income based on masked income value if the masking method does not protect the sensitive data adequately.
In addition to disclosure risk, data masking techniques are also evaluated based on how well they preserve data utility. Data utility of the protected data represents the extent to which results of analysis performed on the protected data are similar to results of the same analysis performed on the original, confidential data. The assessment of the utility of additive data perturbation methods has been carried out in prior literature (Muralidhar et al., 1999) as has the assessment of value disclosure (Sarathy & Muralidhar, 2002). However, there are no prior studies that have undertaken a comprehensive assessment of re-identification risk of all existing additive data perturbation methods. This is critical to the selection of an appropriate data masking technique.