Evaluating Re-Identification Risks of Data Protected by Additive Data Perturbation

Han Li, Krishnamurty Muralidhar, Rathindra Sarathy, Xin (Robert) Luo

Source Title: Journal of Database Management (JDM) 25(2)

DOI: 10.4018/jdm.2014040103

OnDemand:

(Individual Articles)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

Commercial organizations and government agencies that gather, store, share and disseminate data are facing increasing concerns over individual privacy and confidentiality. Confidential data is often masked in the database or prior to release to a third party, through methods such as data perturbation. In this study, re-identification risks of three major additive data perturbation techniques were compared using two different record linkage techniques. The results suggest that re-identification risk of Kim's multivariate noise addition method is similar to that of simple noise addition method. The general additive perturbation method (GADP) has the lowest re-identification risk and therefore provides the highest level of protection. The study also suggests that Fuller's method of assessing re-identification risk may be better suited than the probabilistic record-linkage method of Winkler, for numeric data. The results of this study should be help organizations and government agencies choose an appropriate additive perturbation technique.

Article Preview

Top

Introduction

Gathering, analysis and dissemination of data is an important aspect of the operations of both commercial and governmental organizations. The Internet has made data gathering and dissemination easier. Improvements in storage technology enable vast amounts of data to be stored and organized. Improvements in hardware and software have enabled organizations to perform sophisticated analysis and data mining on data. From a utility viewpoint, commercial organizations view data as an important resource, essential to maintaining competitive advantage. Governmental agencies such as the Census Bureau also gather and disseminate vast amounts of data in the hope that the analysis of such data will reveal social trends and serve the society as a whole.

While the role of data as an important resource is evident, there have been increasing concerns, expressed by privacy advocates and others, that organizational data gathering and usage could violate individual privacy and confidentiality. Such violations could occur during gathering, storing, analysis, sharing or dissemination of data. Hence techniques that can preserve the value of data while simultaneously protecting the privacy and confidentiality of individuals and other entities are becoming important.

A number of approaches have been proposed to mask sensitive numerical data such as swapping, rounding, imputation, microaggregation, data distortion, perturbation (Willenborg & Waal, 2001) and hybrid approach (Melville & McQuaid, 2012). However, for numerical, confidential data, perturbation is considered best with respect to reducing data disclosure and providing high data utility. Data perturbation can be further classified into additive perturbation and multiplicative perturbation (Muralidhar, Parsa, & Sarathy, 1999). Additive perturbation is better suited to multivariate data and has been improved over the years.

Additive data perturbation involves the addition of random noise with specific characteristics to the confidential numerical data. The noise-added data prevents the worst-case scenario of a record being identified with certainty and the true values of that record being revealed. However, the release of perturbed microdata could still incur identity and value disclosure risks (Muralidhar & Sarathy, 2012). Identity disclosure risk (or simply re-identification risk), is the degree to which a perturbation technique allows the identity of a particular de-identified record to be inferred. Value disclosure risk is an assessment of how close a malicious user could estimate the true confidential value. A malicious user (also called a “snooper” or “intruder”) could take two stages to compromise the privacy of the data. Consider a set of data with information on households. Assume that a snooper has partial information on a specific household and is trying to identify a record belonging to that household in the released data with the intention of revealing household income. Assume further, that all identifying information has been removed from the data (i.e, the data is “de-identified”) and that the data has been protected using one of the data masking techniques mentioned earlier. First, snoopers will use information that they already possess to establish links between records from two files using record linkage techniques. The identity of the individuals and their perturbed data will be exposed if snoopers are successful in this step. In the second stage, snoopers will focus on these records and attempt to extract sensitive information. A snooper may be able to obtain a good estimate of household income based on masked income value if the masking method does not protect the sensitive data adequately.

In addition to disclosure risk, data masking techniques are also evaluated based on how well they preserve data utility. Data utility of the protected data represents the extent to which results of analysis performed on the protected data are similar to results of the same analysis performed on the original, confidential data. The assessment of the utility of additive data perturbation methods has been carried out in prior literature (Muralidhar et al., 1999) as has the assessment of value disclosure (Sarathy & Muralidhar, 2002). However, there are no prior studies that have undertaken a comprehensive assessment of re-identification risk of all existing additive data perturbation methods. This is critical to the selection of an appropriate data masking technique.

Complete Article List

Search this Journal:

Reset

Volume 35: 1 Issue (2024)

Volume 34: 3 Issues (2023)

Volume 33: 5 Issues (2022): 4 Released, 1 Forthcoming

Volume 32: 4 Issues (2021)

Volume 31: 4 Issues (2020)

Volume 30: 4 Issues (2019)

Volume 29: 4 Issues (2018)

Volume 28: 4 Issues (2017)

Volume 27: 4 Issues (2016)

Volume 26: 4 Issues (2015)

Volume 25: 4 Issues (2014)

Volume 24: 4 Issues (2013)

Volume 23: 4 Issues (2012)

Volume 22: 4 Issues (2011)

Volume 21: 4 Issues (2010)

Volume 20: 4 Issues (2009)

Volume 19: 4 Issues (2008)

Volume 18: 4 Issues (2007)

Volume 17: 4 Issues (2006)

Volume 16: 4 Issues (2005)

Volume 15: 4 Issues (2004)

Volume 14: 4 Issues (2003)

Volume 13: 4 Issues (2002)

Volume 12: 4 Issues (2001)

Volume 11: 4 Issues (2000)

Volume 10: 4 Issues (1999)

Volume 9: 4 Issues (1998)

Volume 8: 4 Issues (1997)

Volume 7: 4 Issues (1996)

Volume 6: 4 Issues (1995)

Volume 5: 4 Issues (1994)

Volume 4: 4 Issues (1993)

Volume 3: 4 Issues (1992)

Volume 2: 4 Issues (1991)

Volume 1: 2 Issues (1990)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

Evaluating Re-Identification Risks of Data Protected by Additive Data Perturbation

Abstract

Introduction

Complete Article List