UBITECH presents a scientific paper on clinical and genetic data anonymization at CERC 2019 in Darmstadt, Germany

A scientific paper entitled “Anonymizing Clinical and Genetic Data of Patients with Minimum Information Loss” has been authored by UBITECH and is presented at the 5th Collaborative European Research Conference (CERC 2019), hosted between March 29-30, 2019 in Darmstadt, Germany. In this paper, Dimitris Ntalaperas and Thanassis Bouras propose an approach for data anonymization to foster data exchange, which is based on disclosing a row based anonymized version of the original data set. The methodology is more versatile (than traditional data cubes -oriented approaches), while it also preserves the statistical characteristics of the original data set. We demonstrate this by considering an SVM predictor that tries to estimate the value of Breslow’s depth, based on the values of another clinical variable, namely Clark’s level, and the expression count of a skin cancer related gene (CDKN2A). The predictions are shown to have the same characteristics for both the original and the anonymized data sets.

In particular, we introduce a new methodology that performs anonymization directly in the patient data. The transformation may be applied to the whole data set, which is then distributed in row format. Since the transformation is applied at the row level, it is linear in time and attributes of interest can be removed/added in a time efficient manner. The methodology will be applied in a use case that involves data of patients treated for melanoma. We demonstrate that datasets containing phenotypical and genetic data can be effectively anonymized, with any information loss not leading to a substantial change to the data’s statistical characteristics. The methodology can be applied to patient data and these can be shared between parties in a manner that does not compromise patients’ personal and sensitive information and that is also compliant with the requirements of GDPR. The transformed data moreover, exhibit the same statistical characteristics as the original data; their scientific value is therefore not compromised by the anonymization transformations.