Grid news: Custodix Anonymisation Tool (CAT)
The best proof that ACGT achieves its goal of advancing medical science through offering an IT platform that facilitates seamless and secure access and analysis of multi-level clinico-genomic data, is a demonstration of its capabilities in the field, using "real" data. Thus the success of the ACGT project partially depends on the volume of high quality data that can be analyzed in the different cancer related pilot trials.
Sharing and exploiting sensitive medical data in trans-European network raises a large number of ethical and legal privacy related questions. The ACGT Data Protection Framework, a synergy between legal and technical components, tries to provide a convenient way for ACGT users to be compliant with governing laws and existing best practices. One of the technical components of the ACGT Data Protection Framework that is highlighted here, is the "Custodix Anonymisation Tool" (CAT ) which aims to simplify the process of de-identifying personal data that is used to import data from participating centers into the ACGT platform.
De-identification is no straightforward task. It certainly is not sufficient to remove obvious identifiers from a dataset. Adequate privacy protection involves thorough risk assessment in order to define how the data must be transformed (e.g. through perturbation, suppression, aggregation, etc..) to guarantee that data cannot be re-identified. Privacy protection means balancing re-identification risk versus data usability (both are related to information content).
CAT does not have the ambition to offer a complete solution to the mentioned data protection issues. It was designed as a generic solution (as opposed to the many ad-hoc solutions that pop-up with every new data collection initiative) to remove a large part of the "practical" burden when people want to exchange information compliant with governing legislation and ethical guidelines.
Using CAT
CAT basically consists of a "workbench" and a "wizard". The "CAT workbench" serves at defining the mechanics (data protection profile) through which data is exported for sharing, the "wizard" allows to apply those mechanics over and over again on new datasets. The execution of these profiles and thus the exporting of data doesn't necessarily need to be a manual exercise, next to the wizard CAT can be used as a command line tool (perfect for scripting) or even as a (Java) library for full integration.
Designing a data protection profile in the workbench consists of two important tasks:
- Creation of a mapping from a specific data format to a more generic internal format
- Definition of actions that should be performed on the generic data format in order to de-identify data (data protection profile)
Privacy processing actions in CAT are defined towards an internal generic data model. The big advantage of this approach is that a single privacy protection profile can be applied to different data sources (in different formats). The mapping of the data sources to the generic data model can be easily done in the workbench itself.
CAT support CSV ("Comma Separated Values"), XML, DICOM, CEL files (microarray data) and direct operations on relational databases, through a modular plug-in mechanism which allows developers to add support for their own proprietary data formats.
CAT in ACGT
CAT will be used for example by Jules Bordet to share TOP trial data on the ACGT platform. Two types of data are available the pool of patients included in the study: i.e. medical images (DICOM) and associated lab results which are put in a CSV file ("Comma Separated Values" file, e.g. exported from Microsoft Excel).
Assume that for example, the privacy risk analysis includes requirements such as: "identifiers and free-text (which could contain identifiers) must be removed". These requirements can be formalized in the CAT workbench in terms of operations on a generic data model. In this example:
- Patient identifiers have to be removed and replaced by a pseudonym, such that patients in CSV and DICOM files remain linked.
- The patient demographics must be stored in an encrypted way. This way, a exported record can be re-identified at a later point in time by the people that made the data originally available. This can be useful when reporting for example adverse events.
CAT aims to contain a full library with privacy processing functions including: a wide range of pseudonym generators, placeholders for encrypted storage, free-text de-identification, date transformations, ... and allows users to easily add custom transformation functions.
Once data mappings and a data protection profile exists, users can use the wizard to easily process several input sources at once with a single mouse-click, or use the profiles to script a command-line CAT.
Brecht Claerhout