The elaboration of guidelines and recommendations for integrating clinical data sources into the ACGT platform
An important challenge in carrying out post-genomic bio-medical research is to efficiently manage and retrieve all relevant data from many heterogeneous sources. A post-genomic clinical trial involves the collection, storage and management of a wide variety of data, including: clinical data collected on Case Report Forms (e.g. symptoms, histology, administered treatment, treatment response), imaging data, genomic data, pathology data and other lab data. Next to that, access to many external sources of data and knowledge is required. These store information about gene and protein sequences, pathways, genomic variation, microarray experiments, medical literature, etc. Seamless access to all these data repositories would greatly facilitate research.
In order to provide seamless data access, syntactic and semantic integration needs to take place. Syntactic data integration handles differences in formats and mechanisms of data access, the fact that information can be represented in different ways, using different terms and identifiers.
To achieve syntactic integration, the data access services first need to provide a uniform data access interface. This includes uniformity of transport protocol, message syntax, query language, and data format. Through the ACGT syntactic access services data can be queried using SPARQL, thus hiding the different query mechanisms provided by the underlying databases.
In ACGT we have implemented syntactic data access services to access relational databases, DICOM image repositories and BASE microarray databases. Relational databases can also be used to make data available that may not yet be stored in a relational database, but that can be mapped to the relational data model. This holds for data collected in files of various formats, such as Excel files, plain text files, XML files, etc.
Next, the data access services export the structure of the database using a common data model, together with possible query limitations of the data source. An RDF Schema of the data resources is exported on demand. Clients use this information for constructing queries, e.g. the semantic mapping editor uses this schema to provide the mapping to the ACGT Master Ontology. Finally, the data access services enforce the data source access policy, and audit access to data sources.
The main steps to integrate new sources with trial-specific patient data into the ACGT platform are the same for all data sources, irrespective of the type of data they store.
1. Export the data from the data source.
2. Anonymise and pseudonymise the data.
3. Determine who should be allowed access, and if need be, create the appropriate contracts and have these signed by all parties.
4. Set-up a database and import the anonymised data.
5. Set-up secure access to the database.
6. Create a data access service for the database.
7. Configure the GAS so that authorized users can access the data by way of the data access services.
8. Create the required semantic mapping so that the database can be queried using the ACGT Master Ontology.
The integration of new data sources is not only a technical task. Successfully making new trial data available also involves clinical and legal partners in ACGT.
We have also provided users with the ability to dynamically deploy new relational data sources into the platform. Users can integrate new data sources from the ACGT portal and subsequently query these from the workflow enactor.
Dynamic creation of relational data access services is particularly useful because a lot of different types of data can be made available this way, including data from Excel spreadsheets and text files in CSV format. Integration of new data sources requires creating a mapping from SPARQL to SQL, which requires good familiarity with the schema of the database and the data contained in it. In contrast, there is no contentspecific knowledge required to integrate new DICOM and microarray databases. The data they store is much more specific and new databases appear less frequently (generation of new data in both cases requires expensive equipment, is time-consuming and always involves patients or tissue samples), which means that static deployment of these data access services is in practice sufficient.
Anca Bucur
Philips