This presentation was given at UCT's Open Data Day on 6 March 2020. It covers the challenges and opportunities with regard to the sharing of genomics data.
There are limits to what can be shared in terms of genomic data. The actual sequence data consists of a string of 4 letters – DNA. The accompanying data or meta-data such as a phenotype that using the NIH definition is any clinical measurement taken is what will add value and power to any analysis. However, a combination of some phenotypes will make it possible to identify populations and individuals, especially in cases of rare genetic diseases. The more data that is pooled together and analyzed will provide more power to detect minor polygenic differences which are usually linked to non-communicable diseases such as type 2 diabetes, cardiovascular diseases, cancer, obesity, etc.
A Data Access Committee (DAC) is comprised of members who are experts in their fields that review applications to access data. If there are 1,000 requests for data, the DAC will have to review each one and the fact that they are experts in their fields, means they have very little spare time. Usually, a DAC has to go through the data request forms to see if they are in line with the data sharing policy for the project, this can be quite tricky as there is no clear cut way of checking so we will have to go through each data request. Hence, there needs to be some way of automating the process for data requests to be evaluated. To address these problems, the GA4GH has created an eco-system of tools and a couple of driver projects, such as CINECA to implement these technologies at scale.
Some of the challenges to create an eco-system that is secure and accessible with the right credentials include finding the data on multiple infrastructures around the world. Having the right credentials to be able to log in and request or access data. Harmonizing datasets across various consortia and ethical and legal frameworks that applicable across national boundaries.
One of the projects within the GA4GH stable is the tagging of data with data use ontologies that enable data depositors to indicate what sort of research is permissible with the data they submitted, sort of like a cc license for data. Requesters also select various terms for what purpose they intend to use the data for, both are fed to a matching algorithm which then assists the DAC in deciding whether to not to enable access to a dataset. DUO represents one half of a single sign-on framework for automatically granting researchers access to multiple datasets based on their credentials. DUO provides the matching between data use restrictions and intended research use, while the DURI Researcher Identities provide researcher authentication.
Another focus of CINECA is the harmonizing of meta-data across the different national cohort initiatives that are part of it. To share data, it needs to be interoperable. Ontologies are structured, the community agreed machine-interpretable representations of knowledge, have a formalized relationship definition to other terms. The richness of the H3Africa data is the meta-data, good standardization would facilitate its reuse with other datasets in the form of meta-analyses where the sample size is vital to ensure that population-based studies are not underpowered.
A lot of the tools and technologies being developed for population-based human genomics and cohort analysis which underpin the realization of precision medicine would conform to FAIR standards, where FAIR is not Open Data or Open Science, but as open as possible and as closed as necessary.