Accelerating Genomic Data Sharing and Collaborative Research with Privacy Protection

Project Details

Description

Abstract. The rapid progress in genome sequencing has led to significant data collection. Analyzing this data can be transformative in answering the key questions about disease associations and our evolution. However, due to growing privacy concerns about the sensitive information of participants, access to genomic datasets used in studies, such as genome-wide association studies (GWAS), is restricted to only a limited number of large groups. On the other hand, collaborative research over genomic datasets, which will also lead to democratizing genomic data sharing, requires sharing data across collaborators. One way to share such datasets across collaborators is through the IRB process and the use of institutional data use agreements. Currently, due to the sensitivity of data, the GWAS computation can only be carried out after IRB review for all collaborators. In this research, we propose a sandbox environment in which potential collaborators come together and obtain an accurate "preview" of their collaborative research in an efficient, reproducible (verifiable), and privacy- preserving way. Our proposed framework allows each collaborator to share information about their dataset in a privacy-preserving way within the proposed sandbox environment. This will help the researchers (1) rectify their federated datasets from low-quality, biased, or statistically dependent records, (2) generate an accurate preview of their collaborative GWAS results to provide evidence for benefit versus risk tradeoff in IRB approval, and (3) identify what part of the datasets should be shared among the collaborators (once they obtain the full IRB approval). To achieve these goals, we will develop (1) novel algorithms that enable quality control over federated data while preserving ownership and privacy and (2) algorithms that promote reproducibility of GWAS results by developing novel techniques for verifying the correctness of GWAS computation and for sharing the whole research datasets while preserving privacy. Our preliminary results show that the proposed framework accurately provides evidence of reproducibility of GWAS results, identifies low-quality (e.g., statistically dependent) data in federated datasets, and preserves the privacy of individuals in collaborators' datasets. Notably, we show that privacy risk due to the proposed framework is lower than the one accepted by the NIH Genomic Data Sharing Policy. Finally, working together with the IRB from three institutions, we will design a pilot study to explore the efficacy of the proposed framework and its integration into the current IRB process. The outcomes of this research will provide a new strategy for genomic data sharing.
StatusActive
Effective start/end date9/1/235/31/25

Funding

  • U.S. National Library of Medicine: $637,188.00
  • U.S. National Library of Medicine: $673,010.00

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.