Project Details
Description
Abstract. The rapid progress in genome sequencing has led to significant data collection. Analyzing
this data can be transformative in answering the key questions about disease associations and our
evolution. However, due to growing privacy concerns about the sensitive information of participants,
access to genomic datasets used in studies, such as genome-wide association studies (GWAS), is
restricted to only a limited number of large groups. On the other hand, collaborative research over
genomic datasets, which will also lead to democratizing genomic data sharing, requires sharing data
across collaborators. One way to share such datasets across collaborators is through the IRB process
and the use of institutional data use agreements. Currently, due to the sensitivity of data, the GWAS
computation can only be carried out after IRB review for all collaborators. In this research, we propose
a sandbox environment in which potential collaborators come together and obtain an accurate
"preview" of their collaborative research in an efficient, reproducible (verifiable), and privacy-
preserving way. Our proposed framework allows each collaborator to share information about their
dataset in a privacy-preserving way within the proposed sandbox environment. This will help the
researchers (1) rectify their federated datasets from low-quality, biased, or statistically dependent
records, (2) generate an accurate preview of their collaborative GWAS results to provide evidence for
benefit versus risk tradeoff in IRB approval, and (3) identify what part of the datasets should be shared
among the collaborators (once they obtain the full IRB approval). To achieve these goals, we will
develop (1) novel algorithms that enable quality control over federated data while preserving
ownership and privacy and (2) algorithms that promote reproducibility of GWAS results by developing
novel techniques for verifying the correctness of GWAS computation and for sharing the whole
research datasets while preserving privacy. Our preliminary results show that the proposed framework
accurately provides evidence of reproducibility of GWAS results, identifies low-quality (e.g.,
statistically dependent) data in federated datasets, and preserves the privacy of individuals in
collaborators' datasets. Notably, we show that privacy risk due to the proposed framework is lower
than the one accepted by the NIH Genomic Data Sharing Policy. Finally, working together with the
IRB from three institutions, we will design a pilot study to explore the efficacy of the proposed
framework and its integration into the current IRB process. The outcomes of this research will provide
a new strategy for genomic data sharing.
Status | Active |
---|---|
Effective start/end date | 9/1/23 → 5/31/25 |
Funding
- U.S. National Library of Medicine: $637,188.00
- U.S. National Library of Medicine: $673,010.00
Fingerprint
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.