TY - GEN
T1 - Privacy-Preserving Knowledge Transfer with Bootstrap Aggregation of Teacher Ensembles
AU - Yoon, Hong Jun
AU - Klasky, Hilda B.
AU - Durbin, Eric B.
AU - Wu, Xiao Cheng
AU - Stroup, Antoinette
AU - Doherty, Jennifer
AU - Coyle, Linda
AU - Penberthy, Lynne
AU - Stanley, Christopher
AU - Christian, J. Blair
AU - Tourassi, Georgia D.
N1 - Funding Information:
This research used resources of the Oak Ridge Leadership Computing Facility at ORNL, which is supported by the DOE Office of Science under Contract No. DE-AC05-00OR22725.
Funding Information:
The study was supported by the Laboratory Directed Research and Development (LDRD) program of Oak Ridge National Laboratory, under LDRD project No. 9831.
Funding Information:
LTR data were collected using funding from NCI and the Surveillance, Epidemiology and End Results (SEER) Program (HHSN261201800007I), the CDC’s National Program of Cancer Registries (NPCR) (NU58DP006332-02-00) as well as the State of Louisiana.
Funding Information:
Acknowledgement. This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the US Department of Energy (DOE) Office of Science and the National Nuclear Security Administration. This work has been supported in part by the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by DOE and the National Cancer Institute of the National Institutes of Health. This work was performed under the auspices of DOE by Argonne National Laboratory under Contract DE-AC02-06-CH11357, Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344, Los Alamos National Laboratory under Contract DE-AC5206NA25396, and ORNL under Contract DE-AC05-00OR22725.
Funding Information:
NJSCR data were collected using funding from NCI and the Surveillance, Epidemiology and End Results (SEER) Program (HHSN261201300021I), the CDC’s National Program of Cancer Registries (NPCR) (NU58DP006279-02-00) as well as the State of New Jersey and the Rutgers Cancer Institute of New Jersey.
Funding Information:
KCR data were collected with funding from NCI Surveillance, Epidemiology and End Results (SEER) Program (HHSN261201800013I), the CDC National Program of Cancer Registries (NPCR) (U58DP00003907) and the Commonwealth of Kentucky.
Funding Information:
The Utah Cancer Registry is funded by the National Cancer Institute’s SEER Program, Contract No. HHSN261201800016I, and the US Centers for Disease Control and Prevention’s National Program of Cancer Registries, Cooperative Agreement No. NU58DP0063200, with additional support from the University of Utah and Huntsman Cancer Foundation.
Publisher Copyright:
© 2021, Springer Nature Switzerland AG.
PY - 2021
Y1 - 2021
N2 - There is a need to transfer knowledge among institutions and organizations to save effort in annotation and labeling or in enhancing task performance. However, knowledge transfer is difficult because of restrictions that are in place to ensure data security and privacy. Institutions are not allowed to exchange data or perform any activity that may expose personal information. With the leverage of a differential privacy algorithm in a high-performance computing environment, we propose a new training protocol, Bootstrap Aggregation of Teacher Ensembles (BATE), which is applicable to various types of machine learning models. The BATE algorithm is based on and provides enhancements to the PATE algorithm, maintaining competitive task performance scores on complex datasets with underrepresented class labels. We conducted a proof-of-the-concept study of the information extraction from cancer pathology report data from four cancer registries and performed comparisons between four scenarios: no collaboration, no privacy-preserving collaboration, the PATE algorithm, and the proposed BATE algorithm. The results showed that the BATE algorithm maintained competitive macro-averaged F1 scores, demonstrating that the suggested algorithm is an effective yet privacy-preserving method for machine learning and deep learning solutions.
AB - There is a need to transfer knowledge among institutions and organizations to save effort in annotation and labeling or in enhancing task performance. However, knowledge transfer is difficult because of restrictions that are in place to ensure data security and privacy. Institutions are not allowed to exchange data or perform any activity that may expose personal information. With the leverage of a differential privacy algorithm in a high-performance computing environment, we propose a new training protocol, Bootstrap Aggregation of Teacher Ensembles (BATE), which is applicable to various types of machine learning models. The BATE algorithm is based on and provides enhancements to the PATE algorithm, maintaining competitive task performance scores on complex datasets with underrepresented class labels. We conducted a proof-of-the-concept study of the information extraction from cancer pathology report data from four cancer registries and performed comparisons between four scenarios: no collaboration, no privacy-preserving collaboration, the PATE algorithm, and the proposed BATE algorithm. The results showed that the BATE algorithm maintained competitive macro-averaged F1 scores, demonstrating that the suggested algorithm is an effective yet privacy-preserving method for machine learning and deep learning solutions.
KW - Bootstrap aggregation
KW - Data privacy
KW - Differential privacy
KW - Information extraction
KW - Natural language processing
KW - Privacy-preserving machine learning
UR - http://www.scopus.com/inward/record.url?scp=85103597093&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85103597093&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-71055-2_9
DO - 10.1007/978-3-030-71055-2_9
M3 - Conference contribution
AN - SCOPUS:85103597093
SN - 9783030710545
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 87
EP - 99
BT - Heterogeneous Data Management, Polystores, and Analytics for Healthcare - VLDB Workshops, Poly 2020 and DMAH 2020, Revised Selected Papers
A2 - Gadepally, Vijay
A2 - Mattson, Timothy
A2 - Stonebraker, Michael
A2 - Kraska, Tim
A2 - Wang, Fusheng
A2 - Luo, Gang
A2 - Kong, Jun
A2 - Dubovitskaya, Alevtina
PB - Springer Science and Business Media Deutschland GmbH
T2 - VLDB workshops: International Workshop on Polystore Systems for Heterogeneous Data in Multiple Databases with Privacy and Security Assurances, Poly 2020, and 6th International Workshop on Data Management and Analytics for Medicine and Healthcare, DMAH 2020
Y2 - 31 August 2020 through 4 September 2020
ER -