TY - JOUR
T1 - Decoding the effects of synonymous variants
AU - Zeng, Zishuo
AU - Aptekmann, Ariel A.
AU - Bromberg, Yana
N1 - Funding Information:
Z.Z. and Y.B. were supported by the NIH/NIGMS grant R01 [GM115486]; A.A. is supported by the Astrobiology Institute grant [80NSSC18M0093]; Y.B. was also supported by NIH grant R01 [MH115958]. Funding for open access charge: NIH/NIGMS grant R01 [GM115486].
Publisher Copyright:
© The Author(s) 2021. Published by Oxford University Press on behalf of Nucleic Acids Research.
PY - 2021/12/16
Y1 - 2021/12/16
N2 - Synonymous single nucleotide variants (sSNVs) are common in the human genome but are often overlooked. However, sSNVs can have significant biological impact and may lead to disease. Existing computational methods for evaluating the effect of sSNVs suffer from the lack of gold-standard training/evaluation data and exhibit over-reliance on sequence conservation signals. We developed synVep (synonymous Variant effect predictor), a machine learning-based method that overcomes both of these limitations. Our training data was a combination of variants reported by gnomAD (observed) and those unreported, but possible in the human genome (generated). We used positive-unlabeled learning to purify the generated variant set of any likely unobservable variants. We then trained two sequential extreme gradient boosting models to identify subsets of the remaining variants putatively enriched and depleted in effect. Our method attained 90% precision/recall on a previously unseen set of variants. Furthermore, although synVep does not explicitly use conservation, its scores correlated with evolutionary distances between orthologs in cross-species variation analysis. synVep was also able to differentiate pathogenic vs. benign variants, as well as splice-site disrupting variants (SDV) vs. non-SDVs. Thus, synVep provides an important improvement in annotation of sSNVs, allowing users to focus on variants that most likely harbor effects.
AB - Synonymous single nucleotide variants (sSNVs) are common in the human genome but are often overlooked. However, sSNVs can have significant biological impact and may lead to disease. Existing computational methods for evaluating the effect of sSNVs suffer from the lack of gold-standard training/evaluation data and exhibit over-reliance on sequence conservation signals. We developed synVep (synonymous Variant effect predictor), a machine learning-based method that overcomes both of these limitations. Our training data was a combination of variants reported by gnomAD (observed) and those unreported, but possible in the human genome (generated). We used positive-unlabeled learning to purify the generated variant set of any likely unobservable variants. We then trained two sequential extreme gradient boosting models to identify subsets of the remaining variants putatively enriched and depleted in effect. Our method attained 90% precision/recall on a previously unseen set of variants. Furthermore, although synVep does not explicitly use conservation, its scores correlated with evolutionary distances between orthologs in cross-species variation analysis. synVep was also able to differentiate pathogenic vs. benign variants, as well as splice-site disrupting variants (SDV) vs. non-SDVs. Thus, synVep provides an important improvement in annotation of sSNVs, allowing users to focus on variants that most likely harbor effects.
UR - http://www.scopus.com/inward/record.url?scp=85122842957&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85122842957&partnerID=8YFLogxK
U2 - 10.1093/nar/gkab1159
DO - 10.1093/nar/gkab1159
M3 - Article
C2 - 34850938
AN - SCOPUS:85122842957
SN - 0305-1048
VL - 49
SP - 12673
EP - 12691
JO - Nucleic Acids Research
JF - Nucleic Acids Research
IS - 22
ER -