TY - JOUR
T1 - Discover, Explain, Improve
T2 - An Automatic Slice Detection Benchmark for Natural Language Processing
AU - Hua, Wenyue
AU - Jin, Lifeng
AU - Song, Linfeng
AU - Mi, Haitao
AU - Zhang, Yongfeng
AU - Yu, Dong
N1 - Publisher Copyright:
© 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - Pretrained natural language processing (NLP) models have achieved high overall perfor-mance, but they still make systematic errors. Instead of manual error analysis, research on slice detection models (SDMs), which au-tomatically identify underperforming groups of datapoints, has caught escalated attention in Computer Vision for both understanding model behaviors and providing insights for future model training and designing. How-ever, little research on SDMs and quantitative evaluation of their effectiveness have been conducted on NLP tasks. Our paper fills the gap by proposing a benchmark named ‘‘Discover, Explain, Improve (DEIM)’’ for classification NLP tasks along with a new SDM Edisa. Edisa discovers coherent and underperforming groups of datapoints; DEIM then unites them under human-understandable concepts and provides comprehensive evaluation tasks and corresponding quantitative metrics. The evaluation in DEIM shows that Edisa can accurately select error-prone data-points with informative semantic features that summarize error patterns. Detecting difficult datapoints directly boosts model performance without tuning any original model parameters, showing that discovered slices are actionable for users.1.
AB - Pretrained natural language processing (NLP) models have achieved high overall perfor-mance, but they still make systematic errors. Instead of manual error analysis, research on slice detection models (SDMs), which au-tomatically identify underperforming groups of datapoints, has caught escalated attention in Computer Vision for both understanding model behaviors and providing insights for future model training and designing. How-ever, little research on SDMs and quantitative evaluation of their effectiveness have been conducted on NLP tasks. Our paper fills the gap by proposing a benchmark named ‘‘Discover, Explain, Improve (DEIM)’’ for classification NLP tasks along with a new SDM Edisa. Edisa discovers coherent and underperforming groups of datapoints; DEIM then unites them under human-understandable concepts and provides comprehensive evaluation tasks and corresponding quantitative metrics. The evaluation in DEIM shows that Edisa can accurately select error-prone data-points with informative semantic features that summarize error patterns. Detecting difficult datapoints directly boosts model performance without tuning any original model parameters, showing that discovered slices are actionable for users.1.
UR - http://www.scopus.com/inward/record.url?scp=85180479038&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85180479038&partnerID=8YFLogxK
U2 - 10.1162/tacl_a_00617
DO - 10.1162/tacl_a_00617
M3 - Article
AN - SCOPUS:85180479038
SN - 2307-387X
VL - 11
SP - 1537
EP - 1552
JO - Transactions of the Association for Computational Linguistics
JF - Transactions of the Association for Computational Linguistics
ER -