TY - JOUR
T1 - ASELMAR
T2 - Active and semi-supervised learning-based framework to reduce multi-labeling efforts for activity recognition
AU - Saribudak, Aydin
AU - Yuan, Sifan
AU - Gao, Chenyang
AU - Gestrich-Thompson, Waverly V.
AU - Milestone, Zachary P.
AU - Burd, Randall S.
AU - Marsic, Ivan
N1 - Publisher Copyright:
© 2024 The Author(s)
PY - 2025/2
Y1 - 2025/2
N2 - Manual annotation of unlabeled data for model training is expensive and time-consuming, especially for visual datasets requiring domain-specific experience for multi-labeling, such as video records generated in hospital settings. There is a need to build frameworks to reduce human labeling efforts while improving training performance. Semi-supervised learning is widely used to generate predictions for unlabeled samples in a partially labeled datasets. Active learning can be used with semi-supervised learning to annotate unlabeled samples to reduce the sampling bias due to the label predictions. We developed the ASELMAR framework based on active and semi-supervised learning techniques to reduce the time and effort associated with multi-labeling of unlabeled samples for activity recognition. ASELMAR (i) categorizes the predictions for unlabeled data based on the confidence level in predictions using fixed and adaptive threshold settings, (ii) applies a label verification procedure for the samples with the ambiguous prediction, and (iii) retrains the model iteratively using samples with their high-confidence predictions or manual annotations. We also designed a software tool to guide domain experts in verifying ambiguous predictions. We applied ASELMAR to recognize eight selected activities from our trauma resuscitation video dataset and evaluated their performance based on the label verification time and the mean AP score metric. The label verification required by ASELMAR was 12.1% of the manual annotation effort for the unlabeled video records. The improvement in the mean AP score was 5.7% for the first iteration and 8.3% for the second iteration with the fixed threshold-based method compared to the baseline model. The p-values were below 0.05 for the target activities. Using an adaptive-threshold method, ASELMAR achieved a decrease in AP score deviation, implying an improvement in model robustness. For a speech-based case study, the word error rate decreased by 6.2%, and the average transcription factor increased 2.6 times, supporting the broad applicability of ASELMAR in reducing labeling efforts from domain experts.
AB - Manual annotation of unlabeled data for model training is expensive and time-consuming, especially for visual datasets requiring domain-specific experience for multi-labeling, such as video records generated in hospital settings. There is a need to build frameworks to reduce human labeling efforts while improving training performance. Semi-supervised learning is widely used to generate predictions for unlabeled samples in a partially labeled datasets. Active learning can be used with semi-supervised learning to annotate unlabeled samples to reduce the sampling bias due to the label predictions. We developed the ASELMAR framework based on active and semi-supervised learning techniques to reduce the time and effort associated with multi-labeling of unlabeled samples for activity recognition. ASELMAR (i) categorizes the predictions for unlabeled data based on the confidence level in predictions using fixed and adaptive threshold settings, (ii) applies a label verification procedure for the samples with the ambiguous prediction, and (iii) retrains the model iteratively using samples with their high-confidence predictions or manual annotations. We also designed a software tool to guide domain experts in verifying ambiguous predictions. We applied ASELMAR to recognize eight selected activities from our trauma resuscitation video dataset and evaluated their performance based on the label verification time and the mean AP score metric. The label verification required by ASELMAR was 12.1% of the manual annotation effort for the unlabeled video records. The improvement in the mean AP score was 5.7% for the first iteration and 8.3% for the second iteration with the fixed threshold-based method compared to the baseline model. The p-values were below 0.05 for the target activities. Using an adaptive-threshold method, ASELMAR achieved a decrease in AP score deviation, implying an improvement in model robustness. For a speech-based case study, the word error rate decreased by 6.2%, and the average transcription factor increased 2.6 times, supporting the broad applicability of ASELMAR in reducing labeling efforts from domain experts.
KW - Active learning
KW - Activity recognition
KW - Prediction confidence
KW - Semi-supervised learning
KW - Trauma resuscitation
KW - Visual data labeling
UR - http://www.scopus.com/inward/record.url?scp=85213050542&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85213050542&partnerID=8YFLogxK
U2 - 10.1016/j.cviu.2024.104269
DO - 10.1016/j.cviu.2024.104269
M3 - Article
AN - SCOPUS:85213050542
SN - 1077-3142
VL - 251
JO - Computer Vision and Image Understanding
JF - Computer Vision and Image Understanding
M1 - 104269
ER -