TY - GEN
T1 - EchoVib
T2 - 16th ACM Asia Conference on Computer and Communications Security, ASIA CCS 2021
AU - Abhishek Anand, S.
AU - Liu, Jian
AU - Wang, Chen
AU - Shirvanian, Maliheh
AU - Saxena, Nitesh
AU - Chen, Yingying
N1 - Funding Information:
We would like to thank our shepherd, Dr. Ding Wang, and the anonymous reviewers for their insightful comments and constructive feedback on the paper. This work was partially supported by the following NSF grants: CNS-1714807, CNS-1526524, CNS-1547350, CNS-2030501, CCF-2028876, CNS-1820624, CNS-1814590 and the ARO grant: W911NF-19-1-0405.
Publisher Copyright:
© 2021 ACM.
PY - 2021/5/24
Y1 - 2021/5/24
N2 - Recent advances in speaker verification and speech processing technology have seen voice authentication being adopted on a wide scale in commercial applications like online banking and customer care support and on devices such as smartphones and IoT voice assistant systems. However, it has been shown that the current voice authentication systems can be ineffective against voice synthesis attacks that mimic a user's voice to high precision. In this work, we suggest a paradigm shift from the traditional voice authentication systems operating in the audio domain but susceptible to speech synthesis attacks (in the same audio domain). We leverage a motion sensor's capability to pick up phonatory vibrations, that can help to uniquely identify a user via voice signatures in the vibration domain. The user's speech is played/echoed back by a device's speaker for a short duration (hence our method is termed EchoVib) and the resulting non-linear phonatory vibrations are picked up by the motion sensor for speaker recognition. The uniqueness of the device's speaker and its accelerometer results in a device-specific fingerprint in response to the echoed speech. The use of the vibration domain and its non-linear relationship with audio allows EchoVib to resist the state-of-the-art voice synthesis attacks, shown to be successful in the audio domain. We develop an instance of EchoVib using the onboard loudspeaker and the accelerometer embedded in smartphones, as the authenticator, based on machine learning techniques. Our evaluation shows that even with the low-quality loudspeaker and the low-sampling rate of accelerometer recordings, EchoVib can identify users with an accuracy of over 90%. We also analyze our system against state-of-art-voice synthesis attacks and show that it can distinguish between the morphed and the original speaker's voice samples, correctly rejecting the morphed samples with a success rate of 85% for voice conversion and voice modeling attacks. We believe that using the vibration domain to detect synthesized speech attacks is effective due to the hardness of preserving the unique phonatory vibration signatures and is difficult to mimic due to the non-linear mapping of the unique speaker and accelerometer response in the vibration domain to the voice in the audio domain.
AB - Recent advances in speaker verification and speech processing technology have seen voice authentication being adopted on a wide scale in commercial applications like online banking and customer care support and on devices such as smartphones and IoT voice assistant systems. However, it has been shown that the current voice authentication systems can be ineffective against voice synthesis attacks that mimic a user's voice to high precision. In this work, we suggest a paradigm shift from the traditional voice authentication systems operating in the audio domain but susceptible to speech synthesis attacks (in the same audio domain). We leverage a motion sensor's capability to pick up phonatory vibrations, that can help to uniquely identify a user via voice signatures in the vibration domain. The user's speech is played/echoed back by a device's speaker for a short duration (hence our method is termed EchoVib) and the resulting non-linear phonatory vibrations are picked up by the motion sensor for speaker recognition. The uniqueness of the device's speaker and its accelerometer results in a device-specific fingerprint in response to the echoed speech. The use of the vibration domain and its non-linear relationship with audio allows EchoVib to resist the state-of-the-art voice synthesis attacks, shown to be successful in the audio domain. We develop an instance of EchoVib using the onboard loudspeaker and the accelerometer embedded in smartphones, as the authenticator, based on machine learning techniques. Our evaluation shows that even with the low-quality loudspeaker and the low-sampling rate of accelerometer recordings, EchoVib can identify users with an accuracy of over 90%. We also analyze our system against state-of-art-voice synthesis attacks and show that it can distinguish between the morphed and the original speaker's voice samples, correctly rejecting the morphed samples with a success rate of 85% for voice conversion and voice modeling attacks. We believe that using the vibration domain to detect synthesized speech attacks is effective due to the hardness of preserving the unique phonatory vibration signatures and is difficult to mimic due to the non-linear mapping of the unique speaker and accelerometer response in the vibration domain to the voice in the audio domain.
KW - vibration domain
KW - voice echo fingerprint
KW - voice imitation resistance
UR - http://www.scopus.com/inward/record.url?scp=85108076836&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85108076836&partnerID=8YFLogxK
U2 - 10.1145/3433210.3437518
DO - 10.1145/3433210.3437518
M3 - Conference contribution
AN - SCOPUS:85108076836
T3 - ASIA CCS 2021 - Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security
SP - 67
EP - 81
BT - ASIA CCS 2021 - Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security
PB - Association for Computing Machinery, Inc
Y2 - 7 June 2021 through 11 June 2021
ER -