EchoVib: Exploring Voice Authentication via Unique Non-Linear Vibrations of Short Replayed Speech

S. Abhishek Anand, Jian Liu, Chen Wang, Maliheh Shirvanian, Nitesh Saxena, Yingying Chen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Scopus citations

Abstract

Recent advances in speaker verification and speech processing technology have seen voice authentication being adopted on a wide scale in commercial applications like online banking and customer care support and on devices such as smartphones and IoT voice assistant systems. However, it has been shown that the current voice authentication systems can be ineffective against voice synthesis attacks that mimic a user's voice to high precision. In this work, we suggest a paradigm shift from the traditional voice authentication systems operating in the audio domain but susceptible to speech synthesis attacks (in the same audio domain). We leverage a motion sensor's capability to pick up phonatory vibrations, that can help to uniquely identify a user via voice signatures in the vibration domain. The user's speech is played/echoed back by a device's speaker for a short duration (hence our method is termed EchoVib) and the resulting non-linear phonatory vibrations are picked up by the motion sensor for speaker recognition. The uniqueness of the device's speaker and its accelerometer results in a device-specific fingerprint in response to the echoed speech. The use of the vibration domain and its non-linear relationship with audio allows EchoVib to resist the state-of-the-art voice synthesis attacks, shown to be successful in the audio domain. We develop an instance of EchoVib using the onboard loudspeaker and the accelerometer embedded in smartphones, as the authenticator, based on machine learning techniques. Our evaluation shows that even with the low-quality loudspeaker and the low-sampling rate of accelerometer recordings, EchoVib can identify users with an accuracy of over 90%. We also analyze our system against state-of-art-voice synthesis attacks and show that it can distinguish between the morphed and the original speaker's voice samples, correctly rejecting the morphed samples with a success rate of 85% for voice conversion and voice modeling attacks. We believe that using the vibration domain to detect synthesized speech attacks is effective due to the hardness of preserving the unique phonatory vibration signatures and is difficult to mimic due to the non-linear mapping of the unique speaker and accelerometer response in the vibration domain to the voice in the audio domain.

Original languageEnglish (US)
Title of host publicationASIA CCS 2021 - Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security
PublisherAssociation for Computing Machinery, Inc
Pages67-81
Number of pages15
ISBN (Electronic)9781450382878
DOIs
StatePublished - May 24 2021
Event16th ACM Asia Conference on Computer and Communications Security, ASIA CCS 2021 - Virtual, Online, Hong Kong
Duration: Jun 7 2021Jun 11 2021

Publication series

NameASIA CCS 2021 - Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security

Conference

Conference16th ACM Asia Conference on Computer and Communications Security, ASIA CCS 2021
Country/TerritoryHong Kong
CityVirtual, Online
Period6/7/216/11/21

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Computer Science Applications
  • Information Systems
  • Software

Keywords

  • vibration domain
  • voice echo fingerprint
  • voice imitation resistance

Fingerprint

Dive into the research topics of 'EchoVib: Exploring Voice Authentication via Unique Non-Linear Vibrations of Short Replayed Speech'. Together they form a unique fingerprint.

Cite this