FastAno: Fast Anomaly Detection via Spatio-temporal Patch Transformation
There are growing implications surrounding generative AI in the speech domain
that enable voice cloning and real-time voice conversion from one individual to
another. This technology poses a significant ethical threat and could lead to
breaches of privacy and misrepresentation, thus there is an urgent need for
real-time detection of AI-generated speech for DeepFake Voice Conversion. To
address the above emerging issues, the DEEP-VOICE dataset is generated in this
study, comprised of real human speech from eight well-known figures and their
speech converted to one another using Retrieval-based Voice Conversion.
Presenting as a binary classification problem of whether the speech is real or
AI-generated, statistical analysis of temporal audio features through t-testing
reveals that there are significantly different distributions. Hyperparameter
optimisation is implemented for machine learning models to identify the source
of speech. Following the training of 208 individual machine learning models
over 10-fold cross validation, it is found that the Extreme Gradient Boosting
model can achieve an average classification accuracy of 99.3% and can classify
speech in real-time, at around 0.004 milliseconds given one second of speech.
All data generated for this study is released publicly for future research on
AI speech detection.