AudioIntegrityNet: A Convolutional Neural  Network Framework for Audio Deepfake Detection Using RMS-ZCR Threshold Intersections | IJET – Volume 11 Issue 5 | IJET-V11I5P23

International Journal of Engineering and Techniques (IJET) Logo

International Journal of Engineering and Techniques (IJET)

Open Access • Peer Reviewed • High Citation & Impact Factor • ISSN: 2395-1303

Volume 11, Issue 5  |  Published: October 2025

AudioIntegrityNet: A Convolutional Neural Network Framework for Audio Deepfake Detection Using RMS-ZCR Threshold Intersections

Author: Prasanth K V , Dr K C James

Abstract

The rise of spoofed audio poses significant risks, including the spread of misinformation, fake news, and substantial financial fraud. Recent developments in audio generation, such as GANs, diffusion models, and autoencoders, are rapidly evolving, and only three seconds of a person’s audio are required to clone their entire sound and generate new audio that feels like it is being said by them. While recent studies have predominantly focused on spectral features and their derivatives for detecting spoofed audio, this research explores the role of Root Mean Square (RMS) and Zero-Crossing Rate (ZCR) values that cross a specific threshold, interpreted as “breath,” as potent discriminators. We use statistical methods to explore how the feature differ in real and fake audio data. Additionally, we develop AudioIntegrityNet, a Convolutional Neural Network (CNN) to classify audio as real or fake.

Keywords

Deepfakes, RMS, ZCR, Mann Whitney U Test, MFCC, CNN, EER

Conclusion

In conclusion, our study provides evidence of a significant difference in the distribution of Root Mean Square (RMS) and Zero-Crossing Rate (ZCR) when crossing a set threshold value between real and synthetic audio samples as inferred from the variation in their respective data distribution. We also infer that these features give the best results when trained in conjunction with other short term spectral features. The performance of our model, AudioIntegrityNet, which was trained on Mel-Frequency Cepstral Coefficients (MFCC), log spectrograms, and breath features, demonstrates its capability to effectively classify audio as real or fake. Achieving an Equal Error Rate (EER) of 32.50% during independent validation against an “in-the-wild” dataset. The study underscores the model’s robustness and potential applicability in real-world scenarios. These findings contribute to the ongoing efforts in audio forensics and integrity verification, paving the way for future research aimed at enhancing the detection of audio deepfakes.

References

[1]Yi, J., Wang, C., Tao, J., Zhang, X., Zhang, C. Y., & Zhao, Y. (2023). Audio deepfake detection: A survey. arXiv preprint arXiv:2308.14970. [2] World Economic Forum ‘Year of elections: Lessons from India’s fight against AI-generated misinformation’ https://www.weforum.org/stories/2024/08/deepf akes-india-tackling-ai-generated-misinformation- elections/ [3]The Battle against AI-driven Identity fraud , Signicat. https://www.signicat.com/the-battle- against-ai-driven-identity-fraud [4]Engel, J., Agrawal, K. K., Chen, S., Gulrajani, I., Donahue, C., & Roberts, A. (2019). Gansynth: Adversarial neural audio synthesis. arXiv preprint arXiv:1902.08710. [5]Donahue, C., McAuley, J., & Puckette, M. (2018). Adversarial audio synthesis. arXiv preprint arXiv:1802.04208. [6]Caillon, A., & Esling, P. (2021). RAVE: A variational autoencoder for fast and high-quality neural audio synthesis. arXiv preprint arXiv:2111.05011 [7]Kong, Z., Ping, W., Huang, J., Zhao, K., & Catanzaro, B. (2020). Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761. [8]Tokuda, K. (1999). Speech synthesis based on hidden markov models. 電子 情報通信学会技術研究報告, 99(255 (SP99 55-61)), 47-54. [9]Frank, J., & Schönherr, L. (2021). Wavefake: A data set to facilitate audio deepfake detection. arXiv preprint arXiv:2111.02813. [10]Pham, L., Lam, P., Nguyen, T., Nguyen, H., & Schindler, A. (2024, September). Deepfake audio detection using spectrogram-based feature and ensemble of deep learning models. In 2024 IEEE 5th International Symposium on the Internet of Sounds (IS2) (pp. 1-5). IEEE. [11]Yang, Y., Qin, H., Zhou, H., Wang, C., Guo, T., Han, K., & Wang, Y. (2024, April). A robust audio deepfake detection system via multi-view feature. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 13131-13135). IEEE. [12]Kawa, P., Plata, M., & Syga, P. (2022, December). Specrnet: Towards faster and more accessible audio deepfake detection. In 2022 IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom) (pp. 792-799). IEEE. [13]Liu, R., Zhang, J., Gao, G., & Li, H. (2023). Betray oneself: A novel audio deepfake detection model via mono-to-stereo conversion. arXiv preprint arXiv:2305.16353. [14]Layton, S., De Andrade, T., Olszewski, D., Warren, K., Butler, K., & Traynor, P. (2024). Every Breath You Don’t Take: Deepfake Speech Detection Using Breath. arXiv preprint arXiv:2404.15143. [15]Doan, T. P., Nguyen-Vu, L., Jung, S., & Hong, K. (2023, June). BTS-E: Audio deepfake detection using breathing-talking-silence encoder. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE. [16]Sabry, A. H., Bashi, O. I. D., Ali, N. N., & Al Kubaisi, Y. M. (2024). Lung disease recognition methods using audio-based analysis with machine learning. Heliyon, 10(4).

© 2025 International Journal of Engineering and Techniques (IJET).