The purpose of this article is to create a convolutional neural network model for identifying and predicting audio deepfakes by classifying voice content using deep machine learning algorithms and python programming language libraries. The audio content datasets are basic for the neural network learning process and are represented by mel spectrograms. The processing of graphic images of the audio signal in the heatmap format forms the knowledge base of the convolutional neural network. The results of the visualization of mel spectrograms in the ratio of the measurement of the frequency of sound and chalk determine the key characteristics of the audio signal and provide a comparison procedure between a real voice and artificial speech. Modern speech synthesizers use a complex selection and generate synthetic speech based on the recording of a person's voice and a language model. We note the importance of mel spectrograms, including for speech synthesis models, where this type of spectrograms is used to record the timbre of a voice and encode the speaker's original speech. Convolutional neural networks allow you to automate the processing of mel spectrograms and classify voice content: original or fake. The experiments conducted on test voice sets proved the success of learning and using convolutional neural networks using images of MFCC spectral coefficients to classify and study audio content, and the use of this type of neural networks in the field of information security to identify audio deepfakes.
Keywords: neural networks, detection of voice deepfakes, information security, speech synthesis models, deep machine learning, categorical cross-entropy, loss function, algorithms for detecting voice deepfakes, convolutional neural networks, mel-spectrograms
The article considers mathematical models for the collection and processing of voice content, on the basis of which a fundamentally logical scheme for predicting synthetic voice deepfakes has been developed. Experiments have been conducted on selected mathematical formulas and sets of python programming language libraries that allow real-time analysis of audio content in an organization. The software capabilities of neural networks for detecting voice fakes and generated synthetic (artificial) speech are considered and the main criteria for the study of voice messages are determined. Based on the results of the experiments, a mathematical apparatus has been formed that is necessary for positive solutions to problems of detecting voice deepfakes. A list of technical standards recommended for collecting voice information and improving the quality of information security in the organization has been formed.
Keywords: neural networks, detection of voice defects, information security, synthetic voice speech, voice deepfakes, technical standards for collecting voice information, algorithms for detecting audio deepfakes, voice cloning