Abstract
Speech recognition is one of the fastest developing engineering technologies at present. It has various applications in several different areas, hence giving multiple benefits to tech today. On the other hand, Speech Emotion Recognition (SER)’s goal is to predict human emotions from speech, and is one of the fastest growing fields in technology. Predicting emotions from audio only is difficult, but SER allows the prediction of emotions just from audio. Different speech features including tone, pitch, and volume, help detect the emotions contained within speech.
This project would contribute to an advanced Emotional Voice Conversion (EVC) system that builds on users' emotional expression vocally as both an emotional recognition and speech processing tool. The system would provide emotional recognition capabilities through incorporating machine learning with emotional recognitional speech processing functions. Machine learning tactic used here is prediction, which allows the system to determine the emotion of speech based on volume, pitch etc. In the previous studies pertaining to SER, there has been a more traditional approach to the detection of emotion in speech which yields a high error percentage. To combat this inaccuracy, the use of modern CNN algorithm increases the accuracy of speech emotion reading and thus giving a high accuracy output of about 93% compared to previous works that had a percentage of about 70%. Machine learning is used here in the form of prediction and elimination, where according to the pitch and volume of the audio file, a prediction is made for the emotion of the audio file used and the output of emotion in that audio file is given as a result. Elimination of the less likely match is done, and the final result matches the emotion of the input audio file. For this project, RAVDESS dataset is used as the input.