Improving speech emotion recognition using audio transformer and features fusion

سال انتشار: 1402
نوع سند: مقاله کنفرانسی
زبان: انگلیسی
مشاهده: 153

فایل این مقاله در 8 صفحه با فرمت PDF قابل دریافت می باشد

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

شناسه ملی سند علمی:

ICAISV01_004

تاریخ نمایه سازی: 6 شهریور 1402

چکیده مقاله:

The purpose of speech emotion recognition is to recognize different speaker emotions by extracting and classifying salient features from a pre-processed speech signal. In this paper, a basic method based on the fusion of features, extracted from pre-trained AlexNet, BiLSTM and Wav۲vec۲.۰ models is improved for speech emotion recognition. To this end, similar to the basic model, spectrogram, MFCC and raw signal features are used, respectively. To improve the performance of the basic model, on the one hand, in addition to the MFCC, its first and second derivatives are also extracted. On the other hand, for feature extraction of the concatenated vector, the Audio Transformer with Patchout (PaSST) replaces the BiLSTM of the base model. Then, the attention unit is usedto use the effective information extracted from the MFCC and the spectrogram and also to weight the Wav۲vec۲.۰ output. Finally, the extracted features from AlexNet, PaSST, and also the weighted output of Wav۲vec۲.۰ are fused and fed to the Softmax as the classifier. Experiments have shown that the proposed algorithm has reached a weighted accuracy of ۶۱.۵۶% on RAVDESS dataset.

نویسندگان

Fateme Mehrpouyan

Faculty of Electrical and Computer Engineering,Babol Noshirvani University of Technology, Mazandaran, Iran

Mehdi Ezoji

Faculty of Electrical and Computer Engineering,Babol Noshirvani University of Technology, Mazandaran, Iran