Speech Emotion Recognition Using Machine Learning

Preethi Jeevan1

,

Kolluri Sahaja2

,

Dasari Vani3

,

Alisha Begum4

1Professor,Dept.ofComputerScienceandEngineering,SNIST,Hyderabad-501301,India

2,3,4B.TECHScholars,Dept.ofComputerScienceandEngineeringHyderabad-501301,India

***

Abstract: Speech is the ability to express thoughts and emotions through vocal sounds and gestures. Speech emotion recognition (SER), is the act of attempting to recognize human emotion and affective states from speech. This is capitalizing on the fact that voice often reflects underlying emotion through tone and pitch..In this paper we analyze the emotions in recorded speech by analyzing the acoustic features of the audio data of recordings .From the Audio data Feature Selection and Feature Extraction is done. The Python implementation of the Librosa package was used in their extraction. Feature selection (FS) was applied in order to seek for the most relevant feature subset. We analyzed and labeled with respect to emotions and gender for six emotions viz. neutral, happy, sad, angry, fear and disgust, the number of labels was slightly less for surprise.

Keywords: MFCC, Chroma, Mel Spectrogram, Feature selection

1. INTRODUCTION

Emotionplaysasignificant roleininterpersonalhumaninteractions.Forinteractionbetweenhumanandmachineuseof speechsignalisthefastestandmostefficientmethod.Emotionaldisplaysconveyconsiderableinformationaboutthemental stateofanindividual.Formachineemotionaldetectionisaverydifficulttask,ontheotherhand,itisnaturalforhumans.So, knowledge related to emotion is used by an emotion recognition system in such a way that there is an improvement in communicationbetweenmachineandhuman.

1.1 LITERATURE SURVERY

Amajorityofnaturallanguageprocessingsolutions,suchasvoice-activatedsystemsetc.requirespeechasinput.Standard procedure is to first convert this speech input to text usingAutomatic Speech Recognition (ASR) systems and then run classificationorotherlearningoperations ontheASRtextoutput.ASRresolvesvariations inspeechtranscriptionsbeing speakerindependent.ASRsystemsproduceoutputswithhighaccuracybutenduplosingasignificantamountofaspeechfrom differentusersusingprobabilistic acousticandlanguagemodelswhichresultsininformationthatsuggestsemotionfrom speech.ThisgaphasresultedinSpeech-basedEmotionRecognition(SER)systemsbecominganareaofresearchinterestinthe lastfewyears.ThreekeyissuesforsuccessfulSERsystem

 Choiceofagoodemotionalspeechdatabase.

 Extracteffectivefeatures.

 Designreliableclassifiersusingmachinelearningalgorithms.

TheemotionalfeatureextractionisamainissueintheSERsystem.AtypicalSERsystemworksonextractingfeaturessuchas spectral features, pitchfrequency features, formant featuresand energy relatedfeatures fromspeech,following it witha classificationtasktopredictvariousclassesofemotion.Speechinformationrecognizedemotionsmaybespeakerindependent orspeakerdependent.BayesianNetworkmodel,HiddenMarkovModel(HMM),SupportVectorMachines(SVM),Gaussian MixtureModel(GMM)andMulti-ClassifierFusionarefewofthetechniquesusedintraditionalclassificationtasks.

Inthiswork,weproposearobusttechniqueofemotionclassificationusingspeechfeaturesandtranscriptions.Theobjectiveis tocaptureemotionalcharacteristicsusingspeechfeatures,alongwithsemanticinformationfromtext,anduseadeeplearning basedemotionclassifiertoimproveemotiondetectionaccuracies.Wepresentdifferentdeepnetworkarchitecturestoclassify emotionusingspeechfeaturesandtext.Themaincontributionsofthecurrentworkare:

1.2 PROPOSED METHOD

ThespeechemotionrecognitionsystemisperformedasaMachineLearning(ML)model.Thestepsofoperationaresimilarto anyotherMLproject,withsupplementaryfine-tuningsystemstomakethemodelfunctionadequately.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 01 | Jan 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page997

Thefundamentalactionisdatacollection,whichisofprimeimportance.

Themodelbeinggeneratedwillacquirefromthedatacontributedtoitandalltheconclusionsanddecisionsthataprogressed modelwillproduceissuperviseddata.Thesecondaryaction,calledasfeatureengineering,isacombinationofvariousmachine learningassignmentsthatareperformedoverthegathereddata.Thesesystemsapproachthevariousdatadescriptionanddata quality problems. The third step is often explored the essence of an ML project where an algorithmic based prototype is generated.Thismodel usesanML algorithmtodetermineaboutthedata andinstructitselftoreacttoanynewdata itis exhibitedto.Theultimatestepistoestimatethefunctioningofthebuiltmodel.Veryfrequently,developersreplicatethesteps ofgeneratingamodelandestimatingittoanalyzetheperformanceofvariousalgorithms.Measuringoutcomeshelptochoose thesuitableMLalgorithmmostappropriatetothep Dataset:

EnglishLanguageDataset,namelytheTorontoEmotionalSpeechSet(TESS)wastakenintoconsideration.Thisisadataset whichconsistsof200targetwordsandwerespokenbytwowomen,oneyoungerandtheotherolder,andthephraseswere recordedinordertoportraythefollowingsevenemotionshappy,sad,angry,disgust,fear,surpriseandneutralstate.Thereare 2800filesintotal,inthisdataset.

CREMA-Disadatasetof7,442originalclipsfrom91actors.Theseclipswerefrom48maleand43femaleactorsbetweenthe agesof20and74comingfromavarietyofracesandethnicities(AfricanAmerica,Asian,Caucasian,Hispanic,andUnspecified). Actors spoke from a selection of12 sentences. The sentences were presented using one of six different emotions (Anger, Disgust,Fear,Happy,Neutral,andSad)andfourdifferentemotionlevels(Low,Medium,High,andUnspecified).

RyersonAudio-VisualDatabaseofEmotionalSpeechandSong(RAVDESS)datasetcontains24professionalactors(12female, 12male),vocalizingtwolexically-matchedstatementsinaneutralNorthAmericanaccent.Speechemotionsincludecalm, happy,sad,angry,fearful,surprise,anddisgustexpressions.Eachexpressionisproducedattwolevelsofemotionalintensity (normal,strong),withanadditionalneutralexpression.

2.SPEECH EMOTION RECOGNITION SYSTEM

2.1 Data Set & Data Visualization:

InthisSpeechEmotionRecognitionProject,AudioFileistakenfromtheTESSDataset,andthatwillbeuploadedin.wavfile formatbeforethefileuploadprocessisvalidated,whichrelatestothefileformatandemptyfileinput,andwillbeconnect directlytopythonfileswheretheoutputgeneratedintheformofEmotionalLabels.Datavisualizationprovidesinformation aboutthegivenaudiodatainredicament.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 01 | Jan 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page998

Graphicandpictorialform.Here,initialdatasetisdividedintoitsemotionallabels,andthenthewholedataispicturedinto spectrogramgraphandwaveplotdiagram.Spectrogrammaybeavisualrepresentationofthespectrumoffrequenciesofasign becauseitvarieswithtime.Wave-plotisemployedtoplotwaveformofamplitudevstimewheretheprimaryaxisisamplitude andthesecondaxisistime.

2.2 Features Selection:

InthisProjectmainlycomprisefeatureslikeZeroCrossingrate,ChromaShift,RootMeanSquareValue,MelSpectrogramand MFCC(Melinformationretrievalinthemusiccategory;theseextractthecruxofgivenaudio.Thezero-crossingrateiswhena significantchangefrompositiveto0FrequencyCepstralCoefficient).Thesearefewpredominantlyusedaudiofeaturesfor emotional-relatedaudiocontent,acousticrecognition,andtonegativeorfromnegativeto0topositive.MelSpectrogramsare spectrograms that visual sounds on the Mel scale as against the frequency domain. The Mel Scale must be a logarithmic transformationofasignal'sfrequency.

2.3 Feature extraction:

Thespeechsignalcontainsalargenumberofparametersthatreflecttheemotionalcharacteristics.Oneofthestickingpointsin emotionrecognitioniswhatfeaturesshouldbeused.Inrecentresearch,manycommonfeaturesareextracted,suchasenergy, pitch,formant,andsomespectrumfeaturessuchaslinearpredictioncoefficients(LPC),mel-frequencycepstrumcoefficients (MFCC),andmodulationspectralfeatures.Inthiswork,wehaveselectedmodulationspectralfeaturesandMFCC,toextractthe emotionalfeatures.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 01 | Jan 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page999

Mel-frequencycepstrumcoefficient(MFCC)isthemostusedrepresentationofthespectralpropertyofvoicesignals.Theseare thebestforspeechrecognitionasittakeshumanperceptionsensitivitywithrespecttofrequenciesintoconsideration.Foreach frame, the Fourier transform and the energy spectrum were estimated and mapped into the Mel-frequency scale. In our research,weextractthefirst12orderoftheMFCCcoefficientswherethespeechsignalsaresampledat16KHz.Foreachorder coefficients,wecalculatethemean,variance,standarddeviation,kurtosis,andskewness,andthisisfortheotheralltheframes ofanutterance.EachMFCCfeaturevectoris60-dimensional.

3. CONVOLUTIONAL NEURAL NETWORK

Convolutionalneuralnetwork(CNN)isusedtotrainthespeechdatainthetrainingset.Fig.2showstheaccuracyofCNN modelRNNmodelandMLP model trainingsetandverificationset.ItcanbeseenfromFigure2thatwiththeincreaseof iterationtimes,theaccuracyoftrainingsetandverificationsettendstoincrease,especiallythetrainingset.Finally,after200 iterations,theaccuracyofthetrainingsetisover97%,andthatoftheverificationsetisover80%.CNNmodeldevelopedinthis paperhashighaccuracythanRNNandMLPinbothtrainingsetandverificationset.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 01 | Jan 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page1000

Theconvolutionoperationrepresentsthehierarchicalextractionofspeechfeatures,whilethemaximumpoolingoperation removesredundantinformationinthepreviouslayerfeatures,andtheoperationissimplified.Anactivationlayerissetafter eachconvolutionlayer,andthebestactivationfunctiondeterminedbyexperiments.Twofullyconnectedneuronsaresetinthe outputlayertodividethespeechsignalsintotwocategories.Inaddition,consideringthecomplexityofthenetworkstructure, randomDropoutissetaftereachhiddenlayertopreventthenetworkfromover-fittinginthetrainingprocess.Afteradding Dropout,theneuronsineachhiddenlayerwillhaveacertainprobabilityofnotupdatingtheweightsinthetrainingprocess, andtheprobabilityofeachneuronnotupdatingtheweightsisequal.

Algorithm

Step1:Thesampleaudioisprovidedasinput.

Step2:TheSpectrogramandWaveformisplottedfromtheaudiofile.

Step3:UsingtheLIBROSA,apythonlibraryweextracttheMFCC(MelFrequencyCepstralCoefficient)usuallyabout10–20.

Step4:Remixingthedata,dividingitintrainandtestandthereafterconstructingaCNNmodelanditsfollowinglayerstotrain thedataset.

Step5:Predictingthehumanvoiceemotionfromthattraineddata(sampleno-predictedvalue-actualvalue)

UML DIAGRAM:

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 01 | Jan 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page1001

5. EXPERIMENTAL RESULTS AND ANALYSIS

In this paper, the obtained models are tested in the test set following the experimental procedure. Table 1 shows the normalizedconfusionmatrixforSER.Therecognitionaccuracyofexpressionrecognitionachievedbytheproposedmethodon theRAVDESSdatasetisbetterthanexistingstudiesonSER.TheaverageclassificationaccuracyonRAVDESSislistedinTable

Somediscriminationsof'sad'for'neural'fail.Perhapssurprisingly,thefrequencyofconfusionbetween'fear'and'happiness' isashighasthefrequencyofconfusionbetween'sadness'or'disgust'-perhapsthisisbecausefearisatruemulti-faceted emotion.Basedonthis,wewillcomparethecharacteristicsofconfoundingemotionsmorecarefullytoseeifthereareany differencesandhowtocapturethem.Forexample,translatingspokenwordsintotextandtrainingthenetworkformultimodal prediction,andcombiningsemanticsforsentimentrecognition.

6. CONCLUSION

Inrecentyears,SERtechnologyasoneofthekeytechnologiesinhuman-computerinteractionsystems,hasreceivedalotof attentionfromresearchersathomeandabroadforitsabilitytoaccuratelyrecognizeemotionsandthusimprovethequalityof human-computerinteraction.Inthispaper,weproposeadeeplearningalgorithmwithfusedfeaturesforSER.intermsofdata processing,wequadruple-processedtheRAVDESSdatasetwith5760audio.Forthenetworkstructure,weconstructedtwo parallelconvolutionalneuralnetworks(CNNs)toextractspatialfeaturesandatransformencodernetworktoextracttemporal featurestoclassifyemotionsfromoneoftheeightcategories.TheTESSdatasetthatweconsideredisfine-tuned.Sinceithas

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 01 | Jan 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page1002

noiseless data, it was easy for us to classify and feature .Taking advantage of CNNs in spatial feature representation and sequencecodingtransformation,weobtainedanaccuracyof80.46%ontheholdouttestsetoftheRAVDESSdataset.Basedon theanalysisoftheresults,therecognitionofemotionsbyconvertingspeechintotextcombinedwithsemanticsisconsidered.

The combined Spectrogram-MFCC model results in an overall emotion detection accuracy of 73.1%, an almost 4% improvementtotheexistingstate-of-the-artmethods.Betterresultsareobservedwhenspeechfeaturesareusedalongwith speechtranscriptions.ThecombinedSpectrogram-Textmodelgivesaclassaccuracyof69.5%andanoverallaccuracyof75.1% whereasthecombinedMFCC-Textmodelalsogivesaclassaccuracyof69.5%butanoverallaccuracyof76.1%,a5.6%andan almost 7% improvement over current benchmarks respectively. The proposed models can be used for emotion-related applicationssuchasconversational,socialrobots,etc.whereidentifyingemotionandsentimenthiddeninspeechmayplaya roleinthebetterconversation.

REFERENCES

1. Surabhi V, Saurabh M. Speech emotion recognition: A review. International Research Journal of Engineering and Technology(IRJET).2016;03:313-316

2. Nicholson, J., Takahashi, K. &Nakatsu, R. Emotion Recognition in Speech Using Neural Networks. NCA 9, 290

296(2000).https://doi.org/10.1007/s005210070006

3. ‘Han,Kun/Yu,Dong/Tashev,Ivan(2014):“Speechemotionrecognitionusingdeepneuralnetwork”

4. A.Rajasekhar,M.K.Hota,―AStudyofSpeech,Speaker andEmotionRecognitionusingMel FrequencyCepstrum Coefficientsand Support VectorMachines‖,InternationalConferenceonCommunicationandSignalProcessing, pp.0114-0118,2018.

5. Zheng,W.L.,Zhu,J.,Peng,Y.:EEG-basedemotionclassificationusingdeepbeliefnetworks.In:IEEEInternational ConferenceonMultimedia&Expo,pp.1-6(2014).

6. ParthasarathyS,TashevI.Convolutionalneuralnetworktechniquesforspeechemotionrecognition.In:201816th InternationalWorkshoponAcousticSignalEnhancement(IWAENC).IEEE2018.pp.121-125

7. MatildaS.Emotionrecognition:Asurvey.InternationalJournalofAdvancedComputerResearch.2015;3(1):14-19

8. LimW,JangD,LeeT.Speechemotionrecognitionusingconvolutionalandrecurrentneuralnetworks.Asia-Pacific. 2017:1-4

9. G.Liu,W.He,B.Jin,―Featurefusionofspeech emotionrecognitionbasedondeepLearning‖,2018International ConferenceonNetworkInfrastructureandDigitalContent(IC-NIDC),pp.193-197,2018.

10. KoolagudiSG,RaoKS.Emotionrecognitionfromspeech:Areview.InternationalJournalofSpeechTechnology2012.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 01 | Jan 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page1003

–

Turn static files into dynamic content formats.

Create a flipbook