Speech Emotion Recognition Using Machine Learning
Preethi Jeevan1,
Kolluri Sahaja2,
Dasari Vani3,
Alisha Begum41Professor,Dept.ofComputerScienceandEngineering,SNIST,Hyderabad-501301,India
2,3,4B.TECHScholars,Dept.ofComputerScienceandEngineeringHyderabad-501301,India
***
Abstract: Speech is the ability to express thoughts and emotions through vocal sounds and gestures. Speech emotion recognition (SER), is the act of attempting to recognize human emotion and affective states from speech. This is capitalizing on the fact that voice often reflects underlying emotion through tone and pitch..In this paper we analyze the emotions in recorded speech by analyzing the acoustic features of the audio data of recordings .From the Audio data Feature Selection and Feature Extraction is done. The Python implementation of the Librosa package was used in their extraction. Feature selection (FS) was applied in order to seek for the most relevant feature subset. We analyzed and labeled with respect to emotions and gender for six emotions viz. neutral, happy, sad, angry, fear and disgust, the number of labels was slightly less for surprise.
Keywords: MFCC, Chroma, Mel Spectrogram, Feature selection
1. INTRODUCTION
Emotionplaysasignificant roleininterpersonalhumaninteractions.Forinteractionbetweenhumanandmachineuseof speechsignalisthefastestandmostefficientmethod.Emotionaldisplaysconveyconsiderableinformationaboutthemental stateofanindividual.Formachineemotionaldetectionisaverydifficulttask,ontheotherhand,itisnaturalforhumans.So, knowledge related to emotion is used by an emotion recognition system in such a way that there is an improvement in communicationbetweenmachineandhuman.
1.1 LITERATURE SURVERY
Amajorityofnaturallanguageprocessingsolutions,suchasvoice-activatedsystemsetc.requirespeechasinput.Standard procedure is to first convert this speech input to text usingAutomatic Speech Recognition (ASR) systems and then run classificationorotherlearningoperations ontheASRtextoutput.ASRresolvesvariations inspeechtranscriptionsbeing speakerindependent.ASRsystemsproduceoutputswithhighaccuracybutenduplosingasignificantamountofaspeechfrom differentusersusingprobabilistic acousticandlanguagemodelswhichresultsininformationthatsuggestsemotionfrom speech.ThisgaphasresultedinSpeech-basedEmotionRecognition(SER)systemsbecominganareaofresearchinterestinthe lastfewyears.ThreekeyissuesforsuccessfulSERsystem
Choiceofagoodemotionalspeechdatabase.
Extracteffectivefeatures.
Designreliableclassifiersusingmachinelearningalgorithms.
TheemotionalfeatureextractionisamainissueintheSERsystem.AtypicalSERsystemworksonextractingfeaturessuchas spectral features, pitchfrequency features, formant featuresand energy relatedfeatures fromspeech,following it witha classificationtasktopredictvariousclassesofemotion.Speechinformationrecognizedemotionsmaybespeakerindependent orspeakerdependent.BayesianNetworkmodel,HiddenMarkovModel(HMM),SupportVectorMachines(SVM),Gaussian MixtureModel(GMM)andMulti-ClassifierFusionarefewofthetechniquesusedintraditionalclassificationtasks.
Inthiswork,weproposearobusttechniqueofemotionclassificationusingspeechfeaturesandtranscriptions.Theobjectiveis tocaptureemotionalcharacteristicsusingspeechfeatures,alongwithsemanticinformationfromtext,anduseadeeplearning basedemotionclassifiertoimproveemotiondetectionaccuracies.Wepresentdifferentdeepnetworkarchitecturestoclassify emotionusingspeechfeaturesandtext.Themaincontributionsofthecurrentworkare:
1.2 PROPOSED METHOD
ThespeechemotionrecognitionsystemisperformedasaMachineLearning(ML)model.Thestepsofoperationaresimilarto anyotherMLproject,withsupplementaryfine-tuningsystemstomakethemodelfunctionadequately.
Thefundamentalactionisdatacollection,whichisofprimeimportance.
Themodelbeinggeneratedwillacquirefromthedatacontributedtoitandalltheconclusionsanddecisionsthataprogressed modelwillproduceissuperviseddata.Thesecondaryaction,calledasfeatureengineering,isacombinationofvariousmachine learningassignmentsthatareperformedoverthegathereddata.Thesesystemsapproachthevariousdatadescriptionanddata quality problems. The third step is often explored the essence of an ML project where an algorithmic based prototype is generated.Thismodel usesanML algorithmtodetermineaboutthedata andinstructitselftoreacttoanynewdata itis exhibitedto.Theultimatestepistoestimatethefunctioningofthebuiltmodel.Veryfrequently,developersreplicatethesteps ofgeneratingamodelandestimatingittoanalyzetheperformanceofvariousalgorithms.Measuringoutcomeshelptochoose thesuitableMLalgorithmmostappropriatetothep Dataset:
EnglishLanguageDataset,namelytheTorontoEmotionalSpeechSet(TESS)wastakenintoconsideration.Thisisadataset whichconsistsof200targetwordsandwerespokenbytwowomen,oneyoungerandtheotherolder,andthephraseswere recordedinordertoportraythefollowingsevenemotionshappy,sad,angry,disgust,fear,surpriseandneutralstate.Thereare 2800filesintotal,inthisdataset.
CREMA-Disadatasetof7,442originalclipsfrom91actors.Theseclipswerefrom48maleand43femaleactorsbetweenthe agesof20and74comingfromavarietyofracesandethnicities(AfricanAmerica,Asian,Caucasian,Hispanic,andUnspecified). Actors spoke from a selection of12 sentences. The sentences were presented using one of six different emotions (Anger, Disgust,Fear,Happy,Neutral,andSad)andfourdifferentemotionlevels(Low,Medium,High,andUnspecified).
RyersonAudio-VisualDatabaseofEmotionalSpeechandSong(RAVDESS)datasetcontains24professionalactors(12female, 12male),vocalizingtwolexically-matchedstatementsinaneutralNorthAmericanaccent.Speechemotionsincludecalm, happy,sad,angry,fearful,surprise,anddisgustexpressions.Eachexpressionisproducedattwolevelsofemotionalintensity (normal,strong),withanadditionalneutralexpression.
2.SPEECH EMOTION RECOGNITION SYSTEM
2.1 Data Set & Data Visualization:
InthisSpeechEmotionRecognitionProject,AudioFileistakenfromtheTESSDataset,andthatwillbeuploadedin.wavfile formatbeforethefileuploadprocessisvalidated,whichrelatestothefileformatandemptyfileinput,andwillbeconnect directlytopythonfileswheretheoutputgeneratedintheformofEmotionalLabels.Datavisualizationprovidesinformation aboutthegivenaudiodatainredicament.
Graphicandpictorialform.Here,initialdatasetisdividedintoitsemotionallabels,andthenthewholedataispicturedinto spectrogramgraphandwaveplotdiagram.Spectrogrammaybeavisualrepresentationofthespectrumoffrequenciesofasign becauseitvarieswithtime.Wave-plotisemployedtoplotwaveformofamplitudevstimewheretheprimaryaxisisamplitude andthesecondaxisistime.
2.2 Features Selection:
InthisProjectmainlycomprisefeatureslikeZeroCrossingrate,ChromaShift,RootMeanSquareValue,MelSpectrogramand MFCC(Melinformationretrievalinthemusiccategory;theseextractthecruxofgivenaudio.Thezero-crossingrateiswhena significantchangefrompositiveto0FrequencyCepstralCoefficient).Thesearefewpredominantlyusedaudiofeaturesfor emotional-relatedaudiocontent,acousticrecognition,andtonegativeorfromnegativeto0topositive.MelSpectrogramsare spectrograms that visual sounds on the Mel scale as against the frequency domain. The Mel Scale must be a logarithmic transformationofasignal'sfrequency.
2.3 Feature extraction:
Thespeechsignalcontainsalargenumberofparametersthatreflecttheemotionalcharacteristics.Oneofthestickingpointsin emotionrecognitioniswhatfeaturesshouldbeused.Inrecentresearch,manycommonfeaturesareextracted,suchasenergy, pitch,formant,andsomespectrumfeaturessuchaslinearpredictioncoefficients(LPC),mel-frequencycepstrumcoefficients (MFCC),andmodulationspectralfeatures.Inthiswork,wehaveselectedmodulationspectralfeaturesandMFCC,toextractthe emotionalfeatures.
Mel-frequencycepstrumcoefficient(MFCC)isthemostusedrepresentationofthespectralpropertyofvoicesignals.Theseare thebestforspeechrecognitionasittakeshumanperceptionsensitivitywithrespecttofrequenciesintoconsideration.Foreach frame, the Fourier transform and the energy spectrum were estimated and mapped into the Mel-frequency scale. In our research,weextractthefirst12orderoftheMFCCcoefficientswherethespeechsignalsaresampledat16KHz.Foreachorder coefficients,wecalculatethemean,variance,standarddeviation,kurtosis,andskewness,andthisisfortheotheralltheframes ofanutterance.EachMFCCfeaturevectoris60-dimensional.
3. CONVOLUTIONAL NEURAL NETWORK
Convolutionalneuralnetwork(CNN)isusedtotrainthespeechdatainthetrainingset.Fig.2showstheaccuracyofCNN modelRNNmodelandMLP model trainingsetandverificationset.ItcanbeseenfromFigure2thatwiththeincreaseof iterationtimes,theaccuracyoftrainingsetandverificationsettendstoincrease,especiallythetrainingset.Finally,after200 iterations,theaccuracyofthetrainingsetisover97%,andthatoftheverificationsetisover80%.CNNmodeldevelopedinthis paperhashighaccuracythanRNNandMLPinbothtrainingsetandverificationset.
Theconvolutionoperationrepresentsthehierarchicalextractionofspeechfeatures,whilethemaximumpoolingoperation removesredundantinformationinthepreviouslayerfeatures,andtheoperationissimplified.Anactivationlayerissetafter eachconvolutionlayer,andthebestactivationfunctiondeterminedbyexperiments.Twofullyconnectedneuronsaresetinthe outputlayertodividethespeechsignalsintotwocategories.Inaddition,consideringthecomplexityofthenetworkstructure, randomDropoutissetaftereachhiddenlayertopreventthenetworkfromover-fittinginthetrainingprocess.Afteradding Dropout,theneuronsineachhiddenlayerwillhaveacertainprobabilityofnotupdatingtheweightsinthetrainingprocess, andtheprobabilityofeachneuronnotupdatingtheweightsisequal.
Algorithm
Step1:Thesampleaudioisprovidedasinput.
Step2:TheSpectrogramandWaveformisplottedfromtheaudiofile.
Step3:UsingtheLIBROSA,apythonlibraryweextracttheMFCC(MelFrequencyCepstralCoefficient)usuallyabout10–20.
Step4:Remixingthedata,dividingitintrainandtestandthereafterconstructingaCNNmodelanditsfollowinglayerstotrain thedataset.
Step5:Predictingthehumanvoiceemotionfromthattraineddata(sampleno-predictedvalue-actualvalue)
UML DIAGRAM:
5. EXPERIMENTAL RESULTS AND ANALYSIS
In this paper, the obtained models are tested in the test set following the experimental procedure. Table 1 shows the normalizedconfusionmatrixforSER.Therecognitionaccuracyofexpressionrecognitionachievedbytheproposedmethodon theRAVDESSdatasetisbetterthanexistingstudiesonSER.TheaverageclassificationaccuracyonRAVDESSislistedinTable
Somediscriminationsof'sad'for'neural'fail.Perhapssurprisingly,thefrequencyofconfusionbetween'fear'and'happiness' isashighasthefrequencyofconfusionbetween'sadness'or'disgust'-perhapsthisisbecausefearisatruemulti-faceted emotion.Basedonthis,wewillcomparethecharacteristicsofconfoundingemotionsmorecarefullytoseeifthereareany differencesandhowtocapturethem.Forexample,translatingspokenwordsintotextandtrainingthenetworkformultimodal prediction,andcombiningsemanticsforsentimentrecognition.
6. CONCLUSION
Inrecentyears,SERtechnologyasoneofthekeytechnologiesinhuman-computerinteractionsystems,hasreceivedalotof attentionfromresearchersathomeandabroadforitsabilitytoaccuratelyrecognizeemotionsandthusimprovethequalityof human-computerinteraction.Inthispaper,weproposeadeeplearningalgorithmwithfusedfeaturesforSER.intermsofdata processing,wequadruple-processedtheRAVDESSdatasetwith5760audio.Forthenetworkstructure,weconstructedtwo parallelconvolutionalneuralnetworks(CNNs)toextractspatialfeaturesandatransformencodernetworktoextracttemporal featurestoclassifyemotionsfromoneoftheeightcategories.TheTESSdatasetthatweconsideredisfine-tuned.Sinceithas
noiseless data, it was easy for us to classify and feature .Taking advantage of CNNs in spatial feature representation and sequencecodingtransformation,weobtainedanaccuracyof80.46%ontheholdouttestsetoftheRAVDESSdataset.Basedon theanalysisoftheresults,therecognitionofemotionsbyconvertingspeechintotextcombinedwithsemanticsisconsidered.
The combined Spectrogram-MFCC model results in an overall emotion detection accuracy of 73.1%, an almost 4% improvementtotheexistingstate-of-the-artmethods.Betterresultsareobservedwhenspeechfeaturesareusedalongwith speechtranscriptions.ThecombinedSpectrogram-Textmodelgivesaclassaccuracyof69.5%andanoverallaccuracyof75.1% whereasthecombinedMFCC-Textmodelalsogivesaclassaccuracyof69.5%butanoverallaccuracyof76.1%,a5.6%andan almost 7% improvement over current benchmarks respectively. The proposed models can be used for emotion-related applicationssuchasconversational,socialrobots,etc.whereidentifyingemotionandsentimenthiddeninspeechmayplaya roleinthebetterconversation.
REFERENCES
1. Surabhi V, Saurabh M. Speech emotion recognition: A review. International Research Journal of Engineering and Technology(IRJET).2016;03:313-316
2. Nicholson, J., Takahashi, K. &Nakatsu, R. Emotion Recognition in Speech Using Neural Networks. NCA 9, 290
296(2000).https://doi.org/10.1007/s005210070006
3. ‘Han,Kun/Yu,Dong/Tashev,Ivan(2014):“Speechemotionrecognitionusingdeepneuralnetwork”
4. A.Rajasekhar,M.K.Hota,―AStudyofSpeech,Speaker andEmotionRecognitionusingMel FrequencyCepstrum Coefficientsand Support VectorMachines‖,InternationalConferenceonCommunicationandSignalProcessing, pp.0114-0118,2018.
5. Zheng,W.L.,Zhu,J.,Peng,Y.:EEG-basedemotionclassificationusingdeepbeliefnetworks.In:IEEEInternational ConferenceonMultimedia&Expo,pp.1-6(2014).
6. ParthasarathyS,TashevI.Convolutionalneuralnetworktechniquesforspeechemotionrecognition.In:201816th InternationalWorkshoponAcousticSignalEnhancement(IWAENC).IEEE2018.pp.121-125
7. MatildaS.Emotionrecognition:Asurvey.InternationalJournalofAdvancedComputerResearch.2015;3(1):14-19
8. LimW,JangD,LeeT.Speechemotionrecognitionusingconvolutionalandrecurrentneuralnetworks.Asia-Pacific. 2017:1-4
9. G.Liu,W.He,B.Jin,―Featurefusionofspeech emotionrecognitionbasedondeepLearning‖,2018International ConferenceonNetworkInfrastructureandDigitalContent(IC-NIDC),pp.193-197,2018.
10. KoolagudiSG,RaoKS.Emotionrecognitionfromspeech:Areview.InternationalJournalofSpeechTechnology2012.