Afrispeech-Dialog:ABenchmarkDatasetforSpontaneousEnglish ConversationsinHealthcareandBeyond
MardhiyahSanni1,2,TassallahAbdullahi2,3,DevendraD.Kayande1,2,4 , EmmanuelAyodele1 , NaomeA.Etori2,5 , MichaelS.Mollel2,6 , MoshoodYekini2 , ChibuzorOkocha2,7 , LukmanE.Ismaila2,8 , FolafunmiOmofoye2,9 , BoluwatifeA.Adewale2 , TobiOlatunji1,2,10♠ 1Intron, 2BioRAMP, 3BrownUniversity 4IndianInstituteofInformationTechnologyAllahabad, 5UniversityofMinnesota-TwinCities 6UniversityofGlasgow, 7UniversityofFlorida, 8JohnsHopkinsUniversity 9UniversityofNorthCarolinaatChapelHill, 10GeorgiaInstituteofTechnology tobi@intron.io
Abstract
Speechtechnologiesaretransforminginteractionsacrossvarioussectors,fromhealthcareto callcentersandrobots,yettheirperformance onAfrican-accentedconversationsremainsunderexplored.WeintroduceAfrispeech-Dialog, abenchmarkdatasetof50simulatedmedicalandnon-medicalAfrican-accentedEnglish conversations,designedtoevaluateautomatic speechrecognition(ASR)andrelatedtechnologies.Weassessstate-of-the-art(SOTA) speakerdiarizationandASRsystemsonlongform,accentedspeech,comparingtheirperformancewithnativeaccentsanddiscovera10%+ performancedegradation.Additionally,weexploremedicalconversationsummarizationcapabilitiesoflargelanguagemodels(LLMs) todemonstratetheimpactofASRerrorson downstreammedicalsummaries,providinginsightsintothechallengesandopportunitiesfor speechtechnologiesintheGlobalSouth.Our workhighlightstheneedformoreinclusive datasetstoadvanceconversationalAIinlowresourcesettings.
1 Introduction
DeepLearningapproacheshaverevolutionizedand yieldedsignificantperformancegainsacrossseveralnaturallanguagetasks(Bharadiya, 2023),especiallyforhigh-resourcelanguages(Radfordetal., 2023).Whileconversationalspeechrecognition hascontinuedtomakesignificantstridesintask automationindifferentdomainssuchasmedical (BiswasandTalukdar, 2024; Abachaetal., 2023; Yimetal., 2023),voiceassistants(Pasandiand Pasandi, 2022; Manietal., 2020),callcenter(Płaza etal., 2021),androbotics(Skantze, 2021),much oftheresearchinAutomaticSpeechRecognition
1Authorsmarkedwith ♠ areseniorauthors.
(ASR)hasfocusedonmonolingualspeechwith nativeaccents(Aksënovaetal., 2022),withconsiderableperformancegapsindiverselinguistic, lowresource,andaccentedcontexts(Radfordetal., 2023; Olatunjietal., 2023a).
InAnglophoneAfricancontexts,regionalaccentsfurthercomplicatethedevelopmentofASR systemsresultinginpoorgeneralizationfromstateof-the-art(SOTA)models(Hinsvarketal., 2021) inbothgeneral(Adelanietal., 2022; Olatunjietal., 2023a; Ogunetal., 2024; Tonjaetal., 2024)and medical(Afonjaetal., 2024; Olatunjietal., 2023b) contexts.Inmoredevelopedcountries,Performant Speechrecognitionmodelsareparticularlyuseful inoverburdenedhealthcaresystems,wherethey canhelptoreducethedocumentationworkload foroverwhelmedclinicians(Afonjaetal., 2024; Olatunjietal., 2023b).
Recentadvancementsinmedicalconversation transcriptionandsummarizationusingLLMshave ledtowideadoptionofthesetechnologiesinhospitalsindevelopedcountries(Michalopoulosetal., 2022; Yimetal., 2023)helpingtoimproveclinical documentationqualityandproductivity(Galloway etal., 2024).Giventheheavypatientloads(Baye etal., 2020; Etorietal., 2023)aswellastherecent integrationofAI/NLP/ASRsystemsintoclinical workflowsinAfricanclinics 12,performantambienttranscriptionandsummarizationsystemsare highlydesired.However,thegeneralizabilityof suchsystemstoAfricancontextsremainsunderexplored(Afonjaetal., 2024).
Totacklethisproblem,ourcontributionsareas
1https://techcabal.com/2023/05/29/intron-health-bringsai-to-african-healthcare/
2https://techtrendske.co.ke/2024/09/23/google-supportsjacaranda-healths-use-of-gen-ai-for-mothers-and-babies/
follows:
1. AfriSpeech-Dialoganovel7hrdatasetof 50long-formsimulatedAfrican-accentedEnglishmedicalandnon-medicalconversations
2. BenchmarktheperformanceofSOTAspeaker diarization,speechrecognition,andLLMbasedsummarizationonAfrican-accented conversations
WelaythegroundworkformoreinclusiveASR andNLPtechnologiesinhealthcareandbeyond.
2 RelatedWork
2.1 ASRinMedicalConversationsand Summarization
TheroleofASRinmedicaldocumentationhas grownsignificantly,particularlyintelehealthand in-personpatient-physicianconsultations(Korfiatisetal., 2022; Gallowayetal., 2024; Michalopoulosetal., 2022; Yimetal., 2023).AccurateASR inmedicaldialogueiscritical,astranscriptionerrorscanleadtoincorrectmedicalrecords.Severaldatasetshavebeendevelopedtofacilitatethe studyofASRinmedicalcontexts.PriMock57(Korfiatisetal., 2022)providesprimarycaremock consultationsinEuropeanEnglishaccents; Fareez etal. (2022)evaluatesmedicalASRonsimulated patient-physicianmedicalinterviewswithafocus onrespiratorycases; Enarvietal. (2020)experimentsautomaticallygeneratedmedicalreports fromdiarizeddoctor-patientsurgeryconversations. Le-Ducetal. (2024)proposedareal-timespeech summarizationsystemformedicalconversations withtheabilitytogeneratesummariesforevery (local)andendof(global)utterances,eliminating theneedforcontinuousupdateandrevisionofthe summarystate.
Thesedatasetsprimarilyfocusonnon-African accentsandthereforedonotaccountforthe challengesspecifictoAfrican-accentedmedical speech.
2.2 Non-medicalconversationalASR
P˛eziketal. (2022)releasedDiaBiz,anAnnotated Corpusofover400hrsofPolishcallcenterdialogs.Otherconversational,parliamentary,ororatorydatasetslikeAMI(Carlettaetal., 2005),Earnings22(DelRioetal., 2022),Voxpopuli(Wang etal., 2021)havegainedpopularityonpublicASR benchmarks.ConversationalASRhasalsobeenexploredinotherdomainssuchascallcenters(Płaza
etal., 2021),androbotics(Skantze, 2021).However,thesedatasetslackrepresentationofAfricanaccentedspeech.
2.3 AfricanAccentedASR
Therehasbeengrowinginterestindeveloping ASRsystemsthatcatertoAfricanlanguages;for example, Yılmazetal. (2018)developedamultilingualASRsystemforcode-switchedSouth Africanspeech.MultilingualASRsystems,such asEVIdataset(Spithourakisetal., 2022),offer astrongfoundationfordevelopingsimilarmodelsinAfricancontextswheredatascarcityhindersprogress. Olatunjietal. (2023b)released apan-AfricanaccentedEnglishdatasetformedicalandgeneralASR.Whilethesedatasetsfocus onsingle-speakerspeechrecognition,AfriSpeechDialogisthefirstAfrican-accentedEnglishconversationaldatasetspanningmedicalandnon-medical domains,enablingadditionaltaskslikediarization andsummarization.
2.4 SpeakerDiarizationinMulti-Speaker Conversations
ToincreasetheefficiencyofNLP/ASRsystems, enormouscontributionsweremadetoresearchingtheintegrationofspeakerdiarization(SD) intoitspipeline. Serikovetal. (2024)provides acomparativeanalysisofSDmodels-Pyannote (Bredinetal., 2020),CLEVER3,andNVIDIA NeMo(Harperetal., 2019)on20differentGermandialectsfordiarizationandidentification(DI) task.NVIDIANeMoperformsslightlybetterwith acompetitiveperformanceduetoitsmultiscale segmentationforidentifyingandremovingshorter segments.OnasimilarDItask,(Chuaetal., 2023) alsobenchmarkedtheperformanceofmultilingualASRmodelsinopenandclosedtracksonthe challengingMERLIonCCSEnglish-Mandarin datasets-anextractedspontaneousandcodeswitchingparent-childconversationspeeches.However, SDforAfrican-accentedconversationsremainsunderexplored.BenchmarkingSOTASDmodelson AfriSpeech-Dialogrevealstheirlimitationsinthis setting.
3 Methodology
Figure 1 showsanoverviewofthedatasetcreationandbenchmarkingprocess,illustratinghow AfriSpeech-Dialogsupportstaskslikespeakerdiarization,ASR,andsummarization.Below,we 3https://www.oxfordwaveresearch.com/products/cleaver/

Figure1:AfriSpeechDialog:DatasetandBenchmarkingPipeline
describethedatasetcreationprocessandtheevaluationofstate-of-the-art(SOTA)modelsonthese taskstodemonstratetheirapplicabilityandhighlightchallengesinAfrican-accentedconversational speech.
3.1 DatasetStatistics
Here,weoutlineourdatasetcreationprocess.
demographics.DoctorandPatientactorsincluded medicalprofessionals(e.g.doctors,nurses)familiarwithObjectiveStructuredClinicalExaminations(OSCE),awidelyusedassessmentinmedical educationthatsimulatesdoctor-patientinteractions (Fareezetal., 2022).Eachpatientactorwasprovidedwithadetailed"patientcard"thatincluded informationontheircondition,demographics,and medicalhistory,asshowninTable 2.Consistent withOSCEformat,patientcardswerehiddenfrom doctoractorstofacilitateamorenaturalconsultation.
Num.ofCountries 13
Num.ofAccents 68
Gender(M,F) (14,26)(25,33)
Table1: Statisticsofthe medical and non-medical datasets.
3.1.1
CollectingConversations
Werecordedsimulatedvirtualandin-personmedicalandnon-medicalconversationsfromAfrican medicalandnon-medicalcrowdworkersontheIntronPlatform 4 similartotheprocessdescribed in Olatunjietal. (2023b).Eachconversationbeganwithspeakersprovidingconsent,andanyidentifiableinformationintheconsentsegmentwas removed.
Formedicalconversations,followingtheprocessdescribedin Fareezetal. (2022)and Korfiatisetal. (2022),clinicalexpertsprepared"patient cards"withAfrican-contextdiseaseconditionsand 4https://speech.intron.health
Forgeneraldomainconversations,participants engagedinopendiscussionsbasedon"topiccards" preparedbyateamofreviewers.Eachcardcontainedaconversationtopic,abriefdescription,and twodiscussionpromptstoguidetheconversation. Thepairofparticipants(actors)hadprioraccess tothecardsandwereadvisedtoreadthroughand understandthembeforestartingtheconversation. Table 3 showsasampletopiccard.
Theconversationrecordingswerestoredas mono-channel,16-bitwavfiles,witha48kHzsamplingrate.Ateamofclinicianreviewersreviewed theconversationsandselectedahighqualitysubsetforthisrelease.Thedatasetwascollected acrossthreeAfricancountries—Nigeria,Kenya, andSouthAfrica.Thespeakersrepresentadiverserangeofaccents(11intotal):Hausa,Isoko, Idoma,Urhobo,Ijaw,Yoruba,Swahili,Sesotho, Igbo,Igala,andEbira.
3.1.2 RecordingCharacteristics
DatasetstatisticsaresummarizedinTable 1.The datasetfeaturestwospeakersineachconversation
Condition Malaria
Demographic(Age,Gender) 32-year-oldFemale
PresentingComplaint Feverandchills(2days)
HistoryofPresentingComplaint
• Feverfor2days(Highgrade,notrelievedbymedication)
• Chillsfor2days(Intermittent,severe)
• Headachefor2days(Generalized,throbbing,7/10inseverity)
• Fatigueandgeneralbodyweakness
• Nocough,diarrhea,vomiting,orurinarysymptoms
• Patientlivesinmalaria-endemicarea;norecenttravelhistory
PastMedicalHistory(PMH) Nochronicdiseaseorsurgery
FamilyHistory Nofamilyhistoryofsimilarillness
SocialHistory Doesnotdrinkalcoholorsmoke
AllergyHistory Noknownallergies
Table2: Example PatientCard forMedicalConversations.
Topic:Cyberbullying
Overview: Cyberbullyingisaformofbullyingthattakesplaceonlineorthroughelectroniccommunication.Itinvolves usingtechnology(socialmedia,textmessages,onlineforums)tointimidateorhumiliatesomeone.Examplesinclude:
• Spreadingrumors
• Sendinghurtfulmessages
• Sharingembarrassinginformationwithoutconsent
DiscussionPrompts:
• Whatstepsdoyoutaketoprotectyourselffromcyberbullying?
• Doyouthinksocialmediahaseffectivepoliciesinplace,orcouldtheyimprove?
Table3: Sample ConversationCard forGeneralConversations.
inboththemedicalandgeneraldomains.Medical conversationsweremorestructured,withdoctors askingdirectquestionsandpatientsresponding, resultinginamoreformalexchange.Generalconversationsweremorerelaxed,withspontaneous discussionsonvarioustopics.Overlappingspeech occursoccasionallybutisusuallybrief,involving shortinterjectionslike"yes"or"okay."Disfluenciesandsomecode-switchingreflectthenatural flowofAfricanEnglishspeakers.Generalconversationshavefewerspeakerturnsbutahigher averagewordcountcomparedtomedicalones, asspeakerstendtotalklongertoexpresstheir thoughts.Thesecharacteristicsmakethedataset valuablefortestingspeakerdiarizationandtranscriptionmodelsonAfrican-accentedspeech.
3.1.3
TranscriptionProcess
Allconversationsweremanuallytranscribedby fiveprofessionalannotatorsselectedfromtopperformingcontributorsontheIntronPlatform. Annotatorswereinstructedtoannotatespeaker turnsandinserttimestampsforeachturn,with allannotatorsrequiredtobefamiliarwithmedical terminology.Toensurequality,clinicianreviewersevaluatedarandom20%ofeachannotator’s workforaccuracy,withatleast90%correctness thresholdrequiredforinclusion.Contributorsand annotatorswerepaid$10–$20perhourdepending ontaskcomplexityandclinicalexperience.The datasetisreleasedunderaCC-BY-NC-SA4.0License.
3.2 SpeakerDiarization
Tobenchmarkdiarizationperformanceonthis dataset,weselectedthreerecenthigh-performing neuraldiarizationmodels:
• Pyannote (Bredin, 2023):Thismodelleveragesapre-trainedneuralnetworkthatcomputesx-vectorembeddingsforspeakerdiarization.ItusesanECAPA-TDNN(Desplanquesetal., 2020)architectureforspeaker embeddings,showntoimprovespeakerseparationindiarizationtasks.
• Reverbdiarizationv2 (Bhandarietal., 2024):ThismodelisanextensionofPyannote3.0,fine-tunedon26,000hoursofannotatedspeech.However,thismodeluses WavLM(Chenetal., 2022)insteadofthe SincNet(RavanelliandBengio, 2018)featuresinthePyannote3.0basicmodel.
• Titanet:ThisNvidiadiarizationpipelineuses MarbleNet(Nguyen-Voetal., 2022)forvoice activitydetectionandTitanet-L(Koluguri etal., 2022)forembeddingextraction.Titanetusesa1Ddepth-wiseseparableconvolutionswithSqueeze-and-Excitation(SE)layerswithglobalcontext,followedbyachannel attention-basedstatisticspoolinglayertomap variable-lengthutterancestofixed-lengthembeddings(t-vectors)(Kolugurietal., 2022).
3.2.1 EvaluationMetrics
Weevaluateddiarizationperformanceusingthe DiarizationErrorRate(DER) (Doddingtonetal., 2000).DERquantifiesthepercentageoftimethat speakersaremisattributedormissedinthediarizationoutput.
Weusedthe pyannote.metrics library 5 tocalculateDERforeachrecordingandcomputedthe absoluteDERfortheentiredataset.Theoptimal mappingbetweenthereferenceandhypothesislabelswasobtainedusingtheHungarianalgorithm (Kuhn, 1955),ensuringanaccuratealignment.We reportDERonmedicalandgeneraldomainconversations.
3.3 AutomaticSpeechRecognition(ASR)
TobenchmarkASRperformance,wecompare SOTAopensourcepretrainedmodels,Whisper (Radfordetal., 2023),Distil-Whisper(Gandhi 5https://pyannote.github.io/pyannote-metrics/
etal., 2023),NvidiaParakeet(Harperetal., 2019), Canary(Puvvadaetal., 2024),MMS(Pratapetal., 2024)andWav2vec2(Baevskietal., 2020).
Open-sourcemodelsofferstransparencyandreproducibility.Theyareoftentrainedondiverse, real-worldspeechdata,providecompetitivebaselinesfordialog-specifictaskssuchashandling speakervariation,spontaneousspeech,andoverlappingspeakers.
3.3.1 Preprocessing
Theoriginaltranscriptscontainedtimestampsand speakertags.Weremovedtheseitemsfromthetext astheyareunnecessaryfortheASRtask.Long formaudiorecordingsexceededthecontextlength ofmostASRmodels.Theywerethereforechunkedinto30secondsegmentsforinferenceand transcriptsegmentsreturnedwereconcatenated.
3.3.2
EvaluationMetrics
ASRperformancewasevaluatedusing WordErrorRate(WER).WERmeasuresthetotalnumber ofinsertions,deletions,andsubstitutionsinthe predictedtextwithrespecttothetotalnumberof wordsinthereferencetext.
3.4 MedicalConversationSummarization
NineLLMs(large,small,open,closed,general, andbiomedical)werebenchmarkedforsummarizingdoctor-patientdialogues.EachLLMswas presentedahumanconversationtranscriptandwas prompted(Appendixsection D)togenerateadetailedsummary.
Closed-sourcegeneralLLMs: OpenAI(GPT4o,GPT-3.5-turbo)(Achiametal., 2023),andAnthropicClaude-3-Sonnet(Anthropic, 2023)representleadinggeneral-purposecommercialLLMs.
Open-sourcesmallgeneralLLMs: MetaLlama-3.1-8B-Instruct,Meta-Llama-3.2-3BInstruct(Dubeyetal., 2024),Microsoft-Phi3-mini-4k-instruct(Abdinetal., 2024),and Google-Gemma-2-9b-it(Teametal., 2024) selectedfortheirinstruction-followingabilities ormultilingualsupport,whichisessentialfor code-switching.
Biomedicalopen-sourceLLMs: m42-healthLlama3-Med42-8B(Christopheetal., 2024), johnsnowlabs-JSL-MedLlama-3-8B-v2.0,selected fortheirbiomedicaladaptation.Examplesof generatedsummariesareprovidedinAppendix section E
3.4.1 QuantitativeEvaluation
WeusedtheBERTScore(Zhangetal., 2019)to evaluatethequalityoftheLLM-generatedsummariesagainsttheexpert-generatedreferencesummaries.AlthoughBERTScoreiswidelyused,studieslike(HannaandBojar, 2021)haveshownits limitations,particularlyincapturingfine-grained semanticnuancesandpenalizingstylisticdifferences.
3.4.2 QualitativeEvaluation
ToaddressthelimitationsofBERTScore,we complementitwithtwoqualitativeevaluationapproaches(HumanandLLM-as-Judge)wheresummarieswereevaluatedona5-pointscaleadapted from Zhengetal., 2023; Liuetal., 2023),1(worst) to5(best)onthefollowingsixcriteria:recallofthe diagnosis,accuracyofthetreatmentplan,avoidanceoffalseorfabricatedinformation,clarityand structure,andinclusionofimportantpositiveand negativeclinicaldetails.Ifanycriterion(e.g.,treatmentplan)wasabsentfromtheconversation(transcript),ascoreof0wastobeassignedforthat criterion.Detailedevaluationcriteriaareinthe AppendixSection A.
LLM-as-Judge Consistentwiththegrowing trendinrecentstudies(Zhengetal., 2023; Liu etal., 2023)weusedgenerativemodelsforautomatedsummaryevaluation.WeusedtheOpenAI’s "o1"model(Temsahetal., 2024)promptingbased onthecriteriamentionedabove.Detailedverbatim promptsandthemethodforcomputingaccuracy scoresareprovidedinAppendixSection B and C.
HumanEvaluation Inablindstudy,werandomlypresentpairsofhumantranscriptsandLLMorhuman-generatedsummariestoateamof4clinicalexperts.Theexpertscomparedtheinformationavailablefromsummariestoconversationtranscriptsusingthe6criterialistedabove(Kanithi etal., 2024; Singhaletal., 2023; Wangetal., 2023). Eachsummarywasindependentlyratedby2experts.
3.4.3 ErrorPropagationonCascading Models
Sincereal-worldconversationsummarizationsystemsrelyonimperfectASRtranscriptsandaccentedmedicalASRtranscriptionischallenging forseveralASRsystems(Afonjaetal., 2024), wefurtherevaluatedsummariesgeneratedbased onpredicted(machine)transcriptstodetermineif
therewasadropinqualitywhencomparedwith summariesgeneratedbasedonhumantranscripts (Giuseppeetal., 2021).Wemeasuredsummary qualityusingLLM-as-Judge.
4 Experiments
4.1 Diarization
Wedownloadandruninferenceusingpublicly releasedcheckpointsfromHuggingFace(Wolf etal., 2020),withdefaulthyperparameters.Weset thecollarto0.0,meaningnomarginwasallowed aroundspeakertransitions,ensuringthatevenshort overlaps(e.g.,"yes"or"okay")wereevaluateddirectlywithoutanytolerance.Overlappingspeech wasalsonotexcludedfromtheevaluation.
WeraninferenceonasingleNvidiaT4GPU. InferenceforPyannoteandReverbtookapproximately2hourswhiletheTitanettookabout30 minutes.Resultsrepresentsingleruns.
4.2 ASR
Modelsweredownloadedfrompubliclyavailable huggingface(Wolfetal., 2020)checkpointswith defaulthyperparametersandthedefaultgenerationconfigurationwasused.Weraninferenceon NvidiaT4GPUs.Theinferencerequiredanaverageofaround30minutesforthewholedatasetfor theopen-sourcemodels.Resultsrepresentsingle runs.
4.3 Summarization
Foropen-sourceLLMs,weusedpubliclyavailablecheckpointsfromHuggingFace(Wolfetal., 2020)withoutalteringtheirdefaulthyperparameters,exceptforsettingmax_new_tokensto1024. Closed-sourcemodelswereaccessedviatheirrespectiveAPIs,alsousingdefaulthyperparameters. Theprompttemplatewasadaptedfrompriorwork (Zhengetal., 2023; Liuetal., 2023),andtoensure consistency,thesamepromptwasusedacrossall models(detailscanbefoundintheAppendix).
Weconductedthesummarizationexperiments undertwoscenarios:(1)generatingsummaries fromhuman-producedtranscriptsand(2)generatingsummariesfromtranscriptscreatedbyour best-performingASRmodel(Whisper-large-v3).
5 ResultsandDiscussion
5.1 Diarization
WecomputeDERseparatelyforasubsetofconversationswithaccuratetimestamps–medical(9
ModelDER(%)MedDER(%)GenDER(%)AMIDER(%)DIHARDDER(%)
Table4: DiarizationErrorRate(DER)forall30audios,withdetailedresultsforthe Medical(Med.DER) and General(Gen.DER) subsets.The AMIDER and DIHARDDER columnsshowperformanceonthe AMI MixHeadset (Carlettaetal., 2005)and DIHARDII (Ryantetal., 2019)datasets,respectively. LowerDERis better,and(*)indicatesresultswhereoverlappingspeechregionswereignored.
samples,MedDER)andgeneral(21samples,Gen DER)domainconversations.Theresultsareshown inTable 4 andFigure 2.Wealsoshowtheperformanceofthesemodelsonconversationaldatasets onotheraccents,usingthevaluesreportedin (Bredin, 2023; Kolugurietal., 2022; Landinietal., 2022).
Themodelsconsistentlyperformedbetteron generaldomainconversationscomparedtomedical conversations,likelyduetotheirrelaxedstructure andfewerinterruptions.Resultsshowdiarization resultsonAfrispeech-DialogarebetterthanAMI andDIHARDlikelybecauseofthesimulatedand structurednatureofconversations.
ComparisonofMedicalandGeneralDERforDifferentModels
Figure2:ComparisonofMedicalandGeneralDERfor DifferentModels
5.2 ASR
WereportWERforallconversations,aswellas separatelyforthemedicalandnon-medicalportionsofthedata(Table 5).The openai/whisperlarge modelsachievedthebestperformance,followedby nvidia/canary, nvidia/parakeet and modelsandlastly wav2vec2 basedmodels.
Theresults,aspresentedinTable 5 andAppendixFigure 3,demonstrateacleartrend:modelsexhibitsuperiorperformanceonnon-medical (generaldomain)audiocomparedtomedicaldomainaudio,roughly5%betterWER,suggesting medicalconversationsaremorechallenginglikely
duetoaccentedmedicaljargonsuchasmedication names,diagnoses,anddensityofnumericalentities (Afonjaetal., 2024).
Followingtrendsin Olatunjietal. (2023b), Wav2vec2-large-960hfinetunedexclusivelyon readlibrispeechaudiobooks(Panayotovetal., 2015)performsworst,confirmingthatmodelsize (numberofparameters),trainingsetsize,training datadomain(in-the-wild,conversational)andmultilingualpretrainingplayaroleingeneralizability toaccentedspeech.
Additionally,wecompareperformanceonother conversationaldatasetsreportedinTable 5 and showa5to20point(absolute)performancedrop onaveragesuggestingpoorgeneralizabilityof SOTAmodelstoAfrican-accentedconversation Englishspeech.Thisdegradationunderscoresthe challengesASRmodelsfacewithAfricanaccents, highlightingthepressingneedforhigh-quality, domain-specificdatasetstoimproveASRperformanceforlow-resourceaccentsandlanguages.
5.3 Summarization
Table 6 showsoursummarizationresults.
InBERTScore-H,summariesaregenerated basedonhumantranscriptsandevaluatedagainst humanreferencesummaries.InBERTScoreM,summariesaregeneratedbasedonmachinepredictedtranscripts(Whisper-Large-v3)andevaluatedagainsthumanreferencesummaries.
LLM-Eval-HrepresentstheLLM-as-Judgeresultswhenkeydetailsinthehumantranscripts aresoughtfromsummariesgeneratedfromhuman transcripts.LLM-Eval-MrepresentstheLLM-asJudgeresultswhenkeydetailsinthehumantranscriptsaresoughtfromsummariesgeneratedfrom machine-predictedtranscripts.
Table 6 showsthatgeneraldomainmodels likeGTP4oandClaudeconsistentlyachievehigh scoresacrossallevaluationmetricsfollowedby BiomedicalLLMswheregeneratedsummariesare semanticallymoresimilartoreferencesummaries.
ModelWERMedWERGenWERAMIEarnings22VoxPop whisper-medium21.2726.49 19.47 16.6-7.4 whisper-large-v2 20.8223.74 19.8116.412.057.3 whisper-large-v3 20.3823.8119.19 16.01 11.3whisper-large-v3-turbo21.9325.5820.67--distil-whisper-large-v225.3830.4323.63 14.67 12.198.24 distil-whisper-large-v321.2025.6719.5815.95 11.29 8.25 parakeet-rnnt-1.1b28.1634.0326.1317.115.15 5.44 parakeet-ctct-1.1b28.9734.1627.1915.6713.756.56 parakeet-tdt-1.1b28.6933.5727.0115.914.65 5.49 canary-1b22.8227.4021.25 13.53 12.055.79 wav2vec2-large-960h86.3488.3581.1737.0-17.9 mms-1b-all61.7569.0459.2242.0231.1717.63
Table5: WERforvariouspre-trainedmodelsontheentiredataset,includingthe Medical (MedWER)and General (GenWER)portions,aswellasAMI(Carlettaetal., 2005),Earnings22(DelRioetal., 2022),andVoxpopuli (Wangetal., 2021)datasets. Lowerisbetter.Top2modelsin bold
ModelBertScore-HLLM-Eval-HBertScore-MLLM-Eval-MHuman-Eval (%)(%)(%)(%)(%)
Claude-3-Sonnet88.39
GPT-4o 91.3472.9490.9470.9867.48 GPT-3.5-turbo89.8769.4190.8164.51Llama-3.1-8B-Instruct86.4469.4186.4462.3565.87 Llama-3.2-3B-Instruct85.6065.1087.0963.7362.75 Gemma-2-9b86.0071.3785.48
HumanExpert----54.59 Llama3-Med42-8B
Phi3-mini-instruct88.0557.4589.6660.3950.83 Meditron3-8B89.8360.2090.4059.6142.98
Table6: Summarizationperformanceofvariousmodels,evaluatedusing BERTScore(F1), LLM-Eval,and HumanEvaluation BERTScore-H and LLM-Eval-H representthemetricswhensummariesaregeneratedand evaluatedusing human-generatedtranscripts,while BERTScore-M and LLM-Eval-M representthemetricsfor machine-generatedtranscripts Higherisbetter.Top2modelsin bold
ModelpositivesnegativesdiagnosestreatmentnohallucinationconciseTotal HumanExpert3.353.061.971.593.233.1816.37 Claude-3-sonnet-202402294.314.311.541.464.774.54 20.92 GPT-4o4.224.131.711.764.314.1120.24 Llama-3.1-8B-Instruct3.903.712.672.143.713.6219.76 Llama-3.2-3B-Instruct4.043.911.351.433.964.1318.83 Gemma3.813.881.311.693.693.8818.25 Llama3-Med42-8B3.132.672.001.872.832.8315.33 Phi-3-mini-4k-instruct3.222.841.591.672.962.9615.25 Meditron3-8B2.862.481.461.232.452.4112.89
Table7:HumanEvaluationResultsshowingmeansof5-pointratingsforeachcriteria.Higherisbetter. Positives: Doesthesummaryincludeallimportantpositiveclinicalsigns,symptoms,ordetailsinthetranscript?; Negatives: Doesthesummaryincludeallimportantnegativeclinicalsigns,symptoms,ordetailsinthetranscript?; Diagnoses: Doesthesummaryaccuratelyrecallthediagnosesfromthetranscript?; Treatment: Doesthesummaryaccurately reflectthetreatmentplaninthetranscript?; NoHallucination: Doesthesummaryavoidanyfalse-incorrect-or fabricatedinformation?; Concise: Isthesummaryconcise-clear-andwell-structured?
Claude-3wasrankedhighestforsummarizinghumantranscripts,whileGemma-2rankedhighest forsummarizingASRtranscripts.
OpengeneralsmallerLLMslikeGemmaand theLlamamodelsdemonstratelowersemanticsimilaritywithreferencesummaries(BERTScore)and theirperformanceonLLM-Evalmetricsdemonstrateaninferiorabilitytoretrievecriticalinforma-
tionfromtheconversation.Theweakestperformer overallwasPhi3-mini-4k-instruct,particularlyin theLLM-Evalscores,signalingsignificantchallengesinmedicalsummarization.
5.4 HumanEvaluation
ResultsinTable 6 showthatLLM-as-Judgeratings(LLM-Eval-H)werestronglycorrelated(Pearsons=0.816)withHumanExpertratings,inline with Zhengetal. (2023).
Table 7 showedthat,ofthe6criteria,expert summarieswererankedinthetop3onlyataccuratelyrecallingthediagnosesfromthetranscript, suggestingthiswasakeyfocusforexperts.LLM summariesweremoreconciseandretainedgreater amountsofkeydetailswhencomparedtotheexpertsummaries.
Theblindexpertratingsshowedthatoverall,expertsrankedleadingLLMsummarieshigherthan referencesummaries,scoringconsistentlyhigher onthecompletenessofkeyfactscarriedoverfrom thetranscripttothesummary.Thissuggeststhat LLMsmayperformbetteratcertaintasks,such ascompressinglargeamountsofinformation(10+ minsconversations)whencomparedwithtimeconstrainedphysicianswhomaypreferbrevity.
5.5 ErrorPropagationfromcascadingmodels
WhilesummarizingASRtranscriptsrevealnosignificantdifferenceinsemanticsimilaritywithreferencesummaries(BERTScore-HvsBERTScoreM),amorenuancedpatternemergeswithLLMEval-HandLLM-Eval-Mmetricswheremany modelsshowanotabledropinperformance.LLMEval-Mscoresareonaverage2to5pointslower thanLLM-Eval-H.ThisindicatesthatASRerrors (noisytranscriptionoutputs)maylimitthesummarizationabilityofLLMsleadingtolossofcritical healthcareinformation.
6 Conclusion
ThisstudyhighlightsseveralchallengeswithaccentedconversationalASR,especiallyitsimpact ondownstreammedicalsummarization.Webenchmarkspeakerdiarization,ASR,andmedicalsummarization,drawingattentiontogapsandopportunitiestoimproveaccentedconversationalASRin theAfricancontext.
7 Limitations
WhilesimulatedconversationsprovidetheopportunitytobenchmarkASRsystems,theymaynot
fullyreflectthecomplexityofreal-worldnatural dialogueswhereseveralinterruptions,distractions andbackgroundnoisearemoretypical.Additionally,theagerangeofcontributorsdoesnotaccuratelyreflectthediversityofreal-worldspeaker populations,potentiallyimpactingthegeneralizabilityoftheresults.Theuseofgenerallarge languagemodels(LLMs)asevaluatorsmayalsointroducebiases.At5hrs,thedatasetdoesnotreflect thefullbreadthofmedicalorgeneralconversationsasseveraldomains,subdomains,topics,and medicalspecialtieswerenotcoveredinthisrelease. Nonetheless,thisworksignificantlycontributesto addressingthegapinASRsystems’adaptability toAfricanaccents.Futureresearchshouldaimto incorporatereal-worlddata,broaderagerepresentation,andfurtheroptimizemodelstoenhancethe robustnessandapplicabilityofASRtechnologies indiverseAfricanlinguisticenvironments.
8 EthicalConsiderations
Inaworldwherevoicecloningandidentitytheft aremajorcybersecuritythreats,releasingavoicebaseddatasetisnotwithoutrisk.Eachrecording beganwithcontributorsexpressingconsenttothe recording.However,sincerealnamesweresometimesusedduringconsent,thesesegmentswere removedtoprotectparticipants’identities.Duringdoctor-patientconversations,pseudonymswere usedtomaintainanonymity.Furthermore,releasingreal-worlddoctor-patientconversationsrisks breachingprivacyandconfidentialityrequirements, hencethepreferenceforsimulatedconversations.
9 Acknowledgment
WethanktheIntronHealthteamforprovidingthe dataandcomputeusedinthisworkandallcontributors,patient-anddoctor-actorswhosevoicesare includedinthedataset.
References
AsmaBenAbacha,Wen-waiYim,YadanFan,and ThomasLin.2023. Anempiricalstudyofclinical notegenerationfromdoctor-patientencounters.In Proceedingsofthe17thConferenceoftheEuropean ChapteroftheAssociationforComputationalLinguistics,pages2291–2302.
MarahAbdin,SamAdeJacobs,AmmarAhmadAwan, JyotiAneja,AhmedAwadallah,HanyAwadalla, NguyenBach,AmitBahree,ArashBakhtiari,HarkiratBehl,etal.2024. Phi-3technicalreport:Ahighly capablelanguagemodellocallyonyourphone arXivpreprintarXiv:2404.14219
JoshAchiam,StevenAdler,SandhiniAgarwal,Lama Ahmad,IlgeAkkaya,FlorenciaLeoniAleman, DiogoAlmeida,JankoAltenschmidt,SamAltman, ShyamalAnadkat,etal.2023. Gpt-4technicalreport arXivpreprintarXiv:2303.08774
DavidIfeoluwaAdelani,GrahamNeubig,Sebastian Ruder,ShrutiRijhwani,MichaelBeukman,Chester Palen-Michel,ConstantineLignos,JesujobaOAlabi,ShamsuddeenHMuhammad,PeterNabende, etal.2022. Masakhaner2.0:Africa-centrictransfer learningfornamedentityrecognition arXivpreprint arXiv:2210.12391
TejumadeAfonja,TobiOlatunji,SewadeOgun, NaomeAEtori,AbrahamOwodunni,andMoshoodYekini.2024. Performantasrmodelsformedicalentitiesinaccentedspeech arXivpreprint arXiv:2406.12387
AlënaAksënova,ZhehuaiChen,Chung-ChengChiu, DaanvanEsch,PavelGolik,WeiHan,LeviKing, BhuvanaRamabhadran,AndrewRosenberg,Suzan Schwartz,etal.2022. Accentedspeechrecognition: Benchmarking,pre-training,anddiversedata. arXiv preprintarXiv:2205.08014.
RieKubotaAndoandTongZhang.2005. Aframework forlearningpredictivestructuresfrommultipletasks andunlabeleddata. JournalofMachineLearning Research,6:1817–1853.
GalenAndrewandJianfengGao.2007. ScalabletrainingofL1-regularizedlog-linearmodels.In Proceedingsofthe24thInternationalConferenceon MachineLearning,pages33–40.
Anthropic.2023. Theclaude3modelfamily:Opus, sonnet,haiku.Accessed:2024-10-15.
AlexeiBaevski,YuhaoZhou,AbdelrahmanMohamed, andMichaelAuli.2020. wav2vec2.0:Aframeworkforself-supervisedlearningofspeechrepresentations Advancesinneuralinformationprocessing systems,33:12449–12460.
YohannesBaye,TesfayeDemeke,NigusieBirhan,AgumasieSemahegn,andSimonBirhanu.2020. Nurses’ work-relatedstressandassociatedfactorsingovernmentalhospitalsinharar,easternethiopia:Acrosssectionalstudy PloSone,15(8):e0236782.
NishchalBhandari,DannyChen,MiguelÁngeldel RíoFernández,NatalieDelworth,JenniferDrexler Fox,MigüelJetté,QuintenMcNamara,CoreyMiller, OndˇrejNovotný,JánProfant,NanQin,MartinRatajczak,andJean-PhilippeRobichaud.2024. Reverb: Open-sourceasranddiarizationfromrev Preprint, arXiv:2410.03930.
JasminBharadiya.2023. Acomprehensivesurveyof deeplearningtechniquesnaturallanguageprocessing EuropeanJournalofTechnology,7(1):58–66.
AnjanavaBiswasandWrickTalukdar.2024. Intelligentclinicaldocumentation:Harnessinggenerative aiforpatient-centricclinicalnotegeneration arXiv preprintarXiv:2405.18346
HervéBredin,RuiqingYin,JuanManuelCoria,GregoryGelly,PavelKorshunov,MarvinLavechin, DiegoFustes,HadrienTiteux,WassimBouaziz,and Marie-PhilippeGill.2020. Pyannote.audio:neural buildingblocksforspeakerdiarization.In ICASSP 2020-2020IEEEInternationalConferenceonAcoustics,SpeechandSignalProcessing(ICASSP),pages 7124–7128.IEEE.
HervéBredin.2023. pyannote.audio2.1speakerdiarizationpipeline:principle,benchmark,andrecipe In Proc.INTERSPEECH2023
JeanCarletta,SimoneAshby,SebastienBourban,Mike Flynn,MaelGuillemot,ThomasHain,Jaroslav Kadlec,VasilisKaraiskos,WesselKraaij,Melissa Kronenthal,etal.2005. Theamimeetingcorpus: Apre-announcement.In Internationalworkshopon machinelearningformultimodalinteraction,pages 28–39.Springer.
SanyuanChen,ChengyiWang,ZhengyangChen, YuWu,ShujieLiu,ZhuoChen,JinyuLi,Naoyuki Kanda,TakuyaYoshioka,XiongXiao,JianWu, LongZhou,ShuoRen,YanminQian,YaoQian, JianWu,MichaelZeng,XiangzhanYu,andFuru Wei.2022. Wavlm:Large-scaleself-supervised pre-trainingforfullstackspeechprocessing IEEE JournalofSelectedTopicsinSignalProcessing, 16(6):1505–1518.
ClémentChristophe,PraveenKKanithi,Tathagata Raha,ShadabKhan,andMarcoAFPimentel.2024. Med42-v2:Asuiteofclinicalllms
VictoriaYHChua,HexinLiu,LeibnyPaolaGarciaPerera,FeiTingWoon,JinyiWong,Xiangyu Zhang,SanjeevKhudanpur,AndyWHKhong,Justin Dauwels,andSuzyJStyles.2023. Merlionccs challenge:Aenglish-mandarincode-switchingchilddirectedspeechcorpusforlanguageidentification anddiarization arXivpreprintarXiv:2305.18881
MiguelDelRio,PeterHa,QuintenMcNamara,Corey Miller,andShipraChandra.2022. Earnings-22:A practicalbenchmarkforaccentsinthewild. arXiv preprintarXiv:2203.15591.
BrechtDesplanques,JentheThienpondt,andKris Demuynck.2020. Ecapa-tdnn:Emphasized channelattention,propagationandaggregationin tdnnbasedspeakerverification arXivpreprint arXiv:2005.07143
GeorgeRDoddington,MarkAPrzybocki,AlvinFMartin,andDouglasAReynolds.2000. Thenistspeaker recognitionevaluation–overview,methodology,systems,results,perspective Speechcommunication, 31(2-3):225–254.
AbhimanyuDubey,AbhinavJauhri,AbhinavPandey, AbhishekKadian,AhmadAl-Dahle,AieshaLetman, AkhilMathur,AlanSchelten,AmyYang,Angela Fan,etal.2024. Thellama3herdofmodels arXiv preprintarXiv:2407.21783
SeppoEnarvi,MarilisaAmoia,MiguelDel-AguaTeba, BrianDelaney,FrankDiehl,StefanHahn,Kristina Harris,LiamMcGrath,YuePan,JoelPinto,etal. 2020. Generatingmedicalreportsfrompatientdoctorconversationsusingsequence-to-sequence models.In Proceedingsofthefirstworkshopon naturallanguageprocessingformedicalconversations,pages22–30.
NaomeEtori,EbasaTemesgen,andMariaGini.2023. Whatweknowsofar:Artificialintelligencein africanhealthcare arXivpreprintarXiv:2305.18302
FaihaFareez,TishyaParikh,ChristopherWavell,Saba Shahab,MeghanChevalier,ScottGood,Isabella DeBlasi,RafikRhouma,ChristopherMcMahon, Jean-PaulLam,etal.2022. Adatasetofsimulated patient-physicianmedicalinterviewswithafocuson respiratorycases ScientificData,9(1):313.
JLukeGalloway,DominiqueMunroe,PamelaDVohraKhullar,ChristopherHolland,MaryASolis,MirandaAMoore,andReemaHDbouk.2024. Impact ofanartificialintelligence-basedsolutiononclinicians’clinicaldocumentationexperience:Initial findingsusingambientlisteningtechnology. Journal ofGeneralInternalMedicine,pages1–3.
SanchitGandhi,PatrickvonPlaten,andAlexanderM Rush.2023. Distil-whisper:Robustknowledge distillationvialarge-scalepseudolabelling arXiv preprintarXiv:2311.00430
MartucciGiuseppe,MauroCettolo,MatteoNegri,and MarcoTurchi.2021. Lexicalmodelingofasrerrorsforrobustspeechtranslation.In Proceedingsof Interspeech2021,pages2282–2286.
MichaelHannaandOndˇrejBojar.2021. Afine-grained analysisofbertscore.In ProceedingsoftheSixth ConferenceonMachineTranslation,pages507–517.
EricHarper,SomshubraMajumdar,OleksiiKuchaiev, LiJason,YangZhang,EBakhturina,VNoroozi, SSubramanian,KNithin,HJocelyn,etal.2019. Nemo:Atoolkitforconversationalaiandlarge languagemodels Computersoftware],URL: https://github.com/NVIDIA/NeMo
ArthurHinsvark,NatalieDelworth,MiguelDelRio, QuintenMcNamara,JoshuaDong,RyanWesterman,MichelleHuang,JosephPalakapilly,JenniferDrexler,IlyaPirkin,etal.2021. Accented speechrecognition:Asurvey. arXivpreprint arXiv:2104.10747.
PraveenKKanithi,ClémentChristophe,MarcoAFPimentel,TathagataRaha,NadaSaadi,HamzaJaved, SvetlanaMaslenkova,NasirHayat,RonnieRajan,
andShadabKhan.2024. Medic:Towardsacomprehensiveframeworkforevaluatingllmsinclinical applications arXivpreprintarXiv:2409.07314
NithinRaoKoluguri,TaejinPark,andBorisGinsburg. 2022. Titanet:Neuralmodelforspeakerrepresentationwith1ddepth-wiseseparableconvolutionsand globalcontext.In ICASSP2022-2022IEEEInternationalConferenceonAcoustics,SpeechandSignal Processing(ICASSP),pages8102–8106.
AlexPapadopoulosKorfiatis,FrancescoMoramarco, RadmilaSarac,andAleksandarSavkov.2022. Primock57:Adatasetofprimarycaremockconsultations. arXivpreprintarXiv:2204.00333.
HaroldWKuhn.1955. Thehungarianmethodforthe assignmentproblem Navalresearchlogisticsquarterly,2(1-2):83–97.
FedericoLandini,JánProfant,MireiaDiez,andLukáš Burget.2022. Bayesianhmmclusteringofx-vector sequences(vbx)inspeakerdiarization:Theory,implementationandanalysisonstandardtasks ComputerSpeech&Language,71:101254.
KhaiLe-Duc,Khai-NguyenNguyen,LongVo-Dang, andTruong-SonHy.2024. Real-timespeechsummarizationformedicalconversations arXivpreprint arXiv:2406.15888
YixinLiu,AlexanderRFabbri,JiawenChen,Yilun Zhao,SimengHan,ShafiqJoty,PengfeiLiu, DragomirRadev,Chien-ShengWu,andArmanCohan.2023. Benchmarkinggenerationandevaluationcapabilitiesoflargelanguagemodelsforinstructioncontrollablesummarization arXivpreprint arXiv:2311.09184
AnirudhMani,ShrutiPalaskar,andSandeepKonam. 2020. Towardsunderstandingasrerrorcorrection formedicalconversations.In Proceedingsofthe firstworkshoponnaturallanguageprocessingfor medicalconversations,pages7–11.
GeorgeMichalopoulos,KyleWilliams,Gagandeep Singh,andThomasLin.2022. Medicalsum:A guidedclinicalabstractivesummarizationmodelfor generatingmedicalreportsfrompatient-doctorconversations.In FindingsoftheAssociationforComputationalLinguistics:EMNLP2022,pages4741–4749.
Hoang-ThiNguyen-Vo,HuyNguycn-Gia,Hoan-Duy Nguyen-Tran,HoangPham-Minh,HungVo-Thanh, andHaoDo-Due.2022. Marblenet:Adeepneuralnetworksolutionforvietnamesevoiceactivity detection.In 20229thNAFOSTEDConferenceon InformationandComputerScience(NICS),pages 154–158.
SewadeOgun,AbrahamTOwodunni,TobiOlatunji, EniolaAlese,BabatundeOladimeji,Tejumade Afonja,KayodeOlaleye,NaomeAEtori,andTosin Adewumi.2024. 1000africanvoices:Advancinginclusivemulti-speakermulti-accentspeechsynthesis arXivpreprintarXiv:2406.11727
TobiOlatunji,TejumadeAfonja,BonaventureFP Dossou,AtnafuLambeboTonja,ChrisChinenye Emezue,AminaMardiyyahRufai,andSahibSingh. 2023a. Afrinames:Mostasrmodels"butcher" africannames arXivpreprintarXiv:2306.00253
TobiOlatunji,TejumadeAfonja,AdityaYadavalli, ChrisChinenyeEmezue,SahibSingh,BonaventureFPDossou,JoanneOsuchukwu,SalomeyOsei, AtnafuLambeboTonja,NaomeEtori,etal.2023b. Afrispeech-200:Pan-africanaccentedspeechdataset forclinicalandgeneraldomainasr Transactions oftheAssociationforComputationalLinguistics, 11:1669–1685.
VassilPanayotov,GuoguoChen,DanielPovey,and SanjeevKhudanpur.2015. Librispeech:anasrcorpusbasedonpublicdomainaudiobooks.In 2015 IEEEinternationalconferenceonacoustics,speech andsignalprocessing(ICASSP),pages5206–5210. IEEE.
HannanehBPasandiandHaniyehBPasandi.2022. Evaluationofasrsystemsforconversationalspeech: Alinguisticperspective.In Proceedingsofthe20th ACMConferenceonEmbeddedNetworkedSensor Systems,pages962–965.
PiotrP˛ezik,GosiaKrawentek,SylwiaKarasi´nska, PawełWilk,PaulinaRybi´nska,AnnaCichosz,AngelikaPeljak-Łapi´nska,MikołajDeckert,andMichał Adamczyk.2022. Diabiz–anannotatedcorpusof polishcallcenterdialogs.In Proceedingsofthethirteenthlanguageresourcesandevaluationconference, pages723–726.
MirosławPłaza,ŁukaszPawlik,andStanisław Deniziak.2021. Calltranscriptionmethodologyfor contactcentersystems IEEEAccess,9:110975–110988.
VineelPratap,AndrosTjandra,BowenShi,Paden Tomasello,ArunBabu,SayaniKundu,AliElkahky, ZhaohengNi,ApoorvVyas,MaryamFazel-Zarandi, etal.2024. Scalingspeechtechnologyto1,000+ languages JournalofMachineLearningResearch, 25(97):1–52.
KrishnaCPuvvada,Piotr Zelasko,HeHuang,OleksiiHrinchuk,NithinRaoKoluguri,KunalDhawan, SomshubraMajumdar,ElenaRastorgueva,Zhehuai Chen,VitalyLavrukhin,etal.2024. Lessismore: Accuratespeechrecognition&translationwithout web-scaledata. arXivpreprintarXiv:2406.19674.
AlecRadford,JongWookKim,TaoXu,GregBrockman,ChristineMcLeavey,andIlyaSutskever.2023. Robustspeechrecognitionvialarge-scaleweaksupervision.In Internationalconferenceonmachine learning,pages28492–28518.PMLR.
MohammadSadeghRasooliandJoelR.Tetreault.2015. Yaraparser:Afastandaccuratedependencyparser ComputingResearchRepository,arXiv:1503.06733. Version2.
MircoRavanelliandYoshuaBengio.2018. Speaker recognitionfromrawwaveformwithsincnet 2018 IEEESpokenLanguageTechnologyWorkshop(SLT), pages1021–1028.
NevilleRyant,KennethChurch,ChristopherCieri,AlejandrinaCristia,JunDu,SriramGanapathy,and MarkLiberman.2019. Theseconddiharddiarizationchallenge:Dataset,task,andbaselines.In Interspeech,pages978–982.
OlegSerikov,EkaterinaVoloshina,AnnaPostnikova, SalihaMurado˘glu,ÉricLeFerrand,ElenaKlyachko, EkaterinaVylomova,TatianaShavrina,andFrancis Tyers.2024. Proceedingsofthe3rdworkshoponnlp applicationstofieldlinguistics(fieldmatters2024) In Proceedingsofthe3rdWorkshoponNLPApplicationstoFieldLinguistics(FieldMatters2024).
KaranSinghal,ShekoofehAzizi,TaoTu,SSaraMahdavi,JasonWei,HyungWonChung,NathanScales, AjayTanwani,HeatherCole-Lewis,StephenPfohl, etal.2023. Largelanguagemodelsencodeclinical knowledge Nature,620(7972):172–180.
GabrielSkantze.2021. Turn-takinginconversational systemsandhuman-robotinteraction:areview ComputerSpeech&Language,67:101178.
GeorgiosPSpithourakis,IvanVuli´c,MichałLis,Iñigo Casanueva,andPawełBudzianowski.2022. Evi: Multilingualspokendialoguetasksanddatasetfor knowledge-basedenrolment,verification,andidentification arXivpreprintarXiv:2204.13496
GemmaTeam,MorganeRiviere,ShreyaPathak, PierGiuseppeSessa,CassidyHardin,SuryaBhupatiraju,LéonardHussenot,ThomasMesnard,Bobak Shahriari,AlexandreRamé,etal.2024. Gemma2: Improvingopenlanguagemodelsatapracticalsize arXivpreprintarXiv:2408.00118
Mohamad-HaniTemsah,AmrJamal,KhalidAlhasan, AbdulkarimATemsah,andKhalidHMalki.2024. Openaio1-previewvs.chatgptinhealthcare:Anew frontierinmedicalaireasoning Cureus,16(10).
AtnafuLambeboTonja,IsraelAbebeAzime, TadesseDestawBelay,MesayGemedaYigezu, MogesAhmedMehamed,AbinewAliAyele, EbrahimChekolJibril,MichaelMeleseWoldeyohannis,OlgaKolesnikova,PhilippSlusallek,etal. 2024. Ethiollm:Multilinguallargelanguagemodels forethiopianlanguageswithtaskevaluation arXiv preprintarXiv:2403.13737
ChanghanWang,MorganeRiviere,AnnLee,Anne Wu,ChaitanyaTalnikar,DanielHaziza,Mary Williamson,JuanPino,andEmmanuelDupoux. 2021. VoxPopuli:Alarge-scalemultilingualspeech corpusforrepresentationlearning,semi-supervised learningandinterpretation.In Proceedingsofthe 59thAnnualMeetingoftheAssociationforComputationalLinguisticsandthe11thInternationalJoint
ConferenceonNaturalLanguageProcessing(Volume1:LongPapers),pages993–1003,Online.AssociationforComputationalLinguistics.
LucyLuWang,YuliaOtmakhova,JayDeYoung, ThinhHungTruong,BaileyEKuehl,ErinBransom,andByronCWallace.2023. Automated metricsformedicalmulti-documentsummarization disagreewithhumanevaluations arXivpreprint arXiv:2305.13693
ThomasWolf,LysandreDebut,VictorSanh,Julien Chaumond,ClementDelangue,AnthonyMoi,PierricCistac,TimRault,RémiLouf,MorganFuntowicz, etal.2020. Transformers:State-of-the-artnatural languageprocessing.In Proceedingsofthe2020conferenceonempiricalmethodsinnaturallanguage processing:systemdemonstrations,pages38–45.
EmreYılmaz,AstikBiswas,EwaldvanderWesthuizen, FebedeWet,andThomasNiesler.2018. Buildinga unifiedcode-switchingasrsystemforsouthafrican languages arXivpreprintarXiv:1807.10949
Wen-waiYim,YujuanFu,AsmaBenAbacha,Neal Snider,ThomasLin,andMelihaYetisgen.2023. Acibench:anovelambientclinicalintelligencedataset forbenchmarkingautomaticvisitnotegeneration ScientificData,10(1):586.
TianyiZhang,VarshaKishore,FelixWu,KilianQ Weinberger,andYoavArtzi.2019. Bertscore:Evaluatingtextgenerationwithbert arXivpreprint arXiv:1904.09675
LianminZheng,Wei-LinChiang,YingSheng,Siyuan Zhuang,ZhanghaoWu,YonghaoZhuang,ZiLin, ZhuohanLi,DachengLi,EricXing,etal.2023. Judgingllm-as-a-judgewithmt-benchandchatbot arena. AdvancesinNeuralInformationProcessing Systems,36:46595–46623.
A HumanEvaluationCriteria
Thissectionprovidesdetailsonthecriteriaused toevaluatetheAI-generatedsummariesofdoctorpatientconversations.Eachsummarywasevaluatedbasedonthefollowingaspects:
1. Howwelldoesthesummaryincludeallpositiveclinicalsigns,symptoms,ordetailsinthe transcript?
2. Howwelldoesthesummaryincludeallnegativeclinicalsigns,symptoms,ordetailsinthe transcript?
3. Howwelldoesthesummaryrecallthediagnosisfromthetranscript?
4. Howaccuratelydoesthesummaryreflectthe treatmentplaninthetranscript?
5. Towhatdegreedoesthesummaryavoidfalse, incorrect,orfabricatedinformation?
6. Howconcise,clear,andwell-structuredisthe summary?
B PromptTemplatefortheLLM Evaluator
Table 8 outlinestheprompttemplateutilizedfor theevaluationconductedbyOpenAI’so1model, servingastheLLMevaluator.
C EvaluationMetricfortheLLM Evaluator
IntheLLMevaluationprompt(seeSection B), eachcriterionhasamaximumscoreof5points. Sincetherearesixcriteriaperrow,theperfect scoreforasinglerowis 5 × 6 = 30 points.For adatasetwithtenrows,thetotalpossiblescore (perfectaccuracy)is:
PerfectAccuracy = 5 × 6 × 10 = 300 points
Tocomputeamodel’spercentagescorefora datasetwithNrows:
1. Sumthescoresforallsixcriteriaperrow.
2. Sumthetotalscoresacrossallrows.
3. Calculatethepercentagescoreforthemodel as:
PercentageScore = ( Model’sTotalScore 5 × 6 × NumberofRows ) × 100
LLMasaJudgePromptTemplate
Youareamedicalexpert.YourtaskistoevaluatetheaccuracyandqualityofanAI-generated summaryofadoctor-patientconversation.Provideanimpartialassessmentbasedonthecriteria below,ratingeachfrom1to5(1=lowest,5=highest).Ifaspecificcriterionisnotaddressed, assignascoreof0.
Transcript: {transcript}
Summary: {summary}
EvaluationCriteria:
1.Completenessofpositiveclinicalsigns,symptoms,orrelevantdetails:Doesthesummary captureallimportantpositivefindings?
2.Completenessofnegativeclinicalsigns,symptoms,orrelevantdetails:Doesthesummary captureallimportantnegativefindings?
3.Accuracyofdiagnosis:Doesthesummarycorrectlyreflectthediagnosis?
4.Accuracyoftreatmentplan:Doesthesummaryconveythetreatmentplancorrectly?
5.Factualaccuracy:Doesthesummaryavoidfalse,incorrect,orfabricatedinformation?
6.Clarityandstructure:Isthesummaryconcise,clear,andwell-structured? PleaseReturntheevaluationscoresinJSONformat:
{
"Criterion1":score, "Criterion2":score, "Criterion3":score, "Criterion4":score, "Criterion5":score, "Criterion6":score }
Table8:EvaluationtaskdescriptionforassessingAI-generatedsummaries.
D PromptTemplateforSummarization
Thisfigureoutlinestheprompttemplateutilized forthesummarizationtask.
Youareamedicalexpert;youwillbeprovidedatranscriptthat containsaconversationbetweenadoctorandapatient.The taskisforyoutosummarizethekeypointsintheconversation. Transcript: {transcript}
E TranscriptandSummaryExample
Thissectionpresentsasampleofahuman-written transcriptinTable 9 andthecorresponding LLM-generatedsummariesforthetranscript. WhileGPT4ocaptureskeydetailssuchaspain severity,lifestylefactors,andmedicalhistory, itssummaryomitsavitaldetail:thepatient’s experienceofrectalbleeding“duringurination", whichiscrucialtothecase.Thesummariesare:
GPT4o’sSummary: Dr.Philipconducted aconsultationwithMrs.Ayike,a45-year-old woman.Shereportedexperiencingrectalbleeding andabdominalpainforthepasttwoweeks.The pain,ratedasa6outof10onthepainscale, islocalizedinthelowerabdominalregionand worsenswithpressure.Mrs.Ayikealsonoted alossofappetiteandfatiguesincetheonset ofhersymptoms.Shehasnotsoughtmedical attentionpriortothisvisitandhasnohistoryof surgery.Sheneithersmokesnordrinksalcohol. AlthoughMrs.Ayikedoesnottravel,sherevealed afamilyhistoryofcolorectalcancerthrough hermother.Herusualdietincludes"swallow," andshehasnofoodallergies.Mrs.Ayikeis marriedwithfourchildren,butcurrentlylives separatelyfromherhusband.Dr.Philipconcluded thesessionbyacknowledgingMrs.Ayike’sdetails.
Claude’sSummary: Basedontheconversation, thekeypointsare:1.Thepatient,Mrs.Ayike, isa45-year-oldwomanpresentingwithrectal bleedingandabdominalpainforthepasttwo weeks.2.Therectalbleedingoccurswhenshe urinates,anditisnotmixedwithstool.3.The abdominalpainisratedas6outof10inseverity, anditislocalizedinthelowerabdominalregion.4. Shehaslostherappetiteandfeelstiredsincethe symptomsstarted.5.Shehasnottraveledrecently andisnotafrequenttraveler.6.Sheismarriedbut currentlyseparatedfromherhusband,livingina differentstate.n7.Shehasfourchildren.8.Shehas notundergoneanysurgerybefore.9.Shedoes
notsmokeorconsumealcohol.10.Hermother hadcolorectalcancerinthepast,indicatinga familyhistoryofthecondition.11.Hertypical dietincludesswallows(atypeoffood),and sheisnotallergictoanyfood.Thedoctorhas gatheredessentialinformationaboutthepatient’s symptoms,medicalhistory,andlifestyletoaidin thediagnosisandtreatmentprocess.
Gemma’sSummary: Mrs.Ayike,a45-yearoldwoman,presentedtoDr.Philipcomplainingof twoweeksofabdominalpainandrectalbleeding. Keypoints:Symptoms:Lowerabdominalpain rated6/10,worseningwithpressure.Rectal bleedingduringurination,notmixedwithstool. Lossofappetiteandfatigue.MedicalHistory: Noprevioussurgeries.Nosmokingoralcohol consumption.Familyhistoryofcolorectalcancer (mother).Lifestyle:Prefersadiethighinswallow. Noknownfoodallergies.Notatraveler.Separated fromherhusband,livingwithfourchildren.Next steps:Basedonthesymptomsandfamilyhistory, Dr.Philipwilllikelyorderfurtherinvestigations todeterminethecauseofMrs.Ayike’sabdominal painandrectalbleeding.Thismayincludeblood tests,stooltests,imagingstudies(suchasa colonoscopyorCTscan),andpossiblyabiopsy.
Med42-8B’ssummary: Mrs.Ayike,a 45-year-oldfemale,presentswithrectalbleeding andlowerabdominalpain,rated6/10inseverity, whichstartedtwoweeksago.Shereportsseeing redbloodinherurinebutnotmixedwithstool. Shehasnotsoughtmedicalattentionpreviously forthesesymptoms.Shedescribesherappetite aspoorandfeelstired.Shedeniesanyrecent travel,smoking,oralcoholuse.Sheismarried withfourchildrenbutiscurrentlyseparatedfrom herhusband.Hermotherhadcolorectalcancer. Sheconsumesadietprimarilyconsistingoffufu (swallow).Therearenoknownfoodallergies.
GPT3.5’ssummary: Mrs.Aike,a45-year-old patient,visitedDr.Philipduetorectalbleeding andabdominalpainthatstartedtwoweeksago. Theabdominalpainissevere,ratedata6on ascaleof1to3,andislocalizedinthelower abdominalregion.Shehasalsoexperienceda lossofappetite,fatigue,anddifficultyeating sincethepainstarted.Mrs.Aikeismarriedwith fourchildren,notlivingwithherhusband,and
Transcript
Doctor: Goodafternoon.MynameisDr.Philip,andwhat’syourname?
Patient: MynameisMrs.Ayike.
Doctor: Okay,howoldareyou?
Patient: Iam45yearsold.
Doctor: Okay.You’rewelcome,mydear.Whatbroughtyouhere?
Patient: Ihaverectalbleedingandabdominalpain.
Doctor: Canyoutellmemoreaboutit?
Patient: Thepainstartedtwoweeksago,andtherectalbleedinghappenswhenIurinate.Thebloodis notmixedwithstool.
Doctor: Haveyougoneforanycheck-ups?
Patient: No,thisismyfirsttimecominghere.
Doctor: Howsevereistheabdominalpain,fromascaleof1to10?
Patient: 6.
Doctor: Whichsideoftheabdominalregiondoyoufeelthepain?
Patient: Ifeelpaininmylowerabdominalregion,anditworsenswithpressure.Andsincethelast twoweeks,asI’vestartedtheabdominalpain,Ihavenotbeenabletoeatwell.Ilostappetite.Ifeel tired.
Doctor: Haveyoutraveledrecently?
Patient: No,Idon’ttravel.
Doctor: Areyoumarried?
Patient: Yes,Iammarried,withfourchildren,butmyhusbandlivesinanotherstate.
Doctor: Haveyouundergoneanysurgerybefore?
Patient: No.
Doctor: Doyousmokeordrinkalcohol?
Patient: No.
Doctor: Doesanyoneinyourfamilyhavesimilarissues?
Patient: Yes,mymotherhadcolorectalcancer.
Doctor: Whatfoodsdoyounormallyeat?
Patient: Iliketakingswallow.Ilikeswallowalot.
Doctor: Areyouallergictoanyfood?
Patient: No.
Table9:Snippetofadoctor-patientconversation(transcript)writtenbyaclinician hasnohistoryofsurgeries,smoking,oralcohol consumption.Hermotherhadcolorectalcancer before,indicatingafamilyhistoryofsimilarissues. Shepreferseatingswallowandisnotallergicto anyfood.Thisisherfirstvisittoseekmedical helpforthesesymptoms.
Meditron3-8B’ssummary: Thepatient,Mrs. Ayike,isa45-year-oldwomanwhohasbeen experiencingrectalbleedingandabdominalpain fortwoweeks.Shehasnotundergoneanysurgery beforeanddoesnotsmokeortakealcohol.Her motherhadcolorectalcancerbefore.Shehaslost herappetiteandfeelstired.Thedoctorasksher aboutherfamilyhistory,foodallergies,andthe
severityofhersymptoms.Thedoctoralsoasks abouthermaritalstatusandchildren.Thepatient’s symptomsaresimilartothoseofcolorectalcancer, whichhermotherhad.
Llama-3.1-8B-Instruct’ssummary: Basedon theprovidedtranscript,herearethekeypoints fromtheconversationbetweenDr.PhilipandMrs. Ayike:1.Patient’sIntroduction:Mrs.Ayike,a 45-year-oldwoman,presentstothedoctorwith rectalbleedingandabdominalpain.2.Symptoms: Theabdominalpainstartedtwoweeksago,and sheexperiencesrectalbleedingwhenurinating, butnotmixedwithstool.Thepainislocatedin thelowerabdominalregionandworsenswith
pressure.3.MedicalHistory:Mrs.Ayikehas notundergoneanysurgerybeforeanddoesnot haveahistoryofsmokingoralcoholconsumption. 4.FamilyHistory:Hermotherhadcolorectal cancerbefore.5.DietandAllergies:Mrs.Ayike likeseatingswallow(atypeoffood)andisnot allergictoanyfood.6.Lifestyle:Sheismarried withfourchildrenbutiscurrentlyseparatedfrom herhusband.Shedoesnottravelandhasnot beenabletoeatwellorfeelenergeticduetoher symptoms.7.CurrentCondition:Mrs.Ayike ratesherabdominalpainasa6outof10andhas lostappetiteandfeelstiredsincetheonsetof symptoms.Basedontheinformationprovided,Dr. Philipwilllikelyperformathoroughexamination andpossiblyorderdiagnosticteststodetermine thecauseofMrs.Ayike’ssymptoms.Givenher familyhistoryofcolorectalcancer,acolonoscopy mayberecommendedtoruleoutanypotential malignancies.
Llama-3.2-3B-Instruct’ssummary: Here’s asummaryofthekeypointsintheconversation betweenDr.PhilipandMrs.Ayike:Patient’s Symptoms:Rectalbleeding(notmixedwith stool),Abdominalpain(rated6outof10)inthe lowerabdominalregion,Lossofappetiteand tirednessforthepasttwoweeks.MedicalHistory: Noprevioussurgeries,Nosmokingoralcohol consumption,NoknownallergiestofoodFamily History:Motherhadcolorectalcancerbefore. LifestyleandHabits:Marriedwithfourchildren, butnotlivingwithherhusband.Nohistory oftraveling.Prefersswallowingfood.Other Information:Mrs.Ayikeis45yearsold.Shehas notbeenabletoeatwellduetotheabdominal pain.Sheisseekingmedicalattentionforthefirst time,asthesymptomsstartedtwoweeksago.
Phi3-mini-instruct’ssummary: Dr.Philip,a medicalexpert,isspeakingwithMrs.Ayike,a45year-oldwomanwhohasbeenexperiencingrectal bleedingandabdominalpainfortwoweeks.Mrs. Ayikehasnothadanycheck-upsforhersymptoms andhaslostherappetite,feelingtired.Shehasnot traveledrecentlyandisnotasmoker.Shedoesnot consumealcoholandhasnoknownfoodallergies. Mrs.Ayike’smotherhadcolorectalcancer.Dr. Philipasksaboutherdiet,whichmainlyconsists ofswallow,andinquiresaboutanyfamilyhistory ofsimilarissues.
F OverlappingSpeechExample
Inthisexample,overlappingspeechoccursbetweenSpeaker1andSpeaker2.Speaker2’s"No" overlapswithSpeaker1’sutterancestartingwith "Ok,thankyouverymuch."Bothutterancesare fullycapturedwithprecisetimestampstoensure accuraterepresentationoftheconversation.This exampleillustrateshowoverlapsarehandledinthe transcriptionprocessforthedataset.
G ASRandSummarizationResult Graphs
Figure3:ComparisonofMedicalandGeneralWERforDifferentModels
Timestamp Speaker Utterance
02:40.98–02:41.99 Speaker1 Pasthospitaladmission?
02:42.97–02:43.02 Speaker2 No.
02:43.00–02:45.99 Speaker1 Ok,thankyouverymuch.Doyoulivealone?
Table10:Exampleofoverlappingspeechwithaccuratetimestampsandspeakerannotations.
openai/whisper-mediumopenai/whisper-large-v2openai/whisper-large-v3openai/whisper-large-v3-turbodistil-whisper/distil-large-v2distil-whisper/distil-large-v3nvidia/parakeet-rnnt-1.1bnvidia/parakeet-ctct-1.1bnvidia/parakeet-tdt-1.1bnvidia/canary-1bfacebook/wav2vec2-large-960hfacebook/mms-1b-all
Figure5:WER,MedWER,andNon-MedWERforVariousModels