TEST by yulianiu

Afrispeech-Dialog:ABenchmarkDatasetforSpontaneousEnglish ConversationsinHealthcareandBeyond

MardhiyahSanni1,2,TassallahAbdullahi2,3,DevendraD.Kayande1,2,4 , EmmanuelAyodele1 , NaomeA.Etori2,5 , MichaelS.Mollel2,6 , MoshoodYekini2 , ChibuzorOkocha2,7 , LukmanE.Ismaila2,8 , FolafunmiOmofoye2,9 , BoluwatifeA.Adewale2 , TobiOlatunji1,2,10♠ 1Intron, 2BioRAMP, 3BrownUniversity 4IndianInstituteofInformationTechnologyAllahabad, 5UniversityofMinnesota-TwinCities 6UniversityofGlasgow, 7UniversityofFlorida, 8JohnsHopkinsUniversity 9UniversityofNorthCarolinaatChapelHill, 10GeorgiaInstituteofTechnology tobi@intron.io

Abstract

Speechtechnologiesaretransforminginteractionsacrossvarioussectors,fromhealthcareto callcentersandrobots,yettheirperformance onAfrican-accentedconversationsremainsunderexplored.WeintroduceAfrispeech-Dialog, abenchmarkdatasetof50simulatedmedicalandnon-medicalAfrican-accentedEnglish conversations,designedtoevaluateautomatic speechrecognition(ASR)andrelatedtechnologies.Weassessstate-of-the-art(SOTA) speakerdiarizationandASRsystemsonlongform,accentedspeech,comparingtheirperformancewithnativeaccentsanddiscovera10%+ performancedegradation.Additionally,weexploremedicalconversationsummarizationcapabilitiesoflargelanguagemodels(LLMs) todemonstratetheimpactofASRerrorson downstreammedicalsummaries,providinginsightsintothechallengesandopportunitiesfor speechtechnologiesintheGlobalSouth.Our workhighlightstheneedformoreinclusive datasetstoadvanceconversationalAIinlowresourcesettings.

1 Introduction

DeepLearningapproacheshaverevolutionizedand yieldedsignificantperformancegainsacrossseveralnaturallanguagetasks(Bharadiya, 2023),especiallyforhigh-resourcelanguages(Radfordetal., 2023).Whileconversationalspeechrecognition hascontinuedtomakesignificantstridesintask automationindifferentdomainssuchasmedical (BiswasandTalukdar, 2024; Abachaetal., 2023; Yimetal., 2023),voiceassistants(Pasandiand Pasandi, 2022; Manietal., 2020),callcenter(Płaza etal., 2021),androbotics(Skantze, 2021),much oftheresearchinAutomaticSpeechRecognition

1Authorsmarkedwith ♠ areseniorauthors.

(ASR)hasfocusedonmonolingualspeechwith nativeaccents(Aksënovaetal., 2022),withconsiderableperformancegapsindiverselinguistic, lowresource,andaccentedcontexts(Radfordetal., 2023; Olatunjietal., 2023a).

InAnglophoneAfricancontexts,regionalaccentsfurthercomplicatethedevelopmentofASR systemsresultinginpoorgeneralizationfromstateof-the-art(SOTA)models(Hinsvarketal., 2021) inbothgeneral(Adelanietal., 2022; Olatunjietal., 2023a; Ogunetal., 2024; Tonjaetal., 2024)and medical(Afonjaetal., 2024; Olatunjietal., 2023b) contexts.Inmoredevelopedcountries,Performant Speechrecognitionmodelsareparticularlyuseful inoverburdenedhealthcaresystems,wherethey canhelptoreducethedocumentationworkload foroverwhelmedclinicians(Afonjaetal., 2024; Olatunjietal., 2023b).

Recentadvancementsinmedicalconversation transcriptionandsummarizationusingLLMshave ledtowideadoptionofthesetechnologiesinhospitalsindevelopedcountries(Michalopoulosetal., 2022; Yimetal., 2023)helpingtoimproveclinical documentationqualityandproductivity(Galloway etal., 2024).Giventheheavypatientloads(Baye etal., 2020; Etorietal., 2023)aswellastherecent integrationofAI/NLP/ASRsystemsintoclinical workflowsinAfricanclinics 12,performantambienttranscriptionandsummarizationsystemsare highlydesired.However,thegeneralizabilityof suchsystemstoAfricancontextsremainsunderexplored(Afonjaetal., 2024).

Totacklethisproblem,ourcontributionsareas

1https://techcabal.com/2023/05/29/intron-health-bringsai-to-african-healthcare/

2https://techtrendske.co.ke/2024/09/23/google-supportsjacaranda-healths-use-of-gen-ai-for-mothers-and-babies/

follows:

1. AfriSpeech-Dialoganovel7hrdatasetof 50long-formsimulatedAfrican-accentedEnglishmedicalandnon-medicalconversations

2. BenchmarktheperformanceofSOTAspeaker diarization,speechrecognition,andLLMbasedsummarizationonAfrican-accented conversations

WelaythegroundworkformoreinclusiveASR andNLPtechnologiesinhealthcareandbeyond.

2 RelatedWork

2.1 ASRinMedicalConversationsand Summarization

TheroleofASRinmedicaldocumentationhas grownsignificantly,particularlyintelehealthand in-personpatient-physicianconsultations(Korfiatisetal., 2022; Gallowayetal., 2024; Michalopoulosetal., 2022; Yimetal., 2023).AccurateASR inmedicaldialogueiscritical,astranscriptionerrorscanleadtoincorrectmedicalrecords.Severaldatasetshavebeendevelopedtofacilitatethe studyofASRinmedicalcontexts.PriMock57(Korfiatisetal., 2022)providesprimarycaremock consultationsinEuropeanEnglishaccents; Fareez etal. (2022)evaluatesmedicalASRonsimulated patient-physicianmedicalinterviewswithafocus onrespiratorycases; Enarvietal. (2020)experimentsautomaticallygeneratedmedicalreports fromdiarizeddoctor-patientsurgeryconversations. Le-Ducetal. (2024)proposedareal-timespeech summarizationsystemformedicalconversations withtheabilitytogeneratesummariesforevery (local)andendof(global)utterances,eliminating theneedforcontinuousupdateandrevisionofthe summarystate.

Thesedatasetsprimarilyfocusonnon-African accentsandthereforedonotaccountforthe challengesspecifictoAfrican-accentedmedical speech.

2.2 Non-medicalconversationalASR

P˛eziketal. (2022)releasedDiaBiz,anAnnotated Corpusofover400hrsofPolishcallcenterdialogs.Otherconversational,parliamentary,ororatorydatasetslikeAMI(Carlettaetal., 2005),Earnings22(DelRioetal., 2022),Voxpopuli(Wang etal., 2021)havegainedpopularityonpublicASR benchmarks.ConversationalASRhasalsobeenexploredinotherdomainssuchascallcenters(Płaza

etal., 2021),androbotics(Skantze, 2021).However,thesedatasetslackrepresentationofAfricanaccentedspeech.

2.3 AfricanAccentedASR

Therehasbeengrowinginterestindeveloping ASRsystemsthatcatertoAfricanlanguages;for example, Yılmazetal. (2018)developedamultilingualASRsystemforcode-switchedSouth Africanspeech.MultilingualASRsystems,such asEVIdataset(Spithourakisetal., 2022),offer astrongfoundationfordevelopingsimilarmodelsinAfricancontextswheredatascarcityhindersprogress. Olatunjietal. (2023b)released apan-AfricanaccentedEnglishdatasetformedicalandgeneralASR.Whilethesedatasetsfocus onsingle-speakerspeechrecognition,AfriSpeechDialogisthefirstAfrican-accentedEnglishconversationaldatasetspanningmedicalandnon-medical domains,enablingadditionaltaskslikediarization andsummarization.

2.4 SpeakerDiarizationinMulti-Speaker Conversations

ToincreasetheefficiencyofNLP/ASRsystems, enormouscontributionsweremadetoresearchingtheintegrationofspeakerdiarization(SD) intoitspipeline. Serikovetal. (2024)provides acomparativeanalysisofSDmodels-Pyannote (Bredinetal., 2020),CLEVER3,andNVIDIA NeMo(Harperetal., 2019)on20differentGermandialectsfordiarizationandidentification(DI) task.NVIDIANeMoperformsslightlybetterwith acompetitiveperformanceduetoitsmultiscale segmentationforidentifyingandremovingshorter segments.OnasimilarDItask,(Chuaetal., 2023) alsobenchmarkedtheperformanceofmultilingualASRmodelsinopenandclosedtracksonthe challengingMERLIonCCSEnglish-Mandarin datasets-anextractedspontaneousandcodeswitchingparent-childconversationspeeches.However, SDforAfrican-accentedconversationsremainsunderexplored.BenchmarkingSOTASDmodelson AfriSpeech-Dialogrevealstheirlimitationsinthis setting.

3 Methodology

Figure 1 showsanoverviewofthedatasetcreationandbenchmarkingprocess,illustratinghow AfriSpeech-Dialogsupportstaskslikespeakerdiarization,ASR,andsummarization.Below,we 3https://www.oxfordwaveresearch.com/products/cleaver/

Figure1:AfriSpeechDialog:DatasetandBenchmarkingPipeline

describethedatasetcreationprocessandtheevaluationofstate-of-the-art(SOTA)modelsonthese taskstodemonstratetheirapplicabilityandhighlightchallengesinAfrican-accentedconversational speech.

3.1 DatasetStatistics

Here,weoutlineourdatasetcreationprocess.

demographics.DoctorandPatientactorsincluded medicalprofessionals(e.g.doctors,nurses)familiarwithObjectiveStructuredClinicalExaminations(OSCE),awidelyusedassessmentinmedical educationthatsimulatesdoctor-patientinteractions (Fareezetal., 2022).Eachpatientactorwasprovidedwithadetailed"patientcard"thatincluded informationontheircondition,demographics,and medicalhistory,asshowninTable 2.Consistent withOSCEformat,patientcardswerehiddenfrom doctoractorstofacilitateamorenaturalconsultation.

Num.ofCountries 13

Num.ofAccents 68

Gender(M,F) (14,26)(25,33)

Table1: Statisticsofthe medical and non-medical datasets.

3.1.1

CollectingConversations

Werecordedsimulatedvirtualandin-personmedicalandnon-medicalconversationsfromAfrican medicalandnon-medicalcrowdworkersontheIntronPlatform 4 similartotheprocessdescribed in Olatunjietal. (2023b).Eachconversationbeganwithspeakersprovidingconsent,andanyidentifiableinformationintheconsentsegmentwas removed.

Formedicalconversations,followingtheprocessdescribedin Fareezetal. (2022)and Korfiatisetal. (2022),clinicalexpertsprepared"patient cards"withAfrican-contextdiseaseconditionsand 4https://speech.intron.health

Forgeneraldomainconversations,participants engagedinopendiscussionsbasedon"topiccards" preparedbyateamofreviewers.Eachcardcontainedaconversationtopic,abriefdescription,and twodiscussionpromptstoguidetheconversation. Thepairofparticipants(actors)hadprioraccess tothecardsandwereadvisedtoreadthroughand understandthembeforestartingtheconversation. Table 3 showsasampletopiccard.

Theconversationrecordingswerestoredas mono-channel,16-bitwavfiles,witha48kHzsamplingrate.Ateamofclinicianreviewersreviewed theconversationsandselectedahighqualitysubsetforthisrelease.Thedatasetwascollected acrossthreeAfricancountries—Nigeria,Kenya, andSouthAfrica.Thespeakersrepresentadiverserangeofaccents(11intotal):Hausa,Isoko, Idoma,Urhobo,Ijaw,Yoruba,Swahili,Sesotho, Igbo,Igala,andEbira.

3.1.2 RecordingCharacteristics

DatasetstatisticsaresummarizedinTable 1.The datasetfeaturestwospeakersineachconversation

Condition Malaria

Demographic(Age,Gender) 32-year-oldFemale

PresentingComplaint Feverandchills(2days)

HistoryofPresentingComplaint

• Feverfor2days(Highgrade,notrelievedbymedication)

• Chillsfor2days(Intermittent,severe)

• Headachefor2days(Generalized,throbbing,7/10inseverity)

• Fatigueandgeneralbodyweakness

• Nocough,diarrhea,vomiting,orurinarysymptoms

• Patientlivesinmalaria-endemicarea;norecenttravelhistory

PastMedicalHistory(PMH) Nochronicdiseaseorsurgery

FamilyHistory Nofamilyhistoryofsimilarillness

SocialHistory Doesnotdrinkalcoholorsmoke

AllergyHistory Noknownallergies

Table2: Example PatientCard forMedicalConversations.

Topic:Cyberbullying

Overview: Cyberbullyingisaformofbullyingthattakesplaceonlineorthroughelectroniccommunication.Itinvolves usingtechnology(socialmedia,textmessages,onlineforums)tointimidateorhumiliatesomeone.Examplesinclude:

• Spreadingrumors

• Sendinghurtfulmessages

• Sharingembarrassinginformationwithoutconsent

DiscussionPrompts:

• Whatstepsdoyoutaketoprotectyourselffromcyberbullying?

• Doyouthinksocialmediahaseffectivepoliciesinplace,orcouldtheyimprove?

Table3: Sample ConversationCard forGeneralConversations.

inboththemedicalandgeneraldomains.Medical conversationsweremorestructured,withdoctors askingdirectquestionsandpatientsresponding, resultinginamoreformalexchange.Generalconversationsweremorerelaxed,withspontaneous discussionsonvarioustopics.Overlappingspeech occursoccasionallybutisusuallybrief,involving shortinterjectionslike"yes"or"okay."Disfluenciesandsomecode-switchingreflectthenatural flowofAfricanEnglishspeakers.Generalconversationshavefewerspeakerturnsbutahigher averagewordcountcomparedtomedicalones, asspeakerstendtotalklongertoexpresstheir thoughts.Thesecharacteristicsmakethedataset valuablefortestingspeakerdiarizationandtranscriptionmodelsonAfrican-accentedspeech.

3.1.3

TranscriptionProcess

Allconversationsweremanuallytranscribedby fiveprofessionalannotatorsselectedfromtopperformingcontributorsontheIntronPlatform. Annotatorswereinstructedtoannotatespeaker turnsandinserttimestampsforeachturn,with allannotatorsrequiredtobefamiliarwithmedical terminology.Toensurequality,clinicianreviewersevaluatedarandom20%ofeachannotator’s workforaccuracy,withatleast90%correctness thresholdrequiredforinclusion.Contributorsand annotatorswerepaid$10–$20perhourdepending ontaskcomplexityandclinicalexperience.The datasetisreleasedunderaCC-BY-NC-SA4.0License.

3.2 SpeakerDiarization

Tobenchmarkdiarizationperformanceonthis dataset,weselectedthreerecenthigh-performing neuraldiarizationmodels:

• Pyannote (Bredin, 2023):Thismodelleveragesapre-trainedneuralnetworkthatcomputesx-vectorembeddingsforspeakerdiarization.ItusesanECAPA-TDNN(Desplanquesetal., 2020)architectureforspeaker embeddings,showntoimprovespeakerseparationindiarizationtasks.

• Reverbdiarizationv2 (Bhandarietal., 2024):ThismodelisanextensionofPyannote3.0,fine-tunedon26,000hoursofannotatedspeech.However,thismodeluses WavLM(Chenetal., 2022)insteadofthe SincNet(RavanelliandBengio, 2018)featuresinthePyannote3.0basicmodel.

• Titanet:ThisNvidiadiarizationpipelineuses MarbleNet(Nguyen-Voetal., 2022)forvoice activitydetectionandTitanet-L(Koluguri etal., 2022)forembeddingextraction.Titanetusesa1Ddepth-wiseseparableconvolutionswithSqueeze-and-Excitation(SE)layerswithglobalcontext,followedbyachannel attention-basedstatisticspoolinglayertomap variable-lengthutterancestofixed-lengthembeddings(t-vectors)(Kolugurietal., 2022).

3.2.1 EvaluationMetrics

Weevaluateddiarizationperformanceusingthe DiarizationErrorRate(DER) (Doddingtonetal., 2000).DERquantifiesthepercentageoftimethat speakersaremisattributedormissedinthediarizationoutput.

Weusedthe pyannote.metrics library 5 tocalculateDERforeachrecordingandcomputedthe absoluteDERfortheentiredataset.Theoptimal mappingbetweenthereferenceandhypothesislabelswasobtainedusingtheHungarianalgorithm (Kuhn, 1955),ensuringanaccuratealignment.We reportDERonmedicalandgeneraldomainconversations.

3.3 AutomaticSpeechRecognition(ASR)

TobenchmarkASRperformance,wecompare SOTAopensourcepretrainedmodels,Whisper (Radfordetal., 2023),Distil-Whisper(Gandhi 5https://pyannote.github.io/pyannote-metrics/

etal., 2023),NvidiaParakeet(Harperetal., 2019), Canary(Puvvadaetal., 2024),MMS(Pratapetal., 2024)andWav2vec2(Baevskietal., 2020).

Open-sourcemodelsofferstransparencyandreproducibility.Theyareoftentrainedondiverse, real-worldspeechdata,providecompetitivebaselinesfordialog-specifictaskssuchashandling speakervariation,spontaneousspeech,andoverlappingspeakers.

3.3.1 Preprocessing

Theoriginaltranscriptscontainedtimestampsand speakertags.Weremovedtheseitemsfromthetext astheyareunnecessaryfortheASRtask.Long formaudiorecordingsexceededthecontextlength ofmostASRmodels.Theywerethereforechunkedinto30secondsegmentsforinferenceand transcriptsegmentsreturnedwereconcatenated.

3.3.2

EvaluationMetrics

ASRperformancewasevaluatedusing WordErrorRate(WER).WERmeasuresthetotalnumber ofinsertions,deletions,andsubstitutionsinthe predictedtextwithrespecttothetotalnumberof wordsinthereferencetext.

3.4 MedicalConversationSummarization

NineLLMs(large,small,open,closed,general, andbiomedical)werebenchmarkedforsummarizingdoctor-patientdialogues.EachLLMswas presentedahumanconversationtranscriptandwas prompted(Appendixsection D)togenerateadetailedsummary.

Closed-sourcegeneralLLMs: OpenAI(GPT4o,GPT-3.5-turbo)(Achiametal., 2023),andAnthropicClaude-3-Sonnet(Anthropic, 2023)representleadinggeneral-purposecommercialLLMs.

Open-sourcesmallgeneralLLMs: MetaLlama-3.1-8B-Instruct,Meta-Llama-3.2-3BInstruct(Dubeyetal., 2024),Microsoft-Phi3-mini-4k-instruct(Abdinetal., 2024),and Google-Gemma-2-9b-it(Teametal., 2024) selectedfortheirinstruction-followingabilities ormultilingualsupport,whichisessentialfor code-switching.

Biomedicalopen-sourceLLMs: m42-healthLlama3-Med42-8B(Christopheetal., 2024), johnsnowlabs-JSL-MedLlama-3-8B-v2.0,selected fortheirbiomedicaladaptation.Examplesof generatedsummariesareprovidedinAppendix section E

3.4.1 QuantitativeEvaluation

WeusedtheBERTScore(Zhangetal., 2019)to evaluatethequalityoftheLLM-generatedsummariesagainsttheexpert-generatedreferencesummaries.AlthoughBERTScoreiswidelyused,studieslike(HannaandBojar, 2021)haveshownits limitations,particularlyincapturingfine-grained semanticnuancesandpenalizingstylisticdifferences.

3.4.2 QualitativeEvaluation

ToaddressthelimitationsofBERTScore,we complementitwithtwoqualitativeevaluationapproaches(HumanandLLM-as-Judge)wheresummarieswereevaluatedona5-pointscaleadapted from Zhengetal., 2023; Liuetal., 2023),1(worst) to5(best)onthefollowingsixcriteria:recallofthe diagnosis,accuracyofthetreatmentplan,avoidanceoffalseorfabricatedinformation,clarityand structure,andinclusionofimportantpositiveand negativeclinicaldetails.Ifanycriterion(e.g.,treatmentplan)wasabsentfromtheconversation(transcript),ascoreof0wastobeassignedforthat criterion.Detailedevaluationcriteriaareinthe AppendixSection A.

LLM-as-Judge Consistentwiththegrowing trendinrecentstudies(Zhengetal., 2023; Liu etal., 2023)weusedgenerativemodelsforautomatedsummaryevaluation.WeusedtheOpenAI’s "o1"model(Temsahetal., 2024)promptingbased onthecriteriamentionedabove.Detailedverbatim promptsandthemethodforcomputingaccuracy scoresareprovidedinAppendixSection B and C.

HumanEvaluation Inablindstudy,werandomlypresentpairsofhumantranscriptsandLLMorhuman-generatedsummariestoateamof4clinicalexperts.Theexpertscomparedtheinformationavailablefromsummariestoconversationtranscriptsusingthe6criterialistedabove(Kanithi etal., 2024; Singhaletal., 2023; Wangetal., 2023). Eachsummarywasindependentlyratedby2experts.

3.4.3 ErrorPropagationonCascading Models

Sincereal-worldconversationsummarizationsystemsrelyonimperfectASRtranscriptsandaccentedmedicalASRtranscriptionischallenging forseveralASRsystems(Afonjaetal., 2024), wefurtherevaluatedsummariesgeneratedbased onpredicted(machine)transcriptstodetermineif

therewasadropinqualitywhencomparedwith summariesgeneratedbasedonhumantranscripts (Giuseppeetal., 2021).Wemeasuredsummary qualityusingLLM-as-Judge.

4 Experiments

4.1 Diarization

Wedownloadandruninferenceusingpublicly releasedcheckpointsfromHuggingFace(Wolf etal., 2020),withdefaulthyperparameters.Weset thecollarto0.0,meaningnomarginwasallowed aroundspeakertransitions,ensuringthatevenshort overlaps(e.g.,"yes"or"okay")wereevaluateddirectlywithoutanytolerance.Overlappingspeech wasalsonotexcludedfromtheevaluation.

WeraninferenceonasingleNvidiaT4GPU. InferenceforPyannoteandReverbtookapproximately2hourswhiletheTitanettookabout30 minutes.Resultsrepresentsingleruns.

4.2 ASR

Modelsweredownloadedfrompubliclyavailable huggingface(Wolfetal., 2020)checkpointswith defaulthyperparametersandthedefaultgenerationconfigurationwasused.Weraninferenceon NvidiaT4GPUs.Theinferencerequiredanaverageofaround30minutesforthewholedatasetfor theopen-sourcemodels.Resultsrepresentsingle runs.

4.3 Summarization

Foropen-sourceLLMs,weusedpubliclyavailablecheckpointsfromHuggingFace(Wolfetal., 2020)withoutalteringtheirdefaulthyperparameters,exceptforsettingmax_new_tokensto1024. Closed-sourcemodelswereaccessedviatheirrespectiveAPIs,alsousingdefaulthyperparameters. Theprompttemplatewasadaptedfrompriorwork (Zhengetal., 2023; Liuetal., 2023),andtoensure consistency,thesamepromptwasusedacrossall models(detailscanbefoundintheAppendix).

Weconductedthesummarizationexperiments undertwoscenarios:(1)generatingsummaries fromhuman-producedtranscriptsand(2)generatingsummariesfromtranscriptscreatedbyour best-performingASRmodel(Whisper-large-v3).

5 ResultsandDiscussion

5.1 Diarization

WecomputeDERseparatelyforasubsetofconversationswithaccuratetimestamps–medical(9

ModelDER(%)MedDER(%)GenDER(%)AMIDER(%)DIHARDDER(%)

Table4: DiarizationErrorRate(DER)forall30audios,withdetailedresultsforthe Medical(Med.DER) and General(Gen.DER) subsets.The AMIDER and DIHARDDER columnsshowperformanceonthe AMI MixHeadset (Carlettaetal., 2005)and DIHARDII (Ryantetal., 2019)datasets,respectively. LowerDERis better,and(*)indicatesresultswhereoverlappingspeechregionswereignored.

samples,MedDER)andgeneral(21samples,Gen DER)domainconversations.Theresultsareshown inTable 4 andFigure 2.Wealsoshowtheperformanceofthesemodelsonconversationaldatasets onotheraccents,usingthevaluesreportedin (Bredin, 2023; Kolugurietal., 2022; Landinietal., 2022).

Themodelsconsistentlyperformedbetteron generaldomainconversationscomparedtomedical conversations,likelyduetotheirrelaxedstructure andfewerinterruptions.Resultsshowdiarization resultsonAfrispeech-DialogarebetterthanAMI andDIHARDlikelybecauseofthesimulatedand structurednatureofconversations.

ComparisonofMedicalandGeneralDERforDifferentModels

Figure2:ComparisonofMedicalandGeneralDERfor DifferentModels

5.2 ASR

WereportWERforallconversations,aswellas separatelyforthemedicalandnon-medicalportionsofthedata(Table 5).The openai/whisperlarge modelsachievedthebestperformance,followedby nvidia/canary, nvidia/parakeet and modelsandlastly wav2vec2 basedmodels.

Theresults,aspresentedinTable 5 andAppendixFigure 3,demonstrateacleartrend:modelsexhibitsuperiorperformanceonnon-medical (generaldomain)audiocomparedtomedicaldomainaudio,roughly5%betterWER,suggesting medicalconversationsaremorechallenginglikely

duetoaccentedmedicaljargonsuchasmedication names,diagnoses,anddensityofnumericalentities (Afonjaetal., 2024).

Followingtrendsin Olatunjietal. (2023b), Wav2vec2-large-960hfinetunedexclusivelyon readlibrispeechaudiobooks(Panayotovetal., 2015)performsworst,confirmingthatmodelsize (numberofparameters),trainingsetsize,training datadomain(in-the-wild,conversational)andmultilingualpretrainingplayaroleingeneralizability toaccentedspeech.

Additionally,wecompareperformanceonother conversationaldatasetsreportedinTable 5 and showa5to20point(absolute)performancedrop onaveragesuggestingpoorgeneralizabilityof SOTAmodelstoAfrican-accentedconversation Englishspeech.Thisdegradationunderscoresthe challengesASRmodelsfacewithAfricanaccents, highlightingthepressingneedforhigh-quality, domain-specificdatasetstoimproveASRperformanceforlow-resourceaccentsandlanguages.

5.3 Summarization

Table 6 showsoursummarizationresults.

InBERTScore-H,summariesaregenerated basedonhumantranscriptsandevaluatedagainst humanreferencesummaries.InBERTScoreM,summariesaregeneratedbasedonmachinepredictedtranscripts(Whisper-Large-v3)andevaluatedagainsthumanreferencesummaries.

LLM-Eval-HrepresentstheLLM-as-Judgeresultswhenkeydetailsinthehumantranscripts aresoughtfromsummariesgeneratedfromhuman transcripts.LLM-Eval-MrepresentstheLLM-asJudgeresultswhenkeydetailsinthehumantranscriptsaresoughtfromsummariesgeneratedfrom machine-predictedtranscripts.

Table 6 showsthatgeneraldomainmodels likeGTP4oandClaudeconsistentlyachievehigh scoresacrossallevaluationmetricsfollowedby BiomedicalLLMswheregeneratedsummariesare semanticallymoresimilartoreferencesummaries.

Titanet-L

ModelWERMedWERGenWERAMIEarnings22VoxPop whisper-medium21.2726.49 19.47 16.6-7.4 whisper-large-v2 20.8223.74 19.8116.412.057.3 whisper-large-v3 20.3823.8119.19 16.01 11.3whisper-large-v3-turbo21.9325.5820.67--distil-whisper-large-v225.3830.4323.63 14.67 12.198.24 distil-whisper-large-v321.2025.6719.5815.95 11.29 8.25 parakeet-rnnt-1.1b28.1634.0326.1317.115.15 5.44 parakeet-ctct-1.1b28.9734.1627.1915.6713.756.56 parakeet-tdt-1.1b28.6933.5727.0115.914.65 5.49 canary-1b22.8227.4021.25 13.53 12.055.79 wav2vec2-large-960h86.3488.3581.1737.0-17.9 mms-1b-all61.7569.0459.2242.0231.1717.63

Table5: WERforvariouspre-trainedmodelsontheentiredataset,includingthe Medical (MedWER)and General (GenWER)portions,aswellasAMI(Carlettaetal., 2005),Earnings22(DelRioetal., 2022),andVoxpopuli (Wangetal., 2021)datasets. Lowerisbetter.Top2modelsin bold

ModelBertScore-HLLM-Eval-HBertScore-MLLM-Eval-MHuman-Eval (%)(%)(%)(%)(%)

Claude-3-Sonnet88.39

GPT-4o 91.3472.9490.9470.9867.48 GPT-3.5-turbo89.8769.4190.8164.51Llama-3.1-8B-Instruct86.4469.4186.4462.3565.87 Llama-3.2-3B-Instruct85.6065.1087.0963.7362.75 Gemma-2-9b86.0071.3785.48

HumanExpert----54.59 Llama3-Med42-8B

Phi3-mini-instruct88.0557.4589.6660.3950.83 Meditron3-8B89.8360.2090.4059.6142.98

Table6: Summarizationperformanceofvariousmodels,evaluatedusing BERTScore(F1), LLM-Eval,and HumanEvaluation BERTScore-H and LLM-Eval-H representthemetricswhensummariesaregeneratedand evaluatedusing human-generatedtranscripts,while BERTScore-M and LLM-Eval-M representthemetricsfor machine-generatedtranscripts Higherisbetter.Top2modelsin bold

ModelpositivesnegativesdiagnosestreatmentnohallucinationconciseTotal HumanExpert3.353.061.971.593.233.1816.37 Claude-3-sonnet-202402294.314.311.541.464.774.54 20.92 GPT-4o4.224.131.711.764.314.1120.24 Llama-3.1-8B-Instruct3.903.712.672.143.713.6219.76 Llama-3.2-3B-Instruct4.043.911.351.433.964.1318.83 Gemma3.813.881.311.693.693.8818.25 Llama3-Med42-8B3.132.672.001.872.832.8315.33 Phi-3-mini-4k-instruct3.222.841.591.672.962.9615.25 Meditron3-8B2.862.481.461.232.452.4112.89

Table7:HumanEvaluationResultsshowingmeansof5-pointratingsforeachcriteria.Higherisbetter. Positives: Doesthesummaryincludeallimportantpositiveclinicalsigns,symptoms,ordetailsinthetranscript?; Negatives: Doesthesummaryincludeallimportantnegativeclinicalsigns,symptoms,ordetailsinthetranscript?; Diagnoses: Doesthesummaryaccuratelyrecallthediagnosesfromthetranscript?; Treatment: Doesthesummaryaccurately reflectthetreatmentplaninthetranscript?; NoHallucination: Doesthesummaryavoidanyfalse-incorrect-or fabricatedinformation?; Concise: Isthesummaryconcise-clear-andwell-structured?

Claude-3wasrankedhighestforsummarizinghumantranscripts,whileGemma-2rankedhighest forsummarizingASRtranscripts.

OpengeneralsmallerLLMslikeGemmaand theLlamamodelsdemonstratelowersemanticsimilaritywithreferencesummaries(BERTScore)and theirperformanceonLLM-Evalmetricsdemonstrateaninferiorabilitytoretrievecriticalinforma-

tionfromtheconversation.Theweakestperformer overallwasPhi3-mini-4k-instruct,particularlyin theLLM-Evalscores,signalingsignificantchallengesinmedicalsummarization.

5.4 HumanEvaluation

ResultsinTable 6 showthatLLM-as-Judgeratings(LLM-Eval-H)werestronglycorrelated(Pearsons=0.816)withHumanExpertratings,inline with Zhengetal. (2023).

Table 7 showedthat,ofthe6criteria,expert summarieswererankedinthetop3onlyataccuratelyrecallingthediagnosesfromthetranscript, suggestingthiswasakeyfocusforexperts.LLM summariesweremoreconciseandretainedgreater amountsofkeydetailswhencomparedtotheexpertsummaries.

Theblindexpertratingsshowedthatoverall,expertsrankedleadingLLMsummarieshigherthan referencesummaries,scoringconsistentlyhigher onthecompletenessofkeyfactscarriedoverfrom thetranscripttothesummary.Thissuggeststhat LLMsmayperformbetteratcertaintasks,such ascompressinglargeamountsofinformation(10+ minsconversations)whencomparedwithtimeconstrainedphysicianswhomaypreferbrevity.

5.5 ErrorPropagationfromcascadingmodels

WhilesummarizingASRtranscriptsrevealnosignificantdifferenceinsemanticsimilaritywithreferencesummaries(BERTScore-HvsBERTScoreM),amorenuancedpatternemergeswithLLMEval-HandLLM-Eval-Mmetricswheremany modelsshowanotabledropinperformance.LLMEval-Mscoresareonaverage2to5pointslower thanLLM-Eval-H.ThisindicatesthatASRerrors (noisytranscriptionoutputs)maylimitthesummarizationabilityofLLMsleadingtolossofcritical healthcareinformation.

6 Conclusion

ThisstudyhighlightsseveralchallengeswithaccentedconversationalASR,especiallyitsimpact ondownstreammedicalsummarization.Webenchmarkspeakerdiarization,ASR,andmedicalsummarization,drawingattentiontogapsandopportunitiestoimproveaccentedconversationalASRin theAfricancontext.

7 Limitations

WhilesimulatedconversationsprovidetheopportunitytobenchmarkASRsystems,theymaynot

fullyreflectthecomplexityofreal-worldnatural dialogueswhereseveralinterruptions,distractions andbackgroundnoisearemoretypical.Additionally,theagerangeofcontributorsdoesnotaccuratelyreflectthediversityofreal-worldspeaker populations,potentiallyimpactingthegeneralizabilityoftheresults.Theuseofgenerallarge languagemodels(LLMs)asevaluatorsmayalsointroducebiases.At5hrs,thedatasetdoesnotreflect thefullbreadthofmedicalorgeneralconversationsasseveraldomains,subdomains,topics,and medicalspecialtieswerenotcoveredinthisrelease. Nonetheless,thisworksignificantlycontributesto addressingthegapinASRsystems’adaptability toAfricanaccents.Futureresearchshouldaimto incorporatereal-worlddata,broaderagerepresentation,andfurtheroptimizemodelstoenhancethe robustnessandapplicabilityofASRtechnologies indiverseAfricanlinguisticenvironments.

8 EthicalConsiderations

Inaworldwherevoicecloningandidentitytheft aremajorcybersecuritythreats,releasingavoicebaseddatasetisnotwithoutrisk.Eachrecording beganwithcontributorsexpressingconsenttothe recording.However,sincerealnamesweresometimesusedduringconsent,thesesegmentswere removedtoprotectparticipants’identities.Duringdoctor-patientconversations,pseudonymswere usedtomaintainanonymity.Furthermore,releasingreal-worlddoctor-patientconversationsrisks breachingprivacyandconfidentialityrequirements, hencethepreferenceforsimulatedconversations.

9 Acknowledgment

WethanktheIntronHealthteamforprovidingthe dataandcomputeusedinthisworkandallcontributors,patient-anddoctor-actorswhosevoicesare includedinthedataset.

References

AsmaBenAbacha,Wen-waiYim,YadanFan,and ThomasLin.2023. Anempiricalstudyofclinical notegenerationfromdoctor-patientencounters.In Proceedingsofthe17thConferenceoftheEuropean ChapteroftheAssociationforComputationalLinguistics,pages2291–2302.

MarahAbdin,SamAdeJacobs,AmmarAhmadAwan, JyotiAneja,AhmedAwadallah,HanyAwadalla, NguyenBach,AmitBahree,ArashBakhtiari,HarkiratBehl,etal.2024. Phi-3technicalreport:Ahighly capablelanguagemodellocallyonyourphone arXivpreprintarXiv:2404.14219

JoshAchiam,StevenAdler,SandhiniAgarwal,Lama Ahmad,IlgeAkkaya,FlorenciaLeoniAleman, DiogoAlmeida,JankoAltenschmidt,SamAltman, ShyamalAnadkat,etal.2023. Gpt-4technicalreport arXivpreprintarXiv:2303.08774

DavidIfeoluwaAdelani,GrahamNeubig,Sebastian Ruder,ShrutiRijhwani,MichaelBeukman,Chester Palen-Michel,ConstantineLignos,JesujobaOAlabi,ShamsuddeenHMuhammad,PeterNabende, etal.2022. Masakhaner2.0:Africa-centrictransfer learningfornamedentityrecognition arXivpreprint arXiv:2210.12391

TejumadeAfonja,TobiOlatunji,SewadeOgun, NaomeAEtori,AbrahamOwodunni,andMoshoodYekini.2024. Performantasrmodelsformedicalentitiesinaccentedspeech arXivpreprint arXiv:2406.12387

AlënaAksënova,ZhehuaiChen,Chung-ChengChiu, DaanvanEsch,PavelGolik,WeiHan,LeviKing, BhuvanaRamabhadran,AndrewRosenberg,Suzan Schwartz,etal.2022. Accentedspeechrecognition: Benchmarking,pre-training,anddiversedata. arXiv preprintarXiv:2205.08014.

RieKubotaAndoandTongZhang.2005. Aframework forlearningpredictivestructuresfrommultipletasks andunlabeleddata. JournalofMachineLearning Research,6:1817–1853.

GalenAndrewandJianfengGao.2007. ScalabletrainingofL1-regularizedlog-linearmodels.In Proceedingsofthe24thInternationalConferenceon MachineLearning,pages33–40.

Anthropic.2023. Theclaude3modelfamily:Opus, sonnet,haiku.Accessed:2024-10-15.

AlexeiBaevski,YuhaoZhou,AbdelrahmanMohamed, andMichaelAuli.2020. wav2vec2.0:Aframeworkforself-supervisedlearningofspeechrepresentations Advancesinneuralinformationprocessing systems,33:12449–12460.

YohannesBaye,TesfayeDemeke,NigusieBirhan,AgumasieSemahegn,andSimonBirhanu.2020. Nurses’ work-relatedstressandassociatedfactorsingovernmentalhospitalsinharar,easternethiopia:Acrosssectionalstudy PloSone,15(8):e0236782.

NishchalBhandari,DannyChen,MiguelÁngeldel RíoFernández,NatalieDelworth,JenniferDrexler Fox,MigüelJetté,QuintenMcNamara,CoreyMiller, OndˇrejNovotný,JánProfant,NanQin,MartinRatajczak,andJean-PhilippeRobichaud.2024. Reverb: Open-sourceasranddiarizationfromrev Preprint, arXiv:2410.03930.

JasminBharadiya.2023. Acomprehensivesurveyof deeplearningtechniquesnaturallanguageprocessing EuropeanJournalofTechnology,7(1):58–66.

AnjanavaBiswasandWrickTalukdar.2024. Intelligentclinicaldocumentation:Harnessinggenerative aiforpatient-centricclinicalnotegeneration arXiv preprintarXiv:2405.18346

HervéBredin,RuiqingYin,JuanManuelCoria,GregoryGelly,PavelKorshunov,MarvinLavechin, DiegoFustes,HadrienTiteux,WassimBouaziz,and Marie-PhilippeGill.2020. Pyannote.audio:neural buildingblocksforspeakerdiarization.In ICASSP 2020-2020IEEEInternationalConferenceonAcoustics,SpeechandSignalProcessing(ICASSP),pages 7124–7128.IEEE.

HervéBredin.2023. pyannote.audio2.1speakerdiarizationpipeline:principle,benchmark,andrecipe In Proc.INTERSPEECH2023

JeanCarletta,SimoneAshby,SebastienBourban,Mike Flynn,MaelGuillemot,ThomasHain,Jaroslav Kadlec,VasilisKaraiskos,WesselKraaij,Melissa Kronenthal,etal.2005. Theamimeetingcorpus: Apre-announcement.In Internationalworkshopon machinelearningformultimodalinteraction,pages 28–39.Springer.

SanyuanChen,ChengyiWang,ZhengyangChen, YuWu,ShujieLiu,ZhuoChen,JinyuLi,Naoyuki Kanda,TakuyaYoshioka,XiongXiao,JianWu, LongZhou,ShuoRen,YanminQian,YaoQian, JianWu,MichaelZeng,XiangzhanYu,andFuru Wei.2022. Wavlm:Large-scaleself-supervised pre-trainingforfullstackspeechprocessing IEEE JournalofSelectedTopicsinSignalProcessing, 16(6):1505–1518.

ClémentChristophe,PraveenKKanithi,Tathagata Raha,ShadabKhan,andMarcoAFPimentel.2024. Med42-v2:Asuiteofclinicalllms

VictoriaYHChua,HexinLiu,LeibnyPaolaGarciaPerera,FeiTingWoon,JinyiWong,Xiangyu Zhang,SanjeevKhudanpur,AndyWHKhong,Justin Dauwels,andSuzyJStyles.2023. Merlionccs challenge:Aenglish-mandarincode-switchingchilddirectedspeechcorpusforlanguageidentification anddiarization arXivpreprintarXiv:2305.18881

MiguelDelRio,PeterHa,QuintenMcNamara,Corey Miller,andShipraChandra.2022. Earnings-22:A practicalbenchmarkforaccentsinthewild. arXiv preprintarXiv:2203.15591.

BrechtDesplanques,JentheThienpondt,andKris Demuynck.2020. Ecapa-tdnn:Emphasized channelattention,propagationandaggregationin tdnnbasedspeakerverification arXivpreprint arXiv:2005.07143

GeorgeRDoddington,MarkAPrzybocki,AlvinFMartin,andDouglasAReynolds.2000. Thenistspeaker recognitionevaluation–overview,methodology,systems,results,perspective Speechcommunication, 31(2-3):225–254.

AbhimanyuDubey,AbhinavJauhri,AbhinavPandey, AbhishekKadian,AhmadAl-Dahle,AieshaLetman, AkhilMathur,AlanSchelten,AmyYang,Angela Fan,etal.2024. Thellama3herdofmodels arXiv preprintarXiv:2407.21783

SeppoEnarvi,MarilisaAmoia,MiguelDel-AguaTeba, BrianDelaney,FrankDiehl,StefanHahn,Kristina Harris,LiamMcGrath,YuePan,JoelPinto,etal. 2020. Generatingmedicalreportsfrompatientdoctorconversationsusingsequence-to-sequence models.In Proceedingsofthefirstworkshopon naturallanguageprocessingformedicalconversations,pages22–30.

NaomeEtori,EbasaTemesgen,andMariaGini.2023. Whatweknowsofar:Artificialintelligencein africanhealthcare arXivpreprintarXiv:2305.18302

FaihaFareez,TishyaParikh,ChristopherWavell,Saba Shahab,MeghanChevalier,ScottGood,Isabella DeBlasi,RafikRhouma,ChristopherMcMahon, Jean-PaulLam,etal.2022. Adatasetofsimulated patient-physicianmedicalinterviewswithafocuson respiratorycases ScientificData,9(1):313.

JLukeGalloway,DominiqueMunroe,PamelaDVohraKhullar,ChristopherHolland,MaryASolis,MirandaAMoore,andReemaHDbouk.2024. Impact ofanartificialintelligence-basedsolutiononclinicians’clinicaldocumentationexperience:Initial findingsusingambientlisteningtechnology. Journal ofGeneralInternalMedicine,pages1–3.

SanchitGandhi,PatrickvonPlaten,andAlexanderM Rush.2023. Distil-whisper:Robustknowledge distillationvialarge-scalepseudolabelling arXiv preprintarXiv:2311.00430

MartucciGiuseppe,MauroCettolo,MatteoNegri,and MarcoTurchi.2021. Lexicalmodelingofasrerrorsforrobustspeechtranslation.In Proceedingsof Interspeech2021,pages2282–2286.

MichaelHannaandOndˇrejBojar.2021. Afine-grained analysisofbertscore.In ProceedingsoftheSixth ConferenceonMachineTranslation,pages507–517.

EricHarper,SomshubraMajumdar,OleksiiKuchaiev, LiJason,YangZhang,EBakhturina,VNoroozi, SSubramanian,KNithin,HJocelyn,etal.2019. Nemo:Atoolkitforconversationalaiandlarge languagemodels Computersoftware],URL: https://github.com/NVIDIA/NeMo

ArthurHinsvark,NatalieDelworth,MiguelDelRio, QuintenMcNamara,JoshuaDong,RyanWesterman,MichelleHuang,JosephPalakapilly,JenniferDrexler,IlyaPirkin,etal.2021. Accented speechrecognition:Asurvey. arXivpreprint arXiv:2104.10747.

PraveenKKanithi,ClémentChristophe,MarcoAFPimentel,TathagataRaha,NadaSaadi,HamzaJaved, SvetlanaMaslenkova,NasirHayat,RonnieRajan,

andShadabKhan.2024. Medic:Towardsacomprehensiveframeworkforevaluatingllmsinclinical applications arXivpreprintarXiv:2409.07314

NithinRaoKoluguri,TaejinPark,andBorisGinsburg. 2022. Titanet:Neuralmodelforspeakerrepresentationwith1ddepth-wiseseparableconvolutionsand globalcontext.In ICASSP2022-2022IEEEInternationalConferenceonAcoustics,SpeechandSignal Processing(ICASSP),pages8102–8106.

AlexPapadopoulosKorfiatis,FrancescoMoramarco, RadmilaSarac,andAleksandarSavkov.2022. Primock57:Adatasetofprimarycaremockconsultations. arXivpreprintarXiv:2204.00333.

HaroldWKuhn.1955. Thehungarianmethodforthe assignmentproblem Navalresearchlogisticsquarterly,2(1-2):83–97.

FedericoLandini,JánProfant,MireiaDiez,andLukáš Burget.2022. Bayesianhmmclusteringofx-vector sequences(vbx)inspeakerdiarization:Theory,implementationandanalysisonstandardtasks ComputerSpeech&Language,71:101254.

KhaiLe-Duc,Khai-NguyenNguyen,LongVo-Dang, andTruong-SonHy.2024. Real-timespeechsummarizationformedicalconversations arXivpreprint arXiv:2406.15888

YixinLiu,AlexanderRFabbri,JiawenChen,Yilun Zhao,SimengHan,ShafiqJoty,PengfeiLiu, DragomirRadev,Chien-ShengWu,andArmanCohan.2023. Benchmarkinggenerationandevaluationcapabilitiesoflargelanguagemodelsforinstructioncontrollablesummarization arXivpreprint arXiv:2311.09184

AnirudhMani,ShrutiPalaskar,andSandeepKonam. 2020. Towardsunderstandingasrerrorcorrection formedicalconversations.In Proceedingsofthe firstworkshoponnaturallanguageprocessingfor medicalconversations,pages7–11.

GeorgeMichalopoulos,KyleWilliams,Gagandeep Singh,andThomasLin.2022. Medicalsum:A guidedclinicalabstractivesummarizationmodelfor generatingmedicalreportsfrompatient-doctorconversations.In FindingsoftheAssociationforComputationalLinguistics:EMNLP2022,pages4741–4749.

Hoang-ThiNguyen-Vo,HuyNguycn-Gia,Hoan-Duy Nguyen-Tran,HoangPham-Minh,HungVo-Thanh, andHaoDo-Due.2022. Marblenet:Adeepneuralnetworksolutionforvietnamesevoiceactivity detection.In 20229thNAFOSTEDConferenceon InformationandComputerScience(NICS),pages 154–158.

SewadeOgun,AbrahamTOwodunni,TobiOlatunji, EniolaAlese,BabatundeOladimeji,Tejumade Afonja,KayodeOlaleye,NaomeAEtori,andTosin Adewumi.2024. 1000africanvoices:Advancinginclusivemulti-speakermulti-accentspeechsynthesis arXivpreprintarXiv:2406.11727

TobiOlatunji,TejumadeAfonja,BonaventureFP Dossou,AtnafuLambeboTonja,ChrisChinenye Emezue,AminaMardiyyahRufai,andSahibSingh. 2023a. Afrinames:Mostasrmodels"butcher" africannames arXivpreprintarXiv:2306.00253

TobiOlatunji,TejumadeAfonja,AdityaYadavalli, ChrisChinenyeEmezue,SahibSingh,BonaventureFPDossou,JoanneOsuchukwu,SalomeyOsei, AtnafuLambeboTonja,NaomeEtori,etal.2023b. Afrispeech-200:Pan-africanaccentedspeechdataset forclinicalandgeneraldomainasr Transactions oftheAssociationforComputationalLinguistics, 11:1669–1685.

VassilPanayotov,GuoguoChen,DanielPovey,and SanjeevKhudanpur.2015. Librispeech:anasrcorpusbasedonpublicdomainaudiobooks.In 2015 IEEEinternationalconferenceonacoustics,speech andsignalprocessing(ICASSP),pages5206–5210. IEEE.

HannanehBPasandiandHaniyehBPasandi.2022. Evaluationofasrsystemsforconversationalspeech: Alinguisticperspective.In Proceedingsofthe20th ACMConferenceonEmbeddedNetworkedSensor Systems,pages962–965.

PiotrP˛ezik,GosiaKrawentek,SylwiaKarasińska, PawełWilk,PaulinaRybińska,AnnaCichosz,AngelikaPeljak-Łapińska,MikołajDeckert,andMichał Adamczyk.2022. Diabiz–anannotatedcorpusof polishcallcenterdialogs.In Proceedingsofthethirteenthlanguageresourcesandevaluationconference, pages723–726.

MirosławPłaza,ŁukaszPawlik,andStanisław Deniziak.2021. Calltranscriptionmethodologyfor contactcentersystems IEEEAccess,9:110975–110988.

VineelPratap,AndrosTjandra,BowenShi,Paden Tomasello,ArunBabu,SayaniKundu,AliElkahky, ZhaohengNi,ApoorvVyas,MaryamFazel-Zarandi, etal.2024. Scalingspeechtechnologyto1,000+ languages JournalofMachineLearningResearch, 25(97):1–52.

KrishnaCPuvvada,Piotr Zelasko,HeHuang,OleksiiHrinchuk,NithinRaoKoluguri,KunalDhawan, SomshubraMajumdar,ElenaRastorgueva,Zhehuai Chen,VitalyLavrukhin,etal.2024. Lessismore: Accuratespeechrecognition&translationwithout web-scaledata. arXivpreprintarXiv:2406.19674.

AlecRadford,JongWookKim,TaoXu,GregBrockman,ChristineMcLeavey,andIlyaSutskever.2023. Robustspeechrecognitionvialarge-scaleweaksupervision.In Internationalconferenceonmachine learning,pages28492–28518.PMLR.

MohammadSadeghRasooliandJoelR.Tetreault.2015. Yaraparser:Afastandaccuratedependencyparser ComputingResearchRepository,arXiv:1503.06733. Version2.

MircoRavanelliandYoshuaBengio.2018. Speaker recognitionfromrawwaveformwithsincnet 2018 IEEESpokenLanguageTechnologyWorkshop(SLT), pages1021–1028.

NevilleRyant,KennethChurch,ChristopherCieri,AlejandrinaCristia,JunDu,SriramGanapathy,and MarkLiberman.2019. Theseconddiharddiarizationchallenge:Dataset,task,andbaselines.In Interspeech,pages978–982.

OlegSerikov,EkaterinaVoloshina,AnnaPostnikova, SalihaMurado˘glu,ÉricLeFerrand,ElenaKlyachko, EkaterinaVylomova,TatianaShavrina,andFrancis Tyers.2024. Proceedingsofthe3rdworkshoponnlp applicationstofieldlinguistics(fieldmatters2024) In Proceedingsofthe3rdWorkshoponNLPApplicationstoFieldLinguistics(FieldMatters2024).

KaranSinghal,ShekoofehAzizi,TaoTu,SSaraMahdavi,JasonWei,HyungWonChung,NathanScales, AjayTanwani,HeatherCole-Lewis,StephenPfohl, etal.2023. Largelanguagemodelsencodeclinical knowledge Nature,620(7972):172–180.

GabrielSkantze.2021. Turn-takinginconversational systemsandhuman-robotinteraction:areview ComputerSpeech&Language,67:101178.

GeorgiosPSpithourakis,IvanVuli´c,MichałLis,Iñigo Casanueva,andPawełBudzianowski.2022. Evi: Multilingualspokendialoguetasksanddatasetfor knowledge-basedenrolment,verification,andidentification arXivpreprintarXiv:2204.13496

GemmaTeam,MorganeRiviere,ShreyaPathak, PierGiuseppeSessa,CassidyHardin,SuryaBhupatiraju,LéonardHussenot,ThomasMesnard,Bobak Shahriari,AlexandreRamé,etal.2024. Gemma2: Improvingopenlanguagemodelsatapracticalsize arXivpreprintarXiv:2408.00118

Mohamad-HaniTemsah,AmrJamal,KhalidAlhasan, AbdulkarimATemsah,andKhalidHMalki.2024. Openaio1-previewvs.chatgptinhealthcare:Anew frontierinmedicalaireasoning Cureus,16(10).

AtnafuLambeboTonja,IsraelAbebeAzime, TadesseDestawBelay,MesayGemedaYigezu, MogesAhmedMehamed,AbinewAliAyele, EbrahimChekolJibril,MichaelMeleseWoldeyohannis,OlgaKolesnikova,PhilippSlusallek,etal. 2024. Ethiollm:Multilinguallargelanguagemodels forethiopianlanguageswithtaskevaluation arXiv preprintarXiv:2403.13737

ChanghanWang,MorganeRiviere,AnnLee,Anne Wu,ChaitanyaTalnikar,DanielHaziza,Mary Williamson,JuanPino,andEmmanuelDupoux. 2021. VoxPopuli:Alarge-scalemultilingualspeech corpusforrepresentationlearning,semi-supervised learningandinterpretation.In Proceedingsofthe 59thAnnualMeetingoftheAssociationforComputationalLinguisticsandthe11thInternationalJoint

ConferenceonNaturalLanguageProcessing(Volume1:LongPapers),pages993–1003,Online.AssociationforComputationalLinguistics.

LucyLuWang,YuliaOtmakhova,JayDeYoung, ThinhHungTruong,BaileyEKuehl,ErinBransom,andByronCWallace.2023. Automated metricsformedicalmulti-documentsummarization disagreewithhumanevaluations arXivpreprint arXiv:2305.13693

ThomasWolf,LysandreDebut,VictorSanh,Julien Chaumond,ClementDelangue,AnthonyMoi,PierricCistac,TimRault,RémiLouf,MorganFuntowicz, etal.2020. Transformers:State-of-the-artnatural languageprocessing.In Proceedingsofthe2020conferenceonempiricalmethodsinnaturallanguage processing:systemdemonstrations,pages38–45.

EmreYılmaz,AstikBiswas,EwaldvanderWesthuizen, FebedeWet,andThomasNiesler.2018. Buildinga unifiedcode-switchingasrsystemforsouthafrican languages arXivpreprintarXiv:1807.10949

Wen-waiYim,YujuanFu,AsmaBenAbacha,Neal Snider,ThomasLin,andMelihaYetisgen.2023. Acibench:anovelambientclinicalintelligencedataset forbenchmarkingautomaticvisitnotegeneration ScientificData,10(1):586.

TianyiZhang,VarshaKishore,FelixWu,KilianQ Weinberger,andYoavArtzi.2019. Bertscore:Evaluatingtextgenerationwithbert arXivpreprint arXiv:1904.09675

LianminZheng,Wei-LinChiang,YingSheng,Siyuan Zhuang,ZhanghaoWu,YonghaoZhuang,ZiLin, ZhuohanLi,DachengLi,EricXing,etal.2023. Judgingllm-as-a-judgewithmt-benchandchatbot arena. AdvancesinNeuralInformationProcessing Systems,36:46595–46623.

A HumanEvaluationCriteria

Thissectionprovidesdetailsonthecriteriaused toevaluatetheAI-generatedsummariesofdoctorpatientconversations.Eachsummarywasevaluatedbasedonthefollowingaspects:

1. Howwelldoesthesummaryincludeallpositiveclinicalsigns,symptoms,ordetailsinthe transcript?

2. Howwelldoesthesummaryincludeallnegativeclinicalsigns,symptoms,ordetailsinthe transcript?

3. Howwelldoesthesummaryrecallthediagnosisfromthetranscript?

4. Howaccuratelydoesthesummaryreflectthe treatmentplaninthetranscript?

5. Towhatdegreedoesthesummaryavoidfalse, incorrect,orfabricatedinformation?

6. Howconcise,clear,andwell-structuredisthe summary?

B PromptTemplatefortheLLM Evaluator

Table 8 outlinestheprompttemplateutilizedfor theevaluationconductedbyOpenAI’so1model, servingastheLLMevaluator.

C EvaluationMetricfortheLLM Evaluator

IntheLLMevaluationprompt(seeSection B), eachcriterionhasamaximumscoreof5points. Sincetherearesixcriteriaperrow,theperfect scoreforasinglerowis 5 × 6 = 30 points.For adatasetwithtenrows,thetotalpossiblescore (perfectaccuracy)is:

PerfectAccuracy = 5 × 6 × 10 = 300 points

Tocomputeamodel’spercentagescorefora datasetwithNrows:

1. Sumthescoresforallsixcriteriaperrow.

2. Sumthetotalscoresacrossallrows.

3. Calculatethepercentagescoreforthemodel as:

PercentageScore = ( Model’sTotalScore 5 × 6 × NumberofRows ) × 100

LLMasaJudgePromptTemplate

Youareamedicalexpert.YourtaskistoevaluatetheaccuracyandqualityofanAI-generated summaryofadoctor-patientconversation.Provideanimpartialassessmentbasedonthecriteria below,ratingeachfrom1to5(1=lowest,5=highest).Ifaspecificcriterionisnotaddressed, assignascoreof0.

Transcript: {transcript}

Summary: {summary}

EvaluationCriteria:

1.Completenessofpositiveclinicalsigns,symptoms,orrelevantdetails:Doesthesummary captureallimportantpositivefindings?

2.Completenessofnegativeclinicalsigns,symptoms,orrelevantdetails:Doesthesummary captureallimportantnegativefindings?

3.Accuracyofdiagnosis:Doesthesummarycorrectlyreflectthediagnosis?

4.Accuracyoftreatmentplan:Doesthesummaryconveythetreatmentplancorrectly?

5.Factualaccuracy:Doesthesummaryavoidfalse,incorrect,orfabricatedinformation?

6.Clarityandstructure:Isthesummaryconcise,clear,andwell-structured? PleaseReturntheevaluationscoresinJSONformat:

{

"Criterion1":score, "Criterion2":score, "Criterion3":score, "Criterion4":score, "Criterion5":score, "Criterion6":score }

Table8:EvaluationtaskdescriptionforassessingAI-generatedsummaries.

D PromptTemplateforSummarization

Thisfigureoutlinestheprompttemplateutilized forthesummarizationtask.

Youareamedicalexpert;youwillbeprovidedatranscriptthat containsaconversationbetweenadoctorandapatient.The taskisforyoutosummarizethekeypointsintheconversation. Transcript: {transcript}

E TranscriptandSummaryExample

Thissectionpresentsasampleofahuman-written transcriptinTable 9 andthecorresponding LLM-generatedsummariesforthetranscript. WhileGPT4ocaptureskeydetailssuchaspain severity,lifestylefactors,andmedicalhistory, itssummaryomitsavitaldetail:thepatient’s experienceofrectalbleeding“duringurination", whichiscrucialtothecase.Thesummariesare:

GPT4o’sSummary: Dr.Philipconducted aconsultationwithMrs.Ayike,a45-year-old woman.Shereportedexperiencingrectalbleeding andabdominalpainforthepasttwoweeks.The pain,ratedasa6outof10onthepainscale, islocalizedinthelowerabdominalregionand worsenswithpressure.Mrs.Ayikealsonoted alossofappetiteandfatiguesincetheonset ofhersymptoms.Shehasnotsoughtmedical attentionpriortothisvisitandhasnohistoryof surgery.Sheneithersmokesnordrinksalcohol. AlthoughMrs.Ayikedoesnottravel,sherevealed afamilyhistoryofcolorectalcancerthrough hermother.Herusualdietincludes"swallow," andshehasnofoodallergies.Mrs.Ayikeis marriedwithfourchildren,butcurrentlylives separatelyfromherhusband.Dr.Philipconcluded thesessionbyacknowledgingMrs.Ayike’sdetails.

Claude’sSummary: Basedontheconversation, thekeypointsare:1.Thepatient,Mrs.Ayike, isa45-year-oldwomanpresentingwithrectal bleedingandabdominalpainforthepasttwo weeks.2.Therectalbleedingoccurswhenshe urinates,anditisnotmixedwithstool.3.The abdominalpainisratedas6outof10inseverity, anditislocalizedinthelowerabdominalregion.4. Shehaslostherappetiteandfeelstiredsincethe symptomsstarted.5.Shehasnottraveledrecently andisnotafrequenttraveler.6.Sheismarriedbut currentlyseparatedfromherhusband,livingina differentstate.n7.Shehasfourchildren.8.Shehas notundergoneanysurgerybefore.9.Shedoes

notsmokeorconsumealcohol.10.Hermother hadcolorectalcancerinthepast,indicatinga familyhistoryofthecondition.11.Hertypical dietincludesswallows(atypeoffood),and sheisnotallergictoanyfood.Thedoctorhas gatheredessentialinformationaboutthepatient’s symptoms,medicalhistory,andlifestyletoaidin thediagnosisandtreatmentprocess.

Gemma’sSummary: Mrs.Ayike,a45-yearoldwoman,presentedtoDr.Philipcomplainingof twoweeksofabdominalpainandrectalbleeding. Keypoints:Symptoms:Lowerabdominalpain rated6/10,worseningwithpressure.Rectal bleedingduringurination,notmixedwithstool. Lossofappetiteandfatigue.MedicalHistory: Noprevioussurgeries.Nosmokingoralcohol consumption.Familyhistoryofcolorectalcancer (mother).Lifestyle:Prefersadiethighinswallow. Noknownfoodallergies.Notatraveler.Separated fromherhusband,livingwithfourchildren.Next steps:Basedonthesymptomsandfamilyhistory, Dr.Philipwilllikelyorderfurtherinvestigations todeterminethecauseofMrs.Ayike’sabdominal painandrectalbleeding.Thismayincludeblood tests,stooltests,imagingstudies(suchasa colonoscopyorCTscan),andpossiblyabiopsy.

Med42-8B’ssummary: Mrs.Ayike,a 45-year-oldfemale,presentswithrectalbleeding andlowerabdominalpain,rated6/10inseverity, whichstartedtwoweeksago.Shereportsseeing redbloodinherurinebutnotmixedwithstool. Shehasnotsoughtmedicalattentionpreviously forthesesymptoms.Shedescribesherappetite aspoorandfeelstired.Shedeniesanyrecent travel,smoking,oralcoholuse.Sheismarried withfourchildrenbutiscurrentlyseparatedfrom herhusband.Hermotherhadcolorectalcancer. Sheconsumesadietprimarilyconsistingoffufu (swallow).Therearenoknownfoodallergies.

GPT3.5’ssummary: Mrs.Aike,a45-year-old patient,visitedDr.Philipduetorectalbleeding andabdominalpainthatstartedtwoweeksago. Theabdominalpainissevere,ratedata6on ascaleof1to3,andislocalizedinthelower abdominalregion.Shehasalsoexperienceda lossofappetite,fatigue,anddifficultyeating sincethepainstarted.Mrs.Aikeismarriedwith fourchildren,notlivingwithherhusband,and

Transcript

Doctor: Goodafternoon.MynameisDr.Philip,andwhat’syourname?

Patient: MynameisMrs.Ayike.

Doctor: Okay,howoldareyou?

Patient: Iam45yearsold.

Doctor: Okay.You’rewelcome,mydear.Whatbroughtyouhere?

Patient: Ihaverectalbleedingandabdominalpain.

Doctor: Canyoutellmemoreaboutit?

Patient: Thepainstartedtwoweeksago,andtherectalbleedinghappenswhenIurinate.Thebloodis notmixedwithstool.

Doctor: Haveyougoneforanycheck-ups?

Patient: No,thisismyfirsttimecominghere.

Doctor: Howsevereistheabdominalpain,fromascaleof1to10?

Patient: 6.

Doctor: Whichsideoftheabdominalregiondoyoufeelthepain?

Patient: Ifeelpaininmylowerabdominalregion,anditworsenswithpressure.Andsincethelast twoweeks,asI’vestartedtheabdominalpain,Ihavenotbeenabletoeatwell.Ilostappetite.Ifeel tired.

Doctor: Haveyoutraveledrecently?

Patient: No,Idon’ttravel.

Doctor: Areyoumarried?

Patient: Yes,Iammarried,withfourchildren,butmyhusbandlivesinanotherstate.

Doctor: Haveyouundergoneanysurgerybefore?

Patient: No.

Doctor: Doyousmokeordrinkalcohol?

Patient: No.

Doctor: Doesanyoneinyourfamilyhavesimilarissues?

Patient: Yes,mymotherhadcolorectalcancer.

Doctor: Whatfoodsdoyounormallyeat?

Patient: Iliketakingswallow.Ilikeswallowalot.

Doctor: Areyouallergictoanyfood?

Patient: No.

Table9:Snippetofadoctor-patientconversation(transcript)writtenbyaclinician hasnohistoryofsurgeries,smoking,oralcohol consumption.Hermotherhadcolorectalcancer before,indicatingafamilyhistoryofsimilarissues. Shepreferseatingswallowandisnotallergicto anyfood.Thisisherfirstvisittoseekmedical helpforthesesymptoms.

Meditron3-8B’ssummary: Thepatient,Mrs. Ayike,isa45-year-oldwomanwhohasbeen experiencingrectalbleedingandabdominalpain fortwoweeks.Shehasnotundergoneanysurgery beforeanddoesnotsmokeortakealcohol.Her motherhadcolorectalcancerbefore.Shehaslost herappetiteandfeelstired.Thedoctorasksher aboutherfamilyhistory,foodallergies,andthe

severityofhersymptoms.Thedoctoralsoasks abouthermaritalstatusandchildren.Thepatient’s symptomsaresimilartothoseofcolorectalcancer, whichhermotherhad.

Llama-3.1-8B-Instruct’ssummary: Basedon theprovidedtranscript,herearethekeypoints fromtheconversationbetweenDr.PhilipandMrs. Ayike:1.Patient’sIntroduction:Mrs.Ayike,a 45-year-oldwoman,presentstothedoctorwith rectalbleedingandabdominalpain.2.Symptoms: Theabdominalpainstartedtwoweeksago,and sheexperiencesrectalbleedingwhenurinating, butnotmixedwithstool.Thepainislocatedin thelowerabdominalregionandworsenswith

pressure.3.MedicalHistory:Mrs.Ayikehas notundergoneanysurgerybeforeanddoesnot haveahistoryofsmokingoralcoholconsumption. 4.FamilyHistory:Hermotherhadcolorectal cancerbefore.5.DietandAllergies:Mrs.Ayike likeseatingswallow(atypeoffood)andisnot allergictoanyfood.6.Lifestyle:Sheismarried withfourchildrenbutiscurrentlyseparatedfrom herhusband.Shedoesnottravelandhasnot beenabletoeatwellorfeelenergeticduetoher symptoms.7.CurrentCondition:Mrs.Ayike ratesherabdominalpainasa6outof10andhas lostappetiteandfeelstiredsincetheonsetof symptoms.Basedontheinformationprovided,Dr. Philipwilllikelyperformathoroughexamination andpossiblyorderdiagnosticteststodetermine thecauseofMrs.Ayike’ssymptoms.Givenher familyhistoryofcolorectalcancer,acolonoscopy mayberecommendedtoruleoutanypotential malignancies.

Llama-3.2-3B-Instruct’ssummary: Here’s asummaryofthekeypointsintheconversation betweenDr.PhilipandMrs.Ayike:Patient’s Symptoms:Rectalbleeding(notmixedwith stool),Abdominalpain(rated6outof10)inthe lowerabdominalregion,Lossofappetiteand tirednessforthepasttwoweeks.MedicalHistory: Noprevioussurgeries,Nosmokingoralcohol consumption,NoknownallergiestofoodFamily History:Motherhadcolorectalcancerbefore. LifestyleandHabits:Marriedwithfourchildren, butnotlivingwithherhusband.Nohistory oftraveling.Prefersswallowingfood.Other Information:Mrs.Ayikeis45yearsold.Shehas notbeenabletoeatwellduetotheabdominal pain.Sheisseekingmedicalattentionforthefirst time,asthesymptomsstartedtwoweeksago.

Phi3-mini-instruct’ssummary: Dr.Philip,a medicalexpert,isspeakingwithMrs.Ayike,a45year-oldwomanwhohasbeenexperiencingrectal bleedingandabdominalpainfortwoweeks.Mrs. Ayikehasnothadanycheck-upsforhersymptoms andhaslostherappetite,feelingtired.Shehasnot traveledrecentlyandisnotasmoker.Shedoesnot consumealcoholandhasnoknownfoodallergies. Mrs.Ayike’smotherhadcolorectalcancer.Dr. Philipasksaboutherdiet,whichmainlyconsists ofswallow,andinquiresaboutanyfamilyhistory ofsimilarissues.

F OverlappingSpeechExample

Inthisexample,overlappingspeechoccursbetweenSpeaker1andSpeaker2.Speaker2’s"No" overlapswithSpeaker1’sutterancestartingwith "Ok,thankyouverymuch."Bothutterancesare fullycapturedwithprecisetimestampstoensure accuraterepresentationoftheconversation.This exampleillustrateshowoverlapsarehandledinthe transcriptionprocessforthedataset.

G ASRandSummarizationResult Graphs

Figure3:ComparisonofMedicalandGeneralWERforDifferentModels

Figure4:SummarizationresultsforseveralLLMs

Timestamp Speaker Utterance

02:40.98–02:41.99 Speaker1 Pasthospitaladmission?

02:42.97–02:43.02 Speaker2 No.

02:43.00–02:45.99 Speaker1 Ok,thankyouverymuch.Doyoulivealone?

Table10:Exampleofoverlappingspeechwithaccuratetimestampsandspeakerannotations.

openai/whisper-mediumopenai/whisper-large-v2openai/whisper-large-v3openai/whisper-large-v3-turbodistil-whisper/distil-large-v2distil-whisper/distil-large-v3nvidia/parakeet-rnnt-1.1bnvidia/parakeet-ctct-1.1bnvidia/parakeet-tdt-1.1bnvidia/canary-1bfacebook/wav2vec2-large-960hfacebook/mms-1b-all

Figure5:WER,MedWER,andNon-MedWERforVariousModels