Enhancing LLMs with Indian Multi-Lingual Audio Understanding for AGI Advancement: A Survey by IRJET Journal

Enhancing LLMs with Indian Multi-Lingual Audio Understanding for AGI Advancement: A Survey

Aniruddha Birage1 , Tanay Thatte1 , Chiranjeev Patil1, Aarush Balkundi1 , Arati Deshpande1

Saswati Rabha2, Chintan Parikh2

1PICT Pune (of Aff. SPPU), Pune, India

2SReverie Language Technologies Limited, Bengaluru, India

Abstract - The recent trends in speech and language processing include self-supervised learning, multilingual automatic speech recognition, and large-scale language models. New emerging techniques, such as Wav2Vec 2.0 and jointsupervised-unsupervisedtraining,canachievescalability with high performance in low-resource languages, but challengessuchashighcomputationalcostsmaybeincurred, and data imbalances will be encountered. Among the newer innovations with few-shot learning, multimodal models, and task generalization, big improvements come on adaptation and efficiency-whilesimilarly sodo the challenges of bias and even resource intensity. This paper will explore these breakthroughs and their effects in discipline.

Key Words: Self-supervised learning, Wav2Vec 2.0, Large scale language models, Multimodal models, Task generalization, Low-resource languages, Speech and language processing

1.INTRODUCTION

This Recent advances in self-supervised learning (SSL), multilingualautomaticspeechrecognition(ASR),andlargescalelanguagemodels(LLMs)havedramaticallyimpacted the speech and language processing community with so manybreakthroughsandestablishmentofnewstate-of-theart benchmarks. This paper discusses over three broad challenges: speaker and cross-lingual representation learning (XLRL) and efficient scaling of LLMs. Another crucial factor is minimizing dependence on labelled data, especially for low-resource languages (LRLs). Therefore, self-supervised models like Wav2Vec 2.0 push the boundariesofscalabilityandaccuracy;theydoindeedhelp usfindanswerstomanychallengesthatappeartoarisefrom limited labelled datasets but concurrently expose new complexitiesbecauseofcomputationalcostsoffine-tuning verycomplexmodelsforspecifictasksorlanguages.

XLRLresearchhasadvanced significantly,andmodelscan now learn common patterns across LRLs. The property combined with transfer learning techniques makes the flexibilityandinclusivenessofspeechrecognitionsystems significant.Thiscapabilitymakesthesesystemsgeneralize effectivelyacrossawiderangeoflanguages,therebymaking themhighlyadaptable.Suchflexibilityhasgreatpromisefor

code-switchingandzero-shotlearningwithsuchsuccessful applications of the model in languages that it was never explicitlytrainedon.

The multilingual audio addition from India enhances significantly the performance and the adaptability of the LLMs; bidirectional circuits leverage India's linguistic diversitytoprovidericher,morenuancedrepresentations, enhancing the processing and understanding of various languages. The challenges with LRLs are addressed, and thereisapushforwardwithcross-lingualtransferlearning. Useofmulti-lingualaudiodataenablesrobusttrainingthat captures nuances of various dialects and contexts, culminatinginricher,moreinclusive,andnotablyaccurate languagemodels.

ScalinglargelanguagemodelstosizesthatarelikePathways Language Model (PaLM) and Generative Pre-trained Transformer 3 (GPT-3) has completely revolutionized the natureofnaturallanguageprocessing.Thesemodelsexcelat learningfromfewexamplesandadaptingtovarioustasks withminimaltask-specificfine-tuning..Thesekindsofscale come with enormous computational costs and increase memory requirements and larger environmental impacts. Data imbalance, particularly the lack of low-resource languagesintrainingdatasets,isamajorissue.

Therapidgrowthoflargemodeltrainingraisessignificant concerns for researchers and organizations with limited resources. Therefore, the open-source models like Vicuna havepoppeduptodemocratizeartificialgeneralintelligence (AGI) so that the organizations can now deploy it for researchanddevelopmentpurposes.Otherchallengeslike imbalanceddatasetsandthesearchformoreefficientand moreinclusivemodelsremain.

2. RESEARCH

2.1 Self-Supervised Learning and Speech Recognition

Thesepapersfocusonreducingrelianceonlabeleddata and improving speech recognition using self-supervised techniques.kindofpaginationanywhereinthepaper.Donot numbertextheads-thetemplatewilldothatforyou.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 11 Issue: 10 | Oct 2024 www.irjet.net p-ISSN: 2395-0072

 Self-Supervised Cross-Lingual Speech Representation LearningatScale[1]Ascalableself-supervisedlearning approach for learning speech representations across multiple languages from few or no labeled datasets improves performance on low-resource languages throughcross-lingualtransfer.

 Wav2Vec 2.0: Self-Supervised Speech Learning [3] It presents Wav2Vec 2.0 which learns speech representations from raw audio using self-supervised learning.Itsarchitectureandresultssignificantlyreduce the amount of labelled data needed for training yet allows for state-of-the-art speech recognition performancetobeachieved.

 Self-supervised Learning with Random-Projection Quantizer[6]Inthisapproach,moresignificantgainsin self-supervised speech learning are achieved at lower computationalcostswithoutaffectingtheprecisionand withtheuseofarandom-projectionquantizertomake thesystemefficient.

 w2v-BERT: Combining Contrastive Learning and MaskedLanguageModelling[8]Combinescontrastive learning and masked language modelling to enhance self-supervised speech pretraining for better generalizationacrossspeechtasksbutrequiresreduced dataforthepurposeofbeinglabelled.

 UnsupervisedCross-lingualRepresentationLearningfor Speech Recognition [9] Knowledge transfer of highresource languages into low-resource languages is focusedoninanentirelyunsupervisedmannertobetter enhancespeechrecognitionwithoutusinglabelleddata forlow-resourcelanguages.

2.2 Multilingual and Multimodal Learning

Thesepapersexplorelearningacrossmultiplelanguages or modalities, enhancing system performance in diverse tasks.

 Joint Unsupervised and Supervised Training for Multilingual ASR [2] It combines unsupervised and supervised learning for bettering ASR in multiple languages, especially low-resource languages, by processinglabelledandunlabelleddata.

 Multimodal Embodied Language Learning [11] It integratesdifferentmodalitiessuchasvision,language, andactionfortaskslikeroboticsandenablesmodelsto talkaboutandunderstandthereal-worldenvironment moreeffectivelybyutilizingseveralmodalities.

 Language-independent Sub word Tokenization [16] A language-independenttokenizer,basedonmultilingual rulesasopposedtospecificrulesforanyonelanguage,

is proposed for efficiency in text processing with precision,particularlyforrareandunknownwords.

2.3 Large Language Models and Few-shot Learning

Thesepapersfocusonscalinglargemodelsandimproving taskgeneralizationusingminimaltask-specificdata.

 LanguageModelsareFew-ShotLearners[4]Introduces GPT-3,alargelanguagemodelthatcanreallycarryout multipletaskswithfew-shotlearningandcangeneralize robustlyacross mostNLPtasksevenwhenfine-tuned minimally.

 Scaling Vision Transformers to 22 Billion Parameters [10] A new vision transformer with 22 billion parameters significantly improves image recognition and addresses training and scaling challenges, highlighting its potential in diverse computer vision tasks.

 Vicuna: Open-Source Chatbot Development [5] A chatbotthattops90percentefficiencyofChatGPTandis capable of quality capability and flexibility in being highlyadaptableandcustomizable.

 PaLM: Scaling Language Modelling with Pathways [7] Scaling large language models. Efficiently with the Pathways system, which only activates parts of the model. Reduces computational overheads while still maintainingstrongperformanceonawiderangeofNLP tasks.

2.4 Efficient Adaptation and Transfer Learning

These papers explore techniques for improving the computationalefficiencyandscalabilityofmodels.

 Parameter-efficientTransferLearning[13]Itproposes reducingthenumberoftrainableparametersneededto fine-tunelargemodelswithoutsacrificingperformance, reducing computational resources on adaptation to a giventask.

 Low-rankAdaptationofLargeModels[14]Themethod applieslow-rankmatricestothetaskoffine-tuninglarge pre-trained models efficiently in such a way that significant reductions in computational costs are achieved while retaining high performance on new tasks.

2.5 Real-time Speech and Contextual Understanding

These papers deal with enhancing real-time speech recognition and contextual understanding in language models.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 11 Issue: 10 | Oct 2024 www.irjet.net p-ISSN: 2395-0072

 Fast-SlowEncoder-basedSpeechTransduction[15]This paper proposes a fast-slow encoder framework that improvestheaccuracyofreal-timespeechrecognition and provides low latency, which makes it suitable for streamingspeech-to-textapplications.

 ContextualSpeechUnderstanding[12]Thisframework improvesspeechcomprehensionbycombininghearing

Table – 1: Comparative Study of Large Language Models (LLMs)

Ref. No.

[1] Self-supervised cross-lingual speech representation learningatscale

•Focuseson language-agnostic speech representation usingselfsupervised learning.

•Learnsto predictpartsof audiobasedon context, generalizing speechpatterns acrosslanguages.

•Involvespretrainingonlargescalemultilingual audiodata withoutlabeled transcriptions.

•Pre-trained modelscanbe fine-tunedfor taskslikespeech recognitionor translation.

•Reducesreliance onlabeleddata foreachlanguage.

•Thetraining corpusincludes multilingual speechdata, enablingthe modeltolearn robustcrosslingual representations.

•Large-scale datasetslike CommonVoice, BABEL,or proprietary collectionsare used.

•Ensures coverageofa widerangeof acousticand linguistic variations.

with a reasoning mechanism. Such integrated capabilities improve performance with complex realtime interaction. Virtual assistants are expected to understandtheirusers'commandsandgetitrighteven inquitenoisyenvironments.

•Pretrained modelsshow strong performancein downstream taskslikeASR andspeech translation, especiallyin low-resource languages

•Clear improvementin WordErrorRate (WER) comparedto modelstrained fromscratchor withlabeled data.

•Selfsupervisedpretrainingproves highlyefficient forcross-lingual speechtasks.

•Reduces relianceon labeleddata, especiallyfor low-resource languages.

•Cross-lingual knowledge transfer improves performance across languages.

•Scalable trainingwith large multilingual datasets.

•Enhanced performancein downstream taskslike speech recognition.

•Easilyadapts tonew languageswith minimaldata.

•Cost-efficient byusing unannotated data,reducing theneedfor manual annotations.

•High computational costfortraining largemodelswith massivedatasets.

•Biastowards resource-rich languagesdueto dataimbalance.

•Fine-tuningfor specificlanguages ortasksadds complexity.

•Performance dependsheavily onthequalityof unannotateddata.

•Learned representations areabstractand difficultto interpret.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 11 Issue: 10 | Oct 2024 www.irjet.net p-ISSN: 2395-0072

[2] Joint Unsupervised andSupervised Trainingfor Multilingual ASR

•Theauthors proposeajoint training framework combining supervisedand unsupervised learning.

•Unsupervised traininguses untranscribed speechdata,while supervised traininguses labeleddata.

•Themodelis trained multilingually, integratingdata fromvarious languagesto enhanceASR performance acrossmultiple languages.

•Thepaper utilizeslarge multilingual datasetsof labeledand unlabeled speechdata acrossvarious languages.

•Specific datasetsinclude theMLSdataset andotherspeech corporawith bothtranscribed and untranscribed data.

•Thejoint trainingscheme shows significant improvements, particularlywith low-resource languages.

•Authorsreport adecreasein WordErrorRate (WER)across various languagesusing thismethod.

•Themost drasticWER reductionoccurs whenlabeled dataislimited, butthereis abundant unlabeleddata.

•Thismethod outperforms monolingualand multilingual modelstrained solelythrough supervised learning.

•Enhances performance acrossmultiple languages.

•Workswith bothlabeledand unlabeleddata.

•Reduces relianceon expensive labeleddata.

•Improves model robustnessand generalization.

•Scalablefor largertraining datasets.

•Enhances accuracyinrealworld applications.

•Requires complextraining configurations.

•Higher computational costdueto handlingboth labeledand unlabeleddata.

•Qualityheavily dependsonthe qualityofthe inputdata.

•Limited enhancementfor high-resource languages.

•Aligning multilingualdata canbe challenging.

•Riskof overfittingthe modeltospecific languagefamilies.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 11 Issue: 10 | Oct 2024 www.irjet.net p-ISSN: 2395-0072

[3] Wav2Vec2.0:A Frameworkfor Self-Supervised Learningof Speech

Representations

•Wav2Vec2.0 comesinwitha self-supervised learning frameworkfor robustspeech representations learneddirectly fromrawaudio waveforms.

•Themodel involvestwo primary components: featureencoder andcontext network.

•Featureencoder actually transformsthe rawaudiointoa sequenceoflatent speech representations, whilethecontext network,whichis basedona decoder architecturesuch asTransformers, processesthose representationsto capture contextual information.

•Itistrainedwith acontrastiveloss approachsothat itlearns distinguishing betweenthetrue andfalse representationsof theinputaudio.

•Thus,themodel maylearnrich speechfeatures withoutrequiring labeleddata.

•Wav2Vec2.0 usesalarge, diversedataset composedof largeamountsof unlabeledaudio fromvarious sourcestotrain itsmodel.

•Majorly,the trainingdata comprises960 hoursofreading speechfrom LibriSpeech corpus.

•Further robustnessand generalization propertieshave beenaddedtoit byfine-tuning themodelwith smallerand morelabeled datasets involving CommonVoice andLibriSpeech datasets.This combinationof unsupervised andsupervised samples facilitatesthe modelintaking advantagesof self-supervised learningwhile availing supervisedfinetuningforits betterment.

•Wav2Vec2.0 boastssuperior results comparedtoall theotherstateof-the-art modelsonthe Automatic Speech Recognition (ASR)task;it willperform outstandingly withvery impressive resultson benchmark datasetslike LibriSpeech.

•Reductionin WERismassive, exceeding previous methods.

•Forexample, forfine-tuning withlabeled dataonthe LibriSpeech test-cleanset,it achievesaWER of1.9%, showinga massive improvement overtheexisting approaches.

•Wav2Vec2.0is thereforea versatile frameworkfor speech representation learning,andit canbeused effectivelyin anylanguageor evendialect becauseofits abilitytolearn fromunlabeled data.

•Selfsupervised learning:Itdoes notneedpretrainingwith labeleddatabut, instead,uses rawaudiodata forthat purpose.

•Italso outperforms speech recognition modelsthatrely ontheclassic approacheswith fewernumbers oflabeled examples.

•Transferable features:The methodlearns robustspeech representations which generalize acrosstasks.

•Lower dependencyon labels:This makesit suitableforlowresource languagesby reducingthe requirementfor large,labeled datasets.

•Computationally expensive:Large modelsarepretrainedon massiveamounts ofraw,unlabeled data.Thisisvery computationintensive

•Implementations complex:Selfsupervised learningmethods suchasWav2Vec 2.0alsorequire implementation andfine-tuning.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 11 Issue: 10 | Oct 2024 www.irjet.net p-ISSN: 2395-0072

[4] Language ModelsareFewShotLearners

•Introducing GPT-3.Thislargescalelanguage modelwas createdusinga transformer architectureand 175billion parametersby trainingon.

•Itisshownthat thepaperis capableof applying functionswith justafewshotsof one-shotoreven zero-shot learning;thatis,it canproduce usefulresponses withveryfew examplesasinput duringinference.

•Thetraining datavolumewas about570GBof textinthisbroad mixofinternet text,including CommonCrawl, WebText, Wikipedia,and books.

•Theamountof evaluationtests appliedis diverseinNLP benchmarks-for example, question answering(e.g., TriviaQA),close tasks,language translation,and textgeneration.

•GPT-3did often outperform smallermodels onmanyNLP tasksand,in somecases, evenequaledor surpassed supervised baselines.

•Ithad weaknessesin particularareas suchas arithmeticand common-sense reasoningto pointoutthat therewereafew capabilities, despiteallthose few-shot learning abilities.

•Few-shot learning capacity eliminatesthe requirementof thepresenceof largeamounts oflabeleddata.

•Nospecialized fine-tuningis necessaryon thetaskfor versatile performanceon averywide rangeofNLP tasks.

•Flexibletask adaptationfrom pre-trained knowledge.

•Reduced requirementof computational resourcesfor newtasksin comparisonto traditional models.

•Itgeneralizes effectivelyto unseentasks withfewer examples.

•Ithelpsin increasing efficiencyin carryingout different linguistictasks withjustone model.

•Heavy computational costinpretrainingbig models.

•Verysensitiveto theprompt design,which hurtsits performance.

•Poorontasks thataretoo complexor domainspecific.

•Themodelcan inheritthebiases itlearnsfromthe trainingdatato affectitsoutput.

•Thereislimited interpretability concerninghow few-shotlearning works.

•Resultsmaynot generalize consistently acrossdifferent tasksand languages.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 11 Issue: 10 | Oct 2024 www.irjet.net p-ISSN: 2395-0072

[5] Vicuna:An Open-Source Chatbot Impressing GPT-4with90% ChatGPTQuality

[6] Self-supervised Learningwith RandomProjection Quantizerfor Speech Recognition

•Developedby fine-tuningthe open-sourcelarge languagemodel LLaMA,Vicunais fine-tunedon human-chattext conversations.

•Theseinclude high-quality datasetsaimedat furtheringthe capabilitiesofthe chatbotin developing human-like responses.

•Themodel architecturerelies ontransformer networkswith effortstoward balancingits performance efficiencyin computation, muchlikeGPT-4.

•Thedataset consistsofuserchatbot conversation datatailored towards dialogueswith ChatGPT.

•Fine-tuning utilizeshighquality conversation transcriptsin ordertoenhance naturallanguage understanding andgeneration.

•Atan approximate ratingof90% quality,as developedby benchmarking evaluation metrics,Vicuna standsbright accordingto GPT-4.

•Thisshowed Vicunatobeall themore impressivein anykindoftaskdialogue generationto question answering-with brightprospects evenifitis open-sourceand shinesin comparison withproprietary modelslike GPT-4.

•Moreorless, ofquality90% comparedto ChatGPT,yetthe samelevelof performance.

•Opensource thuscanbe modified, experimented on,andtogether developedby users.

•Incomparison, thiswillsave costasitisan open-source comparedto proprietary modelslike ChatGPT.

•Theprocessof development increases peoples'trust, andtheycan engagewith suchaproject.

•Improvable andfine-tuneablewithpublic dataavailable.

•Greataccess forresearchers anddevelopersastepforward forinnovation regarding conversational AI.

•Qualityis subjectiveand reliesonthe assessmentof GPT-4.

•Ithaslimited functionalitiesin certaindomains andmayfailin certainniches.

•Metricsof assessmentare notdiverse enoughtoaccount forthe performanceof thechatbot.

•Mayfailto handlecomplex multi-turn dialoguessuitably likeproprietary models.

•Computationally expensivefor fine-tuningto encouragebetter performances.

•Lackswide supportthat commercial modelssuchas ChatGPTalso provide.

•Thepaper presentsaselfsupervised learningparadigm basedona RandomProjection Quantizer(RPQ) toenhancespeech recognitiontasks.

•Thetwo datasets experimentedon includeseveral widelyused speech recognition datasets,namely, theLibriSpeech datasetand

•The experimental resultsclearly showpromising improvements intheaccuracy ofspeech recognitionby introducingthe self-supervised

•Efficiency:The random projection quantizer reducesthe computational costofspeech recognition.

•Self-

•Complexity:The randomprojection approachmay addcomplexityin implementation andtuning.

•High computational

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 11 Issue: 10 | Oct 2024 www.irjet.net p-ISSN: 2395-0072

•Themethodisto learnmeaningful representationsof speechsignalsby traininganeural networktomap high-dimensional featurestoalowdimensionalspace throughrandom projection.

•Feature quantizationis facilitatedbyRPQ, whichforcesa morecompact representation whilepreserving essential information, reducingthe complexityof computation,and helpingin generalization performance improvement acrossdifferent speech recognitiontasks.

CommonVoice, whicharelarge collectionsof readspeech recordingsthat carryawide rangeof speakers, accents,and recording conditions.

•Thiscould allowforrich leverageofthese datasets, allowingthis proposed methodtotest itseffectiveness inlearning robust representations ofspeechand improving recognition accuracyon average.

learningwith theRPQ.

•This established enhanced robustnessto noiseand variabilityin speechwith notablemargins overthe baselinemodels proposedinthis work.

•Themodelalso outperformedin termsof efficient computational performance andthus possibleforits real-time applications.

•Overall, incorporationof randomprojection quantizationin self-supervised learninghelped achieveastateof-the-art performanceon thetaskof speech recognition.

supervised Learning:It learnsfrom unlabeleddata, thusreducing therelianceon large,labeled datasets.

•High Performance: Trackedto state-of-the-art performanceon speech recognition tasks,manyof thetimeonless labeleddata thanthose trainingtasks.

•Versatile: Workswith manydifferent languagesas wellaslow resource settings.

power:Requires significant computational resourcesfor trainingonlarge andbigdatasets.

[7] Palm:Scaling language modelingwith pathways

•ThePaLM researchof Googlestudiesthe efficientscalingof languagemodels usingthe Pathways architecture.

•Thismodel permitsamodel toscale differentlyon varioustasksand different

•ThePaLM modelusesa humongous dataset-a combinationof high-qualityweb pages,books, Wikipedia, GitHub,and multilingual corpora-worth around780 billiontokens.

•Diversityin

•Byits performance, PaLMachieves state-of-the-art resultsona varietyof benchmarks concerningboth language understanding aswellas generation:fewshotlearning, reasoningand codegeneration

•Efficient scalabilityof modelsto diversetasks.

•Better utilizationof resources throughusage ofonlythe relevantportion ofthemodel.

•Improved generalization

•Highcomplexity inthe implementation andmanagement ofthePathways system.

•High computational resourcesneedto beusedtotrain largemodels.

•Potentialbias duringthe

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 11 Issue: 10 | Oct 2024 www.irjet.net p-ISSN: 2395-0072

modalitiesby allowingpartsof thenetworkto specializeand thusonlyupswing therelevantparts whenexecuting tasksoravoid accomplishing unnecessary computing.

thisdataset wouldensure good generalizationto variousdomains andlanguages forthemodel.

wereamongthe former.

•Infact,itwas provento outperform manyearlier models, includingGPT-3, inmanytasks, hence demonstrating betterreasoning and comprehension, particularlyin few-shot learning scenarios.

•Itwasalso shownto performvery wellin multilingual environments, displaying versatilityona largerangeof tasks.

acrossdiverse tasksand domains.

•Improved multi-task learningwithout theneedfor retraining.

•Scalable infrastructure thatconsumes fewerresources.

•Elastic handlingoftask diversityina flexibleway.

applicationof large-scaledata thatwasusedfor trainingpurposes.

•Highdifficultyin fine-tuningfor highlyspecific tasks.

•Challenging when interpretabledue tothemodel's sizeandstructure.

•Energy consumption issueswhen scalingup massivemodels.

[8] w2v-BERT: Combining Contrastive Learningand Masked Language Modelingfor Self-Supervised SpeechPreTraining

•Contrasting learning-wav2vec 2.0andmasked language modeling-MLM (asusedinBERTthetwo approachesare usedinthew2vBERTmodel).

•Thisisbecause contrastive learningwill enablethemodel tolearnwhat constitutes similarand dissimilarspeech representations.

•Foritspart, MLMpredicts maskedportions ofinputspeech

•Thelarge-scale speechdatasets, whichare commonlymade upofunlabeled audiodata,are usedinthe trainingofthe model.

•Veryvitalto thispre-training processwithout manual transcriptionis theavailability ofthesepublic corpora,suchas LibriSpeechand somanymore.

•Theobtained resultsusingthe w2v-BERT modelsurpass thoseby downstream taskslikeASR andspeaker identification.

•Forinstance,it achieves substantial improvements inWERand showsbetter transferability todifferent speech-related taskscompared tootherselfsupervised speechmodels.

•Thisbetter representsthe speechdataby incorporating masked language modelingwith contrastive learning.

•Thisalso employsselfsupervised learning, thereby minimizingthe extensive requirementfor labeleddata, andthereforeit iscost-effective aswellas scalable.

•Itobtains

•Complex Implementation: Itrequiresthe combinationof twolearning paradigms, makingboth designand training complicated.

•Extremely Computationally Expensive:It requiresalotof computational resourcestotrain.

•Sensitiveto Specificityof TrainingData:It mightdegrade performanceif thetrainingdata isnotdiversified

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 11 Issue: 10 | Oct 2024 www.irjet.net p-ISSN: 2395-0072

features.

•The combinationof thetwowillallow foreffectiveselfsupervised learningonraw audiodatathat willenhancethe abilityofthe modelin understanding andgenerationof speech.

[9] Unsupervised Cross-lingual Representation Learningfor Speech Recognition

•Thispaper proposedan unsupervised cross-lingual representation learningofspeech recognitionbased onself-supervised learning techniques.

•Theapproach wasinspiredby thesharedlatent spacewhereby mappingfeatures ofspeechin different languageswere learnedinthis space.

•Anew contrastive learning framework

•Thepaper workswitha varietyofspeech datasetscoming fromdifferent languages.

•Thislist includesbutis notlimitedto: LibriSpeech (English), CommonVoice (asmany languagesas possible), VoxForge.

•Thesedatasets are characterizedby thediversityof speakers, accents,and acoustic environments,

•Thus,the proposed methodistested againstsome benchmarksfor performance evaluation; several improvements inthetaskof cross-lingual speech recognitionare realized.

•Results obtained actuallyshow thatthe unsupervised approach capturescrosslingual representations effectively,and theoutcomes

state-of-the-art performances onvarious speech recognition benchmarks.

•Thismodel maybeusedto support downstream tasksbeyond speech recognition.

•Thisenhances improved generalization capabilitysince thistechnique willlearnuseful featuresfrom diversedata.

•Thismodel efficientlyuses speechdata, leadingtofaster convergence. enough.

•Cross-lingual learning:It facilitates speech recognitionin several languages withouthaving torequire labeledsamples foreveryoneof thoselanguages.

•Data dependencycut downon:Learns good-quality representations fromunlabeled multilingual speechdata.

• Generalization: Thelowresource language

•Sensitiveto Hyperparameters Used:Themodel issensitivetothe choiceof hyperparameters.

•Limited Explainability: Thecomplexityof thismodelwill makeitinfeasible toexplainthe decisionsmadeby themodel.

•Proneto Overfitting: LIABLEtooverfit smallsamples leadingtobad generalization.

•Complex training: Unsupervised cross-lingual trainingcanbe computationally expensive.

•Requireslarge datasets:Effective representation learningrequires alargeamountof unlabeled multilingual speechdata.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 11 Issue: 10 | Oct 2024 www.irjet.net p-ISSN: 2395-0072

helpedoptimize themodelsuch thatit distinguished betweensimilar anddissimilar audiosamples withgreater accuracy.

•Usingan enormousamount ofspeechdata withoutlabels,it learnstoabstract andgeneralize phoneticsand linguistic informationin ordertoavoidthe needforbig labeleddatasets.

therebyoffering arichspectrum ofspeech characteristics.

•Thetypeof datasetensures cross-lingual featurelearning onthepartof themodel becauseitis exposedtowide variationsof languageand phoneticsthat wouldenhance performancein recognizing speechacross various languages.

arethus improvedfor recognition tasksin languageswith limitedlabeled data.

•Forthose modelstested onlanguages unseenduring theirtraining, highWER reductionsare achieved relativetothe traditional methods.

•Thismodel generalizes across languages, promising applicationsin multilingual speech recognition.

benefitsfrom better performance through knowledge transferfrom thehighresource languages.

•Efficient model:Having fewerresources onlanguages,it stillmaintains itshighlevelsof performance..

[10]

Scalingvision transformersto 22billion parameters

•Scalingvision transformersup to22billion parameters involveda numberofvery important strategiesaimed atkeeping efficiencyand performance intact.

•Sparse mechanismsof attentionwere engagedin reducing computational costsaswellas memoryusage thatmadeit possibleforthe modeltoenlarge thesizeofits inputsand

•Themodelis pre-trainedona vast,diverse datasetbuiltto pushits generalization capabilitieson manyvision tasks.

•Thisdataset combines curated,highresolution imagesfrom publicsources, suchas ImageNet,with proprietarydata holdingbillions ofimages.

•Theaimwasto achievebalance incoveringas largeadomain

•Thevision transformer scaledupto22 billion parameters achievedstateof-the-art performance overvarious benchmarks.

•OnImageNet classification,it hasmarkednew accuracy recordswith largemargins overthe previous records.

•Itshowed strong adaptationto downstream taskssuchas

•Impressive performance enhancements fromprevious versionswith regardtoimage recognitionand othervisionrelatedtasks.

•More advanced capacityfor modelsthat capturerich complexvisual patternsaswell asdetails.

•Improved generalization onawideand large-scale visualdataset

•Better

•Computeand resource intensiveness: Bothtrainingas wellasinference arecomputeintensiveand havehigh resource requirements.

•Highmemory usage,andhence requireshighly advanced hardware. Longtraining periodsduetothe sizeofthemodel.

•Overfittingis morelikelyto occurwhenthere islimiteddataand themodelhas capacitytobe

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 11 Issue: 10 | Oct 2024 www.irjet.net p-ISSN: 2395-0072

processthem muchbetter.

•Italsoemployed gradient checkpointingin ordertosave memorywhilein theprocessof backpropagation.

•Model parallelismand pipeline parallelismare usedtodistribute theworkof computationover multipleGPUs.

•Customtraining loopsand memory-efficient techniquesmade trainingamodel ofthissize feasible.

aspossible, includingobject categories, visualcontexts, andedgecases.

•Besidesthat, pretexttasks suchasmasked imagemodeling addedtothisset proved beneficialfor allowingthe transformerto learnmeaningful visual representations.

objectdetection, segmentation, andimage generationin transfer learning.

•Thankstothe optimizationof attention mechanismsand computation efficiency,the inferencespeed wasfoundtobe similartothatof muchsmaller modelsdespite itssize.

•Itmarksakey milestonein large-scale vision transformersfor scalabilityand performance balance.

accuracyinfinegrainedimage classification anddetection tasks.

•Simplifiedor enhanced flexibilityto handlethewide rangeofvisual applications.

•Advanced feature extractionthat leadstoricher visual representations. large.

•Increased energy consumptionand reduced sustainability aspect.

•Itbecomes challengingto manageand deploythe models.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 11 Issue: 10 | Oct 2024 www.irjet.net p-ISSN: 2395-0072

[11] PaLM-E:An Embodied Multimodal LanguageModel

•PaLM-Eis designedfor embodiedagents.

•Itintegrates visionwith languageby processingtextual andvisualinputs.

•Itbeginsatthe baseofPaLM (Pathways LanguageModel) butaddsthe powerofvisual understandingto itbyapre-trained vision transformer calledViT.

•Themodelmay thusenablerealworld applications whereagentscan eitherinteract withobjectsand environmentsor makeuseof perceptionin combinationwith languagefortasks suchasobject manipulationand navigation.

•Themodelwas trainedwith largemultimodal data,both containingvisual (imagesand videos)aswell astextual (instructions, descriptions) data.

•Somesuch datasetsinclude COCOandthe roboticcontrol datasets,usedto teachmodel visual perceptionand actionplanning inembodied environments.

•PaLM-Eexcels inmany different multimodal tasks,suchas visualquestion answeringand roboticcontrol tasks.

•Itgreatly improveson embodied reasoningand interaction capabilities, performing betterthan previousmodels attasks requiringthe understanding ofbothvision andlanguage.

•Generally,itis more generalizable towardnew environments andtasks, thereby increasing successratesin roboticsand multimodal challenges.

•Multimodal learning:Learn vision,language, andactionin parallel,hence effectively interactingwith the environmentto enhance comprehension

•Embodied understanding: Realistic applications suchasrobotics andinteractive systemsareits mainfocus.

•Generalized capabilities: Enablestransfer ofknowledge across modalities, whichenhances itsflexibility

•Computationally expensive:Very highcomputation resourcesare neededtotrain large-scale multimodal models.

•Addingmultiple modalitieswould addcomplexityto thearchitecture ofthemodel.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 11 Issue: 10 | Oct 2024 www.irjet.net p-ISSN: 2395-0072

[12] Listen,Think, andUnderstand

•Thenoveltyof thestudyisinits newapproach thatinvolves usingaudio processingand naturallanguage understandingto enhancetasksof speech recognition.

•Itfollowsa multimodal architecturethat involves contextual information derivedfromthe textinadditionto theaudiosignals used.

•Themodel enhancesinput parts'focusing abilityby applyingaseries ofattention mechanisms.

•Itisfullyend-toendtrainable architecture, whichmeansthat itlearns representations thatoptimizeboth audioandtext inputstowards varioustasksof language understanding.

•Thispaper usesan extremely diversedataset thatactually combinesthese speech recognition benchmarks withsome conversational datasets.

•Theseinclude recordingsfrom disparate sources, involvingsome ofthepublic speechcorpora, podcasts,and dialogue datasets-thus greatlinguistics andcontextual variability.

•Thedatasetis alsosupported intermsof transcriptions andcontextual tagstosupport thetrainingand evaluationofthe model.

•Thisisavery comprehensive datasetthat supports generalization and performance acrossdifferent audioscenarios andlanguage contexts.

•Thisistosay, themodel executesa numberof improvements overbaseline approachesin manyspeech recognition tasks.

•Itperforms betterwith respectto accuracyrates intranscription tasksandshows robustnessin understanding conversational contextas againstthe state-of-art models.The WERandF1 scores' evaluation metricsindeed indicatean improvement quitemarkedin the comprehension and transcription accuracy, therebywell validatingthe multimodal approach proposed.

•Theresults showthatbeing combinedwith contextual understanding, audioperforms betterinrealworld applications.

•Better pronunciation recognition, withabetter integrationof listeningand reasoning processes.

•Usesanew framework, hencebetter enhancingthe modelto understand contextand semantics.

•Has considerable improvements inperformance forspeech understanding benchmarks.

•Combines different modalities; henceitoffers holistic understanding ofspoken language.

•Facilitates transfer learning.Thus, itbecomes adaptableto varioustasks anddomains.

•Thenew interaction modelforusers becomesquite intuitiveinrealtime applications.

•Itisinordinately data-hungryand canbea drawbackfor beingusedin resource-poor settings.

•Thearchitecture complexitymakes itmore challengingto deployandscale.

•Computationally intensivewiththe needofhardware resourcesona largescale.

•Reallyproneto overfittingunless regularized appropriately. Sincelotsof scarcedataare available,thiscan causeamajor problem.

•Noisyorvery diversespeech inputsmaycause issuesin robustness.

•Themodelis hardtointerpret. Thismeansitis noteasyto comprehendwhat themodelis deciding

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 11 Issue: 10 | Oct 2024 www.irjet.net p-ISSN: 2395-0072

[13] ParameterEfficient Transfer Learningfor NLP

•ItisParameterEfficientTransfer Learning(PETL) forNLPfine-tunes pre-trained languagemodels likeBERT,GPT,or T5byadjustinga smallsubsetof theirparameters, insteadofthe wholemodel.

•Someofthe popular techniqueshave beenAdapters, Low-Rank Adaptation,and PromptTuning.

•Adaptersadda fewlayers betweenexisting layersfortaskspecificity, whereasLoRA injectslow-rank matricesto reducethe trainable parametersinthe weights.

•Thesemethods retainthebenefits oflargepretrainedmodels butmakefinetuning computationally moreefficient, particularlyinthe caseofmultiple downstream tasks.

•DifferentPETL approacheshave beenputtothe teston numerous general-purpose NLP benchmarks: bothGLUE (General Language Understanding Evaluation)and SuperGLUE, SQuAD(Stanford Question Answering Dataset),and task-specific datasetsbased onthedirection oftheresearch infocus.

•Thecorpus collections comprisea varietyoftasks thatarerelated tolanguage, including sentence classification, textual entailment,and question answering, amongother kindsoftasks.

•Testingdiverse datasetswillbe usefulto evaluateifPETL methodshave generalizability acrossthese differenttypesof NLPtasks.

•Thesemethods ofPETLhave beenprovedto becompetitive withfullfinetuning:infact, onsome benchmarks theyreachedthe same performanceor evensurpassthe performanceof fullfine-tuning whilehavingan orderof magnitude smallernumber ofparameters.

•Forinstance, LoRAand Adapters achievedstateof-the-art performanceon theGLUE benchmarkwith morethana 90%reduction intrainable parameters.

•Thus,PETLis almosthighly efficientforrealworld applications whereresources ofcomputation andmemoryare usuallylimited.

•Theefficiency oftransfer learning:It highlydecreases thetrainable parametersof theNLPmodels.

•Flexibilityis permitted:Only asmallfraction ofthewhole modelcanadapt todownstream taskswithout fine-tuning.

•Thisreduces the computational costasitonly editsasmall subsetof parameters.

•Scalability-It allowsscalable deployment acrossmultiple NLPtasks.

•Poor optimization:The methodis unlikelytoreach thefull exploitationofthe achievable advantagesin performanceby fine-tuningallthe parameters.

•Task-specific limitations: Qualitymaybe verysensitiveto thenatureofthe downstream tasks.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 11 Issue: 10 | Oct 2024 www.irjet.net p-ISSN: 2395-0072

[14] LoRA:LowRank

Adaptationof LargeLanguage Models

•TheLoRA methodinjects lowranktrainable matricesintothe layersofthepretrainedmodelto fine-tuneitmore efficiently.

•Insteadof updatingevery modelparameter, onlyasmall,lowranksubsetcan beupdated, wherethe computational costandthe memory requirementsare drastically reduced.

•Adaptationis faster,yetmodels performwell undersuch methods.

•Toevaluate LoRA,standard datasetsforNLP tasksareused, suchasthose utilizedforthe GLUE,SQuAD, andMTdatasets fortranslation purposes.

•Theseare standard benchmarksfor text classification tasksand question answeringas wellasmachine translation,thus offeringa numberof language situationsfor testingthe efficiencyof LoRA.

•Thatmeans thatLoRA significantly reducesthe memoryandthe computational requirementsin comparisonto fullfine-tuning whilebeing similarly,or evenbetter, performingin comparisonto thelatter.

•Inbenchmark tasks,ithasless overheadthan theothertwo methodsandis therefore attractivefor adaptinglarge languagemodels likeGPTand BERTtospecific tasks, consideringitis morecapableof beingusedin resourceconstrained environments.

•Lower computational costforfinetuninglarge models.

•Memoryusage isalsoreduced comparedtofull updatesofthe model.

•Thereare fewer parametersto adapttonew tasks.

•Fewertime demandsduring trainingand deployment.

•Strong performance remainswith reduced resource requirements.

•Thelarge modelcanbe adaptedtoall kindsof different domains.

•Constrainingthe ranktobelow maydepress performanceon difficulttasks.

•Less expressiveness thanfullyranked models.

•Alsocomplicates tuningand optimizinglowrankadaptations.

•Accuracyaswell asgeneralization couldalsohave trade-offs.

•Implementations oflow-rank adaptation algorithmsare alsodifficult.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 11 Issue: 10 | Oct 2024 www.irjet.net p-ISSN: 2395-0072

[15] ImprovingfastslowEncoder based Transducerwith Streaming Deliberation

•Afast-slow encoder-based transducer architectureis reportedalong with improvementofa streaming deliberation mechanism.

•Inthismethod, twotypesof encodersare developed:oneis areal-timeinput processingbythe model,whilethe otherisaslow encoderformore contextand informationto enhancethe output.

•Thefastencoder willdeterminethe preliminary prediction,while theslowencoder willrefinethe predictionbased onmorecontext andinformation.

•Thedeliberation processwould allowforbetter outputfromboth encodersby relyingonthe strengthsofboth andensuringthat theresultofthe finalprocess benefitsboth speedand accuracy.

•Thisimproves transducer performancein speech recognitionand language translation.

•Thepaper executes experiments usingseveral benchmark datasetstoshow theperformance ofthepresented method.

•Amongthose, LibriSpeechisa typicaldataset forspeech recognition whichwasalso usedwith recordingsof audiobooksin thiswork.

•Otherdatasets thatmighthave beenusedare CommonVoice orTIMIT,which aremodelsboth forphonemerecognitionand language models.

•Thesedatasets provideanallaround performance assessmentin termsof linguistic featuresand acoustic environmentsin viewofthe generalizability androbustness ofthemodelin real-world applications.

•The experimental resultsofthe fast-slow encoder-based transducerwith deliberationin streamshow improvements overbaselines.

•Worderror ratesandoverall accuracyofthe testeddataset setare improved experiments.

•Havingapplied streaming deliberation overit,the modelbecomes morerobust, particularlyin processing complexinputs withhigh precisionand reducederrors.

•Theresearch couldalso mentionhow highefficiencyis possiblethrough itsreal-time applications, providingit couldbeableto gethigh throughput withoutgiving uponprecise predictions.

•The architecture enhances transducerbasedspeech processingand language understanding.

•Improved accuracyof speech recognition throughthe refinementof thetransducer model.

•Streaming deliberationfor arealtime processingto makethe systemsmore responsive.

•Better strategiesfor deliberationare capableof improvingthe managementof complicated inputs.

•Speech-to-text, alongwithother applications, resultsin minimizing latency.

•Morerobust noisyor variableinput conditions.

•Increased computational complexitydueto theadditional deliberation process.

•Increased latencyin processingwhich mightreflecton thereal-time performance.

•Increased consumptionof resourcesfor trainingand inference.

•Implementation andintegration issues:Complex andchallenging.

•Theremightbea potentialdanger ofdegraded performanceif thedeliberation processisnot optimizedwell.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 11 Issue: 10 | Oct 2024 www.irjet.net p-ISSN: 2395-0072

[16] SentencePiece: ASimpleand Language Independent Subword Tokenizerand Detokenizerfor NeuralText Processing

•Thispaperon SentencePiece introducesa simplesubword tokenization algorithmwith regardtothe language.

•Themethodis basedon unsupervised learning;itcanbe appliedbyeither unigramlanguage modelingorbytepairencoding, whichisoften usedasamethod oftokenization.

•Inputtextin SentencePieceis assumedtobea sequenceofraw bytes,withoutany assumptionof linguistic boundaries,such asinwordsor phrases,makingit effectivefor languageswhere worddelimiters cannotbedefined clearly.

•Itproducesa vocabularyofsub wordunits, makingiteasier toaddressrare wordsbysplitting themintosub wordsthatare shorterandthus probablymore frequent,yetstill canbecombined tomakecommon words.

•Althoughit doesnotgive particulardata sets,thispaper notesthat SentencePiece hasbeen adaptedfor variousNLP tasks, particularlyin NMT.

•Ithasbeen bench-marked onamultilingual versionof severaldatasets fromtheWMT andothertypical diversetext corporausedin machine translationas wellasin language modeling.

•Suchdatasets includedifferent languagesand more importantly scriptsbesides different structuresthus makingany practical application effective,which isefficienthence representsthe flexibilityofa SentencePiece.

•Inmanyways, itintegratedan important efficiencyand performance improvement intotheneural textprocessing model.

•Itpavedthe waytomake NMTeasier,as wellasother NLPtasks,by bringingeasein handlinglowresource languagesand byreducing manyissues relatedtooutof-vocabulary words.

•Italso triggeredbetter generalizationin downstream tasksandless complexityin preprocessing multilingual data.

•Itwas performingona parwithor betterthanthe traditional tokenizersand wasan extremely preferred solutionfor modernNLP pipelines, including Google's transformer modelsand multilingual languagemodels likemBERT.

•Languageindependent: Canbeapplied toanylanguage inwhichno languagedependentrules areused.

•Subword tokenization: Rareandout-ofvocabulary wordsare processedby breakingthem downinto semanticsub wordunits.

•Efficiency: Vocabularysize isreduced, whichmakes theirprocesson neuraltext processing faster.

•Unsupervised Training:No needtotrain withanalready definedtoken dictionary.

•Potentiallossof semantics:The subword tokenizers tokenizewordsin suchawaythat meaningsgetlost.

•Muchharderto readlikea human:Outputs canbeless understandable thanfull-word tokenizers.

International Research Journal of Engineering and Technology (IRJET) e-ISSN:

Volume: 11 Issue: 10 | Oct 2024 www.irjet.net

3. PERORMANCE

3.1

Improved multilingual speech representations

Thecross-lingualrepresentationsaresowelllearnedthat they can be utilized with a minimal use of large, labelled datasetsacrossdifferentlanguages.Thisperformanceoffers an enormous performance boost over the low-resource languagesanddemonstratesstronglyscalablecross-lingual transferlearning.Inaddition,thismodelalsooutperforms state-of-the-art modelsthat havebeen proposedso far by reducingrelianceonlanguage-specificlabelleddata.

3.2 Enhanced ASR with unlabeled data

This approach nicely combines supervised and unsupervised learning to improve the performance of multilingualASR,especiallyforlow-resourcelanguages.The joint training mechanism results in better generalization, anditturnsoutthatthemodelachieveshigherrecognition accuracy than systems solely trained with supervised learning, providing more robust performance across languages.

3.3 Robust speech recognition without labels

Wav2Vec 2.0: Learning Robust Representations for Speech Recognition from Raw Audio using End-to-End BackpropagationProof-of-conceptexperimentsdemonstrate this model can achieve state-of-the-art performance on numerous benchmarks, with labeled data requirements reducedbyupto100-foldandthroughputsbeyondthatof supervisedmodels.

3.4 Excellent few-shot performance

This study proved that large language models, such as GPT-3, can support the completion of various natural languageprocessingtasksonminimaltask-specificdataand evenusingveryfew-shotlearning.Prominentresultswere achieved across diverse tasks-the translation, questionanswering,andtextgenerationtasksreflectingremarkable generalizationwithjustafewexamples.

3.5 90% of ChatGPT performance

Vicuna achieves 90% of the conversational quality of ChatGPTandisopensource,makingit free touse. Itdoes dialogue generation surprisingly well and presents much promiseasaveryflexibleandcustomizablechatbotsolution.

3.6

Reduced cost, efficient recognition

A self-supervised learning approach makes speech recognition less computationally expensive with good accuracy while making the method more efficient and scalableforreal-worldapplications.

3.7 Efficient scaling of large models

Instead,scalinglargemodelswiththePathwayssystemis very efficient compared to activating only relevant model components, reducing resource usage, and improving performanceinmultipletasks.

3.8 Enhanced generalization, less labelled

data

It is achieved by combining contrastive learning and maskedlanguagemodelling.w2v-BERTcanimprovespeech representation in the absence of labelled data and hence offerbettergeneralizationacrossawidevarietyofspeech tasks.

3.9 Boosted low-resource language recognition

This effectively transfers the knowledge from highresourcetolow-resourcelanguages;thisimprovesspeech recognition without the requirement of labelled data and overcomes the challenge of processing low-resource languages.

3.10 Improved image recognition

Scalingthevisiontransformersto22billionparameters obviously offers good improvements in image recognition tasks. Clearly, it remains hard to reach state-of-the-art performanceinmanagingthecomplexitiesofcomputation.

3.11 Enhanced task performance in robotics

Therealreasonfortheimprovementofmodeldecisionmaking in robotics tasks is the integration of vision, language,andaction,whichmakesitachievemoreadaptive and accurate performance in real-world environments comparedtounimodalmodels.

3.12 Better comprehension in real-time

Involving contextual reasoning in speech models improves much in the comprehension, especially when conditionsarenoisyorambiguous,thusimprovingreal-time speechapplicationssuchasvirtualassistants.

3.13 Fewer parameters, same accuracy

The number of trainable parameters needed for finetuning can be reduced by applying parameter-efficient transfer learning, further maintaining high accuracy with reducedcomputationalcostsandmakingadaptationtonew taskspossible.

3.14 Efficient fine-tuning of large models

Lowrankadaptationenableslargepre-trainedmodelsto be adapted in a more efficient manner than a standard adaptation setup, with an improvement in cutting computationalcostswithoutlossofperformance;hence,it shouldbothadaptmodelsmorequicklyandcheapertoother tasks.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 11 Issue: 10 | Oct 2024 www.irjet.net p-ISSN: 2395-0072

3.15 Faster, more accurate speech recognition

This fast-slow encoder framework provides better accuracyandspeedforreal-timespeechrecognitionwhile improvingthelowlatencyforidealapplicationsinstreaming transcription.

3.16 Efficient tokenization across languages

It managed to handle sub words in more than one languageandwasnotparticularlydependentonparticular rulesofanylanguage,thusenhancingtheperformanceofthe modelinprocessingtextsintomorethanonelanguageand ensuring that the good generalization of the network was acrossthelanguagebarrier.

4. CONCLUSION

This research paper focuses on key advances in speech, language,andmultimodallearningwiththegoalsofreducing relianceonlabeleddata,makingmodelsmoredeployableat scale, and furthering cross-lingual and multimodal capabilities. Techniques like Wav2Vec 2.0 and selfsupervisedcross-linguallearninghaverepositionedspeech recognitionsothatsuchrobustrepresentationscanbelearnt withverylittleinthewayoflabeleddata.Innovations,such as parameter-efficient transfer learning and low-rank adaptation,haveestablishedthatefficiencycanbeachieved in fine-tuning large language models without losing any quality and even saving on computations. Other solutions likethefast-slowencoderalsoimproved real-timespeech transduction;thus,theyarewellsuitedforliveapplications.

Contributions like Vicuna opened the access of conversational AIsystemstoopensourcewhileproviding equal performance as if the person was accessing the proprietary system. Multimodal learning approaches with vision,language,andactionopenedthepathwayofrobotics andreal-worldinteraction.Lastly,cross-lingualprocessing improvestheefficiencyofhandlingrareorunknownwords across languages using tokenization systems that are not dependentuponthelanguagesused.

Summarycumulatively,suchnoveltiesarepushingprogress in speech and language processing towards making such efficient, scalable, and adaptable AI systems that handle diversereal-worldtasks.

ACKNOWLEDGEMENT

Wewouldliketoplaceourgratitudeonrecordbeforeothers for the opportunity and resources given to us by Reverie LanguageTechnologiesLimited,bywhichwecouldworkout this survey paper. Your support has been invaluable throughoutthisresearch.

Wewouldliketothankourguides,Dr.SaswatiRabhaand Mr. Chintan Parikh, for the continuous guidance and

encouragementintheformofmeaningfulfeedback.Atthe right moments, expertise and advice helped shape the research direction, so all thanks for the support that we couldenjoy.

FurtherThanksandAppreciation:Wewouldliketoextend ourthanksandappreciationtoDr.GeetanjaliKale,Headof the Department at SCTR'S Pune Institute of Computer Technology,forguidanceandconstantinspirationtocreate anacademicenvironmentofgrowthandallowedthisproject toblossom.

Lastbutnottheleast,thankstoourprojectmentor,Dr.Arati Deshpande,forherpatienceandknowledge-cum-guidance at every stage of this project. Your commitment to our successhasbeenreallyinspiring,andwethankyou.

Thanks to all of you for your contribution toward making thissurveypaperarewardingandfulfillingexercise.

REFERENCES

[1] Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, PatrickvonPlaten,YatharthSaraf,JuanPino,etal.“XLSR:Self-supervisedcross-lingualspeechrepresentation learningatscale”.In:arXivpreprintarXiv:2111.09296 (2021).

[2] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, andMichaelAuli.“Wav2vec2.0:AFrameworkforSelfSupervised Learning of Speech Representations”. In: Proceedings of the 34th International Conference on NeuralInformationProcessingSystems.2020.

[3] Junwen Bai, Bo Li, Yu Zhang, Ankur Bapna, Nikhil Siddhartha, Khe Chai Sim, and Tara N. Sainath. “Joint UnsupervisedandSupervisedTrainingforMultilingual ASR”. In: 2022 IEEE International Conference on Acoustics,SpeechandSignalProcessing(ICASSP).2022.

[4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell,etal.“LanguageModelsareFew-ShotLearners”. In:AdvancesinNeuralInformationProcessingSystems. 2020.

[5] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang,YonghaoZhuang,JosephE.Gonzalez,IonStoica, and Eric P. Xing. “Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\%* ChatGPT Quality”. In: (2023).

[6] Chung-ChengChiu,JamesQin,YuZhang,JiahuiYu,and Yonghui Wu. “Self-supervised learning with random projection quantizer for speech recognition”. In:

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 11 Issue: 10 | Oct 2024 www.irjet.net p-ISSN: 2395-0072

Proceedings of the 39th International Conference on MachineLearning.2022.

[7] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham,HyungWonChung,CharlesSutton,Sebastian Gehrmann,etal.“Palm:Scalinglanguagemodelingwith pathways”.In:arXivpreprintarXiv:2204.02311(2022).

[8] Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, JamesQin,RuomingPang,andYonghuiWu.“w2v-BERT: CombiningContrastiveLearningandMaskedLanguage ModelingforSelf-SupervisedSpeechPre-Training”.In: 2021 IEEE Automatic Speech Recognition and UnderstandingWorkshop(ASRU).2021

[9] Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. “Unsupervised Cross-lingual Representation Learning forSpeechRecognition”.In:Interspeech.2021.

[10] MostafaDehghani,JosipDjolonga,BasilMustafa,Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin,etal.“Scalingvisiontransformersto22 billionparameters”.In:arXivpreprintarXiv:2302.05442 (2023).

[11] Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, JonathanTompson,QuanVuong,TianheYu,etal.“Palme:Anembodiedmultimodallanguagemodel”.In:arXiv preprintarXiv:2303.03378(2023).

[12] Yuan Gong, Hongyin Luo, Alexander H Liu, Leonid Karlinsky, and James Glass. “Listen, Think, and Understand”. In: arXiv preprint arXiv:2305.10790 (2023).

[13] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. “Parameter-Efficient Transfer Learning for NLP”. In: Proceedings of the 36th International Conference on MachineLearning.Vol.97.2019.

[14] EdwardJHu,YelongShen,PhillipWallis,ZeyuanAllenZhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.“LoRA:Low-RankAdaptationofLargeLanguage Models”. In: International Conference on Learning Representations.2022.

[15] Taku Kudo and John Richardson. “SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations.2018.

[16] Ke Li, Jay Mahadeokar, Jinxi Guo, Yangyang Shi, Gil Keren, Ozlem Kalinli, Michael L. Seltzer, and Duc Le. “Improving fast-slow Encoder based Transducer with StreamingDeliberation”.In:InternationalConferenceon Acoustics,SpeechandSignalProcessing(ICASSP).2023.