thesis

Page 9

Explorationof representationof thedifferentways dataanalysisinreal timeandinits materializationin usefulbusiness information.

CristinaMeng DissertaçãodeMestradoapresentadaà FaculdadedeCiênciasdaUniversidadedoPortoeAlticeLabsem DataAnalytics 2022

2º CICLO MSc FCUP UNI2 2022
inusefulbusinessinformation.dataanalysisinrealtimeandinitsmaterializationExplorationofrepresentationofthedifferentways CristinaMeng

Explorationof representationof thedifferentways dataanalysisinreal timeandinits materializationin usefulbusiness information.

CristinaMeng

MSc.EngineeringPhysics

DepartmentofPhysicsandAstronomy

2022

Orientador

JoãoManuelVianaParenteLopes, FacultyofSciencesofUniversityofPorto

Supervisor

CélioGomesAbreu, AlticeLabs

Todasascorreçõesdeterminadas pelojúri,esóessas,foramefetuadas.

OPresidentedoJúri, Porto, / /

UNIVERSITYOF PORTO

MASTERS THESIS

Explorationofrepresentationofthedifferent waysdataanalysisinrealtimeandinits materializationinusefulbusiness information.

Author: CristinaMENG

Supervisors: JoaoManuelVIANA PARENTE

LOPES

C´elioGOMES ABREU

Athesissubmittedinfulfilmentoftherequirements forthedegreeofMSc.EngineeringPhysics atthe

FacultyofSciencesofUniversityofPorto

DepartmentofPhysicsandAstronomy

August9,2023

RandomgirlIsawonTikTok
“Whenlifegivesyoumelons,youmightbedyslexic.”

Sworn Statement

I, Cristina Meng, enrolled in the Master Degree of Engineering Physics at the Faculty of Sciences of the University of Porto hereby declare, in accordance with the provisions of paragraph a) of Article 14 of the Code of Ethical Conduct of the University of Porto, that the content of this internship report reflects perspectives, research work and my own interpretations at the time of its submission.

By submitting this dissertation/ internship report/ project [choose accordingly], I also declare that it contains the results of my own research work and contributions that have not been previously submitted to this or any other institution.

I further declare that all references to other authors fully comply with the rules of attribution and are referenced in the text by citation and identified in the bibliographic references section. This dissertation/ internship report/ project [choose accordingly] does not include any content whose reproduction is protected by copyright laws.

I am aware that the practice of plagiarism and self-plagiarism constitute a form of academic offense.

16th December 2022

Acknowledgements

Ahugethankyoutomyuniversity’ssupervisor,professorJoaoVianaLopes,forall thesupportgiven,theavailabilityandthemotivationalwordsduringthemostchallengingtimesandthepeopleatAlticeLabs:C´elioAbreu,mysupervisoratthecompany, andFilipaFerreira,myteamleader,bothofthemwelcomedme,showedavailabilityand accompaniedallofmyworkduringthisinternship,itwasawonderfulexperience.

IwouldalsoliketothankmyfriendsfromFEUP.Youguysmademychoiceofenrollingininformaticsinmyfirstyearallworthit.

Tomybestfriendofalltime,JoanaFerreira,whowassmartenoughnottoputherself throughamaster’sdegree. Thegirlsthatgirl,girlandthegirlswhogirln’t,goren’t.

Tomyboyfriend,forthecoffee,icecreamandsupportgivenduringtheselastfew daysofinsanity,inwhichIevendaredtosayIwantedacat.

Andlastly,Iwouldliketothankthetinyhumanbeingcreatedbymysister,towhom Iamalsothankfulfor:mywonderfulniece.LetmeraiseatoasttothegirlIcurrentlylove most.Auntiewillbuyyouandyourfuturesisteralltheicecreamintheworld.

Abstract

FacultyofSciencesofUniversityofPorto

DepartmentofPhysicsandAstronomy

MSc.EngineeringPhysics

Explorationofrepresentationofthedifferentwaysdataanalysisinrealtimeandinits materializationinusefulbusinessinformation.

ThisinternshipcarriedoutatAlticeLabshasthegoalofsearchingpatternsintheclients’ activityandlookforanomalousbehaviour.Thisisimportantasitallowsfortheoperator toknowitsclientandtakethenecessarymeasuresincaseanythinganomalousshowsup. Iftheoperatordoesnotknowwhatanormalpatternlookslike,ifsomethinganomalous comesup,itwilllooknormaltohim.

Inordertodoso,itisessentialto,first,anonymisethedata:numerousproblemshave arosefromthisprocess,includingmixeduprecordsandfieldswiththewrongdata.

Second,toanalyzethedata,themostcommondatasciencetechnologieswereused: SQL,RandPython.Theselasttwohavelibrariesthatenableclustering,aprocessthat allowsustoaggregateclientswithsimilarattributes,enablingtheirgeneralcharacterization.

Ourpatternsallowedustodiscovernotonlyclientswithamillionrecordsthathave analmost100%ofnotansweringphonecalls,andothersontheextremeopposite,but alsoclientswithonlyacoupleofcalls.

UNIVERSITYOFPORTO

Resumo

FacultyofSciencesofUniversityofPorto

DepartmentofPhysicsandAstronomy

MestradoIntegradoemEngenhariaF´ısica

Explora¸c˜aodarepresenta¸c˜aodasdiferentesformasdean´alisededadosemtemporeal enasuamaterializa¸c˜aoeminforma¸c˜oescomerciais´uteis.

Esteest´agio,realizadonaAlticeLabs,temcomoobjetivoprocurarpadr ˜ oesnaatividadedosdiferentesclientesedetetarcomportamentosan ´ omalos.Isto ´ eimportanteporquepermitequeooperadorconhec¸aosseusclienteetomeasprovidˆenciasnecess´arias casoalgoforadopadraoaparec¸a.Seooperadornaosouberopadraonormaldoseu cliente,sesurgiralgoforadocomum,esteemprinc´ıpioparecer´anormalparaele.

Paraisso, ´ eessencial,emprimeirolugar,anonimizarosdados:v´ariosproblemassurgiramdesteprocesso,incluindoregistostrocadosecamposcomdadoserrados.

Emsegundolugar,paraanalisarosdados,foramutilizadasastecnologiasde data science maiscomuns:SQL,RePython.Estasduas ´ ultimaspossuembibliotecasquepermitemo clustering,processoquepermiteagregarclientescomatributossemelhantes,possibilitandoasuacaracterizac¸aogeral.

Ospadroesencontradospermitiram-nosdescobrirquenaos ´ oexistemclientescomum milhaoderegistosquetˆemquase100%dechamadasnaoatendidaseoutrosnoextremo oposto,mascomotamb´emexistemclientescomapenasalgumaschamadas.

UNIVERSITYOFPORTO
Contents Acknowledgements vi Abstract vii Resumo ix Contents xi ListofFigures xiii Glossary xv 1Introduction 1 1.1AlticeLabs 1 1.2ContextandMotivation .............................. 2 1.3Data .......................................... 5 1.3.1GeneralDataProtectionRegulation(GDRP) 5 1.3.2TypesofData 6 1.4InternshipandReport ............................... 7 2Methods 9 2.1TechnologyUsed 9 2.1.1StructuredQueryLanguage(SQL) 10 2.1.2GoogleColaboratory ............................ 10 2.1.3Python 11 2.1.4R 11 2.2Workflow ....................................... 12 2.2.1Features ................................... 12 2.2.1.1DataEncoding 13 2.2.2DataCleaning 14 2.2.2.1CreationofaCalendar ..................... 14 2.3GlobalCharacterization .............................. 15 2.3.1CallPattern 15 2.4ClientSegmentation ................................ 15 2.4.1Clustering .................................. 16 2.4.1.1HierarchicalClustering 16 xi
2.4.2DailyActivity ................................ 19 2.4.3AnsweredandUnansweredCalls 19 2.4.4InternalService&Service 20 2.4.4.1Approach1:WaitingQueue .................. 22 2.4.4.2Approach2 22 3Analysis 25 3.1Globalcharacterization ............................... 25 3.1.1Comparisonbetweenoldandnewdatasets ............... 25 3.1.2CallType 27 3.1.3Timeframeanalysis 28 3.2ClientSegmentation ................................ 32 3.2.1SessionAnswered ............................. 32 3.2.1.1Client80933(Almostallansweredcalls) 42 3.2.1.2Client1303(Almostnoansweredcalls) 44 3.2.2GroupServices ............................... 48 3.2.2.1Approach1:WaitingQueue 48 3.2.2.2Approach2:CallPatternofGroupServices 49 4Conclusions 55 Bibliography 57
1.1MapofthelocationsofAlticeLabs’offices.Retrievedfrom[3]. ........ 2 1.2StrategicthemesthatAlticeLabs’RDIactivitiesrevolvearound.Retrieved from[3]. 3 1.3WhatABChastooffer.Retrievedfrom[3]. 4 1.4Differenttypesofdatavariables.Retrievedfrom[16]. ............. 7 2.1Schematicoftheaccesstothedatabase.Adaptedfrom[20, 21]. ........ 10 2.2Thedataprocessingpipeline.Adaptedfrom[40, 41]. 15 2.3HierarchicalClustering 17 2.4DifferentcriteriaforHierarchicalClustering.Adaptedfrom[40]. ...... 18 2.5SchematicofWard’smethod.Retrievedfrom[49]. ............... 18 3.1Numberofeventsassociatedwitheachcall. ................... 26 3.2InternalservicesthatacallgoesthroughforGID=’992005ff-69d8-4a62bb21-0f53a0234fe5’,forboththeoldandnewdatasets. 27 3.3Differentcalltypesdistribution. .......................... 28 3.4Hourlyactivityofeachdaywiththeaverageofeventsrepresentedinblue. 29 3.5WeeklyActivityofsomedaysofMarch. 30 3.6Normalisedhourlyactivity,inwhichthenumberbetweenbracketsrepresentsthedayoftheweek.ZeroforMondayandSixforSunday. ....... 31 3.7DistributionofansweredandnotansweredcallsthroughoutMarchforeverydayandhour. .................................. 33 3.8Numberofclientsthatonlyhaveunansweredandansweredvoicecalls. 35 3.9Exampleofasequenceoftestcalls. 36 3.10CirculardendrogramresultingfromtheHierarchicalClusteringoftheclients thathaveansweredallcalls. ............................ 37 3.11Distributionofthepercentageofunansweredcalls,dividedbyorderof magnitudeofcallsthattheclientsreceive. 38 3.12Relationbetweenthenumberoftotalcallsandthepercentageofunansweredones. ..................................... 39 3.13Distributionofthepercentageofunansweredcalls. .............. 40 3.14Activityofclients1303(ingreen)and80933(inpink). 41 3.15DailyhourlyactivityofClient’80933’. 42 3.16NormalizedhourlyactivityofClient’80933’. .................. 43 3.17InternalServicesofclient80933andrespectiveNumberofoccurrencesfor answered(ingreen)andunanswered(inpink)sessions. 44 3.18DailyhourlyactivityofClient’1303’. 45 3.19NormalizedhourlyactivityofClient’1303’. ................... 45 xiii
ListofFigures
3.20HistogramoftheoccurrencesofLEC=’33001’forclient1303asafunction oftime. ........................................ 46 3.21InternalServicesofclient1303andrespectiveNumberofoccurrencesfor answered(ingreen)andunanswered(inpink)sessions. 47 3.22Clients’countofthetimestheygothroughtheserviceWaitingQueue. ... 48 3.23Numberofeventsforeachgroupserviceandeachclient. ........... 50 3.24Clientschosebasedonthenumberofeventsperserviceandservicedistribution. 50 3.25ActivityofClient114114. .............................. 51 3.26ActivityofClient91763. .............................. 52 3.27CallPatternforanencodedGIDfoundthroughIVR. 53

Glossary

FCUP FacultyofSciencesofUniversityofPorto

RDI Research,DevelopmentandInnovation

IoT InternetofThings

ABC AdvancedBusinessCommunications

AS ApplicationServer

SEC ConvergentEnterpriseService

CDR CallDetailRecords

GDPR GeneralDataProtectionRegulation

EU EuropeanUnion

EEA EuropeanEconomicArea

ML MachineLearning

DB Database

GID GlobalCallIDIdentifier

LEC LogicEndCause

HG HuntGroup

WQ WaitingQueue

PA Pre-Answer

IVR InteractiveVoiceResponse

WMS WindlessMediaServer

IP InternetProtocol xv

Chapter1 Introduction

Inthisdocument,asynthesisoftheinternshipcarriedoutatAlticeLabsispresented. Thisinternshiptookplaceunderthecurricularunitof”Internship”oftheMaster’sinEngineeringPhysicsoftheFacultyofSciencesofUniversityofPorto(FCUP).Thisinternship takesplaceatAlticeLabs,SA,headquarteredinAveirobuttakenoutremotelyatAltice’s officeinPorto.

1.1AlticeLabs

Withalegacyofseveraldecades, AlticeLabs,previouslyknownas PTInova¸c˜ao,hasbeen acatalystofinnovationandtransformation,assertingapositionofleadershipintheRDI (Research,DevelopmentandInnovation)areainPortugal.Forexample,inthepastyear, 2021,AlticeLabsannouncedaninnovationinfibreopticsthatrevolutionisedhowthese networksweredeployedontheground.Withthis,theydoubledtheircapacity,enabling themtoservetwiceasmanycustomersascomparedtothecurrenttechnology.Over300 millionpeopleandmorethan60countriesuseproductsandsolutionsmadeinAlticeLabs [1].

AlticeLabshasofficesinAlgarve,MadeiraandAzoresarchipelagos,Porto,Lisbon, Viseu,andalsooutsideofEurope,suchasBrazil,throughcentresdedicatedtotheresearchanddevelopmentofadvancedsolutionsofTelecommunicationsandSystemsof Information[2].

1

1.2ContextandMotivation

AlticeLabscontinuallysupportscollaborativeprojectsofRDI,workinginpartnership andcooperationwithuniversitiesallovertheworld,RDinstitutions,partners,suppliers andcostumers[3].Presentedinfig.1.2 aretheactivitiesofRDIinwhichAlticeLabsis involved.TheseinvolveArtificialIntelligence&MachineLearning,SmartCities,Smart Living,InternetofThings(IoT),DigitalServices&Platforms,5GFutureNetworks,includingtheopticalevolutionframework[3].

2 EXPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION.
FIGURE 1.1:MapofthelocationsofAlticeLabs’offices.Retrievedfrom[3].

ABC(AdvancedBusinessCommunications)isanEnterpriseCollaborationSolutionin whichABC-AS(whereASstandsforApplicationServer),previouslydesignatedasSEC (ConvergentEnterpriseService),deliversaunifiedcommunicationexperienceacrossall userdevices-fig.1.3.OneofitsmainfeaturesistheAnalytics,wherewehavetheusage informationofCustomerresourcesandservices.However,thecorporateenvironment framedintelecommunications,combinedwiththeemergenceofnewservicesandtechnologies,isincreasinglycomplexandwithfasterchanges/transformations.Thesimple collectionandpresentationofdatafromallactivitieswithanoperationalandbusiness natureareinsufficient.Aneedtomanipulateandanalysethedatafromdifferentperspectivesarises,fosteringtheinterpretationandconversionofthesedataintousefulinformationthatcompaniescanusetounderstandandenhancetheirbusiness.

Everytelecommunicationcompanystoresdetailedcallrecords,denominatedbyCDRs (CallDetailRecords),thatconsistofinformationcapturedbythetelecommunications

1.INTRODUCTION 3
FIGURE 1.2:StrategicthemesthatAlticeLabs’RDIactivitiesrevolvearound.Retrieved from[3].

providersduringthecalls,messages,andinternetactivityofacustomer.Telecommunicationprovidersroutinelycollecttheserecordstodetectcongestedcelltowerstomanageadditionalbandwidth,tobillcustomersforcellularusageandaretypicallyusedto troubleshootandimprovethenetwork’sperformance.Foreachcommunicationbetween individuals,themobileoperatorkeepsaCDRthatstoresmetadatathatcancontaininformationsuchasthetypeofcall,ifthecallwasincomingoroutgoing,thestartingand endingtimeofthecallandcallduration.Sincethecontentsofthecommunicationarenot revealedthroughtheCDR,astheyonlystorecommunication-relatedproperties,CDRcan beconsideredmetadata[4].Theserecordsareessentialsincetheyallowtheidentification ofsimilarusepatternsandinformationthattelecomoperatorsandtheirclientscanuseto theiradvantage[5, 6].Theknowledgeoftheclients’patternsallowsthetelecommunicationsoperatortounderstandwhatanomalousbehaviourmightlooklikeand,according tothat,maketheappropriatedecisions.Anothercriticalaspectofthesecallpatternsis thattheycanpresentdistinctcharacteristicseventhoughmultipleclientsmayfollowthe samepattern.Characterisingtheseclientswouldbeintheoperator’sbestinterest,asit mayhelpwithfuturebusinessplans.

4 E
.
XPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION
FIGURE 1.3:WhatABChastooffer.Retrievedfrom[3].

1.3Data

WiththeinventionoftheInternetanditstechnologicaladvancements,dataisatthecentre ofeverything.Thevolumeofdatacollectedandstoredgloballyisgrowingexponentially, andeverydaydataisbeingusedinnewways.Worldwidedemandfordigitalservicesand growthresultsinmoredatatoprotect:dataprivacyandsecurityhavebecomekeysocial andbusinessissues[7].Dataprotectionisaboutsecuringdigitalinformationwhilekeepingdatausableforbusinesspurposeswithouttradingcustomerorend-userprivacy[8]. Ithelpstoreduceriskandenablesabusinessoragencytorespondquicklytothreats.In 1995,theEuropeanDataProtectionDirectivewaspassed[8].However,asitwasadopted whentheInternetwasstillinitsearlystages,itestablishedminimumdataprivacyand securitystandards[9].Stricterguidelinesandrestrictionsneededtobeadopted.In2016, theEUadoptedtheGeneralDataProtectionRegulation(GDPR)[9],aregulationwewill discussbelow.

1.3.1GeneralDataProtectionRegulation(GDRP)

GDRP(EU)2016/679isaregulationoftheEuropeanrightrelatedtotheprivacyandprotectionofpersonaldata,applicabletoallEuropeanUnion(EU)andEuropeanEconomic Area(EEA),implementedin2018[10].ItregulatestheexportationofpersonaldataoutsideoftheEUandEEA,andoneofitsmaingoalsistogivethecitizensandresidents waystocontroltheirdata[10].

Theregulationcontainsclausesanddemandsrelatedtohowpersonalinformationis treatedintheEU,anditappliestoeverycompanythatoperatesintheEEA,nomatter theirorigincountry[11].Thetreatmentofpersonaldataismadesuretorespecttheprinciplesofdataprotection:allthedatamustbestoredusingpseudonymisationorcomplete anonymisation:

• Pseudonymisationisa requiredprocessforstoreddatathattransformspersonaldatain suchawaythattheresultingdatacannotbeattributedtoaspecificdatasubjectwithoutthe useofadditionalinformation[12]. Moreover,itneedstoensurethatthedatacannotbe availablewithoutexplicitconsentandbeusedtoidentifysomeonewithouttheuse ofadditionalinformation(key)keptseparatelyfromthepseudonymiseddata[12].

1.INTRODUCTION 5

TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION.

• Completeanonymisation:forthedatatobetrulyanonymised,theanonymisation processneedstobeirreversible,asopposedtopseudonymisationthathasakeyto decryptthesensitivedata[12, 13].

Theregulationalsodoesnotallowthetreatmentofanydataoutsidethelegalcontext specifiedinit,besidesthecaseinwhichtheonewhocontrolsthedatahasreceivedthe explicitconsentand opt-in ofthedata’sowner.Theowneralsohastherighttorevokethis permissionatanytime[14].

Theresponsibleforthetreatmentofpersonaldatashoulddeclareanydatacollection, thelegalframeworkthatallowsthatcollection,thegoalofthedataprocessing,thetime framethedataisgoingtobestoredandthesharingofinformationwithanythirdparty partneroutsidetheEU.Theusershavetherighttodemandacopyofthecollecteddata andtherighttodemandtheeliminationofthatsamedataundercertaincircumstances [14].Thepublicauthoritiesandcompanies,whoseactivityisfocusedontheregularor systematictreatmentofpersonaldata,aremandatedtohaveadataprotectionofficer (DPO)responsibleforensuringthattheprocessingisgoingaccordingtotheGDPR.

1.3.2TypesofData

Thesimplestformofdataismultidimensionaldata.Thistypeofdatatypicallycontainsa setofrecords.Eachrecordcontainsasetoffieldsreferredtoasattributes,dimensionsor features.Thesefieldsdescribethedifferentpropertiesofthatrecord.Amultidimensional datasetisdefinedasfollows:

Definition1.3.1(MultidimensionalData)AmultidimensionaldatasetDisasetofnrecords, X1 Xn,suchthateachrecordXi containsasetofdfeaturesdenotedby (x1 i xd i ) [15].

Therecordsarethetablerows,andthefieldsarethecolumns.Asseenfromfig.1.4, therearetwobroadtypesofdata-categorical(orqualitative)andnumeric(orquantitative)-andeachofthemcanbesplitintodifferentcategories.

Categoricaldataisgroupedintocategoriesthattakeondiscreteunorderedvalues.It canbeclassifiedasnominalwhenthecategoriesdonothaveanorderorordinal,when thecategoriesdohaveanorder[15, 17].

Quantitativedata,asthenamesays,triestoquantifythingsbyconsideringnumerical valuesthatmakeitcountableinnature[18].Thus,wheneachvalueof xj i inDefinition1.3.1 isquantitative,thecorrespondingdatasetisreferredtoasquantitativemultidimensional data[15].

6 EXPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL

• Continuousdataismeasuredonacontinuousnumericalscaleandcantakeona largenumberofpossiblevalues.Continuousdatacanbeclassifiedasaninterval whenitdoesnothaveanabsolutezeroandnegativenumbersalsohaveameaning, orasaratiowhenitdoeshaveanabsolutezeroandnegativenumbersdonothave ameaning[17].

• Discretedatameasurescountsornumbersofevents.Sowhileitisnumericaldata, itisnotmeasuredonacontinuousnumericalscaleandhencedoesnotfitneatly intoeitheroftheclassificationsabove.Soitcanbetreatedaseithercategoricalor continuous,dependingonthenumberofpossiblevalues[17].

Toworkwithquantitativedata,wefirstneedtoencodeitbeforeprocessing[15, 18].

Dataminingalgorithms,suchasMachineLearning(ML)ones,canonlyworkwithnumericaldata,asthemodelsaremathematicalbynature.Thus,thesecategorieshelpus decidewhichencodingstrategycanbeappliedtowhichdatatype.

1.4InternshipandReport

Inthisinternship,thegivenusecaseisthesearchforanomalousbehaviourincallpatternsassociatedwithvoicecommunications.Weexploredifferentclients’timeseriespatternstoidentifypossibleanomalousbehaviour.Suchbehaviourencompassestrafficalterations(drop/peakinthetime-seriespattern)andabnormalpercentagesofanswered/unansweredcallsandtheirrootcauses.

Thisreportdividesintothreemainchapters,excludingtheIntroductionchapter.The firstonefocusesontheMethods,achapterdividedintofourmaintopics,startingwith

1.INTRODUCTION 7
FIGURE 1.4:Differenttypesofdatavariables.Retrievedfrom[16].

EXPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL

TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION. thetechnologyusedtoretrieveandanalysethedataandmovingtotheworkflowofconvertingthisrawdataintousefulone.Theapproachtakentocharacterisethedatainto patternsisdescribed,andlastly,anomalousbehaviourwilltrytobefound.However, throughoutthisinternship,manychallengesarosewithretrievingthedata,asthemain onebeingwithitsanonymisation:asthetelecom’soperatorconsentwasnecessary,not onlyforwhichdatatogivebutalsoiftheyagreedornotwiththeprocessofanonymisation,multiplebackandforthmeetingstookplacewithMEO(mobileandfixedtelecommunicationsserviceandabrandfromAlticethatrevolutionisedthetelecommunications marketinPortugal[19]).Onlythedatathatcontainedsensitiveinformationrelatedtothe usersoftheABCservicewereanonymisedor’zeroed’,thislastonemeaningthat,when importingthesefields,theinformationwasnotobtained.Theremainingfieldsremained unchanged.

MEOdefinedakeythatallowedtheanonymisationprocesstogothrough.Altice doesnotknowthiskey,sothisprocessbelongsinthepseudonymisationcategory.This strategyguaranteedthat,eventhoughtheinformationisalwaysanonymised,itwould alwaysrefertothesameidentification.Thisprocesstookaroundninetotenmonths,so anextensionoftwomonthsoftheinitialelevenmonthswasnecessarytohavetimeto analysethedata.

Thesecondchapterdetailstheanalysisofthepatternsandanomalousbehaviour,with afewexamplesofwhatthismaylooklike,whetherinthecallpatternsthemselvesand thepercentageofunansweredcallsanditslogicendcausestoasamplecharacterisation ofoneclientandtheircallpatterns.

Thefinalchapterreportstheconclusionsretrievedfromthisinternship,thechallenges faced,andconsiderationsforthefuture.

8

Chapter2 Methods

InordertoanalyzetheCDRsgiven,wefirstneedtounderstandwhatisavailablefordata analysisandtherighttoolsforwhatwearetryingtoachieve.Intermsofstrategy,diagramsdepictingwhatwasdoneregardingretrievingandconvertingrawdataintousable dataarepresented.Westudiedthegeneralactivityofalltheclientsandtheirvoicecallsto accomplishtheobjectivesforthisinternship.Clusteringallowedustoaggregateclients withsimilarattributes,enablingtheirgeneralcharacterization.Thesecharacteristicscan bethenumberofunanswered/answeredcalls,thenumberofcalls,thefunctionalitiesone goesthrough(internalservices),orthemostusual,thecallactivity.Wewillbeexploring eachoneoftheseinthefollowingsections.

2.1TechnologyUsed

ThemostcommonlyusedsetoftoolsfordataanalysisincludeSQL,PythonandR,which wewillgointomoredetailaboutbelow.ThemaintoolusedtorunPythonintheweb browser,GoogleColab,willalsobediscussed.

Theconnectionbetweentheseusedtechnologiesisdepictedintheworkflowfrom fig.2.1.

9

2.1.1StructuredQueryLanguage(SQL)

DuetotheGDPR,exportingandstoringsensitivedatainourlocalhardwarewasprohibited.Theonlywaytoaccessthedatawastoconnectdirectlytothedatabase(DB).Dueto this,SQLwasusedinordertowritetherequiredqueriesforretrievingthedata.SQLis anon-procedurallanguage,meaningtheuseronlyneedstospecify’whattodo’andnot ’howtodoit’.

SQLmainlyhasfouressentialoperations:creating,reading,updatinganddeleting datafromthetablewherethecontentisstored.Inthisinternship,weusedSQLsolelyto readthedatafromthedatabase[22, 23].

2.1.2GoogleColaboratory

GoogleColaboratory,alsoknownasColab,isaJupyternotebookenvironmentthatruns entirelyinGoogle’scloudservers.SimilartoJupyterNotebook,Colabisalistofcells thatcancontaintextorPythonexecutablecodeanditsrespectiveoutputs.Themain advantageofusingColabisthatnoneoftheuser’scomputerresourcesareusedtorun thenotebook,andalso,itmakesiteasiertosharethenotebookswithotherssinceColab notebooksarestoredinGoogleDrive,andifitcontainssensitivecontent,thecells’output canbeomitted[24, 25].However,sinceweneedtoaccessourlocaldatabase,weneedto beabletorunColablocally.ColaballowsustoconnecttoalocalruntimeusingJupyter, meaningthatallcodewillbeexecutedonourlocalhardware,enablingaccesstoourlocal files.Inordertodoso,thejupyterextension’jupyter http over ws’needstobeinstalled andenabled.

10
.
EXPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION
FIGURE 2.1:Schematicoftheaccesstothedatabase.Adaptedfrom[20, 21].

2.1.3Python

Pythonisapopularprogramminglanguagewithplentyoflibrariessuitedforvarious tasks,whichisoftencitedasoneofPython’sstrengths.Adescriptionofthelibrariesused canbeseenbelow.

• Pandas allowsdataimportfromnotonlyvariousfileformats,suchascomma-separated values,SQLdatabasetablesandMicrosoftExcelspreadsheets,butalsovarious datamanipulationoperationssuchasmerging,reshaping,selecting,aswellasdata cleaning,anddatawranglingfeatures[26].

• cx Oracle isamodulethatenablesaccesstoOracleDatabase.ExecutingSQLqueries istheprimarywayaPythonapplicationcommunicateswithOracleDatabase.The queriescanonlybeexecutedusingthemethodCursor.execute().Rowscanthen beallfetchedusingoneofthemethodsCursor.fetchall().Thefetchmethodsreturn dataastupleswhichPandasthenprocesstoconvertintoadataframe[27].

• Matplotlib isPython’sdatavisualisationandgraphicalplottinglibrary.Allgraphs fromthisreportwereplottedusingthislibrary[28].

• Scikit isafreesoftwaremachinelearninglibraryforPython.Itfeaturesvariousclassification,regression,andclusteringalgorithms,includinghierarchicalclustering, anditisdesignedtointeroperatewiththePythonnumericalandscientificlibraries NumPyandSciPy[29, 30].

2.1.4R

Risaprogramminglanguagefocusedonstatisticalanalysisanddatavisualisation[31].

LikePython,itprovidesacollectionoflibrariesthataidineverydaydatasciencetasks.In thiscase,Rwasusedforitslibrariesassociatedwithhierarchicalclustering:’Circlize’and ’Dendextend’.Thefirstoneallowedtheplotofacirculardendrogram,whilethesecond allowedforitspersonalisation,e.g.,colours,labels,etc.Whenwehavetoomanylabels, aregulardendrogramcanbecomehardtoread,soplottingitinacircularmanneristhe bestsolution.

2.METHODS 11

EXPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION.

2.2Workflow

Unfortunately,mostofthetime,rawdataisnotimmediatelysuitableforautomatedprocessing.Inourcase,thefirstchallengeencounteredisthenumberoffeatureswehave, whichareoverahundred,mostofthemnotrelevanttotheproblemathand.Soherewe highlighttheimportanceoffeatureselection.

2.2.1Features

Thegeneralfeaturesusedintheanalysisinvolve:

• CALL TYPE(Number)

TABLE 2.1:CALL TYPE:Possiblenumbersanddescription.

(Unstructuredsupplementaryservicedata): serviceusedtoreceiveimportantinformation, suchastheavailablebalance,viaSMSafterdialingaservicecode.

Fax

SS NOTIFwhereSSstandsforSupplementaryService[32].

CS VIDEO (Circuit-SwitchedVideocall) whereavideocalloccursusingachannelinthecircuit-switchednetwork. 7 DATA64K: digitaltransmissionmethodwhereeachcalliscarriedbya64-kbpsdigitalstream[33].

• CLIENT ID(Number):customeridentifier.CustomerisanABCentityreferringto enterprisesthatsubscribetotheserviceprovidedbythetelecommunicationsoperator.

• INTERNAL SERVICE(VARCHAR2):ABCfunctionalityorlistoffunctionalitiesappliedduringthecall.Thereasonforthefield’INTERNAL SERVICE’tobe’VARCHAR2’insteadof’Number’isbecausesometimes,inoneCDR,acallgoesthrough multipleservicesatthesametime,meaningthatthefieldwillbeoccupiedbyasequenceofnumbersseparatedbycommas,makingthefinalresult,notinnumber form.

• INFO35(VARCHAR2):generalfieldpreparedtoaccommodateawidevarietyof valuetypes.Inthiscase,itcontainstheGlobalCallIDIdentifier(GID),whichrefers

12
Number CALL TYPE -1 Unknown 0 Voice 1 Data 2 SMS 3 USSD
4
5
6

toagenerateduniqueidentifierusedtomarkeachcallinordertoallowtracingthe callalongitspath.

• LOCATION TYPE(VARCHAR2):Typeofcallfromlocationprocessing.Itcanbe oneoftwotypes,eitherONNET(internalcalls),inwhichthecallisbetweennumberswithinthesameclientorOFFNET(externalcalls),wherethecallhappensfor offnetdestinations(differentclients).Thisfield’sexistenceisassociatedwiththe differentpaymentplans,e.g.,internalcallsaremorelikelytobefreeofchargethan externalcalls.

• LOGIC END CAUSE(LEC)(Number):numbercodeusedtoidentifythecauseof thecallending.Thesecodesmayrefertosuccesscauses,suchasthecallbeing endedbytheoriginafteritwasestablishedsuccessfully,orerrorcodes,identifying thecausesofwhythecallwasnotallowedtoproceed.

• SERVICE(Number):ABCSub-serviceusedinthesession,responsibletogenerate theCDR.

• SERVICE ACCESS TIME(DateTime):DateTimeofserviceaccess,withsystemtime.

• SESSION ANSWERED(Binary):1ifthesessionisansweredand0ifnotanswered.

Afterwards,itcomestheactualpre-processingofthedata,thatinvolvesnotonlytasks suchasdatacleaning,dataencoding,butalsoinvolvesfeatureextraction,whichgenerates newfeaturesfromrawdatathatforsomereason,arenotdirectlycomparable,suchas categoricaldata.Asmentionedpreviously,manyMLalgorithmsonlyworkonnumerical data,so,inordertodoso,weresorttodataencoding.

2.2.1.1DataEncoding

Severalmethodscanbeusedtoconvertcategoricaldataintonumericaldata,themost popularonesbeingOne-HotEncodingandLabelEncoding.

WhatOne-HotEncodingdoesisittakesacolumnwithcategoricaldata.Foreach uniquevalueinthesetofthecategoricalattribute,additionalcolumns(features)arecreatedandassignedthevalueoneorzero.Eachfeatureisrepresentedasabinaryvector. Allthevaluesarezero,andtheindexismarkedwithone[34].Thismethodincreasesthe dimensionalityofthedataset,whichisnotideal[35].

2.METHODS 13

TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION.

InLabelEncoding,eachcategoricalvalueisreplacedwithanumericvaluebetween0 andthenumberofclassesminus1.Eachlabelisassignedauniquevalue,anditisalways possibletorevertittotheoriginalvalue.Inthedocumentationofthistechnique,nothing explainshowthenumbersareattributedtothelabels[36].However,theproblemwith LabelEncodingisthatitmightaddarelationshipbetweenthelabelsthatdidnotexist priortotheencoding.Forexample,ifadistancebasedclusteringisperformed,Label Encodingmayaddaproximityrelationshipbetweenthelabelsandclusterthemtogether. However,sinceitdoesnotoccupymorememory,wewillmostlikelyusethistechnique.

2.2.2DataCleaning

Inordertoanalyzetheeventsacallgoesthrough,wedecidedtoexcludethoserowsin whichtheinternalserviceandtheGIDwereNaN.NaNvaluesareduetomostlikelyparsingerrors.Moreover,someCDRsgothroughmultipleinternalservicessimultaneously, presentingthemselvesnolongerasnumericaldata,ascommasseparatethem.Thisway, itwasalsonecessarytoseparatethissingleCDRintomultipleones,whereeverythingis thesamebesidestheinternalservicefield[37, 38].

Followingthis,wehavefeaturetransformation,afamilyofalgorithmsthatcreatesnew featuresfromexistingfeatures.Forexample,thedateofaneventmightnotbeexplanatoryenough.Aswearelookingfortime-seriespatterns,knowingtheweekdaymightbe helpful.SoanewfeaturefortheweekdayiscreatedfromtheDateTimeone.

2.2.2.1CreationofaCalendar

ThetimeanddateofeachCDRcomeintheformat’yy.MM.DDHH:mm:ss.ffffff’.To identifypossibletemporalpatterns,weseparatedtheCDRswithincertaintimeframes, suchas’WeekoftheMonth’,’DayoftheWeek’,’Month’andeven’HouroftheDay’.

Thesetimeframesaggregatediswhatwecallthecalendar.Incaseofmissinghoursor evendates,therespectiverowswereinsertedinthedatatablewiththefieldsnotrelated tothedatetimefilledwith’NaN’.Thisallowsforanalignedrepresentationofthecurves withinthesamedayoftheweekforanotherweek[38, 39].

Aswecansee,thedataminingprocessisapipelinecontainingmanyphases,asrepresentedinfig.2.2

14 EXPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL

2.3GlobalCharacterization

Beforegoingdeepintocharacterisingtheclients,wewillfirstbestudyingtheircallpatternsandtheservicestheygothrough.Thisisonewaytofindpatterns,asclientsmaygo throughthesameservices,designingthesamepatternornot.

2.3.1CallPattern

Eachcallhasaseriesofevents,soeachGIDcorrespondstoadifferentcall.Foreachevent, thecallgoesthroughaninternalservice,afunctionality,soafirstattempttolookattimeseriespatternsisto,foreachGID,plotalltheinternalservicesthecallgoesthroughasa functionoftime.

Inthefirstapproach,wewillanalysethecallpatternsofdifferentclientsandtheclients themselves.Thiswillallowustomoveforwardtonotonlyanalysethemostcommon andleastcommonservicesacallgoesthroughbutalsowhichclientsappearmoreand lessfrequentlyand,afterwards,amoredetailedanalysisofthecallsthemselvesifeither theywereorwerenotansweredandthereasonastowhyithappened.

Weareonlyinterestedinvoicecalls,sowewillfilterouteveryrowthatdoesnothave ’CALL TYPE’=0,asthisisthecodeforvoice.

2.4ClientSegmentation

Oneofthepossiblewaystosegmentthedifferentclientsisbyclusteringthemaccording totheircharacteristics.Clusteringisthemostcommonunsupervisedlearningalgorithm usedtoexplorethedataanalysisandfindhiddenpatternsorgroupingsinthedata[42].

2.METHODS 15
FIGURE 2.2:Thedataprocessingpipeline.Adaptedfrom[40, 41].

TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION.

Unsupervisedlearningisatypeofmachinelearningthatusesunlabeledandunclassified trainingdata[43].Themaingoalofitistodiscoverhiddenandinterestingpatternsin unlabeleddata,thatthealgorithmwillfindonitsown,unlikewhathappenswithsupervisedlearning,inwhichweknowwhatvalueswewillobtainfortheoutput[44].

2.4.1Clustering

Onedefinitionofclusteringis: Givenasetofdatapoints,partitionthemintoasetofgroups whichareassimilaraspossible[45].

Thereisnotonlyonetypeofclusteringalgorithm;themoststudiedonesarethepartitionalandhierarchical.Theyhavebeenusedindifferentapplicationsduetotheirsimplicityandeaseofimplementationrelativetootherclusteringalgorithms[45].

Partitionalmethodsneedtobeprovidedwithasetofinitialseeds(orclusters),which arethenimprovediterativelyandcanbedividedintodistance-basedanddensity-based [45].Distance-basedmethodsoptimizeaglobalcriterionbasedonthedistancebetween thepatterns,whiledensity-basedmethodsoptimizelocalcriteriabasedondensityinformationofthepatterns[46].

Ontheotherhand,hierarchicalmethodscanstartwiththeindividualdatapointsin singleclustersandbuildtheclustering.Theroleofthedistancemetricisalsodifferent inbothofthesealgorithms.Inhierarchicalclustering,thedistancemetricisinitiallyappliedtothedatapointsatthebaselevelandthenprogressivelyappliedtosubclusters bychoosingabsoluterepresentativepoints.However,inthecaseofpartitionalmethods, therepresentativepointschosenatdifferentiterationscanbevirtualpointssuchasthe centroidofthecluster(whichisnonexistentinthedata)[45].Hierarchicalclusteringis themethodchosenforclustering,asitdoesnothavetodefinethenumberofclusters.

2.4.1.1HierarchicalClustering

Hierarchicalclusteringalgorithmsapproachtheproblemofclusteringbydevelopinga treestructurecalledthedendrogram.Oncethedendrogramisconstructed,ahorizontal linecanbetracedanywhereinthedendrogram,andthenumberoflegsitintersectsisthe numberofclusterswewillhave.

Hierarchicalclusteringcanbeachievedintwodifferentways,bottom-up(agglomerative)andtop-down(divisive)clustering[45].Divisivealgorithmsaremorerobustinthe earlystagescomparedtotheircounterpart,andagglomerativeclusteringtechniques,in

16 EXPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL

turn,aremoreunderstandableandbyfarthemostpopular[47].Fromfig.2.3a,wecansee thatagglomerativeclusteringworkslikethefollowing:fromadataset,weattributeeach datapointtoacluster,andfromthere,theindividualdatapointsaresuccessivelyagglomeratedintohigher-levelclusters,movingupinthehierarchy.Ateachstageofhierarchical clustering,theclustersforwhichthedistancebetweenthemistheminimumaremerged. Atthesametime,adendrogramisconstructed:thetopoftheU-linkindicatesacluster merge.ThetwolegsoftheU-linkindicatewhichclustersweremerged,andtheirlength representsthedistancebetweenthechildclusters[48].

FIGURE 2.3:HierarchicalClustering

Inordertodecidewhichclustersshouldbemerged,appropriatecriteriashouldbe applied.Thesecriteriacanbebasednotonlyonthedistances(suchasEuclidean)between clusters,forexample,[49, 50]:

• Centroidlinkage:computesthecentroidsoftheclustersandcalculatesthedistance betweenthem;

• Singlelinkage:calculatesthedistancebetweentheclosestpointsoftheclusters;

• Completelinkage:calculatesthedistancebetweenthefurthestpointsoftheclusters;

• Averagelinkage(notsopopular):calculatesthedistancebetweenallpairsofthe clustersandtakesitsaverageasthedistance,asseenineq.2.2

2.METHODS 17
(A)Schematicofbottom-upclustering. (B)Dendrogram.
D(A, B)= TAB/(NA NB) (2.1)

XPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION.

where TAB isthesumofallpairwisedistancesbetweenclusterAandclusterB. NA and NB arethesizesofclustersAandB,respectively.Thisformulais,ofcourse, adaptableforeverysinglenumberofclusters.

Butalsoontheminimizationofdispersionofthepointsinsideeachcluster:

• Ward’smethodisbasedontheerrorsumofsquares(ESS).TheESSforaclusteris definedasthesumofthesquaredEuclideandistancesfromthecluster’spointsto thecluster’scentroid[51].Itcombinesclusterswheretheincreaseinwithin-cluster varianceisthesmallest.

Where Xij isthe ith observationinthe jth clusterand Xj isthecluster’scentroid[51].

Thedistancebetweentwoclusters,1and2,ishowmuchthesumofsquareswill increasewhentheyaremerged[52].

18
E
FIGURE 2.4:DifferentcriteriaforHierarchicalClustering.Adaptedfrom[40].
ESSj = nj ∑ i=1 ||Xij Xj||2 (2.2)
FIGURE 2.5:SchematicofWard’smethod.Retrievedfrom[49].

2.4.2DailyActivity

Itismainlyherethatwemakeuseofthecalendar.Itisawaytoseetheclient’smostactive andinactivehoursandiftheactivitycurve,whichconsistsofthenumberofeventsper hour,followsthesametrendforalldaysoftheweek.Anunusualspikeordropinthe curvecouldrepresentapossibleanomaly.Inordertoseeifalldaysoftheweekfollowthe sametrend,independentlyofthenumberofeventsregisteredthatday,anormalisation ofthecurveswasrequired.Thisconsistedindividingthenumberofeventsbythearea belowtheactivitycurve.

2.4.3AnsweredandUnansweredCalls

Oneimportantaspecttoanalyseiswhetherthecallwasansweredornot.Asindaily activityforallvoicecalls,inthiscase,wewilldivideitintotwocurves,onefortheactivity ofansweredcallsandanotherfortheunansweredcalls.Therearemanyreasonswhya callwasorwasnotanswered:thegeneralcauseisidentifiedbythefirsttwodigitsofthe ’LOGIC END CAUSE’five-digitnumberandtheremainingdigitsidentifythespecific cause

Below,intable 2.3,therearethelogicendcausesthatappearthroughouttheanalysis insection 3.2.1.

2.METHODS 19
CauseGroup Code Callswithsuccess Normalsuccessfulcall 10 Callsuccessful(butwithcompletionforcedbylogic) 11 Callswithoutsuccess Genericfailureofacall 30 Identification/validatingfailure 31 Loadingfailure 32 Failureassociatedwiththedestination 33 Failureassociatedwiththeorigin 34 Incorrectutilization 35 Problemsinexternalsystem 36
TABLE 2.2:GeneralCausesforansweredandunansweredcalls.

Oursourceofconcernisthepercentageofunansweredcalls.Wevisualisethisdata usingahistogramwiththepercentagesofunansweredcallsversusthenumberofclients thathavethem.Moreover,weneedtoconsiderclientswhomayonlyhaveafewvoice callsthroughouttheanalyseddates.Apercentageof100%ismorerelevanttoclientswith amillionvoicecallsthantoclientswithonlyonevoicecall.Thisway,wewillseparate thegraphsbyorderofmagnitude,from1to10millionvoicecalls.

2.4.4InternalService&Service

ABCfunctionalitiesareclassifiedasServicesandInternalServices.InternalServicesrefer tospecialisedfeatureswithineachABCService.Thereareinternalservicesthataremost commontoappearthroughoutthisanalysis:somerefertoGrouporAdvancedServices, whicharefunctionalitiesappliedtotheenterpriseasawhole,andothersrefertoUser Services,whichallowtoprocessthecalldifferentlydependingonthesettingsofaspecific user.GroupservicesinvolveHuntGroup(HG),WaitingQueue(WQ),Pre-Answer(PA) andInteractiveVoiceResponse(IVR),andarespectiveexplanationoftheseisgivenbelow. ThenumberbetweenparenthesiscorrespondstothenumberoftheService.

20
TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION.
EXPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL
LOGIC END CAUSE Description 10001 CALL OK 10009 OriginAbandoned(callerhangsupbeforeanswering) 10022 OriginAbandonedwithCallCompletedElsewhereReason 11010 Sessionterminatedbynetwork 30055 callForwardedbyUnconditionalforwarding 30057 callForwardedbyNoanswerforwarding 30059 callForwardedbyMovedtemporarilyforwarding 33001 Busydestination 33002 Destinationdoesnotrespond 33004 Nonexistentdestination 33007 Customerunreachabledestination 34031 Exceedednumberofsimultaneouscalls
TABLE 2.3:LogicEndCausesthatappearthroughoutthisanalysisanditsdescription.

• Global(1):Servicethatmanagescallstoorfromauser,definesifitprocessesthe originatedorreceivedparty,andverifiesifitisnecessarytoapplyotherservicesto thecall.

• HuntGroup(2):Servicethatallowsorganisationstodistributecallstoagroupof peopleinacompany.Theincomingcallsaremadetoasinglephonenumberand canbere-routedtomultiplephonelines.

• WaitingQueue(3):Servicethatdistributesincomingcustomercallsbasedonthe callorder.Thecallerremainsonholduntilanagentbecomesavailable,atwhich pointthecallqueueroutesthecustomertotherepresentative[53].

• Pre-Answer(4):Anaudioannouncementmadetoanincomingcallerbeforethecall isputthrough.

• IVR(11):Servicethatallowsincomingcallerstoaccessinformationviaavoiceresponsesystemofpre-recordedmessageswithouthavingtospeaktoanagent,as wellastoutilisemenuoptionsviatouchtonekeypadselectionorspeechrecognitiontohavetheircallroutedtospecificdepartmentsorspecialists[54].

Similartogroupservices,anexplanationoftheUserservicesisalsogiven:

• Forward:allowstheredirectionofareceivedcalltoanewdestinationaccording toasetofconditions,suchastemporalpatterns,orduetothedestination’sstatus (’Busy’,’NoAnswer’,and’NotAvailable’),amongothers.

• OIP(OriginatingIdentityPresentation):OIPisasupplementaryserviceinwhich thenetworkprovidestheCalledPartywiththetrustedidentityoftheCallingParty.

• Out Numbers:allowstodefinewhichisthenumberusedtoidentifythecallmade byauser.

• Convergent:allowsthecall’sorigintobeidentifieddifferentlydependingonwhether thedestinationisfixedormobile.Forexample,thecallgoestothenetworkidentifiedwithamobilenumberbecausethedestinationismobile.However,ifthedestinationisfixed,thecallmadebythesameuserwiththesametelephoneisidentified withafixednumber.

2.METHODS 21

• NP Announce Surpress(ServicePortabilityAnnouncementInhibition):allowsthe activationordeactivationoftheoptiontoplayanannouncementidentifyingthatthe numbertheuseriscallingisported.Portednumberannouncementisplayedwhen, forexample,ausercallsa96xxxnumberwhichiscurrentlyportedtoVodafone.So, thisannouncementallowstheusertorealisethatthecallmayhaveadifferentcost sincethenumbernowbelongstoadifferentprovider.

• On Net Display:allowstodefinethenumbertypeusedtoidentifyaninternalcall tothedestination.Forexample,itdefinesifthecallisidentifiedbytheuser’sshort numberorbytheuser’sDDI(DirectDialInnumber),incasetheuserhasboth.

• Number Portability:differentiationofportednumbers.

2.4.4.1Approach1:WaitingQueue

Oneoftheinternalservicesthat’stherichestininformationistheWaitingQueue.The ideaistoidentifywhichclientsgothroughthisServicemoreoften,pickasuitableone andthencheckforabnormalitiesinthefieldsofthisService-table 2.4.

TABLE

2.4:SpecificfeaturesusedfortheWaitingQueueservice

Feature Description

INFO4

0:Callwasn’tansweredbyWaitingQueuedestination;

1:CallwasansweredbyWaitingQueuedestination.

0:Callhasleftthewaitingqueueduetoerror;

1:Callhasleftwaitingqueuebecausecallwasansweredbywaitingqueue;

INFO5

2:Callnotansweredbywaitingqueuebecausethequeueisfull.destinationorbecauseitwascanceledbytheorigin;

3:CallhasleftwaitingqueuebecauseuserpressInterceptexitoption;

4:Callhasleftwaitingqueuebecausemaxwaitingtimereached.

0:Thecallwasnotforwarded;

INFO6

INFO7

1:Thecallwasforwarded

IndicatesthewaitingtimeinWaitingQueue.

Thisisaverytargetedapproach,aswepickaspecificservice.However,itisalso possibletostudythegeneralityofgroupserviceswhichwewillseebelow.

2.4.4.2Approach2

Inthiscase,whatwewillbedoingisanalysingthecallpathforeachoneoftheseservices.

However,insteadofusingalltheglobalcallIDsforaspecifictimeframelikeinsection 2.3.1,wewillbegrabbingthe’USER ID’thatarepresentineachoftheservicesforeach

22 E
XPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION.

client,storingalltheGIDsthathavethesamecommon’USER ID’andforeachoneof them,doaplotoftheservicesacallgoesthroughforacertainDateTime.Notethat typicallytheUSER IDfieldisusedtostorethedatabaseIDoftheuserwhomadeor receivedthecall,butsincetheseareGroupServicesandthereforenotappliedtoaspecific user,thesamefieldisusedtosavethedatabaseruleIDsforeachService.

Foramorein-depthanalysis,wewillbegrabbingthesameclientsandwillbeanalysing thembythe’LOCATION TYPE’ofthecallandfilteringtheServicetoGlobalonly.

2.METHODS 23

Chapter3 Analysis

Inthischapter,wewillcomparedatasetsanonymisedmanuallyand,byaprocess,have ageneralideaofthecalltypesthatourdatahasandwhichdatesareworthanalysing. Moreover,theactivityandunansweredandansweredcallcurvesforallclientswillbe studied,alongwiththemostcommonLECcausegroups.Duetotheirunusuallyhigh percentageofunansweredandansweredcalls,someclientswillbetargeted,analysing theirmostcommoninternalservicesandLECs.Asthenumberofclientsallowsit(for amorereadablegraph),hierarchicalclusteringwillbeperformedtoaggregatethemaccordingtotheiractivityforclientswithonlyansweredcalls.Atlast,acharacterisation accordingtothevolumeofinternalandexternalcallswillbeperformedontwodifferent clients,andananalysisofthecallpatternofthegroupserviceswillbeshown.

3.1Globalcharacterization

Thissectionwillbemoreofanunderstandingofthedatawehaveathand,knowingwhat itcontains,thenumberofCDRsperhour,day,andsoon.

3.1.1Comparisonbetweenoldandnewdatasets

Atthestartofthisinternship,nousecasehadbeendefinedandatthispoint,havingonly minimaldata,westartedbyanalysingtheGlobalCallID(GID).Eachcallhasaseriesof events,andeachdifferentGIDcorrespondstoadifferentcall.Foreachevent,thecall goesthroughaservice.Theinitialapproachconsistedincountingthenumberofevents percallandrepresentingitgraphically,thenumberofeventsasafunctionofeachGIDfig.3.1

25

Wecanseeafewcallswithanunusuallyhighernumberofevents.However,thisanalysisisnotfeasibleanymoreforthenewcomingdata.Therearearound16000clientsin thenewdataset,withoverthousandsofrecords,makingitimpossibletoanalyseallcalls forallexistingclients.Thefirstdatasetprovided,manuallyanonymised,wasrelatively poor,asaccesstoonlyadayandahalfofcommunicationswasprovided.Withthesecondbatch,anonymisedwithanalgorithm,wehadaccesstoaroundtwoandahalfdays ofcommunications.Thegoalwastounderstandwhetherornotthisanonymisationalgorithmchangedthedatabyanymeans.Aspreviouslymentioned,thenumberofevents correspondstothenumberof”jumps”performedinsidethesamecall,andtherefore,in ordertocomparethedatafromtheseconddatasettothefirstone,wedecidedtocheck if,forthesameGID,thecallgoesthroughthesameinternalservices.Anexampleofthis comparisoncanbeseeninfig.3.2 Itisessentialtorememberthatallfieldsof’INFO35’ withNaNwereerased,asitdoesnotallowustotracebacktheeventstoacall.

26
EXPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION.
FIGURE 3.1:Numberofeventsassociatedwitheachcall.

Atfirstsight,theplotslookverydistinctbecausethetimeofthecallismuchmore preciseintheseconddataset,goinguptomilliseconds,whileinthefirstdataset,itonly goestominutes.However,inbothcases,thesameinternalservicesarepassedthrough, meaningthattheanonymisationalgorithmisreliable.

3.1.2CallType

Fromthismomentforward,alltheanalysiswillbeperformedwiththenewdata,theone anonymisedwithanautomaticprocessratherthanmanually.Fig.3.3 showsthedifferent typesofcallsthatappearinthedataset.Ourfocuswillbedirectedtovoicecalls,that occupythemostsignificantslicewithapercentageofalmost78%.Theremainingslices

3.ANALYSIS 27
FIGURE 3.2:InternalservicesthatacallgoesthroughforGID=’992005ff-69d8-4a62bb21-0f53a0234fe5’,forboththeoldandnewdatasets.

TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION. ofthepiechartinclude’Data’and’SMS’,whichalthoughnotourfocus,theirpercentages werebigenoughtobeleftoutasindividualslices.The’Miscellaneous’sliceincludescalls oftypessuchasvideomessagesandfaxes,whichalone,werenotsignificantenoughtobe leftoutasindividualslices.Infact,thesecallswereaggregatedintoabroadercategory,in whichthemostsignificantparticipantarethe’NaN’values,whichisnotpossibleforus toattributetoatypeofcall.

3.1.3Timeframeanalysis

AcountofthevoicecallsfromtheCDRsasafunctionofthehouroftheday,forallofthe daysavailable,alongwiththedailyhourlyaverage,wasplottedinfig.3.4

28
EXPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL
FIGURE 3.3:Differentcalltypesdistribution.

Somethingthatfeelsoffisthat,fromfig.3.4,thereisunusualactivityatmidnightfrom almosteverydaywehaveavailableinMarch.Moreover,thereisagapinthedataatmiddayfromeveryday.Thisleadsustobelievethatsomethingwentwrongintheanonymisationprocessandthatthedatafromthesetwohoursgotmixedup.Intheory,thedata anonymisationprocessshouldretrievethedatafromthesource,anonymiseit,andthen writeitinthetargetplatform.Theprocesswasmeanttoanonymisesomeofthefields, andthecolumnswiththedatatype’DateTime’werenotsupposedtobeanonymised. However,theanonymisationprocess,developedthroughthe Talend tool[55],didnot consider’masking’overthe’DateTime’typeoffieldswhenobtainingthedatafromthe source.Andbecauseofthis,the’DateTime’typedatacameoutinatextformat,which wasmisinterpretedwhenplacingthedataonthetargetplatform,thuspromotingthedisappearanceofthemiddayeventsandtheappearanceoftheseatmidnight.Uponfurther research,therewerealsoduplicated/tripledCDRs.Duetotherushtomakethedata availableassoonaspossible,anattemptatamethodofhavingparallelprocessestowrite thedatainthedatabasewascreated.Unfortunately,thesethreadswereconsumingthe samedata,resultinginduplicate/triplicateregistersforthesameCDR.

Afterfixingthesemistakes,wecarriedonwiththeanalysis,meaningthatonlyfig.3.4 isplottedusingthebuggeddata.Inadditiontothefactthattherearedaysinwhichthe plottedlineisnotcomplete,forexample,the2ndofMarch,inwhichthelineendsat13h, the9thofMarch,wherethereisonlydatafrom14h,the28thofFebruary,eventhough notvisible,onlyhasactivityfrom21h-23h.The1stofMarchdropsat13h,whichstays constantuntiltheveryendoftheday.AstheselasttwodayscorrespondtoaMondayand

3.ANALYSIS 29
FIGURE 3.4:Hourlyactivityofeachdaywiththeaverageofeventsrepresentedinblue.

TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION

. aTuesday,respectively,itfeelsveryoutofthepatternofaregularweekday.Furthermore, thereisalsoagapofseveraldaysbetweenthefirsttwodatesmentioned,inwhichthere isnodata.Thedecisionwastocheckthenumberofvoicecallsperdayasthedayofthe weektoseeifthesedatesarestillworthanalysingwhencomparedtotherest.Theresult isinfig.3.5

Observingfig.3.5,wecanseethatthe28thofFebruaryhasadeficientactivity,which ispredictable,asitonlyhad2hoursofactivity.Sincealltheremainingdatacorresponds toMarch,notmanyconclusionscanbetakenoutofthisdateandthisdateonly,and thereforewedecidedtoexcludeit.Moreover,thelowestpointsofthisgraphreflectthe lackofactivityonthedaysmentionedpreviously.Thefactthatthe29thofMarchhas almostthesameAmountofcallsasthesedays,meansthatsomewhereinfig.3.4,theline forthisdayalsoeitherbreakssomewhereorstartsinthemiddleoftheday.Thereason behindthesedays’abnormalactivityisthatthedatacollectionwasdonebyaccessinga fieldthatrepresentsasequencenumber,whichismanagedbycachinginstance.Caching isusedtoprovidequickaccesstovaluesthatareoftenreadfromdatabases,storingthem inatemporarystorageinstance,thecache.Thecacheisusedsinceasimultaneousrequest ofvalueinthesamesequencecouldreturnrepeatedvalues.Forexample,’instance1’ and’instance2’bothcache5000values.’Instance1’cachesvaluesfrom1to5000and ’instance2’cachesvaluesfrom5001to10000.Whenitsrangeends,thesequenceasks foranothernumberingrange.Andthisassignsfrom10001to15000to’instance1’and soon.Whenobtainingthedata,sincetherearemillionsofrecordstobeextracted,we

30
EXPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL
FIGURE 3.5:WeeklyActivityofsomedaysofMarch.

askedfortheminimumIDandthemaximumIDofthedatabase.Andsofar,thedatafor thoseparticulardayshavenotyetarrivedbecausetheymayhaveanIDhigherthanthe maximumobtainedsofar.

Therefore,thedatawewillbeanalysingcorrespondstothetimeframebetweenthe beginningofthe10thofMarchandtheendofthe28thofMarch.Sofar,weknowthenumberofeventseachdayhasgonethrough,thattheremainingweeksareinsyncwitheach other(fig.3.5:linescorrespondingtoweeks2,3and4overlap),andthatweekdaysare muchbusierthanweekends.However,wedon’tknowwhetherthecurvesforweekends andweekdaysfollowthesamepattern,regardlessofthenumberofcallsmade/received thatday.Theideaofnormalisingthedailycurvescameup,andtheresultisinfig.3.6

Fromfig.3.6,itisclearthatwehavetwostandardcurves,onefortheweekdaysand anotheronefortheweekends.Besidestheselastonesbeingmuchslower,weekdayshave avigorousactivitywithinthePortugueseeightdailyworkinghours,includingadrop intheactivitywithinthelunchhours.Outsidetheseworkinghours,whichcomprehend theperiodof9huntil17horeven18h,theactivitydropstoalmostnull.Forweekends, however,thepeaksobservedareataround11h-12hand19h,withamoresubtledrop betweenthesehours,whichisnotasabruptaswesawonweekdays.Notmanyclients haveanintenseactivityonweekendsoranyactivityconsideredrelevantatall,meaning thatthecurveswiththepeaksandlowsobservedfortheweekendsareprimarilyduetoa fewclientsthathavethisactivityinspecific.Forexample,fooddeliveryservicesusually

3.ANALYSIS 31
FIGURE 3.6:Normalisedhourlyactivity,inwhichthenumberbetweenbracketsrepresentsthedayoftheweek.ZeroforMondayandSixforSunday.

XPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION. peakabitbeforeeatinghours,andhospitals,asusually,shouldnotbeclosedonanyday oftheweek.

Whatismorecuriousistheorangelinethatrepresentsthe28thofMarch.Compared tootherweekdays,itseemstohaveanoffsetofonehourtotheleft.Thisisduetothetime changeofonehourmorethathappenedonthe27thofMarchat01:00.Soforthedaylight savingtime,01:00is02:00,andso,thereisnodataavailablefor01:00asthathourdidnot existthatday.ThetransitionintheUTC(CoordinatedUniversalTime)fromUTC+0to UTC+1shiftsthecallsfromthefollowingdaysonehourtotheleft.

3.2ClientSegmentation

Clientsarecharacterisedaccordingtotheirpercentageofanswered/unansweredcalls. Forthosewithextremepercentages,wewillperformamorein-depthanalysisofthe servicestheytransitthroughandtheirLECs.Fortheremainingclients,wepresenta generalreviewoftheirLECs.Finally,wewillanalysetheWaitingQueueserviceforone clientandthegroupservicepatternsoftwodifferentclients.

3.2.1SessionAnswered

Thefluxofcallsthatareansweredornotansweredisgiveninfig.3.7.Thetwocurves accompanyeachother,notonlyinshapebutalsoinnumber:besidesmakingthe’M’ lettershape,thenumberofunansweredcallsissurprisinglyverysimilartothenumber ofansweredones.Eventhoughthecurveforunansweredcallsisalmostallofthetime belowtheansweredone,withsomemoreunusualdropsobservedonthe11thand24th ofMarchandamorepronouncedpeakonthe14thofMarch,these’anomalies’arenot thatsignificant,astheyarewithinthestatisticalsample,whichleavesustothinkasto whythenumberofunansweredcallsisalmostequaltotheansweredones,asonewould thinkthattherewouldtobemanymoreansweredcallsthanunanswered.

32
E

Ifwegobynumbers,thereisatotalof32017clients,6381(20%)ofwhichhavemore unansweredcallsthananswered.Ofthese,1023clients,meaningaround16%,havezero answeredcalls(3.2%ofthetotalnumberofclients).Furthermore,foransweredcalls, thereisa0.5%amountofthetotalnumberofclientsthathaveonlyansweredcalls.This

3.ANALYSIS 33
FIGURE 3.7:DistributionofansweredandnotansweredcallsthroughoutMarchforeverydayandhour.

TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION. informationisonlyrelevantifwehaveasignificantnumberofcalls:iftheclientinquestiononlyhasonecallwithinthesettimeframeasatotal,andthisoneisnotanswered, notmuchinformationcanberetrievedfromthis.Astatisticthresholdmustbeapplied tofilteronlytherelevantclientsasthereisnotmuchtoaddfromclientsthathad,for example,onecallonly.Ifthatcallisnotanswered,itimmediatelypumpstheprobability ofnotansweringto1,ortheopposite,ifthecallisanswered.

Outofalltheexistentunansweredandansweredcallsinthedataset,theattributed LECareassociatedwiththefollowinggroups,withtherespectiveamountofoccurrences:

Fromtable 3.1,wecanseethatasuccessfulcallbelongstogroups’10’and’11’.However,byanalysingtheactualLECvaluesfromtheunansweredcalls,0.67%correspond to’10001’,whichisthecodeusedfor’Call OK’.Therearesomeexceptionalsituations wherethismighthappen,forexample,intheWaitingQueueservice,wheretheoriginis hungupbeforethecallwasestablishedwiththeWindlessMediaServer(WMS).WMSisa genericIP(InternetProtocol)multimediaserverresponsiblefordeliveringaudiocontent toclientswhorequestit.However,thissituationshouldbereanalysedtouseacorrect LEC,as’Call OK’shouldonlybeusedforcallsthatwentthrough.

AnalysingalltheLECsinthedataset,noneraisesanysuspicion,asmostunsuccessful callerrorsarerelativelycommon.Moreover,unansweredandansweredcallscanshare thesameLEC,dependingonthesituation,themostobviouscasebeingwhenLECis equalto’10001’,whichtakesitsshareof99%ofalltheLECforansweredcalls.However, alsoanotherexamplecanbewhenithasthecode’30003’,corespondenttounexpected serviceerrors.Sothiserrorassociatedwithunansweredcallscomesoutasquiteintuitive tounderstand,butforansweredcalls,however,itmightnotcomeaseasy.Forexample,if

34 EXPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL
Groupof’LOGIC END CAUSE’ Percentageofoccurrence for’SESSION ANSWERED’=0 Percentageofoccurrence for’SESSION ANSWERED’=1 10 61.66% 99.90% 11 11.31% 0.02% 30 6.15% 0.07% 31 0.24%33 19.97% 0.01% 34 0.67%35 0.01% -
TABLE 3.1:PercentageofoccurrencesoftheLECgroupsforalltheunansweredand answeredcalls.

theansweredcallisconnectedtoacallrecorderandsomethingunexpectedhappensand everything’shutsdown’,itisgiventhisLEC.

Wedecidethentoanalysetheclientsthatonlyhaveunansweredoransweredvoice calls.Thegraphswiththenumberofclientswithacertainnumberofvoicecallsare representedinfig.3.8,withansweredcallsontheleftandunansweredcallsontheright.

Formostoftheclientsthatonlyhaveansweredcalls,82%ofthemhavelessthanten voicecallswithinthechosenperiod,visiblefromtheconcentrationofdotsunderthe25 answeredvoicecallsfromfig.3.8.Someofthesecallsaretestcalls-fig.3.9 -thathappen atthesamehourformultipledaysgothroughthesameservices,andtheconsumedtime isthesameeachtimeithappens.

3.ANALYSIS 35
FIGURE 3.8:Numberofclientsthatonlyhaveunansweredandansweredvoicecalls.

Otherclientsonlyhaveasinglephone,astheirlowactivity(fewamountofcalls) demandsthat.Furthermore,noconclusioncanbetakenfromclientsthatonlyhavea singleevent,sincethisdataisforlessthanamonth,wedonotknowwhathappensnext. Someguessesmaybeeithertestclientsmakingonecallpermonthorclientsjustinactive duringthisperiodbutmoreactiveinthefollowingmonths.Eitherway,itisnotcommon toonlyhaveansweredcalls,and,asseen,thishappenedforclientswithveryfewamount ofcalls.

Weaggregatedthe167clientsintoadendrogram,havingpreviouslynormalisedall theiractivity.Thecircularshapewaschosenasitbecamemorereadablewithhundredsof labelsratherthanthetraditionalflatdendrogram.Thisdendrogram,presentedinfig.3.10, wasobtainedthroughhierarchicalclusteringusingeuclideandistanceandcompletelinkage.Atfirst,singlelinkagewasused.However,thechainingphenomenonhappened,in whichclusterswereforcedtobetogetherduetosingleelementsbeingclosetoeachother [56].Thisdoesnothappenforcompletelinkage,asthismethodtendstofindcompact clustersofapproximatelyequaldiameters[56].

36 E
.
XPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION
FIGURE 3.9:Exampleofasequenceoftestcalls.

Allthetinyclusters(theonesattheendofthecircle)representthe46clientswith onlyonecall.Theinitialdatasetwasnormalised,soeachclustershouldhaveaggregated clientswithsimilaractivitypatterns.However,notmuchcanbeconcludedfromthis dendrogram,asitwasdrawnfromclientswithonlyfewcalls.Thismeansthatactivity patternscannotbeaggregatedassomeclientscanonlybeclusteredasasinglecluster,as theyonlyhaveonecall.Thesamedendrogramwasattemptedfortheclientswithonly unansweredcalls.However,asseenfromfig.3.8,therearearoundsixtimesmoreclients that100%don’tanswerthanclientsthat100%answertheircalls.Thismeansthat,even thoughitispossibletoaggregatethemaccordingtotheircallpatterns,itwouldnotbe readableinthisreport.Amuchlargerwindowwithzoomcapacitywouldbeneededin ordertobeabletoreadthelabelsofalltheclients.

3.ANALYSIS 37
F IGURE 3.10:CirculardendrogramresultingfromtheHierarchicalClusteringofthe clientsthathaveansweredallcalls.

EXPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION.

Forunansweredcalls,adistributionofthenumberofclientsasafunctionofthepercentageofunansweredcallscanbeseeninfig.3.11.Similartotheclientsthatonlyhave answeredcalls,someclientsonlyhaveunansweredcalls,withamoreconsiderableconcentrationunderthe100voicecallsbutwithunansweredcallsupuntil1200-fig.3.8.This, ofcourse,willreflectonthehistogramsasaverynoticeablebarattheouterborderofthe 100%,visibleonthefirsttwohistogramsoffig.3.11,asthesearetheonesthatgountilthe orderofmagnitudeoftheunansweredcalls’graphx-axisfromfig.3.8.Wewillonlyfocus ongraphswithmorethanonehundredcalls.

Apeakaround37.5%arises,andlater,forabiggernumberofcalls,anotherpeakat around80%alsocomestolife.Inanuniverseofmanycalls,itissporadictohaveonly answeredcalls,asonly0.5%oftheclientsareinthissituation.Assuch,havingabelow 40%percentageofunansweredcallsisnormal,asthedestinationmightbebusy,general errorsmayhappen,andsoon.However,asthenumberofvoicecallsincreases,the80% peakstartstoraisesuspicion.Whileatfirst,thesamemighthaveseemedtoberelated toahighnumberofcalls,fig.3.12 showstheyareuncorrelated,asforpercentagesof60% andalmost100%,thenumberofcallscanbeverysimilar.

38
FIGURE 3.11:Distributionofthepercentageofunansweredcalls,dividedbyorderof magnitudeofcallsthattheclientsreceive.

Thesecondoptionwastoexaminewhytheunansweredcallswereterminated.We dividedtheclientswithcallsbetween105 and107 intothosewithapercentageofnotansweringabove60%andbeloworequalto60%.Then,foreachset,wecalculatedtheoccurrencesoftheLECwithintheentireuniverseofLEC.Thisway,wecanhaveanideaofthe sharetheLECofoneclientoccupies.AdeeperanalysisshowsthemostrelevantpercentagesfrombothsetsrepresentthesameLEC:’33001’(Busydestination),’33007’(Customer unreachabledestination),’11010’(Sessionterminatedbynetwork),’10009’(OriginAbandoned)and’10022’(OriginAbandonedwithCallCompletedElsewhereReason).So,for themainpart,thereisnotmuchofarelevantdifferencebetweentheLECforpercentages aboveandbelow60%.Weneedtoconsiderthatwhencomparedtothehistogramswith fewercalls,wehaveamuchsmallernumberofclients,meaningthatthedistributionmay beskewed.Because,ifwetakealookatthegeneralhistogramwithallofthecalls-fig.3.13 -,thereisonlyasinglepeak.

3.ANALYSIS 39
FIGURE 3.12:Relationbetweenthenumberoftotalcallsandthepercentageofunansweredones.

Whentakingalookatthelastofthehistogramsfromfig.3.11,wecanseethatthereis oneclientthathasapercentageofalmost100%ofnotansweringitscalls,while,onthe othersideofthespectrum,wehaveoneclientthatprettymuchpicksupeverycallitreceives.TheirIDsarerespectively,’1303’and’80933’andtheircallpatternsarerepresented infig.3.14.Theseareclientsthathavecompletelydifferentpatterns,asclient’80933’describesthe’M’curve,followingthestandardpattern(fig.3.6),whileontheotherhand, eventhoughclient’1303’maydropalittleduringnoon,it’sactivityprettymuchstays ataconstanthighlevel.Moreover,theweekendactivityforclient’1303’israthernonexistentcomparedtotheotherclient.

40 EXPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION.
FIGURE 3.13:Distributionofthepercentageofunansweredcalls.

FIGURE 3.14:Activityofclients1303(ingreen)and80933(inpink).

3.ANALYSIS 41

EXPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION.

3.2.1.1Client80933(Almostallansweredcalls)

Averygentle’M’shapedcurve,withaveryslightdropinactivityduringthelunchhours, canbeseenforclient’80933’infig.3.15.Formostdays,thereisanon-stopactivityduring businesshours,withanactivityatamoreconstantrateafterlunch,thenuntillunch(the linesarelesssteep),anduntil18h.Afterwards,amoreabruptdropintheactivityhappens until22h/23h,whentheamountofcallsisalmostnull.Fromthis,wecanconcludethat thisclientprobablyhasmanyemployees,mainlyworkingfromMondaytoFriday,each withdifferentlunchhours,asthedipbetween11h-13hisnotasaccentuatedassome others,aswewillseebelow.

Eventhoughweekendsareslower,as,formostoftheclients,theactivitycurvepretty muchfollowstheremainingcurvesforweekdays-fig.3.16.

42
FIGURE 3.15:DailyhourlyactivityofClient’80933’.

Asexpectedforaclientwithalmost100%ofsuccessfulcalls,’10001’(CallOK)isthe predominantLEC,withapercentageof97%.Infig.3.17,wecanseetheinternalservices thatthisclientgoesthrough,alongwiththenumberofoccurrencesineachoneofthem. Fromthisfigure,wecanconcludethat’OIP(OriginatingIdentityPresentation)’,’Convergent’,’NP ANNOUNCE SURPRESS’and’NUMBER PORTABILITY’arethemostrelevantservicesforthisclient.

3.ANALYSIS 43
FIGURE 3.16:NormalizedhourlyactivityofClient’80933’.

3.2.1.2Client1303(Almostnoansweredcalls)

Client1303hasamoresignificant’M’curve,withacleardroptoalmostnullvaluesat lunchhours,moreinlinewiththegeneralpicture,asseenfromfig.3.6.Themostoffpatternactivitiescorrespondtotheweekends,havingthe12thofMarchasanexamplefig.3.19 -thespikeat20h.Thespikeat13hisduetotheUTCtimechange.

44 EXPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION.
FIGURE 3.17:InternalServicesofclient80933andrespectiveNumberofoccurrencesfor answered(ingreen)andunanswered(inpink)sessions.

Inordertounderstandwhytherearesomanyunansweredcallsforthisclient,their LECwerestudied.OutofalltheLECpresentforthisclient,73%correspondto’33001’ (Busydestination),16%for’10022’(OriginAbandonedwithCallCompletedElsewhere Reason)and4%for’10009’(OriginAbandoned(callerhangsupbeforeanswering)).A histogramforallthedatesandhoursavailableofLEC=’33001’isrepresentedinfig.3.20.

3.ANALYSIS 45
FIGURE 3.18:DailyhourlyactivityofClient’1303’. FIGURE 3.19:NormalizedhourlyactivityofClient’1303’.

Itaccompaniesthe’M’-shapedcurve,havingloweroccurrencesforout-of-officeand lunchhours.Forweekends,therewerenooccurrencesofthisLEC.Thereisnotenough informationtounderstandifhavingthismuchofapercentagecorrespondentto’Busy Destination’isanone-monthsituationorifithappensallyear.Ifitisthefirstone,itmeans thatitisjustamorebusyseason,butifitisthesecondone,thereisaneedtoallocate moreresourcestothisclient.Somethingthatcouldhavebeendonewastodistinguish thishistogrambetweentheansweredandunansweredcalls,toseeiffurtherconclusions couldhavebeentaken.

ThesecondmostcommonLEC,’10022’,ismostlikelyduetothe’HuntGroup’.For example,thecallistransferredtotwodifferentpeople.Ifoneofthemanswers,thecallto theotherpersonisautomaticallyhungup,originatingaLECequalto’10022’.

46 E
.
XPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION
FIGURE 3.20:HistogramoftheoccurrencesofLEC=’33001’forclient1303asafunction oftime.

Whenitcomestotheinternalservicesittransitsby,aconfigurationproblemseems toariseasonlyanalmostinsignificantpercentageofthecallsgothroughthe’Waiting Queue’,andmostcallsofthisclientarenotbeinganswered.Similartothepreviousclient,thereisahigherfrequencyoftheinternalservices’OIP’(OriginatingIdentity Presentation)and’Convergent’,meaningthatthecalledparty’sidentityisshowntothe callingpartyandthatthecallingpartyallowsthecall’sorigintobeidentifieddifferently dependingonwhetherthedestinationisfixedormobile,respectively.Butwhydothese internalserviceshavesomanyunansweredcalls?Wedidnotdoamorethoroughanalysistodigmoreintothissubject,duetothelackoftime,butitwouldbesomethingworth exploringforafutureproject.Theremaininginternalservicesarenotworthmentioning astheirpercentage,comparedtothepreviouslyexplainedones,isinsignificant.

3.ANALYSIS 47
FIGURE 3.21:InternalServicesofclient1303andrespectiveNumberofoccurrencesfor answered(ingreen)andunanswered(inpink)sessions.

3.2.2GroupServices

Thefirstapproachconsistsintoanalysingtheinternalservice’WaitingQueue’andits characterizingfields,whilethesecondapproachconsistsinanalysingthecallpatternsof thegroupservices.

Ofcourse,therearemanymoreclientswhoseactivitygoesthroughthisinternalservice.However,thecountwascappedforvaluesabove5000,astheprimarygoalwasto decidewhichclienttoanalyse,inthiscase,theonethathadthemostsignificantamount oftransitionsthroughthewaitingqueue.TheWaitingQueueservicehasahighnumber ofCDRswithSESSION ANSWERED=1sincethecallisusuallyautomaticallyanswered bythequeue.However,itshouldonlybeconsideredasanansweredcall,allofwhich wereeffectivelyansweredbythequeue’sdestination.’INFO6’isthefieldwhichprovides thatinformation.However,aproblemwasencounteredhere.While’INFO6’shouldonly befilledwitheither0or1,forthisclient,weobtainedtenpossiblevalues,representedin table 3.2.

48
EXPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION.
3.2.2.1Approach1:WaitingQueue Fig.3.22 showsussomeclientsthatgothroughtheinternalservice’WaitingQueue’. FIGURE 3.22:Clients’countofthetimestheygothroughtheserviceWaitingQueue.

TABLE 3.2:ObtainednumbersforINFO6fieldanditsrespectivenumberofoccurrences.

Notonlyisitfivetimesmorethanthenumberofpossiblevalues,butalso,thevalues obtainedseemveryrandom,notevenhavingtheexpectedvalueof’1’.

Eitherdifferentkeyswereusedwhentheanonymisationprocesswasperformed:since therearemillionsofrecords,notallofthemwereextractedsimultaneously,meaningthat theseextractionsmighthavehaddifferentkeystoanonymisethedata;Ortherewasa bugintheanonymisationprocess.However,ifitisthislast,ithasnotyetbeenpossibleto discoverthereasonbehindthiserror.Thisfieldmaycontainothertypesofinformation, outsidethescopeoftheWaitingQueue,hencetheneedforittobeanonymized.Thus, wewouldnotexpecttohavethevalues’0’or’1’(ifanonymized,theywouldhavetobe different),however,itwouldstillonlybetwovalues.Inreality,evenifwehadonlytwo numbers,wewouldnotknowwhichoneisequivalentto’0’andto’1’.Sincethemain issueofhavingtenvaluesinsteadoftwowasnotsolved,thiscorrespondencewasnot investigated.Theseresultsmakethisanalysisunfeasible.

3.2.2.2Approach2:CallPatternofGroupServices

Inordertochoosewhichclientstoanalyse,wedecidedtoseewhichoneshadtheevenest distributionofvoicecallsthroughoutthegroupservicesandagreatnumberofvoice callsforeachofthese-fig.3.23.Twoclientswerechosen,withitsvoicecallsdistribution representedinfig.3.24:client’114114’andclient’91763’.

3.ANALYSIS 49
AllegedINFO6 Numberofoccurrences 20 8881 0 8876 95 8662 43 6118 35 5782 10 4758 5 4696 75 4650 89 3919 37 1888

Thesetwoclientshaveasimilarweekdayactivity,withthetypical’M’shapedcurve -figures 3.25 and 3.26.Althoughtheactivityfortheweekendsseemstodiffer,themain differencewecanspotistheactivityforexternalandinternalcalls.Client’114114’has muchmoreinternalcallsthanexternalones,with,ingeneral,moreansweredcallsthan unansweredforinternalactivities.Theonlyexceptionsaretheweekends,mostprecisely onthe20thofMarch,wheretwoimpressivepeakscanbespotted.However,wecan concludethattheexternalactivityforthisclientisinsignificant,as,atmost,wehaveten callsinonehour.Ontheothersideofthespectrum,wehaveclient’91763’,aclientwith amoresubstantialexternalactivity,withmorethantwicethenumberofunanswered

50 EXPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION.
FIGURE 3.23:Numberofeventsforeachgroupserviceandeachclient. (A)Client114114. (B)Client91763. FIGURE 3.24:Clientschosebasedonthenumberofeventsperserviceandservicedistribution.

calls.Thisisrelatedtothenatureofthebusiness,asthisisaclientwhosebusinessarea ismorecommercial(forexample,insurancecompaniesandcallcentres)oraclientthat hasamuchmoresignificantareaofclientsupportbecauseaclientthathasastronger internalcommunication,forexample,client’114114’,ismorelikelyaclientwithmultiple polesthatneedtocommunicatebetweeneachother.Whenitcomestointernalactivity forclient’91763’,thecurvesforansweredandunansweredcallsaccompanyeachother, withthislastoneonlysurpassingduringeitherweekends,lunchand’outoftheoffice’ sortofhours.

3.ANALYSIS 51
FIGURE 3.25:ActivityofClient114114.

Thegraphsfromfig.3.25 andfig.3.26 weregeneratedforAltice’sintereststohavean ideaofthedifferencebetweenexternalandinternalcallsoftheactivityofaclient.This isjustanexampleofhowclientscanhavedifferentcharacteristicsreflectedintheircall patterns.Again,therearemanycharacteristicsonecouldchoosefrom,anditisalsopossibletodoananalysisusinghierarchicalclustering,similartowhatwedidinfig.3.10,by aggregatingclientsandthoseinthesameclusterswouldsharesimilarcharacteristics.

Atfirst,theideawastodoananalysissimilartotheoneinsection 3.1.1.However, toomanyeventswereregisteredforasinglecall(GID),makingitimpossibletounderstandthetemporalpattern,sothatiswheretheideatoanalysetheGIDthatappearedfor specificusersofgroupservicescameup.However,afteranalysing,werealisedthatthis approachdidnotgettheexpectedresults,soinsteadofshowingtheresultsoftwoclients, wewillonlybeshowingone,inthiscase,client’114114’.

52 EXPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL
.
TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION
FIGURE 3.26:ActivityofClient91763.

Althoughitispossibletounderstandthatthecallhastransitedthroughdifferentservices(inthiscase,IVRandPA),wecannotcharacteriseitbecausethepatterndisplayed mayresultfromdifferentconfigurationsattheIVRlevel,notallowingustoretrieveany conclusions.OnepossibleconfigurationisamultilevelIVR(typicallythepretendedconfiguration),wheremenusarecascaded,meaningthatoneofseveralavailableoptions, whenchosen,deliversthecalltoanotherIVRMenu,whichalsohasoptionslinkedto anotherIVRMenu,andsoon,untileventually,thechosenoptionsdeliverthecalltothe user.Usually,foreachcall,therearetworecords,onefortheansweredcallandanother fortheforwarding,so,infig.3.27,thereareatleastfourlevelsofIVR.Anotherpossible configurationishavinganIVRreceivingthecall,whichhasafallbackconfigurationin caseofmaximumattemptsforinvalidoptions,deliveringthecalltoanumberbelonging toanotherIVR,andsoon.

Itwasexpectedforthesecallpatternstogothroughmoreservicesinamoreinteractiveway,whichdidnothappen.Theideawastounderstandtheinteractionbetween theservices,which,inthiscase,wasnotpossible.Forexample,wecannotunderstand fromfig.3.27 ifthepersoncallingisclickingontherightkeyornot.Thischaracterisation

3.ANALYSIS 53
FIGURE 3.27:CallPatternforanencodedGIDfoundthroughIVR.

TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION. neededtobecomplementedwithsomeothersortofinformation,andhencethiswasnot asusefulaswethoughtitwasgoingtobe.

54
EXPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL

Chapter4 Conclusions

Fromtheworkthatwasdonethroughoutthisinternship,wecantakesomeconclusions fromit.Inthefirststage,weknewadataanonymisationprocessmustoccur.However, forfutureprojects,itwouldbebestifthisprocesshadpreviouslybeendiscussedand testedforthedatatobeavailableassoonastheprojectstarted.Itshouldbediscussed becausethedataownersneedtoapprovethesharingandtheanonymisationprocess(for thisproject,thisphaselastedforaboutninetotenmonths),anditshouldbetestedas, throughouttheanalysis,manymistakeswerefound.Fromduplicatedentriestomixing theentriesfrommiddaytomidnight,aswellashavingfieldswithrecordsthatdidnot makesense-e.g.,fieldsthatcanonlybefilledwith0,1orNaN-makingtheanalysis thatwewantedtodototheWaitingQueueunfeasibletoperform,asitcentredaround onefield,andthatfieldwasbugged.Wearestillnotsurewhythishappened.Allofthis showshowharditistohavedataandhowimportantitisforittobecorrect.

Wesawthegeneralactivitycurveforallclients,withtwodistinctcurvesforweekendsandweekdays.Weekdaysarebusierduring9h-12hand13h-18h,withacleardrop intheactivityduringlunchhoursandalmostnullactivityoutsidetheworkinghours. Forweekends,withfewercalls,thereisalsoaslightdropthatiscomprehendedbetween 11hand19h.Wesawseveralclientswithdifferentactivitycurves,dividedbyday,but ingeneral,allofthemwereslightlyorsignificantly’M’shaped.Withthis,itwaspossibletospotunusualspikesordropsintheclients’dailyactivity.Itwasalsopossibleto seemultiplewaysinwhichbehaviourmaybeconsideredanomalous:clientswithapercentageof100%unansweredcallsandclientswiththecompleteopposite(usuallywitha veryfewamountofcalls),andclientswitharoundamillioncallsandalmost100%ofnot answeredandansweredcalls.Itwasfoundthatforthefewonlyansweredcalls,some

55

TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION. clientsonlyhaveonephone,meaningthatonmanydays,thereisonlyoneregistered voicecall.Otherclientsonlymaketestcalls,asthecallshappenatthesamehour,to thesameinternalserviceandhavethesameduration.However,forsomeothers,itwas impossibletowithdrawconclusionsduetotheshorttimeframeoflessthanamonth.

Ingeneral,thereisapercentageof40%ofunansweredcalls.The’LOGIC END CAUSE’ oftheseunansweredcallsshowedusthatclientswithaconsiderableamountofcallshad amoresignificantpercentageofunansweredcallshavingtheirmainLECas’Busydestination’,meaningthatmoreresourcesshouldbeallocatedtotheseclients.Nevertheless, noneoftheLECsstudiedwasoutofplacebesidetheunansweredcallsgiventheLEC’ Call OK’.ThesesituationsshouldbereanalysedandgivenamoreaccurateLEC.

Wecanalsoconcludethatclientswithsimilaractivitypatternsbutdifferentcharacteristics,suchasthelocationofthecalls,resultinthosecharacteristicshavingdifferent patterns,aswesawpreviously.Othercharacteristicscanbestudied,andclientswith similaronescouldbeaggregatedinorderfortheoperatortounderstandthedifferent businessplansonecouldimplement.

Ingeneral,wecansaythatthisinternshipwasanexperienceofhowtheenterprise worldworks,withthedifficultiesthatareusuallyencounteredandhowtoovercome them.Thisinternshipservedasthebeginningoftheanalyticsfieldwithinthefront-end teamIwasinserted.Forthefuture,amuchlargertimeframeshouldbeexplored,asless thanMarchwasanalysedinthisinternship.Acharacterisationoftheinteractionsbetween thegroupserviceswasattemptedandfailedhere,soforanextproject,abiggerfocuson that.Moreover,exploringothercalltypesbeyondvoicecalls,includingtextmessages.

56
EXPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL

Bibliography

[1] Anivers´ariodaalticelabs:H´a5anosainovardeaveiroparaomundo.[Online].

Available: https://www.telecom.pt/pt-pt/media/noticias/paginas/2021/junho/ anivers%C3%A1rio-da-altice-labs-h%C3%A1-5-anos-a-inovar-de-aveiro-para-omundo.aspx [Citedonpage 1.]

[2] Alticelabs:Inovac¸aodeportugalparaomundo.(accessed:09.06.2022).[Online].

Available: https://www.europeanjobdays.eu/en/company/altice-labs [Citedon page 1.]

[3] Innovation.(accessed:09.06.2022).[Online].Available: https://www.alticelabs. com/innovation/ [Citedonpages xiii, 2, 3,and 4.]

[4] “Metadatadefinitionmeaning,”(accessed:27.09.2022).[Online].Available: https://www.merriam-webster.com/dictionary/metadata [Citedonpage 4.]

[5] J.Sammons,“Chapter10-mobiledeviceforensics,”in TheBasicsofDigitalForensics (SecondEdition),secondeditioned.,J.Sammons,Ed.Boston:Syngress,2015,pp. 145–161.[Online].Available: https://www.sciencedirect.com/science/article/pii/ B9780128016350000103 [Citedonpage 4.]

[6] A.Tauriainen,“Canyouhearmenow?acalldetailrecordbasedend-to-enddiagnosticssystemformobilenetworks,”in 201915thInternationalConferenceonNetwork andServiceManagement(CNSM),2019,pp.1–7.[Citedonpage 4.]

[7] “Whatisdataprotectionandwhydoesitmatter?”Apr2022,(accessed: 29.09.2022).[Online].Available: https://securityintelligence.com/articles/what-isdata-protection/ [Citedonpage 5.]

[8] “Whatisgdpr,theeu’snewdataprotectionlaw?”May2022,(accessed:29.09.2022). [Online].Available: https://gdpr.eu/what-is-gdpr/ [Citedonpage 5.]

57

EXPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION.

[9] “Thehistoryofthegeneraldataprotectionregulation,”(accessed:29.09.2022).

[Online].Available: https://edps.europa.eu/data-protection/data-protection/ legislation/history-general-data-protection-regulation en [Citedonpage 5.]

[10] P.oftheCouncil,“Compromisetext.severalpartialgeneralapproacheshave beeninstrumentalinconvergingviewsincouncilontheproposalforageneral dataprotectionregulationinitsentirety.thetextontheregulationwhichthe presidencysubmitsforapprovalasageneralapproachappearsinannex,”(accessed: 26.04.2022).[Online].Available: http://data.consilium.europa.eu/doc/document/ ST-9565-2015-INIT/en/pdf [Citedonpage 5.]

[11] E.Union,“Directive95/46/ecoftheeuropeanparliamentandofthecouncilonthe protectionofindividualswithregardtotheprocessingofpersonaldataandonthe freemovementofsuchdata,”Oct1995,(accessed:26.04.2022).[Online].Available: https://www.refworld.org/docid/3ddcc1c74.html [Citedonpage 5.]

[12] J.Lindquist.Datascienceundergdprwithpseudonymization inthedatapipeline.(accessed:26.04.2022).[Online].Available: https://web.archive.org/web/20180418230748/https://www.dativa.com/ data-science-gdpr-pseudonymization-data-pipeline/ [Citedonpages 5 and 6.]

[13] (2019,Nov)Oquesaodadospessoais?(accessed:26.04.2022).[Online].Available: https://ec.europa.eu/info/law/law-topic/data-protection/reform/whatpersonal-data pt [Citedonpage 6.]

[14] intersoftconsulting.Keyissues:Gdprconsent.(accessed:26.04.2022).[Online]. Available: https://gdpr-info.eu/issues/consent/ [Citedonpage 6.]

[15] C.C.Aggarwal, DataMining:TheTextbook.Cham:Springer,2015.[Citedonpages 6 and 7.]

[16] A.Soetewey.Variabletypesandexamples.(accessed:24.04.2022).[Online].Available: https://towardsdatascience.com/variable-types-and-examples-cf436acaf769 [Citedonpages xiii and 7.]

[17] C.University.Dataandvariabletypes.(accessed:26.04.2022).[Online].Available: https://libguides.library.curtin.edu.au/uniskills/numeracy-skills/statistics/ data-variable-types [Citedonpages 6 and 7.]

58

[18] R.Sharma.Typesofdata:Nominal,ordinal,discrete,continuous.(accessed: 26.04.2022).[Online].Available: https://www.upgrad.com/blog/types-of-data/ [Citedonpages 6 and 7.]

[19] P.Portugal,“Alticeportugal,”(accessed:27.09.2022).[Online].Available: https://www.telecom.pt/en-us/a-pt/Pages/historia.aspx [Citedonpage 8.]

[20] M.Parmar.(2021,Mar)Structureyourcodebetteringooglecolabwithtextandcodecells.(accessed:16.09.2022).[Online].Available: https://miteshparmar1.medium.com/structure-your-code-better-in-googlecolab-with-text-and-code-cells-b6fa73feec20 [Citedonpages xiii and 10.]

[21] (2020,May)Oracledatabase.(accessed:16.09.2022).[Online].Available: https: //learnmystuff.com/learn/oracle-database-2/ [Citedonpages xiii and 10.]

[22] B.FutureSchool.(2022,Jun)Whatissql,andwhatarethebenefits,uses, andimportanceofsqlintherealworld?(accessed:27.09.2022).[Online]. Available: https://www.byjusfutureschool.com/blog/what-is-sql-and-what-arethe-benefits-uses-and-importance-of-sql-in-the-real-world/ [Citedonpage 10.]

[23] (2022,Mar)Whatissqlwhyisitimportanttolearnit?(accessed:27.09.2022). [Online].Available: https://codeop.tech/what-is-sql/ [Citedonpage 10.]

[24] (accessed:27.09.2022).[Online].Available: https://research.google.com/ colaboratory/faq.html [Citedonpage 10.]

[25] Colaboratory:Localruntimes.(accessed:02.09.2022).[Online].Available: https: //research.google.com/colaboratory/local-runtimes.html [Citedonpage 10.]

[26] Pandasuserguide.(accessed:24.04.2022).[Online].Available: https://pandas. pydata.org/pandas-docs/stable/user guide/index.html [Citedonpage 11.]

[27] (2022,May)Welcometocx oracle’sdocumentation!(accessed:16.09.2022).[Online]. Available: https://cx-oracle.readthedocs.io/en/latest/ [Citedonpage 11.]

[28] Whatismatplotlibinpython?[Online].Available: https://www.activestate.com/resources/quick-reads/what-is-matplotlib-inpython-how-to-use-it-for-plotting/ [Citedonpage 11.]

BIBLIOGRAPHY 59

EXPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL

TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION.

[29] F.Pedregosa,G.Varoquaux,A.Gramfort,V.Michel,B.Thirion,O.Grisel,M.Blondel, P.Prettenhofer,R.Weiss,V.Dubourg,J.Vanderplas,A.Passos,D.Cournapeau, M.Brucher,M.Perrot,and ´ EdouardDuchesnay,“Scikit-learn:Machinelearning inpython,” JournalofMachineLearningResearch,vol.12,no.85,pp.2825–2830, 2011.[Online].Available: http://jmlr.org/papers/v12/pedregosa11a.html [Cited onpage 11.]

[30] scikit-learn.(accessed:24.04.2022).[Online].Available: https://numfocus.org/ project/scikit-learn [Citedonpage 11.]

[31] “Israkeyprogramminglanguagefordatascience?”May2022,(accessed: 29.09.2022).[Online].Available: https://codeop.tech/is-r-a-key-programminglanguage-for-data-science/ [Citedonpage 11.]

[32] “3g,”(accessed:29.09.2022).[Online].Available: https://www.sharetechnote.com/ html/Handbook UMTS SS.html [Citedonpage 12.]

[33] “T1:Asurvivalguide,”(accessed:29.09.2022).[Online].Available: https: //www.oreilly.com/library/view/t1-a-survival/0596001274/ [Citedonpage 12.]

[34] A.Sethi.(2020,Jun)Categoricalencoding:Onehotencodingvslabelencoding. (accessed:26.04.2022).[Online].Available: https://www.analyticsvidhya.com/ blog/2020/03/one-hot-encoding-vs-label-encoding-using-scikit-learn/ [Citedon page 13.]

[35] G.L.Team.(2022,Jul)Labelencodinginpythonexplained.(accessed:26.04.2022). [Online].Available: https://www.mygreatlearning.com/blog/label-encoding-inpython/ [Citedonpage 13.]

[36] “Labelencodingvsdummyvariable/onehotencoding-correctness?”Nov2019,(accessed:26.04.2022).[Online].Available: https://stats.stackexchange.com/questions/410939/label-encoding-vsdummy-variable-one-hot-encoding-correctness [Citedonpage 14.]

[37] Aporras,“Whatisthedifferencebetweenfeatureextractionandfeatureselection?” Dec2019,(accessed:16.09.2022).[Online].Available: https://quantdare.com/ what-is-the-difference-between-feature-extraction-and-feature-selection/ [Citedon page 14.]

60

[38] “Featureselectionvsfeatureextraction.whichtousewhen?”Feb2019, (accessed:16.09.2022).[Online].Available: https://datascience.stackexchange.com/ questions/29006/feature-selection-vs-feature-extraction-which-to-use-when [Cited onpage 14.]

[39] C.Goyal,“Featuretransformationsindatascience:Adetailed walkthrough,”Aug2022,(accessed:16.09.2022).[Online].Available: https://www.analyticsvidhya.com/blog/2021/05/feature-transformationsin-data-science-a-detailed-walkthrough/ [Citedonpage 14.]

[40] W.Mohd,A.Beg,T.Herawan,andK.Rabbi, MaxDK-Means:AClusteringAlgorithm forAuto-generationofCentroidsandDistanceofDataPointsinClusters,012012,vol.316, pp.192–199.[Citedonpages xiii, 15,and 18.]

[41] U.M.Fayyad,G.Piatetsky-Shapiro,andP.Smyth,“Fromdataminingtoknowledgediscoveryindatabases,” AIMag.,vol.17,pp.37–54,1996.[Citedonpages xiii and 15.]

[42] K.ElBouchefryandR.S.deSouza,“Chapter12-learninginbigdata:Introduction tomachinelearning,”in KnowledgeDiscoveryinBigDatafromAstronomyandEarth Observation,P. ˇ SkodaandF.Adam,Eds.Elsevier,2020,pp.225–249.[Online].Available: https://www.sciencedirect.com/science/article/pii/B9780128191545000230

[Citedonpage 15.]

[43] T.Hinton,Geoffrey;Sejnowski,Ed., UnsupervisedLearning:FoundationsofNeural Computation.TheMITPress,1999.[Citedonpage 16.]

[44] K.ElBouchefryandR.S.deSouza,“Chapter12-learninginbigdata:Introduction tomachinelearning,”in KnowledgeDiscoveryinBigDatafromAstronomyandEarth Observation,P. ˇ SkodaandF.Adam,Eds.Elsevier,2020,pp.225–249.[Online].Available: https://www.sciencedirect.com/science/article/pii/B9780128191545000230

[Citedonpage 16.]

[45] C.C.AggarwalandC.K.Reddy,Eds., DataClustering:AlgorithmsandApplications CRCPress,2014.[Online].Available: http://www.charuaggarwal.net/clusterbook. pdf [Citedonpage 16.]

BIBLIOGRAPHY 61

E

XPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION.

[46] B.K.Patra,N.Hubballi,S.Biswas,andS.Nandi,“Distancebasedfasthierarchical clusteringmethodforlargedatasets,”in RoughSetsandCurrentTrendsinComputing, M.Szczuka,M.Kryszkiewicz,S.Ramanna,R.Jensen,andQ.Hu,Eds.Berlin, Heidelberg:SpringerBerlinHeidelberg,2010,pp.50–59.[Citedonpage 16.]

[47] E.K.Tokuda,C.H.Comin,andL.d.F.Costa,“Revisitingagglomerativeclustering,” 2020.[Online].Available: https://arxiv.org/abs/2005.07995 [Citedonpage 17.]

[48] “Scipy.cluster.hierarchy.dendrogram,”(accessed:16.09.2022).[Online].

Available: https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster. hierarchy.dendrogram.html [Citedonpage 17.]

[49] D.Stamenic,“Definitiveguidetohierarchicalclusteringwithpythonandscikitlearn,”Jul2022,(accessed:16.09.2022).[Online].Available: https://stackabuse. com/hierarchical-clustering-with-python-and-scikit-learn/ [Citedonpages xiii, 17, and 18.]

[50] “Clusteringmethods,”Apr2010,(accessed:16.09.2022).[Online].Available: https://support.sas.com/documentation/cdl/en/statug/63033/HTML/ default/viewer.htm#statug cluster sect012.htm [Citedonpage 17.]

[51] “Rudimentsofhierarchicalclustering:Ward’smethodanddivisiveclustering,”Jul2018,(accessed:16.09.2022).[Online].Available: https://m.dexlabanalytics.com/blog/rudiments-of-hierarchical-clusteringwards-method-and-divisive-clustering [Citedonpage 18.]

[52] “Agglomerativehierarchicalclusteringusingwardlinkage,”(accessed:16.09.2022). [Online].Available: https://jbhender.github.io/Stats506/F18/GP/Group10.html [Citedonpage 18.]

[53] “Thebeginner’sguidetocallqueuing(2022),”Dec2021,(accessed:29.09.2022).[Online].Available: https://squaretalk.com/call-queuing-guide/ [Citedonpage 21.]

[54] “Ivr,”Mar2021,(accessed:16.09.2022).[Online].Available: https://t4d.co.ke/ivr/ [Citedonpage 21.]

[55] “Learnmoreabouttalend,”(accessed:07.11.2022).[Online].Available: https: //www.talend.com/ [Citedonpage 29.]

62

[56] HierarchicalClustering.JohnWileySons,Ltd,2011,ch.4,pp.71–110.[Online].Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/9780470977811.ch4 [Cited onpage 36.]

BIBLIOGRAPHY 63

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.