Explorationof representationof thedifferentways dataanalysisinreal timeandinits materializationin usefulbusiness information.
CristinaMeng DissertaçãodeMestradoapresentadaà FaculdadedeCiênciasdaUniversidadedoPortoeAlticeLabsem DataAnalytics 2022
CristinaMeng DissertaçãodeMestradoapresentadaà FaculdadedeCiênciasdaUniversidadedoPortoeAlticeLabsem DataAnalytics 2022
CristinaMeng
MSc.EngineeringPhysics
DepartmentofPhysicsandAstronomy
2022
Orientador
JoãoManuelVianaParenteLopes, FacultyofSciencesofUniversityofPorto
Supervisor
CélioGomesAbreu, AlticeLabs
Todasascorreçõesdeterminadas pelojúri,esóessas,foramefetuadas.
OPresidentedoJúri, Porto, / /
Explorationofrepresentationofthedifferent waysdataanalysisinrealtimeandinits materializationinusefulbusiness information.
Author: CristinaMENG
Supervisors: JoaoManuelVIANA PARENTE
LOPES
C´elioGOMES ABREU
Athesissubmittedinfulfilmentoftherequirements forthedegreeofMSc.EngineeringPhysics atthe
FacultyofSciencesofUniversityofPorto
DepartmentofPhysicsandAstronomy
August9,2023
I, Cristina Meng, enrolled in the Master Degree of Engineering Physics at the Faculty of Sciences of the University of Porto hereby declare, in accordance with the provisions of paragraph a) of Article 14 of the Code of Ethical Conduct of the University of Porto, that the content of this internship report reflects perspectives, research work and my own interpretations at the time of its submission.
By submitting this dissertation/ internship report/ project [choose accordingly], I also declare that it contains the results of my own research work and contributions that have not been previously submitted to this or any other institution.
I further declare that all references to other authors fully comply with the rules of attribution and are referenced in the text by citation and identified in the bibliographic references section. This dissertation/ internship report/ project [choose accordingly] does not include any content whose reproduction is protected by copyright laws.
I am aware that the practice of plagiarism and self-plagiarism constitute a form of academic offense.
Cristina Meng16th December 2022
Ahugethankyoutomyuniversity’ssupervisor,professorJoaoVianaLopes,forall thesupportgiven,theavailabilityandthemotivationalwordsduringthemostchallengingtimesandthepeopleatAlticeLabs:C´elioAbreu,mysupervisoratthecompany, andFilipaFerreira,myteamleader,bothofthemwelcomedme,showedavailabilityand accompaniedallofmyworkduringthisinternship,itwasawonderfulexperience.
IwouldalsoliketothankmyfriendsfromFEUP.Youguysmademychoiceofenrollingininformaticsinmyfirstyearallworthit.
Tomybestfriendofalltime,JoanaFerreira,whowassmartenoughnottoputherself throughamaster’sdegree. Thegirlsthatgirl,girlandthegirlswhogirln’t,goren’t.
Tomyboyfriend,forthecoffee,icecreamandsupportgivenduringtheselastfew daysofinsanity,inwhichIevendaredtosayIwantedacat.
Andlastly,Iwouldliketothankthetinyhumanbeingcreatedbymysister,towhom Iamalsothankfulfor:mywonderfulniece.LetmeraiseatoasttothegirlIcurrentlylove most.Auntiewillbuyyouandyourfuturesisteralltheicecreamintheworld.
FacultyofSciencesofUniversityofPorto
DepartmentofPhysicsandAstronomy
MSc.EngineeringPhysics
Explorationofrepresentationofthedifferentwaysdataanalysisinrealtimeandinits materializationinusefulbusinessinformation.
by CristinaMENGThisinternshipcarriedoutatAlticeLabshasthegoalofsearchingpatternsintheclients’ activityandlookforanomalousbehaviour.Thisisimportantasitallowsfortheoperator toknowitsclientandtakethenecessarymeasuresincaseanythinganomalousshowsup. Iftheoperatordoesnotknowwhatanormalpatternlookslike,ifsomethinganomalous comesup,itwilllooknormaltohim.
Inordertodoso,itisessentialto,first,anonymisethedata:numerousproblemshave arosefromthisprocess,includingmixeduprecordsandfieldswiththewrongdata.
Second,toanalyzethedata,themostcommondatasciencetechnologieswereused: SQL,RandPython.Theselasttwohavelibrariesthatenableclustering,aprocessthat allowsustoaggregateclientswithsimilarattributes,enablingtheirgeneralcharacterization.
Ourpatternsallowedustodiscovernotonlyclientswithamillionrecordsthathave analmost100%ofnotansweringphonecalls,andothersontheextremeopposite,but alsoclientswithonlyacoupleofcalls.
FacultyofSciencesofUniversityofPorto
DepartmentofPhysicsandAstronomy
MestradoIntegradoemEngenhariaF´ısica
Explora¸c˜aodarepresenta¸c˜aodasdiferentesformasdean´alisededadosemtemporeal enasuamaterializa¸c˜aoeminforma¸c˜oescomerciais´uteis.
por CristinaMENGEsteest´agio,realizadonaAlticeLabs,temcomoobjetivoprocurarpadr ˜ oesnaatividadedosdiferentesclientesedetetarcomportamentosan ´ omalos.Isto ´ eimportanteporquepermitequeooperadorconhec¸aosseusclienteetomeasprovidˆenciasnecess´arias casoalgoforadopadraoaparec¸a.Seooperadornaosouberopadraonormaldoseu cliente,sesurgiralgoforadocomum,esteemprinc´ıpioparecer´anormalparaele.
Paraisso, ´ eessencial,emprimeirolugar,anonimizarosdados:v´ariosproblemassurgiramdesteprocesso,incluindoregistostrocadosecamposcomdadoserrados.
Emsegundolugar,paraanalisarosdados,foramutilizadasastecnologiasde data science maiscomuns:SQL,RePython.Estasduas ´ ultimaspossuembibliotecasquepermitemo clustering,processoquepermiteagregarclientescomatributossemelhantes,possibilitandoasuacaracterizac¸aogeral.
Ospadroesencontradospermitiram-nosdescobrirquenaos ´ oexistemclientescomum milhaoderegistosquetˆemquase100%dechamadasnaoatendidaseoutrosnoextremo oposto,mascomotamb´emexistemclientescomapenasalgumaschamadas.
FCUP FacultyofSciencesofUniversityofPorto
RDI Research,DevelopmentandInnovation
IoT InternetofThings
ABC AdvancedBusinessCommunications
AS ApplicationServer
SEC ConvergentEnterpriseService
CDR CallDetailRecords
GDPR GeneralDataProtectionRegulation
EU EuropeanUnion
EEA EuropeanEconomicArea
ML MachineLearning
DB Database
GID GlobalCallIDIdentifier
LEC LogicEndCause
HG HuntGroup
WQ WaitingQueue
PA Pre-Answer
IVR InteractiveVoiceResponse
WMS WindlessMediaServer
IP InternetProtocol xv
Inthisdocument,asynthesisoftheinternshipcarriedoutatAlticeLabsispresented. Thisinternshiptookplaceunderthecurricularunitof”Internship”oftheMaster’sinEngineeringPhysicsoftheFacultyofSciencesofUniversityofPorto(FCUP).Thisinternship takesplaceatAlticeLabs,SA,headquarteredinAveirobuttakenoutremotelyatAltice’s officeinPorto.
Withalegacyofseveraldecades, AlticeLabs,previouslyknownas PTInova¸c˜ao,hasbeen acatalystofinnovationandtransformation,assertingapositionofleadershipintheRDI (Research,DevelopmentandInnovation)areainPortugal.Forexample,inthepastyear, 2021,AlticeLabsannouncedaninnovationinfibreopticsthatrevolutionisedhowthese networksweredeployedontheground.Withthis,theydoubledtheircapacity,enabling themtoservetwiceasmanycustomersascomparedtothecurrenttechnology.Over300 millionpeopleandmorethan60countriesuseproductsandsolutionsmadeinAlticeLabs [1].
AlticeLabshasofficesinAlgarve,MadeiraandAzoresarchipelagos,Porto,Lisbon, Viseu,andalsooutsideofEurope,suchasBrazil,throughcentresdedicatedtotheresearchanddevelopmentofadvancedsolutionsofTelecommunicationsandSystemsof Information[2].
AlticeLabscontinuallysupportscollaborativeprojectsofRDI,workinginpartnership andcooperationwithuniversitiesallovertheworld,RDinstitutions,partners,suppliers andcostumers[3].Presentedinfig.1.2 aretheactivitiesofRDIinwhichAlticeLabsis involved.TheseinvolveArtificialIntelligence&MachineLearning,SmartCities,Smart Living,InternetofThings(IoT),DigitalServices&Platforms,5GFutureNetworks,includingtheopticalevolutionframework[3].
ABC(AdvancedBusinessCommunications)isanEnterpriseCollaborationSolutionin whichABC-AS(whereASstandsforApplicationServer),previouslydesignatedasSEC (ConvergentEnterpriseService),deliversaunifiedcommunicationexperienceacrossall userdevices-fig.1.3.OneofitsmainfeaturesistheAnalytics,wherewehavetheusage informationofCustomerresourcesandservices.However,thecorporateenvironment framedintelecommunications,combinedwiththeemergenceofnewservicesandtechnologies,isincreasinglycomplexandwithfasterchanges/transformations.Thesimple collectionandpresentationofdatafromallactivitieswithanoperationalandbusiness natureareinsufficient.Aneedtomanipulateandanalysethedatafromdifferentperspectivesarises,fosteringtheinterpretationandconversionofthesedataintousefulinformationthatcompaniescanusetounderstandandenhancetheirbusiness.
Everytelecommunicationcompanystoresdetailedcallrecords,denominatedbyCDRs (CallDetailRecords),thatconsistofinformationcapturedbythetelecommunications
providersduringthecalls,messages,andinternetactivityofacustomer.Telecommunicationprovidersroutinelycollecttheserecordstodetectcongestedcelltowerstomanageadditionalbandwidth,tobillcustomersforcellularusageandaretypicallyusedto troubleshootandimprovethenetwork’sperformance.Foreachcommunicationbetween individuals,themobileoperatorkeepsaCDRthatstoresmetadatathatcancontaininformationsuchasthetypeofcall,ifthecallwasincomingoroutgoing,thestartingand endingtimeofthecallandcallduration.Sincethecontentsofthecommunicationarenot revealedthroughtheCDR,astheyonlystorecommunication-relatedproperties,CDRcan beconsideredmetadata[4].Theserecordsareessentialsincetheyallowtheidentification ofsimilarusepatternsandinformationthattelecomoperatorsandtheirclientscanuseto theiradvantage[5, 6].Theknowledgeoftheclients’patternsallowsthetelecommunicationsoperatortounderstandwhatanomalousbehaviourmightlooklikeand,according tothat,maketheappropriatedecisions.Anothercriticalaspectofthesecallpatternsis thattheycanpresentdistinctcharacteristicseventhoughmultipleclientsmayfollowthe samepattern.Characterisingtheseclientswouldbeintheoperator’sbestinterest,asit mayhelpwithfuturebusinessplans.
WiththeinventionoftheInternetanditstechnologicaladvancements,dataisatthecentre ofeverything.Thevolumeofdatacollectedandstoredgloballyisgrowingexponentially, andeverydaydataisbeingusedinnewways.Worldwidedemandfordigitalservicesand growthresultsinmoredatatoprotect:dataprivacyandsecurityhavebecomekeysocial andbusinessissues[7].Dataprotectionisaboutsecuringdigitalinformationwhilekeepingdatausableforbusinesspurposeswithouttradingcustomerorend-userprivacy[8]. Ithelpstoreduceriskandenablesabusinessoragencytorespondquicklytothreats.In 1995,theEuropeanDataProtectionDirectivewaspassed[8].However,asitwasadopted whentheInternetwasstillinitsearlystages,itestablishedminimumdataprivacyand securitystandards[9].Stricterguidelinesandrestrictionsneededtobeadopted.In2016, theEUadoptedtheGeneralDataProtectionRegulation(GDPR)[9],aregulationwewill discussbelow.
GDRP(EU)2016/679isaregulationoftheEuropeanrightrelatedtotheprivacyandprotectionofpersonaldata,applicabletoallEuropeanUnion(EU)andEuropeanEconomic Area(EEA),implementedin2018[10].ItregulatestheexportationofpersonaldataoutsideoftheEUandEEA,andoneofitsmaingoalsistogivethecitizensandresidents waystocontroltheirdata[10].
Theregulationcontainsclausesanddemandsrelatedtohowpersonalinformationis treatedintheEU,anditappliestoeverycompanythatoperatesintheEEA,nomatter theirorigincountry[11].Thetreatmentofpersonaldataismadesuretorespecttheprinciplesofdataprotection:allthedatamustbestoredusingpseudonymisationorcomplete anonymisation:
• Pseudonymisationisa requiredprocessforstoreddatathattransformspersonaldatain suchawaythattheresultingdatacannotbeattributedtoaspecificdatasubjectwithoutthe useofadditionalinformation[12]. Moreover,itneedstoensurethatthedatacannotbe availablewithoutexplicitconsentandbeusedtoidentifysomeonewithouttheuse ofadditionalinformation(key)keptseparatelyfromthepseudonymiseddata[12].
• Completeanonymisation:forthedatatobetrulyanonymised,theanonymisation processneedstobeirreversible,asopposedtopseudonymisationthathasakeyto decryptthesensitivedata[12, 13].
Theregulationalsodoesnotallowthetreatmentofanydataoutsidethelegalcontext specifiedinit,besidesthecaseinwhichtheonewhocontrolsthedatahasreceivedthe explicitconsentand opt-in ofthedata’sowner.Theowneralsohastherighttorevokethis permissionatanytime[14].
Theresponsibleforthetreatmentofpersonaldatashoulddeclareanydatacollection, thelegalframeworkthatallowsthatcollection,thegoalofthedataprocessing,thetime framethedataisgoingtobestoredandthesharingofinformationwithanythirdparty partneroutsidetheEU.Theusershavetherighttodemandacopyofthecollecteddata andtherighttodemandtheeliminationofthatsamedataundercertaincircumstances [14].Thepublicauthoritiesandcompanies,whoseactivityisfocusedontheregularor systematictreatmentofpersonaldata,aremandatedtohaveadataprotectionofficer (DPO)responsibleforensuringthattheprocessingisgoingaccordingtotheGDPR.
Thesimplestformofdataismultidimensionaldata.Thistypeofdatatypicallycontainsa setofrecords.Eachrecordcontainsasetoffieldsreferredtoasattributes,dimensionsor features.Thesefieldsdescribethedifferentpropertiesofthatrecord.Amultidimensional datasetisdefinedasfollows:
Definition1.3.1(MultidimensionalData)AmultidimensionaldatasetDisasetofnrecords, X1 Xn,suchthateachrecordXi containsasetofdfeaturesdenotedby (x1 i xd i ) [15].
Therecordsarethetablerows,andthefieldsarethecolumns.Asseenfromfig.1.4, therearetwobroadtypesofdata-categorical(orqualitative)andnumeric(orquantitative)-andeachofthemcanbesplitintodifferentcategories.
Categoricaldataisgroupedintocategoriesthattakeondiscreteunorderedvalues.It canbeclassifiedasnominalwhenthecategoriesdonothaveanorderorordinal,when thecategoriesdohaveanorder[15, 17].
Quantitativedata,asthenamesays,triestoquantifythingsbyconsideringnumerical valuesthatmakeitcountableinnature[18].Thus,wheneachvalueof xj i inDefinition1.3.1 isquantitative,thecorrespondingdatasetisreferredtoasquantitativemultidimensional data[15].
• Continuousdataismeasuredonacontinuousnumericalscaleandcantakeona largenumberofpossiblevalues.Continuousdatacanbeclassifiedasaninterval whenitdoesnothaveanabsolutezeroandnegativenumbersalsohaveameaning, orasaratiowhenitdoeshaveanabsolutezeroandnegativenumbersdonothave ameaning[17].
• Discretedatameasurescountsornumbersofevents.Sowhileitisnumericaldata, itisnotmeasuredonacontinuousnumericalscaleandhencedoesnotfitneatly intoeitheroftheclassificationsabove.Soitcanbetreatedaseithercategoricalor continuous,dependingonthenumberofpossiblevalues[17].
Toworkwithquantitativedata,wefirstneedtoencodeitbeforeprocessing[15, 18].
Dataminingalgorithms,suchasMachineLearning(ML)ones,canonlyworkwithnumericaldata,asthemodelsaremathematicalbynature.Thus,thesecategorieshelpus decidewhichencodingstrategycanbeappliedtowhichdatatype.
Inthisinternship,thegivenusecaseisthesearchforanomalousbehaviourincallpatternsassociatedwithvoicecommunications.Weexploredifferentclients’timeseriespatternstoidentifypossibleanomalousbehaviour.Suchbehaviourencompassestrafficalterations(drop/peakinthetime-seriespattern)andabnormalpercentagesofanswered/unansweredcallsandtheirrootcauses.
Thisreportdividesintothreemainchapters,excludingtheIntroductionchapter.The firstonefocusesontheMethods,achapterdividedintofourmaintopics,startingwith
TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION. thetechnologyusedtoretrieveandanalysethedataandmovingtotheworkflowofconvertingthisrawdataintousefulone.Theapproachtakentocharacterisethedatainto patternsisdescribed,andlastly,anomalousbehaviourwilltrytobefound.However, throughoutthisinternship,manychallengesarosewithretrievingthedata,asthemain onebeingwithitsanonymisation:asthetelecom’soperatorconsentwasnecessary,not onlyforwhichdatatogivebutalsoiftheyagreedornotwiththeprocessofanonymisation,multiplebackandforthmeetingstookplacewithMEO(mobileandfixedtelecommunicationsserviceandabrandfromAlticethatrevolutionisedthetelecommunications marketinPortugal[19]).Onlythedatathatcontainedsensitiveinformationrelatedtothe usersoftheABCservicewereanonymisedor’zeroed’,thislastonemeaningthat,when importingthesefields,theinformationwasnotobtained.Theremainingfieldsremained unchanged.
MEOdefinedakeythatallowedtheanonymisationprocesstogothrough.Altice doesnotknowthiskey,sothisprocessbelongsinthepseudonymisationcategory.This strategyguaranteedthat,eventhoughtheinformationisalwaysanonymised,itwould alwaysrefertothesameidentification.Thisprocesstookaroundninetotenmonths,so anextensionoftwomonthsoftheinitialelevenmonthswasnecessarytohavetimeto analysethedata.
Thesecondchapterdetailstheanalysisofthepatternsandanomalousbehaviour,with afewexamplesofwhatthismaylooklike,whetherinthecallpatternsthemselvesand thepercentageofunansweredcallsanditslogicendcausestoasamplecharacterisation ofoneclientandtheircallpatterns.
Thefinalchapterreportstheconclusionsretrievedfromthisinternship,thechallenges faced,andconsiderationsforthefuture.
InordertoanalyzetheCDRsgiven,wefirstneedtounderstandwhatisavailablefordata analysisandtherighttoolsforwhatwearetryingtoachieve.Intermsofstrategy,diagramsdepictingwhatwasdoneregardingretrievingandconvertingrawdataintousable dataarepresented.Westudiedthegeneralactivityofalltheclientsandtheirvoicecallsto accomplishtheobjectivesforthisinternship.Clusteringallowedustoaggregateclients withsimilarattributes,enablingtheirgeneralcharacterization.Thesecharacteristicscan bethenumberofunanswered/answeredcalls,thenumberofcalls,thefunctionalitiesone goesthrough(internalservices),orthemostusual,thecallactivity.Wewillbeexploring eachoneoftheseinthefollowingsections.
ThemostcommonlyusedsetoftoolsfordataanalysisincludeSQL,PythonandR,which wewillgointomoredetailaboutbelow.ThemaintoolusedtorunPythonintheweb browser,GoogleColab,willalsobediscussed.
Theconnectionbetweentheseusedtechnologiesisdepictedintheworkflowfrom fig.2.1.
DuetotheGDPR,exportingandstoringsensitivedatainourlocalhardwarewasprohibited.Theonlywaytoaccessthedatawastoconnectdirectlytothedatabase(DB).Dueto this,SQLwasusedinordertowritetherequiredqueriesforretrievingthedata.SQLis anon-procedurallanguage,meaningtheuseronlyneedstospecify’whattodo’andnot ’howtodoit’.
SQLmainlyhasfouressentialoperations:creating,reading,updatinganddeleting datafromthetablewherethecontentisstored.Inthisinternship,weusedSQLsolelyto readthedatafromthedatabase[22, 23].
GoogleColaboratory,alsoknownasColab,isaJupyternotebookenvironmentthatruns entirelyinGoogle’scloudservers.SimilartoJupyterNotebook,Colabisalistofcells thatcancontaintextorPythonexecutablecodeanditsrespectiveoutputs.Themain advantageofusingColabisthatnoneoftheuser’scomputerresourcesareusedtorun thenotebook,andalso,itmakesiteasiertosharethenotebookswithotherssinceColab notebooksarestoredinGoogleDrive,andifitcontainssensitivecontent,thecells’output canbeomitted[24, 25].However,sinceweneedtoaccessourlocaldatabase,weneedto beabletorunColablocally.ColaballowsustoconnecttoalocalruntimeusingJupyter, meaningthatallcodewillbeexecutedonourlocalhardware,enablingaccesstoourlocal files.Inordertodoso,thejupyterextension’jupyter http over ws’needstobeinstalled andenabled.
Pythonisapopularprogramminglanguagewithplentyoflibrariessuitedforvarious tasks,whichisoftencitedasoneofPython’sstrengths.Adescriptionofthelibrariesused canbeseenbelow.
• Pandas allowsdataimportfromnotonlyvariousfileformats,suchascomma-separated values,SQLdatabasetablesandMicrosoftExcelspreadsheets,butalsovarious datamanipulationoperationssuchasmerging,reshaping,selecting,aswellasdata cleaning,anddatawranglingfeatures[26].
• cx Oracle isamodulethatenablesaccesstoOracleDatabase.ExecutingSQLqueries istheprimarywayaPythonapplicationcommunicateswithOracleDatabase.The queriescanonlybeexecutedusingthemethodCursor.execute().Rowscanthen beallfetchedusingoneofthemethodsCursor.fetchall().Thefetchmethodsreturn dataastupleswhichPandasthenprocesstoconvertintoadataframe[27].
• Matplotlib isPython’sdatavisualisationandgraphicalplottinglibrary.Allgraphs fromthisreportwereplottedusingthislibrary[28].
• Scikit isafreesoftwaremachinelearninglibraryforPython.Itfeaturesvariousclassification,regression,andclusteringalgorithms,includinghierarchicalclustering, anditisdesignedtointeroperatewiththePythonnumericalandscientificlibraries NumPyandSciPy[29, 30].
2.1.4R
Risaprogramminglanguagefocusedonstatisticalanalysisanddatavisualisation[31].
LikePython,itprovidesacollectionoflibrariesthataidineverydaydatasciencetasks.In thiscase,Rwasusedforitslibrariesassociatedwithhierarchicalclustering:’Circlize’and ’Dendextend’.Thefirstoneallowedtheplotofacirculardendrogram,whilethesecond allowedforitspersonalisation,e.g.,colours,labels,etc.Whenwehavetoomanylabels, aregulardendrogramcanbecomehardtoread,soplottingitinacircularmanneristhe bestsolution.
EXPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION.
Unfortunately,mostofthetime,rawdataisnotimmediatelysuitableforautomatedprocessing.Inourcase,thefirstchallengeencounteredisthenumberoffeatureswehave, whichareoverahundred,mostofthemnotrelevanttotheproblemathand.Soherewe highlighttheimportanceoffeatureselection.
Thegeneralfeaturesusedintheanalysisinvolve:
• CALL TYPE(Number)
TABLE 2.1:CALL TYPE:Possiblenumbersanddescription.
(Unstructuredsupplementaryservicedata): serviceusedtoreceiveimportantinformation, suchastheavailablebalance,viaSMSafterdialingaservicecode.
Fax
SS NOTIFwhereSSstandsforSupplementaryService[32].
CS VIDEO (Circuit-SwitchedVideocall) whereavideocalloccursusingachannelinthecircuit-switchednetwork. 7 DATA64K: digitaltransmissionmethodwhereeachcalliscarriedbya64-kbpsdigitalstream[33].
• CLIENT ID(Number):customeridentifier.CustomerisanABCentityreferringto enterprisesthatsubscribetotheserviceprovidedbythetelecommunicationsoperator.
• INTERNAL SERVICE(VARCHAR2):ABCfunctionalityorlistoffunctionalitiesappliedduringthecall.Thereasonforthefield’INTERNAL SERVICE’tobe’VARCHAR2’insteadof’Number’isbecausesometimes,inoneCDR,acallgoesthrough multipleservicesatthesametime,meaningthatthefieldwillbeoccupiedbyasequenceofnumbersseparatedbycommas,makingthefinalresult,notinnumber form.
• INFO35(VARCHAR2):generalfieldpreparedtoaccommodateawidevarietyof valuetypes.Inthiscase,itcontainstheGlobalCallIDIdentifier(GID),whichrefers
toagenerateduniqueidentifierusedtomarkeachcallinordertoallowtracingthe callalongitspath.
• LOCATION TYPE(VARCHAR2):Typeofcallfromlocationprocessing.Itcanbe oneoftwotypes,eitherONNET(internalcalls),inwhichthecallisbetweennumberswithinthesameclientorOFFNET(externalcalls),wherethecallhappensfor offnetdestinations(differentclients).Thisfield’sexistenceisassociatedwiththe differentpaymentplans,e.g.,internalcallsaremorelikelytobefreeofchargethan externalcalls.
• LOGIC END CAUSE(LEC)(Number):numbercodeusedtoidentifythecauseof thecallending.Thesecodesmayrefertosuccesscauses,suchasthecallbeing endedbytheoriginafteritwasestablishedsuccessfully,orerrorcodes,identifying thecausesofwhythecallwasnotallowedtoproceed.
• SERVICE(Number):ABCSub-serviceusedinthesession,responsibletogenerate theCDR.
• SERVICE ACCESS TIME(DateTime):DateTimeofserviceaccess,withsystemtime.
• SESSION ANSWERED(Binary):1ifthesessionisansweredand0ifnotanswered.
Afterwards,itcomestheactualpre-processingofthedata,thatinvolvesnotonlytasks suchasdatacleaning,dataencoding,butalsoinvolvesfeatureextraction,whichgenerates newfeaturesfromrawdatathatforsomereason,arenotdirectlycomparable,suchas categoricaldata.Asmentionedpreviously,manyMLalgorithmsonlyworkonnumerical data,so,inordertodoso,weresorttodataencoding.
Severalmethodscanbeusedtoconvertcategoricaldataintonumericaldata,themost popularonesbeingOne-HotEncodingandLabelEncoding.
WhatOne-HotEncodingdoesisittakesacolumnwithcategoricaldata.Foreach uniquevalueinthesetofthecategoricalattribute,additionalcolumns(features)arecreatedandassignedthevalueoneorzero.Eachfeatureisrepresentedasabinaryvector. Allthevaluesarezero,andtheindexismarkedwithone[34].Thismethodincreasesthe dimensionalityofthedataset,whichisnotideal[35].
InLabelEncoding,eachcategoricalvalueisreplacedwithanumericvaluebetween0 andthenumberofclassesminus1.Eachlabelisassignedauniquevalue,anditisalways possibletorevertittotheoriginalvalue.Inthedocumentationofthistechnique,nothing explainshowthenumbersareattributedtothelabels[36].However,theproblemwith LabelEncodingisthatitmightaddarelationshipbetweenthelabelsthatdidnotexist priortotheencoding.Forexample,ifadistancebasedclusteringisperformed,Label Encodingmayaddaproximityrelationshipbetweenthelabelsandclusterthemtogether. However,sinceitdoesnotoccupymorememory,wewillmostlikelyusethistechnique.
Inordertoanalyzetheeventsacallgoesthrough,wedecidedtoexcludethoserowsin whichtheinternalserviceandtheGIDwereNaN.NaNvaluesareduetomostlikelyparsingerrors.Moreover,someCDRsgothroughmultipleinternalservicessimultaneously, presentingthemselvesnolongerasnumericaldata,ascommasseparatethem.Thisway, itwasalsonecessarytoseparatethissingleCDRintomultipleones,whereeverythingis thesamebesidestheinternalservicefield[37, 38].
Followingthis,wehavefeaturetransformation,afamilyofalgorithmsthatcreatesnew featuresfromexistingfeatures.Forexample,thedateofaneventmightnotbeexplanatoryenough.Aswearelookingfortime-seriespatterns,knowingtheweekdaymightbe helpful.SoanewfeaturefortheweekdayiscreatedfromtheDateTimeone.
ThetimeanddateofeachCDRcomeintheformat’yy.MM.DDHH:mm:ss.ffffff’.To identifypossibletemporalpatterns,weseparatedtheCDRswithincertaintimeframes, suchas’WeekoftheMonth’,’DayoftheWeek’,’Month’andeven’HouroftheDay’.
Thesetimeframesaggregatediswhatwecallthecalendar.Incaseofmissinghoursor evendates,therespectiverowswereinsertedinthedatatablewiththefieldsnotrelated tothedatetimefilledwith’NaN’.Thisallowsforanalignedrepresentationofthecurves withinthesamedayoftheweekforanotherweek[38, 39].
Aswecansee,thedataminingprocessisapipelinecontainingmanyphases,asrepresentedinfig.2.2
Beforegoingdeepintocharacterisingtheclients,wewillfirstbestudyingtheircallpatternsandtheservicestheygothrough.Thisisonewaytofindpatterns,asclientsmaygo throughthesameservices,designingthesamepatternornot.
Eachcallhasaseriesofevents,soeachGIDcorrespondstoadifferentcall.Foreachevent, thecallgoesthroughaninternalservice,afunctionality,soafirstattempttolookattimeseriespatternsisto,foreachGID,plotalltheinternalservicesthecallgoesthroughasa functionoftime.
Inthefirstapproach,wewillanalysethecallpatternsofdifferentclientsandtheclients themselves.Thiswillallowustomoveforwardtonotonlyanalysethemostcommon andleastcommonservicesacallgoesthroughbutalsowhichclientsappearmoreand lessfrequentlyand,afterwards,amoredetailedanalysisofthecallsthemselvesifeither theywereorwerenotansweredandthereasonastowhyithappened.
Weareonlyinterestedinvoicecalls,sowewillfilterouteveryrowthatdoesnothave ’CALL TYPE’=0,asthisisthecodeforvoice.
Oneofthepossiblewaystosegmentthedifferentclientsisbyclusteringthemaccording totheircharacteristics.Clusteringisthemostcommonunsupervisedlearningalgorithm usedtoexplorethedataanalysisandfindhiddenpatternsorgroupingsinthedata[42].
TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION.
Unsupervisedlearningisatypeofmachinelearningthatusesunlabeledandunclassified trainingdata[43].Themaingoalofitistodiscoverhiddenandinterestingpatternsin unlabeleddata,thatthealgorithmwillfindonitsown,unlikewhathappenswithsupervisedlearning,inwhichweknowwhatvalueswewillobtainfortheoutput[44].
Onedefinitionofclusteringis: Givenasetofdatapoints,partitionthemintoasetofgroups whichareassimilaraspossible[45].
Thereisnotonlyonetypeofclusteringalgorithm;themoststudiedonesarethepartitionalandhierarchical.Theyhavebeenusedindifferentapplicationsduetotheirsimplicityandeaseofimplementationrelativetootherclusteringalgorithms[45].
Partitionalmethodsneedtobeprovidedwithasetofinitialseeds(orclusters),which arethenimprovediterativelyandcanbedividedintodistance-basedanddensity-based [45].Distance-basedmethodsoptimizeaglobalcriterionbasedonthedistancebetween thepatterns,whiledensity-basedmethodsoptimizelocalcriteriabasedondensityinformationofthepatterns[46].
Ontheotherhand,hierarchicalmethodscanstartwiththeindividualdatapointsin singleclustersandbuildtheclustering.Theroleofthedistancemetricisalsodifferent inbothofthesealgorithms.Inhierarchicalclustering,thedistancemetricisinitiallyappliedtothedatapointsatthebaselevelandthenprogressivelyappliedtosubclusters bychoosingabsoluterepresentativepoints.However,inthecaseofpartitionalmethods, therepresentativepointschosenatdifferentiterationscanbevirtualpointssuchasthe centroidofthecluster(whichisnonexistentinthedata)[45].Hierarchicalclusteringis themethodchosenforclustering,asitdoesnothavetodefinethenumberofclusters.
Hierarchicalclusteringalgorithmsapproachtheproblemofclusteringbydevelopinga treestructurecalledthedendrogram.Oncethedendrogramisconstructed,ahorizontal linecanbetracedanywhereinthedendrogram,andthenumberoflegsitintersectsisthe numberofclusterswewillhave.
Hierarchicalclusteringcanbeachievedintwodifferentways,bottom-up(agglomerative)andtop-down(divisive)clustering[45].Divisivealgorithmsaremorerobustinthe earlystagescomparedtotheircounterpart,andagglomerativeclusteringtechniques,in
turn,aremoreunderstandableandbyfarthemostpopular[47].Fromfig.2.3a,wecansee thatagglomerativeclusteringworkslikethefollowing:fromadataset,weattributeeach datapointtoacluster,andfromthere,theindividualdatapointsaresuccessivelyagglomeratedintohigher-levelclusters,movingupinthehierarchy.Ateachstageofhierarchical clustering,theclustersforwhichthedistancebetweenthemistheminimumaremerged. Atthesametime,adendrogramisconstructed:thetopoftheU-linkindicatesacluster merge.ThetwolegsoftheU-linkindicatewhichclustersweremerged,andtheirlength representsthedistancebetweenthechildclusters[48].
FIGURE 2.3:HierarchicalClustering
Inordertodecidewhichclustersshouldbemerged,appropriatecriteriashouldbe applied.Thesecriteriacanbebasednotonlyonthedistances(suchasEuclidean)between clusters,forexample,[49, 50]:
• Centroidlinkage:computesthecentroidsoftheclustersandcalculatesthedistance betweenthem;
• Singlelinkage:calculatesthedistancebetweentheclosestpointsoftheclusters;
• Completelinkage:calculatesthedistancebetweenthefurthestpointsoftheclusters;
• Averagelinkage(notsopopular):calculatesthedistancebetweenallpairsofthe clustersandtakesitsaverageasthedistance,asseenineq.2.2
XPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION.
where TAB isthesumofallpairwisedistancesbetweenclusterAandclusterB. NA and NB arethesizesofclustersAandB,respectively.Thisformulais,ofcourse, adaptableforeverysinglenumberofclusters.
Butalsoontheminimizationofdispersionofthepointsinsideeachcluster:
• Ward’smethodisbasedontheerrorsumofsquares(ESS).TheESSforaclusteris definedasthesumofthesquaredEuclideandistancesfromthecluster’spointsto thecluster’scentroid[51].Itcombinesclusterswheretheincreaseinwithin-cluster varianceisthesmallest.
Where Xij isthe ith observationinthe jth clusterand Xj isthecluster’scentroid[51].
Thedistancebetweentwoclusters,1and2,ishowmuchthesumofsquareswill increasewhentheyaremerged[52].
Itismainlyherethatwemakeuseofthecalendar.Itisawaytoseetheclient’smostactive andinactivehoursandiftheactivitycurve,whichconsistsofthenumberofeventsper hour,followsthesametrendforalldaysoftheweek.Anunusualspikeordropinthe curvecouldrepresentapossibleanomaly.Inordertoseeifalldaysoftheweekfollowthe sametrend,independentlyofthenumberofeventsregisteredthatday,anormalisation ofthecurveswasrequired.Thisconsistedindividingthenumberofeventsbythearea belowtheactivitycurve.
Oneimportantaspecttoanalyseiswhetherthecallwasansweredornot.Asindaily activityforallvoicecalls,inthiscase,wewilldivideitintotwocurves,onefortheactivity ofansweredcallsandanotherfortheunansweredcalls.Therearemanyreasonswhya callwasorwasnotanswered:thegeneralcauseisidentifiedbythefirsttwodigitsofthe ’LOGIC END CAUSE’five-digitnumberandtheremainingdigitsidentifythespecific cause
Below,intable 2.3,therearethelogicendcausesthatappearthroughouttheanalysis insection 3.2.1.
Oursourceofconcernisthepercentageofunansweredcalls.Wevisualisethisdata usingahistogramwiththepercentagesofunansweredcallsversusthenumberofclients thathavethem.Moreover,weneedtoconsiderclientswhomayonlyhaveafewvoice callsthroughouttheanalyseddates.Apercentageof100%ismorerelevanttoclientswith amillionvoicecallsthantoclientswithonlyonevoicecall.Thisway,wewillseparate thegraphsbyorderofmagnitude,from1to10millionvoicecalls.
ABCfunctionalitiesareclassifiedasServicesandInternalServices.InternalServicesrefer tospecialisedfeatureswithineachABCService.Thereareinternalservicesthataremost commontoappearthroughoutthisanalysis:somerefertoGrouporAdvancedServices, whicharefunctionalitiesappliedtotheenterpriseasawhole,andothersrefertoUser Services,whichallowtoprocessthecalldifferentlydependingonthesettingsofaspecific user.GroupservicesinvolveHuntGroup(HG),WaitingQueue(WQ),Pre-Answer(PA) andInteractiveVoiceResponse(IVR),andarespectiveexplanationoftheseisgivenbelow. ThenumberbetweenparenthesiscorrespondstothenumberoftheService.
• Global(1):Servicethatmanagescallstoorfromauser,definesifitprocessesthe originatedorreceivedparty,andverifiesifitisnecessarytoapplyotherservicesto thecall.
• HuntGroup(2):Servicethatallowsorganisationstodistributecallstoagroupof peopleinacompany.Theincomingcallsaremadetoasinglephonenumberand canbere-routedtomultiplephonelines.
• WaitingQueue(3):Servicethatdistributesincomingcustomercallsbasedonthe callorder.Thecallerremainsonholduntilanagentbecomesavailable,atwhich pointthecallqueueroutesthecustomertotherepresentative[53].
• Pre-Answer(4):Anaudioannouncementmadetoanincomingcallerbeforethecall isputthrough.
• IVR(11):Servicethatallowsincomingcallerstoaccessinformationviaavoiceresponsesystemofpre-recordedmessageswithouthavingtospeaktoanagent,as wellastoutilisemenuoptionsviatouchtonekeypadselectionorspeechrecognitiontohavetheircallroutedtospecificdepartmentsorspecialists[54].
Similartogroupservices,anexplanationoftheUserservicesisalsogiven:
• Forward:allowstheredirectionofareceivedcalltoanewdestinationaccording toasetofconditions,suchastemporalpatterns,orduetothedestination’sstatus (’Busy’,’NoAnswer’,and’NotAvailable’),amongothers.
• OIP(OriginatingIdentityPresentation):OIPisasupplementaryserviceinwhich thenetworkprovidestheCalledPartywiththetrustedidentityoftheCallingParty.
• Out Numbers:allowstodefinewhichisthenumberusedtoidentifythecallmade byauser.
• Convergent:allowsthecall’sorigintobeidentifieddifferentlydependingonwhether thedestinationisfixedormobile.Forexample,thecallgoestothenetworkidentifiedwithamobilenumberbecausethedestinationismobile.However,ifthedestinationisfixed,thecallmadebythesameuserwiththesametelephoneisidentified withafixednumber.
• NP Announce Surpress(ServicePortabilityAnnouncementInhibition):allowsthe activationordeactivationoftheoptiontoplayanannouncementidentifyingthatthe numbertheuseriscallingisported.Portednumberannouncementisplayedwhen, forexample,ausercallsa96xxxnumberwhichiscurrentlyportedtoVodafone.So, thisannouncementallowstheusertorealisethatthecallmayhaveadifferentcost sincethenumbernowbelongstoadifferentprovider.
• On Net Display:allowstodefinethenumbertypeusedtoidentifyaninternalcall tothedestination.Forexample,itdefinesifthecallisidentifiedbytheuser’sshort numberorbytheuser’sDDI(DirectDialInnumber),incasetheuserhasboth.
• Number Portability:differentiationofportednumbers.
Oneoftheinternalservicesthat’stherichestininformationistheWaitingQueue.The ideaistoidentifywhichclientsgothroughthisServicemoreoften,pickasuitableone andthencheckforabnormalitiesinthefieldsofthisService-table 2.4.
TABLE
2.4:SpecificfeaturesusedfortheWaitingQueueservice
Feature Description
INFO4
0:Callwasn’tansweredbyWaitingQueuedestination;
1:CallwasansweredbyWaitingQueuedestination.
0:Callhasleftthewaitingqueueduetoerror;
1:Callhasleftwaitingqueuebecausecallwasansweredbywaitingqueue;
INFO5
2:Callnotansweredbywaitingqueuebecausethequeueisfull.destinationorbecauseitwascanceledbytheorigin;
3:CallhasleftwaitingqueuebecauseuserpressInterceptexitoption;
4:Callhasleftwaitingqueuebecausemaxwaitingtimereached.
0:Thecallwasnotforwarded;
INFO6
INFO7
1:Thecallwasforwarded
IndicatesthewaitingtimeinWaitingQueue.
Thisisaverytargetedapproach,aswepickaspecificservice.However,itisalso possibletostudythegeneralityofgroupserviceswhichwewillseebelow.
2.4.4.2Approach2
Inthiscase,whatwewillbedoingisanalysingthecallpathforeachoneoftheseservices.
However,insteadofusingalltheglobalcallIDsforaspecifictimeframelikeinsection 2.3.1,wewillbegrabbingthe’USER ID’thatarepresentineachoftheservicesforeach
client,storingalltheGIDsthathavethesamecommon’USER ID’andforeachoneof them,doaplotoftheservicesacallgoesthroughforacertainDateTime.Notethat typicallytheUSER IDfieldisusedtostorethedatabaseIDoftheuserwhomadeor receivedthecall,butsincetheseareGroupServicesandthereforenotappliedtoaspecific user,thesamefieldisusedtosavethedatabaseruleIDsforeachService.
Foramorein-depthanalysis,wewillbegrabbingthesameclientsandwillbeanalysing thembythe’LOCATION TYPE’ofthecallandfilteringtheServicetoGlobalonly.
Inthischapter,wewillcomparedatasetsanonymisedmanuallyand,byaprocess,have ageneralideaofthecalltypesthatourdatahasandwhichdatesareworthanalysing. Moreover,theactivityandunansweredandansweredcallcurvesforallclientswillbe studied,alongwiththemostcommonLECcausegroups.Duetotheirunusuallyhigh percentageofunansweredandansweredcalls,someclientswillbetargeted,analysing theirmostcommoninternalservicesandLECs.Asthenumberofclientsallowsit(for amorereadablegraph),hierarchicalclusteringwillbeperformedtoaggregatethemaccordingtotheiractivityforclientswithonlyansweredcalls.Atlast,acharacterisation accordingtothevolumeofinternalandexternalcallswillbeperformedontwodifferent clients,andananalysisofthecallpatternofthegroupserviceswillbeshown.
Thissectionwillbemoreofanunderstandingofthedatawehaveathand,knowingwhat itcontains,thenumberofCDRsperhour,day,andsoon.
Atthestartofthisinternship,nousecasehadbeendefinedandatthispoint,havingonly minimaldata,westartedbyanalysingtheGlobalCallID(GID).Eachcallhasaseriesof events,andeachdifferentGIDcorrespondstoadifferentcall.Foreachevent,thecall goesthroughaservice.Theinitialapproachconsistedincountingthenumberofevents percallandrepresentingitgraphically,thenumberofeventsasafunctionofeachGIDfig.3.1
Wecanseeafewcallswithanunusuallyhighernumberofevents.However,thisanalysisisnotfeasibleanymoreforthenewcomingdata.Therearearound16000clientsin thenewdataset,withoverthousandsofrecords,makingitimpossibletoanalyseallcalls forallexistingclients.Thefirstdatasetprovided,manuallyanonymised,wasrelatively poor,asaccesstoonlyadayandahalfofcommunicationswasprovided.Withthesecondbatch,anonymisedwithanalgorithm,wehadaccesstoaroundtwoandahalfdays ofcommunications.Thegoalwastounderstandwhetherornotthisanonymisationalgorithmchangedthedatabyanymeans.Aspreviouslymentioned,thenumberofevents correspondstothenumberof”jumps”performedinsidethesamecall,andtherefore,in ordertocomparethedatafromtheseconddatasettothefirstone,wedecidedtocheck if,forthesameGID,thecallgoesthroughthesameinternalservices.Anexampleofthis comparisoncanbeseeninfig.3.2 Itisessentialtorememberthatallfieldsof’INFO35’ withNaNwereerased,asitdoesnotallowustotracebacktheeventstoacall.
Atfirstsight,theplotslookverydistinctbecausethetimeofthecallismuchmore preciseintheseconddataset,goinguptomilliseconds,whileinthefirstdataset,itonly goestominutes.However,inbothcases,thesameinternalservicesarepassedthrough, meaningthattheanonymisationalgorithmisreliable.
Fromthismomentforward,alltheanalysiswillbeperformedwiththenewdata,theone anonymisedwithanautomaticprocessratherthanmanually.Fig.3.3 showsthedifferent typesofcallsthatappearinthedataset.Ourfocuswillbedirectedtovoicecalls,that occupythemostsignificantslicewithapercentageofalmost78%.Theremainingslices
TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION. ofthepiechartinclude’Data’and’SMS’,whichalthoughnotourfocus,theirpercentages werebigenoughtobeleftoutasindividualslices.The’Miscellaneous’sliceincludescalls oftypessuchasvideomessagesandfaxes,whichalone,werenotsignificantenoughtobe leftoutasindividualslices.Infact,thesecallswereaggregatedintoabroadercategory,in whichthemostsignificantparticipantarethe’NaN’values,whichisnotpossibleforus toattributetoatypeofcall.
AcountofthevoicecallsfromtheCDRsasafunctionofthehouroftheday,forallofthe daysavailable,alongwiththedailyhourlyaverage,wasplottedinfig.3.4
Somethingthatfeelsoffisthat,fromfig.3.4,thereisunusualactivityatmidnightfrom almosteverydaywehaveavailableinMarch.Moreover,thereisagapinthedataatmiddayfromeveryday.Thisleadsustobelievethatsomethingwentwrongintheanonymisationprocessandthatthedatafromthesetwohoursgotmixedup.Intheory,thedata anonymisationprocessshouldretrievethedatafromthesource,anonymiseit,andthen writeitinthetargetplatform.Theprocesswasmeanttoanonymisesomeofthefields, andthecolumnswiththedatatype’DateTime’werenotsupposedtobeanonymised. However,theanonymisationprocess,developedthroughthe Talend tool[55],didnot consider’masking’overthe’DateTime’typeoffieldswhenobtainingthedatafromthe source.Andbecauseofthis,the’DateTime’typedatacameoutinatextformat,which wasmisinterpretedwhenplacingthedataonthetargetplatform,thuspromotingthedisappearanceofthemiddayeventsandtheappearanceoftheseatmidnight.Uponfurther research,therewerealsoduplicated/tripledCDRs.Duetotherushtomakethedata availableassoonaspossible,anattemptatamethodofhavingparallelprocessestowrite thedatainthedatabasewascreated.Unfortunately,thesethreadswereconsumingthe samedata,resultinginduplicate/triplicateregistersforthesameCDR.
Afterfixingthesemistakes,wecarriedonwiththeanalysis,meaningthatonlyfig.3.4 isplottedusingthebuggeddata.Inadditiontothefactthattherearedaysinwhichthe plottedlineisnotcomplete,forexample,the2ndofMarch,inwhichthelineendsat13h, the9thofMarch,wherethereisonlydatafrom14h,the28thofFebruary,eventhough notvisible,onlyhasactivityfrom21h-23h.The1stofMarchdropsat13h,whichstays constantuntiltheveryendoftheday.AstheselasttwodayscorrespondtoaMondayand
. aTuesday,respectively,itfeelsveryoutofthepatternofaregularweekday.Furthermore, thereisalsoagapofseveraldaysbetweenthefirsttwodatesmentioned,inwhichthere isnodata.Thedecisionwastocheckthenumberofvoicecallsperdayasthedayofthe weektoseeifthesedatesarestillworthanalysingwhencomparedtotherest.Theresult isinfig.3.5
Observingfig.3.5,wecanseethatthe28thofFebruaryhasadeficientactivity,which ispredictable,asitonlyhad2hoursofactivity.Sincealltheremainingdatacorresponds toMarch,notmanyconclusionscanbetakenoutofthisdateandthisdateonly,and thereforewedecidedtoexcludeit.Moreover,thelowestpointsofthisgraphreflectthe lackofactivityonthedaysmentionedpreviously.Thefactthatthe29thofMarchhas almostthesameAmountofcallsasthesedays,meansthatsomewhereinfig.3.4,theline forthisdayalsoeitherbreakssomewhereorstartsinthemiddleoftheday.Thereason behindthesedays’abnormalactivityisthatthedatacollectionwasdonebyaccessinga fieldthatrepresentsasequencenumber,whichismanagedbycachinginstance.Caching isusedtoprovidequickaccesstovaluesthatareoftenreadfromdatabases,storingthem inatemporarystorageinstance,thecache.Thecacheisusedsinceasimultaneousrequest ofvalueinthesamesequencecouldreturnrepeatedvalues.Forexample,’instance1’ and’instance2’bothcache5000values.’Instance1’cachesvaluesfrom1to5000and ’instance2’cachesvaluesfrom5001to10000.Whenitsrangeends,thesequenceasks foranothernumberingrange.Andthisassignsfrom10001to15000to’instance1’and soon.Whenobtainingthedata,sincetherearemillionsofrecordstobeextracted,we
askedfortheminimumIDandthemaximumIDofthedatabase.Andsofar,thedatafor thoseparticulardayshavenotyetarrivedbecausetheymayhaveanIDhigherthanthe maximumobtainedsofar.
Therefore,thedatawewillbeanalysingcorrespondstothetimeframebetweenthe beginningofthe10thofMarchandtheendofthe28thofMarch.Sofar,weknowthenumberofeventseachdayhasgonethrough,thattheremainingweeksareinsyncwitheach other(fig.3.5:linescorrespondingtoweeks2,3and4overlap),andthatweekdaysare muchbusierthanweekends.However,wedon’tknowwhetherthecurvesforweekends andweekdaysfollowthesamepattern,regardlessofthenumberofcallsmade/received thatday.Theideaofnormalisingthedailycurvescameup,andtheresultisinfig.3.6
Fromfig.3.6,itisclearthatwehavetwostandardcurves,onefortheweekdaysand anotheronefortheweekends.Besidestheselastonesbeingmuchslower,weekdayshave avigorousactivitywithinthePortugueseeightdailyworkinghours,includingadrop intheactivitywithinthelunchhours.Outsidetheseworkinghours,whichcomprehend theperiodof9huntil17horeven18h,theactivitydropstoalmostnull.Forweekends, however,thepeaksobservedareataround11h-12hand19h,withamoresubtledrop betweenthesehours,whichisnotasabruptaswesawonweekdays.Notmanyclients haveanintenseactivityonweekendsoranyactivityconsideredrelevantatall,meaning thatthecurveswiththepeaksandlowsobservedfortheweekendsareprimarilyduetoa fewclientsthathavethisactivityinspecific.Forexample,fooddeliveryservicesusually
XPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION. peakabitbeforeeatinghours,andhospitals,asusually,shouldnotbeclosedonanyday oftheweek.
Whatismorecuriousistheorangelinethatrepresentsthe28thofMarch.Compared tootherweekdays,itseemstohaveanoffsetofonehourtotheleft.Thisisduetothetime changeofonehourmorethathappenedonthe27thofMarchat01:00.Soforthedaylight savingtime,01:00is02:00,andso,thereisnodataavailablefor01:00asthathourdidnot existthatday.ThetransitionintheUTC(CoordinatedUniversalTime)fromUTC+0to UTC+1shiftsthecallsfromthefollowingdaysonehourtotheleft.
Clientsarecharacterisedaccordingtotheirpercentageofanswered/unansweredcalls. Forthosewithextremepercentages,wewillperformamorein-depthanalysisofthe servicestheytransitthroughandtheirLECs.Fortheremainingclients,wepresenta generalreviewoftheirLECs.Finally,wewillanalysetheWaitingQueueserviceforone clientandthegroupservicepatternsoftwodifferentclients.
Thefluxofcallsthatareansweredornotansweredisgiveninfig.3.7.Thetwocurves accompanyeachother,notonlyinshapebutalsoinnumber:besidesmakingthe’M’ lettershape,thenumberofunansweredcallsissurprisinglyverysimilartothenumber ofansweredones.Eventhoughthecurveforunansweredcallsisalmostallofthetime belowtheansweredone,withsomemoreunusualdropsobservedonthe11thand24th ofMarchandamorepronouncedpeakonthe14thofMarch,these’anomalies’arenot thatsignificant,astheyarewithinthestatisticalsample,whichleavesustothinkasto whythenumberofunansweredcallsisalmostequaltotheansweredones,asonewould thinkthattherewouldtobemanymoreansweredcallsthanunanswered.
Ifwegobynumbers,thereisatotalof32017clients,6381(20%)ofwhichhavemore unansweredcallsthananswered.Ofthese,1023clients,meaningaround16%,havezero answeredcalls(3.2%ofthetotalnumberofclients).Furthermore,foransweredcalls, thereisa0.5%amountofthetotalnumberofclientsthathaveonlyansweredcalls.This
TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION. informationisonlyrelevantifwehaveasignificantnumberofcalls:iftheclientinquestiononlyhasonecallwithinthesettimeframeasatotal,andthisoneisnotanswered, notmuchinformationcanberetrievedfromthis.Astatisticthresholdmustbeapplied tofilteronlytherelevantclientsasthereisnotmuchtoaddfromclientsthathad,for example,onecallonly.Ifthatcallisnotanswered,itimmediatelypumpstheprobability ofnotansweringto1,ortheopposite,ifthecallisanswered.
Outofalltheexistentunansweredandansweredcallsinthedataset,theattributed LECareassociatedwiththefollowinggroups,withtherespectiveamountofoccurrences:
Fromtable 3.1,wecanseethatasuccessfulcallbelongstogroups’10’and’11’.However,byanalysingtheactualLECvaluesfromtheunansweredcalls,0.67%correspond to’10001’,whichisthecodeusedfor’Call OK’.Therearesomeexceptionalsituations wherethismighthappen,forexample,intheWaitingQueueservice,wheretheoriginis hungupbeforethecallwasestablishedwiththeWindlessMediaServer(WMS).WMSisa genericIP(InternetProtocol)multimediaserverresponsiblefordeliveringaudiocontent toclientswhorequestit.However,thissituationshouldbereanalysedtouseacorrect LEC,as’Call OK’shouldonlybeusedforcallsthatwentthrough.
AnalysingalltheLECsinthedataset,noneraisesanysuspicion,asmostunsuccessful callerrorsarerelativelycommon.Moreover,unansweredandansweredcallscanshare thesameLEC,dependingonthesituation,themostobviouscasebeingwhenLECis equalto’10001’,whichtakesitsshareof99%ofalltheLECforansweredcalls.However, alsoanotherexamplecanbewhenithasthecode’30003’,corespondenttounexpected serviceerrors.Sothiserrorassociatedwithunansweredcallscomesoutasquiteintuitive tounderstand,butforansweredcalls,however,itmightnotcomeaseasy.Forexample,if
theansweredcallisconnectedtoacallrecorderandsomethingunexpectedhappensand everything’shutsdown’,itisgiventhisLEC.
Wedecidethentoanalysetheclientsthatonlyhaveunansweredoransweredvoice calls.Thegraphswiththenumberofclientswithacertainnumberofvoicecallsare representedinfig.3.8,withansweredcallsontheleftandunansweredcallsontheright.
Formostoftheclientsthatonlyhaveansweredcalls,82%ofthemhavelessthanten voicecallswithinthechosenperiod,visiblefromtheconcentrationofdotsunderthe25 answeredvoicecallsfromfig.3.8.Someofthesecallsaretestcalls-fig.3.9 -thathappen atthesamehourformultipledaysgothroughthesameservices,andtheconsumedtime isthesameeachtimeithappens.
Otherclientsonlyhaveasinglephone,astheirlowactivity(fewamountofcalls) demandsthat.Furthermore,noconclusioncanbetakenfromclientsthatonlyhavea singleevent,sincethisdataisforlessthanamonth,wedonotknowwhathappensnext. Someguessesmaybeeithertestclientsmakingonecallpermonthorclientsjustinactive duringthisperiodbutmoreactiveinthefollowingmonths.Eitherway,itisnotcommon toonlyhaveansweredcalls,and,asseen,thishappenedforclientswithveryfewamount ofcalls.
Weaggregatedthe167clientsintoadendrogram,havingpreviouslynormalisedall theiractivity.Thecircularshapewaschosenasitbecamemorereadablewithhundredsof labelsratherthanthetraditionalflatdendrogram.Thisdendrogram,presentedinfig.3.10, wasobtainedthroughhierarchicalclusteringusingeuclideandistanceandcompletelinkage.Atfirst,singlelinkagewasused.However,thechainingphenomenonhappened,in whichclusterswereforcedtobetogetherduetosingleelementsbeingclosetoeachother [56].Thisdoesnothappenforcompletelinkage,asthismethodtendstofindcompact clustersofapproximatelyequaldiameters[56].
Allthetinyclusters(theonesattheendofthecircle)representthe46clientswith onlyonecall.Theinitialdatasetwasnormalised,soeachclustershouldhaveaggregated clientswithsimilaractivitypatterns.However,notmuchcanbeconcludedfromthis dendrogram,asitwasdrawnfromclientswithonlyfewcalls.Thismeansthatactivity patternscannotbeaggregatedassomeclientscanonlybeclusteredasasinglecluster,as theyonlyhaveonecall.Thesamedendrogramwasattemptedfortheclientswithonly unansweredcalls.However,asseenfromfig.3.8,therearearoundsixtimesmoreclients that100%don’tanswerthanclientsthat100%answertheircalls.Thismeansthat,even thoughitispossibletoaggregatethemaccordingtotheircallpatterns,itwouldnotbe readableinthisreport.Amuchlargerwindowwithzoomcapacitywouldbeneededin ordertobeabletoreadthelabelsofalltheclients.
Forunansweredcalls,adistributionofthenumberofclientsasafunctionofthepercentageofunansweredcallscanbeseeninfig.3.11.Similartotheclientsthatonlyhave answeredcalls,someclientsonlyhaveunansweredcalls,withamoreconsiderableconcentrationunderthe100voicecallsbutwithunansweredcallsupuntil1200-fig.3.8.This, ofcourse,willreflectonthehistogramsasaverynoticeablebarattheouterborderofthe 100%,visibleonthefirsttwohistogramsoffig.3.11,asthesearetheonesthatgountilthe orderofmagnitudeoftheunansweredcalls’graphx-axisfromfig.3.8.Wewillonlyfocus ongraphswithmorethanonehundredcalls.
Apeakaround37.5%arises,andlater,forabiggernumberofcalls,anotherpeakat around80%alsocomestolife.Inanuniverseofmanycalls,itissporadictohaveonly answeredcalls,asonly0.5%oftheclientsareinthissituation.Assuch,havingabelow 40%percentageofunansweredcallsisnormal,asthedestinationmightbebusy,general errorsmayhappen,andsoon.However,asthenumberofvoicecallsincreases,the80% peakstartstoraisesuspicion.Whileatfirst,thesamemighthaveseemedtoberelated toahighnumberofcalls,fig.3.12 showstheyareuncorrelated,asforpercentagesof60% andalmost100%,thenumberofcallscanbeverysimilar.
Thesecondoptionwastoexaminewhytheunansweredcallswereterminated.We dividedtheclientswithcallsbetween105 and107 intothosewithapercentageofnotansweringabove60%andbeloworequalto60%.Then,foreachset,wecalculatedtheoccurrencesoftheLECwithintheentireuniverseofLEC.Thisway,wecanhaveanideaofthe sharetheLECofoneclientoccupies.AdeeperanalysisshowsthemostrelevantpercentagesfrombothsetsrepresentthesameLEC:’33001’(Busydestination),’33007’(Customer unreachabledestination),’11010’(Sessionterminatedbynetwork),’10009’(OriginAbandoned)and’10022’(OriginAbandonedwithCallCompletedElsewhereReason).So,for themainpart,thereisnotmuchofarelevantdifferencebetweentheLECforpercentages aboveandbelow60%.Weneedtoconsiderthatwhencomparedtothehistogramswith fewercalls,wehaveamuchsmallernumberofclients,meaningthatthedistributionmay beskewed.Because,ifwetakealookatthegeneralhistogramwithallofthecalls-fig.3.13 -,thereisonlyasinglepeak.
Whentakingalookatthelastofthehistogramsfromfig.3.11,wecanseethatthereis oneclientthathasapercentageofalmost100%ofnotansweringitscalls,while,onthe othersideofthespectrum,wehaveoneclientthatprettymuchpicksupeverycallitreceives.TheirIDsarerespectively,’1303’and’80933’andtheircallpatternsarerepresented infig.3.14.Theseareclientsthathavecompletelydifferentpatterns,asclient’80933’describesthe’M’curve,followingthestandardpattern(fig.3.6),whileontheotherhand, eventhoughclient’1303’maydropalittleduringnoon,it’sactivityprettymuchstays ataconstanthighlevel.Moreover,theweekendactivityforclient’1303’israthernonexistentcomparedtotheotherclient.
FIGURE 3.14:Activityofclients1303(ingreen)and80933(inpink).
Averygentle’M’shapedcurve,withaveryslightdropinactivityduringthelunchhours, canbeseenforclient’80933’infig.3.15.Formostdays,thereisanon-stopactivityduring businesshours,withanactivityatamoreconstantrateafterlunch,thenuntillunch(the linesarelesssteep),anduntil18h.Afterwards,amoreabruptdropintheactivityhappens until22h/23h,whentheamountofcallsisalmostnull.Fromthis,wecanconcludethat thisclientprobablyhasmanyemployees,mainlyworkingfromMondaytoFriday,each withdifferentlunchhours,asthedipbetween11h-13hisnotasaccentuatedassome others,aswewillseebelow.
Eventhoughweekendsareslower,as,formostoftheclients,theactivitycurvepretty muchfollowstheremainingcurvesforweekdays-fig.3.16.
Asexpectedforaclientwithalmost100%ofsuccessfulcalls,’10001’(CallOK)isthe predominantLEC,withapercentageof97%.Infig.3.17,wecanseetheinternalservices thatthisclientgoesthrough,alongwiththenumberofoccurrencesineachoneofthem. Fromthisfigure,wecanconcludethat’OIP(OriginatingIdentityPresentation)’,’Convergent’,’NP ANNOUNCE SURPRESS’and’NUMBER PORTABILITY’arethemostrelevantservicesforthisclient.
Client1303hasamoresignificant’M’curve,withacleardroptoalmostnullvaluesat lunchhours,moreinlinewiththegeneralpicture,asseenfromfig.3.6.Themostoffpatternactivitiescorrespondtotheweekends,havingthe12thofMarchasanexamplefig.3.19 -thespikeat20h.Thespikeat13hisduetotheUTCtimechange.
Inordertounderstandwhytherearesomanyunansweredcallsforthisclient,their LECwerestudied.OutofalltheLECpresentforthisclient,73%correspondto’33001’ (Busydestination),16%for’10022’(OriginAbandonedwithCallCompletedElsewhere Reason)and4%for’10009’(OriginAbandoned(callerhangsupbeforeanswering)).A histogramforallthedatesandhoursavailableofLEC=’33001’isrepresentedinfig.3.20.
Itaccompaniesthe’M’-shapedcurve,havingloweroccurrencesforout-of-officeand lunchhours.Forweekends,therewerenooccurrencesofthisLEC.Thereisnotenough informationtounderstandifhavingthismuchofapercentagecorrespondentto’Busy Destination’isanone-monthsituationorifithappensallyear.Ifitisthefirstone,itmeans thatitisjustamorebusyseason,butifitisthesecondone,thereisaneedtoallocate moreresourcestothisclient.Somethingthatcouldhavebeendonewastodistinguish thishistogrambetweentheansweredandunansweredcalls,toseeiffurtherconclusions couldhavebeentaken.
ThesecondmostcommonLEC,’10022’,ismostlikelyduetothe’HuntGroup’.For example,thecallistransferredtotwodifferentpeople.Ifoneofthemanswers,thecallto theotherpersonisautomaticallyhungup,originatingaLECequalto’10022’.
Whenitcomestotheinternalservicesittransitsby,aconfigurationproblemseems toariseasonlyanalmostinsignificantpercentageofthecallsgothroughthe’Waiting Queue’,andmostcallsofthisclientarenotbeinganswered.Similartothepreviousclient,thereisahigherfrequencyoftheinternalservices’OIP’(OriginatingIdentity Presentation)and’Convergent’,meaningthatthecalledparty’sidentityisshowntothe callingpartyandthatthecallingpartyallowsthecall’sorigintobeidentifieddifferently dependingonwhetherthedestinationisfixedormobile,respectively.Butwhydothese internalserviceshavesomanyunansweredcalls?Wedidnotdoamorethoroughanalysistodigmoreintothissubject,duetothelackoftime,butitwouldbesomethingworth exploringforafutureproject.Theremaininginternalservicesarenotworthmentioning astheirpercentage,comparedtothepreviouslyexplainedones,isinsignificant.
Thefirstapproachconsistsintoanalysingtheinternalservice’WaitingQueue’andits characterizingfields,whilethesecondapproachconsistsinanalysingthecallpatternsof thegroupservices.
Ofcourse,therearemanymoreclientswhoseactivitygoesthroughthisinternalservice.However,thecountwascappedforvaluesabove5000,astheprimarygoalwasto decidewhichclienttoanalyse,inthiscase,theonethathadthemostsignificantamount oftransitionsthroughthewaitingqueue.TheWaitingQueueservicehasahighnumber ofCDRswithSESSION ANSWERED=1sincethecallisusuallyautomaticallyanswered bythequeue.However,itshouldonlybeconsideredasanansweredcall,allofwhich wereeffectivelyansweredbythequeue’sdestination.’INFO6’isthefieldwhichprovides thatinformation.However,aproblemwasencounteredhere.While’INFO6’shouldonly befilledwitheither0or1,forthisclient,weobtainedtenpossiblevalues,representedin table 3.2.
TABLE 3.2:ObtainednumbersforINFO6fieldanditsrespectivenumberofoccurrences.
Notonlyisitfivetimesmorethanthenumberofpossiblevalues,butalso,thevalues obtainedseemveryrandom,notevenhavingtheexpectedvalueof’1’.
Eitherdifferentkeyswereusedwhentheanonymisationprocesswasperformed:since therearemillionsofrecords,notallofthemwereextractedsimultaneously,meaningthat theseextractionsmighthavehaddifferentkeystoanonymisethedata;Ortherewasa bugintheanonymisationprocess.However,ifitisthislast,ithasnotyetbeenpossibleto discoverthereasonbehindthiserror.Thisfieldmaycontainothertypesofinformation, outsidethescopeoftheWaitingQueue,hencetheneedforittobeanonymized.Thus, wewouldnotexpecttohavethevalues’0’or’1’(ifanonymized,theywouldhavetobe different),however,itwouldstillonlybetwovalues.Inreality,evenifwehadonlytwo numbers,wewouldnotknowwhichoneisequivalentto’0’andto’1’.Sincethemain issueofhavingtenvaluesinsteadoftwowasnotsolved,thiscorrespondencewasnot investigated.Theseresultsmakethisanalysisunfeasible.
Inordertochoosewhichclientstoanalyse,wedecidedtoseewhichoneshadtheevenest distributionofvoicecallsthroughoutthegroupservicesandagreatnumberofvoice callsforeachofthese-fig.3.23.Twoclientswerechosen,withitsvoicecallsdistribution representedinfig.3.24:client’114114’andclient’91763’.
Thesetwoclientshaveasimilarweekdayactivity,withthetypical’M’shapedcurve -figures 3.25 and 3.26.Althoughtheactivityfortheweekendsseemstodiffer,themain differencewecanspotistheactivityforexternalandinternalcalls.Client’114114’has muchmoreinternalcallsthanexternalones,with,ingeneral,moreansweredcallsthan unansweredforinternalactivities.Theonlyexceptionsaretheweekends,mostprecisely onthe20thofMarch,wheretwoimpressivepeakscanbespotted.However,wecan concludethattheexternalactivityforthisclientisinsignificant,as,atmost,wehaveten callsinonehour.Ontheothersideofthespectrum,wehaveclient’91763’,aclientwith amoresubstantialexternalactivity,withmorethantwicethenumberofunanswered
calls.Thisisrelatedtothenatureofthebusiness,asthisisaclientwhosebusinessarea ismorecommercial(forexample,insurancecompaniesandcallcentres)oraclientthat hasamuchmoresignificantareaofclientsupportbecauseaclientthathasastronger internalcommunication,forexample,client’114114’,ismorelikelyaclientwithmultiple polesthatneedtocommunicatebetweeneachother.Whenitcomestointernalactivity forclient’91763’,thecurvesforansweredandunansweredcallsaccompanyeachother, withthislastoneonlysurpassingduringeitherweekends,lunchand’outoftheoffice’ sortofhours.
Thegraphsfromfig.3.25 andfig.3.26 weregeneratedforAltice’sintereststohavean ideaofthedifferencebetweenexternalandinternalcallsoftheactivityofaclient.This isjustanexampleofhowclientscanhavedifferentcharacteristicsreflectedintheircall patterns.Again,therearemanycharacteristicsonecouldchoosefrom,anditisalsopossibletodoananalysisusinghierarchicalclustering,similartowhatwedidinfig.3.10,by aggregatingclientsandthoseinthesameclusterswouldsharesimilarcharacteristics.
Atfirst,theideawastodoananalysissimilartotheoneinsection 3.1.1.However, toomanyeventswereregisteredforasinglecall(GID),makingitimpossibletounderstandthetemporalpattern,sothatiswheretheideatoanalysetheGIDthatappearedfor specificusersofgroupservicescameup.However,afteranalysing,werealisedthatthis approachdidnotgettheexpectedresults,soinsteadofshowingtheresultsoftwoclients, wewillonlybeshowingone,inthiscase,client’114114’.
Althoughitispossibletounderstandthatthecallhastransitedthroughdifferentservices(inthiscase,IVRandPA),wecannotcharacteriseitbecausethepatterndisplayed mayresultfromdifferentconfigurationsattheIVRlevel,notallowingustoretrieveany conclusions.OnepossibleconfigurationisamultilevelIVR(typicallythepretendedconfiguration),wheremenusarecascaded,meaningthatoneofseveralavailableoptions, whenchosen,deliversthecalltoanotherIVRMenu,whichalsohasoptionslinkedto anotherIVRMenu,andsoon,untileventually,thechosenoptionsdeliverthecalltothe user.Usually,foreachcall,therearetworecords,onefortheansweredcallandanother fortheforwarding,so,infig.3.27,thereareatleastfourlevelsofIVR.Anotherpossible configurationishavinganIVRreceivingthecall,whichhasafallbackconfigurationin caseofmaximumattemptsforinvalidoptions,deliveringthecalltoanumberbelonging toanotherIVR,andsoon.
Itwasexpectedforthesecallpatternstogothroughmoreservicesinamoreinteractiveway,whichdidnothappen.Theideawastounderstandtheinteractionbetween theservices,which,inthiscase,wasnotpossible.Forexample,wecannotunderstand fromfig.3.27 ifthepersoncallingisclickingontherightkeyornot.Thischaracterisation
TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION. neededtobecomplementedwithsomeothersortofinformation,andhencethiswasnot asusefulaswethoughtitwasgoingtobe.
Fromtheworkthatwasdonethroughoutthisinternship,wecantakesomeconclusions fromit.Inthefirststage,weknewadataanonymisationprocessmustoccur.However, forfutureprojects,itwouldbebestifthisprocesshadpreviouslybeendiscussedand testedforthedatatobeavailableassoonastheprojectstarted.Itshouldbediscussed becausethedataownersneedtoapprovethesharingandtheanonymisationprocess(for thisproject,thisphaselastedforaboutninetotenmonths),anditshouldbetestedas, throughouttheanalysis,manymistakeswerefound.Fromduplicatedentriestomixing theentriesfrommiddaytomidnight,aswellashavingfieldswithrecordsthatdidnot makesense-e.g.,fieldsthatcanonlybefilledwith0,1orNaN-makingtheanalysis thatwewantedtodototheWaitingQueueunfeasibletoperform,asitcentredaround onefield,andthatfieldwasbugged.Wearestillnotsurewhythishappened.Allofthis showshowharditistohavedataandhowimportantitisforittobecorrect.
Wesawthegeneralactivitycurveforallclients,withtwodistinctcurvesforweekendsandweekdays.Weekdaysarebusierduring9h-12hand13h-18h,withacleardrop intheactivityduringlunchhoursandalmostnullactivityoutsidetheworkinghours. Forweekends,withfewercalls,thereisalsoaslightdropthatiscomprehendedbetween 11hand19h.Wesawseveralclientswithdifferentactivitycurves,dividedbyday,but ingeneral,allofthemwereslightlyorsignificantly’M’shaped.Withthis,itwaspossibletospotunusualspikesordropsintheclients’dailyactivity.Itwasalsopossibleto seemultiplewaysinwhichbehaviourmaybeconsideredanomalous:clientswithapercentageof100%unansweredcallsandclientswiththecompleteopposite(usuallywitha veryfewamountofcalls),andclientswitharoundamillioncallsandalmost100%ofnot answeredandansweredcalls.Itwasfoundthatforthefewonlyansweredcalls,some
TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION. clientsonlyhaveonephone,meaningthatonmanydays,thereisonlyoneregistered voicecall.Otherclientsonlymaketestcalls,asthecallshappenatthesamehour,to thesameinternalserviceandhavethesameduration.However,forsomeothers,itwas impossibletowithdrawconclusionsduetotheshorttimeframeoflessthanamonth.
Ingeneral,thereisapercentageof40%ofunansweredcalls.The’LOGIC END CAUSE’ oftheseunansweredcallsshowedusthatclientswithaconsiderableamountofcallshad amoresignificantpercentageofunansweredcallshavingtheirmainLECas’Busydestination’,meaningthatmoreresourcesshouldbeallocatedtotheseclients.Nevertheless, noneoftheLECsstudiedwasoutofplacebesidetheunansweredcallsgiventheLEC’ Call OK’.ThesesituationsshouldbereanalysedandgivenamoreaccurateLEC.
Wecanalsoconcludethatclientswithsimilaractivitypatternsbutdifferentcharacteristics,suchasthelocationofthecalls,resultinthosecharacteristicshavingdifferent patterns,aswesawpreviously.Othercharacteristicscanbestudied,andclientswith similaronescouldbeaggregatedinorderfortheoperatortounderstandthedifferent businessplansonecouldimplement.
Ingeneral,wecansaythatthisinternshipwasanexperienceofhowtheenterprise worldworks,withthedifficultiesthatareusuallyencounteredandhowtoovercome them.Thisinternshipservedasthebeginningoftheanalyticsfieldwithinthefront-end teamIwasinserted.Forthefuture,amuchlargertimeframeshouldbeexplored,asless thanMarchwasanalysedinthisinternship.Acharacterisationoftheinteractionsbetween thegroupserviceswasattemptedandfailedhere,soforanextproject,abiggerfocuson that.Moreover,exploringothercalltypesbeyondvoicecalls,includingtextmessages.
[1] Anivers´ariodaalticelabs:H´a5anosainovardeaveiroparaomundo.[Online].
Available: https://www.telecom.pt/pt-pt/media/noticias/paginas/2021/junho/ anivers%C3%A1rio-da-altice-labs-h%C3%A1-5-anos-a-inovar-de-aveiro-para-omundo.aspx [Citedonpage 1.]
[2] Alticelabs:Inovac¸aodeportugalparaomundo.(accessed:09.06.2022).[Online].
Available: https://www.europeanjobdays.eu/en/company/altice-labs [Citedon page 1.]
[3] Innovation.(accessed:09.06.2022).[Online].Available: https://www.alticelabs. com/innovation/ [Citedonpages xiii, 2, 3,and 4.]
[4] “Metadatadefinitionmeaning,”(accessed:27.09.2022).[Online].Available: https://www.merriam-webster.com/dictionary/metadata [Citedonpage 4.]
[5] J.Sammons,“Chapter10-mobiledeviceforensics,”in TheBasicsofDigitalForensics (SecondEdition),secondeditioned.,J.Sammons,Ed.Boston:Syngress,2015,pp. 145–161.[Online].Available: https://www.sciencedirect.com/science/article/pii/ B9780128016350000103 [Citedonpage 4.]
[6] A.Tauriainen,“Canyouhearmenow?acalldetailrecordbasedend-to-enddiagnosticssystemformobilenetworks,”in 201915thInternationalConferenceonNetwork andServiceManagement(CNSM),2019,pp.1–7.[Citedonpage 4.]
[7] “Whatisdataprotectionandwhydoesitmatter?”Apr2022,(accessed: 29.09.2022).[Online].Available: https://securityintelligence.com/articles/what-isdata-protection/ [Citedonpage 5.]
[8] “Whatisgdpr,theeu’snewdataprotectionlaw?”May2022,(accessed:29.09.2022). [Online].Available: https://gdpr.eu/what-is-gdpr/ [Citedonpage 5.]
[9] “Thehistoryofthegeneraldataprotectionregulation,”(accessed:29.09.2022).
[Online].Available: https://edps.europa.eu/data-protection/data-protection/ legislation/history-general-data-protection-regulation en [Citedonpage 5.]
[10] P.oftheCouncil,“Compromisetext.severalpartialgeneralapproacheshave beeninstrumentalinconvergingviewsincouncilontheproposalforageneral dataprotectionregulationinitsentirety.thetextontheregulationwhichthe presidencysubmitsforapprovalasageneralapproachappearsinannex,”(accessed: 26.04.2022).[Online].Available: http://data.consilium.europa.eu/doc/document/ ST-9565-2015-INIT/en/pdf [Citedonpage 5.]
[11] E.Union,“Directive95/46/ecoftheeuropeanparliamentandofthecouncilonthe protectionofindividualswithregardtotheprocessingofpersonaldataandonthe freemovementofsuchdata,”Oct1995,(accessed:26.04.2022).[Online].Available: https://www.refworld.org/docid/3ddcc1c74.html [Citedonpage 5.]
[12] J.Lindquist.Datascienceundergdprwithpseudonymization inthedatapipeline.(accessed:26.04.2022).[Online].Available: https://web.archive.org/web/20180418230748/https://www.dativa.com/ data-science-gdpr-pseudonymization-data-pipeline/ [Citedonpages 5 and 6.]
[13] (2019,Nov)Oquesaodadospessoais?(accessed:26.04.2022).[Online].Available: https://ec.europa.eu/info/law/law-topic/data-protection/reform/whatpersonal-data pt [Citedonpage 6.]
[14] intersoftconsulting.Keyissues:Gdprconsent.(accessed:26.04.2022).[Online]. Available: https://gdpr-info.eu/issues/consent/ [Citedonpage 6.]
[15] C.C.Aggarwal, DataMining:TheTextbook.Cham:Springer,2015.[Citedonpages 6 and 7.]
[16] A.Soetewey.Variabletypesandexamples.(accessed:24.04.2022).[Online].Available: https://towardsdatascience.com/variable-types-and-examples-cf436acaf769 [Citedonpages xiii and 7.]
[17] C.University.Dataandvariabletypes.(accessed:26.04.2022).[Online].Available: https://libguides.library.curtin.edu.au/uniskills/numeracy-skills/statistics/ data-variable-types [Citedonpages 6 and 7.]
[18] R.Sharma.Typesofdata:Nominal,ordinal,discrete,continuous.(accessed: 26.04.2022).[Online].Available: https://www.upgrad.com/blog/types-of-data/ [Citedonpages 6 and 7.]
[19] P.Portugal,“Alticeportugal,”(accessed:27.09.2022).[Online].Available: https://www.telecom.pt/en-us/a-pt/Pages/historia.aspx [Citedonpage 8.]
[20] M.Parmar.(2021,Mar)Structureyourcodebetteringooglecolabwithtextandcodecells.(accessed:16.09.2022).[Online].Available: https://miteshparmar1.medium.com/structure-your-code-better-in-googlecolab-with-text-and-code-cells-b6fa73feec20 [Citedonpages xiii and 10.]
[21] (2020,May)Oracledatabase.(accessed:16.09.2022).[Online].Available: https: //learnmystuff.com/learn/oracle-database-2/ [Citedonpages xiii and 10.]
[22] B.FutureSchool.(2022,Jun)Whatissql,andwhatarethebenefits,uses, andimportanceofsqlintherealworld?(accessed:27.09.2022).[Online]. Available: https://www.byjusfutureschool.com/blog/what-is-sql-and-what-arethe-benefits-uses-and-importance-of-sql-in-the-real-world/ [Citedonpage 10.]
[23] (2022,Mar)Whatissqlwhyisitimportanttolearnit?(accessed:27.09.2022). [Online].Available: https://codeop.tech/what-is-sql/ [Citedonpage 10.]
[24] (accessed:27.09.2022).[Online].Available: https://research.google.com/ colaboratory/faq.html [Citedonpage 10.]
[25] Colaboratory:Localruntimes.(accessed:02.09.2022).[Online].Available: https: //research.google.com/colaboratory/local-runtimes.html [Citedonpage 10.]
[26] Pandasuserguide.(accessed:24.04.2022).[Online].Available: https://pandas. pydata.org/pandas-docs/stable/user guide/index.html [Citedonpage 11.]
[27] (2022,May)Welcometocx oracle’sdocumentation!(accessed:16.09.2022).[Online]. Available: https://cx-oracle.readthedocs.io/en/latest/ [Citedonpage 11.]
[28] Whatismatplotlibinpython?[Online].Available: https://www.activestate.com/resources/quick-reads/what-is-matplotlib-inpython-how-to-use-it-for-plotting/ [Citedonpage 11.]
TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION.
[29] F.Pedregosa,G.Varoquaux,A.Gramfort,V.Michel,B.Thirion,O.Grisel,M.Blondel, P.Prettenhofer,R.Weiss,V.Dubourg,J.Vanderplas,A.Passos,D.Cournapeau, M.Brucher,M.Perrot,and ´ EdouardDuchesnay,“Scikit-learn:Machinelearning inpython,” JournalofMachineLearningResearch,vol.12,no.85,pp.2825–2830, 2011.[Online].Available: http://jmlr.org/papers/v12/pedregosa11a.html [Cited onpage 11.]
[30] scikit-learn.(accessed:24.04.2022).[Online].Available: https://numfocus.org/ project/scikit-learn [Citedonpage 11.]
[31] “Israkeyprogramminglanguagefordatascience?”May2022,(accessed: 29.09.2022).[Online].Available: https://codeop.tech/is-r-a-key-programminglanguage-for-data-science/ [Citedonpage 11.]
[32] “3g,”(accessed:29.09.2022).[Online].Available: https://www.sharetechnote.com/ html/Handbook UMTS SS.html [Citedonpage 12.]
[33] “T1:Asurvivalguide,”(accessed:29.09.2022).[Online].Available: https: //www.oreilly.com/library/view/t1-a-survival/0596001274/ [Citedonpage 12.]
[34] A.Sethi.(2020,Jun)Categoricalencoding:Onehotencodingvslabelencoding. (accessed:26.04.2022).[Online].Available: https://www.analyticsvidhya.com/ blog/2020/03/one-hot-encoding-vs-label-encoding-using-scikit-learn/ [Citedon page 13.]
[35] G.L.Team.(2022,Jul)Labelencodinginpythonexplained.(accessed:26.04.2022). [Online].Available: https://www.mygreatlearning.com/blog/label-encoding-inpython/ [Citedonpage 13.]
[36] “Labelencodingvsdummyvariable/onehotencoding-correctness?”Nov2019,(accessed:26.04.2022).[Online].Available: https://stats.stackexchange.com/questions/410939/label-encoding-vsdummy-variable-one-hot-encoding-correctness [Citedonpage 14.]
[37] Aporras,“Whatisthedifferencebetweenfeatureextractionandfeatureselection?” Dec2019,(accessed:16.09.2022).[Online].Available: https://quantdare.com/ what-is-the-difference-between-feature-extraction-and-feature-selection/ [Citedon page 14.]
[38] “Featureselectionvsfeatureextraction.whichtousewhen?”Feb2019, (accessed:16.09.2022).[Online].Available: https://datascience.stackexchange.com/ questions/29006/feature-selection-vs-feature-extraction-which-to-use-when [Cited onpage 14.]
[39] C.Goyal,“Featuretransformationsindatascience:Adetailed walkthrough,”Aug2022,(accessed:16.09.2022).[Online].Available: https://www.analyticsvidhya.com/blog/2021/05/feature-transformationsin-data-science-a-detailed-walkthrough/ [Citedonpage 14.]
[40] W.Mohd,A.Beg,T.Herawan,andK.Rabbi, MaxDK-Means:AClusteringAlgorithm forAuto-generationofCentroidsandDistanceofDataPointsinClusters,012012,vol.316, pp.192–199.[Citedonpages xiii, 15,and 18.]
[41] U.M.Fayyad,G.Piatetsky-Shapiro,andP.Smyth,“Fromdataminingtoknowledgediscoveryindatabases,” AIMag.,vol.17,pp.37–54,1996.[Citedonpages xiii and 15.]
[42] K.ElBouchefryandR.S.deSouza,“Chapter12-learninginbigdata:Introduction tomachinelearning,”in KnowledgeDiscoveryinBigDatafromAstronomyandEarth Observation,P. ˇ SkodaandF.Adam,Eds.Elsevier,2020,pp.225–249.[Online].Available: https://www.sciencedirect.com/science/article/pii/B9780128191545000230
[Citedonpage 15.]
[43] T.Hinton,Geoffrey;Sejnowski,Ed., UnsupervisedLearning:FoundationsofNeural Computation.TheMITPress,1999.[Citedonpage 16.]
[44] K.ElBouchefryandR.S.deSouza,“Chapter12-learninginbigdata:Introduction tomachinelearning,”in KnowledgeDiscoveryinBigDatafromAstronomyandEarth Observation,P. ˇ SkodaandF.Adam,Eds.Elsevier,2020,pp.225–249.[Online].Available: https://www.sciencedirect.com/science/article/pii/B9780128191545000230
[Citedonpage 16.]
[45] C.C.AggarwalandC.K.Reddy,Eds., DataClustering:AlgorithmsandApplications CRCPress,2014.[Online].Available: http://www.charuaggarwal.net/clusterbook. pdf [Citedonpage 16.]
XPLORATIONOFREPRESENTATIONOFTHEDIFFERENTWAYSDATAANALYSISINREAL TIMEANDINITSMATERIALIZATIONINUSEFULBUSINESSINFORMATION.
[46] B.K.Patra,N.Hubballi,S.Biswas,andS.Nandi,“Distancebasedfasthierarchical clusteringmethodforlargedatasets,”in RoughSetsandCurrentTrendsinComputing, M.Szczuka,M.Kryszkiewicz,S.Ramanna,R.Jensen,andQ.Hu,Eds.Berlin, Heidelberg:SpringerBerlinHeidelberg,2010,pp.50–59.[Citedonpage 16.]
[47] E.K.Tokuda,C.H.Comin,andL.d.F.Costa,“Revisitingagglomerativeclustering,” 2020.[Online].Available: https://arxiv.org/abs/2005.07995 [Citedonpage 17.]
[48] “Scipy.cluster.hierarchy.dendrogram,”(accessed:16.09.2022).[Online].
Available: https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster. hierarchy.dendrogram.html [Citedonpage 17.]
[49] D.Stamenic,“Definitiveguidetohierarchicalclusteringwithpythonandscikitlearn,”Jul2022,(accessed:16.09.2022).[Online].Available: https://stackabuse. com/hierarchical-clustering-with-python-and-scikit-learn/ [Citedonpages xiii, 17, and 18.]
[50] “Clusteringmethods,”Apr2010,(accessed:16.09.2022).[Online].Available: https://support.sas.com/documentation/cdl/en/statug/63033/HTML/ default/viewer.htm#statug cluster sect012.htm [Citedonpage 17.]
[51] “Rudimentsofhierarchicalclustering:Ward’smethodanddivisiveclustering,”Jul2018,(accessed:16.09.2022).[Online].Available: https://m.dexlabanalytics.com/blog/rudiments-of-hierarchical-clusteringwards-method-and-divisive-clustering [Citedonpage 18.]
[52] “Agglomerativehierarchicalclusteringusingwardlinkage,”(accessed:16.09.2022). [Online].Available: https://jbhender.github.io/Stats506/F18/GP/Group10.html [Citedonpage 18.]
[53] “Thebeginner’sguidetocallqueuing(2022),”Dec2021,(accessed:29.09.2022).[Online].Available: https://squaretalk.com/call-queuing-guide/ [Citedonpage 21.]
[54] “Ivr,”Mar2021,(accessed:16.09.2022).[Online].Available: https://t4d.co.ke/ivr/ [Citedonpage 21.]
[55] “Learnmoreabouttalend,”(accessed:07.11.2022).[Online].Available: https: //www.talend.com/ [Citedonpage 29.]
[56] HierarchicalClustering.JohnWileySons,Ltd,2011,ch.4,pp.71–110.[Online].Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/9780470977811.ch4 [Cited onpage 36.]