https://ebookmass.com/product/exploratory-data-analysisusing-r-1st-edition-ronald-k-pearson/
Instant digital products (PDF, ePub, MOBI) ready for you
Download now and discover formats that fit your needs...
Biostatistics and Computer-based Analysis of Health Data using R 1st Edition Christophe Lalanne
https://ebookmass.com/product/biostatistics-and-computer-basedanalysis-of-health-data-using-r-1st-edition-christophe-lalanne/
ebookmass.com
Using R For Data Analysis In Social Sciences: A Research Project-oriented Approach Li
https://ebookmass.com/product/using-r-for-data-analysis-in-socialsciences-a-research-project-oriented-approach-li/
ebookmass.com
Data Analysis for the Life Sciences with R 1st Edition
https://ebookmass.com/product/data-analysis-for-the-life-scienceswith-r-1st-edition/
ebookmass.com
Dark Wine at Dawn (A Hill Vampire Novel Book 9) Jenna Barwin
https://ebookmass.com/product/dark-wine-at-dawn-a-hill-vampire-novelbook-9-jenna-barwin-6/
ebookmass.com
Developer Advocacy: Establishing Trust, Creating Connections, and Inspiring Developers to Build Better 1st Edition Riley Chris
https://ebookmass.com/product/developer-advocacy-establishing-trustcreating-connections-and-inspiring-developers-to-build-better-1stedition-riley-chris/
ebookmass.com
Woelfels Dental Anatomy 9th Edition, (Ebook PDF)
https://ebookmass.com/product/woelfels-dental-anatomy-9th-editionebook-pdf/
ebookmass.com
The Comprehensive Textbook of Clinical Biomechanics 2nd Edition Edition Jim Richards
https://ebookmass.com/product/the-comprehensive-textbook-of-clinicalbiomechanics-2nd-edition-edition-jim-richards/
ebookmass.com
McGraw Hill 500 HESI A2 Questions to know by test day, Second Edition Kathy A. Zahler
https://ebookmass.com/product/mcgraw-hill-500-hesi-a2-questions-toknow-by-test-day-second-edition-kathy-a-zahler/
ebookmass.com
Diatom Morphogenesis (Diatoms: Biology and Applications)
Vadim V. Annenkov
https://ebookmass.com/product/diatom-morphogenesis-diatoms-biologyand-applications-vadim-v-annenkov/
ebookmass.com
Information Privacy Law (Aspen Casebook Series) 6th Edition, (Ebook PDF)
https://ebookmass.com/product/information-privacy-law-aspen-casebookseries-6th-edition-ebook-pdf/
ebookmass.com
EXPLORATORY DATA ANALYSIS USING R Chapman & Hall/CRC Data Mining and Knowledge Series Series Editor: Vipin Kumar
Computational Business Analytics
Subrata Das
Data Classification
Algorithms and Applications
Charu C. Aggarwal
Healthcare Data Analytics
Chandan K. Reddy and Charu C. Aggarwal
Accelerating Discovery
Mining Unstructured Information for Hypothesis Generation
Scott Spangler
Event Mining
Algorithms and Applications
Tao Li
Text Mining and Visualization
Case Studies Using Open-Source Tools
Markus Hofmann and Andrew Chisholm
Graph-Based Social Media Analysis
Ioannis Pitas
Data Mining
A Tutorial-Based Primer, Second Edition
Richard J. Roiger
Data Mining with R
Learning with Case Studies, Second Edition
Luís Torgo
Social Networks with Rich Edge Semantics
Quan Zheng and David Skillicorn
Large-Scale Machine Learning in the Earth Sciences
Ashok N. Srivastava, Ramakrishna Nemani, and Karsten Steinhaeuser
Data Science and Analytics with Python
Jesus Rogel-Salazar
Feature Engineering for Machine Learning and Data Analytics
Guozhu Dong and Huan Liu
Exploratory Data Analysis Using R
Ronald K. Pearson
For more information about this series please visit: https://www.crcpress.com/Chapman--HallCRC-Data-Mining-and-Knowledge-Discovery-Series/book-series/CHDAMINODIS
EXPLORATORY DATA ANALYSIS USING R Ronald K. Pearson
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2018 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Printed on acid-free paper
Version Date: 20180312
International Standard Book Number-13: 978-1-1384-8060-5 (Hardback)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com
and the CRC Press Web site at http://www.crcpress.com
1.2.2Exploratoryanalysis.....................
1.2.3Computers,software,andR.................
1.3ArepresentativeRsession......................
1.4Organizationofthisbook......................
2.2.2Gridgraphics.........................
2.2.3Latticegraphics.......................
2.2.4Theggplot2package.....................
2.3.1Theflexibilityoftheplotfunction.............
2.3.2S3classesandgenericfunctions...............
2.3.3Optionalparametersforbasegraphics...........
2.4.1Addingpointsandlinestoascatterplot..........
2.4.2Addingtexttoaplot....................
2.4.3Addingalegendtoaplot..................
2.4.4Customizingaxes.......................
2.5Afewdifferentplottypes......................
2.5.1Piechartsandwhytheyshouldbeavoided........
2.5.2Barplotsummaries......................
2.5.3Thesymbolsfunction....................
2.7.1Afewgeneralguidelines...................
2.7.3Thetableplotfunction....................
3ExploratoryDataAnalysis:AFirstLook
3.2.1“Typical”values:themean.................
3.2.2“Spread”:thestandarddeviation..............
3.2.3Limitationsofsimplesummarystatistics..........
3.2.4TheGaussianassumption..................
3.3.1Outliersandtheirinfluence.................
3.3.2Detectingunivariateoutliers................
3.3.3Inliersandtheirdetection..................
3.3.5Missingdata,possiblydisguised..............
3.3.6QQ-plotsrevisited......................
3.4Visualizingrelationsbetweenvariables...............
3.4.1Scatterplotsbetweennumericalvariables..........
3.4.2Boxplots:numericalvs.categoricalvariables.......
3.4.3Mosaicplots:categoricalscatterplots............
3.5Exercises...............................
4.1FilemanagementinR........................
4.2Manualdataentry..........................
4.2.1Enteringthedatabyhand..................
4.2.2Manualdataentryisbadbutsometimesexpedient....
4.3.1PreviewsofthreeInternetdataexamples.........
4.4.1ReadingandwritingCSVfiles...............
4.4.2Spreadsheetsandcsvfilesare not thesamething.....
4.4.3TwopotentialproblemswithCSVfiles...........
4.5Workingwithotherfiletypes....................
4.5.1Workingwithtextfiles.................... 158
4.5.2SavingandretrievingRobjects............... 162
4.5.3Graphicsfiles......................... 163
4.6Mergingdatafromdifferentsources................. 165
4.7Abriefintroductiontodatabases.................. 168
4.7.1Relationaldatabases,queries,andSQL.......... 169
4.7.2Anintroductiontothe sqldf package........... 171
4.7.3AnoverviewofR’sdatabasesupport............ 174
4.7.4Anintroductiontothe RSQLite package.......... 175
4.8Exercises............................... 178
5LinearRegressionModels 181
5.1Modelingthe whiteside data.................... 181
5.1.1Describinglinesintheplane................ 182
5.1.2Fittinglinestopointsintheplane............. 185
5.1.3Fittingthe whiteside data................. 186
5.2Overfittinganddatasplitting.................... 188
5.2.1Anoverfittingexample.................... 188
5.2.2Thetraining/validation/holdoutsplit........... 192
5.2.3Twousefulmodelvalidationtools............. 196
5.3Regressionwithmultiplepredictors................. 201
5.3.1The Cars93 example..................... 202
5.3.2Theproblemofcollinearity................. 207
5.4Usingcategoricalpredictors..................... 211
5.5Interactionsinlinearregressionmodels............... 214
5.6Variabletransformationsinlinearregression............ 217
5.7Robustregression:averybriefintroduction............ 221
5.8Exercises............................... 224
6CraftingDataStories 229
6.1Craftinggooddatastories...................... 229
6.1.1Theimportanceofclarity.................. 230
6.1.2Thebasicelementsofaneffectivedatastory....... 231
6.2Differentaudienceshavedifferentneeds.............. 232
6.2.1Theexecutivesummaryorabstract............ 233
6.2.2Extendedsummaries..................... 234
6.2.3Longerdocuments...................... 235
6.3Threeexampledatastories..................... 235
6.3.1TheBigMacandGrandeLatteeconomicindices..... 236
6.3.2SmalllossesintheAustralianvehicleinsurancedata... 240
6.3.3Unexpectedheterogeneity:theBostonhousingdata... 243
7ProgramminginR
7.1Interactiveuseversusprogramming.................
7.1.1Asimpleexample:computingFibonnaccinumbers....
7.1.2Creatingyourownfunctions................
7.2KeyelementsoftheRlanguage...................
7.2.1Functionsandtheirarguments...............
7.2.2The list datatype.....................
7.2.3Controlstructures......................
7.2.4Replacingloopswith apply functions...........
7.2.5Genericfunctionsrevisited.................
7.3Goodprogrammingpractices....................
7.3.1ModularityandtheDRYprinciple.............
7.3.2Comments...........................
7.3.3Styleguidelines........................
7.3.4Testinganddebugging....................
7.4Fiveprogrammingexamples.....................
7.4.1Thefunction ValidationRsquared
7.4.2Thefunction TVHsplit
7.4.3Thefunction PredictedVsObservedPlot
7.4.4Thefunction
8WorkingwithTextData
8.1Thefundamentalsoftextdataanalysis...............
8.1.1Thebasicstepsinanalyzingtextdata...........
8.1.2Anillustrativeexample...................
8.2BasiccharacterfunctionsinR....................
8.2.1The nchar
8.2.2The grep
8.2.3Applicationtomissingdataandalternativespellings...
8.2.4The sub and gsub functions.................
8.2.5The strsplit function...................
8.2.6Anotherapplication: ConvertAutoMpgRecords ......
8.2.7The paste function.....................
8.3Abriefintroductiontoregularexpressions.............
8.3.1Regularexpressionbasics..................
8.3.2Someusefulregularexpressionexamples..........
8.4Anaside:ASCIIvs.UNICODE...................
8.5Quantitativetextanalysis......................
8.5.1Document-termanddocument-featurematrices......
8.5.2Stringdistancesandapproximatematching........
8.6Threedetailedexamples.......................
8.6.1Characterizingabook....................
8.6.2The cpus dataframe.....................
8.6.3Theunclaimedbankaccountdata............. 344 8.7Exercises............................... 353
9ExploratoryDataAnalysis:ASecondLook 357
9.1Anexample:repeatedmeasurements................ 358
9.1.1Summaryandpracticalimplications............ 358
9.1.2Thegorydetails....................... 359
9.2Confidenceintervalsandsignificance................ 364
9.2.1Probabilitymodelsversusdata............... 364
9.2.2Quantilesofadistribution.................. 366
9.2.3Confidenceintervals..................... 368
9.2.4Statisticalsignificanceand p-values............. 372
9.3Characterizingabinaryvariable.................. 375
9.3.1Thebinomialdistribution.................. 375
9.3.2Binomialconfidenceintervals................ 377
9.3.3Oddsratios.......................... 382
9.4Characterizingcountdata...................... 386
9.4.1ThePoissondistributionandrareevents.......... 387
9.4.2Alternativecountdistributions............... 389
9.4.3Discretedistributionplots.................. 390
9.5Continuousdistributions....................... 393
9.5.1LimitationsoftheGaussiandistribution.......... 394
9.5.2SomealternativestotheGaussiandistribution...... 398
9.5.3The qqPlot functionrevisited................ 404
9.5.4Theproblemsoftiesandimplosion............. 406
9.6Associationsbetweennumericalvariables............. 409
9.6.1Product-momentcorrelations................ 409
9.6.2Spearman’srankcorrelationmeasure............ 413
9.6.3Thecorrelationtrick..................... 415
9.6.4Correlationmatricesandcorrelationplots......... 418
9.6.5Robustcorrelations...................... 421
9.6.6Multivariateoutliers..................... 423
9.7Associationsbetweencategoricalvariables............. 427
9.7.1Contingencytables...................... 427
9.7.2Thechi-squaredmeasureandCram´er’sV......... 429
9.7.3GoodmanandKruskal’staumeasure............ 433
9.8Principalcomponentanalysis(PCA)................ 438
9.9Workingwithdatevariables..................... 447
9.10Exercises............................... 449
10MoreGeneralPredictiveModels 459
10.1Apredictivemodelingoverview................... 459
10.1.1Thepredictivemodelingproblem.............. 460
10.1.2Themodel-buildingprocess................. 461
10.2Binaryclassificationandlogisticregression............ 462
10.2.1Basiclogisticregressionformulation............ 462
10.2.2Fittinglogisticregressionmodels..............
10.3.1Structureandfittingofdecisiontrees...........
10.3.2Aclassificationtreeexample................
10.6.1Partialdependenceplots...................
10.6.2Variableimportancemeasures................
Preface Muchhasbeenwrittenabouttheabundanceofdatanowavailablefromthe Internetandagreatvarietyofothersources.Inhisaptlynamed2007book Glut [81],AlexWrightarguedthatthetotalquantityofdatathenbeingproducedwas approximatelyfive exabytes peryear(5 × 1018 bytes),morethantheestimated totalnumberofwordsspokenbyhumanbeingsinourentirehistory.Andthat assessmentwasfromadecadeago:increasingly,wefindourselves“drowningin aoceanofdata,”raisingquestionslike“Whatdowedowithitall?”and“How dowebegintomakeanysenseofit?”
Fortunately,theopen-sourcesoftwaremovementhasprovideduswith—at leastpartial—solutionslikethe R programminglanguage.While R isnotthe onlyrelevantsoftwareenvironmentforanalyzingdata—Python isanotheroption withagrowingbaseofsupport—R probablyrepresentsthemostflexibledata analysissoftwareplatformthathaseverbeenavailable. R islargelybasedon S,asoftwaresystemdevelopedbyJohnChambers,whowasawardedthe1998 SoftwareSystemAwardbytheAssociationforComputingMachinery(ACM) foritsdevelopment;theawardnotedthat S “hasforeveralteredthewaypeople analyze,visualize,andmanipulatedata.”
Theothersideofthissoftwarecoiniseducational:giventheavailabilityand sophisticationof R,thesituationisanalogoustosomeonegivingyouanF-15 fighteraircraft,fullyfueledwithitsenginesrunning.Ifyouknowhowtoflyit, thiscanbeagreatwaytogetfromoneplacetoanotherveryquickly.Butitis notenoughtojusthavetheplane:youalsoneedtoknowhowtotakeoffinit, howtolandit,andhowtonavigatefromwhereyouaretowhereyouwantto go.Also,youneedtohaveanideaofwhereyoudowanttogo.With R,the situationisanalogous:thesoftwarecandoalot,butyouneedtoknowboth howtouseitandwhatyouwanttodowithit.
Thepurposeofthisbookistoaddressthemostimportantofthesequestions. Specifically,thisbookhasthreeobjectives:
1.Toprovideabasicintroductionto exploratorydataanalysis(EDA);
2.Tointroducetherangeof“interesting”—good,bad,andugly—features wecanexpecttofindindata,andwhyitisimportanttofindthem;
3.Tointroducethemechanicsofusing R toexploreandexplaindata.
ThisbookgrewoutofmaterialsIdevelopedforthecourse“DataMiningUsing R”thatItaughtfortheUniversityofConnecticutGraduateSchoolofBusiness. Thestudentsinthiscoursetypicallyhadlittleornopriorexposuretodata analysis,modeling,statistics,orprogramming.Thiswasnotuniversallytrue, butitwastypical,soitwasnecessarytomakeminimalbackgroundassumptions, particularlywithrespecttoprogramming.Further,itwasalsoimportantto keepthetreatmentrelativelynon-mathematical:dataanalysisisaninherently mathematicalsubject,soitisnotpossibletoavoidmathematicsaltogether, butforthisaudienceitwasnecessarytoassumenomorethantheminimum essentialmathematicalbackground.
Theintendedaudienceforthisbookisstudents—bothadvancedundergraduatesandentry-levelgraduatestudents—alongwithworkingprofessionalswho wantadetailedbutintroductorytreatmentofthethreetopicslistedinthe book’stitle:data,exploratoryanalysis,and R.Exercisesareincludedatthe endsofmostchapters,andaninstructor’ssolutionmanualgivingcomplete solutionstoalloftheexercisesisavailablefromthepublisher.
Author RonaldK.Pearson isaSeniorDataScientistwithGeoVeraHoldings,a propertyinsurancecompanyinFairfield,California,involvedprimarilyinthe exploratoryanalysisofdata,particularlytextdata.Previously,heheldthepositionofDataScientistwithDataRobotinBoston,asoftwarecompanywhose productssupportlarge-scalepredictivemodelingforawiderangeofbusiness applicationsandarebasedonPythonandR,wherehewasoneoftheauthors ofthe datarobot R package.Heisalsothedeveloperofthe GoodmanKruskal R packageandhasheldavarietyofotherindustrial,business,andacademicpositions.ThesepositionsincludeboththeDuPontCompanyandtheSwissFederal InstituteofTechnology(ETHZ¨urich),wherehewasanactiveresearcherinthe areaofnonlineardynamicmodelingforindustrialprocesscontrol,theTampere UniversityofTechnologywherehewasavisitingprofessorinvolvedinteaching andresearchinnonlineardigitalfilters,andtheTravelersCompanies,wherehe wasinvolvedinpredictivemodelingforinsuranceapplications.HeholdsaPhD inElectricalEngineeringandComputerSciencefromtheMassachusettsInstituteofTechnologyandhaspublishedconferenceandjournalpapersontopics rangingfromnonlineardynamicmodelstructureselectiontotheproblemsof disguisedmissingdatainpredictivemodeling.Dr.Pearsonhasauthoredor co-authoredfivepreviousbooks,including ExploringDatainEngineering,the Sciences,andMedicine (OxfordUniversityPress,2011)and NonlinearDigital FilteringwithPython,co-authoredwithMoncefGabbouj(CRCPress,2016). Heisalsothedeveloperofthe DataCamp courseonbase R graphics.
Chapter1 Data,ExploratoryAnalysis, andR 1.1Whydoweanalyzedata? Thebasicsubjectofthisbookisdataanalysis,soitisusefultobeginby addressingthequestionofwhywemightwanttodothis.Thereareatleast threemotivationsforanalyzingdata:
1.tounderstandwhathashappenedorwhatishappening;
2.topredictwhatislikelytohappen,eitherinthefutureorinothercircumstanceswehaven’tseenyet;
3.toguideusinmakingdecisions.
Theprimaryfocusofthisbookison exploratorydataanalysis,discussedfurther inthenextsectionandthroughouttherestofthisbook,andthisapproachis mostusefulinaddressingproblemsofthefirsttype:understandingourdata. Thatsaid,thepredictionsrequiredinthesecondtypeofproblemlistedabove aretypicallybasedonmathematicalmodelslikethosediscussedin Chapters5 and 10,whichareoptimizedtogivereliablepredictionsfordatawehaveavailable,inthehopeandexpectationthattheywillalsogivereliablepredictionsfor caseswehaven’tyetconsidered.Inbuildingthesemodels,itisimportanttouse representative,reliabledata,andtheexploratoryanalysistechniquesdescribed inthisbookcanbeextremelyusefulinmakingcertainthisisthecase.Similarly, inthethirdclassofproblemslistedabove—makingdecisions—itisimportant thatwebasethemonanaccurateunderstandingofthesituationand/oraccuratepredictionsofwhatislikelytohappennext.Again,thetechniquesof exploratorydataanalysisdescribedherecanbeextremelyusefulinverifying and/orimprovingtheaccuracyofourdataandourpredictions.
1.2Theviewfrom90,000feet Thisbookisintendedasanintroductiontothethreetitlesubjects—data,itsexploratoryanalysis,andthe R programminglanguage—andthefollowingsections givehigh-leveloverviewsofeach,emphasizingkeydetailsandinterrelationships.
1.2.1Data Looselyspeaking,theterm“data”referstoacollectionofdetails,recordedto characterizeasourcelikeoneofthefollowing:
• anentity,e.g.:familyhistoryfromapatientinamedicalstudy;manufacturinglotinformationforamaterialsampleinaphysicaltestingapplication;orcompetingcompanycharacteristicsinamarketinganalysis;
• anevent,e.g.:demographiccharacteristicsofthosewhovotedfordifferent politicalcandidatesinaparticularelection;
• aprocess,e.g.:operatingdatafromanindustrialmanufacturingprocess.
Thisbookwillgenerallyusetheterm“data”torefertoarectangulararray ofobservedvalues,whereeachrowreferstoadifferentobservationofentity, event,orprocesscharacteristics(e.g.,distinctpatientsinamedicalstudy),and eachcolumnrepresentsadifferentcharacteristic(e.g.,diastolicbloodpressure) recorded—oratleastpotentiallyrecorded—foreachrow.In R’s terminology, thisdescriptiondefinesa dataframe,oneof R’s keydatatypes.
The mtcars dataframeisoneofmanybuilt-indataexamplesin R.Thisdata framehas32rows,eachonecorrespondingtoadifferentcar.Eachofthesecars ischaracterizedby11variables,whichconstitutethecolumnsofthedataframe. Thesevariablesincludethecar’smileage(inmilespergallon,mpg),thenumber ofgearsinitstransmission,thetransmissiontype(manualorautomatic),the numberofcylinders,thehorsepower,andvariousothercharacteristics.The originalsourceofthisdatawasacomparisonof32carsfrommodelyears1973 and1974publishedin MotorTrendMagazine.Thefirstsixrecordsofthisdata framemaybeexaminedusingthe head commandin R: head(mtcars)
##mpgcyldisphpdratwtqsecvsamgearcarb ##MazdaRX421.061601103.902.62016.460144 ##MazdaRX4Wag21.061601103.902.87517.020144 ##Datsun71022.84108933.852.32018.611141 ##Hornet4Drive21.462581103.083.21519.441031 ##HornetSportabout18.783601753.153.44017.020032 ##Valiant18.162251052.763.46020.221031
Animportantfeatureofdataframesin R isthatbothrowsandcolumnshave namesassociatedwiththem.Infavorablecases,thesenamesareinformative, astheyarehere:therownamesidentifytheparticularcarsbeingcharacterized, andthecolumnnamesidentifythecharacteristicsrecordedforeachcar.
Amorecompletedescriptionofthisdatasetisavailablethrough R’s built-in helpfacility.Typing“help(mtcars)”atthe R commandpromptwillbringup ahelppagethatgivestheoriginalsourceofthedata,citesapaperfromthe statisticalliteraturethatanalyzesthisdataset[39],andbrieflydescribesthe variablesincluded.Thisinformationconstitutes metadata forthe mtcars data frame:metadatais“dataaboutdata,”anditcanvarywidelyintermsofits completeness,consistency,andgeneralaccuracy.Sincemetadataoftenprovides muchofourpreliminaryinsightintothecontentsofadataset,itisextremely important,andanylimitationsofthismetadata—incompleteness,inconsistency, and/orinaccuracy—cancauseseriousproblemsinoursubsequentanalysis.For thesereasons,discussionsofmetadatawillrecurfrequentlythroughoutthis book.Thekeypointhereisthat,potentiallyvaluableasmetadatais,wecannot affordtoacceptituncritically:weshouldalwayscross-checkthemetadatawith theactualdatavalues,withourintuitionandpriorunderstandingofthesubject matter,andwithothersourcesofinformationthatmaybeavailable.
Asaspecificillustrationofthislastpoint,apopularbenchmarkdatasetfor evaluatingbinaryclassificationalgorithms(i.e.,computationalproceduresthat attempttopredictabinaryoutcomefromothervariables)isthePimaIndiansdiabetesdataset,availablefromtheUCIMachineLearningRepository,an importantInternetdatasourcediscussedfurtherin Chapter4.Inthisparticularcase,thedatasetcharacterizesfemaleadultmembersofthePimaIndians tribe,givinganumberofdifferentmedicalstatusandhistorycharacteristics (e.g.,diastolicbloodpressure,age,andnumberoftimespregnant),alongwith abinarydiagnosisindicatorwiththevalue1ifthepatienthadbeendiagnosed withdiabetesand0iftheyhadnot.Severalversionsofthisdatasetareavailable:theoneconsideredherewastheUCIwebsiteonMay10,2014,andithas 768rowsand9columns.Incontrast,thedataframe Pima.tr includedin R’s MASS packageisasubsetofthisoriginal,with200rowsand8columns.The metadataavailableforthisdatasetfromtheUCIMachineLearningRepository nowindicatesthatthisdatasetexhibitsmissingvalues,butthereisalsoanote thatpriortoFebruary28,2011themetadataindicatedthattherewerenomissingvalues.Infact,themissingvaluesinthisdatasetarenotcodedexplicitly asmissingwithaspecialcode(e.g., R’s “NA”code),butareinsteadcodedas zero.Asaresult,anumberofstudiescharacterizingbinaryclassifiershavebeen publishedusingthisdatasetasabenchmarkwheretheauthorswerenotaware thatdatavaluesweremissing,insomecases,quitealargefractionofthetotal observations.Asaspecificexample,theseruminsulinmeasurementincludedin thedatasetis48.7%missing.
Finally,itisimportanttorecognizetheessentialroleour assumptions about datacanplayinitssubsequentanalysis.Asasimpleandamusingexample, considerthefollowing“dataanalysis”question:howmanyplanetsarethereorbitingtheSun?Untilabout2006,thegenerallyacceptedanswerwasnine,with Plutotheoutermostmemberofthisset.Plutowassubsequentlyre-classified asa“dwarfplanet,”inpartbecausealarger,moredistantbodywasfoundin theKuiperBeltandenoughastronomersdidnotwanttoclassifythisobjectas the“tenthplanet”thatPlutowasdemotedtodwarfplanetstatus.Inhisbook,
CHAPTER1.DATA,EXPLORATORYANALYSIS,ANDR IsPlutoaPlanet? [72],astronomerDavidWeintraubarguesthatPlutoshould remainaplanet,basedonthefollowingdefiningcriteriaforplanethood:
1.theobjectmustbetoosmalltogenerate,ortohaveevergenerated,energy throughnuclearfusion;
2.theobjectmustbebigenoughtobespherical;
3.theobjectmusthaveaprimaryorbitaroundastar.
Thefirstoftheseconditionsexcludesdwarfstarsfrombeingclassedasplanets, andthethirdexcludesmoonsfrombeingdeclaredplanets(sincetheyorbit planets,notstars).Weintraubnotes,however,thatunderthisdefinition,there areatleast24planetsorbitingtheSun:theeightnowgenerallyregardedas planets,Pluto,and15ofthelargestobjectsfromtheasteroidbeltbetweenMars andJupiterandfromtheKuiperBeltbeyondPluto.Thisexampleillustrates thatdefinitionsarebothextremelyimportantandnottobetakenforgranted: everyoneknowswhataplanetis,don’tthey?Inthebroadercontextofdata analysis,thekeypointisthatunrecognizeddisagreementsinthedefinitionof avariablearepossiblebetweenthosewhomeasureandrecordit,andthose whosubsequentlyuseitinanalysis;thesediscrepanciescanlieattheheartof unexpectedfindingsthatturnouttobeerroneous.Forexample,ifwewishto combinetwomedicaldatasets,characterizingdifferentgroupsofpatientswith “thesame”disease,itisimportantthatthesamediagnosticcriteriabeusedto declarepatients“diseased”or“notdiseased.”Foramoredetaileddiscussion oftheroleofdefinitionsindataanalysis,referto Sec.2.4 of ExploringDatain Engineering,theSciences,andMedicine [58].(Althoughthebookisgenerally quitemathematical,thisisnottrueofthediscussionsofdatacharacteristics presentedin Chapter2,whichmaybeusefultoreadersofthisbook.)
1.2.2Exploratoryanalysis Roughlyspeaking,exploratorydataanalysis(EDA)maybedefinedastheart oflookingatoneormoredatasetsinanefforttounderstandtheunderlying structureofthedatacontainedthere.Ausefuldescriptionofhowwemightgo aboutthisisofferedbyDiaconis[21]:
Welookatnumbersorgraphsandtrytofindpatterns.Wepursue leadssuggestedbybackgroundinformation,imagination,patterns perceived,andexperiencewithotherdataanalyses.
Notethatthisquotesuggests—althoughitdoesnotstrictlyimply—thatthe dataweareexploringconsistsofnumbers.Indeed,evenifourdatasetcontains nonnumericaldata,ouranalysisofitislikelytobebasedlargelyonnumerical characteristicscomputedfromthesenonnumericalvalues.Asaspecificexample,categoricalvariablesappearinginadatasetlike“city,”“politicalparty affiliation,”or“manufacturer”aretypicallytabulated,convertedfromdiscrete namedvaluesintocountsorrelativefrequencies.Thesederivedrepresentations
canbeparticularlyusefulinexploringdatawhenthenumberoflevels—i.e.,the numberofdistinctvaluestheoriginalvariablecanexhibit—isrelativelysmall. Insuchcases,manyusefulexploratorytoolshavebeendevelopedthatallowus toexaminethecharacterofthesenonnumericvariablesandtheirrelationship withothervariables,whethercategoricalornumeric.Simplegraphicalexamplesincludeboxplotsforlookingatthedistributionofnumericalvaluesacross thedifferentlevelsofacategoricalvariable,ormosaicplotsforlookingatthe relationshipbetweencategoricalvariables;bothoftheseplotsandother,closely relatedonesarediscussedfurtherin Chapters2 and 3.
Categoricalvariableswithmanylevelsposemorechallengingproblems,and thesecomeinatleasttwovarieties.OneisrepresentedbyvariableslikeU.S. postalzipcode,whichidentifiesgeographiclocationsatamuchfiner-grained levelthanstatedoesandexhibitsabout40,000distinctlevels.Adetaileddiscussionofdealingwiththistypeofcategoricalvariableisbeyondthescope ofthisbook,althoughonepossibleapproachisdescribedbrieflyattheendof Chapter10.Thesecondtypeofmany-levelcategoricalvariablearisesinsettings wheretheinherentstructureofthevariablecanbeexploitedtodevelopspecializedanalysistechniques.Textdataisacaseinpoint:thenumberofdistinct wordsinadocumentoracollectionofdocumentscanbeenormous,butspecial techniquesforanalyzingtextdatahavebeendeveloped. Chapter8 introduces someofthemethodsavailablein R foranalyzingtextdata.
Thementionof“graphs”intheDiaconisquoteisparticularlyimportant sincehumansaremuchbetteratseeingpatternsingraphsthaninlargecollectionsofnumbers.Thisisoneofthereasons R supportssomanydifferentgraphicaldisplaymethods(e.g.,scatterplots,barplots,boxplots,quantile-quantile plots,histograms,mosaicplots,andmany,manymore),andoneofthereasons thisbookplacessomuchemphasisonthem.Thatsaid,twopointsareimportant here.First,graphicaltechniquesthatareusefultothedataanalystinfinding importantstructureinadatasetarenotnecessarilyusefulinexplainingthose findingstoothers.Forexample,largearraysoftwo-variablescatterplotsmaybe ausefulscreeningtoolforfindingrelatedvariablesoranomalousdatasubsets, buttheseareextremelypoorwaysofpresentingresultstoothersbecausethey essentiallyrequiretheviewertorepeattheanalysisforthemselves.Instead,resultsshouldbepresentedtoothersusingdisplaysthathighlightandemphasize theanalyst’sfindingstomakesurethattheintendedmessageisreceived.This distinctionbetween exploratory and explanatory displaysisdiscussedfurtherin Chapter2 ongraphicsin R andin Chapter6 oncraftingdatastories(i.e.,explainingyourfindings),butmostoftheemphasisinthisbookisonexploratory graphicaltoolstohelpusobtaintheseresults.
Thesecondpointtonotehereisthattheutilityofanygraphicaldisplay candependstronglyonexactlywhatisplotted,asillustratedin Fig.1.1.This issuehastwocomponents:themechanicsofhowasubsetofdataisdisplayed, andthechoiceofwhatgoesintothatdatasubset.Whilebothoftheseaspects areimportant,thesecondisfarmoreimportantthanthefirst.Specifically,itis importanttonotethattheforminwhichdataarrivesmaynotbethemostuseful foranalysis.Toillustrate, Fig.1.1 showstwosetsofplots,bothconstructed
library(MASS)
library(car)
par(mfrow=c(2,2))
CHAPTER1.DATA,EXPLORATORYANALYSIS,ANDR truehist(mammals$brain)
truehist(log(mammals$brain))
qqPlot(mammals$brain)
title("NormalQQ-plot")
qqPlot(log(mammals$brain))
title("NormalQQ-plot")
Figure1.1:Twopairsofcharacterizationsofthebrainweightdatafromthe mammals dataframe:histogramsandnormalQQ-plotsconstructedfromthe rawdata(left-handplots),andfromlog-transformeddata(right-handplots).
fromthe brain elementofthe mammals datasetfromthe MASS packagethat listsbodyweightsandbrainweightsfor62differentanimals.Thisdataframe isdiscussedfurtherin Chapter3,alongwiththecharacterizationspresented
here,whicharehistograms(toptwoplots)andnormalQQ-plots(bottomtwo plots).Inbothcases,theseplotsareattemptingtotellussomethingabout thedistributionofdatavalues,andthepointofthisexampleisthattheextent towhichtheseplotsareinformativedependsstronglyonhowwepreparethe datafromwhichtheyareconstructed.Here,theleft-handpairofplotswere generatedfromtherawdatavaluesandtheyaremuchlessinformativethanthe right-handpairofplots,whichweregeneratedfromlog-transformeddata.In particular,theseplotssuggestthatthelog-transformeddataexhibitsaroughly Gaussiandistribution,furthersuggestingthatworkingwiththelogofbrain weightmaybemoreusefulthanworkingwiththerawdatavalues.Thisexample isrevisitedanddiscussedinmuchmoredetailin Chapter3,butthepointhere isthatexactlywhatweplot—e.g.,rawdatavaluesvs.log-transformeddata values—sometimesmattersalotmorethanhowweplotit.
Sinceitisoneofthemainthemesofthisbook,amuchmoreextensiveintroductiontoexploratorydataanalysisisgivenin Chapter3.Threekeypoints tonotehereare,first,thatexploratorydataanalysismakesextensiveuseof graphicaltools,forthereasonsoutlinedabove.Consequently,thewideand growingvarietyofgraphicalmethodsavailablein R makesitaparticularlysuitableenvironmentforexploratoryanalysis.Second,exploratoryanalysisoften involvescharacterizingmanydifferentvariablesand/ordatasources,andcomparingthesecharacterizations.Thismotivatesthewidespreaduseofsimpleand well-knownsummarystatisticslikemeans,medians,andstandarddeviations, alongwithother,lesswell-knowncharacterizationsliketheMADscaleestimate introducedin Chapter3.Finally,third,anextremelyimportantaspectofexploratorydataanalysisisthesearchfor“unusual”or“anomalous”featuresin adataset.Thenotionofan outlier isintroducedbrieflyin Sec.1.3,butamore detaileddiscussionofthisandotherdataanomaliesisdeferreduntil Chapter3, wheretechniquesfordetectingtheseanomaliesare alsodiscussed.
1.2.3Computers,software,andR Touse R—oranyotherdataanalysisenvironment—involvesthreebasictasks:
1.Makethedatayouwanttoanalyzeavailabletotheanalysissoftware;
2.Performtheanalysis;
3.Maketheresultsoftheanalysisavailabletothosewhoneedthem. Inthischapter,allofthedataexamplescomefrombuilt-indataframesin R, whichareextremelyconvenientforteachingorlearning R,butinrealdataanalysisapplications,makingthedataavailableforanalysiscanrequiresignificant effort. Chapter4 focusesonthisproblem,buttounderstanditsnatureand significance,itisnecessarytounderstandsomethingabouthowcomputersystemsareorganized,andthisisthesubjectofthenextsection.Relatedissues arisewhenweattempttomakeanalysisresultsavailableforothers,andthese issuesarealsocoveredin Chapter4.Mostofthebookisdevotedtovariousaspectsofstep(2)above—performingtheanalysis—andthesecondsectionbelow
CHAPTER1.DATA,EXPLORATORYANALYSIS,ANDR brieflyaddressesthequestionof“whyuse R andnotsomethingelse?”Finally, sincethisisabookaboutusing R toanalyzedata,somekeydetailsaboutthe structureofthe R languagearepresentedinthethirdsectionbelow.
Generalstructureofacomputingenvironment Inhisbook, IntroductiontoDataTechnologies [56,pp.211–214],PaulMurrell describesthegeneralstructureofacomputingenvironmentintermsofthe followingsixcomponents:
1.the CPU or centralprocessingunit isthebasichardwarethatdoesallof thecomputing;
2.the RAM or randomaccessmemory isthe internal memorywherethe CPUstoresandretrievesresults;
3.the keyboard isthestandardinterfacethatallowstheusertosubmitrequeststothecomputersystem;
4.the screen isthegraphicaldisplayterminalthatallowstheusertoseethe resultsgeneratedbythecomputersystem;
5.the massstorage,typicallya“harddisk,”isthe external memorywhere dataandresultscanbestoredpermanently;
6.the network isanexternalconnectiontotheoutsideworld,includingthe Internetbutalsopossiblyan intranet ofothercomputers,alongwithperipheraldeviceslikeprinters.
Threeimportantdistinctionsbetweeninternalstorage(i.e.,RAM)andexternal storage(i.e.,massstorage)are,first,thatRAMistypicallyseveralordersof magnitudefastertoaccessthanmassstorage;second,thatRAMis volatile i.e.,thecontentsarelostwhenthepoweristurnedoff—whilemassstorage isnot;and,third,thatmassstoragecanaccommodatemuchlargervolumes ofdatathanRAMcan.(Asaspecificexample,thecomputerbeingusedto preparethisbookhas4GBofinstalledRAMandjustover100timesasmuch diskstorage.)Apracticalconsequenceisthatboththedatawewanttoanalyze andanyresultswewanttosaveneedtoendupinmassstoragesotheyarenot lostwhenthecomputerpoweristurnedoff. Chapter4 isdevotedtoadetailed discussionofsomeofthewayswecanmovedataintoandoutofmassstorage.
ThesedifferencesbetweenRAMandmassstorageareparticularlyrelevant to R sincemost R functionsrequirealldata—boththerawdataandtheinternal storagerequiredtokeepanytemporary,intermediateresults—tofitinRAM. Thismakesthecomputationsfaster,butitlimitsthesizeofthedatasetsyoucan workwithinmostcasestosomethinglessthanthetotalinstalledRAMonyour computer. Insomeapplications,thisrestrictionrepresentsaseriouslimitation onR’sapplicability. ThislimitationisrecognizedwithintheRcommunityand continuingeffortsarebeingmadetoimprovethesituation.
CloselyassociatedwiththeCPUisthe operatingsystem,whichisthesoftwarethatrunsthecomputersystem,makingusefulactivitypossible.That is,theoperatingsystemcoordinatesthedifferentcomponents,establishingand managingfilesystemsthatallowdatasetstobestored,located,modified,or deleted;providinguseraccesstoprogramslike R;providingthesupportinfrastructurerequiredsotheseprogramscaninteractwithnetworkresources,etc. Inadditiontothegeneralcomputinginfrastructureprovidedbytheoperating system,toanalyzedataitisnecessarytohaveprogramslike R andpossibly others(e.g.,databaseprograms).Further,theseprogramsmustbecompatible withtheoperatingsystem:onpopulardesktopsandenterpriseservers,thisis usuallynotaproblem,althoughitcanbecomeaproblemforolderoperating systems.Forexample, Section2.2 ofthe RFAQ documentavailablefromthe R “Help”tabnotesthat“supportforMacOSClassicendedwithR1.7.1.”
WiththegrowthoftheInternetasadatasource,itisbecomingincreasingly importanttobeabletoretrieveandprocessdatafromit.Unfortunately,this involvesanumberofissuesthatarewellbeyondthescopeofthisbook(e.g., parsingHTMLtoextractdatastoredinwebpages).Abriefintroductionto thekeyideaswithsomesimpleexamplesisgivenin Chapter4,butforthose needingamorethoroughtreatment,Murrell’sbookishighlyrecommended[56].
Dataanalysissoftware Akeyelementofthedataanalysischain(acquire → analyze → explain)describedearlieristhechoiceofdataanalysissoftware.Sincethereareanumber ofpossibilitieshere,why R?Onereasonisthat R isafree,open-sourcelanguage,availableformostpopularoperatingsystems.Incontrast,commercially supportedpackagesmustbepurchased,insomecasesforalotofmoney.
Anotherreasontouse R inpreferencetootherdataanalysisplatformsisthe enormousrangeofanalysismethodssupportedby R’s growinguniverseofaddonpackages.Thesepackagessupportanalysismethodsfrommanybranches ofstatistics(e.g.,traditionalstatisticalmethodslikeANOVA,ordinaryleast squaresregression,and t-tests,Bayesianmethods,androbuststatisticalprocedures),machinelearning(e.g.,randomforests,neuralnetworks,andboosted trees),andotherapplicationsliketextanalysis.Thisavailabilityofmethodsis importantbecauseitgreatlyexpandstherangeofdataexplorationandanalysis approachesthatcanbeconsidered.Forexample,ifyouwantedtousethemultivariateoutlierdetectionmethoddescribedin Chapter9 basedontheMCD covarianceestimatorinanotherframework—e.g.,MicrosoftExcel—youwould havetofirstbuildtheseanalysistoolsyourself,andthentestthemthoroughly tomakesuretheyarereallydoingwhatyouwant.Allofthistakestimeand effortjusttobeabletogettothepointofactuallyanalyzingyourdata.
Finally,athirdreasontoadopt R isitsgrowingpopularity,undoubtedly fueledbythereasonsjustdescribed,butwhichisalsolikelytopromotethe continuedgrowthofnewcapabilities.AsurveyofprogramminglanguagepopularitybytheInstituteofElectricalandElectronicsEngineers(IEEE)hasbeen takenforthelastseveralyears,andasummaryoftheresultsasofJuly18,
CHAPTER1.DATA,EXPLORATORYANALYSIS,ANDR 2017,wasavailablefromthewebsite: http://spectrum.ieee.org/computing/software/ the-2017-top-ten-programming-languages
Thetopsixprogramminglanguagesonthislistwere,indescendingorder: Python,C,Java,C++,C#,andR.Notethatthetopfiveofthesearegeneralpurposelanguages,allsuitableforatleasttwoofthefourprogrammingenvironmentsconsideredinthesurvey:web,mobile,desktop/enterprise,andembedded.Incontrast, R isaspecializeddataanalysislanguagethatisonlysuitable forthedesktop/enterpriseenvironment.Thenextdataanalysislanguageinthis listwasthecommercialpackageMATLAB R ,ranked15th.
ThestructureofR The R programminglanguagebasicallyconsistsofthreecomponents:
• asetof baseRpackages,arequiredcollectionofprogramsthatsupport languageinfrastructureandbasicstatisticsanddataanalysisfunctions;
• asetof recommendedpackages,automaticallyincludedinalmostall R installations(the MASS packageusedinthischapterbelongstothisset);
• averylargeandgrowingsetof optionaladd-onpackages,availablethrough theComprehensiveRArchiveNetwork(CRAN).
Most R installationshaveallofthebaseandrecommendedpackages,withat leastafewselectedadd-onpackages.Theadvantageofthislanguagestructure isthatitallowsextensivecustomization:asofFebruary3,2018,therewere 12,086packagesavailablefromCRAN,andnewonesareaddedeveryday.These packagesprovidesupportforeverythingfromroughandfuzzysettheorytothe analysisoftwittertweets,soitisanextremelyrareorganizationthatactually needs everything CRANhastooffer.Allowinguserstoinstallonlywhatthey needavoidsmassivewasteofcomputerresources.
InstallingpackagesfromCRANiseasy:the R graphicaluserinterface(GUI) hasatablabeled“Packages.”Clickingonthistabbringsupamenu,and selecting“Installpackages”fromthismenubringsuponeortwoothermenus. Ifyouhavenotusedthe“Installpackages”optionpreviouslyinyourcurrent R session,amenuappearsaskingyoutoselectaCRANmirror;thesesitesare locationsthroughouttheworldwithserversthatsupportCRANdownloads,so youshouldselectonenearyou.Onceyouhavedonethis,asecondmenuappears thatlistsallofthe R packagesavailablefordownload.Simplyscrolldownthis listuntilyoufindthepackageyouwant,selectit,andclickthe“OK”button atthebottomofthemenu.Thiswillcausethepackageyouhaveselectedto bedownloadedfromtheCRANmirrorandinstalledonyourmachine,along withallotherpackagesthatarerequiredtomakeyourselectedpackagework. Forexample,the car packageusedtogenerate Fig.1.1 requiresanumberof otherpackages,includingthequantileregressionpackge quantreg,whichis automaticallydownloadedandinstalledwhenyouinstallthe car package.
Itisimportanttonotethat installing an R packagemakesitavailableforyou touse,butthisdoes not “load”thepackageintoyourcurrent R session.Todo this,youmustusethe library() function,whichworksintwodifferentways. First,ifyouenterthisfunctionwithoutanyparameters—i.e.,type“library()”at the R prompt—itbringsupanewwindowthatlistsallofthepackagesthathave beeninstalledonyourmachine.Touseanyofthesepackages,itisnecessary tousethe library() commandagain,thistimespecifyingthenameofthe packageyouwanttouseasaparameter.Thisisshowninthecodeappearing atthetopof Fig.1.1,wherethe MASS and car packagesareloaded: library(MASS) library(car)
Thefirstofthesecommandsloadsthe MASS package,whichcontainsthe mammals dataframeandthe truehist functiontogeneratehistograms,andthesecond loadsthe car package,whichcontainsthe qqPlot functionusedtogeneratethe normalQQ-plotsshownin Fig.1.1
1.3ArepresentativeRsession Togiveaclearviewoftheessentialmaterialcoveredinthisbook,thefollowing paragraphsdescribeasimplebutrepresentative R analysissession,providing afewspecificillustrationsofwhat R cando.Thegeneraltaskisatypical preliminarydataexploration:wearegivenanunfamiliardatasetandwebegin byattemptingtounderstandwhatisinit.Inthisparticularcase,thedataset isabuilt-indataexamplefrom R—oneofmanysuchexamplesincludedin thelanguage—butthepreliminaryquestionsexploredhereareanalogousto thosewewouldaskincharacterizingadatasetobtainedfromtheInternet, fromadatawarehouseofcustomerdatainabusinessapplication,orfroma computerizeddatacollectionsysteminascientificexperimentoranindustrial processmonitoringapplication.Usefulpreliminaryquestionsinclude:
1.Howmanyrecordsdoesthisdatasetcontain?
2.Howmanyfields(i.e.,variables)areincludedineachrecord?
3.Whatkindsofvariablesarethese?(e.g.,realnumbers,integers,categorical variableslike“city”or“type,”orsomethingelse?)
4.Arethesevariablesalwaysobserved?(i.e.,ismissingdataanissue?Ifso, howaremissingvaluesrepresented?)
5.Arethevariablesincludedinthedatasettheoneswewereexpecting?
6.Arethevaluesofthesevariablesconsistentwithwhatweexpect?
7.Dothevariablesinthedatasetseemtoexhibitthekindsofrelationships weexpect?(Indeed,whatrelationshipsdoweexpect,andwhy?)