Immediate download Exploratory data analysis using r 1st edition ronald k. pearson ebooks 2024

Page 1


https://ebookmass.com/product/exploratory-data-analysisusing-r-1st-edition-ronald-k-pearson/

Instant digital products (PDF, ePub, MOBI) ready for you

Download now and discover formats that fit your needs...

Biostatistics and Computer-based Analysis of Health Data using R 1st Edition Christophe Lalanne

https://ebookmass.com/product/biostatistics-and-computer-basedanalysis-of-health-data-using-r-1st-edition-christophe-lalanne/

ebookmass.com

Using R For Data Analysis In Social Sciences: A Research Project-oriented Approach Li

https://ebookmass.com/product/using-r-for-data-analysis-in-socialsciences-a-research-project-oriented-approach-li/

ebookmass.com

Data Analysis for the Life Sciences with R 1st Edition

https://ebookmass.com/product/data-analysis-for-the-life-scienceswith-r-1st-edition/

ebookmass.com

Dark Wine at Dawn (A Hill Vampire Novel Book 9) Jenna Barwin

https://ebookmass.com/product/dark-wine-at-dawn-a-hill-vampire-novelbook-9-jenna-barwin-6/

ebookmass.com

Developer Advocacy: Establishing Trust, Creating Connections, and Inspiring Developers to Build Better 1st Edition Riley Chris

https://ebookmass.com/product/developer-advocacy-establishing-trustcreating-connections-and-inspiring-developers-to-build-better-1stedition-riley-chris/

ebookmass.com

Woelfels Dental Anatomy 9th Edition, (Ebook PDF)

https://ebookmass.com/product/woelfels-dental-anatomy-9th-editionebook-pdf/

ebookmass.com

The Comprehensive Textbook of Clinical Biomechanics 2nd Edition Edition Jim Richards

https://ebookmass.com/product/the-comprehensive-textbook-of-clinicalbiomechanics-2nd-edition-edition-jim-richards/

ebookmass.com

McGraw Hill 500 HESI A2 Questions to know by test day, Second Edition Kathy A. Zahler

https://ebookmass.com/product/mcgraw-hill-500-hesi-a2-questions-toknow-by-test-day-second-edition-kathy-a-zahler/

ebookmass.com

Diatom Morphogenesis (Diatoms: Biology and Applications)

Vadim V. Annenkov

https://ebookmass.com/product/diatom-morphogenesis-diatoms-biologyand-applications-vadim-v-annenkov/

ebookmass.com

Information Privacy Law (Aspen Casebook Series) 6th Edition, (Ebook PDF)

https://ebookmass.com/product/information-privacy-law-aspen-casebookseries-6th-edition-ebook-pdf/

ebookmass.com

EXPLORATORY DATA ANALYSIS USING R

Chapman & Hall/CRC

Data

Mining and Knowledge Series

Series Editor: Vipin Kumar

Computational Business Analytics

Subrata Das

Data Classification

Algorithms and Applications

Charu C. Aggarwal

Healthcare Data Analytics

Chandan K. Reddy and Charu C. Aggarwal

Accelerating Discovery

Mining Unstructured Information for Hypothesis Generation

Scott Spangler

Event Mining

Algorithms and Applications

Tao Li

Text Mining and Visualization

Case Studies Using Open-Source Tools

Markus Hofmann and Andrew Chisholm

Graph-Based Social Media Analysis

Ioannis Pitas

Data Mining

A Tutorial-Based Primer, Second Edition

Richard J. Roiger

Data Mining with R

Learning with Case Studies, Second Edition

Luís Torgo

Social Networks with Rich Edge Semantics

Quan Zheng and David Skillicorn

Large-Scale Machine Learning in the Earth Sciences

Ashok N. Srivastava, Ramakrishna Nemani, and Karsten Steinhaeuser

Data Science and Analytics with Python

Jesus Rogel-Salazar

Feature Engineering for Machine Learning and Data Analytics

Guozhu Dong and Huan Liu

Exploratory Data Analysis Using R

Ronald K. Pearson

For more information about this series please visit: https://www.crcpress.com/Chapman--HallCRC-Data-Mining-and-Knowledge-Discovery-Series/book-series/CHDAMINODIS

EXPLORATORY DATA ANALYSIS USING R

CRC Press

Taylor & Francis Group

6000 Broken Sound Parkway NW, Suite 300

Boca Raton, FL 33487-2742

© 2018 by Taylor & Francis Group, LLC

CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper

Version Date: 20180312

International Standard Book Number-13: 978-1-1384-8060-5 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe.

Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com

and the CRC Press Web site at http://www.crcpress.com

1.2.2Exploratoryanalysis.....................

1.2.3Computers,software,andR.................

1.3ArepresentativeRsession......................

1.4Organizationofthisbook......................

2.2.2Gridgraphics.........................

2.2.3Latticegraphics.......................

2.2.4Theggplot2package.....................

2.3.1Theflexibilityoftheplotfunction.............

2.3.2S3classesandgenericfunctions...............

2.3.3Optionalparametersforbasegraphics...........

2.4.1Addingpointsandlinestoascatterplot..........

2.4.2Addingtexttoaplot....................

2.4.3Addingalegendtoaplot..................

2.4.4Customizingaxes.......................

2.5Afewdifferentplottypes......................

2.5.1Piechartsandwhytheyshouldbeavoided........

2.5.2Barplotsummaries......................

2.5.3Thesymbolsfunction....................

2.7.1Afewgeneralguidelines...................

2.7.3Thetableplotfunction....................

3ExploratoryDataAnalysis:AFirstLook

3.2.1“Typical”values:themean.................

3.2.2“Spread”:thestandarddeviation..............

3.2.3Limitationsofsimplesummarystatistics..........

3.2.4TheGaussianassumption..................

3.3.1Outliersandtheirinfluence.................

3.3.2Detectingunivariateoutliers................

3.3.3Inliersandtheirdetection..................

3.3.5Missingdata,possiblydisguised..............

3.3.6QQ-plotsrevisited......................

3.4Visualizingrelationsbetweenvariables...............

3.4.1Scatterplotsbetweennumericalvariables..........

3.4.2Boxplots:numericalvs.categoricalvariables.......

3.4.3Mosaicplots:categoricalscatterplots............

3.5Exercises...............................

4.1FilemanagementinR........................

4.2Manualdataentry..........................

4.2.1Enteringthedatabyhand..................

4.2.2Manualdataentryisbadbutsometimesexpedient....

4.3.1PreviewsofthreeInternetdataexamples.........

4.4.1ReadingandwritingCSVfiles...............

4.4.2Spreadsheetsandcsvfilesare not thesamething.....

4.4.3TwopotentialproblemswithCSVfiles...........

4.5Workingwithotherfiletypes....................

4.5.1Workingwithtextfiles.................... 158

4.5.2SavingandretrievingRobjects............... 162

4.5.3Graphicsfiles......................... 163

4.6Mergingdatafromdifferentsources................. 165

4.7Abriefintroductiontodatabases.................. 168

4.7.1Relationaldatabases,queries,andSQL.......... 169

4.7.2Anintroductiontothe sqldf package........... 171

4.7.3AnoverviewofR’sdatabasesupport............ 174

4.7.4Anintroductiontothe RSQLite package.......... 175

4.8Exercises............................... 178

5LinearRegressionModels 181

5.1Modelingthe whiteside data.................... 181

5.1.1Describinglinesintheplane................ 182

5.1.2Fittinglinestopointsintheplane............. 185

5.1.3Fittingthe whiteside data................. 186

5.2Overfittinganddatasplitting.................... 188

5.2.1Anoverfittingexample.................... 188

5.2.2Thetraining/validation/holdoutsplit........... 192

5.2.3Twousefulmodelvalidationtools............. 196

5.3Regressionwithmultiplepredictors................. 201

5.3.1The Cars93 example..................... 202

5.3.2Theproblemofcollinearity................. 207

5.4Usingcategoricalpredictors..................... 211

5.5Interactionsinlinearregressionmodels............... 214

5.6Variabletransformationsinlinearregression............ 217

5.7Robustregression:averybriefintroduction............ 221

5.8Exercises............................... 224

6CraftingDataStories 229

6.1Craftinggooddatastories...................... 229

6.1.1Theimportanceofclarity.................. 230

6.1.2Thebasicelementsofaneffectivedatastory....... 231

6.2Differentaudienceshavedifferentneeds.............. 232

6.2.1Theexecutivesummaryorabstract............ 233

6.2.2Extendedsummaries..................... 234

6.2.3Longerdocuments...................... 235

6.3Threeexampledatastories..................... 235

6.3.1TheBigMacandGrandeLatteeconomicindices..... 236

6.3.2SmalllossesintheAustralianvehicleinsurancedata... 240

6.3.3Unexpectedheterogeneity:theBostonhousingdata... 243

7ProgramminginR

7.1Interactiveuseversusprogramming.................

7.1.1Asimpleexample:computingFibonnaccinumbers....

7.1.2Creatingyourownfunctions................

7.2KeyelementsoftheRlanguage...................

7.2.1Functionsandtheirarguments...............

7.2.2The list datatype.....................

7.2.3Controlstructures......................

7.2.4Replacingloopswith apply functions...........

7.2.5Genericfunctionsrevisited.................

7.3Goodprogrammingpractices....................

7.3.1ModularityandtheDRYprinciple.............

7.3.2Comments...........................

7.3.3Styleguidelines........................

7.3.4Testinganddebugging....................

7.4Fiveprogrammingexamples.....................

7.4.1Thefunction ValidationRsquared

7.4.2Thefunction TVHsplit

7.4.3Thefunction PredictedVsObservedPlot

7.4.4Thefunction

8WorkingwithTextData

8.1Thefundamentalsoftextdataanalysis...............

8.1.1Thebasicstepsinanalyzingtextdata...........

8.1.2Anillustrativeexample...................

8.2BasiccharacterfunctionsinR....................

8.2.1The nchar

8.2.2The grep

8.2.3Applicationtomissingdataandalternativespellings...

8.2.4The sub and gsub functions.................

8.2.5The strsplit function...................

8.2.6Anotherapplication: ConvertAutoMpgRecords ......

8.2.7The paste function.....................

8.3Abriefintroductiontoregularexpressions.............

8.3.1Regularexpressionbasics..................

8.3.2Someusefulregularexpressionexamples..........

8.4Anaside:ASCIIvs.UNICODE...................

8.5Quantitativetextanalysis......................

8.5.1Document-termanddocument-featurematrices......

8.5.2Stringdistancesandapproximatematching........

8.6Threedetailedexamples.......................

8.6.1Characterizingabook....................

8.6.2The cpus dataframe.....................

8.6.3Theunclaimedbankaccountdata............. 344 8.7Exercises............................... 353

9ExploratoryDataAnalysis:ASecondLook 357

9.1Anexample:repeatedmeasurements................ 358

9.1.1Summaryandpracticalimplications............ 358

9.1.2Thegorydetails....................... 359

9.2Confidenceintervalsandsignificance................ 364

9.2.1Probabilitymodelsversusdata............... 364

9.2.2Quantilesofadistribution.................. 366

9.2.3Confidenceintervals..................... 368

9.2.4Statisticalsignificanceand p-values............. 372

9.3Characterizingabinaryvariable.................. 375

9.3.1Thebinomialdistribution.................. 375

9.3.2Binomialconfidenceintervals................ 377

9.3.3Oddsratios.......................... 382

9.4Characterizingcountdata...................... 386

9.4.1ThePoissondistributionandrareevents.......... 387

9.4.2Alternativecountdistributions............... 389

9.4.3Discretedistributionplots.................. 390

9.5Continuousdistributions....................... 393

9.5.1LimitationsoftheGaussiandistribution.......... 394

9.5.2SomealternativestotheGaussiandistribution...... 398

9.5.3The qqPlot functionrevisited................ 404

9.5.4Theproblemsoftiesandimplosion............. 406

9.6Associationsbetweennumericalvariables............. 409

9.6.1Product-momentcorrelations................ 409

9.6.2Spearman’srankcorrelationmeasure............ 413

9.6.3Thecorrelationtrick..................... 415

9.6.4Correlationmatricesandcorrelationplots......... 418

9.6.5Robustcorrelations...................... 421

9.6.6Multivariateoutliers..................... 423

9.7Associationsbetweencategoricalvariables............. 427

9.7.1Contingencytables...................... 427

9.7.2Thechi-squaredmeasureandCram´er’sV......... 429

9.7.3GoodmanandKruskal’staumeasure............ 433

9.8Principalcomponentanalysis(PCA)................ 438

9.9Workingwithdatevariables..................... 447

9.10Exercises............................... 449

10MoreGeneralPredictiveModels 459

10.1Apredictivemodelingoverview................... 459

10.1.1Thepredictivemodelingproblem.............. 460

10.1.2Themodel-buildingprocess................. 461

10.2Binaryclassificationandlogisticregression............ 462

10.2.1Basiclogisticregressionformulation............ 462

10.2.2Fittinglogisticregressionmodels..............

10.3.1Structureandfittingofdecisiontrees...........

10.3.2Aclassificationtreeexample................

10.6.1Partialdependenceplots...................

10.6.2Variableimportancemeasures................

Preface

Muchhasbeenwrittenabouttheabundanceofdatanowavailablefromthe Internetandagreatvarietyofothersources.Inhisaptlynamed2007book Glut [81],AlexWrightarguedthatthetotalquantityofdatathenbeingproducedwas approximatelyfive exabytes peryear(5 × 1018 bytes),morethantheestimated totalnumberofwordsspokenbyhumanbeingsinourentirehistory.Andthat assessmentwasfromadecadeago:increasingly,wefindourselves“drowningin aoceanofdata,”raisingquestionslike“Whatdowedowithitall?”and“How dowebegintomakeanysenseofit?”

Fortunately,theopen-sourcesoftwaremovementhasprovideduswith—at leastpartial—solutionslikethe R programminglanguage.While R isnotthe onlyrelevantsoftwareenvironmentforanalyzingdata—Python isanotheroption withagrowingbaseofsupport—R probablyrepresentsthemostflexibledata analysissoftwareplatformthathaseverbeenavailable. R islargelybasedon S,asoftwaresystemdevelopedbyJohnChambers,whowasawardedthe1998 SoftwareSystemAwardbytheAssociationforComputingMachinery(ACM) foritsdevelopment;theawardnotedthat S “hasforeveralteredthewaypeople analyze,visualize,andmanipulatedata.”

Theothersideofthissoftwarecoiniseducational:giventheavailabilityand sophisticationof R,thesituationisanalogoustosomeonegivingyouanF-15 fighteraircraft,fullyfueledwithitsenginesrunning.Ifyouknowhowtoflyit, thiscanbeagreatwaytogetfromoneplacetoanotherveryquickly.Butitis notenoughtojusthavetheplane:youalsoneedtoknowhowtotakeoffinit, howtolandit,andhowtonavigatefromwhereyouaretowhereyouwantto go.Also,youneedtohaveanideaofwhereyoudowanttogo.With R,the situationisanalogous:thesoftwarecandoalot,butyouneedtoknowboth howtouseitandwhatyouwanttodowithit.

Thepurposeofthisbookistoaddressthemostimportantofthesequestions. Specifically,thisbookhasthreeobjectives:

1.Toprovideabasicintroductionto exploratorydataanalysis(EDA);

2.Tointroducetherangeof“interesting”—good,bad,andugly—features wecanexpecttofindindata,andwhyitisimportanttofindthem;

3.Tointroducethemechanicsofusing R toexploreandexplaindata.

ThisbookgrewoutofmaterialsIdevelopedforthecourse“DataMiningUsing R”thatItaughtfortheUniversityofConnecticutGraduateSchoolofBusiness. Thestudentsinthiscoursetypicallyhadlittleornopriorexposuretodata analysis,modeling,statistics,orprogramming.Thiswasnotuniversallytrue, butitwastypical,soitwasnecessarytomakeminimalbackgroundassumptions, particularlywithrespecttoprogramming.Further,itwasalsoimportantto keepthetreatmentrelativelynon-mathematical:dataanalysisisaninherently mathematicalsubject,soitisnotpossibletoavoidmathematicsaltogether, butforthisaudienceitwasnecessarytoassumenomorethantheminimum essentialmathematicalbackground.

Theintendedaudienceforthisbookisstudents—bothadvancedundergraduatesandentry-levelgraduatestudents—alongwithworkingprofessionalswho wantadetailedbutintroductorytreatmentofthethreetopicslistedinthe book’stitle:data,exploratoryanalysis,and R.Exercisesareincludedatthe endsofmostchapters,andaninstructor’ssolutionmanualgivingcomplete solutionstoalloftheexercisesisavailablefromthepublisher.

Author

RonaldK.Pearson isaSeniorDataScientistwithGeoVeraHoldings,a propertyinsurancecompanyinFairfield,California,involvedprimarilyinthe exploratoryanalysisofdata,particularlytextdata.Previously,heheldthepositionofDataScientistwithDataRobotinBoston,asoftwarecompanywhose productssupportlarge-scalepredictivemodelingforawiderangeofbusiness applicationsandarebasedonPythonandR,wherehewasoneoftheauthors ofthe datarobot R package.Heisalsothedeveloperofthe GoodmanKruskal R packageandhasheldavarietyofotherindustrial,business,andacademicpositions.ThesepositionsincludeboththeDuPontCompanyandtheSwissFederal InstituteofTechnology(ETHZ¨urich),wherehewasanactiveresearcherinthe areaofnonlineardynamicmodelingforindustrialprocesscontrol,theTampere UniversityofTechnologywherehewasavisitingprofessorinvolvedinteaching andresearchinnonlineardigitalfilters,andtheTravelersCompanies,wherehe wasinvolvedinpredictivemodelingforinsuranceapplications.HeholdsaPhD inElectricalEngineeringandComputerSciencefromtheMassachusettsInstituteofTechnologyandhaspublishedconferenceandjournalpapersontopics rangingfromnonlineardynamicmodelstructureselectiontotheproblemsof disguisedmissingdatainpredictivemodeling.Dr.Pearsonhasauthoredor co-authoredfivepreviousbooks,including ExploringDatainEngineering,the Sciences,andMedicine (OxfordUniversityPress,2011)and NonlinearDigital FilteringwithPython,co-authoredwithMoncefGabbouj(CRCPress,2016). Heisalsothedeveloperofthe DataCamp courseonbase R graphics.

Chapter1 Data,ExploratoryAnalysis, andR

1.1Whydoweanalyzedata?

Thebasicsubjectofthisbookisdataanalysis,soitisusefultobeginby addressingthequestionofwhywemightwanttodothis.Thereareatleast threemotivationsforanalyzingdata:

1.tounderstandwhathashappenedorwhatishappening;

2.topredictwhatislikelytohappen,eitherinthefutureorinothercircumstanceswehaven’tseenyet;

3.toguideusinmakingdecisions.

Theprimaryfocusofthisbookison exploratorydataanalysis,discussedfurther inthenextsectionandthroughouttherestofthisbook,andthisapproachis mostusefulinaddressingproblemsofthefirsttype:understandingourdata. Thatsaid,thepredictionsrequiredinthesecondtypeofproblemlistedabove aretypicallybasedonmathematicalmodelslikethosediscussedin Chapters5 and 10,whichareoptimizedtogivereliablepredictionsfordatawehaveavailable,inthehopeandexpectationthattheywillalsogivereliablepredictionsfor caseswehaven’tyetconsidered.Inbuildingthesemodels,itisimportanttouse representative,reliabledata,andtheexploratoryanalysistechniquesdescribed inthisbookcanbeextremelyusefulinmakingcertainthisisthecase.Similarly, inthethirdclassofproblemslistedabove—makingdecisions—itisimportant thatwebasethemonanaccurateunderstandingofthesituationand/oraccuratepredictionsofwhatislikelytohappennext.Again,thetechniquesof exploratorydataanalysisdescribedherecanbeextremelyusefulinverifying and/orimprovingtheaccuracyofourdataandourpredictions.

1.2Theviewfrom90,000feet

Thisbookisintendedasanintroductiontothethreetitlesubjects—data,itsexploratoryanalysis,andthe R programminglanguage—andthefollowingsections givehigh-leveloverviewsofeach,emphasizingkeydetailsandinterrelationships.

1.2.1Data

Looselyspeaking,theterm“data”referstoacollectionofdetails,recordedto characterizeasourcelikeoneofthefollowing:

• anentity,e.g.:familyhistoryfromapatientinamedicalstudy;manufacturinglotinformationforamaterialsampleinaphysicaltestingapplication;orcompetingcompanycharacteristicsinamarketinganalysis;

• anevent,e.g.:demographiccharacteristicsofthosewhovotedfordifferent politicalcandidatesinaparticularelection;

• aprocess,e.g.:operatingdatafromanindustrialmanufacturingprocess.

Thisbookwillgenerallyusetheterm“data”torefertoarectangulararray ofobservedvalues,whereeachrowreferstoadifferentobservationofentity, event,orprocesscharacteristics(e.g.,distinctpatientsinamedicalstudy),and eachcolumnrepresentsadifferentcharacteristic(e.g.,diastolicbloodpressure) recorded—oratleastpotentiallyrecorded—foreachrow.In R’s terminology, thisdescriptiondefinesa dataframe,oneof R’s keydatatypes.

The mtcars dataframeisoneofmanybuilt-indataexamplesin R.Thisdata framehas32rows,eachonecorrespondingtoadifferentcar.Eachofthesecars ischaracterizedby11variables,whichconstitutethecolumnsofthedataframe. Thesevariablesincludethecar’smileage(inmilespergallon,mpg),thenumber ofgearsinitstransmission,thetransmissiontype(manualorautomatic),the numberofcylinders,thehorsepower,andvariousothercharacteristics.The originalsourceofthisdatawasacomparisonof32carsfrommodelyears1973 and1974publishedin MotorTrendMagazine.Thefirstsixrecordsofthisdata framemaybeexaminedusingthe head commandin R: head(mtcars)

##mpgcyldisphpdratwtqsecvsamgearcarb ##MazdaRX421.061601103.902.62016.460144 ##MazdaRX4Wag21.061601103.902.87517.020144 ##Datsun71022.84108933.852.32018.611141 ##Hornet4Drive21.462581103.083.21519.441031 ##HornetSportabout18.783601753.153.44017.020032 ##Valiant18.162251052.763.46020.221031

Animportantfeatureofdataframesin R isthatbothrowsandcolumnshave namesassociatedwiththem.Infavorablecases,thesenamesareinformative, astheyarehere:therownamesidentifytheparticularcarsbeingcharacterized, andthecolumnnamesidentifythecharacteristicsrecordedforeachcar.

Amorecompletedescriptionofthisdatasetisavailablethrough R’s built-in helpfacility.Typing“help(mtcars)”atthe R commandpromptwillbringup ahelppagethatgivestheoriginalsourceofthedata,citesapaperfromthe statisticalliteraturethatanalyzesthisdataset[39],andbrieflydescribesthe variablesincluded.Thisinformationconstitutes metadata forthe mtcars data frame:metadatais“dataaboutdata,”anditcanvarywidelyintermsofits completeness,consistency,andgeneralaccuracy.Sincemetadataoftenprovides muchofourpreliminaryinsightintothecontentsofadataset,itisextremely important,andanylimitationsofthismetadata—incompleteness,inconsistency, and/orinaccuracy—cancauseseriousproblemsinoursubsequentanalysis.For thesereasons,discussionsofmetadatawillrecurfrequentlythroughoutthis book.Thekeypointhereisthat,potentiallyvaluableasmetadatais,wecannot affordtoacceptituncritically:weshouldalwayscross-checkthemetadatawith theactualdatavalues,withourintuitionandpriorunderstandingofthesubject matter,andwithothersourcesofinformationthatmaybeavailable.

Asaspecificillustrationofthislastpoint,apopularbenchmarkdatasetfor evaluatingbinaryclassificationalgorithms(i.e.,computationalproceduresthat attempttopredictabinaryoutcomefromothervariables)isthePimaIndiansdiabetesdataset,availablefromtheUCIMachineLearningRepository,an importantInternetdatasourcediscussedfurtherin Chapter4.Inthisparticularcase,thedatasetcharacterizesfemaleadultmembersofthePimaIndians tribe,givinganumberofdifferentmedicalstatusandhistorycharacteristics (e.g.,diastolicbloodpressure,age,andnumberoftimespregnant),alongwith abinarydiagnosisindicatorwiththevalue1ifthepatienthadbeendiagnosed withdiabetesand0iftheyhadnot.Severalversionsofthisdatasetareavailable:theoneconsideredherewastheUCIwebsiteonMay10,2014,andithas 768rowsand9columns.Incontrast,thedataframe Pima.tr includedin R’s MASS packageisasubsetofthisoriginal,with200rowsand8columns.The metadataavailableforthisdatasetfromtheUCIMachineLearningRepository nowindicatesthatthisdatasetexhibitsmissingvalues,butthereisalsoanote thatpriortoFebruary28,2011themetadataindicatedthattherewerenomissingvalues.Infact,themissingvaluesinthisdatasetarenotcodedexplicitly asmissingwithaspecialcode(e.g., R’s “NA”code),butareinsteadcodedas zero.Asaresult,anumberofstudiescharacterizingbinaryclassifiershavebeen publishedusingthisdatasetasabenchmarkwheretheauthorswerenotaware thatdatavaluesweremissing,insomecases,quitealargefractionofthetotal observations.Asaspecificexample,theseruminsulinmeasurementincludedin thedatasetis48.7%missing.

Finally,itisimportanttorecognizetheessentialroleour assumptions about datacanplayinitssubsequentanalysis.Asasimpleandamusingexample, considerthefollowing“dataanalysis”question:howmanyplanetsarethereorbitingtheSun?Untilabout2006,thegenerallyacceptedanswerwasnine,with Plutotheoutermostmemberofthisset.Plutowassubsequentlyre-classified asa“dwarfplanet,”inpartbecausealarger,moredistantbodywasfoundin theKuiperBeltandenoughastronomersdidnotwanttoclassifythisobjectas the“tenthplanet”thatPlutowasdemotedtodwarfplanetstatus.Inhisbook,

CHAPTER1.DATA,EXPLORATORYANALYSIS,ANDR

IsPlutoaPlanet? [72],astronomerDavidWeintraubarguesthatPlutoshould remainaplanet,basedonthefollowingdefiningcriteriaforplanethood:

1.theobjectmustbetoosmalltogenerate,ortohaveevergenerated,energy throughnuclearfusion;

2.theobjectmustbebigenoughtobespherical;

3.theobjectmusthaveaprimaryorbitaroundastar.

Thefirstoftheseconditionsexcludesdwarfstarsfrombeingclassedasplanets, andthethirdexcludesmoonsfrombeingdeclaredplanets(sincetheyorbit planets,notstars).Weintraubnotes,however,thatunderthisdefinition,there areatleast24planetsorbitingtheSun:theeightnowgenerallyregardedas planets,Pluto,and15ofthelargestobjectsfromtheasteroidbeltbetweenMars andJupiterandfromtheKuiperBeltbeyondPluto.Thisexampleillustrates thatdefinitionsarebothextremelyimportantandnottobetakenforgranted: everyoneknowswhataplanetis,don’tthey?Inthebroadercontextofdata analysis,thekeypointisthatunrecognizeddisagreementsinthedefinitionof avariablearepossiblebetweenthosewhomeasureandrecordit,andthose whosubsequentlyuseitinanalysis;thesediscrepanciescanlieattheheartof unexpectedfindingsthatturnouttobeerroneous.Forexample,ifwewishto combinetwomedicaldatasets,characterizingdifferentgroupsofpatientswith “thesame”disease,itisimportantthatthesamediagnosticcriteriabeusedto declarepatients“diseased”or“notdiseased.”Foramoredetaileddiscussion oftheroleofdefinitionsindataanalysis,referto Sec.2.4 of ExploringDatain Engineering,theSciences,andMedicine [58].(Althoughthebookisgenerally quitemathematical,thisisnottrueofthediscussionsofdatacharacteristics presentedin Chapter2,whichmaybeusefultoreadersofthisbook.)

1.2.2Exploratoryanalysis

Roughlyspeaking,exploratorydataanalysis(EDA)maybedefinedastheart oflookingatoneormoredatasetsinanefforttounderstandtheunderlying structureofthedatacontainedthere.Ausefuldescriptionofhowwemightgo aboutthisisofferedbyDiaconis[21]:

Welookatnumbersorgraphsandtrytofindpatterns.Wepursue leadssuggestedbybackgroundinformation,imagination,patterns perceived,andexperiencewithotherdataanalyses.

Notethatthisquotesuggests—althoughitdoesnotstrictlyimply—thatthe dataweareexploringconsistsofnumbers.Indeed,evenifourdatasetcontains nonnumericaldata,ouranalysisofitislikelytobebasedlargelyonnumerical characteristicscomputedfromthesenonnumericalvalues.Asaspecificexample,categoricalvariablesappearinginadatasetlike“city,”“politicalparty affiliation,”or“manufacturer”aretypicallytabulated,convertedfromdiscrete namedvaluesintocountsorrelativefrequencies.Thesederivedrepresentations

canbeparticularlyusefulinexploringdatawhenthenumberoflevels—i.e.,the numberofdistinctvaluestheoriginalvariablecanexhibit—isrelativelysmall. Insuchcases,manyusefulexploratorytoolshavebeendevelopedthatallowus toexaminethecharacterofthesenonnumericvariablesandtheirrelationship withothervariables,whethercategoricalornumeric.Simplegraphicalexamplesincludeboxplotsforlookingatthedistributionofnumericalvaluesacross thedifferentlevelsofacategoricalvariable,ormosaicplotsforlookingatthe relationshipbetweencategoricalvariables;bothoftheseplotsandother,closely relatedonesarediscussedfurtherin Chapters2 and 3.

Categoricalvariableswithmanylevelsposemorechallengingproblems,and thesecomeinatleasttwovarieties.OneisrepresentedbyvariableslikeU.S. postalzipcode,whichidentifiesgeographiclocationsatamuchfiner-grained levelthanstatedoesandexhibitsabout40,000distinctlevels.Adetaileddiscussionofdealingwiththistypeofcategoricalvariableisbeyondthescope ofthisbook,althoughonepossibleapproachisdescribedbrieflyattheendof Chapter10.Thesecondtypeofmany-levelcategoricalvariablearisesinsettings wheretheinherentstructureofthevariablecanbeexploitedtodevelopspecializedanalysistechniques.Textdataisacaseinpoint:thenumberofdistinct wordsinadocumentoracollectionofdocumentscanbeenormous,butspecial techniquesforanalyzingtextdatahavebeendeveloped. Chapter8 introduces someofthemethodsavailablein R foranalyzingtextdata.

Thementionof“graphs”intheDiaconisquoteisparticularlyimportant sincehumansaremuchbetteratseeingpatternsingraphsthaninlargecollectionsofnumbers.Thisisoneofthereasons R supportssomanydifferentgraphicaldisplaymethods(e.g.,scatterplots,barplots,boxplots,quantile-quantile plots,histograms,mosaicplots,andmany,manymore),andoneofthereasons thisbookplacessomuchemphasisonthem.Thatsaid,twopointsareimportant here.First,graphicaltechniquesthatareusefultothedataanalystinfinding importantstructureinadatasetarenotnecessarilyusefulinexplainingthose findingstoothers.Forexample,largearraysoftwo-variablescatterplotsmaybe ausefulscreeningtoolforfindingrelatedvariablesoranomalousdatasubsets, buttheseareextremelypoorwaysofpresentingresultstoothersbecausethey essentiallyrequiretheviewertorepeattheanalysisforthemselves.Instead,resultsshouldbepresentedtoothersusingdisplaysthathighlightandemphasize theanalyst’sfindingstomakesurethattheintendedmessageisreceived.This distinctionbetween exploratory and explanatory displaysisdiscussedfurtherin Chapter2 ongraphicsin R andin Chapter6 oncraftingdatastories(i.e.,explainingyourfindings),butmostoftheemphasisinthisbookisonexploratory graphicaltoolstohelpusobtaintheseresults.

Thesecondpointtonotehereisthattheutilityofanygraphicaldisplay candependstronglyonexactlywhatisplotted,asillustratedin Fig.1.1.This issuehastwocomponents:themechanicsofhowasubsetofdataisdisplayed, andthechoiceofwhatgoesintothatdatasubset.Whilebothoftheseaspects areimportant,thesecondisfarmoreimportantthanthefirst.Specifically,itis importanttonotethattheforminwhichdataarrivesmaynotbethemostuseful foranalysis.Toillustrate, Fig.1.1 showstwosetsofplots,bothconstructed

library(MASS)

library(car)

par(mfrow=c(2,2))

CHAPTER1.DATA,EXPLORATORYANALYSIS,ANDR

truehist(mammals$brain)

truehist(log(mammals$brain))

qqPlot(mammals$brain)

title("NormalQQ-plot")

qqPlot(log(mammals$brain))

title("NormalQQ-plot")

Figure1.1:Twopairsofcharacterizationsofthebrainweightdatafromthe mammals dataframe:histogramsandnormalQQ-plotsconstructedfromthe rawdata(left-handplots),andfromlog-transformeddata(right-handplots).

fromthe brain elementofthe mammals datasetfromthe MASS packagethat listsbodyweightsandbrainweightsfor62differentanimals.Thisdataframe isdiscussedfurtherin Chapter3,alongwiththecharacterizationspresented

here,whicharehistograms(toptwoplots)andnormalQQ-plots(bottomtwo plots).Inbothcases,theseplotsareattemptingtotellussomethingabout thedistributionofdatavalues,andthepointofthisexampleisthattheextent towhichtheseplotsareinformativedependsstronglyonhowwepreparethe datafromwhichtheyareconstructed.Here,theleft-handpairofplotswere generatedfromtherawdatavaluesandtheyaremuchlessinformativethanthe right-handpairofplots,whichweregeneratedfromlog-transformeddata.In particular,theseplotssuggestthatthelog-transformeddataexhibitsaroughly Gaussiandistribution,furthersuggestingthatworkingwiththelogofbrain weightmaybemoreusefulthanworkingwiththerawdatavalues.Thisexample isrevisitedanddiscussedinmuchmoredetailin Chapter3,butthepointhere isthatexactlywhatweplot—e.g.,rawdatavaluesvs.log-transformeddata values—sometimesmattersalotmorethanhowweplotit.

Sinceitisoneofthemainthemesofthisbook,amuchmoreextensiveintroductiontoexploratorydataanalysisisgivenin Chapter3.Threekeypoints tonotehereare,first,thatexploratorydataanalysismakesextensiveuseof graphicaltools,forthereasonsoutlinedabove.Consequently,thewideand growingvarietyofgraphicalmethodsavailablein R makesitaparticularlysuitableenvironmentforexploratoryanalysis.Second,exploratoryanalysisoften involvescharacterizingmanydifferentvariablesand/ordatasources,andcomparingthesecharacterizations.Thismotivatesthewidespreaduseofsimpleand well-knownsummarystatisticslikemeans,medians,andstandarddeviations, alongwithother,lesswell-knowncharacterizationsliketheMADscaleestimate introducedin Chapter3.Finally,third,anextremelyimportantaspectofexploratorydataanalysisisthesearchfor“unusual”or“anomalous”featuresin adataset.Thenotionofan outlier isintroducedbrieflyin Sec.1.3,butamore detaileddiscussionofthisandotherdataanomaliesisdeferreduntil Chapter3, wheretechniquesfordetectingtheseanomaliesare alsodiscussed.

1.2.3Computers,software,andR

Touse R—oranyotherdataanalysisenvironment—involvesthreebasictasks:

1.Makethedatayouwanttoanalyzeavailabletotheanalysissoftware;

2.Performtheanalysis;

3.Maketheresultsoftheanalysisavailabletothosewhoneedthem. Inthischapter,allofthedataexamplescomefrombuilt-indataframesin R, whichareextremelyconvenientforteachingorlearning R,butinrealdataanalysisapplications,makingthedataavailableforanalysiscanrequiresignificant effort. Chapter4 focusesonthisproblem,buttounderstanditsnatureand significance,itisnecessarytounderstandsomethingabouthowcomputersystemsareorganized,andthisisthesubjectofthenextsection.Relatedissues arisewhenweattempttomakeanalysisresultsavailableforothers,andthese issuesarealsocoveredin Chapter4.Mostofthebookisdevotedtovariousaspectsofstep(2)above—performingtheanalysis—andthesecondsectionbelow

CHAPTER1.DATA,EXPLORATORYANALYSIS,ANDR

brieflyaddressesthequestionof“whyuse R andnotsomethingelse?”Finally, sincethisisabookaboutusing R toanalyzedata,somekeydetailsaboutthe structureofthe R languagearepresentedinthethirdsectionbelow.

Generalstructureofacomputingenvironment

Inhisbook, IntroductiontoDataTechnologies [56,pp.211–214],PaulMurrell describesthegeneralstructureofacomputingenvironmentintermsofthe followingsixcomponents:

1.the CPU or centralprocessingunit isthebasichardwarethatdoesallof thecomputing;

2.the RAM or randomaccessmemory isthe internal memorywherethe CPUstoresandretrievesresults;

3.the keyboard isthestandardinterfacethatallowstheusertosubmitrequeststothecomputersystem;

4.the screen isthegraphicaldisplayterminalthatallowstheusertoseethe resultsgeneratedbythecomputersystem;

5.the massstorage,typicallya“harddisk,”isthe external memorywhere dataandresultscanbestoredpermanently;

6.the network isanexternalconnectiontotheoutsideworld,includingthe Internetbutalsopossiblyan intranet ofothercomputers,alongwithperipheraldeviceslikeprinters.

Threeimportantdistinctionsbetweeninternalstorage(i.e.,RAM)andexternal storage(i.e.,massstorage)are,first,thatRAMistypicallyseveralordersof magnitudefastertoaccessthanmassstorage;second,thatRAMis volatile i.e.,thecontentsarelostwhenthepoweristurnedoff—whilemassstorage isnot;and,third,thatmassstoragecanaccommodatemuchlargervolumes ofdatathanRAMcan.(Asaspecificexample,thecomputerbeingusedto preparethisbookhas4GBofinstalledRAMandjustover100timesasmuch diskstorage.)Apracticalconsequenceisthatboththedatawewanttoanalyze andanyresultswewanttosaveneedtoendupinmassstoragesotheyarenot lostwhenthecomputerpoweristurnedoff. Chapter4 isdevotedtoadetailed discussionofsomeofthewayswecanmovedataintoandoutofmassstorage.

ThesedifferencesbetweenRAMandmassstorageareparticularlyrelevant to R sincemost R functionsrequirealldata—boththerawdataandtheinternal storagerequiredtokeepanytemporary,intermediateresults—tofitinRAM. Thismakesthecomputationsfaster,butitlimitsthesizeofthedatasetsyoucan workwithinmostcasestosomethinglessthanthetotalinstalledRAMonyour computer. Insomeapplications,thisrestrictionrepresentsaseriouslimitation onR’sapplicability. ThislimitationisrecognizedwithintheRcommunityand continuingeffortsarebeingmadetoimprovethesituation.

CloselyassociatedwiththeCPUisthe operatingsystem,whichisthesoftwarethatrunsthecomputersystem,makingusefulactivitypossible.That is,theoperatingsystemcoordinatesthedifferentcomponents,establishingand managingfilesystemsthatallowdatasetstobestored,located,modified,or deleted;providinguseraccesstoprogramslike R;providingthesupportinfrastructurerequiredsotheseprogramscaninteractwithnetworkresources,etc. Inadditiontothegeneralcomputinginfrastructureprovidedbytheoperating system,toanalyzedataitisnecessarytohaveprogramslike R andpossibly others(e.g.,databaseprograms).Further,theseprogramsmustbecompatible withtheoperatingsystem:onpopulardesktopsandenterpriseservers,thisis usuallynotaproblem,althoughitcanbecomeaproblemforolderoperating systems.Forexample, Section2.2 ofthe RFAQ documentavailablefromthe R “Help”tabnotesthat“supportforMacOSClassicendedwithR1.7.1.”

WiththegrowthoftheInternetasadatasource,itisbecomingincreasingly importanttobeabletoretrieveandprocessdatafromit.Unfortunately,this involvesanumberofissuesthatarewellbeyondthescopeofthisbook(e.g., parsingHTMLtoextractdatastoredinwebpages).Abriefintroductionto thekeyideaswithsomesimpleexamplesisgivenin Chapter4,butforthose needingamorethoroughtreatment,Murrell’sbookishighlyrecommended[56].

Dataanalysissoftware

Akeyelementofthedataanalysischain(acquire → analyze → explain)describedearlieristhechoiceofdataanalysissoftware.Sincethereareanumber ofpossibilitieshere,why R?Onereasonisthat R isafree,open-sourcelanguage,availableformostpopularoperatingsystems.Incontrast,commercially supportedpackagesmustbepurchased,insomecasesforalotofmoney.

Anotherreasontouse R inpreferencetootherdataanalysisplatformsisthe enormousrangeofanalysismethodssupportedby R’s growinguniverseofaddonpackages.Thesepackagessupportanalysismethodsfrommanybranches ofstatistics(e.g.,traditionalstatisticalmethodslikeANOVA,ordinaryleast squaresregression,and t-tests,Bayesianmethods,androbuststatisticalprocedures),machinelearning(e.g.,randomforests,neuralnetworks,andboosted trees),andotherapplicationsliketextanalysis.Thisavailabilityofmethodsis importantbecauseitgreatlyexpandstherangeofdataexplorationandanalysis approachesthatcanbeconsidered.Forexample,ifyouwantedtousethemultivariateoutlierdetectionmethoddescribedin Chapter9 basedontheMCD covarianceestimatorinanotherframework—e.g.,MicrosoftExcel—youwould havetofirstbuildtheseanalysistoolsyourself,andthentestthemthoroughly tomakesuretheyarereallydoingwhatyouwant.Allofthistakestimeand effortjusttobeabletogettothepointofactuallyanalyzingyourdata.

Finally,athirdreasontoadopt R isitsgrowingpopularity,undoubtedly fueledbythereasonsjustdescribed,butwhichisalsolikelytopromotethe continuedgrowthofnewcapabilities.AsurveyofprogramminglanguagepopularitybytheInstituteofElectricalandElectronicsEngineers(IEEE)hasbeen takenforthelastseveralyears,andasummaryoftheresultsasofJuly18,

CHAPTER1.DATA,EXPLORATORYANALYSIS,ANDR

2017,wasavailablefromthewebsite: http://spectrum.ieee.org/computing/software/ the-2017-top-ten-programming-languages

Thetopsixprogramminglanguagesonthislistwere,indescendingorder: Python,C,Java,C++,C#,andR.Notethatthetopfiveofthesearegeneralpurposelanguages,allsuitableforatleasttwoofthefourprogrammingenvironmentsconsideredinthesurvey:web,mobile,desktop/enterprise,andembedded.Incontrast, R isaspecializeddataanalysislanguagethatisonlysuitable forthedesktop/enterpriseenvironment.Thenextdataanalysislanguageinthis listwasthecommercialpackageMATLAB R ,ranked15th.

ThestructureofR

The R programminglanguagebasicallyconsistsofthreecomponents:

• asetof baseRpackages,arequiredcollectionofprogramsthatsupport languageinfrastructureandbasicstatisticsanddataanalysisfunctions;

• asetof recommendedpackages,automaticallyincludedinalmostall R installations(the MASS packageusedinthischapterbelongstothisset);

• averylargeandgrowingsetof optionaladd-onpackages,availablethrough theComprehensiveRArchiveNetwork(CRAN).

Most R installationshaveallofthebaseandrecommendedpackages,withat leastafewselectedadd-onpackages.Theadvantageofthislanguagestructure isthatitallowsextensivecustomization:asofFebruary3,2018,therewere 12,086packagesavailablefromCRAN,andnewonesareaddedeveryday.These packagesprovidesupportforeverythingfromroughandfuzzysettheorytothe analysisoftwittertweets,soitisanextremelyrareorganizationthatactually needs everything CRANhastooffer.Allowinguserstoinstallonlywhatthey needavoidsmassivewasteofcomputerresources.

InstallingpackagesfromCRANiseasy:the R graphicaluserinterface(GUI) hasatablabeled“Packages.”Clickingonthistabbringsupamenu,and selecting“Installpackages”fromthismenubringsuponeortwoothermenus. Ifyouhavenotusedthe“Installpackages”optionpreviouslyinyourcurrent R session,amenuappearsaskingyoutoselectaCRANmirror;thesesitesare locationsthroughouttheworldwithserversthatsupportCRANdownloads,so youshouldselectonenearyou.Onceyouhavedonethis,asecondmenuappears thatlistsallofthe R packagesavailablefordownload.Simplyscrolldownthis listuntilyoufindthepackageyouwant,selectit,andclickthe“OK”button atthebottomofthemenu.Thiswillcausethepackageyouhaveselectedto bedownloadedfromtheCRANmirrorandinstalledonyourmachine,along withallotherpackagesthatarerequiredtomakeyourselectedpackagework. Forexample,the car packageusedtogenerate Fig.1.1 requiresanumberof otherpackages,includingthequantileregressionpackge quantreg,whichis automaticallydownloadedandinstalledwhenyouinstallthe car package.

Itisimportanttonotethat installing an R packagemakesitavailableforyou touse,butthisdoes not “load”thepackageintoyourcurrent R session.Todo this,youmustusethe library() function,whichworksintwodifferentways. First,ifyouenterthisfunctionwithoutanyparameters—i.e.,type“library()”at the R prompt—itbringsupanewwindowthatlistsallofthepackagesthathave beeninstalledonyourmachine.Touseanyofthesepackages,itisnecessary tousethe library() commandagain,thistimespecifyingthenameofthe packageyouwanttouseasaparameter.Thisisshowninthecodeappearing atthetopof Fig.1.1,wherethe MASS and car packagesareloaded: library(MASS) library(car)

Thefirstofthesecommandsloadsthe MASS package,whichcontainsthe mammals dataframeandthe truehist functiontogeneratehistograms,andthesecond loadsthe car package,whichcontainsthe qqPlot functionusedtogeneratethe normalQQ-plotsshownin Fig.1.1

1.3ArepresentativeRsession

Togiveaclearviewoftheessentialmaterialcoveredinthisbook,thefollowing paragraphsdescribeasimplebutrepresentative R analysissession,providing afewspecificillustrationsofwhat R cando.Thegeneraltaskisatypical preliminarydataexploration:wearegivenanunfamiliardatasetandwebegin byattemptingtounderstandwhatisinit.Inthisparticularcase,thedataset isabuilt-indataexamplefrom R—oneofmanysuchexamplesincludedin thelanguage—butthepreliminaryquestionsexploredhereareanalogousto thosewewouldaskincharacterizingadatasetobtainedfromtheInternet, fromadatawarehouseofcustomerdatainabusinessapplication,orfroma computerizeddatacollectionsysteminascientificexperimentoranindustrial processmonitoringapplication.Usefulpreliminaryquestionsinclude:

1.Howmanyrecordsdoesthisdatasetcontain?

2.Howmanyfields(i.e.,variables)areincludedineachrecord?

3.Whatkindsofvariablesarethese?(e.g.,realnumbers,integers,categorical variableslike“city”or“type,”orsomethingelse?)

4.Arethesevariablesalwaysobserved?(i.e.,ismissingdataanissue?Ifso, howaremissingvaluesrepresented?)

5.Arethevariablesincludedinthedatasettheoneswewereexpecting?

6.Arethevaluesofthesevariablesconsistentwithwhatweexpect?

7.Dothevariablesinthedatasetseemtoexhibitthekindsofrelationships weexpect?(Indeed,whatrelationshipsdoweexpect,andwhy?)

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.