Get Using r for data analysis in social sciences: a research project-oriented approach li PDF ebook

Page 1


Visit to download the full and correct content document: https://ebookmass.com/product/using-r-for-data-analysis-in-social-sciences-a-researc h-project-oriented-approach-li/

More products digital (pdf, epub, mobi) instant download maybe you interests ...

Data Analysis for the Life Sciences with R 1st Edition

https://ebookmass.com/product/data-analysis-for-the-lifesciences-with-r-1st-edition/

Numerical Methods Using Kotlin: For Data Science, Analysis, and Engineering 1st Edition Haksun Li

https://ebookmass.com/product/numerical-methods-using-kotlin-fordata-science-analysis-and-engineering-1st-edition-haksun-li-2/

Numerical Methods Using Kotlin: For Data Science, Analysis, and Engineering 1st Edition Haksun Li

https://ebookmass.com/product/numerical-methods-using-kotlin-fordata-science-analysis-and-engineering-1st-edition-haksun-li/

Cybersecurity In Humanities And Social Sciences: A Research Methods Approach 1st Edition Edition Hugo Loiseau

https://ebookmass.com/product/cybersecurity-in-humanities-andsocial-sciences-a-research-methods-approach-1st-edition-editionhugo-loiseau/

Survey Data Harmonization in the Social Sciences Irina Tomescu-Dubrow

https://ebookmass.com/product/survey-data-harmonization-in-thesocial-sciences-irina-tomescu-dubrow/

Exploratory Data Analysis Using R 1st Edition Ronald K. Pearson

https://ebookmass.com/product/exploratory-data-analysisusing-r-1st-edition-ronald-k-pearson/

Using Basic Statistics in the Behavioral and Social Sciences

https://ebookmass.com/product/using-basic-statistics-in-thebehavioral-and-social-sciences/

Practical Business Analytics Using R and Python: Solve Business Problems Using a Data-driven Approach 2nd Edition Umesh R. Hodeghatta

https://ebookmass.com/product/practical-business-analytics-usingr-and-python-solve-business-problems-using-a-data-drivenapproach-2nd-edition-umesh-r-hodeghatta/

Research Integrity: Best Practices for the Social and Behavioral Sciences Lee Jussim

https://ebookmass.com/product/research-integrity-best-practicesfor-the-social-and-behavioral-sciences-lee-jussim/

UsingRforDataAnalysis inSocialSciences

UsingRforData Analysisin SocialSciences

AResearchProject-OrientedApproach

QUANLI

OxfordUniversityPressisadepartmentoftheUniversityofOxford.Itfurthers theUniversity’sobjectiveofexcellenceinresearch,scholarship,andeducation bypublishingworldwide.OxfordisaregisteredtrademarkofOxfordUniversity PressintheUKandincertainothercountries.

PublishedintheUnitedStatesofAmericabyOxfordUniversityPress 198MadisonAvenue,NewYork,NY10016,UnitedStatesofAmerica.

©OxfordUniversityPress2018

Allrightsreserved.Nopartofthispublicationmaybereproduced,storedin aretrievalsystem,ortransmitted,inanyformorbyanymeans,withoutthe priorpermissioninwritingofOxfordUniversityPress,orasexpresslypermitted bylaw,bylicenseorundertermsagreedwiththeappropriatereproduction rightsorganization.Inquiriesconcerningreproductionoutsidethescopeofthe aboveshouldbesenttotheRightsDepartment,OxfordUniversityPress,atthe addressabove.

Youmustnotcirculatethisworkinanyotherform andyoumustimposethissameconditiononanyacquirer.

LibraryofCongressCataloging-in-PublicationData Names:Li,Quan,1966–author.

Title:UsingRfordataanalysisinsocialsciences:aresearch project-orientedapproach/QuanLi. Description:NewYork,NY:OxfordUniversityPress,[2018] Identifiers:LCCN2017010031|ISBN9780190656225(pbk.)| ISBN9780190656218(hardcover)|ISBN9780190656232(updf)| ISBN9780190656249(epub)Subjects:LCSH:Socialsciences–Research–Data processing.|Socialsciences–Statisticalmethods.|R(Computerprogramlanguage) Classification:LCCH61.3.L522018|DDC330.285/5133–dc23 LCrecordavailableathttps://lccn.loc.gov/2017010031

135798642

PaperbackprintedbyWebCom,Inc.,Canada HardbackprintedbyBridgeportNationalBindery,Inc.,UnitedStatesofAmerica

CONTENTS

ListofFigures ix

ListofTables xi

Acknowledgments xiii

Introduction xv

1.LearnaboutRandWriteFirstToyPrograms 1

WHENTOUSERINARESEARCHPROJECT 2

ESSENTIALSABOUTR 3

HOWTOSTARTAPROJECTFOLDERANDWRITEOURFIRSTRPROGRAM 4

CREATE,DESCRIBE,ANDGRAPHAVECTOR:ASIMPLETOYEXAMPLE 7

SIMPLEREAL-WORLDEXAMPLE:DATAFROMIVERSENANDSOSKICE(2006) 23

CHAPTER1:RPROGRAMCODE 28

TROUBLESHOOTANDGETHELP 32

IMPORTANTREFERENCEINFORMATION:SYMBOLS,OPERATORS,ANDFUNCTIONS 34

SUMMARY 35

MISCELLANEOUSQ&ASFORAMBITIOUSREADERS 36

EXERCISES 42

2.GetDataReady:Import,Inspect,andPrepareData 43

PREPARATION 43

IMPORTPENNWORLDTABLE7.0DATASET 45

INSPECTIMPORTEDDATA 49

PREPAREDATAI:VARIABLETYPESANDINDEXING 55

PREPAREDATAII:MANAGEDATASETS 59

PREPAREDATAIII:MANAGEOBSERVATIONS 65

PREPAREDATAIV:MANAGEVARIABLES 68

CHAPTER2PROGRAMCODE 78

SUMMARY 85

MISCELLANEOUSQ&ASFORAMBITIOUSREADERS 86 EXERCISES 93

3.One-SampleandDifference-of-MeansTests 94

CONCEPTUALPREPARATION 95

DATAPREPARATION 101

WHATISTHEAVERAGEECONOMICGROWTHRATEINTHEWORLDECONOMY? 104

DIDTHEWORLDECONOMYGROWMOREQUICKLYIN1990THANIN1960? 115

CHAPTER3PROGRAMCODE 128

SUMMARY 133

MISCELLANEOUSQ&ASFORAMBITIOUSREADERS 133 EXERCISES 142

4.CovarianceandCorrelation 143

DATAANDSOFTWAREPREPARATIONS 143

VISUALIZETHERELATIONSHIPBETWEENTRADEANDGROWTHUSING SCATTERPLOT 146

ARETRADEOPENNESSANDECONOMICGROWTHCORRELATED? 149

DOESTHECORRELATIONBETWEENTRADEANDGROWTHCHANGEOVERTIME? 154

CHAPTER4PROGRAMCODE 160

SUMMARY 163

MISCELLANEOUSQ&ASFORAMBITIOUSREADERS 164 EXERCISES 168

5.RegressionAnalysis 170

CONCEPTUALPREPARATION:HOWTOUNDERSTANDREGRESSIONANALYSIS 171

DATAPREPARATION 175

VISUALIZEANDINSPECTDATA 182

HOWTOESTIMATEANDINTERPRETOLSMODELCOEFFICIENTS 185

HOWTOESTIMATESTANDARDERROROFCOEFFICIENT 187

HOWTOMAKEANINFERENCEABOUTTHEPOPULATIONPARAMETER OFINTEREST 188

HOWTOINTERPRETOVERALLMODELFIT 190

HOWTOPRESENTSTATISTICALRESULTS 193

CHAPTER5PROGRAMCODE 194

SUMMARY 198

MISCELLANEOUSQ&ASFORAMBITIOUSREADERS 199 EXERCISES 204

6.RegressionDiagnosticsandSensitivityAnalysis 206

WHYAREOLSASSUMPTIONSANDDIAGNOSTICSIMPORTANT? 206

DATAPREPARATION 211

LINEARITYANDMODELSPECIFICATION 215

PERFECTANDHIGHMULTICOLLINEARITY 221

CONSTANTERRORVARIANCE 223

INDEPENDENCEOFERRORTERMOBSERVATIONS 227

INFLUENTIALOBSERVATIONS 240

NORMALITYTEST 245

REPORTFINDINGS 247

CHAPTER6PROGRAMCODE 251

SUMMARY 259

MISCELLANEOUSQ&ASFORAMBITIOUSREADERS 259 EXERCISES 262

7.ReplicationofFindingsinPublishedAnalyses 263

WHATEXPLAINSTHEGEOGRAPHICSPREADOFMILITARIZEDINTERSTATEDISPUTES?

REPLICATIONANDDIAGNOSTICSOFBRAITHWAITE(2006) 264

DOESRELIGIOSITYINFLUENCEINDIVIDUALATTITUDESTOWARDINNOVATION?

REPLICATIONOFBÉNABOUETAL.(2015) 284

CHAPTER7PROGRAMCODE 295

SUMMARY 301

8.Appendix:ABriefIntroductiontoAnalyzingCategorical DataandFindingMoreData 302

OBJECTIVE 302

GETTINGDATAREADY 303

DOMENANDWOMENDIFFERINSELF-REPORTEDHAPPINESS? 304

DOBELIEVERSINGODANDNON-BELIEVERSDIFFERINSELF-REPORTED HAPPINESS? 310

SOURCESOFSELF-REPORTEDHAPPINESS:LOGISTICREGRESSION 313 WHERETOFINDMOREDATA 323

ReferencesandReadings 327 Index 331

LISTOFFIGURES

1.1HowtoWriteFirstToyPrograminR 8

1.2HowtoInstallAdd-onPackage 18

1.3DistributionofDiscreteVariablevd$v1:BarChart 21

1.4DistributionofContinuousVariablevd$v1:Boxplotand Histogram 23

1.5DistributionofWageInequalityfromIversenand Soskice(2006) 27

1.6DistributionofPRandMajoritarianSystemsfromIversenand Soskice(2006) 27

1.7RStudioScreenshot 38

2.1UsingView()FunctiontoViewRawData 50

2.2DistributionofVariablergdpl 55

3.1TypesofErrorsandAlternativeSamplingDistributions 100

3.2HistogramforGrowth 113

3.3Meanand95%ConfidenceIntervalforGrowth 114

3.4Meanand95%ConfidenceIntervalforGrowth:1960and1990 127

4.1SimulatedPositiveCorrelationsofTwoRandomVariables 147

4.2ScatterPlotofTradeOpennessandEconomicGrowth 148

4.3CorrelationbetweenTradeandGrowthoverTime 157

4.4 P ValueofCorrelationbetweenTradeandGrowthoverTime 159

4.5AnscombeQuartetScatterPlot 166

5.1OriginalStatisticalResultsfromFrankelandRomer(1999) 174

5.2ComparingUnloggedandLoggedIncomeperPerson 184

5.3TradeOpennessandLogofIncomeperPerson 184

5.4CoefficientsPlotforModel1 194

5.5PartialRegressionPlot 203

5.6ExplorePairwiseRelationshipsamongVariables 204

6.1AnscombeQuartetRegressions 210

6.2AnscombeQuartetResidualsversusFittedValuesPlots 211

6.3DiagnosticPlotsforaWell-BehavedRegression 212

6.4ResidualsversusFittedValues:Linearity 216

6.5ResidualsversusIndependentVariables:Linearity 217

6.6TradeOpennessandLogofIncomeperPerson 220

6.7DistributionofResidualsbyRegion 228

6.8ScatterPlotofTradeandIncomebyRegion 230

6.9EstimatedEffectofTradeonIncomebyRegion 237

6.10InfluencePlotofInfluentialObservations 241

6.11InfluentialObservationsAboveCook’sDThreshold 243

6.12NormalityAssumptionDiagnosticPlot 245

7.1RegressionDiagnosticPlot:ResidualsversusFittedValues 274

7.2DiagnosticPlotforInfluentialObservations:Cook’sD 278

7.3NormalityAssumptionDiagnosticPlot 281

8.1SamplePagefromWorldValuesSurveyCodebook 303

LISTOFTABLES

1.1CountryMeansforVariablesUsedinRegressionAnalysis (fromIversonandSoskice,2006) 24

1.2StatisticsofImportedDatafromIversenandSoskice(2006) 26

1.3ImportantSymbolsinR 34

1.4ArithmeticOperators 35

1.5LogicalOperators 35

1.6CommonStatisticalandMathematicalFunctions 36

2.1ListofDataPreparationTasksandRelatedRFunctions 46

3.1LogicofStatisticalInference 96

3.2Two-SampleDifference-of-MeansTests 123

5.1CoefficientInterpretationinLogorUnloggedModels 175

5.2DescriptiveStatisticsofFinalDataset 183

5.3EffectofTradeOpennessonRealIncomeperPerson 193

6.1RegressionResultsUsingAnscombe’sQuartet 209

6.2EffectofTradeonIncome:RobustnessChecksPartI 249

6.3EffectofTradeonIncome:RobustnessChecksPartII 250

7.1VariableMeasuresandExpectedEffects 266

7.2OLSRegressionofDisputeDispersion(OriginalStatisticalResults TablefromBraithwaite,2006) 267

7.3OriginalDescriptiveStatisticsTableinBraithwaite(2006) 269

7.4CausesofSpreadofMilitaryDisputes:ReplicationandRobustness Tests 282

7.5MostImportantQualitiesforChildrentoHave(fromBénabouetal., 2015) 285

7.6VariableLabelsforDatasetinBénabouetal.(2015) 288

7.7ReplicatingTable2inBénaboutetal.(2015) 293

ACKNOWLEDGMENTS

Fiveoriginaltablesfromfourdifferentjournalarticlesarereprintedinthebook forreplicationexercises.Thearticlesinclude(1)Iversen,Torben,andDavid Soskice,2006,“ElectoralInstitutionsandthePoliticsofCoalitions:WhySome DemocraciesRedistributeMoreThanOthers,”AmericanPoliticalScienceReview 100(2):165–81,TableA1.Copyright:CambridgeUniversityPress.(2)Frankel, JeffreyA.,andDavidRomer,1999.“DoesTradeCauseGrowth?”American EconomicReview89(3):379–99,Table3.Copyright:AmericanEconomicAssociation.(3)Braithwaite,Alex.2006.“TheGeographicSpreadofMilitarizedDisputes,”JournalofPeaceResearch43(5):507–22,TableIandTableII.Copyright: SAGEPublications.(4)Bénabou,Roland,DavideTicchi,andAndreaVindigni, 2015,“Religionand‘Innovation”’AmericanEconomicReview105(5):346–51, Table2.Copyright:AmericanEconomicAssociation.Permissionstoreprintthe relevanttablesinIversenandSoskice(2006)andBraithwaite(2006)havebeen acquiredandlicensedfromCambridgeUniversityPressandSAGEPublications.

JeffreyFrankel,RolandBénabou,andAmericanEconomicAssociationdeserve specialthanksforgraciouslygrantingmepermissiontoreprinttherelevant tablesintheirarticlesforfree.

Figures1through4inF.J.Anscombe’s“GraphsinStatisticalAnalysis,” publishedin1973in TheAmericanStatistician 27(1):17–21,havebeenadapted andusedwithpermissionofthepublisher,Taylor&FrancisLtdhttp://www. tandfonline.com.

Thisbookwouldnothavebeenpossiblewithouttheencouragement,help, andsupportofmanystudents,colleagues,andfriends.Myundergraduate studentsinPolimetricsandSeniorResearchSeminaratTexasA&MUniversity gavemethefirstimpetustowritethisbook.Manystudentstakingthosetwo courses,especiallyJacobKingandAlexGoodman,caughttyposandmistakesin earlierdrafts.Duringthesummerof2016,ScarletAmo,CorbinCali,Chandler Dawson,andElizabethGohmertexperimentedwithusinganearlierversionof themanuscripttoself-studyRfordataanalysis.Theyprovideddetailedreports

acknowledgments oneachchapterandcompletedindependentapplicationpapers.Theirinputhas dramaticallychangedandimprovedhowvariousmaterialsinthebookarenow presentedandstructured.Ithankthemfortheirextraordinaryworkandeffort. Mygraduateassistants,MollyBerkemeier,KellyMcCaskey,andAustinJohnson, providedexcellenteditorialassistance.Mycolleaguesandfriends,TiyiFeng,Ren Mu,EricaOwen,andCarlisleRainey,readpartsofanearlierdraftandprovided valuablefeedbackandsuggestions.

ManypeopleatOxfordUniversityPresshavehelpedtomakethismanuscript possibleandbetter.ScottParris,whowastheeditorformyfirstbookby CambridgeUniversityPress,hadbeenpatientlyencouragingandproddingme tofinishthisbookuntilhisretirementfromOxford.Happyretirement,Scott! BeforeretiringfromOxford,ScotthandedmycasetoAnneDellinger.Anne’s enthusiasmandencouragementwerethemainreasonthatIdecidedtostaywith Oxford.AfterAnnedepartedfromOxford,DavidPervinbecamemyeditorand offeredsoundadvice.Scott’sassistantCathrynVaulmanandDavid’sassistants EmilyMackenzieandHayleySingertookcareofmanyofthelogisticissues intheprocess.DebbieRuelcorrectedmanyerrorsanddidagreatjobduring copyediting,andLincyPriyapatientlydealtwithmyrequestsandsmoothly handledtheproductionofmybook.XunPangandJudeHaysprovidedvaluable commentsandsuggestionsthathelpedtomakethebookenormouslybetter.

Finally,mygreatestdebtofgratitudeisowedtomywife,Liu,andmytwo children,EllenandAndrew.Withouttheirunyieldingsupport,constantinquiry, andevenreadingpartsofthebookandcheckingmyRcode,Iwouldnothave finishedtheproject.Thisbookisdedicatedtothem!

INTRODUCTION

Thisbookseekstoteachseniorundergraduateandbeginninggraduatestudents insocialscienceshowtouseRtomanage,visualize,andanalyzedatain ordertoanswersubstantiveresearchquestionsandreproducethestatistical analysisinpublishedjournalarticles.Overthepastseveraldecades,statistical analysistraininghasbecomeincreasinglyimportantforundergraduateand graduatestudentsinmanydisciplineswithinsocialandbehavioralsciences,such aseconomics,politicalscience,publicadministration,business,publichealth, anthropology,psychology,sociology,education,andcommunication.Withrapid progressinstatisticalcomputing,proficiencyinusingstatisticalsoftwarehas becomealmostauniversalrequirement,albeittovaryingdegrees,instatistical methodscourses.Popularsoftwarechoicesinclude:SAS,SPSS,Stata,andR. WhileSAS,SPSS,andStataallhaveaccessibleintroductorytextbookstargeting studentsinsocialsciences,suchtextbooksonRarerare.

ComparedwithcommercialpackageslikeSAS,SPSS,andStata,Rhasat leastthreestrengths.Itisawell-thought-out,coherentsystemthatcomes withasuiteofsoftwarefacilitiesfordatamanagement,visualization,and analysis.Inaddition,tomeetemergingneeds,alargecommunityofRusers constantlydevelopsnewopensourceadd-onpackages,alreadyreachingover 10,000.Finally,perhapsthegreatestperkofthesoftwareisthatitisfree.This financialbenefitcannotbeover-emphasized.Cash-strappedcollegestudents oftenfindthemselvesrelyingonlabcomputersforaccesstoSAS,SPSS,and Stata,orconstrainedbythelimitationsofthestudentversionsofthose commercialpackages.Evenpostgraduation,manyfinditdifficulttoconvince theiremployerstopurchaseaparticularcommercialpackagetheyknowfortheir everydayuse.

TherearemanyreasonswhyRispreferredtootherstatisticalsoftware packagesinhighereducation.ButR’sgreatesthandicaptoitswidespreaduse inthesocialsciencesisitssteeplearningcurve.Whilethemarkethasproduced numerousbooksonRatvariouslevels,introductorytextbooksthatfocusonthe

needsofstudentsinthesocialsciencesarenoteasytofind.Thisbookseeksto fillthisvoid.

ThisbookdistinguishesitselffromotherintroductoryRorstatisticsbooksin threeimportantways.First,itintendstoserveasanintroductorytextonusing Rfordataanalysisprojects,targetinganaudiencerarelyexposedtostatistical programming.Therationaleforemphasizingtheintroductorynatureofthis bookissimple;itisdrivenbytheneedsandheterogeneityofthestudentbodywe oftencomeacrossinclassroomteachinginsocialsciencedepartments.Unlike studentsinmathandstatistics,manystudentusersofRinsocialsciences havenoexperienceinanycomputinglanguageorprogrammingsoftware,and manywillneverachieveahigherlevelofprogrammingbeyondwhatisnecessary fortheireverydayuseinR.However,studentsinsocialscienceswillfindthat theopportunitytouseRfordatamanipulation,visualization,andanalysis frequentlypresentsitselfinvariouscoursesandfuturecareers.Hence,they needtobecomeproficientataccomplishingcommontasksindatamanipulation, visualization,andanalysisusingR,withoutgettingoverlytechnical.Inthis respect,existingintroductorytextsonRprogrammingthatdonotinvolve statisticstendtobeoverlycomprehensiveincoverageandareoftengeared towardstudentsinmath,statistics,sciences,andengineering,thusintimidating mostsocialsciencestudents.AlainZuur,ElenaIeno,andErikMeesters’ A Beginner’sGuidetoR andPhilipSpector’s DataManipulationwithR aregood examples.Theirtargetaudiencesoftenarestudentsinmath,statistics,sciences, andengineeringmajorswhohavemoreexperiencesinprogrammingthanfellow classmatesinsocialsciences.

Thisbook,incontrast,adoptsaminimalistapproachinteachingR.Itcovers onlythemostimportantfeaturesandfunctionsinRthatonewillneedforconductingreproducibleresearchprojects,withothermaterialsmovedtochapter appendicesorremovedfromconsiderationcompletely.Risextremelyflexible, almostalwaysallowingmultiplesolutionstooneprogrammingtask.Whilethis isastrength,itdoeschallengebeginningRusersrarelyexposedtocomputer programming.Theminimalistapproachadoptedherewillpresenttypicallyone waytodealwithataskinthemainpartofachapter,leavingotherstuffto asectioncalled“MiscellaneousQuestionsforAmbitiousReaders.”Asaresult, theminimalistapproachshouldflattenthesteeplearningcurve—acommonly noteddisadvantageofR—therebyimprovingthesoftware’saccessibilityto undergraduatesandsimilaraudiences.Organizationally,thisbookbreaksdown chaptersintosmallsectionsthatmimiclabsessionsforstudents.Eachchapter focusesononlytheessentialRfunctionsoneneedstoknowinorderto manipulate,visualize,andanalyzedatatoaccomplishsomeprimarystatistical analysistasks.Intheend,throughthisminimalistapproach,thereaderwill accumulateenoughRknowledgeandskillstocompleteacourseresearchproject andtoself-studymoreadvancedRmaterialsifnecessary.

Aseconduniquefeatureofthisbookisitsemphasisonmeetingthepractical needsofstudentsusingRtoconductstatisticalanalysisforresearchprojects drivenbysubstantivequestionsinsocialsciences.Inadditiontohomework assignmentsandproblemsets,statisticalmethodscoursesinsocialsciences oftenrequirethecompletionofafull-blown,substantivelymotivatedresearch project.Suchtrainingiscriticalifstatisticalknowledgeistoprovetobeofany valueandrelevancetosubstantivecoursesandstudents’futurecareers.Ideally, studentscanutilizecompletedstatisticalanalysispapersaswritingsamplesto showcasetheirquantitativeskillsintheirgraduateschoolorjobapplications.

Inpractice,toaccomplishsuchaprojectonasubstantivequestion,astudent hastocollect,clean,andmanipulatedata,visualizeandanalyzedatasystematicallytoaddressthequestionasked,andreportfindingsinanorganizedmanner. ManyRbooksforintroductorystatisticstendtoemphasizetheRcodesfor statisticaltechniques,givinginsufficientattentiontothepre-analysisneedsof usersaswellastheprocessofcompletingaresearchproject.Forexample,John Verzani’s UsingRforIntroductoryStatistics andMichaelCrawley’s Introductory StatisticsUsingR aretwopopulartextsinthiscategory.Datapreparationisnot linkedtoparticularresearchprojectsthataddresssubstantivequestions.

Incontrast,thisbookiswrittenunderthepremisethatthereaderuses Rprimarilytoaddresssomesubstantivequestionofinterest.Thisleadsto severalnotabledifferencesfromotherintroductorystatisticsbooksusingR.This bookbeginswiththeuseofRtogetanoriginalrawdatasetintoacondition appropriateforstatisticalanalysis,thusemphasizinghowtodealwithvarious issuesthatariseinsuchaprocess.Next,insteadofstartingwiththeinteractive useofR,whichistypicalinothertextbooks,thisbookgivesexclusiveattention towritingandexecutingRprograms.Thisapproachallowseasyverification, recollection,andreplicationofanalysis,anditisalmostalwayshowthings aredoneinactualreproducibleresearch.Studentsfollowingthisapproachwill writemanywell-documentedRcodesthataddressavarietyofpracticalissues suchthattheycansavethoseprogramsforfuturereference.Lastbutnotleast, theuseofRinthisbookiscloselyintegratedintoaprototypicalprocessthat consistsofasequenceofelements:asubstantivequestiontobeanswered,a hypothesisthatanswersthequestion,thelogicofstatisticalinferencebehind theempiricaltestofthehypothesis,theteststatisticforstatisticalinference representedinmathematicalnotationandimplementedcomputationallyinR, andthepresentationoffindingsinanorganizedmanner.Theemphasisison anin-depthunderstandingofwhywedostatisticalanalysisandhowRfits intoactualempiricalresearch.Hence,thisresearchprocess-orproject-oriented approachoughttosignificantlyincreasethelikelihoodthatstudentswillactually useRtosolveproblemsintheirfuturecoursesandcareers.

Athirduniquefeatureofthisbookisitsemphasisonteachingstudents howtoreplicatestatisticalanalysesinpublishedjournalarticles.Scientific

progressrequirespreviousfindingsbereplicableandreplicated;scientificeducation,likeinphysicsandchemistry,alwaysincludeslabexercisesthatreplicatepreviousexperiments.Associalscientificknowledgebecomesincreasingly evidence-basedandreliesonextensivedataanalysis,learningtoreplicate publishedresultsisanecessarystepforundergraduatesandfirst-yeargraduate studentsintheirlearningtoconductsocialscientificresearch.Suchtraining nowbecomesfeasiblebecauseoftheavailabilityofpowerfulfreesoftwareanda widerangeofdatasetsinthepublicdomain.Manyjournalsnowrequireauthors tosubmitanddepositreplicationdatasets.Manyoriginaldatafromsurveys andarchivalresearcharedownloadablefromtheinternet.Studentsnolonger havetobejustpassiveconsumersofsocialscientificresearchbutinsteadcan activelyscrutinizepublishedresearch,playwiththedata,andreproduceorfailto reproducepreviousfindings.Thiswillconvertstudentsfrompassiveconsumers intoactivelearners.Asreproducingresearchfindingsbecomesthenormrather thantheexception,itwillempowerthestudents,lowerthebarriertotheirentry intotheacademiccommunity,andchallengetheprofessorsandotherknowledge producers.Thewidegapbetweenteachingandresearchcommonlyobserved inundergraduatecoursesinsocialscienceswillbenarrowed.Suchchangesare likelytomaketeachingmoreinterestingforprofessors,renderlearningmore fruitfulforstudents,andenablebothpartiestobecomemoresuccessfulintheir endeavors.

Thisbookconsistsofeightchapters.Chapter1introducesR,illustrating howtowriteandexecuteprogramsusingthesoftware.Chapter2goesthrough theprocessof,andvariousmaintasksin,gettingdatareadyforanalysisinR. Chapter3providesaconceptualbackgroundonthelogicofstatisticalinference andthendemonstrateshowtomakestatisticalinferencewithrespecttoone continuousoutcomevariableusingone-andtwo-samplettests.Chapter4moves intoanalyzingtherelationshipbetweentwocontinuousvariables,focusingon covarianceandcorrelation.Chapter5introducesregressionanalysis,covering itsconceptualfoundation,modelspecification,estimation,interpretation,and inference.Chapter6continueswithregressionanalysis,delvingintovarious diagnosticsandsensitivityanalyses.Chapters4through6followthesame approach,integratingconceptualandmathematicalfoundation,datapreparation,statisticalanalysis,andresultsreportingwithineachchapter.Chapter 7walksreadersthroughtheprocessofreplicatingtwopublishedanalyses. Finally,Chapter8,asanappendix,providesabriefintroductiontoanalyzing discretedata,demonstratingtheChi-squaredtestofindependenceandlogistic regression.

Notextbookcanbeperfect;thisoneisnoexception.Theminimalistapproach, emphasizingtheaccessibilityofR,comesataprice.Manycommonlyused functionsandfeaturesofR,suchaswritingfunctionsandloops,arenot covered.Similarly,byfocusingonteachingtheresearchprocessofhowtouse

Rtoaddresssubstantivequestions,thisbookcoversprimarilyexplainingone continuousoutcomevariableandrelevantstatisticaltechniques,suchasmean, differenceofmeans,covariance,correlation,andcross-sectionalregression. Hence,comprehensivenessinbothprogrammingandstatisticsissacrificed,on purpose,forgreateraccessibility,clarity,anddepth.Thegoalistomakethisbook accessibleandusefulfornovicesinbothprogramminganddataanalysis.

Insum,thisbookintegratesRprogramming,thelogicandstepsofstatistical inference,andtheprocessofempiricalsocialscientificresearchinahighly accessibleandstructuredfashion.ItemphasizeslearningtouseRforessential datamanagement,visualization,analysis,andreplicatingpublishedresearch findings.Bytheendofthisbook,studentswillhavelearnedhowtodothe following:(1)useRtoimportdata,inspectdata,identifydatasetattributes, andmanageobservations,variables,anddatasets;(2)useRtographsimple histograms,boxplots,scatterplots,andresearchfindings;(3)useRtosummarizedata,conductone-samplet-test,testthedifference-of-meansbetween groups,computecovarianceandcorrelation,estimateandinterpretordinary leastsquare(OLS)regression,anddiagnoseandcorrectregressionassumption violations;and(4)replicateresearchfindingsinpublishedjournalarticles. The principlebehindthisbookistoteachstudentstolearnaslittleRaspossiblebutto doasmuchsubstantivelydrivendataanalysisatthebeginnerorintermediatelevel aspossible. Theminimalistapproachshoulddramaticallyreducethelearning costbutstillproveadequateformeetingthepracticalresearchneedsofsenior undergraduateandbeginninggraduatestudentsinthesocialsciences.Having completedthisbook,studentscancompetentlyuseRandstatisticalanalysisto answersubstantivequestionsregardingsomesubstantivelyinterestingcontinuousoutcomevariableinacross-sectionaldesign.Itismyhopethat,thenewly acquiredcompetencewillmotivatestudentstowantto,ratherthanbeingforced to,learnmoreaboutRandstatistics.

UsingRforDataAnalysis inSocialSciences

LearnaboutRandWrite

FirstToyPrograms

ChapterObjectives

Inthisfirstchapter,wewillaimtoachievethefollowingobjectives:

1.UnderstandwhentouseRinaresearchproject.

2.LearnaboutthebasicbackgroundofR,softwareinstallation,andgetting help.

3.LearntosetupaprojectfolderforRprogramsanddatafiles.

4.Learntowriteandexecutesimpletoyprograms.

5.LearntofindandsettheworkingdirectoryforaprojectinR.

6.Learntocreateadatavector.

7.Learntocalculatedescriptivestatisticsandhandlemissingvalues.

8.Learntoconvertadatavectorintoadataframe.

9.Learntorefertoavariablewithinadataframe.

10.Learntoinstallanadd-onpackage,"stargazer,"loaditintoR,anduseitto getadescriptivestatisticstable.

11.Learntographthedistributionofavariable.

12.Applyallthelessonslearnedtoareal-worlddataexample.

13.Learnaboutcommoncodingerrorsandhowtogethelp.

Materialsinthischapterneedaboutanhourandahalfforaclassofabout 10studentstocoverinalab,includingbrieflecturingandhands-onpractice. Largerclassesorself-studycouldtakelonger.

WhentoUseRinaResearchProject

Tocompleteanempiricalresearchprojectinvolvesseveralstages,oftenstarting withtheidentificationofaresearchproblemandendingwiththereportof findingsandimplications:

1.Identifyaresearchproblem

2.Surveytheliterature(Findoutwhatisknownabouttheproblem)

3.Formulateatheoreticalargumentandsometestablehypothesis

4.Measureconcepts

5.Collectdata

6.Preparedata

7.Analyzedata

8.Reportfindingsandimplications

Thetasksofidentifyingasignificantandinterestingresearchproblem, surveyingtheextantliterature,formulatingacoherenttheoreticalargumentand sometestablehypothesisthatexplaintheresearchpuzzle,measuringconcepts inthetheoryempirically,andcollectingdatafortheempiricalindicatorsofthe concepts—tasks(1)to(5)—aregenerallydealtwithinsubstantiveandresearch designcoursesinafield.ThosetopicsarebeyondthescopeofthislittleRbook. Yettasks(6)to(8)mayallinvolveRasaresearchinstrument.Specifically,using Rforactualresearchprojectsistoanalyzeparticularresearchproblems,such asevaluatingtheimpactofapolicyortestingtheimpactofacausalfactor(or anindependentvariable)onanoutcome(oradependentvariable)ofinterest, aspostulatedbypre-specifiedtheoreticalexpectations.Howtoaccomplishtasks (6)to(8)willbeillustratedinthefollowingchapters.

Aresearchprojectofthistypepresentsatleasttwochallenges,forwhichR willbeuseful.First,inpractice,suchaprojectinvolvesarangeoftasks,such asimportingdataintosoftware,mergingdifferentdatasetstogether,verifying data,creatingnewvariables,recodingandrenamingvariables,visualizingdata, runningstatisticalestimationprocedures,carryingoutdiagnostictests,andso on.Second,ananalystneedstobeabletoreproducehisorherownanalysis, includingdatasetconstructionandestimationresults,evenyearslater.Thefirst challengeconcernstheefficiencyofananalysis,whereasthesecondconcernsthe reproducibilityandintegrityoftheanalysis.

Toachievebothefficiencyandreproducibility,experiencedanalystsalways choosetowritedowntheircomputingcodeinoneormoreprogramssothat thecodecanbesubmitted,revised,andresubmittedtoreproduceananalysis speedilyandwhenevernecessary.Hence,inthisbook,wewillfocusonhowto writeandsubmitRprogramsforspecifictasksinaprogrameditor,ratherthan theinteractiveuseormenu-driveninterfaceofR.Forallpracticalpurposes,

theprogrammingapproachismuchmoreefficientandconsistentthanthe interactiveormenu-drivenapproach.

BeforewestepintohowtouseR,wewillneedtoclarifysomerelated organizationalandhousekeepingissues.Inthischapter,wewillfirstoffera verybriefintroductiontoR,thendemonstratehowtoinstallR,writeand executeRprograms,installandloadadd-onpackages,andproducegraphical andnumericaloutput,andthenturntoessentialreferenceinformationabout importantsymbolsandcommoncodingerrors.Notably,eachlineofRcodewill likelyappearthreetimes:presentedasastand-alonecommandlineprecededor followedbyanexplanationofitspurposeandfunction,listedtogetherwiththe outputfromitsexecution,andcollatedwithallotherprogramcodeinthechapter forthesakeofconvenientreference.Wewillendthechapterwithasectionabout miscellaneousissuesofinteresttoambitiousreadersandasectiononexercises.

EssentialsaboutR

AOne-ParagraphIntroductiontoR

Risacomputerlanguageandanenvironmentforstatisticalcomputingand graphicswithimportantadvantages.StartedbyRobertGentlemanandRoss IhakaoftheUniversityofAucklandin1995,itisnowmaintainedbytheR core-developmentteamofvolunteerdevelopers.Risreferredtoasacomputer languagebecauseasadialectoftheSlanguagedevelopedinthelate1980s atAT&T’slabs,Rallowsuserstofollowthealgorithms,defineandaddnew functions,andwritenewanalyticmethods,ratherthanmerelysupplyingcanned routines.Risalsoacoherentsystemwhichprovidesanenvironmentwithan integratedsuiteofsoftwarefacilitiesfordatastorage,manipulation,analysis, andvisualization.Inaddition,Risflexible.ItrunsonWindows,UNIX,andMac OSX.Itcanbeeasilyextendedintermsofnewfunctionsandstate-of-the-art statisticalmethods;theover10,000add-onpackagesbytheendofJanuary 2017throughtheCRANfamilyofinternetsitestestifytothisfact.Lastbutnot least,Risfree,asareitsnumerousadd-onpackages.Hence,Rispopularamong practitionersinmanyfieldsandscholarsinmanydisciplines,includingthesocial sciences.

Installation

Asanopensourcesoftwareforstatisticalcomputing,Rcanbeeasilydownloaded fromthefollowingsite:http://www.r-project.org/.Wemaysimplyclickonthe highlighted downloadR linktoreachalistofCRANmirrorsites.Clickingon anysitewepreferdirectsustothepagefordownloadingthesoftwareforthree differentplatforms:Linux,Windows,andMac.Rworksslightlydifferentlyacross

thethreeplatforms.Forthepurposeofthisbook,wewillfocusonthelanguage andfunctionalityspecifictotheWindowsplatform.Macusersmayconsultthe MiscellaneousQ&ASectionlaterinthechapterforsomebriefexplanation.

Risbeingconstantlyupdatedtonewversionsbydevelopers.Itisworth notingthatsomeRprogramsandpackagesusedinthisbookcouldrequire3.3.2 ornewer.IftheversionofRonamachineisnotuptodate,onemaysimply uninstalltheoldversionandinstallthelatestversionfollowingtheprocedures describedpreviously,orrefertothesubsectiononhowtoupdateRinthe MiscellaneousQ&ASection.

HowtoStartAProjectFolderandWriteOurFirstRProgram

LearntoSetupAProjectFolderforProgramsandDataFiles

Thefirststepinaprojectistosetupaprojectfoldertoholdrelevantdatasets, programs,andoutputfiles.Wecanthinkofaprojectfolderasourhomemailing address,andalltherelevantdatasets,programs,andoutputfilesasthemailand packagestobedeliveredtous.Withoutthemailingaddress,thepackagesand mailwillnotbedeliveredtotherightplace.Hence,aprojectfolderallowsus toeasilyfindalltherelevantfilesandavoidhavingthemmingledandconflated withthosefilesforotherprojectsorpurposes.

InWindows,wecancreateaprojectfolderviathefollowingsteps:Open MyComputer or FileExplorer;rightclickontherootdirectory,suchasC:or D:;clickon New;clickon NewFolder or Folder;andtypeinameaningfulname forthenewfolder,suchas Project

LearntoFindandSetAWorkingDirectoryforAProject

WhenweopenR,thedefaultinterfaceistheRConsolepage,whichisbased ontheinteractivemode.TocreateanRprogram,weshouldgototheREditor page.Todoso,wecanopenR,clickon File onthemenubar,andthenclick on Newscript toopentheRprogrameditor.Now,clickonthe Save buttonon themenubarorthe Save optionunder File,andwewillbepromptedtoentera filenameforanRprogramfilethatendswith.R.Foranexperiment,namethefile session1.R (remembertoenditwith .R),andthensavethefileinthe Project folder.

LearntoWriteandExecutetheSimplestToyProgram

NowisthetimeforustolearntowriteanextremelysimpleRprogramandrunit. Rhasadefaultworkingdirectoryorfolder(thinkofitasthepostofficeaddress formailandpackages).WeareinterestedintellingRtochangethecurrent defaultworkingdirectorytothe Project folder.Itislikedirectingourmailto

bedeliveredtoourownhomeaddress,ratherthanthepostofficeaddress.The Project folderiswherewekeepourprogramanddatafiles.

Todoso,wefirstidentifytheworkingdirectoryofthecurrentRsession,then changeittothe Project folder,andfinallyverifythatthechangeissuccessful. Intheprogrameditor,firsttypein

getwd()

The getwd functionliststhenameofthecurrentworkingdirectory. Next,typein

setwd("C:/Project")

The setwd functionchangesthecurrentworkingdirectoryforthecurrentR session.Theargumentofthefunctionisinsidetheparentheses,betweendouble quotationmarks,andemploysoneforwardslash;itspecifiesthepathtothe Project folderasthenewcurrentworkingdirectoryforthecurrentRsession. Thislineofcodemakesitpossible,duringtherestofourRsession,forusto refertothefileswithinthe Project folderwithoutspecifyingthepathagain. Finally,notethatRiscasesensitive.Hence,Rwilltreat Project and project astwodifferentfolders.Ifthereisamismatchinspellingbetweentheprogram codeandtheactualfoldername,Rwillproduceanerrormessage.Alsonotethat anymismatchintermsofquotationmarks,colon,etc.willcauseRtoproducean errormessage.

Inspecifyingthepath,wemayuseoneforwardslashasabove,oralternatively, twodoublebackslashesasfollows:

setwd("C:\\Project")

Pleasenotethedoublebackslashes.Thisisveryimportant.Ifwecopythe pathfromourcomputer FileExplorer,thecopiedandpastedpathwillcontain onlyonebackslash.ForR,wewillneedtoaddanextrabackslash,orchangeit tooneforwardslash.

Finally,typeinagain

getwd()

Thisallowsustoverifythetaskisdoneasinstructed. Wesavethesethreelinesofcodeintoaprogramfilecalled session1.R.This three-lineRprogramasksRtodisplaythedefaultworkingdirectory,thensets the Project folderasournewcurrentworkingdirectory,andfinallyasksRto displaythecurrentworkingdirectoryagain.

Havingthe .R suffixintheprogramfilenameisagoodpracticefortwo reasons.IthelpsusseeimmediatelythatitisanRprogramfile.Whenweopen aprogramfileintheReditor,allfileswith .R suffixwillappearautomaticallyin thelistoffilesforustochoosetoopen.Iftheprogramfiledoesnothavethe .R suffixandifwewanttoopenitintheRprogrameditor,itwillnotshowup automaticallyinthelistoffiles.Wewillhavetochoose "AllFiles(*.*)" from filetypeinthelowerrightcornerinordertoseeallfilesinthefolder.

getwd()

setwd("C:/Project")

getwd()

ToexecutethislittleprograminR,wemaychooseoneofthefollowingtwo ways:

1.Ifwewanttoexecutetheprogramlinebyline,putthecursoranywhere inthatlineofcode,thenwecanexecuteitinoneofthreeways:(a)hit twokeys Ctrl+R onthekeyboardtogether;(b)rightclickthemouseand thenclickon Runlineorselection;(c)clickonthethirdlittleicon(right nexttothesecond savescript icon)ontheupperleftcorner,representing Runlineorselection

2.Ifwewanttoexecutethewholeprograminonerun,highlightthewhole programinReditor,andtheneitherrightclickthemouseandclickon Runlineorselection,orhittwokeys Ctrl+R onthekeyboardtogether.

Whenweexecutetheprogramabove,wewillgetthefollowingoutputinR:

getwd()

[1]"C:/Users/QuanLi/BoxSync/RBook/Rnw_oup_formal"

setwd("C:/Project") getwd()

[1]"C:/Project"

Notethatthefirstlineofcode getwd() showsthatthedefaultcurrentworkingdirectorywas "C:/Users/QuanLi/BoxSync/RBook/Rnw_oup_formal" ,in whichIhavekeptmyknitr.RnwandLaTexfilesforwritingthisRbook.Then,the secondlineofcodeasksRtosetthecurrentworkingdirectoryto "C:/Project" instead.ThethirdlineofcodeshowsthatfortherestofthisRsession,fileswill bedrawnfromorsavedtothisnewworkingdirectoryunlessotherwisespecified viaadifferentfilepath.

Oneessentialpointaboutprogrammingisthatoneshoulddocumentthe purposeofaprogramasawholeandwhateachlineofcodedoessothatdays, weeks,ormonthsfromnow,weoranyotherswhoopenuptheprogramwill beabletounderstandwhattheprogramdoesandhowitdoesit.Forthis purpose,weinsertcommentlinesthatbeginwiththe#signintoaprogram. The#signtellsRnottoexecutethatline.Notethatthefirstcomment specifiesthepurpose,time,andsoftwareversionused.Afteraddingcomment lines,thelittletoyRprogramabovewillnowbecompleteandlooklikethe following:

#FirstRtoyprogram,today'sdate,Rversion3.2.3

#showcurrentworkingdirectorypath getwd()

#changetheworkingdirectoryforprogramtoprojectfolder setwd("C:/Project")

#showcurrentworkingdirectorypathagaintoverify getwd()

Todemonstratethegeneralprocessgraphically,Figure1.1presentsfour screenshotsfromRconsoleandeditor,whichproceedfromopeningR(picture1), toopeningReditor(picture2),totypingthethreelinesofcodeinReditor (picture3),torunningthosethreelinesandproducingoutputinRconsole (picture4).

Thebiggestbenefitofwritingandsavingaprogramistoreproducethesame outputatanytimesolongasthefunctionsofthesoftwareremainunchanged. Forareaderunfamiliarwiththisapproachofworkingwithasoftwarepackage, itmightbeusefultocloseR,openitagain,andre-runthelittleprogramfor replicationandverification.Remembertosavetheprogramfirstbeforeclosing it;otherwise,wewillloseallthechangessincewelastsavedit.Theabilitytorun thesameprogramandproducethesameresultyearsafterisoneofthemost importantreasonswhyweprefertoprogram,ratherthanusingtheinteractive modeviapointandclick.

SinceRisanobject-orientedprogramminglanguage,itisusefultoknow somethingabouthowRworkswithdata.AsimplifiedviewisthatRreliesona varietyoffunctions,whichtakeindataasinputandthenproducedesiredoutput

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.