Immediate download An introduction to statistical learning with applications in r ebook ebooks 2024 by Education Libraries

An Introduction to Statistical Learning with Applications in R eBook

Visit to download the full and correct content document: https://ebookmass.com/product/an-introduction-to-statistical-learning-with-application s-in-r-ebook/

More products digital (pdf, epub, mobi) instant download maybe you interests ...

Generalized Additive Models: An Introduction with R (Chapman & Hall/CRC Texts in Statistical Science) 1st Edition – Ebook PDF Version

https://ebookmass.com/product/generalized-additive-models-anintroduction-with-r-chapman-hall-crc-texts-in-statisticalscience-1st-edition-ebook-pdf-version/

An Introduction to Qualitative Research: Learning in the Field – Ebook PDF Version

https://ebookmass.com/product/an-introduction-to-qualitativeresearch-learning-in-the-field-ebook-pdf-version/

An Introduction to Statistical Methods and Data Analysis 7th Edition, (Ebook PDF)

https://ebookmass.com/product/an-introduction-to-statisticalmethods-and-data-analysis-7th-edition-ebook-pdf/

Linear Models with R (Chapman & Hall/CRC Texts in Statistical Science) 2nd Edition, (Ebook PDF)

https://ebookmass.com/product/linear-models-with-r-chapman-hallcrc-texts-in-statistical-science-2nd-edition-ebook-pdf/

An Introduction to Statistical Mechanics and Thermodynamics 2nd Edition Robert H. Swendsen

https://ebookmass.com/product/an-introduction-to-statisticalmechanics-and-thermodynamics-2nd-edition-robert-h-swendsen/

From Statistical Physics to Data-Driven Modelling: with Applications to Quantitative Biology Simona Cocco

https://ebookmass.com/product/from-statistical-physics-to-datadriven-modelling-with-applications-to-quantitative-biologysimona-cocco/

Introduction to Biostatistical Applications in Health Research with Microsoft Office Excel® and R 2nd Edition Robert P. Hirsch

https://ebookmass.com/product/introduction-to-biostatisticalapplications-in-health-research-with-microsoft-office-exceland-r-2nd-edition-robert-p-hirsch/

Analysis with an introduction to proof. Fifth Edition, Pearson New International Edition Steven R. Lay

https://ebookmass.com/product/analysis-with-an-introduction-toproof-fifth-edition-pearson-new-international-edition-steven-rlay/

An Introduction to the Criminology of Genocide William R. Pruitt

https://ebookmass.com/product/an-introduction-to-the-criminologyof-genocide-william-r-pruitt/

Gareth James

Daniela Witten

Trevor Hastie

Robert

Tibshirani

An Introduction to Statistical Learning

with Applications in R

Preface

Statisticallearningreferstoasetoftoolsformodelingandunderstanding complexdatasets.Itisarecentlydevelopedareainstatisticsandblends withparalleldevelopmentsincomputerscienceand,inparticular,machine learning.Theﬁeldencompassesmanymethodssuchasthelassoandsparse regression,classiﬁcationandregressiontrees,andboostingandsupport vectormachines.

Withtheexplosionof“BigData”problems,statisticallearninghasbecomeaveryhotfieldinmanyscientificareasaswellasmarketing,finance, andotherbusinessdisciplines.Peoplewithstatisticallearningskillsarein highdemand.

Oneoftheﬁrstbooksinthisarea—TheElementsofStatisticalLearning (ESL)(Hastie,Tibshirani,andFriedman)—waspublishedin2001,witha secondeditionin2009.ESLhasbecomeapopulartextnotonlyinstatisticsbutalsoinrelatedﬁelds.OneofthereasonsforESL’spopularityis itsrelativelyaccessiblestyle.ButESLisintendedforindividualswithadvancedtraininginthemathematicalsciences. AnIntroductiontoStatistical Learning (ISL)arosefromtheperceivedneedforabroaderandlesstechnicaltreatmentofthesetopics.Inthisnewbook,wecovermanyofthe sametopicsasESL,butweconcentratemoreontheapplicationsofthe methodsandlessonthemathematicaldetails.Wehavecreatedlabsillustratinghowtoimplementeachofthestatisticallearningmethodsusingthe popularstatisticalsoftwarepackage R.Theselabsprovidethereaderwith valuablehands-onexperience.

Thisbookisappropriateforadvancedundergraduatesormaster’sstudentsinstatisticsorrelatedquantitativeﬁeldsorforindividualsinother

disciplineswhowishtousestatisticallearningtoolstoanalyzetheirdata. Itcanbeusedasatextbookforacoursespanningoneortwosemesters. Wewouldliketothankseveralreadersforvaluablecommentsonpreliminarydraftsofthisbook:PallaviBasu,AlexandraChouldechova,Patrick Danaher,WillFithian,LuellaFu,SamGross,MaxGrazierG’Sell,CourtneyPaulson,XinghaoQiao,ElisaSheng,NoahSimon,KeanMingTan, andXinLuTan. It’stoughtomakepredictions,especiallyaboutthefuture.

-YogiBerra

LosAngeles,USAGarethJames Seattle,USADanielaWitten PaloAlto,USATrevorHastie PaloAlto,USARobertTibshirani

2.1.1WhyEstimate f ?

2.1.2HowDoWeEstimate f ?

2.1.3TheTrade-OﬀBetweenPredictionAccuracy andModelInterpretability

2.1.4SupervisedVersusUnsupervisedLearning

2.1.5RegressionVersusClassiﬁcationProblems

2.2AssessingModelAccuracy

2.2.1MeasuringtheQualityofFit

2.2.2TheBias-VarianceTrade-Oﬀ

2.2.3TheClassiﬁcationSetting

2.3Lab:IntroductiontoR

2.3.1BasicCommands

2.3.2Graphics

2.3.3IndexingData

2.3.4LoadingData

2.3.5AdditionalGraphicalandNumericalSummaries

2.4Exercises

3LinearRegression 59

3.1SimpleLinearRegression ...................61

3.1.1EstimatingtheCoeﬃcients ..............61

3.1.2AssessingtheAccuracyoftheCoeﬃcient Estimates ........................63

3.1.3AssessingtheAccuracyoftheModel .........68

3.2MultipleLinearRegression ..................71

3.2.1EstimatingtheRegressionCoeﬃcients ........72

3.2.2SomeImportantQuestions ..............75

3.3OtherConsiderationsintheRegressionModel ........82

3.3.1QualitativePredictors .................82

3.3.2ExtensionsoftheLinearModel ............86

3.3.3PotentialProblems ...................92

3.4TheMarketingPlan ......................102

3.5ComparisonofLinearRegressionwith K -Nearest Neighbors ............................104

3.6Lab:LinearRegression .....................109

3.6.1Libraries .........................109

3.6.2SimpleLinearRegression

3.6.3MultipleLinearRegression ..............113

3.6.4InteractionTerms ...................115

3.6.5Non-linearTransformationsofthePredictors ....115

3.6.6QualitativePredictors .................117

3.6.7WritingFunctions ...................119

3.7Exercises ............................120

4Classiﬁcation

4.1AnOverviewofClassiﬁcation

4.2WhyNotLinearRegression?

4.3LogisticRegression .......................130

4.3.1TheLogisticModel ...................131

4.3.2EstimatingtheRegressionCoeﬃcients ........133

4.3.3MakingPredictions ...................134

4.3.4MultipleLogisticRegression ..............135

4.3.5LogisticRegressionfor >2ResponseClasses .....137

4.4LinearDiscriminantAnalysis .................138

4.4.1UsingBayes’TheoremforClassiﬁcation .......138

4.4.2LinearDiscriminantAnalysisfor p =1 ........139

4.4.3LinearDiscriminantAnalysisfor p>1 ........142

4.4.4QuadraticDiscriminantAnalysis ...........149

4.5AComparisonofClassiﬁcationMethods ...........151

4.6Lab:LogisticRegression,LDA,QDA,andKNN ......154

4.6.1TheStockMarketData ................154

4.6.2LogisticRegression ...................156

4.6.3LinearDiscriminantAnalysis .............161

4.6.4QuadraticDiscriminantAnalysis ...........163

4.6.5 K -NearestNeighbors ..................163

4.6.6AnApplicationtoCaravanInsuranceData .....165

4.7Exercises ............................168

5ResamplingMethods

5.1Cross-Validation ........................176

5.1.1TheValidationSetApproach

5.1.2Leave-One-OutCross-Validation

5.1.3 k -FoldCross-Validation ................181

5.1.4Bias-VarianceTrade-Oﬀfor k -Fold Cross-Validation ....................183

5.1.5Cross-ValidationonClassiﬁcationProblems .....184

5.2TheBootstrap .........................187

5.3Lab:Cross-ValidationandtheBootstrap ...........190

5.3.1TheValidationSetApproach .............191

5.3.2Leave-One-OutCross-Validation ...........192

5.3.3 k -FoldCross-Validation ................193

5.3.4TheBootstrap .....................194

5.4Exercises ............................197

6LinearModelSelectionandRegularization 203

6.1SubsetSelection ........................205

6.1.1BestSubsetSelection

6.1.2StepwiseSelection ...................207

6.1.3ChoosingtheOptimalModel

6.2ShrinkageMethods .......................214

6.2.1RidgeRegression ....................215

6.2.2TheLasso ........................219

6.2.3SelectingtheTuningParameter ............227

6.3DimensionReductionMethods ................228

6.3.1PrincipalComponentsRegression ...........230

6.3.2PartialLeastSquares .................237

6.4ConsiderationsinHighDimensions ..............238

6.4.1High-DimensionalData ................238

6.4.2WhatGoesWronginHighDimensions? .......239

6.4.3RegressioninHighDimensions ............241

6.4.4InterpretingResultsinHighDimensions .......243

6.5Lab1:SubsetSelectionMethods ...............244

6.5.1BestSubsetSelection .................244

6.5.2ForwardandBackwardStepwiseSelection ......247

6.5.3ChoosingAmongModelsUsingtheValidation SetApproachandCross-Validation ..........248

6.6Lab2:RidgeRegressionandtheLasso

6.6.1RidgeRegression

6.6.2TheLasso ........................255

6.7Lab3:PCRandPLSRegression ...............256

6.7.1PrincipalComponentsRegression

6.7.2PartialLeastSquares

6.8Exercises ............................259

7MovingBeyondLinearity 265

7.1PolynomialRegression

7.2StepFunctions

7.3BasisFunctions

7.4RegressionSplines

7.4.1PiecewisePolynomials .................271

7.4.2ConstraintsandSplines ................271

7.4.3TheSplineBasisRepresentation ...........273

7.4.4ChoosingtheNumberandLocations oftheKnots ......................274

7.4.5ComparisontoPolynomialRegression

7.5SmoothingSplines

7.5.1AnOverviewofSmoothingSplines

7.5.2ChoosingtheSmoothingParameter λ

7.6LocalRegression ........................280

7.7GeneralizedAdditiveModels

7.7.1GAMsforRegressionProblems

7.7.2GAMsforClassiﬁcationProblems

7.8Lab:Non-linearModeling

7.8.1PolynomialRegressionandStepFunctions .....288

7.8.2Splines ..........................293

7.8.3GAMs ..........................294

7.9Exercises

8Tree-BasedMethods

8.1TheBasicsofDecisionTrees

8.1.1RegressionTrees

8.1.2ClassiﬁcationTrees

8.1.3TreesVersusLinearModels ..............314

8.1.4AdvantagesandDisadvantagesofTrees .......315

8.2Bagging,RandomForests,Boosting

8.2.1Bagging .........................316

8.2.2RandomForests ....................320

8.2.3Boosting .........................321

8.3Lab:DecisionTrees

8.3.1FittingClassiﬁcationTrees

8.3.2FittingRegressionTrees

8.3.3BaggingandRandomForests

8.3.4Boosting .........................330

8.4Exercises ............................332

9SupportVectorMachines 337

9.1MaximalMarginClassiﬁer ...................338

9.1.1WhatIsaHyperplane? ................338

9.1.2ClassiﬁcationUsingaSeparatingHyperplane ....339

9.1.3TheMaximalMarginClassiﬁer ............341

9.1.4ConstructionoftheMaximalMarginClassiﬁer ...342

9.1.5TheNon-separableCase ................343

9.2SupportVectorClassiﬁers ...................344

9.2.1OverviewoftheSupportVectorClassiﬁer ......344

9.2.2DetailsoftheSupportVectorClassiﬁer .......345

9.3SupportVectorMachines ...................349

9.3.1ClassiﬁcationwithNon-linearDecision Boundaries .......................349

9.3.2TheSupportVectorMachine

9.3.3AnApplicationtotheHeartDiseaseData ......354

9.4SVMswithMorethanTwoClasses ..............355

9.4.1One-Versus-OneClassiﬁcation .............355

9.4.2One-Versus-AllClassiﬁcation

9.5RelationshiptoLogisticRegression ..............356

9.6Lab:SupportVectorMachines ................359

9.6.1SupportVectorClassiﬁer ...............359

9.6.2SupportVectorMachine ................363

9.6.3ROCCurves ......................365

9.6.4SVMwithMultipleClasses ..............366

9.6.5ApplicationtoGeneExpressionData ........366 9.7Exercises ............................368

10UnsupervisedLearning 373

10.1TheChallengeofUnsupervisedLearning

10.2PrincipalComponentsAnalysis

10.2.1WhatArePrincipalComponents? ..........375

10.2.2AnotherInterpretationofPrincipalComponents ..379

10.2.3MoreonPCA ......................380

10.2.4OtherUsesforPrincipalComponents ........385

10.3ClusteringMethods .......................385

10.3.1 K -MeansClustering ..................386

10.3.2HierarchicalClustering

10.3.3PracticalIssuesinClustering

10.4Lab1:PrincipalComponentsAnalysis

10.5Lab2:Clustering ........................404

10.5.1 K -MeansClustering ..................404

10.5.2HierarchicalClustering .................406

10.6Lab3:NCI60DataExample .................407

10.6.1PCAontheNCI60Data ...............408

10.6.2ClusteringtheObservationsoftheNCI60Data ...410

10.7Exercises ............................413

1 Introduction

AnOverviewofStatisticalLearning

Statisticallearning referstoavastsetoftoolsfor understandingdata.These toolscanbeclassifiedas supervised or unsupervised.Broadlyspeaking, supervisedstatisticallearninginvolvesbuildingastatisticalmodelforpredicting,orestimating,an output basedononeormore inputs.Problemsof thisnatureoccurinfieldsasdiverseasbusiness,medicine,astrophysics,and publicpolicy.Withunsupervisedstatisticallearning,thereareinputsbut nosupervisingoutput;neverthelesswecanlearnrelationshipsandstructurefromsuchdata.Toprovideanillustrationofsomeapplicationsof statisticallearning,webrieflydiscussthreereal-worlddatasetsthatare consideredinthisbook.

WageData

Inthisapplication(whichwerefertoasthe Wage datasetthroughoutthis book),weexamineanumberoffactorsthatrelatetowagesforagroupof malesfromtheAtlanticregionoftheUnitedStates.Inparticular,wewish tounderstandtheassociationbetweenanemployee’s age and education,as wellasthecalendar year,onhis wage.Consider,forexample,theleft-hand panelofFigure 1.1,whichdisplays wage versus age foreachoftheindividualsinthedataset.Thereisevidencethat wage increaseswith age butthen decreasesagainafterapproximatelyage60.Theblueline,whichprovides anestimateoftheaverage wage foragiven age,makesthistrendclearer.

G.Jamesetal., AnIntroductiontoStatisticalLearning:withApplicationsinR, SpringerTextsinStatistics,DOI10.1007/978-1-4614-7138-7 1,

FIGURE1.1. Wage data,whichcontainsincomesurveyinformationformales fromthecentralAtlanticregionoftheUnitedStates. Left: wage asafunctionof age.Onaverage, wage increaseswith age untilabout 60 yearsofage,atwhich pointitbeginstodecline. Center: wage asafunctionof year.Thereisaslow butsteadyincreaseofapproximately $10,000 intheaverage wage between 2003 and 2009. Right: Boxplotsdisplaying wage asafunctionof education,with 1 indicatingthelowestlevel(nohighschooldiploma)and 5 thehighestlevel(an advancedgraduatedegree).Onaverage, wage increaseswiththelevelofeducation.

Givenanemployee’s age,wecanusethiscurveto predict his wage.However, itisalsoclearfromFigure 1.1 thatthereisasigniﬁcantamountofvariabilityassociatedwiththisaveragevalue,andso age aloneisunlikelyto provideanaccuratepredictionofaparticularman’s wage. Wealsohaveinformationregarding eachemployee’seducationleveland the year inwhichthe wage wasearned.Thecenterandright-handpanelsof Figure 1.1,whichdisplay wage asafunctionofboth year and education,indicatethatbothofthesefactorsareassociatedwith wage.Wagesincrease byapproximately$10,000,inaroughlylinear(orstraight-line)fashion, between2003and2009,thoughthisriseisveryslightrelativetothevariabilityinthedata.Wagesarealsotypicallygreaterforindividualswith highereducationlevels:menwiththelowesteducationlevel(1)tendto havesubstantiallylowerwagesthanthosewiththehighesteducationlevel (5).Clearly,themostaccuratepredictionofagivenman’s wage willbe obtainedbycombininghis age,his education,andthe year.InChapter 3, wediscusslinearregression,whichcanbeusedtopredict wage fromthis dataset.Ideally,weshouldpredict wage inawaythataccountsforthe non-linearrelationshipbetween wage and age.InChapter 7,wediscussa classofapproachesforaddressingthisproblem.

StockMarketData

The Wage datainvolvespredictinga continuous or quantitative outputvalue. Thisisoftenreferredtoasa regression problem.However,incertaincases wemayinsteadwishtopredictanon-numericalvalue—thatis,a categorical

FIGURE1.2. Left: Boxplotsofthepreviousday’spercentagechangeintheS&P indexforthedaysforwhichthemarketincreasedordecreased,obtainedfromthe Smarket data. CenterandRight: Sameasleftpanel,butthepercentagechanges for2and3dayspreviousareshown.

or qualitative output.Forexample,inChapter 4 weexamineastockmarketdatasetthatcontainsthedailymovementsintheStandard&Poor’s 500(S&P)stockindexovera5-yearperiodbetween2001and2005.We refertothisasthe Smarket data.Thegoalistopredictwhethertheindex will increase or decrease onagivendayusingthepast5days’percentage changesintheindex.Herethestatisticallearningproblemdoesnotinvolvepredictinganumericalvalue.Insteaditinvolvespredictingwhether agivenday’sstockmarketperformancewillfallintothe Up bucketorthe Down bucket.Thisisknownasa classiﬁcation problem.Amodelthatcould accuratelypredictthedirectioninwhichthemarketwillmovewouldbe veryuseful!

Theleft-handpanelofFigure 1.2 displaystwoboxplotsoftheprevious day’spercentagechangesinthestockindex:oneforthe648daysforwhich themarketincreasedonthesubsequentday,andoneforthe602daysfor whichthemarketdecreased.Thetwoplotslookalmostidentical,suggestingthatthereisnosimplestrategyforusingyesterday’smovementinthe S&Ptopredicttoday’sreturns.Theremainingpanels,whichdisplayboxplotsforthepercentagechanges2and3daysprevioustotoday,similarly indicatelittleassociationbetweenpast andpresentreturns.Ofcourse,this lackofpatternistobeexpected:inthe presenceofstrongcorrelationsbetweensuccessivedays’returns,onecouldadoptasimpletradingstrategy togenerateproﬁtsfromthemarket.Nevertheless,inChapter 4,weexplore thesedatausingseveraldiﬀerentstatisticallearningmethods.Interestingly, therearehintsofsomeweaktrendsinthedatathatsuggestthat,atleast forthis5-yearperiod,itispossibletocorrectlypredictthedirectionof movementinthemarketapproximately60%ofthetime(Figure 1.3).

FIGURE1.3. Weﬁtaquadraticdiscriminantanalysismodeltothesubset ofthe Smarket datacorrespondingtothe2001–2004timeperiod,andpredicted theprobabilityofastockmarketdecreaseusingthe2005data.Onaverage,the predictedprobabilityofdecreaseishigherforthedaysinwhichthemarketdoes decrease.Basedontheseresults,weareabletocorrectlypredictthedirectionof movementinthemarket60%ofthetime.

GeneExpressionData

Theprevioustwoapplicationsillustratedatasetswithbothinputand outputvariables.However,anotherimportantclassofproblemsinvolves situationsinwhichweonlyobserveinputvariables,withnocorresponding output.Forexample,inamarketingsetting,wemighthavedemographic informationforanumberofcurrentorpotentialcustomers.Wemaywishto understandwhichtypesofcustomersaresimilartoeachotherbygrouping individualsaccordingtotheirobservedcharacteristics.Thisisknownasa clustering problem.Unlikeinthepreviousexamples,herewearenottrying topredictanoutputvariable.

WedevoteChapter 10 toadiscussionofstatisticallearningmethods forproblemsinwhichnonaturaloutputvariableisavailable.Weconsider the NCI60 dataset,whichconsistsof6,830geneexpressionmeasurements foreachof64cancercelllines.Insteadofpredictingaparticularoutput variable,weareinterestedindeterminingwhethertherearegroups,or clusters,amongthecelllinesbasedontheirgeneexpressionmeasurements. Thisisadiﬃcultquestiontoaddress,inpartbecausetherearethousands ofgeneexpressionmeasurementsper cellline,makingithardtovisualize thedata.

Theleft-handpanelofFigure 1.4 addressesthisproblembyrepresentingeachofthe64celllinesusingjusttwonumbers, Z1 and Z2 .These aretheﬁrsttwo principalcomponents ofthedata,whichsummarizethe 6, 830expressionmeasurementsforeachcelllinedowntotwonumbersor dimensions.Whileitislikelythatthisdimensionreductionhasresultedin

FIGURE1.4. Left: Representationofthe NCI60 geneexpressiondatasetin atwo-dimensionalspace, Z1 and Z2 .Eachpointcorrespondstooneofthe 64 celllines.Thereappeartobefourgroupsofcelllines,whichwehaverepresented usingdifferentcolors. Right: Sameasleftpanelexceptthatwehaverepresented eachofthe 14 differenttypesofcancerusingadifferentcoloredsymbol.Celllines correspondingtothesamecancertypetendtobenearbyinthetwo-dimensional space.

somelossofinformation,itisnowpossibletovisuallyexaminethedatafor evidenceofclustering.Decidingonthenumberofclustersisoftenadiﬃcultproblem.Buttheleft-handpanelofFigure 1.4 suggestsatleastfour groupsofcelllines,whichwehaverepresentedusingseparatecolors.We cannowexaminethecelllineswithineachclusterforsimilaritiesintheir typesofcancer,inordertobetterunderstandtherelationshipbetween geneexpressionlevelsandcancer.

Inthisparticulardataset,itturnsoutthatthecelllinescorrespond to14diﬀerenttypesofcancer.(However,thisinformationwasnotused tocreatetheleft-handpanelofFigure 1.4.)Theright-handpanelofFigure 1.4 isidenticaltotheleft-handpanel,exceptthatthe14cancertypes areshownusingdistinctcoloredsymb ols.Thereisclearevidencethatcell lineswiththesamecancertypetendtobelocatedneareachotherinthis two-dimensionalrepresentation.Inaddition,eventhoughthecancerinformationwasnotusedtoproducetheleft-handpanel,theclusteringobtained doesbearsomeresemblancetosomeoftheactualcancertypesobserved intheright-handpanel.Thisprovidessomeindependentveriﬁcationofthe accuracyofourclusteringanalysis.

ABriefHistoryofStatisticalLearning

Thoughtheterm statisticallearning isfairlynew,manyoftheconcepts thatunderlietheﬁeldweredevelopedlongago.Atthebeginningofthe nineteenthcentury,LegendreandGausspublishedpapersonthe method

ofleastsquares,whichimplementedtheearliestformofwhatisnowknown as linearregression.Theapproachwasﬁrstsuccessfullyappliedtoproblems inastronomy.Linearregressionisusedforpredictingquantitativevalues, suchasanindividual’ssalary.Inordertopredictqualitativevalues,suchas whetherapatientsurvivesordies,orwhetherthestockmarketincreases ordecreases,Fisherproposed lineardiscriminantanalysis in1936.Inthe 1940s,variousauthorsputforthanalternativeapproach, logisticregression. Intheearly1970s,NelderandWedderburncoinedtheterm generalized linearmodels foranentireclassofstatisticallearningmethodsthatinclude bothlinearandlogisticregressionasspecialcases.

Bytheendofthe1970s,manymoretechniquesforlearningfromdata wereavailable.However,theywerealmostexclusively linear methods,becausefitting non-linear relationshipswascomputationallyinfeasibleatthe time.Bythe1980s,computingtechnologyhadfinallyimprovedsufficiently thatnon-linearmethodswerenolongercomputationallyprohibitive.Inmid 1980sBreiman,Friedman,OlshenandStoneintroduced classificationand regressiontrees,andwereamongthefirsttodemonstratethepowerofa detailedpracticalimplementationofamethod,includingcross-validation formodelselection.HastieandTibshiranicoinedtheterm generalizedadditivemodels in1986foraclassofnon-linearextensionstogeneralizedlinear models,andalsoprovidedapracticalsoftwareimplementation.

Sincethattime,inspiredbytheadventof machinelearning andother disciplines,statisticallearninghasemergedasanewsubﬁeldinstatistics, focusedonsupervisedandunsupervisedmodelingandprediction.Inrecent years,progressinstatisticallearninghasbeenmarkedbytheincreasing availabilityofpowerfulandrelativelyuser-friendlysoftware,suchasthe popularandfreelyavailable R system.Thishasthepotentialtocontinue thetransformationoftheﬁeldfromasetoftechniquesusedanddeveloped bystatisticiansandcomputerscientiststoanessentialtoolkitforamuch broadercommunity.

ThisBook

TheElementsofStatisticalLearning (ESL)byHastie,Tibshirani,and Friedmanwasﬁrstpublishedin2001.Sincethattime,ithasbecomean importantreferenceonthefundamentalsofstatisticalmachinelearning. Itssuccessderivesfromitscomprehensiveanddetailedtreatmentofmany importanttopicsinstatisticallearning,aswellasthefactthat(relativeto manyupper-levelstatisticstextbo oks)itisaccessibletoawideaudience. However,thegreatestfactorbehindthesuccessofESLhasbeenitstopical nature.Atthetimeofitspublication,interestintheﬁeldofstatistical

learningwasstartingtoexplode.ESLprovidedoneoftheﬁrstaccessible andcomprehensiveintroductionstothetopic.

SinceESLwasfirstpublished,thefieldofstatisticallearninghascontinuedtoflourish.Thefield’sexpansionhastakentwoforms.Themost obviousgrowthhasinvolvedthedevelopmentofnewandimprovedstatisticallearningapproachesaimedatansweringarangeofscientificquestions acrossanumberoffields.However,thefieldofstatisticallearninghas alsoexpandeditsaudience.Inthe1990s,increasesincomputationalpower generatedasurgeofinterestinthefieldfromnon-statisticianswhowere eagertousecutting-edgestatisticaltoolstoanalyzetheirdata.Unfortunately,thehighlytechnicalnatureoftheseapproachesmeantthattheuser communityremainedprimarilyrestrictedtoexpertsinstatistics,computer science,andrelatedfieldswiththetraining(andtime)tounderstandand implementthem.

Inrecentyears,newandimprovedsoftwarepackageshavesignificantly easedtheimplementationburdenformanystatisticallearningmethods. Atthesametime,therehasbeengrowingrecognitionacrossanumberof fields,frombusinesstohealthcaretogeneticstothesocialsciencesand beyond,thatstatisticallearningisapowerfultoolwithimportantpractical applications.Asaresult,thefieldhasmovedfromoneofprimarilyacademic interesttoamainstreamdiscipline, withanenormouspotentialaudience. Thistrendwillsurelycontinuewiththeincreasingavailabilityofenormous quantitiesofdataandthesoftwaretoanalyzeit.

Thepurposeof AnIntroductiontoStatisticalLearning (ISL)istofacilitatethetransitionofstatisticallearningfromanacademictoamainstream ﬁeld.ISLisnotintendedtoreplaceESL,whichisafarmorecomprehensivetextbothintermsofthenumberofapproachesconsideredandthe depthtowhichtheyareexplored.WeconsiderESLtobeanimportant companionforprofessionals(withgraduatedegreesinstatistics,machine learning,orrelatedﬁelds)whoneedtounderstandthetechnicaldetails behindstatisticallearningapproaches.However,thecommunityofusersof statisticallearningtechniqueshasexpandedtoincludeindividualswitha widerrangeofinterestsandbackgrounds.Therefore,webelievethatthere isnowaplaceforalesstechnicalandmoreaccessibleversionofESL.

Inteachingthesetopicsovertheyears,wehavediscoveredthattheyare ofinteresttomaster’sandPhDstudentsinﬁeldsasdisparateasbusiness administration,biology,andcomputerscience,aswellastoquantitativelyorientedupper-divisionundergraduates.Itisimportantforthisdiverse grouptobeabletounderstandthemodels,intuitions,andstrengthsand weaknessesofthevariousapproaches.Butforthisaudience,manyofthe technicaldetailsbehindstatisticallearningmethods,suchasoptimizationalgorithmsandtheoreticalproperties,arenotofprimaryinterest. Webelievethatthesestudentsdonotneedadeepunderstandingofthese aspectsinordertobecomeinformedus ersofthevariousmethodologies,and

inordertocontributetotheirchosenﬁeldsthroughtheuseofstatistical learningtools.

ISLRisbasedonthefollowingfourpremises.

1. Manystatisticallearningmethodsarerelevantandusefulinawide rangeofacademicandnon-academicdisciplines,beyondjustthestatisticalsciences. Webelievethatmanycontemporarystatisticallearningproceduresshould,andwill,becomeaswidelyavailableandused asiscurrentlythecaseforclassicalmethodssuchaslinearregression.Asaresult,ratherthanattemptingtoconsidereverypossible approach(animpossibletask),wehaveconcentratedonpresenting themethodsthatwebelievearemostwidelyapplicable.

2. Statisticallearningshouldnotbeviewedasaseriesofblackboxes. No singleapproachwillperformwellinallpossibleapplications.Withoutunderstandingallofthecogsinsidethebox,ortheinteraction betweenthosecogs,itisimpossibletoselectthebestbox.Hence,we haveattemptedtocarefullydescribethemodel,intuition,assumptions,andtrade-oﬀsbehindeachofthemethodsthatweconsider.

3. Whileitisimportanttoknowwhatjobisperformedbyeachcog,it isnotnecessarytohavetheskillstoconstructthemachineinsidethe box! Thus,wehaveminimizeddiscussionoftechnicaldetailsrelated toﬁttingproceduresandtheoreticalproperties.Weassumethatthe readeriscomfortablewithbasic mathematicalconcepts,butwedo notassumeagraduatedegreeinthemathematicalsciences.Forinstance,wehavealmostcompletelyavoidedtheuseofmatrixalgebra, anditispossibletounderstandtheentirebookwithoutadetailed knowledgeofmatricesandvectors.

4. Wepresumethatthereaderisinterestedinapplyingstatisticallearningmethodstoreal-worldproblems. Inordertofacilitatethis,aswell astomotivatethetechniquesdiscussed,wehavedevotedasection withineachchapterto R computerlabs.Ineachlab,wewalkthe readerthrougharealisticapplicationofthemethodsconsideredin thatchapter.Whenwehavetaughtthismaterialinourcourses, wehaveallocatedroughlyone-thirdofclassroomtimetoworking throughthelabs,andwehavefoundthemtobeextremelyuseful. Manyofthelesscomputationally-orientedstudentswhowereinitiallyintimidatedby R’scommandlevelinterfacegotthehangof thingsoverthecourseofthequarterorsemester.Wehaveused R becauseitisfreelyavailableandispowerfulenoughtoimplementall ofthemethodsdiscussedinthebook.Italsohasoptionalpackages thatcanbedownloadedtoimplementliterallythousandsofadditionalmethods.Mostimportantly, R isthelanguageofchoicefor academicstatisticians,andnewapproachesoftenbecomeavailablein

R yearsbeforetheyareimplementedincommercialpackages.However,thelabsinISLareself-contained,andcanbeskippedifthe readerwishestouseadiﬀerentsoftwarepackageordoesnotwishto applythemethodsdiscussedtoreal-worldproblems.

WhoShouldReadThisBook?

Thisbookisintendedforanyonewhoisinterestedinusingmodernstatisticalmethodsformodelingandpredictionfromdata.Thisgroupincludes scientists,engineers,dataanalysts,or quants,butalsolesstechnicalindividualswithdegreesinnon-quantitativeﬁeldssuchasthesocialsciencesor business.Weexpectthatthereaderwillhavehadatleastoneelementary courseinstatistics.Backgroundinlinearregressionisalsouseful,though notrequired,sincewereviewthekeyconceptsbehindlinearregressionin Chapter 3.Themathematicallevelofthisbookismodest,andadetailed knowledgeofmatrixoperationsisnotrequired.Thisbookprovidesanintroductiontothestatisticalprogramminglanguage R.Previousexposure toaprogramminglanguage,suchas MATLAB or Python,isusefulbutnot required.

Wehavesuccessfullytaughtmaterialatthisleveltomaster’sandPhD studentsinbusiness,computerscience,biology,earthsciences,psychology, andmanyotherareasofthephysicalandsocialsciences.Thisbookcould alsobeappropriateforadvancedundergraduateswhohavealreadytaken acourseonlinearregression.Inthe contextofamoremathematically rigorouscourseinwhichESLservesastheprimarytextbook,ISLcould beusedasasupplementarytextforteachingcomputationalaspectsofthe variousapproaches.

NotationandSimpleMatrixAlgebra

Choosingnotationforatextbookisalwaysadiﬃculttask.Forthemost partweadoptthesamenotationalconventionsasESL.

Wewilluse n torepresentthenumberofdistinctdatapoints,orobservations,inoursample.Wewilllet p denotethenumberofvariablesthatare availableforuseinmakingpredictions.Forexample,the Wage datasetconsistsof12variablesfor3,000people,sowehave n =3,000observationsand p =12variables(suchas year, age, wage,andmore).Notethatthroughout thisbook,weindicatevariablenamesusingcoloredfont: VariableName. Insomeexamples, p mightbequitelarge,suchasontheorderofthousandsorevenmillions;thissituationarisesquiteoften,forexample,inthe analysisofmodernbiologicaldataorweb-basedadvertisingdata.

Ingeneral,wewilllet xij representthevalueofthe j thvariableforthe ithobservation,where i =1, 2,...,n and j =1, 2,...,p.Throughoutthis book, i willbeusedtoindexthesamplesorobservations(from1to n)and j willbeusedtoindexthevariables(from1to p).Welet X denotea n × p matrixwhose(i,j )thelementis xij .Thatis,

Forreaderswhoareunfamiliarwithmatrices,itisusefultovisualize X as aspreadsheetofnumberswith n rowsand p columns. Attimeswewillbeinterestedintherowsof X,whichwewriteas x1 ,x2 ,...,xn .Here xi isavectoroflength p,containingthe p variable measurementsforthe ithobservation.Thatis,

(Vectorsarebydefaultrepresented ascolumns.)Forexample,forthe Wage data, xi isavectoroflength12,consistingof year, age, wage,andother valuesforthe ithindividual.Atothertimeswewillinsteadbeinterested inthecolumnsof X,whichwewriteas x1 , x2 ,..., xp .Eachisavectorof length n.Thatis,

Forexample,forthe Wage data, x1 containsthe n =3,000valuesfor year Usingthisnotation,thematrix X canbewrittenas

The T notationdenotesthe transpose ofamatrixorvector.So,forexample,

while

Weuse yi todenotethe ithobservationofthevariableonwhichwe wishtomakepredictions,suchas wage.Hence,wewritethesetofall n observationsinvectorformas

Thenourobserveddataconsistsof {(x1 ,y1 ), (x2 ,y2 ),..., (xn ,yn )},where each xi isavectoroflength p.(If p =1,then xi issimplyascalar.)

Inthistext,avectoroflength n willalwaysbedenotedin lowercase bold ;e.g.

However,vectorsthatarenotoflength n (suchasfeaturevectorsoflength p,asin(1.1))willbedenotedin lowercasenormalfont,e.g. a.Scalarswill alsobedenotedin lowercasenormalfont,e.g. a.Intherarecasesinwhich thesetwousesforlowercasenormalfontleadtoambiguity,wewillclarify whichuseisintended.Matriceswillbedenotedusing boldcapitals,such as A.Randomvariableswillbedenotedusing capitalnormalfont,e.g. A, regardlessoftheirdimensions.

Occasionallywewillwanttoindicatethedimensionofaparticularobject.Toindicatethatanobjectis ascalar,wewillusethenotation a ∈ R. Toindicatethatitisavectoroflength k ,wewilluse a ∈ Rk (or a ∈ Rn ifitisoflength n).Wewillindicatethatanobjectisa r × s matrixusing A ∈ Rr ×s .

Wehaveavoidedusingmatrixalgebrawheneverpossible.However,in afewinstancesitbecomestoocumbersometoavoiditentirely.Inthese rareinstancesitisimportanttounderstandtheconceptofmultiplying twomatrices.Supposethat A ∈ Rr ×d and B ∈ Rd×s .Thentheproduct

of A and B isdenoted AB.The(i,j )thelementof AB iscomputedby multiplyingeachelementofthe ithrowof A bythecorrespondingelement ofthe j thcolumnof B.Thatis,(AB)ij = d k=1 aik bkj .Asanexample, consider

Notethatthisoperationproducesan r × s matrix.Itisonlypossibleto compute AB ifthenumberofcolumnsof A isthesameasthenumberof rowsof B.

OrganizationofThisBook

Chapter 2 introducesthebasicterminologyandconceptsbehindstatisticallearning.Thischapteralsopresentsthe K -nearestneighbor classifier,a verysimplemethodthatworkssurprisinglywellonmanyproblems.Chapters 3 and 4 coverclassicallinearmethodsforregressionandclassification. Inparticular,Chapter 3 reviews linearregression,thefundamentalstartingpointforallregressionmethods.InChapter 4 wediscusstwoofthe mostimportantclassicalclassificationmethods, logisticregression and lineardiscriminantanalysis. Acentralprobleminallstatisticallearningsituationsinvolveschoosing thebestmethodforagivenapplication.Hence,inChapter 5 weintroduce cross-validation andthe bootstrap,whichcanbeusedtoestimatethe accuracyofanumberofdifferentmethodsinordertochoosethebestone. Muchoftherecentresearchinstatisticallearninghasconcentratedon non-linearmethods.However,linearmethodsoftenhaveadvantagesover theirnon-linearcompetitorsintermsofinterpretabilityandsometimesalso accuracy.Hence,inChapter 6 weconsiderahostoflinearmethods,both classicalandmoremodern,whichofferpotentialimprovementsoverstandardlinearregression.Theseinclude stepwiseselection, ridgeregression, principalcomponentsregression, partialleastsquares,andthe lasso. Theremainingchaptersmoveintotheworldofnon-linearstatistical learning.WefirstintroduceinChapter 7 anumberofnon-linearmethods thatworkwellforproblemswithasingleinputvariable.Wethenshowhow thesemethodscanbeusedtofitnon-linear additive modelsforwhichthere ismorethanoneinput.InChapter 8,weinvestigate tree-basedmethods, including bagging, boosting,and randomforests. Supportvectormachines, asetofapproachesforperformingbothlinearandnon-linearclassification,

arediscussedinChapter 9.Finally,inChapter 10,weconsiderasetting inwhichwehaveinputvariablesbutnooutputvariable.Inparticular,we present principalcomponentsanalysis, K -meansclustering,and hierarchicalclustering.

Attheendofeachchapter,wepresentoneormore R labsectionsin whichwesystematicallyworkthroughapplicationsofthevariousmethodsdiscussedinthatchapter.Theselabsdemonstratethestrengthsand weaknessesofthevariousapproaches,andalsoprovideausefulreference forthesyntaxrequiredtoimplementthevariousmethods.Thereadermay choosetoworkthroughthelabsathisorherownpace,orthelabsmay bethefocusofgroupsessionsaspartofaclassroomenvironment.Within each R lab,wepresenttheresultsthatweobtainedwhenweperformed thelabatthetimeofwritingthisbook.However,newversionsof R are continuouslyreleased,andovertime,thepackagescalledinthelabswillbe updated.Therefore,inthefuture,itispossiblethattheresultsshownin thelabsectionsmaynolongercorrespondpreciselytotheresultsobtained bythereaderwhoperformsthelabs.Asnecessary,wewillpostupdatesto thelabsonthebookwebsite.

Weusethe symboltodenotesectionsor exercisesthatcontainmore challengingconcepts.Thesecanbeeasilyskippedbyreaderswhodonot wishtodelveasdeeplyintothematerial,orwholackthemathematical background.

DataSetsUsedinLabsandExercises

Inthistextbook,weillustratestatisticallearningmethodsusingapplicationsfrommarketing,ﬁnance,biology,andotherareas.The ISLR package availableonthebookwebsitecontainsanumberofdatasetsthatare requiredinordertoperformthelabsandexercisesassociatedwiththis book.Oneotherdatasetiscontainedinthe MASS library,andyetanother ispartofthebase R distribution.Table 1.1 containsasummaryofthedata setsrequiredtoperformthelabsandexercises.Acoupleofthesedatasets arealsoavailableastextﬁlesonthebookwebsite,foruseinChapter 2

NameDescription

Auto Gasmileage,horsepower,andotherinformationforcars.

Boston HousingvaluesandotherinformationaboutBostonsuburbs.

Caravan Informationaboutindividualsoﬀeredcaravaninsurance.

Carseats Informationaboutcarseatsalesin400stores.

College Demographiccharacteristics,tuition,andmoreforUSAcolleges.

Default Customerdefaultrecordsforacreditcardcompany.

Hitters Recordsandsalariesforbaseballplayers.

Khan Geneexpressionmeasurementsforfourcancertypes. NCI60 Geneexpressionmeasurementsfor64cancercelllines.

OJ SalesinformationforCitrusHillandMinuteMaidorangejuice.

Portfolio Pastvaluesofﬁnancialassets,foruseinportfolioallocation.

Smarket DailypercentagereturnsforS&P500overa5-yearperiod. USArrests Crimestatisticsper100,000residentsin50statesofUSA.

Wage IncomesurveydataformalesincentralAtlanticregionofUSA.

Weekly 1,089weeklystockmarketreturnsfor21years.

TABLE1.1. Alistofdatasetsneededtoperformthelabsandexercisesinthis textbook.Alldatasetsareavailableinthe ISLR library,withtheexceptionof Boston (partof MASS)and USArrests (partofthebase R distribution).

Itcontainsanumberofresources,includingthe R packageassociatedwith thisbook,andsomeadditionaldatasets.

Acknowledgements

AfewoftheplotsinthisbookweretakenfromESL:Figures 6.7, 8.3, and 10.12.Allotherplotsarenewtothisbook.

2 StatisticalLearning

2.1WhatIsStatisticalLearning?

Inordertomotivateourstudyofstatisticallearning,webeginwitha simpleexample.Supposethatwearestatisticalconsultantshiredbya clienttoprovideadviceonhowtoimprovesalesofaparticularproduct.The Advertising datasetconsistsofthe sales ofthatproductin200diﬀerent markets,alongwithadvertisingbudgetsfortheproductineachofthose marketsforthreediﬀerentmedia: TV, radio,and newspaper.Thedataare displayedinFigure 2.1.Itisnotpossibleforourclienttodirectlyincrease salesoftheproduct.Ontheotherhand,theycancontroltheadvertising expenditureineachofthethreemedia.Therefore,ifwedeterminethat thereisanassociationbetweenadvertisingandsales,thenwecaninstruct ourclienttoadjustadvertisingbudgets,therebyindirectlyincreasingsales. Inotherwords,ourgoalistodevelopanaccuratemodelthatcanbeused topredictsalesonthebasisofthethreemediabudgets.

Inthissetting,theadvertisingbudgetsare inputvariables while sales input variable isan outputvariable.Theinputvariablesaretypicallydenotedusingthe output variable symbol X ,withasubscripttodistinguishthem.So X1 mightbethe TV budget, X2 the radio budget,and X3 the newspaper budget.Theinputs gobydiﬀerentnames,suchas predictors, independentvariables, features, predictor independent variable feature orsometimesjust variables.Theoutputvariable—inthiscase, sales—is variable oftencalledthe response or dependentvariable,andistypicallydenoted response dependent variable usingthesymbol Y .Throughoutthisbook,wewillusealloftheseterms interchangeably.

G.Jamesetal., AnIntroductiontoStatisticalLearning:withApplicationsinR, SpringerTextsinStatistics,DOI10.1007/978-1-4614-7138-7 2,

FIGURE2.1. The Advertising dataset.Theplotdisplays sales,inthousands ofunits,asafunctionof TV, radio,and newspaper budgets,inthousandsof dollars,for 200 diﬀerentmarkets.Ineachplotweshowthesimpleleastsquares ﬁtof sales tothatvariable,asdescribedinChapter 3.Inotherwords,eachblue linerepresentsasimplemodelthatcanbeusedtopredict sales using TV, radio, and newspaper,respectively.

Moregenerally,supposethatweobserveaquantitativeresponse Y and p diﬀerentpredictors, X1 ,X2 ,...,Xp .Weassumethatthereissome relationshipbetween Y and X =(X1 ,X2 ,...,Xp ),whichcanbewritten intheverygeneralform

Here f issomeﬁxedbutunknownfunctionof X1 ,...,Xp ,and isarandom errorterm,whichisindependentof X andhasmeanzero.Inthisformulaerrorterm tion, f representsthe systematic informationthat X providesabout Y systematic

Asanotherexample,considertheleft-handpanelofFigure 2.2,aplotof income versus yearsofeducation for30individualsinthe Income dataset. Theplotsuggeststhatonemightbeabletopredict income using yearsof education.However,thefunction f thatconnectstheinputvariabletothe outputvariableisingeneralunknown.Inthissituationonemustestimate f basedontheobservedpoints.Since Income isasimulateddataset, f is knownandisshownbythebluecurveintheright-handpanelofFigure 2.2. Theverticallinesrepresenttheerrorterms .Wenotethatsomeofthe 30observationslieabovethebluecurveandsomeliebelowit;overall,the errorshaveapproximatelymeanzero.

Ingeneral,thefunction f mayinvolvemorethanoneinputvariable. InFigure 2.3 weplot income asafunctionof yearsofeducation and seniority.Here f isatwo-dimensionalsurfacethatmustbeestimated basedontheobserveddata.

FIGURE2.2. The Income dataset. Left: Thereddotsaretheobservedvalues of income (intensofthousandsofdollars)and yearsofeducation for 30 individuals. Right: Thebluecurverepresentsthetrueunderlyingrelationshipbetween income and yearsofeducation,whichisgenerallyunknown(butisknownin thiscasebecausethedataweresimulated).Theblacklinesrepresenttheerror associatedwitheachobservation.Notethatsomeerrorsarepositive(ifanobservationliesabovethebluecurve)andsomearenegative(ifanobservationlies belowthecurve).Overall,theseerrorshaveapproximatelymeanzero.

Inessence,statisticallearningreferstoasetofapproachesforestimating f .Inthischapterweoutlinesomeofthekeytheoreticalconceptsthatarise inestimating f ,aswellastoolsforevaluatingtheestimatesobtained.

2.1.1WhyEstimate f ?

Therearetwomainreasonsthatwemaywishtoestimate f : prediction and inference.Wediscusseachinturn.

Prediction

Inmanysituations,asetofinputs X arereadilyavailable,buttheoutput Y cannotbeeasilyobtained.Inthissetting,sincetheerrortermaverages tozero,wecanpredict Y using

where ˆ f representsourestimatefor f ,and ˆ Y representstheresultingpredictionfor Y .Inthissetting, ˆ f isoftentreatedasa blackbox,inthesense thatoneisnottypicallyconcernedwiththeexactformof ˆ f ,providedthat ityieldsaccuratepredictionsfor Y .

YearsofEducation

Seniority

FIGURE2.3. Theplotdisplays income asafunctionof yearsofeducation and seniority inthe Income dataset.Thebluesurfacerepresentsthetrueunderlyingrelationshipbetween income and yearsofeducation and seniority, whichisknownsincethedataaresimulated.Thereddotsindicatetheobserved valuesofthesequantitiesfor 30 individuals.

Asanexample,supposethat X1 ,...,Xp arecharacteristicsofapatient’s bloodsamplethatcanbeeasilymeasuredinalab,and Y isavariable encodingthepatient’sriskforasevereadversereactiontoaparticular drug.Itisnaturaltoseektopredict Y using X ,sincewecanthenavoid givingthedruginquestiontopatientswhoareathighriskofanadverse reaction—thatis,patientsforwhomtheestimateof Y ishigh.

Theaccuracyof ˆ Y asapredictionfor Y dependsontwoquantities, whichwewillcallthe reducibleerror andthe irreducibleerror.Ingeneral, reducible error irreducible error

ˆ f willnotbeaperfectestimatefor f ,andthisinaccuracywillintroduce someerror.Thiserroris reducible becausewecanpotentiallyimprovethe accuracyof ˆ f byusingthemostappropriatestatisticallearningtechniqueto estimate f .However,evenifitwerepossibletoformaperfectestimatefor f ,sothatourestimatedresponsetooktheform ˆ Y = f (X ),ourprediction wouldstillhavesomeerrorinit!Thisisbecause Y isalsoafunctionof ,which,bydeﬁnition,cannotbepredictedusing X .Therefore,variability associatedwith alsoaﬀectstheaccuracyofourpredictions.Thisisknown asthe irreducible error,becausenomatterhowwellweestimate f ,we cannotreducetheerrorintroducedby .

Whyistheirreducibleerrorlargerthanzero?Thequantity maycontainunmeasuredvariablesthatareusefulinpredicting Y :sincewedon’t measurethem, f cannotusethemforitsprediction.Thequantity may alsocontainunmeasurablevariation.Forexample,theriskofanadverse reactionmightvaryforagivenpatientonagivenday,dependingon

manufacturingvariationinthedrugitselforthepatient’sgeneralfeeling ofwell-beingonthatday.

Consideragivenestimate ˆ f andasetofpredictors X ,whichyieldsthe prediction ˆ Y = ˆ f (X ).Assumeforamomentthatboth ˆ f and X areﬁxed. Then,itiseasytoshowthat

where E (Y ˆ Y )2 representstheaverage,or expectedvalue,ofthesquared expected value diﬀerencebetweenthepredictedandactualvalueof Y ,andVar( )representsthe variance associatedwiththeerrorterm . variance

Thefocusofthisbookisontechniquesforestimating f withtheaimof minimizingthereducibleerror.Itisimportanttokeepinmindthatthe irreducibleerrorwillalwaysprovideanupperboundontheaccuracyof ourpredictionfor Y .Thisboundisalmostalwaysunknowninpractice.

Inference

Weareofteninterestedinunderstandingthewaythat Y isaﬀectedas X1 ,...,Xp change.Inthissituationwewishtoestimate f ,butourgoalis notnecessarilytoma kepredictionsfor Y .Weinsteadwanttounderstand therelationshipbetween X and Y ,ormorespeciﬁcally,tounderstandhow Y changesasafunctionof X1 ,...,Xp .Now ˆ f cannotbetreatedasablack box,becauseweneedtoknowitsexactform.Inthissetting,onemaybe interestedinansweringthefollowingquestions:

• Whichpredictorsareassociatedwiththeresponse? Itisoftenthecase thatonlyasmallfractionoftheavailablepredictorsaresubstantially associatedwith Y .Identifyingthefew important predictorsamonga largesetofpossiblevariablescanbeextremelyuseful,dependingon theapplication.

• Whatistherelationshipbetweentheresponseandeachpredictor? Somepredictorsmayhaveapositiverelationshipwith Y ,inthesense thatincreasingthepredictorisassociatedwithincreasingvaluesof Y .Otherpredictorsmayhavetheoppositerelationship.Depending onthecomplexityof f ,therelationshipbetweentheresponseanda givenpredictormayalsodependonthevaluesoftheotherpredictors.

• Cantherelationshipbetween Y andeachpredictorbeadequatelysummarizedusingalinearequation,oristherelationshipmorecomplicated? Historically,mostmethodsforestimating f havetakenalinear form.Insomesituations,suchanassumptionisreasonableorevendesirable.Butoftenthetruerelationshipismorecomplicated,inwhich casealinearmodelmaynotprovideanaccuraterepresentationof therelationshipbetweentheinputandoutputvariables.

Inthisbook,wewillseeanumberofexamplesthatfallintotheprediction setting,theinferencesetting,oracombinationofthetwo.

Forinstance,consideracompanythatisinterestedinconductinga direct-marketingcampaign.Thegoalistoidentifyindividualswhowill respondpositivelytoamailing,basedonobservationsofdemographicvariablesmeasuredoneachindividual.Inthiscase,thedemographicvariables serveaspredictors,andresponsetothemarketingcampaign(eitherpositiveornegative)servesastheoutcome.Thecompanyisnotinterested inobtainingadeepunderstandingoftherelationshipsbetweeneachindividualpredictorandtheresponse;instead,thecompanysimplywants anaccuratemodeltopredicttheresponseusingthepredictors.Thisisan exampleofmodelingforprediction.

Incontrast,considerthe Advertising dataillustratedinFigure 2.1.One maybeinterestedinansweringquestionssuchas:

Whichmediacontributetosales?

Whichmediageneratethebiggestboostinsales? or

HowmuchincreaseinsalesisassociatedwithagivenincreaseinTV advertising?

Thissituationfallsintotheinference paradigm.Anotherexampleinvolves modelingthebrandofaproductthatacustomermightpurchasebasedon variablessuchasprice,storelocation,discountlevels,competitionprice, andsoforth.Inthissituationonemightreallybemostinterestedinhow eachoftheindividualvariablesaﬀectstheprobabilityofpurchase.For instance, whateﬀectwillchangingthepriceofaproducthaveonsales? Thisisanexampleofmodelingforinference.

Finally,somemodelingcouldbeconductedbothforpredictionandinference.Forexample,inarealestatesetting,onemayseektorelatevaluesof homestoinputssuchascrimerate,zoning,distancefromariver,airquality,schools,incomelevelofcommunity,sizeofhouses,andsoforth.Inthis caseonemightbeinterestedinhowtheindividualinputvariablesaﬀect theprices—thatis, howmuchextrawillahousebeworthifithasaview oftheriver? Thisisaninferenceproblem.Alternatively,onemaysimply beinterestedinpredictingthevalueofahomegivenitscharacteristics: is thishouseunder-orover-valued? Thisisapredictionproblem.

Dependingonwhetherourultimategoalisprediction,inference,ora combinationofthetwo,diﬀerentmethodsforestimating f maybeappropriate.Forexample, linearmodels allowforrelativelysimpleandinter- linearmodel pretableinference,butmaynotyieldasaccuratepredictionsassomeother approaches.Incontrast,someofthehighlynon-linearapproachesthatwe discussinthelaterchaptersofthisbookcanpotentiallyprovidequiteaccuratepredictionsfor Y ,butthiscomesattheexpenseofalessinterpretable modelforwhichinferenceismorechallenging.