Immediate download An introduction to statistical learning with applications in r ebook ebooks 2024
An Introduction to Statistical Learning with Applications in R eBook
Visit to download the full and correct content document: https://ebookmass.com/product/an-introduction-to-statistical-learning-with-application s-in-r-ebook/
More products digital (pdf, epub, mobi) instant download maybe you interests ...
Generalized Additive Models: An Introduction with R (Chapman & Hall/CRC Texts in Statistical Science) 1st Edition – Ebook PDF Version
or qualitative output.Forexample,inChapter 4 weexamineastockmarketdatasetthatcontainsthedailymovementsintheStandard&Poor’s 500(S&P)stockindexovera5-yearperiodbetween2001and2005.We refertothisasthe Smarket data.Thegoalistopredictwhethertheindex will increase or decrease onagivendayusingthepast5days’percentage changesintheindex.Herethestatisticallearningproblemdoesnotinvolvepredictinganumericalvalue.Insteaditinvolvespredictingwhether agivenday’sstockmarketperformancewillfallintothe Up bucketorthe Down bucket.Thisisknownasa classification problem.Amodelthatcould accuratelypredictthedirectioninwhichthemarketwillmovewouldbe veryuseful!
4. Wepresumethatthereaderisinterestedinapplyingstatisticallearningmethodstoreal-worldproblems. Inordertofacilitatethis,aswell astomotivatethetechniquesdiscussed,wehavedevotedasection withineachchapterto R computerlabs.Ineachlab,wewalkthe readerthrougharealisticapplicationofthemethodsconsideredin thatchapter.Whenwehavetaughtthismaterialinourcourses, wehaveallocatedroughlyone-thirdofclassroomtimetoworking throughthelabs,andwehavefoundthemtobeextremelyuseful. Manyofthelesscomputationally-orientedstudentswhowereinitiallyintimidatedby R’scommandlevelinterfacegotthehangof thingsoverthecourseofthequarterorsemester.Wehaveused R becauseitisfreelyavailableandispowerfulenoughtoimplementall ofthemethodsdiscussedinthebook.Italsohasoptionalpackages thatcanbedownloadedtoimplementliterallythousandsofadditionalmethods.Mostimportantly, R isthelanguageofchoicefor academicstatisticians,andnewapproachesoftenbecomeavailablein
R yearsbeforetheyareimplementedincommercialpackages.However,thelabsinISLareself-contained,andcanbeskippedifthe readerwishestouseadifferentsoftwarepackageordoesnotwishto applythemethodsdiscussedtoreal-worldproblems.
Wewilluse n torepresentthenumberofdistinctdatapoints,orobservations,inoursample.Wewilllet p denotethenumberofvariablesthatare availableforuseinmakingpredictions.Forexample,the Wage datasetconsistsof12variablesfor3,000people,sowehave n =3,000observationsand p =12variables(suchas year, age, wage,andmore).Notethatthroughout thisbook,weindicatevariablenamesusingcoloredfont: VariableName. Insomeexamples, p mightbequitelarge,suchasontheorderofthousandsorevenmillions;thissituationarisesquiteoften,forexample,inthe analysisofmodernbiologicaldataorweb-basedadvertisingdata.
Ingeneral,wewilllet xij representthevalueofthe j thvariableforthe ithobservation,where i =1, 2,...,n and j =1, 2,...,p.Throughoutthis book, i willbeusedtoindexthesamplesorobservations(from1to n)and j willbeusedtoindexthevariables(from1to p).Welet X denotea n × p matrixwhose(i,j )thelementis xij .Thatis,
Forreaderswhoareunfamiliarwithmatrices,itisusefultovisualize X as aspreadsheetofnumberswith n rowsand p columns. Attimeswewillbeinterestedintherowsof X,whichwewriteas x1 ,x2 ,...,xn .Here xi isavectoroflength p,containingthe p variable measurementsforthe ithobservation.Thatis,
Forexample,forthe Wage data, x1 containsthe n =3,000valuesfor year Usingthisnotation,thematrix X canbewrittenas
or
The T notationdenotesthe transpose ofamatrixorvector.So,forexample,
while
Weuse yi todenotethe ithobservationofthevariableonwhichwe wishtomakepredictions,suchas wage.Hence,wewritethesetofall n observationsinvectorformas
Thenourobserveddataconsistsof {(x1 ,y1 ), (x2 ,y2 ),..., (xn ,yn )},where each xi isavectoroflength p.(If p =1,then xi issimplyascalar.)
Inthistext,avectoroflength n willalwaysbedenotedin lowercase bold ;e.g.
However,vectorsthatarenotoflength n (suchasfeaturevectorsoflength p,asin(1.1))willbedenotedin lowercasenormalfont,e.g. a.Scalarswill alsobedenotedin lowercasenormalfont,e.g. a.Intherarecasesinwhich thesetwousesforlowercasenormalfontleadtoambiguity,wewillclarify whichuseisintended.Matriceswillbedenotedusing boldcapitals,such as A.Randomvariableswillbedenotedusing capitalnormalfont,e.g. A, regardlessoftheirdimensions.
Occasionallywewillwanttoindicatethedimensionofaparticularobject.Toindicatethatanobjectis ascalar,wewillusethenotation a ∈ R. Toindicatethatitisavectoroflength k ,wewilluse a ∈ Rk (or a ∈ Rn ifitisoflength n).Wewillindicatethatanobjectisa r × s matrixusing A ∈ Rr ×s .
Wehaveavoidedusingmatrixalgebrawheneverpossible.However,in afewinstancesitbecomestoocumbersometoavoiditentirely.Inthese rareinstancesitisimportanttounderstandtheconceptofmultiplying twomatrices.Supposethat A ∈ Rr ×d and B ∈ Rd×s .Thentheproduct
of A and B isdenoted AB.The(i,j )thelementof AB iscomputedby multiplyingeachelementofthe ithrowof A bythecorrespondingelement ofthe j thcolumnof B.Thatis,(AB)ij = d k=1 aik bkj .Asanexample, consider
Notethatthisoperationproducesan r × s matrix.Itisonlypossibleto compute AB ifthenumberofcolumnsof A isthesameasthenumberof rowsof B.
arediscussedinChapter 9.Finally,inChapter 10,weconsiderasetting inwhichwehaveinputvariablesbutnooutputvariable.Inparticular,we present principalcomponentsanalysis, K -meansclustering,and hierarchicalclustering.
Attheendofeachchapter,wepresentoneormore R labsectionsin whichwesystematicallyworkthroughapplicationsofthevariousmethodsdiscussedinthatchapter.Theselabsdemonstratethestrengthsand weaknessesofthevariousapproaches,andalsoprovideausefulreference forthesyntaxrequiredtoimplementthevariousmethods.Thereadermay choosetoworkthroughthelabsathisorherownpace,orthelabsmay bethefocusofgroupsessionsaspartofaclassroomenvironment.Within each R lab,wepresenttheresultsthatweobtainedwhenweperformed thelabatthetimeofwritingthisbook.However,newversionsof R are continuouslyreleased,andovertime,thepackagescalledinthelabswillbe updated.Therefore,inthefuture,itispossiblethattheresultsshownin thelabsectionsmaynolongercorrespondpreciselytotheresultsobtained bythereaderwhoperformsthelabs.Asnecessary,wewillpostupdatesto thelabsonthebookwebsite.
FIGURE2.1. The Advertising dataset.Theplotdisplays sales,inthousands ofunits,asafunctionof TV, radio,and newspaper budgets,inthousandsof dollars,for 200 differentmarkets.Ineachplotweshowthesimpleleastsquares fitof sales tothatvariable,asdescribedinChapter 3.Inotherwords,eachblue linerepresentsasimplemodelthatcanbeusedtopredict sales using TV, radio, and newspaper,respectively.
Moregenerally,supposethatweobserveaquantitativeresponse Y and p differentpredictors, X1 ,X2 ,...,Xp .Weassumethatthereissome relationshipbetween Y and X =(X1 ,X2 ,...,Xp ),whichcanbewritten intheverygeneralform
Here f issomefixedbutunknownfunctionof X1 ,...,Xp ,and isarandom errorterm,whichisindependentof X andhasmeanzero.Inthisformulaerrorterm tion, f representsthe systematic informationthat X providesabout Y systematic
Asanotherexample,considertheleft-handpanelofFigure 2.2,aplotof income versus yearsofeducation for30individualsinthe Income dataset. Theplotsuggeststhatonemightbeabletopredict income using yearsof education.However,thefunction f thatconnectstheinputvariabletothe outputvariableisingeneralunknown.Inthissituationonemustestimate f basedontheobservedpoints.Since Income isasimulateddataset, f is knownandisshownbythebluecurveintheright-handpanelofFigure 2.2. Theverticallinesrepresenttheerrorterms .Wenotethatsomeofthe 30observationslieabovethebluecurveandsomeliebelowit;overall,the errorshaveapproximatelymeanzero.
Ingeneral,thefunction f mayinvolvemorethanoneinputvariable. InFigure 2.3 weplot income asafunctionof yearsofeducation and seniority.Here f isatwo-dimensionalsurfacethatmustbeestimated basedontheobserveddata.
FIGURE2.2. The Income dataset. Left: Thereddotsaretheobservedvalues of income (intensofthousandsofdollars)and yearsofeducation for 30 individuals. Right: Thebluecurverepresentsthetrueunderlyingrelationshipbetween income and yearsofeducation,whichisgenerallyunknown(butisknownin thiscasebecausethedataweresimulated).Theblacklinesrepresenttheerror associatedwitheachobservation.Notethatsomeerrorsarepositive(ifanobservationliesabovethebluecurve)andsomearenegative(ifanobservationlies belowthecurve).Overall,theseerrorshaveapproximatelymeanzero.
Inessence,statisticallearningreferstoasetofapproachesforestimating f .Inthischapterweoutlinesomeofthekeytheoreticalconceptsthatarise inestimating f ,aswellastoolsforevaluatingtheestimatesobtained.
2.1.1WhyEstimate f ?
Therearetwomainreasonsthatwemaywishtoestimate f : prediction and inference.Wediscusseachinturn.
Prediction
Inmanysituations,asetofinputs X arereadilyavailable,buttheoutput Y cannotbeeasilyobtained.Inthissetting,sincetheerrortermaverages tozero,wecanpredict Y using
where ˆ f representsourestimatefor f ,and ˆ Y representstheresultingpredictionfor Y .Inthissetting, ˆ f isoftentreatedasa blackbox,inthesense thatoneisnottypicallyconcernedwiththeexactformof ˆ f ,providedthat ityieldsaccuratepredictionsfor Y .
YearsofEducation
Seniority
FIGURE2.3. Theplotdisplays income asafunctionof yearsofeducation and seniority inthe Income dataset.Thebluesurfacerepresentsthetrueunderlyingrelationshipbetween income and yearsofeducation and seniority, whichisknownsincethedataaresimulated.Thereddotsindicatetheobserved valuesofthesequantitiesfor 30 individuals.
Asanexample,supposethat X1 ,...,Xp arecharacteristicsofapatient’s bloodsamplethatcanbeeasilymeasuredinalab,and Y isavariable encodingthepatient’sriskforasevereadversereactiontoaparticular drug.Itisnaturaltoseektopredict Y using X ,sincewecanthenavoid givingthedruginquestiontopatientswhoareathighriskofanadverse reaction—thatis,patientsforwhomtheestimateof Y ishigh.
Theaccuracyof ˆ Y asapredictionfor Y dependsontwoquantities, whichwewillcallthe reducibleerror andthe irreducibleerror.Ingeneral, reducible error irreducible error
ˆ f willnotbeaperfectestimatefor f ,andthisinaccuracywillintroduce someerror.Thiserroris reducible becausewecanpotentiallyimprovethe accuracyof ˆ f byusingthemostappropriatestatisticallearningtechniqueto estimate f .However,evenifitwerepossibletoformaperfectestimatefor f ,sothatourestimatedresponsetooktheform ˆ Y = f (X ),ourprediction wouldstillhavesomeerrorinit!Thisisbecause Y isalsoafunctionof ,which,bydefinition,cannotbepredictedusing X .Therefore,variability associatedwith alsoaffectstheaccuracyofourpredictions.Thisisknown asthe irreducible error,becausenomatterhowwellweestimate f ,we cannotreducetheerrorintroducedby .
Whyistheirreducibleerrorlargerthanzero?Thequantity maycontainunmeasuredvariablesthatareusefulinpredicting Y :sincewedon’t measurethem, f cannotusethemforitsprediction.Thequantity may alsocontainunmeasurablevariation.Forexample,theriskofanadverse reactionmightvaryforagivenpatientonagivenday,dependingon
Consideragivenestimate ˆ f andasetofpredictors X ,whichyieldsthe prediction ˆ Y = ˆ f (X ).Assumeforamomentthatboth ˆ f and X arefixed. Then,itiseasytoshowthat
where E (Y ˆ Y )2 representstheaverage,or expectedvalue,ofthesquared expected value differencebetweenthepredictedandactualvalueof Y ,andVar( )representsthe variance associatedwiththeerrorterm . variance
Thefocusofthisbookisontechniquesforestimating f withtheaimof minimizingthereducibleerror.Itisimportanttokeepinmindthatthe irreducibleerrorwillalwaysprovideanupperboundontheaccuracyof ourpredictionfor Y .Thisboundisalmostalwaysunknowninpractice.
Inference
Weareofteninterestedinunderstandingthewaythat Y isaffectedas X1 ,...,Xp change.Inthissituationwewishtoestimate f ,butourgoalis notnecessarilytoma kepredictionsfor Y .Weinsteadwanttounderstand therelationshipbetween X and Y ,ormorespecifically,tounderstandhow Y changesasafunctionof X1 ,...,Xp .Now ˆ f cannotbetreatedasablack box,becauseweneedtoknowitsexactform.Inthissetting,onemaybe interestedinansweringthefollowingquestions:
• Whichpredictorsareassociatedwiththeresponse? Itisoftenthecase thatonlyasmallfractionoftheavailablepredictorsaresubstantially associatedwith Y .Identifyingthefew important predictorsamonga largesetofpossiblevariablescanbeextremelyuseful,dependingon theapplication.
• Whatistherelationshipbetweentheresponseandeachpredictor? Somepredictorsmayhaveapositiverelationshipwith Y ,inthesense thatincreasingthepredictorisassociatedwithincreasingvaluesof Y .Otherpredictorsmayhavetheoppositerelationship.Depending onthecomplexityof f ,therelationshipbetweentheresponseanda givenpredictormayalsodependonthevaluesoftheotherpredictors.
• Cantherelationshipbetween Y andeachpredictorbeadequatelysummarizedusingalinearequation,oristherelationshipmorecomplicated? Historically,mostmethodsforestimating f havetakenalinear form.Insomesituations,suchanassumptionisreasonableorevendesirable.Butoftenthetruerelationshipismorecomplicated,inwhich casealinearmodelmaynotprovideanaccuraterepresentationof therelationshipbetweentheinputandoutputvariables.