Data:TypesandPresentation
1TYPESOFBIOLOGICALDATA
2ACCURACYANDSIGNIFICANTFIGURES
3FREQUENCYDISTRIBUTIONS
4CUMULATIVEFREQUENCYDISTRIBUTIONS
Scientificstudyinvolvesthesystematiccollection,organization,analysis,andpresentationofknowledge.Manyinvestigationsinthebiologicalsciencesarequantitative, whereknowledgeisintheformofnumericalobservationscalled data.(Onenumericalobservationisa datum.*)Inorderforthepresentationandanalysisofdatatobe validanduseful,wemustusemethodsappropriatetothetypeofdataobtained,tothe designofthedatacollection,andtothequestionsaskedofthedata;andthelimitationsofthedata,ofthedatacollection,andofthedataanalysisshouldbeappreciated whenformulatingconclusions.
Theword statistics isderivedfromtheLatinfor“state,”indicatingthehistorical importanceofgovernmentaldatagathering,whichrelatedprincipallytodemographic information(includingcensusdataand“vitalstatistics”)andoftentotheirusein militaryrecruitmentandtaxcollecting.†
Theterm statistics isoftenencounteredasasynonymfor data:Onehearsofcollegeenrollmentstatistics(suchasthenumbersofnewlyadmittedstudents,numbers ofseniorstudents,numbersofstudentsfromvariousgeographiclocations),statistics ofabasketballgame(suchashowmanypointswerescoredbyeachplayer,how manyfoulswerecommitted),laborstatistics(suchasnumbersofworkersunemployed,numbersemployedinvariousoccupations),andsoon.Hereafter,thisuse oftheword statistics willnotappearinthistext.Instead,itwillbeusedinitsother commonmanner:torefertothe orderlycollection,analysis,andinterpretationofdata withaviewtoobjectiveevaluationofconclusionsbasedonthedata
Statisticsappliedtobiologicalproblemsissimplycalled biostatistics or,sometimes, biometry‡ (thelattertermliterallymeaning“biologicalmeasurement”).Although
*Theterm data issometimesseenasasingularnounmeaning“numericalinformation.”This bookrefrainsfromthatuse.
† Peters(1987:79)andWalker(1929:32)attributethefirstuseoftheterm statistics toaGerman professor,GottfriedAchenwall(1719–1772),whousedtheGermanword Statistik in1749,andthe firstpublisheduseoftheEnglishwordtoJohnSinclair(1754–1835)in1791.
‡ Theword biometry,whichliterallymeans“biologicalmeasurement,”had,sincethenineteenthcentury,beenfoundinseveralcontexts(suchasdemographicsand,later,quantitativegenetics;Armitage,1985;Stigler,2000),butusingittomeantheapplicationofstatisticalmethodstobiological informationapparentlywasconceivedbetween1892and1901byKarlPearson,alongwiththename Biometrika forthestill-importantEnglishjournalhehelpedfound;anditwasfirstpublishedinthe inauguralissueofthisjournalin1901(Snedecor,1954).TheBiometricsSectionoftheAmerican
FromChapter1of BiostatisticalAnalysis,FifthEdition,JerroldH.Zar.Copyright c 2010by PearsonEducation,Inc.PublishingasPearsonPrenticeHall.Allrightsreserved.
Data:TypesandPresentation
thefieldofstatisticshasrootsextendingbackhundredsofyears,itsdevelopment beganinearnestinthelatenineteenthcentury,andamajorimpetusfromearlyin thisdevelopmenthasbeentheneedtoexaminebiologicaldata.
Statisticalconsiderationscanaidinthedesignofexperimentsintendedtocollect dataandinthesettingupofhypothesestobetested.Manybiologistsattemptthe analysisoftheirresearchdataonlytofindthattoofewdatawerecollectedtoenable reliableconclusionstobedrawn,orthatmuchextraeffortwasexpendedincollecting datathatcannotbeofreadyuseintheanalysisoftheexperiment.Thus,aknowledge ofbasicstatisticalprinciplesandproceduresisimportantasresearchquestionsare formulated before anexperimentanddatacollectionarebegun.
Oncedatahavebeenobtained,wemayorganizeandsummarizetheminsuch awayastoarriveattheirorderlyandinformativepresentation.Suchprocedures areoftentermed descriptivestatistics.Forexample,measurementsmightbemade oftheheightsofall13-year-oldchildreninaschooldistrict,perhapsdetermining anaverageheightforeachsex.However,perhapsitisdesiredtomakesomegeneralizationsfromthesedata.Wemight,forexample,wishtomakeareasonable estimateoftheheightsofall13-year-oldsinthestate.Orwemightwishtoconcludewhetherthe13-year-oldboysinthestateareontheaveragetallerthanthegirls ofthatage.Theabilitytomakesuchgeneralizedconclusions,inferringcharacteristicsofthewholefromcharacteristicsofitsparts,lieswithintherealmof inferential statistics
1TYPESOFBIOLOGICALDATA
Acharacteristic(forexample,size,color,number,chemicalcomposition)thatmay differfromonebiologicalentitytoanotheristermeda variable (or,sometimes,a variate∗ ),andseveraldifferentkindsofvariablesmaybeencounteredbybiologists. Becausetheappropriatenessofdescriptiveorinferentialstatisticalproceduresdependsuponthepropertiesofthedataobtained,itisdesirabletodistinguishamong theprincipalkindsofdata.Theclassificationusedhereisthatwhichiscommonly employed(Senders,1958;Siegel,1956;Stevens,1946,1968).However,notalldata fitneatlyintothesecategoriesandsomedatamaybetreateddifferentlydepending uponthequestionsaskedofthem.
(a)DataonaRatioScale. Imaginethatwearestudyingagroupofplants,thatthe heightsoftheplantsconstituteavariableofinterest,andthatthenumberofleaves perplantisanothervariableunderstudy.Itispossibletoassignanumericalvalue totheheightofeachplant,andcountingtheleavesallowsanumericalvaluetobe recordedforthenumberofleavesoneachplant.Regardlessofwhethertheheight measurementsarerecordedincentimeters,inches,orotherunits,andregardlessof whethertheleavesarecountedinanumbersystemusingbase10oranyotherbase, therearetwofundamentallyimportantcharacteristicsofthesedata.
First,thereisaconstantsizeintervalbetweenadjacentunitsonthemeasurement scale.Thatis,thedifferenceinheightbetweena36-cmanda37-cmplantisthesame
StatisticalAssociationwasestablishedin1938,successortotheCommitteeonBiometricsofthat organization,andbeganpublishingthe BiometricsBulletin in1945,whichtransformedin1947into thejournal Biometrics,ajournalretainingmajorimportancetoday.Morerecently,theterm biometrics hasbecomewidelyusedtorefertothestudyofhumanphysicalcharacteristics(including facialandhandcharacteristics,fingerprints,DNAprofiles,andretinalpatterns)foridentification purposes.
∗ “Variate”wasfirstusedbyR.A.Fisher(1925:5;David,1995).
Data:TypesandPresentation asthedifferencebetweena39-cmanda40-cmplant,andthedifferencebetween eightandtenleavesisequaltothedifferencebetweennineandelevenleaves.
Second,itisimportantthatthereexistsazeropointonthemeasurementscale andthatthereisaphysicalsignificancetothiszero.Thisenablesustosaysomething meaningfulabouttheratioofmeasurements.Wecansaythata30-cm(11.8-in.)tall plantishalfastallasa60-cm(23.6-in.)plant,andthataplantwithforty-fiveleaves hasthreetimesasmanyleavesasaplantwithfifteen.
Measurementscaleshavingaconstantintervalsizeandatruezeropointaresaid tobe ratioscales ofmeasurement.Besideslengthsandnumbersofitems,ratioscales includeweights(mg,lb,etc.),volumes(cc,cuft,etc.),capacities(ml,qt,etc.),rates (cm/sec,mph,mg/min,etc.),andlengthsoftime(hr,yr,etc.).
(b)DataonanIntervalScale. Somemeasurementscalespossessaconstantinterval sizebutnotatruezero;theyarecalled intervalscales.Acommonexampleisthat ofthetwocommontemperaturescales:Celsius(C)andFahrenheit(F).Wecansee thatthesamedifferenceexistsbetween20◦ C(68◦ F)and25◦ C(77◦ F)asbetween5◦ C (41◦ F)and10◦ C(50◦ F);thatis,themeasurementscaleiscomposedofequal-sized intervals.Butitcannotbesaidthatatemperatureof40◦ C(104◦ F)istwiceashot asatemperatureof20◦ C(68◦ F);thatis,thezeropointisarbitrary.∗ (Temperature measurementsontheabsolute,orKelvin[K],scalecanbereferredtoaphysically meaningfulzeroandthusconstitutearatioscale.)
Someintervalscalesencounteredinbiologicaldatacollectionare circularscales. Timeofdayandtimeoftheyearareexamplesofsuchscales.Theintervalbetween 2:00 p.m. (i.e.,1400hr)and3:30 p.m. (1530hr)isthesameastheintervalbetween8:00 a.m. (0800hr)and9:30 a.m. (0930hr).Butonecannotspeakofratiosoftimesofday becausethezeropoint(midnight)onthescaleisarbitrary,inthatonecouldjustas wellsetupascalefortimeofdaywhichwouldhavenoon,or3:00 p.m.,oranyother timeasthezeropoint.Circularbiologicaldataareoccasionallycompasspoints,as ifonerecordsthecompassdirectioninwhichananimalorplantisoriented.Asthe designationofnorthas0◦ isarbitrary,thiscircularscaleisaformofintervalscaleof measurement.
(c)DataonanOrdinalScale. Theprecedingparagraphsonratioandintervalscales ofmeasurementdiscusseddatabetweenwhichweknownumericaldifferences.For example,ifman A weighs90kgandman B weighs80kg,thenman A isknown toweigh10kgmorethan B.Butourdatamay,instead,bearecordonlyofthe factthatman A weighsmorethanman B (withnoindicationofhowmuchmore). Thus,wemaybedealingwithrelativedifferencesratherthanquantitativedifferences. Suchdataconsistofanorderingorrankingofmeasurementsandaresaidtobeon an ordinal scaleofmeasurement(ordinal beingfromtheLatinwordfor“order”). Wemayspeakofonebiologicalentitybeingshorter,darker,faster,ormoreactive thananother;thesizesoffivecelltypesmightbelabeled1,2,3,4,and5,todenote
∗ TheGerman-DutchphysicistGabrielDanielFahrenheit(1686–1736)inventedthethermometerin1714andin1724employedascaleonwhichsaltwaterfrozeatzerodegrees,purewaterfroze at32degrees,andpurewaterboiledat212degrees.In1742theSwedishastronomerAndersCelsius(1701–1744)devisedatemperaturescalewith100degreesbetweenthefreezingandboiling pointsofwater(theso-called“centigrade”scale),firstbyreferringtozerodegreesasboilingand 100degreesasfreezing,andlater(perhapsatthesuggestionofSwedishbotanistandtaxonomist CarolusLinnaeus[1707–1778])reversingthesetworeferencepoints(Asimov,1982:177).
Data:TypesandPresentation
theirmagnitudesrelativetoeachother;orsuccessinlearningtorunamazemaybe recordedas A, B,or C .
Itisoftentruethatbiologicaldataexpressedontheordinalscalecouldhavebeen expressedontheintervalorratioscalehadexactmeasurementsbeenobtained(or obtainable).Sometimesdatathatwereoriginallyonintervalorratioscaleswillbe changedtoranks;forexample,examinationgradesof99,85,73,and66%(ratioscale) mightberecordedasA,B,C,andD(ordinalscale),respectively.
Ordinal-scaledatacontainandconveylessinformationthanratioorintervaldata, foronlyrelativemagnitudesareknown.Consequently,quantitativecomparisonsare impossible(e.g.,wecannotspeakofagradeofCbeinghalfasgoodasagradeof A,orofthedifferencebetweencellsizes1and2beingthesameasthedifference betweensizes3and4).However,wewillseethatmanyusefulstatisticalprocedures are,infact,applicabletoordinaldata.
(d)DatainNominalCategories.
Sometimesthevariablebeingstudiedisclassified bysomequalitativemeasureitpossessesratherthanbyanumericalmeasurement. Insuchcasesthevariablemaybecalledan attribute,andwearesaidtobedealing with nominal,or categorical,data.Geneticphenotypesarecommonlyencountered biologicalattributes:Thepossiblemanifestationsofananimal’seyecolormightbe brownorblue;andifhumanhaircolorweretheattributeofinterest,wemight recordblack,brown,blond,orred.Asotherexamplesofnominaldata(nominal is fromtheLatinwordfor“name”),peoplemightbeclassifiedasmaleorfemale,or right-handedorleft-handed.Or,plantsmightbeclassifiedasdeadoralive,oraswith orwithoutfertilizerapplication.Taxonomiccategoriesalsoformanominalclassificationscheme(forexample,plantsinastudymightbeclassifiedaspine,spruce, orfir).
Sometimes,datathatmighthavebeenexpressedonanordinal,interval,orratio scaleofmeasurementmayberecordedinnominalcategories.Forexample,heights mightberecordedastallorshort,orperformanceonanexaminationaspassorfail, wherethereisanarbitrarycut-offpointonthemeasurementscaletoseparatetall fromshortandpassfromfail.
Aswillbeseen,statisticalmethodsusefulwithratio,interval,orordinaldatagenerallyarenotapplicabletonominaldata,andwemust,therefore,beabletoidentify suchsituationswhentheyoccur.
(e)ContinuousandDiscreteData. Whenwespokepreviouslyofplantheights,we weredealingwithavariablethatcouldbeanyconceivablevaluewithinanyobserved range;thisisreferredtoasa continuousvariable.Thatis,ifwemeasureaheightof 35cmandaheightof36cm,aninfinitenumberofheightsispossibleintherange from35to36cm:aplantmightbe35.07cmtallor35.988cmtall,or35.3263cmtall, andsoon,although,ofcourse,wedonothavedevicessensitiveenoughtodetectthis infinityofheights.Acontinuousvariableisoneforwhichthereisapossiblevalue betweenanyothertwovalues.
However,whenspeakingofthenumberofleavesonaplant,wearedealingwitha variablethatcantakeononlycertainvalues.Itmightbepossibletoobserve27leaves, or28leaves,but27.43leavesand27.9leavesarevaluesofthevariablethatare impossibletoobtain.Suchavariableistermeda discrete or discontinuousvariable (alsoknownasa meristicvariable).Thenumberofwhitebloodcellsin1mm3 of blood,thenumberofgiraffesvisitingawaterhole,andthenumberofeggslaidby agrasshopperarealldiscretevariables.Thepossiblevaluesofadiscretevariable generallyareconsecutiveintegers,butthisisnotnecessarilyso.Iftheleavesonour
Data:TypesandPresentation
plantsarealwaysformedinpairs,thenonlyevenintegersarepossiblevaluesofthe variable.Andtheratioofnumberofwingstonumberoflegsofinsectsisadiscrete variablethatmayonlyhavethevalueof0,0.3333 ... ,or0.6666 ... (i.e., 0 6 , 2 6 ,or 4 6 , respectively).∗
Ratio-,interval-,andordinal-scaledatamaybeeithercontinuousordiscrete. Nominal-scaledatabytheirnaturearediscrete.
2ACCURACYANDSIGNIFICANTFIGURES
Accuracy isthenearnessofameasurementtothetruevalueofthevariablebeing measured. Precision isnotasynonymoustermbutreferstotheclosenesstoeach otherofrepeatedmeasurementsofthesamequantity.Figure1illustratesthedifferencebetweenaccuracyandprecisionofmeasurements.
FIGURE1: Accuracyandprecisionofmeasurements.A3-kilogramanimalisweighed10times.The10 measurementsshowninsample(a)arerelativelyaccurateandprecise;thoseinsample(b)arerelatively accuratebutnotprecise;thoseofsample(c)arerelativelyprecisebutnotaccurate;andthoseofsample (d)arerelativelyinaccurateandimprecise.
Humanerrormayexistintherecordingofdata.Forexample,apersonmaymiscountthenumberofbirdsinatractoflandormisreadthenumbersonaheart-rate monitor.Or,apersonmightobtaincorrectdatabutrecordtheminsuchaway(perhapswithpoorhandwriting)thatasubsequentdataanalystmakesanerrorinreading them.Weshallassumethatsucherrorshavenotoccurred,butthereareotheraspects ofaccuracythatshouldbeconsidered.
Accuracyofmeasurementcanbeexpressedinnumericalreporting.Ifwereport thatthehindlegofafrogis8cmlong,wearestatingthenumber8(avalueofa continuousvariable)asanestimateofthefrog’strueleglength.Thisestimatewas madeusingsomesortofameasuringdevice.Hadthedevicebeencapableofmore accuracy,wemighthavedeclaredthatthelegwas8.3cmlong,orperhaps8.32cm long.Whenrecordingvaluesofcontinuousvariables,itisimportanttodesignatethe accuracywithwhichthemeasurementshavebeenmade.Byconvention,thevalue 8denotesameasurementintherangeof7.50000 ... to8.49999 ... ,thevalue8.3 designatesarangeof8.25000 ... to8.34999 ... ,andthevalue8.32impliesthatthe truevaluelieswithintherangeof8 31500 to8 32499 .Thatis,thereported valueisthemidpointoftheimpliedrange,andthesizeofthisrangeisdesignated bythelastdecimalplaceinthemeasurement.Thevalueof8cmimpliesanabilityto
∗ Theellipsismarks( )maybereadas“andsoon.”Here,theyindicatethat 2 6 and 4 6 are repeatingdecimalfractions,whichcouldjustaswellhavebeenwrittenas0 3333333333333 and 0 6666666666666 ,respectively.
0123 (a)(b)
Data:TypesandPresentation determinelengthwithinarangeof1cm,8.3cmimpliesarangeof0.1cm,and8.32cm impliesarangeof0.01cm.Thus,torecordavalueof8.0impliesgreateraccuracyof measurementthandoestherecordingofavalueof8,forinthefirstinstancethe truevalueissaidtoliebetween7.95000 ... and8.049999 ... (i.e.,withinarangeof 0.1cm),whereas8impliesavaluebetween7.50000 ... and8.49999 ... (i.e.,withina rangeof1cm).Tostate8 00cmimpliesameasurementthatascertainsthefrog’slimb lengthtobebetween7 99500 and8 00499 cm(i.e.,withinarangeof0.01cm). Thosedigitsinanumberthatdenotetheaccuracyofthemeasurementarereferred toas significantfigures.Thus,8hasonesignificantfigure,8.0and8.3eachhavetwo significantfigures,and8.00and8.32eachhavethree.
Inworkingwithexactvaluesofdiscretevariables,theprecedingconsiderationsdo notapply.Thatis,itissufficienttostatethatourfroghasfourlimbsorthatitsleft lungcontainsthirteenflukes.Theuseof4.0or13.00wouldbeinappropriate,forasthe numbersinvolvedareexactly4and13,thereisnoquestionofaccuracyorsignificant figures.
Butthereareinstanceswheresignificantfiguresandimpliedaccuracycomeinto playwithdiscretedata.Anentomologistmayreportthatthereare72,000mothsin aparticularforestarea.Indoingso,itisprobablynotbeingclaimedthatthisisthe exactnumberbutanestimateoftheexactnumber,perhapsaccuratetotwosignificant figures.Insuchacase,72,000wouldimplyarangeofaccuracyof1000,sothatthetrue valuemightlieanywherefrom71,500to72,500.Iftheentomologistwishedtoconvey thefactthatthisestimateisbelievedtobeaccuratetothenearest100(i.e.,tothree significantfigures),ratherthantothenearest1000,itwouldbebettertopresentthe dataintheformof scientificnotation, ∗ asfollows:Ifthenumber7 2 × 104 (= 72,000) iswritten,arangeofaccuracyof0.1 × 104 (= 1000) isimplied,andthetruevalueis assumedtoliebetween71,500and72,500.Butif7 20 × 104 werewritten,arangeof accuracyof0 01 × 104 (= 100) wouldbeimplied,andthetruevaluewouldbeassumed tobeintherangeof71,950to72,050.Thus,theaccuracyoflargevalues(andthis appliestocontinuousaswellasdiscretevariables)canbeexpressedsuccinctlyusing scientificnotation.
Calculatorsandcomputerstypicallyyieldresultswithmoresignificantfiguresthan arejustifiedbythedata.However,itisgoodpractice—toavoidroundingerror—to retainmanysignificantfiguresuntilthelaststepinasequenceofcalculations,andon attainingtheresultofthefinalsteptoroundofftotheappropriatenumberoffigures.
3FREQUENCYDISTRIBUTIONS
Whencollectingandsummarizinglargeamountsofdata,itisoftenhelpfultorecord thedataintheformofa frequencytable.Suchatablesimplyinvolvesalistingofall theobservedvaluesofthevariablebeingstudiedandhowmanytimeseachvalueis observed.Considerthetabulationofthefrequencyofoccurrenceofsparrownests ineachofseveraldifferentlocations.ThisisillustratedinExample1,wherethe observedkindsofnestsitesarelisted,andforeachkindthenumberofnestsobserved isrecorded.Thedistributionofthetotalnumberofobservationsamongthevariouscategoriesistermeda frequencydistribution.Example1isafrequencytable fornominaldata,andthesedatamayalsobepresentedgraphicallybymeansofa bargraph (Figure2),wheretheheightofeachbarisproportionaltothefrequency intheclassrepresented.Thewidthsofallbarsinabargraphshouldbeequalso
∗ Theuseofscientificnotation—byphysicists—canbetracedbacktoatleastthe1860s(Miller, 2004b).
EXAMPLE1TheLocationofSparrowNests:AFrequencyTableof NominalData
Thevariableisnestsite,andtherearefourrecordedcategoriesofthisvariable. Thenumbersrecordedinthesecategoriesconstitutethefrequencydistribution.
NestSiteNumberofNestsObserved
A.Vines56
B.Buildingeaves60
C.Lowtreebranches46
D.Treeandbuildingcavities49
FIGURE2: AbargraphofthesparrownestdataofExample1.Anexampleofabargraphfornominal data.
thattheeyeofthereaderisnotdistractedfromthedifferencesinbarheights;this alsomakestheareaofeachbarproportionaltothefrequencyitrepresents.Also, thefrequencyscaleontheverticalaxisshouldbeginatzerotoavoidtheapparent differencesamongbars.If,forexample,abargraphofthedataofExample1were constructedwiththeverticalaxisrepresentingfrequenciesof45to60ratherthan0 to60,theresultswouldappearasinFigure3.Huff(1954)illustratesothertechniques thatcanmisleadthereadersofgraphs.Itisgoodpracticetoleavespacebetween thebarsofabargraphofnominaldata,toemphasizethedistinctnessamongthe categoriesrepresented.
AfrequencytabulationofordinaldatamightappearasinExample2,whichpresentstheobservednumbersofsunfishcollectedineachoffivecategories,eachcategorybeingadegreeofskinpigmentation.Abargraph(Figure4)canbepreparedfor thisfrequencydistributionjustasfornominaldata.
FIGURE3: AbargraphofthesparrownestdataofExample1,drawnwiththeverticalaxisstartingat 45.ComparethiswithFigure1,wheretheaxisstartsat0.
EXAMPLE2NumbersofSunfish,TabulatedAccordingtoAmountofBlack Pigmentation:AFrequencyTableofOrdinalData
Thevariableisamountofpigmentation,whichisexpressedbynumerically orderedclasses.Thenumbersrecordedforthefivepigmentationclassescompose thefrequencydistribution.
PigmentationClassAmountofPigmentationNumberofFish
0Noblackpigmentation13
1Faintlyspeckled68
2Moderatelyspeckled44
3Heavilyspeckled21
4Solidblackpigmentation8 PigmentationClass
FIGURE4: AbargraphofthesunfishpigmentationdataofExample2.Anexampleofabargraphfor ordinaldata.
Data:TypesandPresentation
Inpreparingfrequencytablesofinterval-andratio-scaledata,wecanmakeaproceduraldistinctionbetweendiscreteandcontinuousdata.Example3showsdiscrete datathatarefrequenciesoflittersizesinfoxes,andFigure5presentsthisfrequency distributiongraphically.
EXAMPLE3FrequencyofOccurrenceofVariousLitterSizesinFoxes:A FrequencyTableofDiscrete,Ratio-ScaleData
Thevariableislittersize,andthenumbersrecordedforthefivelittersizesmake upfrequencydistribution.
FIGURE5: AbargraphofthefoxlitterdataofExample3.Anexampleofabargraphfordiscrete, ratio-scaledata.
Example4ashowsdiscretedatathatarethenumbersofaphidsfoundperclover plant.Thesedatacreatequitealengthyfrequencytable,anditisnotdifficulttoimaginesetsofdatawhosetabulationwouldresultinanevenlongerlistoffrequencies. Thus,forpurposesofpreparingbargraphs,weoftencastdataintoafrequencytable bygroupingthem.
Example4bisatableofthedatafromExample4aarrangedbygroupingthedata intosizeclasses.ThebargraphforthisdistributionappearsasFigure6.Suchgroupingresultsinthelossofsomeinformationandisgenerallyutilizedonlytomake frequencytablesandbargraphseasiertoread,andnotforcalculationsperformedon
Data:TypesandPresentation
thedata.Therehavebeenseveral“rulesofthumb”proposedtoaidindecidinginto howmanyclassesdatamightreasonablybegrouped,fortheuseoftoofewgroupswill obscurethegeneralshapeofthedistribution.Butsuch“rules”orrecommendations areonlyroughguides,andthechoiceisgenerallylefttogoodjudgment,bearingin mindthatfrom10to20groupsareusefulformostbiologicalwork.(SeealsoDoane, 1976.)Ingeneral,groupsshouldbeestablishedthatareequalinthesizeintervalof thevariablebeingmeasured.(Forexample,thegroupsizeintervalinExample4bis fouraphidsperplant.)
EXAMPLE4aNumberofAphidsObservedperCloverPlant:AFrequencyTableofDiscrete,Ratio-ScaleData
NumberofAphidsNumberof NumberofAphidsNumberof onaPlantPlantsObserved onaPlantPlantsObserved 03
Totalnumberofobservations = 424
Becausecontinuousdata,contrarytodiscretedata,cantakeonaninfinityofvalues,oneisessentiallyalwaysdealingwithafrequencydistributiontabulatedby groups.Ifthevariableofinterestwereaweight,measuredtothenearest0.1mg,afrequencytableentryofthenumberofweightsmeasuredtobe48.6mgwouldbeinterpretedtomeanthenumberofweightsgroupedbetween48 5500 and48 6499 mg (althoughinafrequencytablethisclassintervalisusuallywrittenas48.55–48.65). Example5presentsatabulationof130determinationsoftheamountofphosphorus, inmilligramspergram,indriedleaves.(Ignorethelasttwocolumnsofthistableuntil Section4.)
Data:TypesandPresentation
EXAMPLE4bNumberofAphidsObservedperCloverPlant:AFrequency TableGroupingtheDiscrete,Ratio-ScaleDataofExample4a
NumberofAphidsNumberof onaPlantPlantsObserved 0–36 4–717 8–1140 12–1554 16–1959 20–2375 24–2777 28–3155 32–3532 36–398 40–431
Totalnumberofobservations = 424
FIGURE6: AbargraphoftheaphiddataofExample4b.Anexampleofabargraphforgroupeddiscrete, ratio-scaledata.
EXAMPLE5DeterminationsoftheAmountofPhosphorusinLeaves:A FrequencyTableofContinuousData
Cumulativefrequency
Frequency
Phosphorus(i.e.,numberofStartingwithStartingwith (mg/gofleaf)determinations)LowValuesHighValues
8.15–8.2522130 8.25–8.3568128 8.35–8.45816122 8.45–8.551127114 8.55–8.651744103 8.65–8.75176186 8.75–8.85248569 8.85–8.951810345 8.95–9.051311627 9.05–9.151012614 9.15–9.2541304
Totalfrequency = 130 = n
Inpresentingthisfrequencydistributiongraphically,onecanpreparea histogram, ∗ whichisthenamegiventoabargraphbasedoncontinuousdata.Thisisdonein Figure7;notethatratherthanindicatingtherangeonthehorizontalaxis,weindicate onlythemidpointoftherange,aprocedurethatresultsinlesscrowdedprintingon thegraph.Notealsothatadjacentbarsinahistogramareoftendrawntouchingeach other,toemphasizethecontinuityofthescaleofmeasurement,whereasintheother bargraphsdiscussedtheygenerallyarenot.
FIGURE7: AhistogramoftheleafphosphorusdataofExample5.Anexampleofahistogramforcontinuousdata.
∗ Theterm histogram isfromGreekroots(referringtoapole-shapeddrawing)andwasfirst publishedbyKarlPearsonin1895(David1995).
FIGURE8: AfrequencypolygonfortheleafphosphorusdataofExample5.
Oftena frequencypolygon isdrawninsteadofahistogram.Thisisdonebyplotting thefrequencyofeachclassasadot(orothersymbol)attheclassmidpointandthen connectingeachadjacentpairofdotsbyastraightline(Figure8).Itis,ofcourse,the sameasifthemidpointsofthetopsofthehistogrambarswereconnectedbystraight lines.Insteadofplottingfrequenciesontheverticalaxis,onecanplot relativefrequencies,orproportionsofthetotalfrequency.Thisenablesdifferentdistributionsto bereadilycomparedandevenplottedonthesameaxes.Sometimes,asinFigure8, frequencyisindicatedononeverticalaxisandthecorrespondingrelativefrequency ontheother.(UsingthedataofExample5,therelativefrequencyfor8.2mg/gis 2/130 = 0 015,thatfor8.3mg/gis6/130 = 0 046,thatfor9.2mg/gis4/130 = 0 030, andsoon.Thetotalofallthefrequenciesis n,andthetotalofalltherelativefrequenciesis1.)
Frequencypolygonsarealsocommonlyusedfordiscretedistributions,butonecan argueagainsttheirusewhendealingwithordinaldata,asthepolygonimpliestothe readeraconstantsizeintervalhorizontallybetweenpointsonthepolygon.Frequency polygonsshouldnotbeemployedfornominal-scaledata.
Ifwehaveafrequencydistributionofvaluesofacontinuousvariablethatfalls intoalargenumberofclassintervals,thedatamaybegroupedaswasdemonstrated withdiscretevariables.Thisresultsinfewerintervals,buteachintervalis,ofcourse, larger.Themidpointsoftheseintervalsmaythenbeusedinthepreparationofa histogramorfrequencypolygon.Theuseroffrequencypolygonsiscautionedthat suchagraphissimplyanaidtotheeyeinfollowingtrendsinfrequencydistributions, andoneshouldnotattempttoreadfrequenciesbetweenpointsonthepolygon.Also notethatthemethodpresentedfortheconstructionofhistogramsandfrequency polygonsrequiresthattheclassintervalsbeequal.Lastly,theverticalaxis(e.g.,the frequencyscale)onfrequencypolygonsandbargraphsgenerallyshouldbeginwith zero,especiallyifgraphsaretobecomparedwithoneanother.Ifthisisnotdone,the eyemaybemisledbytheappearanceofthegraph(asshownfornominal-scaledata inFigures2and3).
4CUMULATIVEFREQUENCYDISTRIBUTIONS
Afrequencydistributioninformsushowmanyobservationsoccurredforeachvalue (orgroupofvalues)ofavariable.Thatis,examinationofthefrequencytableof Example3(oritscorrespondingbargraphorfrequencypolygon)wouldyieldinformationsuchas,“Howmanyfoxlittersoffourwereobserved?”,theanswerbeing 27.Butifitisdesiredtoaskquestionssuchas,“Howmanylittersoffourormore wereobserved?”,or“Howmanyfoxlittersoffiveorfewerwereobserved?”,weare speakingof cumulativefrequencies.Toanswerthefirstquestion,wesumallfrequenciesforlittersizesfourandup,andforthesecondquestion,wesumallfrequencies fromthesmallestlittersizeupthroughasizeoffive.Wearriveatanswersof54and 59,respectively.
InExample5,thephosphorusconcentrationdataarecastintotwocumulative frequencydistributions,onewithcumulationcommencingatthelowendofthemeasurementscaleandonewithcumulationbeingperformedfromthehighvaluestoward thelowvalues.Thechoiceofthedirectionofcumulationisimmaterial,ascanbe demonstrated.Ifonedesiredtocalculatethenumberofphosphorusdeterminations lessthan8.55mg/g,namely27,acumulationstartingatthelowendmightbeused, whereastheknowledgeofthefrequencyofdeterminationsgreaterthan8.55mg/g, namely103,canbereadilyobtainedfromthecumulationcommencingfromthehigh endofthescale.Butonecaneasilycalculateanyfrequencyfromalow-to-highcumulation(e.g.,27)fromitscomplementaryfrequencyfromahigh-to-lowcumulation (e.g.,103),simplybyknowingthatthesumofthesetwofrequenciesisthetotalfrequency(i.e., n = 130);therefore,inpracticeitisnotnecessarytocalculatebothsets ofcumulations.
Cumulativefrequencydistributionsareusefulindeterminingmedians,percentiles, andotherquantiles.Theyarenotoftenpresentedinbargraphs,but cumulativefrequencypolygons (sometimescalled ogives)arenotuncommon.(SeeFigures9and10.)
FIGURE9: CumulativefrequencypolygonoftheleafphosphorusdataofExample5,withcumulation commencingfromthelowesttothehighestvaluesofthevariable.
FIGURE10: CumulativefrequencypolygonoftheleafphosphorusdataofExample5,withcumulation commencingfromthehighesttothelowestvaluesofthevariable.
Relativefrequencies(proportionsofthetotalfrequency)canbeplottedinsteadof (or,asinFigures9and10,inadditionto)frequenciesontheverticalaxisofacumulativefrequencypolygon.Thisenablesdifferentdistributionstobereadilycompared andevenplottedonthesameaxes.(UsingthedataofExample5forFigure9,the relativecumulativefrequencyfor8.2mg/gis2/130 = 0.015,thatfor8.3mg/gis 8/130 = 0 062,andsoon.ForFigure10,therelativecumulativefrequencyfor8.2 mg/gis130/130 = 1 000,thatfor8.3mg/gis128/130 = 0 985,andsoon.)
This page intentionally left blank
PopulationsandSamples
1POPULATIONS
2SAMPLESFROMPOPULATIONS
3RANDOMSAMPLING
4PARAMETERSANDSTATISTICS
5OUTLIERS
1POPULATIONS
Theprimaryobjectiveofastatisticalanalysisistoinfercharacteristicsofagroupof databyanalyzingthecharacteristicsofasmallsamplingofthegroup.Thisgeneralizationfromtheparttothewholerequirestheconsiderationofsuchimportantconcepts aspopulation,sample,parameter,statistic,andrandomsampling.Thesetopicsare discussedinthischapter.
Basictostatisticalanalysisisthedesiretodrawconclusionsaboutagroupofmeasurementsofavariablebeingstudied.Biologistsoftenspeakofa“population”asa definedgroupofhumansorofanotherspeciesoforganisms.Statisticiansspeakof a population (alsocalleda universe)asagroupofmeasurements(notorganisms) aboutwhichonewishestodrawconclusions.Itisthelatterdefinition,thestatistical definitionof population,thatwillbeusedthroughoutthistext.Forexample,aninvestigatormaydesiretodrawconclusionsaboutthetaillengthsofbobcatsinMontana. AllMontanabobcattaillengthsare,therefore,thepopulationunderconsideration. Ifastudyisconcernedwiththeblood-glucoseconcentrationinthree-year-oldchildren,thentheblood-glucoselevelsinallchildrenofthatagearethepopulationof interest.
Populationsareoftenverylarge,suchasthebodyweightsofallgrasshoppersin KansasortheeyecolorsofallfemaleNewZealanders,butoccasionallypopulations ofinterestmayberelativelysmall,suchastheagesofmenwhohavetraveledtothe moonortheheightsofwomenwhohaveswumtheEnglishChannel.
2SAMPLESFROMPOPULATIONS
Ifthepopulationunderstudyisverysmall,itmightbepracticaltoobtainallthe measurementsinthepopulation.Ifonewishestodrawconclusionsabouttheages ofallmenwhohavetraveledtothemoon,itwouldnotbeunreasonabletoattempt tocollectalltheagesofthesmallnumberofindividualsunderconsideration.Generally,however,populationsofinterestaresolargethatobtainingallthemeasurementsisunfeasible.Forexample,wecouldnotreasonablyexpecttodeterminethe bodyweightofeverygrasshopperinKansas.Whatcanbedoneinsuchcasesis toobtainasubsetofallthemeasurementsinthepopulation.Thissubsetofmeasurementsconstitutesa sample,andfromthecharacteristicsofsampleswecan
PopulationsandSamples
drawconclusionsaboutthecharacteristicsofthepopulationsfromwhichthesamples came.∗
Biologistsmaysampleapopulationthatdoesnotphysicallyexist.Supposean experimentisperformedinwhichafoodsupplementisadministeredto40guinea pigs,andthesampledataconsistofthegrowthratesofthese40animals.Thenthe populationaboutwhichconclusionsmightbedrawnisthegrowthratesofallthe guineapigsthatconceivablymighthavebeenadministeredthesamefoodsupplementunderidenticalconditions.Suchapopulationissaidtobe“imaginary”andis alsoreferredtoas“hypothetical”or“potential.”
3RANDOMSAMPLING
Samplesfrompopulationscanbeobtainedinanumberofways;however,forasampletoberepresentativeofthepopulationfromwhichitcame,andtoreachvalidconclusionsaboutpopulationsbyinductionfromsamples,statisticalprocedurestypically assumethatthesamplesareobtainedina random fashion.Tosampleapopulation randomlyrequiresthateachmemberofthepopulationhasanequalandindependent chanceofbeingselected.Thatis,notonlymusteachmeasurementinthepopulation haveanequalchanceofbeingchosenasamemberofthesample,buttheselection ofanymemberofthepopulationmustinnowayinfluencetheselectionofanyother member.Throughoutthistext,“sample”willalwaysimply“randomsample.”†
Itissometimespossibletoassigneachmemberofapopulationauniquenumber andtodrawasamplebychoosingasetofsuchnumbersatrandom.Thisisequivalent tohavingallmembersofapopulationinahatanddrawingasamplefromthemwhile blindfolded.Table41from Appendix:StatisticalTablesandGraphs provides10,000 randomdigitsforthispurpose.Inthistable,eachdigitfrom0to9hasanequaland independentchanceofappearinganywhereinthetable.Similarly,eachcombination oftwodigits,from00to99,isfoundatrandominthetable,asiseachthree-digit combination,from000to999,andsoon.
Assumethatarandomsampleof200namesisdesiredfromatelephonedirectory having274pages,threecolumnsofnamesperpage,and98namespercolumn.EnteringTable41from Appendix:StatisticalTablesandGraphs atrandom(i.e.,donot alwaysenterthetableatthesameplace),onemightdecidefirsttoarriveatarandom combinationofthreedigits.Ifthisthree-digitnumberis001to274,itcanbetaken asarandomlychosenpagenumber(ifitis000orlargerthan274,simplyskipitand chooseanotherthree-digitnumber,e.g.,thenextoneonthetable).Thenonemight examinethenextdigitinthetable;ifitisa1,2,or3,letitdenoteapagecolumn(ifa digitotherthan1,2,or3isencountered,itisignored,passingtothenextdigitthatis 1,2,or3).Thenonecouldlookatthenexttwo-digitnumberinthetable;ifitisfrom 01to98,letitrepresentarandomlyselectednamewithinthatcolumn.Thisthreestepprocedurewouldbeperformedatotalof200timestoobtainthedesiredrandom sample.Onecanproceedinanydirectionintherandomnumbertable:lefttoright, righttoleft,upward,downward,ordiagonally;butthedirectionshouldbedecided onbeforelookingatthetable.Computersarecapableofquicklygeneratingrandom numbers(sometimescalled“pseudorandom”numbersbecausethenumbergenerationisnotperfectlyrandom),andthisishowTable41from Appendix:Statistical TablesandGraphs wasderived.
∗ Thisuseoftheterms population and sample wasestablishedbyKarlPearson(1903).
† ThisconceptofrandomsamplingwasestablishedbyKarlPearsonbetween1897and1903 (Miller,2004a).
PopulationsandSamples
Veryoftenitisnotpossibletoassignanumbertoeachmemberofapopulation,andrandomsamplingtheninvolvesbiological,ratherthansimplymathematical,considerations.Thatis,thetechniquesforsamplingMontanabobcatsorKansas grasshoppersrequireknowledgeabouttheparticularorganismtoensurethatthe samplingisrandom.Researchersconsultrelevantbooks,periodicalarticles,orreports thataddressthespecifickindofbiologicalmeasurementtobeobtained.
4PARAMETERSANDSTATISTICS
Severalmeasureshelptodescribeorcharacterizeapopulation.Forexample,generallyapreponderanceofmeasurementsoccurssomewherearoundthemiddleofthe rangeofapopulationofmeasurements.Thus,someindicationofapopulation“average”wouldexpressausefulbitofdescriptiveinformation.Suchinformationiscalled a measureofcentraltendency (alsocalleda measureoflocation).
Itisalsoimportanttodescribehowdispersedthemeasurementsarearoundthe “average.”Thatis,wecanaskwhetherthereisawidespreadofvaluesinthepopulationorwhetherthevaluesareratherconcentratedaroundthemiddle.Suchadescriptivepropertyiscalleda measureofvariability (ora measureofdispersion).
Aquantitysuchasameasureofcentraltendencyorameasureofdispersionis calleda parameter whenitdescribesorcharacterizesapopulation,andweshallbe veryinterestedindiscussingparametersanddrawingconclusionsaboutthem. Section2pointedout,however,thatoneseldomhasdataforentirepopulations,but nearlyalwayshastorelyonsamplestoarriveatconclusionsaboutpopulations.Thus, onerarelyisabletocalculateparameters.However,byrandomsamplingofpopulations,parameterscanbeestimatedwell.Anestimateofapopulationparameteris calleda statistic ∗ Itisstatisticalconventiontorepresentpopulationparametersby GreeklettersandsamplestatisticsbyLatinletters;willdemonstratethiscustomfor specificexamples.
Thestatisticsonecalculateswillvaryfromsampletosampleforsamplestaken fromthesamepopulation.Becauseoneusessamplestatisticsasestimatesofpopulationparameters,itbehoovestheresearchertoarriveatthe“best”estimatespossible. Asforwhatpropertiestodesireina“good”estimate,considerthefollowing.
First,itisdesirablethatifwetakeanindefinitelylargenumberofsamplesfroma population,thelong-runaverageofthestatisticsobtainedwillequaltheparameter beingestimated.Thatis,forsomesamplesastatisticmayunderestimatetheparameterofinterest,andforothersitmayoverestimatethatparameter;butinthelongrun theestimatesthataretoolowandthosethataretoohighwill“averageout.”Ifsuch apropertyisexhibitedbyastatistic,wesaythatwehavean unbiased statisticoran unbiasedestimator.
Second,itisdesirablethatastatisticobtainedfromanysinglesamplefromapopulationbeveryclosetothevalueoftheparameterbeingestimated.Thispropertyof astatisticisreferredtoas precision, † efficiency,or reliability.Aswecommonlysecure onlyonesamplefromapopulation,itisimportanttoarriveatacloseestimateofa parameterfromasinglesample.
∗ Thisuseoftheterms parameter and statistic wasdefinedbyR.A.Fisherasearlyas1922(Miller, 2004a;Savage,1976).
† Theprecisionofasamplestatistic,asdefinedhere,shouldnotbeconfusedwiththeprecision ofameasurement.
5OUTLIERS
PopulationsandSamples
Third,considerthatonecantakelargerandlargersamplesfromapopulation(the largestsamplebeingtheentirepopulation).Asthesamplesizeincreases,a consistent statisticwillbecomeabetterestimateoftheparameteritisestimating.Indeed,ifthe samplewerethesizeofthepopulation,thenthebestestimatewouldbeobtained:the parameteritself.
Occasionally,asetofdatawillhaveoneormoreobservationsthataresodifferent, relativetotheotherdatainthesample,thatwedoubttheyshouldbepartofthe sample.Forexample,supposearesearchercollectedasampleconsistingofthebody weightsofnineteen20-week-oldmallardducksraisedinindividuallaboratorycages, forwhichthefollowing19datawererecorded:
1.87,3.75,3.79,3.82,3.85,3.87,3.90,3.94,3.96,3.99, 3.99,4.00,4.03,4.04,4.05,4.06,4.09,8.97,and39.8kilograms.
Visualinspectionofthese19recordeddatacastsdoubtuponthesmallestdatum (1.87kg)andthetwolargestdata(8.97kgand39.8kg)becausetheydiffersogreatly fromtherestoftheweightsinthesample.Datainstrikingdisagreementwithnearly alltheotherdatainasampleareoftencalled outliers or discordantdata,andthe occurrenceofsuchobservationsgenerallycallsforcloserexamination.
Sometimesitisclearthatanoutlieristheresultofincorrectrecordingofdata.In theprecedingexample,amallardduckweightof39.8kgishighlyunlikely(tosaythe least!),forthatisabouttheweightofa12-year-oldboyorgirl(andsuchaduckwould probablynotfitinoneofthelaboratorycages).Inthiscase,inspectionofthedata recordsmightleadustoconcludethatthisbodyweightwasrecordedwithacareless placementofthedecimalpointandshouldhavebeen3.98kginsteadof39.8kg.And, uponinterrogation,theresearchassistantmayadmittoweighingtheeighteenthduck withthescalesettopoundsinsteadofkilograms,sothemetricweightofthatanimal shouldhavebeenrecordedas4.07(not8.97)kg.
Also,uponfurtherexaminationofthedata-collectionprocess,wemayfindthat the1.87-kgduckwastakenfromawrongcageandwas,infact,only4weeksold,not 20weeksold,andthereforedidnotbelonginthissample.Or,perhapswefindthatit wasnotamallardduck,butsomeotherbirdspecies(and,therefore,didnotbelongin thissample).Statisticianssayasampleis contaminated ifitcontainsadatumthatdoes notconformtothecharacteristicsofthepopulationbeingsampled.Sotheweightofa 4-week-oldduck,orofabirdofadifferentspecies,wouldbeastatisticalcontaminant andshouldbedeletedfromthissample.
Therearealsoinstanceswhereitisknownthatameasurementwasfaulty—for example,whenalaboratorytechnicianspillscoffeeontoanelectronicmeasuring deviceorintoabloodsampletobeanalyzed.Insuchacase,themeasurementsknown tobeerroneousshouldbeeliminatedfromthesample.
However,outlyingdatacanalsobecorrectobservationstakenfromanintended population,collectedpurelybychance.Asweshallsee,whendrawingarandomsamplefromapopulation,itisrelativelylikelythatadatuminthesamplewillbearound theaverageofthepopulationandveryunlikelythatasampledatumwillbedramaticallyfarfromtheaverage.Butsampledataveryfarfromtheaveragestillmaybe possible.