Discriminant
Analysis After R Fisher Advanced Research by the Feature Selection Method for Microarray Data 1st Edition Shuichi Shinmura (Auth.)
Visit to download the full and correct content document: https://textbookfull.com/product/new-theory-of-discriminant-analysis-after-r-fisher-adv anced-research-by-the-feature-selection-method-for-microarray-data-1st-edition-shuic hi-shinmura-auth/

More products digital (pdf, epub, mobi) instant download maybe you interests ...




High-dimensional Microarray Data Analysis: Cancer Gene Diagnosis and Malignancy Indexes by Microarray Shuichi Shinmura
https://textbookfull.com/product/high-dimensional-microarraydata-analysis-cancer-gene-diagnosis-and-malignancy-indexes-bymicroarray-shuichi-shinmura/
Metaprogramming in R: Advanced Statistical Programming for Data Science, Analysis and Finance 1st Edition Thomas Mailund
https://textbookfull.com/product/metaprogramming-in-r-advancedstatistical-programming-for-data-science-analysis-andfinance-1st-edition-thomas-mailund/
Modern Data Mining Algorithms in C++ and CUDA C: Recent Developments in Feature Extraction and Selection Algorithms for Data Science 1st Edition Timothy Masters
https://textbookfull.com/product/modern-data-mining-algorithmsin-c-and-cuda-c-recent-developments-in-feature-extraction-andselection-algorithms-for-data-science-1st-edition-timothymasters/
Functional Programming in R: Advanced Statistical Programming for Data Science, Analysis and Finance 1st Edition Thomas Mailund
https://textbookfull.com/product/functional-programming-in-radvanced-statistical-programming-for-data-science-analysis-andfinance-1st-edition-thomas-mailund/

Feature Engineering and Selection A Practical Approach for Predictive Models 1st Edition Max Kuhn
https://textbookfull.com/product/feature-engineering-andselection-a-practical-approach-for-predictive-models-1st-editionmax-kuhn/




Intelligent Feature Selection for Machine Learning Using the Dynamic Wavelet Fingerprint Mark K. Hinders
https://textbookfull.com/product/intelligent-feature-selectionfor-machine-learning-using-the-dynamic-wavelet-fingerprint-markk-hinders/
Advanced Object-Oriented Programming in R: Statistical Programming for Data Science, Analysis and Finance 1st Edition Thomas Mailund
https://textbookfull.com/product/advanced-object-orientedprogramming-in-r-statistical-programming-for-data-scienceanalysis-and-finance-1st-edition-thomas-mailund/
Introduction
to Data Science Data Analysis and Prediction Algorithms with R 1st Edition By Rafael A. Irizarry
https://textbookfull.com/product/introduction-to-data-sciencedata-analysis-and-prediction-algorithms-with-r-1st-edition-byrafael-a-irizarry/
Beginning Data Science in R: Data Analysis, Visualization, and Modelling for the Data Scientist 1st Edition Thomas Mailund
https://textbookfull.com/product/beginning-data-science-in-rdata-analysis-visualization-and-modelling-for-the-datascientist-1st-edition-thomas-mailund/


Shuichi Shinmura


New Theory of Discriminant Analysis After R. Fisher
Advanced Research by the Feature-Selection Method for Microarray Data


NewTheoryofDiscriminantAnalysis AfterR.Fisher
ShuichiShinmura
ShuichiShinmura FacultyofEconomics
SeikeiUniversity
Musashinoshi,Tokyo
Japan
ISBN978-981-10-2163-3ISBN978-981-10-2164-0(eBook) DOI10.1007/978-981-10-2164-0
LibraryofCongressControlNumber:2016947390
© SpringerScience+BusinessMediaSingapore2016
Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpart ofthematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations, recitation,broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmission orinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilar methodologynowknownorhereafterdeveloped.
Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthis publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfrom therelevantprotectivelawsandregulationsandthereforefreeforgeneraluse.
Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthis bookarebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernorthe authorsortheeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinor foranyerrorsoromissionsthatmayhavebeenmade.
Printedonacid-freepaper
ThisSpringerimprintispublishedbySpringerNature TheregisteredcompanyisSpringerScience+BusinessMediaSingaporePteLtd.
Preface
Thisbookintroducesthenewtheoryofdiscriminantanalysisbasedonmathematical programming(MP)-basedoptimallineardiscriminantfunctions(OLDFs)(hereafter, “theTheory”)afterR.Fisher.Thereare fiveseriousproblemsofdiscriminantanalysis inSect. 1.1.2.Idevelop fiveOLDFsinSect. 1.3.AnOLDFbasedonaminimum numberofmisclassification(minimumNM,MNM)criterionusingintegerprograming(IP-OLDF)revealsfourrelevantfactsinSect. 1.3.3. IP-OLDFtellsusthe relationbetweenNMandLDFclearlyinadditiontoamonotonicdecreaseofMNM. IP-OLDFandanOLDFusinglinearprograming(LP-OLDF)arecomparedwith Fisher ’sLDFandaquadraticdiscriminantfunction(QDF)usingIrisdatainChap. 2 andcephalo-pelvicdisproportion(CPD)datainChap. 3.However,becauseIP-OLDF maynot fi ndatrueMNMifdatadonotsatisfythegeneralpositionrevealedbystudent datainChap. 4 (Problem1),IdevelopRevisedIP-OLDF,RevisedLP-OLDF,and RevisedIPLP-OLDFthatisamixturemodelofRevisedLP-OLDFandRevised IP-OLDF.OnlyRevisedIP-OLDFcan findtrueMNMcorrespondingtoaninterior pointofoptimalconvexpolyhedron(optimalCP,OCP)definedonthediscriminant coefficientspaceinSect. 1.3. BecauseallLDFsexceptforRevisedIP-OLDFcannot discriminatecasesonthediscriminanthyperplaneexactly(Problem1),NMsofthese LDFsmaynotbecorrect.IP-OLDF findsSwissbanknotedatainChap. 6 havingsix variablesislinearlyseparabledata(LSD)andtwovariablessuchas(X4,X6)is minimumlinearlyseparablemodelbyexaminationofall63modelsmadebysix independentvariables.RevisedIP-OLDFconfirmsthisresult,later.Bymonotonic decreaseofMNM,16modelsincluding(X4,X6)arelinearlyseparablemodels.This factisveryimportantforustounderstandthegeneanalysis.OnlyRevisedIP-OLDF andahard-marginsupportvectormachine(H-SVM)candiscriminateLSDtheoretically(Problem2).Problem3isthedefectofgeneralizedinverseof variance-covariancematricesthatcausesatroubleforQDFandaregularizeddiscriminantanalysis(RDA).IsolveProblem3thatisexplainedbythepass/fail determinationsusing18examinationscoresinChap. 5.AlthoughthesedataareLSD, errorratesofFisher ’sLDFandQDFareveryhighbecausethesedatasetsdonotsatisfy Fisher ’sassumption.Thesefactstellusseriousproblemthatwehadbetter
re-evaluatedthediscriminantresultsofFisher ’sLDFandQDF.Inparticular,weshall re-evaluatethemedicaldiagnosis,andvariousratingsbecausethesedatahavethe sametypeoftestdatahavingmanycasesonthediscriminanthyperplane.Because Fisherneverformulatedtheequationofstandarderror(SE)oferrorratesanddiscriminantcoeffi cient(Problem4),Idevelopa100-foldcross-validationforsmall samplemethod(hereafter, “theMethod1”).TheMethod1offersthe95%confidence interval(CI)ofdiscriminantcoefficientanderrorrate.Moreover,Idevelopapowerful modelselectionproceduresuchasthebestmodelwithminimummeanoferrorratein thevalidationsamples(M2).BestmodelsofRevisedIP-OLDFarebetterthanother sevenLDFsusingsixdatasetsincludingJapanese-automobiledatainadditionto above fivedatasets.Therefore,wemisunderstandIestablishtheTheoryin2015. However,whenRevisedIP-OLDFdiscriminatessixmicroarraydatasets(thedatasets) inNovember2015,RevisedIP-OLDFcannaturallyselectfeatures.AlthoughRevised IP-OLDFcanmakefeature-selectionnaturallyforSwissbanknotedataand Japanese-automobiledatainChap. 7,Idonotthinkitisaveryimportantfactbecause thebestmodelofferstheusefulmodelselectionprocedureforcommondata.Over thantenyears,manyresearchersarestrugglingintheanalysisofgenedatasetsbecause therearehugenumbersofgenesanditisdifficultforustoanalyzebycommon statisticalmethods(Problem5).IdevelopaMatroskafeature-selectionmethod (hereafter, “theMethod2”)andLINGOprogram.TheMethod2revealsthedataset consistsseveraldisjointsmalllinearlyseparablesubspaces(smallMatroska,SMs) andotherhigh-dimensionalsubspacethatisnotlinearlyseparable.Therefore,wecan analyzeeachSMbyordinarystatisticalmethods.We fi ndProblem5inNovember 2015andsolveitinDecember2015.
Thebookrepresentsmylife'swork/research,towhichIhavededicatedover44 yearsofmylife.AftergraduatingfromKyotoUniversityin1971,Iwasemployed bySCSKCorp.inJapanasasystemintegrator.NaojiTuda,thegrandsonofthe second-generationgeneraldirectorTeigoIbaofSumitomoZaibatsu,wasmyboss andhebelievedthatmedicalengineering(ME)isanimportanttargetforthe information-processingindustries.Throughhisdecision,Ibecameamemberofthe projectfortheautomaticdiagnosticsystemofelectrocardiogram(ECG)datawith theOsakaCenterforCancerandCardiovascularDiseasesandNEC.Theproject leader,Dr.YutakaNomura,orderedmetodevelopthemedicaldiagnosticlogicfor ECGdatathroughtheFisher ’sLDFandQDF.AlthoughIhadhopedtobecomea mathematicalresearcherwhenIwasaseniorstudentinhighschool,Ifailedthe entranceexaminationofgraduateschoolatKyotoUniversitybecauseIspentmuch moretimepursuingtheactivitiesoftheswimmingclubintheuniversity. AlthoughIdidnotbecomeamathematicalresearcher,IstartedresearchwithME. TheresearchIconductedfrom1971to1974usingFisher ’sLDFandQDFwas inferiortohisexperimentaldecisiontreelogic.Initially,Ibelievedthatmystatisticalabilitywaspoor.However,IsoonrealizedthatFisher ’sassumptionwastoo strictformedicaldiagnosis.Iproposedtheearthmodel(Shinmura,1984)1 for 1SeethereferencesinChap. 1
medicaldiagnosisinsteadofFisher ’sassumption.Therefore,thisexperiencegave methemotivationtodeveloptheTheory.Shinmuraetal.(1973,1974)proposeda spectrumdiagnosisusingBayesiantheorythatwasthe firsttrialfortheTheory. However,logisticregressionwasmoresuitablefortheearthmodel.
Shimizuetal.(1975)requestedmetoanalyzephotochemicalairpollutiondata byHayashiquantificationtheory,andthisbecamemy firstpaper.Dr.Takaichirou Suzuki,leaderoftheEpidemiologyGroup,providedmewithseveralthemesfor manytypesofcancers(Shinmuraetal.1983).
In1975,ImetProf.AkihikoMiyakefromtheNihonMedicalSchoolatthe workshoporganizedbyDr.ShigekotoKaihara,ProfessorEmeritusoftheMedical SchoolofTokyoUniversity.MiyakeandShinmura(1976)studiedtherelationship betweenpopulationandsampleerrorrateinFisher ’sLDF.Next,Miyakeand Shinmura(1979)developedanOLDFbasedontheMNMcriterionbyaheuristic approach.ShinmuraandMiyake(1979)discriminatedCPDdatawithcollinearities. Afterwerevisedapapertwoorthreetimes,astatisticaljournalrejectedourpaper. However,MiyakeandShinmura(1980)wasacceptedbyJapaneseSocietyfor MedicalandBiologicalEngineering(JSMBE).FormereditorswhojudgedOLDF basedontheMNMcriterionoverestimatedthevalidationsample,andFisher ’sLDF didnotoverestimatethesamplebecauseFisher ’sLDFwasderivedfromthenormal distributionwithoutexaminationofrealdata.Iwasdeeplydisappointedthatmany statisticiansdislikedrealdatareviewandstartedtheirresearchfromanormal distributionbecauseitwasverycomfortableforthemwithouttheexaminationof realdata(lotuseating).However,IcouldnotdevelopasecondtrialoftheTheory becauseofpoorcomputerpowerandadefectintheheuristicapproach.
Shinmuraetal.(1987)analyzedthespecifi csubstancemycobacterium(SSM, commonlyknownasMaruyamavaccine).From270,000patients,wecategorized 152,289cancerpatientsintofourpostoperativegroups.Thosepatientsthatwere administeredSSMwithinoneyearaftersurgeryweredividedintofourgroups everythreemonthsatthestartoftheSSMadministration.WeassumedthatSSMis onlywaterwithoutsideeffects,andthiswasthenullhypothesis.Thesurvivaltime forthe firstgroupwaslongerthanforthefourthgroupfromninemonthsto12 monthsaftersurgeryandthenullhypothesiswasrejected.
In1994,Prof.KazunoriYamaguchiandMichikoWatanabestronglyrecommendedmetoapplyforthepositionatSeikeiUniversity.Afterorganizingthe9th SymposiumofJSCSinSCSKatRyogokunearRyogokuKokugikaninMarch 1995,IbecameaprofessorattheEconomicDepartmentinAprilofthesameyear. Dr.TokuhideDoipresentedalong-termcareinsurancesystemthatemployeda decisiontreemethodasadvisedbyme.(DoctorKaiharaplannedthissystemasan advisortotheMinistryofHealthandWelfare,andIadvisedDr.Doitousethe decisiontree.)
In1997,Prof.TomoyukiTarumiadvisedmetoobtainadoctoratedegreein scienceathisgraduateschool.Withoutexaminingthepreviousresearch,IdevelopedIP-OLDFandLP-OLDFthatdiscriminatedIrisdata,CPDdata,and115
randomnumberdatasets.IP-OLDFfoundtworelevantfactsabouttheTheory. Therefore,weconfirmedtheMNMcriterionwasessentialforthediscriminant analysisandcompletetheTheoryin2015.TheTheoryisusefulforthegene datasetsassameastheordinarydatasets.Redearscandownloadallmyresearch fromresearchmapandTheoryfromresearchgate. https://www.researchgate.net/pro file/Shuichi_Shinmura http://researchmap.jp/read0049917/?lang=english
Musashinoshi,JapanShuichiShinmura
Acknowledgments
Iwishtothankallresearcherswhocontributedtothisbook:LinusSchrage,Kevin Cunningham,HitoshiIchikawa,JohnSall,NorikiInoue,KyokoTakenaka, MasaichiOkada,NaohiroMasukawa,AkiIshii,IanB.Jeffery,TomoyukiTarumi, YutakaTanaka,KazunoriYamaguchi,MichikoWatanabe,YasuoOhashi,Akihiko Miyake,ShigekotoKaihara,AkiraOoshima,TakaichirouSuzuki,Tadahiko Shimizu,TatuoAonuma,KunioTanabe,HiroshiYanai,TojiMakino,Jirou Kondou,HiroshiTakamori,HidenoriMorimura,AtsuhiroHayashi,IebunYun, HirotakaNakayama,MikaSatou,MasahiroMizuta,SouichirouMoridaira,Yutaka Nomura,NaojiTuda.
Iamgratefulformyfamilies,inparticular,thelegacyofmylatefather,who supportedtheresearch:OtojirouShinmura,ReikoShinmura,MakikoShinmura, HidekiShinmura,andKanaShinmura.
IwouldliketothankEditage(www.editage.jp )forEnglishlanguageediting.
1.3.2BeforeandAfterSVM
1.3.3IP-OLDFandFourNewFactsofDiscriminant
1.5.2Pass/FailDetermination
2IrisDataandFisher
2.3.1ComparisonofMNMandEightNMs
2.3.3LINGOProgram1:SixMP-BasedLDFs
2.4.1FourTrialstoObtainValidationSample
2.4.1.1GenerateTrainingandValidationSamples
2.4.1.220,000NormalRandomSampling
2.4.2BestModelComparison
3Cephalo-PelvicDisproportionDatawithCollinearities
3.3100-FoldCross-Validation
3.3.1BestModel
4.4.1ComparisonofMNMandNine
5.3.1MNMandNineNMs
5.3.2ErrorRateMeans(M1andM2)
5.4Pass/FailDeterminationbyExaminationScores(90%Level in2012)
5.4.1MNMandNineNMs
5.4.2ErrorRateMeans(M1andM2)
5.4.395%CIofDiscriminantCoef
5.5Pass/FailDeterminationbyExaminationScores(10%Level in2012)
5.5.1MNMandNineNMs
5.5.2ErrorRateMeans(M1andM2)
5.5.395%CIofDiscriminantCoef
6BestModelforSwissBanknoteData ..........................
6.1Introduction ..........................................
6.2SwissBanknoteData ...................................
6.2.1DataOutlook ....................................
6.2.2ComparisonofSevenLDFforOriginalData ...........
6.3100-FoldCross-ValidationforSmallSampleMethod
6.3.1BestModelComparison
6.3.295%CIofDiscriminantCoefficient
6.3.2.1Considerationof27Models
6.3.2.2RevisedIP-OLDF
6.3.2.3Hard-MarginSVM(H-SVM) andOtherLDFs
6.4Explanation1forSwissBanknoteData
6.4.1MatroskainLinearlySeparableData
6.4.2Explanation1ofMethod2bySwissBanknoteData
6.5Summary ............................................
7.2.2ComparisonofNineDiscriminantFunctions forNon-LSD ....................................
7.2.3ConsiderationofStatisticalAnalysis
7.3100-FoldCross-Validation(Method1)
7.3.1ComparisonofBestModel
7.3.295%CIofCoefficientsbySixMP-BasedLDFs
7.3.2.1RevisedIP-OLDFVersusH-SVM
7.3.2.2RevisedIPLP-OLDF,RevisedLP-OLDF, andotherLDFs
7.3.395%CIofCoefficientsbyFisher ’sLDFandLogistic Regression
7.4MatroskaFeature-SelectionMethod(Method2)
7.4.1Feature-SelectionbyRevisedIP-OLDF ................
7.4.2Coeffi cientofH-SVMandSVM4 ....................
8MatroskaFeature-SelectionMethodforMicroarrayDataset (Method2) ...............................................
8.1Introduction ..........................................
8.2MatroskaFeature-SelectionMethod(Method2) ...............
8.2.1ShortStorytoEstablishMethod2 ...................
8.2.2ExplanationofMethod2byAlonetal.Dataset
8.2.2.1Feature-SelectionbyEightLDFs
8.2.2.2ResultsofAlonetal.Dataset UsingtheLINGOProgram
8.2.3SummaryofSixMicroarrayDatasetsin2016
8.2.4SummaryofSixDatasetsin2015
8.3ResultsoftheGolubetal.Dataset
8.3.1OutlookofMethod2bytheLINGOProgram3
8.3.2FirstTrialtoFindtheBasicGeneSets
8.3.3AnotherBGSintheFifthSM
8.4HowtoAnalyzetheFirstBGS ............................
8.5StatisticalAnalysisofSM1...............................
8.5.1One-WayANOVA
9LINGOProgram2ofMethod1 .............................
9.1Introduction
9.2Natural(Mathematical)NotationbyLINGO
9.3IrisDatainExcel
9.4SixLDFsbyLINGO
9.5DiscriminationofIrisDatabyLINGO
9.6HowtoGenerateResamplingSamplesandPrepareData inExcelFile
9.7SetModelbyLINGO
Symbols
StatisticalDiscriminantFunctionsbyJMP
JMPStatisticalsoftwaresupportedbytheJMPdivisionofSAS Institute,Japan
JMPscriptJMPscriptsolvesFisher ’sLDFandlogisticregressionby Method1
LDFlineardiscriminantfunctionssuchasFisher ’sLDF,logistic regression,twoOLDFs,threerevisedOLDFs,andthree SVMs
Fisher ’sLDFFisher ’slineardiscriminantfunctionunderFisher ’s assumption
LogistLogisticregression;inthetable, “Logist” isoftenused
QDF* Quadraticdiscriminantfunction
RDA* Regularizeddiscriminantanalysis
*QDFandRDAdiscriminateordinarydatainthisbook
MathematicalProgramming(MP)byLINGO andWhat’sBest!
What’sBest!Exceladd-insolver
LINGOMPsolverthatcansolveLP,IP,QP,NLP,andstochastic programming
LINGOProgram1LINGOthatsolvestheoriginaldatabysixMP-based LDFsexplainedin 2.3.3
LINGOProgram2LINGOthatsolvessixMP-basedLDFsbyMethod 1explainedinChap. 9
LINGOProgram3LINGOthatsolvessixMP-basedLDFsbyMethod2
LPLinearProgrammingdevelopsReviselLP-OLDF
IPIntegerProgrammingdevelopsRevisedIP-OLDF
QPQuadraticProgrammingdevelopsthreeSVMs
NLPNonlinearProgrammingdefi nesFisher ’sLDF
MP-basedLDFs
SVMSupportvectormachine
H-SVMHard-marginSVM
S-SVMSoft-marginSVM
SVM4**
S-SVMforpenaltyc=10000
SVM1** S-SVMforpenaltyc=1
**Becausethereisnoruletodecideaproper “c”,we compareresultsbySVM4andSVM1
OLDF:OptimalLDF
LSDLinearlyseparabledata,MNMofwhichiszero;LSD includesseverallinearlyseparablemodels(orsubspaces) MatroskaIngeneanalysis,wecallalllinearlyseparablespaceand subspacesasMatroska BigMatroskathemicroarraydatasetisLSDandincludessmaller MatroskainitbymonotonicdecreaseofMNM
SMsmallMatroskafoundbyLINGOprogram3notexplained inthisbook
BGSbasicgenesetorsubspacethatisthesmallestMatroskain eachSM
NMNumberofmisclassi fications
MNMMinimumNM
CPConvexpolyhedron;theinteriorpointofCPhasunique NManddiscriminatessamecasesdefinedbyIP-OLDF, notRevisedIP-OLDF
OCPOptimalCP;theinteriorpointofOCPhasuniqueMNM IP-OLDFOLDF-basedMNMcriterionusingIP;ifdataarenot generalposition,IP-OLDFmaynot findtrueMNM
RevisedIP-OLDFIt findstheinteriorpointofOCPandsolvesProblem1 LP-OLDFOLDFusingLP;oneofL1-normLDF
RevisedLP-OLDFOneofL1-normLDF;Althoughitisfasterthanother MP-basedLDFs,itisweakforProblem1
RevisedIPLP-OLDFAmixturemodelofRevisedLP-OLDFinthe firststepand RevisedIP-OLDFinthesecondstep
DATA
Datancasesbyp-independentvariables xi ithp-independentvariablevector(fori=1,…,n) yi Objectvariable;yi=1forclass1andyi=-1forclass2
Hi(b)Hi(b)=yi × (txi × b +1)isalinearhyperplane(fori=1, …,n)anddividep-dimensionalcoefficientspaceinto fi nite CP(twohalf-planessuchasHi(b)<0andHi(b)>0.)
Hi(b)<0Minushalf-planeofHi(b):IfHi(bk)<0,Hi(bk)=yi × (txi × bk +1)=yi × (tbk × xi +1)<0andcase xi is misclassified.Ifinteriorpoint bk islocatedinh-minus half-plane,NM=h.ThisLDFmisclassi fiesthesame h–cases
OrdinaryorcommonData
IrisdataFisherevaluatesFisher ’sLDFbythesedata CPDdataCephalo-pelvicdisproportiondatawithcollinearities
StudentdataPass/faildeterminationusingstudentattribute LSDLinearlyseparabledatathatincludelinearlyseparable modelsinit.Ingeneanalysis,wecallLSDandlinearly separablemodelsareMatroskas
SwissbanknotedataIP-OLDF findsthesedataareLSD;WeexplainProblem2 and5
TestdataPass/faildeterminationusingexaminationscores;these datasetsareLSD,andweexplainatrivialLDF
Japanese-automobile data LSD;weexplainProblem3and5
ThedatasetsSixmicroarraydatasets
TheoryandMethod
TheoryNewtheoryofdiscriminantanalysisafterR.Fisher Method1100-foldcross-validationforsmallsamplemethod
Method2Matroskafeature-selectionmethodformicroarraydatasets
M1Themeanoferrorrateinthetrainingsample
M2Themeanoferrorrateinthevalidationsample
BestmodelM2ofbestmodelisminimumamongallpossiblemodels ofeachLDF
LOOprocedureAleave-one-outmodelselectionprocedure
ThebestmodelThemodelwithminimumM2insteadofLOObyMethod 1
Diff1Thedifferencedefinedas(NMofninediscriminant functions MNM)
DiffThedifferencedefinedas(M2 M1)
M1DiffThedifferencedefinedas(M1ofninediscriminant functions-M1ofRevisedIP-OLDF)
M2DiffThedifferencedefinedas(M2ofninediscriminant functions-M2ofRevisedIP-OLDF)
FiveProblemsofDiscriminantAnalysis
Problem1AllLDFs,withtheexceptionofRevisedIP-OLDF,cannot discriminatethecasesonthediscriminant hyperplane.NMsoftheseLDFsmaynotbecorrect
Problem2AllLDFs,withtheexceptionofH-SVMandRevised IP-OLDF,cannotrecognizeLDFtheoretically;Although RevisedLP-OLDFandRevisedIPLP-OLDFcanoften discriminateLSD,weneverdiscussinChap. 8 becauseof thisreason
Problem3Thedefectofthegeneralizedinversematricestechnique; QDFmisclassi fiesallcasesasotherclassesforaparticular case.Addingasmallrandomnoisetotheconstantvalues solvesProblem3
Problem4Fisherneverformulatedanequationforthestandarderror oftheerrorrateanddiscriminantcoefficient.Method1 offers95%confidenceinterval(CI)fortheerrorrateand coefficient.
Problem5Formorethantenyears,manyresearchershavestruggled toanalyzethemicroarraydatasetthatisLSD.OnlyRevised IP-OLDFcanmakefeature-selectionnaturally.Idevelop theMatroskafeature-selectionmethod(Method2)that findsasurprisingstructureofthemicroarraydatasetwhere suchstructureisthedisjointunionsofseveralsmalllinearly separablesubspaces(smallMatroska,SMs).Nowwecan analyzeeachSMveryquickly.Studentlinearlyseparable, Swissbanknote,andJapanese-automobiledatashowthe naturalfeature-selectionofRevisedIP-OLDF.Therefore, Irecommendthatresearchersoffeature-selectionmethods, suchasLASSO,evaluateandcomparetheirtheorythrough thesedatasetsinChaps. 4, 6–8.Iomittheresultsofthe pass/faildeterminationusingexaminationscoresthat consistonlyfourvariables
Chapter1
NewTheoryofDiscriminantAnalysis
1.1Introduction
1.1.1TheoryTheme
Thisbookintroducesanewtheoryofdiscriminantanalysis(hereafter, “the Theory”)afterR.Fisher.Thischapterexplainshowtosolvethe fiveserious problemsofdiscriminantanalysis.Tothebestofmyknowledge,thisisthe fi rst bookthatcompareseightlineardiscriminantfunctions(LDFs)usingseveraldifferenttypesofdata.TheseeightLDFsareasfollows:Fisher ’sLDF(Fisher 1936, 1956),logisticregression(Cox 1958),hard-marginSVM(H-SVM)(Vapnik 1995), twosoft-marginSVMs(S-SVMs)suchasSVM4(penalty c =10,000)andSVM1 (penalty c =1),andthreeoptimalLDFs(OLDFs).At first,IdevelopanOLDF basedonaminimumnumberofmisclassi fications(minimumNM(MNM))criterionusingintegerprogramming(IP-OLDF)andanOLDFusinglinearprogramming(LP-OLDF)(Shinmura 2000b, 2003, 2004, 2005, 2007).However,becauseI findthedefectofIP-OLDF,IdevelopthreerevisedOLDFssuchasRevised IP-OLDF(Shinmura 2010a, 2011a),RevisedLP-OLDF,andRevisedIPLP-OLDF (Shinmura 2010b, 2014b).IrisdatainChap. 2 arecriticaltestdatabecauseFisher evaluatesFisher ’sLDFwiththesedata(Anderson 1945).Cephalo-pelvic Disproportion(CPD)data(MiyakeandShinmura 1980)inChap. 3 aremedical datawiththreecollinearities.AlthoughStudentdatainChap. 4 employasmalldata sample(Shinmura 2010a),wecanunderstandProblem1becausethedataarenot generalpositions.The18pass/faildeterminationsusingexaminationscoresin Chap. 5 arelinearlyseparabledata(LSD).NoneoftheLDFs,withtheexceptionof H-SVMandRevisedIP-OLDF,candiscriminateLSDtheoretically.Idemonstrate that18errorratesofFisher ’sLDFandthequadraticdiscriminantfunction (QDF)areveryhigh(Shinmura 2011b);nevertheless,thesedataareLSD. Moreover,sevenLDFs,withtheexceptionofFisher ’sLDF,becometrivialLDF (Shinmura 2015b).Swissbanknotedata(FluryandRieduyl 1988)inChap. 6 and
© SpringerScience+BusinessMediaSingapore2016
S.Shinmura, NewTheoryofDiscriminantAnalysisAfterR.Fisher, DOI10.1007/978-981-10-2164-0_1
Japanese-automobiledata(Shinmura 2016c)inChap. 7 arealsoLSD.AlthoughI developaMatroskafeature-selectionmethodformicroarraydataset(Method2),it isdifficultforustounderstandthemeaningofMethod2ifwedonotknowLSD discriminationverywell.IcallLSDasbigMatroska.AssameasbigMatroska includesseveralsmallMatroska,themicroarraydataset(thedatasets)includes severallinearlyseparablesubspaces(smallMatroska(SM))init(thelargest Matroska).Therefore,IexplainthisideausingcommondatainChaps. 6 and 7. WhenIdiscriminatethedatasets,onlyRevisedIP-OLDFcanselectfeaturesnaturallyand findsthesurprisingstructureofthedatasets(Shinmura 2015e–s, 2016b). Moreover,Idevelopa100-foldcross-validationforsmallsamplemethod (Method1)(Shinmura 2010a, 2013, 2014c)insteadoftheleave-one-out(LOO) procedure(LachenbruchandMickey 1968).Wecanobtaintwoerrorratemeans, M1and M2,fromthetrainingandvalidationsamples,respectively,andproposea simplemodelselectionproceduretoselectthebestmodelwithminimum M2.The bestmodelofRevisedIP-OLDFisbetterthanthesevenother M2sfromthe previousdataexceptfortheIrisdata.
Wecannotdiscriminatecasesonthediscriminanthyperplane(Problem1).Only RevisedIP-OLDFcansolveProblem1.Moreover,onlyH-SVMandRevised IP-OLDFcandiscriminateLSDtheoretically(Problem2).Problem3isthedefect ofthegeneralizedinversematrixtechniqueandQDFofmisclassifyingallcasesto anotherclassforaparticularcase.IsolveProblem3.Fisherneverformulatedan equationforthestandarderrors(SEs)oftheerrorrateanddiscriminantcoeffi cient (Problem4).TheMethod1offersthe95%confidenceinterval(CI)oftheerrorrate andcoeffi cient.Formorethantenyears,manyresearchershavestruggledtoanalyzethe dataset thatisLSD(Problem5).OnlyRevisedIP-OLDFcanmake feature-selectionnaturally.TheMethod2 findsthesurprisingstructureofthe datasetthatisthedisjointunionsofseveralsmallgenesubspaces(SMs)thatare linearlyseparablemodels.Ifwecanrepairthespecifi cgenesfoundbyMethod2, wemightovercomecancerdiseases.Now,wecananalyzeeachSMveryquickly. Wecallthelinearlyseparablemodelingeneanalysis, “Matroska.” Ifthedatasets areLSD,thefullmodelisthelargestMatroskathatcontainsallsmallerMatroskain it.WealreadyknowthatthesmallestMatroska(thebasicgenesetorsubspace (BGS))candescribetheMatroskastructurecompletelybymonotonicdecreaseof MNM.Ontheotherhand,LASSO(BuhlmannandGeer 2011;Simonetal. 2013) attemptstomakethefeature-selectionsimilartoMethod2.Thisbookoffersuseful datasetsandresultsforLASSOresearchersfromthefollowingperspective:
1.CanLDFobtainedbyLASSOdiscriminatethreedifferenttypesofLSDsuchas Swissbanknotedata,Japanese-automobiledata,andsixmicroarraydatasets exactly?
2.CanLDFobtainedbyLASSO findtheMatroskastructurecorrectlyandlistall BGSs?
IfLASSOcannot findSMsorBGSinthedataset,itcannotexplainthedata structure.
1.1.2FiveProblems
TheTheorydiscussesonlybinaryortwo-class,class1orclass2,discriminationby eightLDFssuchasRevisedIP-OLDF,RevisedLP-OLDF,RevisedIPLP-OLDF, H-SVM,SVM4,SVM1,Fisher ’sLDF,andlogisticregression.Thevaluesofclass 1andclass2are1and 1,respectively.Weconsiderthesevaluesasobjectvariable ofdiscriminantanalysisandregressionanalysis.Let f(x)beLDFand f(xi)bea discriminantscorefor xi.Althoughtherearemanydifficultstatisticsindiscriminant analysis,weshouldfocusonthediscriminantrulethatisquitedirect:If yi × f (xi)>0, xi isclassi fiedintoclass1/class2correctly.If yi × f(xi)<0, xi ismisclassi fied.If yi × f(xi)=0,wecannotdiscriminate xi correctly.Thisunderstanding ismostimportantfordiscriminantanalysis.Thereare fiveseriousproblemshidden inthissimplisticscenario(Shinmura 2014a, 2015c, d).
1.1.2.1Problem1
Wecannotadequatelydiscriminatebetweencaseswhere xi liesonthediscriminant hyperplane(f(xi)=0).TheStudentdatainChap. 4 showthisfactclearly.Thusfar, thishasbeenanunresolvedproblem.However,mostresearchersclassifythese casesintoclass1withoutlogicalreason.Theymisunderstandthediscriminantrule asfollows:If f(xi) ≥ 0, xi isclassifiedintoclass1correctly.If f(xi)<0, xi is classi fiedintoclass2properly.Therearetwomistakesintheirrule.The fi rst mistakeistoclassifythecasesonthediscriminanthyperplanetoclass1without logicalexplanation.Thesecondmistakeiswecannotdeterminethecaseswith positivediscriminantscoreasclassifi edintoclass1andthosewithanegativevalue asclassi fiedintoclass2aprioribecausethedatadeterminethis,notresearchers. OtherstatisticiansproposedeterminingProblem1randomly(i.e.,akintothrowing dice)becausestatisticsisthestudyofprobabilities.Ifuserswouldknowofthis claim,theymightbesurprisedanddisappointedindiscriminantanalysis.Inparticular,medicaldoctorsmightbeupsetbecausetheydonotgamblewithmedical diagnoses,giventhattheyattempttoseriouslydiscriminatecasesbasedonthe discriminanthyperplane.Moststatisticalresearchersarelackofthisfactofmedical diagnosis.Ifweconsiderpass/faildeterminationusingthescoresoffourtestswhere thepassingmarkis50points,wecanobtaintrivialLDFsuchas f = T1+ T2+T3+T4 50.If f ≥ 0,agivenstudenthaspassedtheexamination.On theotherhand,if f <0,thestudenthasfailedtheexamination.Becausewecan describethediscriminantruleby(independent)variablesclearly,wecancorrectly includesuchstudentonthediscriminanthyperplaneinthepassingclass.Wehave ignoredthisunresolvedproblemuntilnow.TheproposedRevisedIP-OLDFbased onMNMcantreatProblem1appropriately(Shinmura 2010a).Indeed,withthe exceptionofRevisedIP-OLDF,noLDFscancorrectlycountthenumberofmisclassi fications(NMs).Therefore,wemustcountthenumberofcaseswhere f(xi)=0 anddisplaythisnumber “h” alongsideNMofallLDFsintheoutput.Wemust
estimateatrueNMthatmightincreaseto(NM+ h).Aftershowingmanyexamples ofProblem1,somestatisticiansclaimthattheprobabilityofcasesonthediscriminanthyperplaneiszerowithoutatheoreticalreason.Theyerroneouslybelievethat wediscriminatedataonacontinuousspace.
1.1.2.2Problem2
OnlyH-SVMandRevisedIP-OLDFcanrecognizeLSDtheoretically.1 OtherLDFs mightnotdiscriminateLSDexactly.WhenIP-OLDFdiscriminatesSwissbanknote datainChap. 6,I findthatthesedataareLSD.Inaddition,Japanese-automobiledataare LSDinChap. 7.Throughbothdata,IexplaintheMatroskafeature-selectionmethod (Method2)inChap. 8.Wecanobtainexaminationscoreseasily,andthesedatasetsare alsoLSDs.Moreover,thereistrivialLSD.However,severalLDFscannotdetermine pass/failusingexaminationscorescorrectly(Shinmura 2015b).Inparticular,theerror ratesofFisher ’sLDFandQDFareveryhigh.Table 1.4 listsallthe18errorratesof Fisher ’sLDFandQDFthatarenotzerointhepass/faildeterminationsfrom2010to 2012.Thisfactsuggeststhatreviewthediscriminantanalysisofpastimportantresearch becauseerrorratesmaydecrease.Inmedicaldiagnosis,researchersgaveuptheir researches,errorratesofwhichwereovertenpercent.However,RevisedIP-OLDFmay tellthemerrorratesarezero.Moreover,discriminantfunctionsthatcannotdiscriminate LSDcorrectlyarenothelpfulforgeneanalysis.
1.1.2.3Problem3
Ifthevariance–covariancematrixissingular,Fisher ’sLDFandQDFcannotcalculateitbecauseinversematricesdonotexist.BecauseJMP(Salletal. 2004) adoptedthegeneralizedinversematrixtechnique,IhadbelievedthatFisher ’sLDF andQDFcouldcalculategeneralizedinversematrixwithoutproblems.WhenI discriminatedmathexaminationscoresamong56examinationdatafromthe NationalCenterforUniversityEntranceExaminations(NCUEE),QDFanda regularizeddiscriminantanalysis(RDA)(Friedman 1989)misclassi fiedallstudents inthepassingclassasthefailingclass.Ifweexchangeclass1andclass2,QDFand RDAmisclassifiedallstudentsinthefailingclassasthepassingclassdecidedby JMPspeci fication.WhenQDFcausedseriousproblemswithproblematicdata,JMP switchedQDFtoRDAautomatically.Afterthreeyearsofsurveys,Ifoundthat RDAandQDFdonotworkcorrectlyforaparticularcasewherethevaluesofthe variablesthatbelongtooneclasshaveaconstantvaluebecauseallthestudentsin thepassingclassansweredtheparticularquestioncorrectly.Ifuserscanselect
1Empirically,RevisedLP-OLDcandiscriminateLSDcorrectly.However,itisveryweakfor Problem1.LogisticregressionandSVM4discriminateLSDcorrectlyformanyexaminations. Fisher ’sLDF,QDF,andSVM1aresevereforLSDdiscriminations.Irecommendresearchers reviewtheiroldresearchesusingthesethreediscriminantfunctions.
appropriateoptionsforamodifi edRDAdevelopedforthisparticularcase,RDA worksbetterthantheQDFlistedinTable 1.5,whichisexplainedbytheresultsof theJapanese-automobiledata.However,JMPdoesnotcurrentlyofferamodi fied QDF.Therefore,Ijudgedthiswasthedefectofgeneralizedinversematrix.Ifwe addslightrandomnoisetotheconstantvalue,QDFcandiscriminatethedata exactly.Becauseitisthebasicstatisticalknowledgeforus,thedatavariedandI trustthequalityofJMP;Ineedthreeyearsto findthereason.Problem3has providedawarningforourstatisticalunderstandingdataalwayschange.
1.1.2.4Problem4
Somestatisticianserroneouslybelievethatdiscriminantanalysisistheinferential statisticalmethodthatissimilartoregressionanalysis.However,Fishernever formulatedanequationofSEsfordiscriminantcoefficientsorerrorrates. Nonetheless,ifweusetheindicator yi ofmathematicalprogramming-basedlinear discriminantfunctions(MP-basedLDFs)inEq.(1.7)astheobjectvariableand analyzethedatabyregressionanalysis,theobtainedregressioncoefficientsare proportionaltothecoefficientsofFisher ’sLDFbytheplug-inrule1.Therefore,we useamodelselectionprocedure,suchasstepwiseprocedures,andallpossible combinationmodels(Goodnight 1978)withstatisticssuchasAIC,BIC,andCpof regressionanalysis.Inthisbook,IproposeMethod1andthenewmodelselection proceduresuchasthebestmodel.Iset k =100andselectthemodelwithminimum M2asthebestmodel;thisisaverydirectandpowerfulmodelselectionprocedure comparedwithLOO.First,weselectthebestmodelineachLDF.Next,weselect themodelwithminimum M2amongsixMP-basedLDFsasthe fi nalbestmodel. Weclaimthatthe fi nalbestmodelhasgeneralizationability.Moreover,weobtain the95%CIofthediscriminantcoefficient.Althoughwecoulddemonstratein2010 thatthebestmodelwasuseful(Shinmura 2010a),Icouldnotexplaintheuseful meaningofthe95%CIofthediscriminantcoefficientbefore2014.However,ifwe divideallcoefficientsbytheLDFinterceptandsettheintercepttoone,six MP-basedLDFsandlogisticregressionbecometrivialLDFs,andonlyFisher ’s LDFisfarfromtrivial(Shinmura 2015b).Moreover,Icanexplaintheuseful meaningofthe95%CIofSwissbanknoteandJapanese-automobiledata (Shinmura 2016a, c)moreprecisely.
1.1.2.5Problem5
Formorethantenyears,manyresearchershavestruggledtoanalyzethedatasets (Problem5).However,tothebestofmyknowledge,therehasbeennoresearchon LSDdiscriminationthusfar.Iexamine fivedifferenttypesofLSDs,suchasSwiss banknotedata,pass/faildeterminationof18examinationdata,Japanese-automobile data,studentlinearlyseparabledataandsixmicroarraydatasets.WhenIdiscriminatethedatasets,mostofthecoefficientsofRevisedIP-OLDFbecomezero.Only
RevisedIP-OLDFcanselectfeaturesnaturallyand fi ndsthesurprisingstructureof thedatasets.ThedatasetsareAlonetal.(1999),Chiarettietal.(2004),Golubetal. (1999),Shippetal.(2002),Singhetal.(2002),andTianetal.(2003).Jefferyetal. (2006)analyzedthesedatasetsanduploadthesedatasetsontheirHP.2 Ishiietal. (2014)analyzedthesedatasetsbyprincipalcomponentanalysis(PCA).I findthe Matroskastructureinthedatasets,withMNMofzero.TheMethod2canreducethe high-dimensionalgenespaceintoseveralsmallMatroskas(SMs)(Shinmura 2015e–s, 2016a).WecananalyzetheseSMsbyordinarystatisticalmethodssuchas t test,one-wayANOVA,clusteranalysis,andPCA.Becausetherehasbeenno researchonLSDdiscriminationthusfar(tothebestofourknowledge),many researchershavestruggledandhavenotobtainedgoodresults.IexplainMethod2 withtheresultsofSwissbanknotedatainChap. 6 andJapanese-automobiledatain Chap. 7 becauseRevisedIP-OLDFcanselectvariablesnaturallyforordinarydata.
1.1.2.6Summary
RevisedIP-OLDFsolvesProblems1,2,and5.Problem3isthedefectofthe generalizedinversematrixtechnique,andQDFnowcausesProblem3.Ifweadd slightrandomnoisetotheconstantvalue,wecansolveProblem3easily.Ipropose Method1andcomparetwostatisticalLDFsbyJMPscriptandsixMP-basedLDFs bytheLINGOProgram2(Schrage 2006)usingsixdifferenttypesofdata.Through manyresults,IcanconfirmthatMethod1solvesProblem4usingacomputerintensiveapproach.Problem5isthecomplexanalysisofmicroarraydatasets.Only RevisedIP-OLDFcanmakefeature-selectionofthedatasetsnaturallyand fi ndthe datasetsthatconsistofseveraldisjointunionsofSMs.WecananalyzeeachSMin thedataseteasilybecauseeachSMisasmallgenesubspace.Itisquitestrangethree SVMscannotselectfeaturenaturally.
1.2MotivationforOurResearch
1.2.1ContributionbyFisher
FisherdescribedFisher ’sLDFusingvariance–covariancematricesandfoundedthe statisticaldiscriminanttheory.Heassumedthattwoclasses(orgroups)havethe samevariance–covariancematrices,andtwomeansaredifferent(Fisher ’s assumption).However,becauseFisher ’sassumptionistoostrictforactualdata, QDFwasdefi nedastwoclasseshavingdifferentvariance–covariancematrices. Thisfactindicatesthatstatisticiansareawarethatthereexistdatathatdonotsatisfy Fisher ’sassumption.Moreover,multiclassdiscriminationthatusesthe
2http://www.bioinf.ucd.ie/people/ian/
Mahalanobisdistancehasbeenproposed.Inthequalitycontrol,Taguchiand Jugular(2002)consideredthatoneclass(thenormalstate)hasavariance–covariancematrix,andanotherclass(theuncontrolledstate)consistsofonlyonecase. Theydiscriminateddatathroughmulticlassdiscriminationandclaimedthatthe typicaluncontrolledcaseisfarfromthenormalstatewithlargeMahalanobisdistance.Theirclaimissimilartothe “earthmodel” inmedicaldiagnosis(Shinmura 1984).Becausestatisticalsoftwarepackageseasilyimplementthesediscriminant functionsbasedonvariance–covariancematrices,weapplydiscriminantanalysisto manyapplicationsinscience,technology,andindustry,suchasmedicaldiagnosis, patternrecognition,andvariousratings.However,realdatararelysatisfyFisher ’s assumptions.Therefore,itiswellknownthatlogisticregressionisbetterthan Fisher ’sLDFandQDFbecauseitdoesnotassumeaparticulartheoreticaldistribution,suchasanormaldistribution.Itisverystrangeandunfortunateforusthat thereisnodiscussiononthismatterbyresearchersandusersoflogisticregression.
1.2.2DefectofFisher ’sAssumptionforMedicalDiagnosis
AftergraduatingfromKyotoUniversityin1971,Ibecameamemberoftheproject thatdevelopedtheautomaticdiagnosticsystemforelectrocardiogram(ECG)data from1971to1974.Aprojectleaderwhowasamedicaldoctorrequestedmeto discriminateoverten3 abnormalsymptomsfromnormalsymptomusingFisher ’s LDFandQDF.Ourfouryearsofresearchwereinferiortomedicaldoctor ’s experimentaldecisiontreelogic.First,IbelievedthatmyresultsusingFisher ’s LDFandQDFwereinferiortodecisiontreelogicresultsbecausemyknowledge andexperiencewaspoor.Later,IrealizedthatFisher ’sassumptionwasnotadequateformedicaldiagnosis.Isummarizedtworeasonsformyfailure,bothof whicharedescribedbelow.Ontheotherhand,thereisnoactualtestforFisher ’s assumption.IdemonstratethatNMofFisher ’sLDFisclosetoMNMinIrisdata. WecanusethistrendinsteadoftheteststatisticsofFisher ’shypothesis.
FirstReason:Inmedicaldiagnosis,typicalcasesinabnormalsymptomsarefar fromthediscriminanthyperplane.Iexplainedmedicaldiagnosisasthe “earth model” wherethenormalsymptomistheland,abnormalsymptomsarethe mountains,andthediscriminanthyperplanesarehorizon.TheMahalanobis–Taguchistrategyissimilartotheearthmodel.ThisclaimviolatesFisher ’s assumption.Inastatisticalconcept,weunderstandthattypicalcasesinbothclasses aretwoaveragesoftwonormaldistributions.Therefore,Ibelievedthatthediscriminantfunctionsbasedonthevariance–covariancematricesarenotadequatefor medicaldiagnosisanddevelopedaspectrumdiagnosticmethod(Shinmuraetal. 1973, 1974).Iknewthatlogisticregressionisremarkablysuccessfulinmedical diagnosisandunderstoodthatitissuperiortothespectrumdiagnosticmethod.
3Icannotrecollecttheexactnumberofabnormalsymptoms.
Currently,Japanesemedicalresearchersdiscriminatedatabylogisticregression insteadofFisher ’sLDFandQDF.Iregretthatasresearchersandusersoflogistic regression,theydidnotdiscussmyclaim.
SecondReason:Therearemanycasesclosetothediscriminanthyperplane. IconcludedthatFisher ’sLDFandQDFarefragileforthediscriminationofparticulardata,suchaspass/faildeterminationusingexaminationscores(Shinmura 2011b)andtheratingofbonds,stocks,andestatesinadditiontomedicaldata. Thesedataalsohavethecharacteristicfeatureofhavingmanycasesclosetothe discriminanthyperplane.NoneoftheLDFs,withtheexceptionofRevised IP-OLDF,candiscriminatethecasesonthediscriminanthyperplanecorrectly (Problem1).Recently,becauseIcouldnotaccessmedicaldataforourresearch,I usedpass/faildeterminationwithexaminationscoresinsteadofmedicaldata.
1.2.3ResearchOutlook
After1975,IdiscriminatedmanydatausingFisher ’sLDF,QDF,logisticregression, multiclassdiscriminationusingMahalanobisdistance,decisiontreelogic(orpartitioning),andthequanti ficationtheorydevelopedbyDr.Hayashi(Shimizuetal. 1975;NomuraandShinmura 1978;Shinmuraetal. 1983).Throughthesestudies,I foundProblems1and4(Shinmura 2014a, 2015c, d).In1973,wedevelopedthe spectrumdiagnosticmethodusingBayesiantheory.However,logisticregression wasmoresophisticatedthanthespectrumdiagnosticmethod.Next,wedeveloped OLDFbasedontheMNMcriterion(MiyakeandShinmura 1979, 1980;Shinmura andMiyake 1979),whichisaheuristicapproach.BecauseWarmackandGonzalez (1973)comparedseveraldiscriminantfunctions,theirresearchencouragedour research.Wewerenotabletodeveloptheresearchbecausewehadlowcomputer powerandbecauseofthedefectoftheheuristicapproach.
Startingin1997,IdevelopedIP-OLDF(Shinmura 1998; 2000a, b;Shinmura andTarumi 2000).BecauseIdefi nedIP-OLDFinthediscriminantcoefficient spaces,Ifoundtwoimportantfactsofdiscriminantanalysis.The firstisOCP.The secondis “themonotonicdecreaseofMNM.” However,therewasaseriousdefect inIP-OLDFusingStudentdatathatarenotgeneralpositions.Ifdataarenotgeneral positions,IP-OLDFmightnotsearchforthevertexofatrueOCP.Thisdefect meansthattheobtainedMNMmightnotbetrueMNM,andProblem1causedthis defect.In2007,RevisedIP-OLDFsolvedthedefectbecauseitcan findtheinterior pointoftrueOCPandavoidProblem1.Therefore,IcouldsolveProblem1 completely.Until2007,IwasnotabletoevaluateeightLDFsusingvalidation samplesbecauseourresearchdataweresmallsamples.
After2007,IdevelopedMethod1.Throughthisbreakthrough,Iwasableto solveProblem4andendedthebasicresearch.RevisedIP-OLDFsolvesProblems1 and2.AlthoughIcanevaluateeightLDFsby M2,Icannotexplaintheuseful meaningofthe95%CIofdiscriminantcoeffi cients.After2010,Istartedapplied researchonLSDdiscrimination.IfoundthatProblem3isthedefectofthe
generalizedinversematrixtechniquebythepass/faildeterminationthatuses examinationscores(Shinmura 2011b).WithregardtoIP-OLDF,Isettheintercept ofIP-OLDFtooneandwasabletoobtaintwoimportantfactssuchasOCPand monotonicdecreaseofMNM.Therefore,Idividedallcoefficientsbytheintercept andsettheintercepttoone.Throughthesecondbreakthrough,sevenLDFs,with theexceptionofFisher ’sLDF,becametrivialLDFbythepass/faildetermination thatusesexaminationscores,andIwasabletoexplaintheusefulmeaningofthe coefficientofRevisedIP-OLDFusingSwissbanknotedata.Therefore,Ihave solvedthefourproblemsandcanconfi rmtheendofourresearch.However,whenI discriminatedShippetal.datasetinOctober2015,IfoundthatRevisedIP-OLDF canmakefeature-selectionnaturallyandcansolveProblem5quickly.
1.2.4Method1andProblem4
Ifweset “k =100” intheMethod1,wecanobtain100LDFsand100errorrates fromthetrainingandvalidationsamples.Fromthe100LDFs,weobtainthe95% CIofdiscriminantcoefficients.Fromthe100errorrates,weobtainthe95%CIof errorratesandtwomeansoferrorrates, M1and M2,fromthetrainingandvalidationsamples.Weconsiderthemodelwithminimum M2amongallpossible combinationmodelstobethebestmodel.Thisstandardisadirectandpowerful modelselectionprocedurecomparedwiththeLOOprocedure.
Weshoulddistinguishsuchcomputer-intensiveapproachesfromtraditionalinferentialstatisticswiththeSEequationbasedonnormaldistribution.Statisticians withoutcomputerpowerestablishedinferentialstatisticsmanually.Today,wecan utilizethepowerofacomputerwithstatisticalandMPsolvers,suchasJMPand LINGO.IdevelopedtheMethod1(Program2)ofFisher ’sLDFandlogistic regressionwiththeJMPscriptsupportedbytheJMPdivisionofSASInstitute Japan.Inaddition,IdevelopedMethod1forsixMP-basedLDFswithLINGO. ThoseareRevisedIP-OLDF,RevisedIPLP-OLDF,RevisedLP-OLDF,H-SVM, SVM4,andSVM1.IexplaintheLINGOProgram2inChap. 9.Thoseresearchers whowanttoanalyzetheirresearchdatacanobtainthe95%CIfortheerrorrate anddiscriminantcoeffi cients.Thesestatisticsprovidepreciseanddeterministic judgmentonmodelselectionprocedurecomparedwiththeLOOprocedure.Tothis point,IcannotvalidateandevaluateRevisedIP-OLDFwithsevenotherLDFs becauseIonlyhavesmalloriginaldataandnovalidationsamples.Researcherswith smallsamplescanvalidateandassesstheirresearchdatawithMethod1andthe bestmodel.
MiyakeandShinmura(1976)discussed “errorratesoflineardiscriminant function” bythetraditionalapproach.Ontheotherhand,KonishiandHonda(1992) discussed “errorrateestimationusingthebootstrapmethod.” Their computer-intensiveapproachesarenottraditionalinferentialstatisticsanddonot offerthe95%CIoftheerrorratesandcoefficientsforindividualdata.Although logisticregressionoutputsthe95%CIofthecoefficientthroughmaximum
likelihoodproposedbyR.Fisher,thisisalsoacomputer-intensiveapproach.Onthe otherhand,wecanselectthebestmodeland95%CIoftheerrorratesand coefficientsforsixLDFsbytheMethod1andthebestmodel.Manyresearchers whowanttodiscriminatesmallsampleshavethePhilosopher ’sStone.
1.3DiscriminantFunctions
IcomparetwostatisticalLDFsbyJMPandsixMP-basedLDFsbyLINGO.Iomit akernelSVMbecauseitisanonlineardiscriminantfunction.However,Ievaluate QDFandRDAwitheightLDFsonlyfortheoriginalsixdifferentdata,withthe exceptionofthedatasets.Next,IcomparetwostatisticalLDFsandsixMP-based LDFsforresamplingsamplesifthedataareLSD.IfthedataarenotLSD,we cannotdiscriminatethedatabyH-SVMbecauseitcauseserrorfornon-LSD.
1.3.1StatisticalDiscriminantFunctions
Fisherdefi nedFisher ’sLDFbymaximizationofthevarianceratio(between/within classes)inEq.(1.1).Nonlinearprogramming(NLP)cansolvethisequation.
IfweacceptFisher ’sassumption,thesameLDFisobtainedinEq.(1.2)by anotherplug-inrule2.ThisequationdefinesFisher ’sLDFexplicitly,whereas Eq.(1.1)definesLDFimplicitly.Therefore,statisticalsoftwarepackagesadoptthis equation.Somestatisticianserroneouslybelievethatdiscriminantanalysisisinferentialstatistics,similartoregressionanalysis.Discriminantanalysisisnottraditionalinferentialstatisticsbasedonthenormaldistributionbecausethereareno SEsforthediscriminantcoefficientsanderrorrates(Problem4).Therefore, LachenbruchandMickeyproposedtheLOOprocedureforselectingagooddiscriminantmodel,asindicatedinTable 1.6
:
MostrealdatadonotsatisfyFisher ’sassumption.Whenthevariance–covariance matricesoftwoclassesarenotthesame(Σ1 ≠ Σ2),theQDFde finedinEq.(1.3) canbeused.Thisfactiscriticalforus.Previousstatisticianshaveknownthatmost realdatadonotsatisfyFisher ’sassumption.WeusetheMahalanobisdistancein Eq.(1.4)forthediscriminationofmulticlasses.TheMahalanobis–Taguchimethod ofqualitycontrolisoneoftheapplications.
WeuseFisher ’sLDFandQDFinmanyareas,butcannotcalculatewhether somevariablesremainconstant.Therearethreecases.First,somevariablesthat belongtobothclassesarethesameconstant.Second,somevariablesthatbelongto bothclassesaredifferent,butconstant.Third,somevariablesthatbelongtoone classareconstant.Moststatisticalsoftwarepackagesexcludeallvariablesinthese threecases.Ontheotherhand,JMPenhancesQDFusingthegeneralizedinverse matrixtechnique.Therefore,QDFcantreatthe firstandsecondcasescorrectly,but cannotmanagethethirdcaseproperly(Problem3).
Recently,thelogisticregressioninEq.(1.5)hasbeenusedinsteadofFisher ’s LDFandQDFfortworeasons.First,itiswellknownthattheerrorrateoflogistic regressionisoftenlessthanthatofFisher ’sLDFandQDFbecauseitisderived fromrealdata,insteadofsomenormaldistributionfreefromreality.Let “ p ” bethe probabilityofbelongingtoaclassofdiseases.Ifthevalueofsomevariableis increasing/decreasing, “ p ” increasesfromzero(normalclass)toone(abnormal class).Thisrepresentationisveryusefulinmedicaldiagnosis,aswellasforratings inrealestatesandbonds.Onthecontrary,Fisher ’sLDFassumesthatcasescloseto theaverageofthediseasesarerepresentativecasesofthediseases’ class.Medical doctorsneverpermitthisclaim.Althoughthemaximum-likelihoodprocedure calculatesSEofthelogisticcoefficient,weshoulddistinguishthecomputerintensiveapproachfromthetraditionalinferentialstatisticsbasedonthetheoretical distributioninducedmanually.Firth(1993)indicatedthattheSEofalogistic coefficientbecomeslargeandtheconvergencecalculationbecomesunstablefor LSD.IfIobservethefollowingpoints:(1)Ican findNM=0bychangingthe discriminanthyperplaneonROC,(2)MNM=0,(3)SEsbecomelarge,and(4)the convergencecalculationbecomesunstable,Icandeterminethatlogisticregression canrecognizeLSD.IconfirmthatlogisticregressioncanalmostrecognizeLSDby thistediouswork:
1.3.2BeforeandAfterSVM
TherearemanytypesofresearchonMP-baseddiscriminantanalysis.Glover (1990)defi nedmanylinearprogramming(LP)discriminantmodels.Rubin(1997) proposedMP-baseddiscriminantfunctionsusingIP.Stam(1997)summarized Lp-normdiscriminantmethodsin1997andansweredthequestion, “Whyhave statisticiansrarelyusedLp-normmethods?” Heprovidedfourreasons:communication,promotion,andterminology;softwareavailability;therelativeaccuracyof
Another random document with no related content on Scribd:
THE HAWAIIANS’ ACCOUNT OF THE
FORMATION OF THEIR
ISLANDS
AND ORIGIN OF THEIR RACE WITH THE TRADITIONS OF THEIR MIGRATIONS,
E ., AS GATHERED FROM ORIGINAL SOURCES
BY ABRAHAM FORNANDER
Author of “An Account of the Polynesian Race”
WITH TRANSLATIONS EDITED AND ILLUSTRATED WITH NOTES BY THOMAS G. THRUM
Memoirs of the Bernice Pauahi Bishop Museum
Volume V—Part I
HONOLULU, H. I.
B M P
1918
[Contents]
P .
L K .
CHAPTER PAGE
I. His Birth and Early Life—Change to Oahu and Fame Attained There 2
II. Kalonaikahailaau—Kawelo Equips Himself to Fight Aikanaka—Arrival at Kauai 20
III. Commencement of Battle Between Kawelo and the People of Kauai 38
IV. Kaehuikiawakea—Kaihupepenuiamono and Muno—Walaheeikio and Moomooikio 42
V. Kahakaloa—His Death by Kawelo 48
VI. Kauahoa—Kawelo Fears to Attack Him— Seeks to Win Him by a Chant—Kauahoa Replies 52
VII. Size of Kauahoa—Is Killed by Kawelo— Kawelo Vanquishes Aikanaka 56
VIII. Division of Kauai Lands—Aikanaka Becomes a Tiller of Ground 60
IX. Kaeleha and Aikanaka Rebel Against Kawelo—Their Battle and Supposed Death of Kawelo 62
X. Temple of Aikanaka—How Kawelo Came to Life Again—He Slaughters His Opponents and Becomes Again Ruler of Kauai 66
S P .
His High Office Laamaomao, His Wind Gourd In Disfavor with the King He Moves to Molokai Has a Son Whom He Instructs Carefully Dreams of Keawenuiaumi Setting Out in Search for Him Prepares with His Son to Meet the King 72
L K
I. Prepares to Meet Keawenuiaumi in Search of Pakaa—Canoe Fleet of Six District Chiefs, Recognized, are Taunted as They Pass— Keawenuiaumi, Greeted with a Chant, Is Warned of Coming Storm and Invited to Land —On Advice of the Sailing-masters the King Sails on 78
II. Kuapakaa Chants the Winds of Hawaii—The King, Angered, Continues on—Winds of Kauai, Niihau and Kaula; Of Maui, Molokai, Halawa—Chants the Names of His Master, Uncle and Men—Pakaa Orders the Winds of Laamaomao Released 92
III. Swamping of the Canoes—They Return to Molokai and Land—The King is Given Dry Apparel, Awa and Food—Storm-bound, the Party is Provided with Food—After Four Months They Prepare to Embark 108
IV. Departure from Molokai—Names of the Six Districts of Hawaii—The King Desires Kuapakaa to Accompany Him—The Boy Consents Conditionally—Setting out they meet with Cold, Adverse Winds—The Sailing-masters Fall Overboard 118
V. At Death of Pakaa’s Enemies Calm Prevails —The Boy is Made Sailing-master—He Directs the Canoes to Hawaii—The Men Are Glad, but the King is Sad at His Failure— Kuapakaa Foretells His Neglect—Landing at Kawaihae, and Deserted, he Joins two Fishermen—Meeting a Six-manned Canoe He Wagers a Race, Single-handed, and Wins —He Hides His Fish in the King’s Canoe— They Plan Another Race to Take Place in Kau, Life to be the Forfeit 124
VI. The Canoe Race in Kau—Kuapakaa Offers to Land Four Times Before His Opponents’ First, and Wins—The King Sends for the Boy and Pleads for the Lives of His Men— Kuapakaa Reveals Himself and Pakaa—The Defeated Men Ordered Put to Death— Keawenuiaumi Orders Kuapakaa to Bring Him Pakaa—Pakaa Demands Full Restitution First—The King Agrees, and on Pakaa’s Arrival Gives Him the Whole of Hawaii 128
Legend of Palila 136
Legend of Puniakaia 154
Legend of Maniniholokuaua and Keliimalolo 164
Legend of Opelemoemoe 168
Legend of Kulepe 172
Legend of Kihapiilani 176
[Contents]
Legend of Hiku and Kawelu 182
Legend of Kahalaopuna 188
Legend of Uweuwelekehau 192
Legend of Kalaepuni and Kalaehina 198
Legend of Kapakohana 208
Legend of Kapunohu 214 [1]
PREFACE.
In this second series of the Fornander Collection of Hawaiian Folklore, with the exception of a few transpositions, as mentioned in the preceding volume, the order of the author has been observed in the main, by grouping together, first, the more important legends and traditions of the race, of universal acceptance throughout the whole group, followed by the briefer folk-tales of more local character.
A few of similar names occur in the collection, indicating, in some cases, different versions of the same story, a number of the more popular legends having several versions.
The closing part of this volume, to embrace the series of Lahainaluna School compositions of myth and traditional character, it is hoped will be found to possess educational value and interest.
No liberties have been taken with the original text, the plan, as outlined, being to present the various stories and papers as written, regardless of historic or other discrepancies, variance in such matters being treated in the notes thereto.
T . G. T , E . [2]
[Contents]
L K .
CHAPTER I.
B E L
K .—H C