2GentleStart
1. Given S =((xi,yi))m i=1,definethemultivariatepolynomial
3. (a) First,observethatbydefinition, A labelspositivelyallthepositiveinstancesinthetrainingset.Second,asweassumerealizability,andsincethetightestrectangleenclosingallpositiveexamples isreturned,allthenegativeinstancesarelabeledcorrectlyby A aswell.Weconcludethat A isanERM.
(b) Fixsomedistribution D over X ,anddefine R asinthehint. Let f bethehypothesisassociatedwith R atrainingset S, denoteby R(S)therectanglereturnedbytheproposedalgorithm andby A(S)thecorrespondinghypothesis.Thedefinitionofthe algorithm A impliesthat R(S) ⊆ R∗ forevery S.Thus,
Fixsome ∈ (0, 1).Define R1,R2,R3 and R4 asinthehint.For each i ∈ [4],definetheevent
Applyingtheunionbound,weobtain
Thus,itsufficestoensurethat Dm(Fi) ≤ δ/4forevery i.Fix some i ∈ [4].Then,theprobabilitythatasampleisin Fi is theprobabilitythatalloftheinstancesdon’tfallin Ri,whichis exactly(1 /4)m.Therefore,
Dm(Fi)=(1 /4)m ≤ exp( m /4) , andhence,
Dm({S : L(D,f )(A(S)) > ]}) ≤ 4exp( m /4)
Pluggingintheassumptionon m,weconcludeourproof.
(c) Thehypothesisclassofaxisalignedrectanglesin Rd isdefinedas follows.Givenrealnumbers a1 ≤ b1,a2 ≤ b2,...,ad ≤ bd,define theclassifier h(a1,b1,...,ad,bd) by
h(a1,b1,...,ad,bd)(x1,...,xd)= 1if ∀i ∈ [d],ai ≤ xi ≤ bi 0otherwise (1)
Theclassofallaxis-alignedrectanglesin Rd isdefinedas
i, }
ItcanbeseenthatthesamealgorithmproposedaboveisanERM forthiscaseaswell.Thesamplecomplexityisanalyzedsimilarly. Theonlydifferenceisthatinsteadof4strips,wehave2d strips (2stripsforeachdimension).Thus,itsufficestodrawatraining setofsize 2d log(2d/δ)
(d) Foreachdimension,thealgorithmhastofindtheminimaland themaximalvaluesamongthepositiveinstancesinthetraining sequence.Therefore,itsruntimeis O(md).Sincewehaveshown thattherequiredvalueof m isatmost 2d log(2d/δ) ,itfollows thattheruntimeofthealgorithmisindeedpolynomialin d, 1/ , andlog(1/δ).
3AFormalLearningModel
1. Theproofsfollow(almost)immediatelyfromthedefinition.Wewill showthatthesamplecomplexityismonotonicallydecreasinginthe accuracyparameter .Theproofthatthesamplecomplexityismonotonicallydecreasingintheconfidenceparameter δ isanalogous.
Denoteby D anunknowndistributionover X ,andlet f ∈H bethe targethypothesis.Denoteby A analgorithmwhichlearns H with samplecomplexity mH( , ).Fixsome δ ∈ (0, 1).Supposethat0 < 1 ≤ 2 ≤ 1.Weneedtoshowthat m1 def = mH( 1,δ) ≥ mH( 2,δ) def = m2.Givenani.i.d.trainingsequenceofsize m ≥ m1,wehavethat withprobabilityatleast1 δ, A returnsahypothesis h suchthat
LD,f (h) ≤ 1 ≤ 2
Bytheminimalityof m2,weconcludethat m2 ≤ m1
2. (a) Weproposethefollowingalgorithm.Ifapositiveinstance x+ appearsin S,returnthe(true)hypothesis hx+ .If S doesn’tcontainanypositiveinstance,thealgorithmreturnstheall-negative hypothesis.ItisclearthatthisalgorithmisanERM.
(b) Let ∈ (0, 1),andfixthedistribution D over X .Ifthetrue hypothesisis h ,thenouralgorithmreturnsaperfecthypothesis.
Assumenowthatthereexistsauniquepositiveinstance x+.It’s clearthatif x+ appearsinthetrainingsequence S,ouralgorithm returnsaperfecthypothesis.Furthermore,if D[{x+}] ≤ then inanycase,thereturnedhypothesishasageneralizationerror ofatmost (withprobability1).Thus,itisonlylefttobound theprobabilityofthecaseinwhich D[{x+}] > ,but x+ doesn’t appearin S.Denotethiseventby F .Then
Hence, HSingleton isPAClearnable,anditssamplecomplexityis boundedby
3. ConsidertheERMalgorithm A whichgivenatrainingsequence S = ((xi,yi))m i=1,returnsthehypothesis ˆ h correspondingtothe“tightest” circlewhichcontainsallthepositiveinstances.Denotetheradiusof thishypothesisbyˆ r.Assumerealizabilityandlet h beacirclewith zerogeneralizationerror.Denoteitsradiusby r
Let ,δ ∈ (0, 1).Let¯ r ≤ r∗ beascalars.t. DX ({x :¯ r ≤ x ≤ r })= .Define E = {x ∈ R2 :¯ r ≤ x ≤ r }.Theprobability(over drawing S)that LD(hS ) ≥ isboundedabovebytheprobabilitythat nopointin S belongsto E.Thisprobabilityofthiseventisbounded aboveby (1 )m ≤ e m
Thedesiredboundonthesamplecomplexityfollowsbyrequiringthat e m ≤ δ.
4. Wefirstobservethat H isfinite.Letuscalculateitssizeaccurately. Eachhypothesis,besidestheall-negativehypothesis,isdeterminedby decidingforeachvariable xi,whether xi,¯ xi ornoneofwhichappear inthecorrespondingconjunction.Thus, |H| =3d +1.Weconclude that H isPAClearnableanditssamplecomplexitycanbebounded by mH( ,δ) ≤ d log3+log(1/δ) .
Let’sdescribeourlearningalgorithm.Wedefine h0 = x1 ∩ x1 ∩
∩ xd ∩ xd.Observethat h0 isthealways-minushypothesis.Let ((a1,y1),..., (am,ym))beani.i.d.trainingsequenceofsize m.Since
wecannotproduceanyinformationfromnegativeexamples,ouralgorithmneglectsthem.Foreachpositiveexample a,weremovefrom hi alltheliteralsthataremissingin a.Thatis,if ai =1,weremove¯ xi from h andif ai =0,weremove xi from hi.Finally,ouralgorithm returns hm.
Byconstructionandrealizability, hi labelspositivelyallthepositive examplesamong a1 ,..., ai.Fromthesamereasons,thesetofliterals in hi containsthesetofliteralsinthetargethypothesis.Thus, hi classifiescorrectlythenegativeelementsamong a1 ,..., ai.Thisimplies that hm isanERM.
Sincethealgorithmtakeslineartime(intermsofthedimension d)to processeachexample,therunningtimeisboundedby O(m d).
5. Fixsome h ∈H with L(Dm,f )(h) > .Bydefinition,
Wenowboundtheprobabilitythat h isconsistentwith S (i.e.,that LS (h)=0)asfollows:
Thefirstinequalityisthegeometric-arithmeticmeaninequality.Applyingtheunionbound,weconcludethattheprobabilitythatthere existssome h ∈H with L(Dm,f )(h) > ,whichisconsistentwith S is atmost |H| exp( m).
6. Supposethat H isagnosticPAClearnable,andlet A bealearning algorithmthatlearns H withsamplecomplexity mH(· , ·).Weshow that H isPAClearnableusing A.
Let D,f bean(unknown)distributionover X ,andthetargetfunction respectively.Wemayassumew.l.o.g.that D isajointdistribution over X×{0, 1},wheretheconditionalprobabilityof y given x isdetermineddeterministicallyby f .Sinceweassumerealizability,wehave inf h∈H LD(h)=0.Let ,δ ∈ (0, 1).Then,foreverypositiveinteger m ≥ mH( ,δ),ifweequip A withatrainingset S consistingof m i.i.d. instanceswhicharelabeledby f ,thenwithprobabilityatleast1 δ (overthechoiceof S|x),itreturnsahypothesis h with
7. Let x ∈X .Let αx betheconditionalprobabilityofapositivelabel given x.Wehave
[g(X)=0|X = x] · min{αx, 1 αx}
+ P[g(X)=1|x] min{αx, 1 αx}
=min{αx, 1 αx},
Thestatementfollowsnowduetothefactthattheaboveistruefor every x ∈X .Moreformally,bythelawoftotalexpectation,
LD(fD)= E(x,y)∼D[1[fD (x)=y]]
= Ex∼DX Ey∼DY |x [1[fD (x)=y]|X = x]
= Ex∼DX [αx]
≤ Ex∼DX Ey∼DY |x [1[g(x)=y]|X = x]
= LD(g) .
Asweshallsee,
8. (a) Thiswasprovedinthepreviousexercise.
(b) Weprovedinthepreviousexercisethatforeverydistribution D, thebayesoptimalpredictor fD isoptimalw.r.t. D
(c) Chooseanydistribution D.Then A isnotbetterthan fD w.r.t.
D.
9. (a) Supposethat H isPAClearnableintheone-oraclemodel.Let A beanalgorithmwhichlearns H anddenoteby mH thefunction thatdeterminesitssamplecomplexity.Weprovethat H isPAC learnablealsointhetwo-oraclemodel.
Let D beadistributionover X×{0, 1}.Notethatdrawingpoints fromthenegativeandpositiveoracleswithequalprovabilityis equivalenttoobtainingi.i.d.examplesfromadistribution D whichgivesequalprobabilitytopositiveandnegativeexamples. Formally,foreverysubset
Thisimpliesthatwithprobabilityatleast1
OurdefinitionforPAClearnabilityinthetwo-oraclemodelissatisfied.Wecanboundboth m+ H( ,δ)and mH( ,δ)by mH( /2,δ).
(b) Supposethat H isPAClearnableinthetwo-oraclemodeland let A beanalgorithmwhichlearns H.Weshowthat H isPAC learnablealsointhestandardmodel.
Let D beadistributionover X ,anddenotethetargethypothesis by f .Let α = D[{x : f (x)=1}].Let ,δ ∈ (0, 1).Accordingtoourassumptions,thereexist m+ def = m+ H( ,δ/2),m def = mH( ,δ/2)s.t.ifweequip A with m+ examplesdrawni.i.d. from D+ and m examplesdrawni.i.d.from D ,then,with probabilityatleast1 δ/2, A willreturn h with
L(D+,f )(h) ≤ ∧ L(D ,f )(h) ≤ .
Ouralgorithm B draws m =max{2m+/ , 2m / , 8log(4/δ) } samplesaccordingto D.Iftherearelessthen m+ positiveexamples, B returns h .Otherwise,iftherearelessthen m negativeexamples, B returns h+.Otherwise, B runs A onthesampleand returnsthehypothesisreturnedby A.
Firstweobservethatifthesamplecontains m+ positiveinstances and m negativeinstances,thenthereductiontothetwo-oracle modelworkswell.Moreprecisely,withprobabilityatleast1 δ/2, A returns h with
L(D+,f )(h) ≤ ∧ L(D ,f )(h) ≤ . Hence,withprobabilityatleast1 δ/2,thealgorithm B returns (thesame) h with
L(D,f )(h)= α · L(D+,f )(h)+(1 α) · L(D ,f )(h) ≤ Weconsidernowthefollowingcases:
• Assumethatboth α ≥ .Weshowthatwithprobability atleast1 δ/4,thesamplecontain m+ positiveinstances. Foreach i =∈ [m],definetheindicatorrandomvariable Zi, whichgetsthevalue1iffthe i-thelementinthesampleis positive.Define Z = m i=1 Zi tobethenumberofpositive examplesthatweredrawn.Clearly, E[Z]= αm.UsingChernoffbound,weobtain
Bythewaywechose m,weconcludethat
P[Z<m+] <δ/4
Similarly,if1 α ≥ ,theprobabilitythatlessthan m negativeexamplesweredrawnisatmost δ/4.Ifboth α ≥ and 1 α ≥ ,then,bytheunionbound,withprobabilityatleast 1 δ/2,thetrainingsetcontainsatleast m+ and m positiveandnegativeinstancesrespectively.Aswementioned above,ifthisisthecase,thereductiontothetwo-oracle modelworkswithprobabilityatleast1 δ/2.Thedesired conclusionfollowsbyapplyingtheunionbound.
• Assumethat α< ,andlessthan m+ positiveexamplesare drawn.Inthiscase, B willreturnthehypothesis h .We obtain
LD(h)= α< .
Similarly,if(1 α) < ,andlessthan m negativeexamples aredrawn, B willreturn h+.Inthiscase,
LD(h)=1 α< .
Allinall,wehaveshownthatwithprobabilityatleast1 δ, B returnsahypothesis h with L(D,f )(h) < .Thissatisfiesour definitionforPAClearnabilityintheone-oraclemodel.
4LearningviaUniformConvergence
1. (a) Assumethatforevery ,δ ∈ (0, 1),andeverydistribution D over X×{0, 1},thereexists m( ,δ) ∈ N suchthatforevery m ≥ m( ,δ),
P S∼Dm[LD(A(S)) > ] <δ.
Let λ> 0.Weneedtoshowthatthereexists m0 ∈ N suchthatfor every m ≥ m0, ES∼Dm [LD(A(S))] ≤ λ.Let =min(1/2,λ/2). Set m0 = mH( , ).Forevery m ≥ m0,sincethelossisbounded
aboveby1,wehave
(b) Assumenowthat
2. TheleftinequalityfollowsfromCorollary 4.4.Weprovetheright inequality.Fixsome h ∈H.ApplyingHoeffding’sinequality,we obtain
Thedesiredinequalityisobtainedbyrequiringthattheright-hand sideofEquation(2)isatmost
,andthenapplyingtheunion bound.
5TheBias-ComplexityTradeoff
1. Wesimplyfollowthehint.ByLemma B.1,